Skip to main content
logoeloovor

Why we moved AI workflows off the API server

A practical look at background jobs, reliability, and keeping the UI fast.

Eloovor Team8 min read
Why we moved AI workflows off the API server

Early on, a user clicked "Analyze Company" and waited. The spinner kept spinning. Then we deployed a new version of the API and the job silently died. The user refreshed. The analysis was gone.

We should have seen it coming. AI workflows are heavy. They pull in context, run model calls, and store structured results. When all of that happens inside your API server, on the same process handling user requests, things break in ways that are hard to debug and impossible to explain to users.

The problem with running AI in the request path

We hit the same failures over and over:

  • Memory would spike when multiple analyses ran at once, sometimes enough to slow down unrelated requests
  • Users would hit timeouts, or worse, the upstream model provider would
  • Deploys would kill running work mid-analysis with no way to resume
  • When something failed, we had almost no visibility into where or why

Even when the system technically worked, it felt slow. The UI is blocked waiting on the server. The server is blocked waiting on the model. The user is staring at a spinner wondering if something broke. Nothing about that inspires confidence, even if the analysis eventually comes back fine.

Moving to background jobs

We pulled AI analysis out of the request path and into Trigger.dev.

The flow now works like this:

  1. The API receives a request and stores a new analysis record
  2. A Trigger.dev job picks up the work in the background
  3. Results get written back when the job completes

From the user's perspective, they still get their analysis. But the UI returns immediately instead of hanging. The server goes back to serving requests. And the heavy model work runs in its own isolated process where a timeout or crash doesn't take anything else down with it.

What a job actually does

A typical analysis job has a lot going on:

  • Gather context from the job description and the user's profile
  • Pull in company or market signals that might be relevant
  • Build a structured prompt with the right guardrails
  • Run model calls and validate that the output actually makes sense
  • Store the results and flip the status so the UI picks it up

Each of those steps can fail independently. The model might time out. The company data might come back empty. The output might fail validation. When this all lived inside an API request, any one of those failures would just surface as a vague 500 error to the user. Now each step can fail, retry, or report on its own without locking up the product.

Why Trigger.dev

We looked at a few options. Trigger.dev won because it gave us the three things we cared about most: scheduling, automatic retries, and actual observability into what jobs are doing.

Before this move, when an analysis failed, it just vanished. The user would see a spinner that never resolved. We'd have to dig through server logs and guess. Now we can open the Trigger.dev dashboard, find the specific job run, and see exactly which step failed and what the error was.

Here is what changed in practice:

  • The API stays responsive under load because it is not doing the heavy work anymore
  • Deploys don't kill running analyses since the jobs run in a separate process
  • Failed tasks retry automatically, we don't have to write custom retry logic for every failure mode
  • We can actually see what's happening in the pipeline at any moment

The reliability improvement was the real payoff. When someone uses a job search tool, they need to trust that it will finish what it started. If you click "analyze" and your results disappear because we pushed a deploy, that trust is gone.

Designing around background jobs

Moving work off the API changed how we think about building features. It's a different set of constraints.

Jobs need to be idempotent. If a job retries, it shouldn't create duplicate results or send duplicate notifications. We had to go back and make several of our early jobs safe for replay, which was not fun but was necessary.

Status updates need to be visible to the user. A spinner with no context is almost worse than no feedback at all. Users need to see that something is happening, where it is in the process, and roughly how long it might take.

Errors need enough context attached to them that we can actually fix things quickly. A generic "job failed" log entry is useless at 2am. We attach the input parameters, the step that failed, and the raw error.

Long tasks should break into discrete steps. A single monolithic job that runs for 90 seconds is hard to observe and hard to debug. Breaking it into steps means we can see progress and isolate failures.

Trigger.dev handles the infrastructure side of all this well, but we still had to do the product work of designing how the UI communicates what is happening behind the scenes.

The UX side

A background job the user can't see might as well not exist. If you move work to the background but the UI still just shows an indefinite spinner, you haven't really improved anything from the user's perspective.

We show progress states and status labels for each analysis. When a job finishes, the UI updates. When a job fails, we tell the user what happened in plain language and give them a retry button. We also handle the case where a user navigates away and comes back — their analysis is still there, still running or completed, with its status intact.

This layer of communication is a relatively small amount of code, but it completely changes how the product feels. The system goes from "I clicked a button and now I'm waiting and hoping" to "I can see this thing working."

When we use background jobs and when we don't

Not everything needs to be a background job. We've been deliberate about where the line is.

We move work to background jobs when it is slow (anything touching model inference), when it depends on external services that might be unreliable, or when it needs to survive retries. That covers AI analysis, multi step data gathering, and anything that could run longer than a few seconds.

Fast operations stay synchronous. A quick database lookup or a simple CRUD operation doesn't need the overhead of a job queue. Adding that complexity where it isn't needed just makes the system harder to reason about.

What changed after the migration

The migration was a technical change, but it ended up changing how we design features. We now ask different questions early in the process. What happens if this takes 30 seconds? What does the user see while they wait? What happens if it fails halfway through? What does a retry look like?

We built better status tracking because we had to. We wrote better error messages because we could actually see the errors now. We started thinking about partial failure states — what if step 3 of 5 fails? Can the user see the results from steps 1 and 2?

The result is that users who run multiple analyses at once get a much more predictable experience. Each one has its own status, its own progress, and its own error handling. Before, multiple concurrent analyses was basically a coin flip on whether the server would hold up.

Scaling

Background jobs also changed how we handle traffic spikes. Before, a burst of analysis requests would hit the API directly, and if enough came in at once, the server would slow down for everyone, including people just trying to log in or browse.

Now the API stays light. It writes records and returns. The analysis jobs queue up and process at whatever concurrency we've configured. If traffic spikes, the queue gets longer but the API doesn't buckle. Users might wait a bit longer for results, but the product doesn't degrade in that broken, unresponsive way.

We can also tune job concurrency independently from the API. If we need to throttle analysis jobs because the model provider is rate limiting us, we can do that without touching the API server. We can look at "is the API healthy" and "are analysis jobs backed up" as two separate questions, which makes operations a lot more straightforward. That separation has already saved us from a few incidents that would have been much worse under the old architecture.

Supercharge your job search with eloovor

Create your free account and run your full search in one place:

  • Smart job application tracking and follow-ups
  • ATS-optimized resumes and personalized cover letters
  • Smart Profile Analysis
  • One click Company research and hiring insights
  • Profile-based job fit analysis
  • Interview preparation and practice prompts
EngineeringTrigger.devBackground JobsAI