Three Months of Killing Processes by Hand

For about three months, part of my morning routine was SSHing into our VM and killing processes.

OpenFunnel runs AI agents for customers every day - detection jobs that scan job boards, LinkedIn activity, website traffic, and hiring patterns, then surface the accounts worth reaching out to. These agents ran on a cron that fired every 24 hours. The problem was that some jobs took longer than 24 hours to complete. So by day three, you'd have three instances of the same job running simultaneously, competing for DB connections, slowly choking the VM. The fix every morning was: SSH in, figure out which processes were from two days ago, kill them, watch the VM stabilize, go back to work.

The actual problem

The cron approach worked when we had fewer customers and simpler agents. It stopped working as both of those grew.

Agents at the tail of the queue were never running. The cron fetched and sorted all jobs at startup. If your agent was at position 180 out of 200, and the first 50 were taking hours each, yours either ran at 2am or not at all. Enterprise customers with more accounts were effectively starving out everyone else.

Every code change had a 24-hour lag before it hit production agents. Push a fix at 3pm, it doesn't affect the running pipeline until tomorrow. For bugs that were causing job failures, this was genuinely painful.

And the pile-up problem - once cron instances started stacking, the only solution was manual intervention. There was no graceful degradation. The VM just got slower until something died.

What I built

I considered the obvious options - Celery, Bull, SQS. All of them meant adding infrastructure I'd then have to monitor and debug separately. We already had Postgres. A table with the right schema and atomic updates gives you a job queue without introducing a new system to babysit.

The replacement is a persistent queue backed by a Postgres table. When agents are due to run, jobs get enqueued with a tier, an attempt count, a claimed-by field, and a heartbeat timestamp. Workers claim the next job by updating claimed_by atomically - if another worker has already claimed it, the update touches zero rows and the worker moves on. No separate broker, no Redis, no new infrastructure. Just a table and long-running workers.

Tiers solved the starvation problem from the cron days. Enterprise customers with heavier workloads get routed to dedicated workers. Smaller accounts don't sit behind a six-hour enterprise job anymore. The queue sorts by tier first, then by age within each tier.

Heartbeats update every 30 seconds. If a worker dies mid-job, the heartbeat goes stale and a recovery job requeues it automatically. Crashed workers don't leave permanently stuck jobs. One thing I learned the hard way here - the heartbeat thread uses run_coroutine_threadsafe to push updates from a daemon thread back into the asyncio loop, and if the coroutine hangs, the thread blocks indefinitely. A 10-second timeout on future.result() keeps the heartbeat from deadlocking silently.

Each worker is a systemd service:

ExecStart=... python -m auto-run.signal_worker --worker-id %i
TimeoutStopSec=7200
Restart=always
MemoryMax=4G
MemorySwapMax=0
MemoryHigh=3200M

The %i placeholder means one unit file, multiple instances. TimeoutStopSec=7200 matters because agent jobs legitimately take hours. Without a long stop timeout, a deploy would SIGKILL a job mid-execution. With it, systemd sends SIGTERM, the worker finishes its current job, then exits cleanly. The SIGTERM handler sets a shutdown flag; the worker checks it before claiming the next job. Combined with Restart=always, you get clean rolling restarts for free - SIGTERM during a deploy, the worker finishes its current job, exits, systemd restarts it, and it picks up the next job from the queue. No orchestration needed.

Each worker claims one job at a time, runs it to completion, then polls for the next. Intentional - it keeps memory bounded per worker and makes the system easy to reason about.

When the workers started OOMing

The queue solved the pile-up immediately. Then, within the first week, workers started dying for a different reason.

The pattern: a worker claims a job, runs for two to three hours, gets killed by the OOM killer. Requeues, restarts, claims again, dies at the same point. The queue was running but heavy jobs were stuck in a restart loop.

Two things were happening. Python RSS accumulates over long runs - caches, thread pool state, LLM response buffers. A worker starting at 800MB would be at 3.5GB after an hour. With eight workers on the VM, combined RSS was pushing the host into memory pressure.

The other thing: some agents run headless browser sessions via Playwright. The headless_shell child processes don't inherit the parent's cgroup memory limit cleanly. By the time the worker's own RSS looked fine, the total process tree was already over available memory.

My first instinct was that MemoryMax=4G would handle it. It didn't. Without MemorySwapMax=0, a process hitting the memory limit swaps instead of getting killed. The VM starts swapping, everything degrades, the OOM killer eventually fires on an unpredictable process. Adding MemorySwapMax=0 made the OOM killer fire predictably at the cgroup ceiling - clean death, five-second restart, no collateral damage. Those two settings need to go together. MemoryMax alone is mostly useless.

That fixed the worst of it. But I also didn't want heavy jobs to restart from scratch every time. So I added a MemoryMonitor thread inside each worker. It checks RSS every five seconds. If the process crosses 3GB - 1GB below the hard ceiling - it sets the agent's status to pausing in the database before the OOM killer can fire. The worker sees the paused status, requeues the job, exits cleanly. On the next restart, the agent picks up from its last checkpoint. No data loss, no stuck jobs.

One of the more frustrating bugs during this whole process was asyncio semaphores failing silently. Creating them at module level binds them to a different event loop than the one actually running, which causes either silent failures or a RuntimeError depending on the Python version. The fix is simple - create every semaphore inside the coroutine that needs it - but the failure mode is subtle enough that it took me longer to find than I'd like to admit.

What changed

The system has been running for about a month. Around 800 agent jobs complete every day.

The clearest test came in late May. During the initial ramp-up period, the system had accumulated roughly 7,500 jobs that were queued but not yet processed - a combination of onboarding new customers and backfilling historical data. I didn't intervene. I didn't even know about the backlog until I checked the queue metrics the next morning. Workers had already started draining it overnight. Over two days, they cleared the entire thing - 2,400 claims in a single peak hour - without affecting any customer's regular daily runs.

Under the old cron system, a backlog like that would have meant a week of manual triage. SSHing in, killing stale processes, manually restarting jobs, prioritizing which customers to run first. This time it just resolved itself while I was asleep.

I haven't SSHed in to kill a process since the migration.