Peter Wang

Anyone else running into same problem deploying long-running AI agents?

I’ve been working on some AI projects recently — things like scheduled agents, API responders, and multi-agent systems that need to run continuously. One of the biggest headaches I’ve run into is deployment.

Most cloud platforms (AWS, GCP, etc.) are built for stateless apps or short-lived functions. But for long-running, stateful agents, the kind that need to persist data, auto-recover from crashes, and expose custom endpoints — it gets surprisingly messy. I’ve spent so much time setting up VMs, Docker configs, and recovery logic than actually writing agent behavior logic.

Has anyone else faced this?

Curious how others are handling deployment for autonomous agents that aren’t just scripts or jobs, but actual long-lived services. I’ve been working on a solution to make this easier, but before I share anything I’d love to hear how others are solving (or working around) this.

154 views

Add a comment

Replies

Best
Sandy Suh

We use AWS Batch to run longer functions, AWS Step Functions for orchestration, Postgres for persisting data, Redis for temporary caching for jobs that need it, and Django for custom API endpoints. We have separate jobs scheduled to scan recent jobs for crashes and re-run those jobs or otherwise handle the crashes accordingly.

And yeah, it definitely gets messy (I've certainly learned a lot about VMs, Docker configs, etc.), but I don't know that there's any good way around it. Designing the right database schemas to persist your job data, figuring out the best way to handle crashes, making things scale—imo the right way to do these things is super use-case-dependent and there's no silver bullet.

(Context: We scrape many thousands of government bodies each day to collect/parse their data/publications. Some jobs take as long as 30 minutes, especially if a government site is not very modern).

Peter Wang

@sand1929 I can totally relate lol. I was building a multi-agent system too, and one agent crash ended up taking down the whole process. There was no clean way to recover its last state, and getting it production-ready with crash recovery, persistent storage, scaling, and container orchestration ended up taking hours to setup.

What you described sounds very similar to the pain points I ran into — and it's exactly what I'm working on solving now. I’ve got a working proof of concept and would love to connect, share what I’ve built, and see if it could help streamline your setup (or learn from your use case to improve it). Let me know if you're interested!

Sandy Suh

@cywdev Sure thing, I'll shoot you a request

Meghan Henry

Yes, long-running AI agents often face memory leaks and random crashes. I’ve been managing them with checkpointing and breaking tasks into smaller modules

Sanskar Yadav

Completely relate to this @cywdev .
I’ve spent more hours wrangling infrastructure than actually building actual agent behaviors and “plug-and-play” just doesn’t seem to exist yet for this use case (maybe a good opportunity)

Most cloud setups really feel geared toward stateless, quickly recycled jobs or scripts. When you try to keep something running for days with consistent state, crash recovery, and custom logic, you suddenly need to stack together Docker, health checks, databases, and custom watchdogs. The agent logic almost becomes the smallest piece in the puzzle! I'm not a hardcore techie but I’ve started to wonder if maybe there’s a better middle ground. Some kind of layer that handles agent lifecycles and auto-recovery, so I can focus on their reasoning, not on reinventing deployment wheels each time I want to create something tangible.

Curious if anyone’s actually found a setup they like for continuous (autonomous) agents or is everyone just playing with homegrown scripts and hope?

Peter Wang

@sanskarix Exactly, this is the problem I’m solving, and I already have a working local version. The demo version is live and allows you to deploy your agent with just one click using a prebuilt Docker image. Once deployed, your agent will include the following features::

  • State Persistent: Both in-memory runtime state and long-live state snapshot, so agent remember what it was working on even after recovered.

  • Recovery: Resume the agent if it paused or stopped with one line of command

  • Dedicated API endpoint: Each of the deploy agent would get its dedicated API endpoints, so you don't need to reconfigure it every single time.

    (For example: https://agentainer.io/agents/{ag...)

  • Metrics and logs: Includes system-level logs and basic API response tracking, very basic now.

  • Coding agent compatible: The demo version can be used by coding agent, I actually asked Cursor to create and deploy an agent all at once by itself without any issue.

Some features I am working on for production version:

  • Auto-scaling: Scalability depends on the workload and load-balancing between containers. All the agents shares same memory with our state persistent feature.

  • Message-bus: Allows multi-agent to communicate more efficient internally on the platform.

  • Enhanced metrics/logs: Provide production grade of the metrics and logs dashboard

  • Team workspace: A workspace which you can see what others are working on, manages all the agents in one place.

  • Flexible database: User can connect any external/internal databases as they wishes per agent.

  • White-labeling: Use your own domain for agent endpoints. For example, https://{yourDomain}/agents/{agentId}/{yourAgentEndpoint}

----

I'll be sharing the full tech stack and build details soon (Been busy recently).

In the meantime, I'm looking for early adopters and testers to try out the demo version. If you're interested, I'd love to work closely with you to make sure everything runs smoothly and to gather your first-hand feedback. Let me know if you'd like early access!

Priyanka Gosai

Yes,

We ran into the same pain points while building backend sync flows between EHR systems and pharmacy networks long-lived agents that had to persist state, retry gracefully, and run 24/7. Most platforms weren’t built for this kind of workload.


What worked better for us:

  • Breaking agents into micro-tasks with checkpointing

  • Using managed container services (like GKE Autopilot) + Redis for state tracking

  • Supervisors for crash recovery, not just retries

Still far from seamless. Would love to hear more about your solution because the infra side still feels like overkill for most product teams.

Peter Wang

@priyanka_gosai1 Hey Priyanka, I have shared a post about how I solve this issue with a demo version available. You can find it here: https://www.producthunt.com/p/self-promotion/how-we-built-a-solution-runs-long-lived-llm-agents

Feel free to sign up early access, it will help us a lot!