Are agent frameworks ready for scale?

None of the existing frameworks is ready for building enterprise B2B applications that require thousands of concurrent agent runs

Apr 10, 2025

Over the last 12 months, I’ve been building agents and worked with many different agent frameworks looking for one that fits the bill for building B2B applications. There’s many that are useful to get you started but none of them seem to be built for enterprise B2B use cases yet. They all lack basic features that allow for efficient scaling of concurrent executions.

Most of them allow you to create agent workflows or multi-agent systems quickly while building an abstraction layer on top of LLM provider specificities. They all introduce their own abstraction and execution model that might pose constrains later depending on your use case.

Agents at Maze

At Maze, we started our journey with agents by teaching the first Llama 3 models how to return JSONs so that we could use for tool calling. We also prototyped with the initial tool calling in Claude 3 models and all the problems that came with it. The first stable release of LangGraph in June of last year, was a great step forward to start to standardize over time things like tool calling, memory and orchestrated workflows. By then, we were used to LangChain’s breaking changes but this made us hopeful for a greener grass in the new project.

We use agents in multiple parts of our architecture and we think of them in the same way Anthropic shared in the blogpost Building Effective Agents. We have workflows, multi-agent systems and what I would call agent graphs: an agent would use tools that are agents as well (this can go recursively a few levels).

Our Use Cases

We have a set of specific constraints regarding how and where we run our agents: the great majority of the agent runs are async, all of them need to be run in our own infrastructure and mostly only use LLM models supported in Bedrock. All telemetry data cannot leave our AWS accounts. This already discards a few frameworks that are mostly built for integrating with a SaaS to enable observability.

Aside from that, we wanted a framework that is open source, has a strong community and the trickiest of all: has features for scaling concurrent executions efficiently.

Generally, this comes in the form of features the model provider or inference provider introduce over time that provide strategies to save on tokens or perform more efficient inference .

In Bedrock, for example, this comes in the form of two key features:

Batch Inference: perform inference of a minimum of 100 requests by uploading the requests to S3 in a jsonl file and processing the output async. Price is half of regular inference.
Prompt Caching: for reused prompts that have a significant initial context that is common across runs, you can cache the part of the prompt that doesn’t mutate and save on reusing that part. Think of a prompt to generate a database query, the schema of the database is common across queries and can be cached. The tokens that are cached would not be charged on new inference runs. It’s GA as of two days ago for some models, Claude Sonnet still in closed beta.

From Anthropic, Token-efficient tool use is an example of this. It promises average savings of 14% for input tokens and up to 70% for output tokens. AWS is doing a great job at implementing those in Bedrock quickly, however most of the frameworks do not offer any support for them for quite some time if ever at all. Sometimes there’s temporary solutions for using these features that might go away in the next release because the solution not being officially supported.

Frameworks

We have looked at many frameworks, implemented complex prototypes in at least 4 and use 2 of them in production. What we have learnt is that there is no one-size fits all and keeping the effort to try a framework low pays off for some time.

Best fitting

AWS Bedrock Agents

I would say this framework is the one that focuses more on secure, scalable multi-agent orchestration. It’s clearly aimed at enterprises fully embedded in the AWS ecosystem. However, it’s a telling sign that even AWS’ framework does not support batch inference or prompt caching yet.

LangGraph

For structured agent workflows with clearly defined interactions, it’s probably one of the ones that can get you started the fastest and works quite well. Writing an agent is a matter of a few minutes. Debugging without LangSmith takes more time.

I’m generally not a fan of the way LangChain has abstracted many of the underlying components, which are shared with LangGraph. The way they have handled their requirements on dependencies like Pydantic (having Pydantic v1 and v2 concurrently) and the number of bugs introduced in weekly releases has brought quite a few headaches.

Autogen

The main use case is multi-agent applications including autonomous agents performing distinct roles. Using it with AWS Bedrock has been bumpy and the latest v0.4 release required quite a lot of refactoring to get it to work.

The AG2 fork from Microsoft’s autogen was a reason for re-evaluating our choices. We stuck to Microsoft after the fork and went with the pain of migrating to v0.4 and having missing features for some weeks.

Honorable Mention: DSPy

The main focus is prompt optimization and building pipelines that can be iteratively refined. I think this is a novel approach that provides value for specific use cases specially during the R&D phases. However, I think the use case is not running thousands of concurrent agent workflows. A data point is how they categorically explained that their architecture is not built for batch inference in their Github issues.

What’s our take?

The general advice would be to pick any framework that fits your use case to get you started. As you learn more about your use case and the trade offs you are and aren’t willing to make, then start looking into alternatives or think about using no framework.

Prompt Caching

If you need Prompt Caching today, your only option is to skip the frameworks as none of them support this feature (not even Langchain). The feature just hit GA two days ago for some models in AWS Bedrock (but not yet ready for Claude Sonnet, for example).

Batch Inference

You can use LangGraph in the following way to achieve batch inference:

For every graph node that executes an LLM call, add an outbound conditional edge that checks if the execution needs to be paused. If it does, it ends the graph execution.
In the LLM graph node, instead of calling the LLM, do three things: a) insert the LLM request into your batch processing system and b) persist the state of the graph c) update the graph state to signal it needs to be paused.
Once the batch processing system finished execution, you need to load the state of the graph, update it to include the result of the LLM response and you can invoke the graph execution using the new state.

This requires a bit of orchestration and state/memory management on your side but is doable and you keep the benefits of LangGraph.

The other options is to skip frameworks completely. This brings a lot more flexibility and I would recommend it for simple agents or simple workflows. For complex agent workflows or multi-agent systems, this option is too costly.

Conclusion

There’s not perfect agent framework to build enterprise B2B applications. They all lag behind on most features that you need to scale your application efficiently in terms of cost and performance. A good solution is to keep the cost of trying frameworks low and keep evaluating what is in the market. Using one framework some use cases and going without framework for others seems to bring a healthy balance. This way you can use the features release by model or inference providers as soon as they are available in most cases.

PS: As I write this, I see that Google just release Agent Development Kit a few hours ago. That might be worth a look!

Thoughts on Cyber, LLMs and Software

Discussion about this post