Meta and Cerebras team on fast inference for new Llama API

Business news | April 30, 2025

By Jean-Pierre Joosting

GPU API LLMs AI Inference Open Source Software Meta Llama Cerebras

Targeting developers, Meta has partnered with Cerebras to deliver ultra-fast inference in its new Llama API, combining the most popular open-source model, Llama, with the fastest available inference technology provided by Cerebras.

Developers building on the Llama 4 Cerebras model in the API can expect generation speeds up to 18 times faster than traditional GPU-based setups. This acceleration unlocks an entirely new generation of applications that are impossible to build with other technologies. Real-time agents, conversational low-latency voice, interactive code generation, and instant multi-step reasoning—all of which require chaining multiple LLM calls—can now be completed in seconds rather than minutes.

Serving Llama models from the new Meta API service brings an expanded global developer audience to Cerebras.

Since launching its inference technology in 2024, Cerebras has delivered the fastest Llama inference currently available, serving billions of tokens through its own AI infrastructure. The broad developer community now has direct access to a robust, OpenAI-class alternative for building intelligent, real-time systems.

“Cerebras is proud to make Llama API the fastest inference API in the world,” said Andrew Feldman, CEO and co-founder of Cerebras. “Developers building agentic and real-time apps need speed. With Cerebras on Llama API, they can build AI systems that are fundamentally out of reach for leading GPU-based inference clouds.”

According to the third-party benchmarking site Artificial Analysis, Cerebras AI achieves an inference speed of over 2,600 tokens per second for Llama 4 Scout, while ChatGPT processes approximately 130 tokens per second and DeepSeek around 25 tokens per second.

Developers can access the fastest Llama 4 inference by selecting Cerebras from the model options within the Llama API, making it easy to prototype, build, and scale real-time AI applications.

www.cerebras.ai