AWS and Cerebras partner to advance AI inference performance

Photo Credit: Michael Vi via Shutterstock

Amazon Web Services (AWS) and Cerebras Systems have announced a partnership to deliver accelerated AI inference capabilities for generative AI and large language model (LLM) tasks.

The new service will launch in the coming months on Amazon Bedrock within AWS data centres, combining Amazon’s Trainium-powered servers, Cerebras CS-3 systems, and Elastic Fabric Adapter (EFA) networking.

AWS also plans to expand its offerings later this year by providing access to open-source LLMs and Amazon Nova using Cerebras hardware.

The collaboration employs “inference disaggregation,” a method that splits AI inference into two distinct phases: prompt processing (prefill) and output generation (decode).

Prefill relies on parallelism and computational intensity with moderate memory requirements, while decode is serial, less computationally demanding, but requires high memory bandwidth.

This division allows each stage to benefit from dedicated compute architectures connected by EFA networking, with Trainium optimised for prefill operations and the Cerebras CS-3 system focusing on decode tasks.

The system runs on the AWS Nitro System, which underpins AWS’s cloud infrastructure. This ensures that both Trainium-powered instances and Cerebras CS-3 systems maintain established security protocols and operational standards.

Trainium is designed specifically for AI training and inference at scale. Major AI research organisations, including Anthropic and OpenAI, have committed significant workloads to Trainium technology.

Anthropic uses Trainium as its main platform for model training and deployment, while OpenAI will utilise 2 gigawatts (GW) of Trainium capacity through AWS infrastructure to meet increasing workflow demands for advanced models.

Since its launch, Trainium3 has seen broad adoption across multiple industries.

Cerebras’ CS-3 system offers high memory bandwidth suited for key parts of the inference workload, particularly decode operations, which make up a significant share of total inference time.

Companies such as OpenAI, Cognition, and Mistral use Cerebras systems to support demanding workloads like agentic coding, where efficient token generation is critical.

In this setup, the CS-3 handles all decode processes while Trainium manages prefill functions. The EFA network connects both processors to maximise efficiency for each stage of the inference task.

Cerebras Systems founder and CEO Andrew Feldman said: “Partnering with AWS to build a disaggregated inference solution will bring the fastest inference to a global customer base.

“Every enterprise around the world will be able to benefit from blisteringly fast inference within their existing AWS environment.”

Earlier this year, Cerebras secured an agreement valued at $10bn with OpenAI to supply AI chips, reported Reuters.

The company, valued at $23.1bn, aims to offer an alternative to Nvidia’s technology by designing chips that do not depend on high-bandwidth memory components.

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing