Announcing our $355M Series C Read more
Modal Inference

The fastest way to scale Inference

Whether low-latency LLM inference or async batch workloads, Modal lets you serve, scale, and optimize inference globally.
customer logo

“Modal lets us move fast while keeping full control over our models and serving stack. The flexibility meant we could train high-accuracy models and hit the real-time performance our product demands.”

Decagon, Voice AI team
customer logo

“We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us.”

Brian Ichter, Co-founder
customer logo

“Modal's infrastructure gave us the performance and reliability we need to ship this in every global region, at production scale.”

Kamil Sindi, CTO of Runway

Code-first inference

Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
Modal Inference

Engineered for inference.

Run any model

Serve open source or custom models with Python. Easily keep ML dependencies and GPU requirements in sync with application code.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16

Real-time serving

Optimized and highly tunable infrastructure for low-latency serving and routing.

Elastic scale

Instantly scale to 1000+ GPUs during traffic spikes, then back down to 0 when idle. No commitments, no waits.

Infrastructure optimized for every deployment pattern




Get clear insight into production deployments

Get clear insight into production deployments


Built with Modal

Ship your first app in minutes.

Get Started

$30 / month free compute