“Modal lets us move fast while keeping full control over our models and serving stack. The flexibility meant we could train high-accuracy models and hit the real-time performance our product demands.”
“We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us.”
Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.
Serve open source or custom models with Python. Easily keep ML dependencies and GPU requirements in sync with application code.
Optimized and highly tunable infrastructure for low-latency serving and routing.
Instantly scale to 1000+ GPUs during traffic spikes, then back down to 0 when idle. No commitments, no waits.
Real-time
Sub-10ms overhead latency from anywhere with our globally distributed compute. Out-of-the-box support for token streaming, WebRTC, WebSocket.
Dynamically batched
Add one line of code to accumulate requests and process them in dynamically-sized batches.
Offline batched
Run inference on millions of inputs in record time—Modal scales instantly to thousands of GPUs.
