Low latency Nemotron 3 with SGLang and Modal

In this example, we show how to serve Nvidia’s Nemotron models on Modal at low latency with SGLang.

The Nemotron models use MoE matmuls and hybrid attention (mixing Transformer and Mamba layers) to deliver powerful capabilities in a model that’s efficient to run. You can read more in the paper here.

This example is intended to demonstrate everything required to run inference at the highest performance and with the lowest latency possible, and so it includes advanced features of both SGLang and Modal. For a simpler introduction to LLM serving, see this example. It also runs a small and efficient model, as far as LLMs go. For more on serving very large language models, see this example.

To minimize routing overheads, we use @modal.experimental.http_server, which uses a new, low-latency routing service on Modal designed for latency-sensitive inference workloads. This gives us more control over routing, but with increased power comes increased responsibility.

Set up the container image

Our first order of business is to define the environment our server will run in: the container Image.

We start from a container image provided by the SGLang team via Dockerhub.

While we’re at it, we import the dependencies we’ll need both remotely and locally (for deployment).

We also choose a GPU to deploy our inference server onto. We choose the B200 GPU, which offers excellent price-performance and supports both 8 bit and 4 bit quantized floating point operations.

Loading and cacheing the model weights

We’ll serve NVIDIA’s Nemotron 3 Nano. For lower latency, we pick a smaller model (30B params quantized to 4-bit floating point). This reduces the amount of data that needs to be loaded from GPU RAM into SM SRAM in each forward pass. Loading fewer bytes of model weights also speeds up cold starts of our inference server.

We load the model from the Hugging Face Hub, so we’ll need their Python package.

We don’t want to load the model from the Hub every time we start the server. We can load it much faster from a Modal Volume. Typical speeds are around one to two GB/s.

In addition to pointing the Hugging Face Hub at the path where we mount the Volume, we also turn on “high performance” downloads, which can fully saturate our network bandwidth.

Define the inference server and infrastructure

Selecting infrastructure to minimize latency

Minimizing latency requires geographic co-location of clients and servers.

So for low latency LLM inference services on Modal, you must select a cloud region for both the GPU-accelerated containers running inference and for the internal Modal proxies that forward requests to them as part of defining a modal.experimental.http_server.

Here, we assume users are mostly in the northern half of the Americas and select the us-east cloud region serve them. This should result in at most a few dozen milliseconds of round-trip time.

Latencies for multi-turn interactions with LLMs are substantially cut when previous interaction turns are in the KV cache. KV caches are stored in GPU RAM, so they aren’t shared across replicas. To improve cache hit rate, modal.experimental.http_server includes sticky routing based on a client-provided header. See the client code below for details.

For production-scale LLM inference services, there are generally enough requests to justify keeping at least one replica running at all times. Having a “warm” or “live” replica reduces latency by skipping slow initialization work that occurs when new replica boots up (a “cold start”). For LLM inference servers, that latency runs from seconds to minutes.

To ensure at least one container is always available, we can set the min_containers of our Modal Function to 1 or more.

However, since this is documentation code, we’ll set it to 0 to avoid surprise bills during casual use.

Finally, we need to decide how we will scale up and down replicas in response to load. Without autoscaling, users’ requests will queue when the server becomes overloaded. Even apart from queueing, responses generally become slower per user above a certain minimum number of concurrent requests.

So we set a target for the number of inputs to run on a single container with modal.concurrent. For details, see the guide.

Generally, this choice needs to be made as part of LLM inference engine benchmarking.

Controlling container lifecycles with `modal.Cls`

We wrap up all of the choices we made about the infrastructure of our inference server into a number of Python decorators that we apply to a Python class that encapsulates the logic to run our server.

The key decorators are:

@app.cls to define the core of our service. We attach our Image, request a GPU, attach our cache Volumes, specify the region, and configure auto-scaling. See the reference documentation for details.
@modal.experimental.http_server to turn our Python code into an HTTP server (i.e. fronting all of our containers with a proxy with a URL). The wrapped code needs to eventually listen for HTTP connections on the provided port.
@modal.concurrent to specify how many requests our server can handle before we need to scale up.
@modal.enter and @modal.exit to indicate which methods of the class should be run when starting the server and shutting it down.

Modal considers a new replica ready to receive inputs once the modal.enter methods have exited and the container accepts connections. To ensure that we actually finish setting up our server before we are marked ready for inputs, we define a helper function to check whether the server is finished setting up and to send it a few test inputs.

We use the requests library to send ourselves these HTTP requests on localhost/127.0.0.1.

With all this in place, we are ready to define our high-performance, low-latency Nemotron inference server.

Deploy the server

To deploy the server on Modal, just run

This will create a new App on Modal and build the container image for it if it hasn’t been built yet.

Interact with the server

Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name--example-nemotron-inference-server.us-east.modal.direct.

You can find interactive Swagger UI docs at the /docs route of that URL, i.e. https://your-workspace-name--example-nemotron-inference-server.us-east.modal.direct/docs. These docs describe each route and indicate the expected input and output and translate requests into curl commands. For simple routes, you can even send a request directly from the docs page.

Note: when no replicas are available, Modal will respond with the 503 Service Unavailable status. In your browser, you can just hit refresh until the docs page appears. You can see the status of the application and its containers on your Modal dashboard.

Test the server

To make it easier to test the server setup, we also include a local_entrypoint that hits the server with a simple client.

If you execute the command

a fresh replica of the server will be spun up on Modal while the code below executes on your local machine.

Think of this like writing simple tests inside of the if __name__ == "__main__" block of a Python script, but for cloud deployments!

This test relies on the two helper functions below, which ping the server and wait for a valid response to stream.

The probe helper function specifically ignores two types of errors that can occur while a replica is starting up — timeouts on the client and 5XX responses from the server. Modal returns the 503 Service Unavailable status when an experimental.http_server has no live replicas.

We include a header with each request — Modal-Session-ID. This is header is used by clients of http_servers on Modal to identify which requests should be routed to the same container (with caveats explained below).

The value associated with this key is used to map requests onto containers such that while the set of containers is fixed, requests with the same value are sent to the same container. Set this to a different value per distinct multi-turn interaction (prototypically, a user conversation thread with a chatbot) to improve KV cache hit rates. Additionally, when the set of containers changes (e.g. due to autoscaling), sessions are rebalanced such that load is approximately evenly spread, much like in RAID rebalancing. This ensures no container ends up as a “hot spot” handling too many client requests.