Stream transcriptions with Kyutai STT
This example demonstrates the deployment of a streaming audio transcription service with Kyutai STT on Modal.
Kyutai STT is an automated speech recognition/transcription model that is designed to operate on streams of audio, rather than on complete audio files. See the linked blog post for details on their “delayed streams” architecture.
Setup
We start by importing some basic packages and the Modal SDK.
Then we define a Modal App and an Image with the dependencies of our speech-to-text system.
One dependency is missing: the model weights.
Instead of including them in the Image or loading them every time the Function starts, we add them to a Modal Volume. Volumes are like a shared disk that all Modal Functions can access.
For more details on patterns for handling model weights on Modal, see this guide.
Run Kyutai STT inference on Modal
Now we’re ready to add the code that runs the speech-to-text model.
We use a Modal Cls so that we can separate out the model loading and setup code from the inference.
For more on lifecycle management with Clses and cold start penalty reduction on Modal, see this guide.
We also define multiple ways to access the underlying streaming STT service — via a WebSocket, for Web clients like browsers, and via a Modal Queue for Python clients.
That plus the code for manipulating the streams of audio bytes and output text leads to a pretty big class! But there’s not anything too complex here.
Run a local Python client to test streaming STT
We can test this code on the same production Modal infra
that we’ll be deploying it on by writing a quick local_entrypoint for testing.
We just need a few helper functions to control the streaming of audio bytes and transcribed text from local Python.
These communicate asynchronously with the deployed Function using a Modal Queue.
Now we write our quick test, which loads in audio from a URL and then passes it to the remote Function via a
If you run this example with
you will
- deploy the latest version of the code on Modal
- spin up a new GPU to handle transcription
- load the model from Hugging Face or the Modal Volume cache
- send the audio out to the new GPU container, transcribe it, and receive it locally to be printed.
Not bad for a single Python file with no dependencies except Modal!
Deploy a streaming STT service on the Web
We’ve already written a Web backend for our streaming STT service — that’s the FastAPI API with the WebSocket in the Modal Cls above.
We can also deploy a Web frontend. To keep things almost entirely “pure Python”, we here use the FastHTML library, but you can also deploy a JavaScript frontend with a FastAPI or Node backend.
We do use a bit of JS for the audio processing in the browser.
We add it to the Modal Image using add_local_dir.
You can find the frontend files here.
You can deploy this frontend with
and then interact with it at the printed ui URL.