Fine-tune Whisper to Improve Transcription on Domain-Specific Vocab

This example demonstrates how to fine-tune an ASR model (whisper-tiny.en) and deploy it for inference using Modal.

Speech recognition models work well out-of-the-box for general speech transcription, but can struggle with examples that are not well represented in the training data - like proper nouns, technical jargon, and industry-specific terms. Fine-tuning with examples of domain-specific vocabulary can improve transcription of these terms.

For example, here is a sample transcription from the baseline model with no fine-tuning:

	Transcription
Ground Truth	“deuterium you put into one element you make a new element”
Prediction	“the theorem you put into one element you make a new element”

After just 1.5 hours of training on a small dataset (~7k samples), the model has already improved:

	Transcription
Ground Truth	“deuterium you put into one element you make a new element”
Prediction	“deuterium you put into one element you make a new element”

We’ll use the “small” subset of “Science and Technology” from the GigaSpeech dataset, which is enough data to see the model improve on scientific terms in just a few epochs.

Note: GigaSpeech is a gated model, so you’ll need to accept the terms on the dataset card and create a Hugging Face Secret to download it.

Setup

We start by importing our standard library dependencies, fastapi, and modal.

We also need an App object, which we’ll use to define how our training application will run on Modal’s cloud infrastructure.

Set up the container image

We define the environment where our functions will run by building up a base container Image with our dependencies using Image.uv_pip_install. We also set environment variables here using Image.env, like the Hugging Face cache directory.

Next we’ll import the dependencies we need for the code that will run on Modal.

The image.imports() context manager ensures these imports are available when our Functions run in the cloud, without the need to install the dependencies locally.

We use Modal Volumes for data we want to persist across function calls. In this case, we’ll create a cache Volume for storing Hugging Face downloads for faster subsequent loads, and an output Volume for saving our model and metrics after training.

Training

We use a dataclass to collect some of the training parameters in one place. Here we set model_output_name which is the directory on the Volume where our model will be saved, and where we’ll load it from when deploying the model for inference.

Defining our training Function

The training Function does the following:

Load the pre-trained model, along with the feature extractor and tokenizer
Load the dataset -> select our training category -> extract features for training
Run baseline evals
🚂 Train!
Save the fine-tuned model to the Volume
Run final evals

We run evals before and after training to establish a baseline and see how much the model improved. The most common way to measure the performance of speech recognition models is “word error rate” (WER):

WER = (substitutions + deletions + insertions) / total words.

The @app.function decorator is where we attach infrastructure and define how our Function runs on Modal. Here we tell the Function to use our Image, specify the GPU, attach the Volumes we created earlier, add our access token, and set a timeout.

The easiest way to invoke our training Function is by creating a local_entrypoint — our main function that runs locally and provides a command-line interface to trigger training on Modal’s cloud infrastructure.

This will allow us to run this example with:

Arguments passed to this function are turned in to CLI arguments automagically. For example, adding --test will run a single step of training for end-to-end testing.

Training will take ~1.5 hours, and will log WER and other metrics throughout the run.

Here are a few more examples of terms the model predicted correctly after fine-tuning:

Base Model	Fine-tuned
and pm package	npm package
teach them	tritium
chromebox	chromevox
purposes	porpoises
difsoup	div soup
would you	widget

Deploying our fine-tuned model for inference

Once fine-tuning is complete, Modal makes it incredibly easy to deploy our new model. We can define both our inference function and an endpoint using a Modal Cls. This will allow us to take advantage of lifecycle hooks to load the model just once on container startup using the @modal.enter decorator. We can use modal.fastapi_endpoint to expose our inference function as a web endpoint.

Deploy it with:

Note: you can specify which model to load by passing the model_name as a query parameter when calling the endpoint. This defaults to model_output_name, which we set in our Config above, and is the name of the directory where our model was saved.

Here’s an example of how to use this endpoint to transcribe an audio file:

Fine-tune Whisper to Improve Transcription on Domain-Specific Vocab

Setup

Set up the container image

Training

Defining our training Function

Deploying our fine-tuned model for inference

Support code

Try this on Modal!

Fine-tune Whisper to Improve Transcription on Domain-Specific Vocab

Setup

Set up the container image

Storing data on Modal

Training

Defining our training Function

Calling a Modal function from the command line

Deploying our fine-tuned model for inference

Support code

Try this on Modal!