Fine-tune Whisper to Improve Transcription on Domain-Specific Vocab
This example demonstrates how to fine-tune an ASR model (whisper-tiny.en) and deploy it for inference using Modal.
Speech recognition models work well out-of-the-box for general speech transcription, but can struggle with examples that are not well represented in the training data - like proper nouns, technical jargon, and industry-specific terms. Fine-tuning with examples of domain-specific vocabulary can improve transcription of these terms.
For example, here is a sample transcription from the baseline model with no fine-tuning:
| Transcription | |
|---|---|
| Ground Truth | “deuterium you put into one element you make a new element” |
| Prediction | “the theorem you put into one element you make a new element” |
After just 1.5 hours of training on a small dataset (~7k samples), the model has already improved:
| Transcription | |
|---|---|
| Ground Truth | “deuterium you put into one element you make a new element” |
| Prediction | “deuterium you put into one element you make a new element” |
We’ll use the “small” subset of “Science and Technology” from the GigaSpeech dataset, which is enough data to see the model improve on scientific terms in just a few epochs.
Note: GigaSpeech is a gated model, so you’ll need to accept the terms on the dataset card and create a Hugging Face Secret to download it.
Setup
We start by importing our standard library dependencies, fastapi, and modal.
We also need an App object, which we’ll use to
define how our training application will run on Modal’s cloud infrastructure.
Set up the container image
We define the environment where our functions will run by building up a base container Image with our dependencies using Image.uv_pip_install. We also set environment variables
here using Image.env, like the Hugging Face cache directory.
Next we’ll import the dependencies we need for the code that will run on Modal.
The image.imports() context manager ensures these imports are available when our
Functions run in the cloud, without the need to install the dependencies locally.
Storing data on Modal
We use Modal Volumes for data we want to persist across function calls. In this case, we’ll create a cache Volume for storing Hugging Face downloads for faster subsequent loads, and an output Volume for saving our model and metrics after training.
Training
We use a dataclass to collect some of the training parameters in one place. Here we
set model_output_name which is the directory on the Volume where our model will be
saved, and where we’ll load it from when deploying the model for inference.
Defining our training Function
The training Function does the following:
- Load the pre-trained model, along with the feature extractor and tokenizer
- Load the dataset -> select our training category -> extract features for training
- Run baseline evals
- 🚂 Train!
- Save the fine-tuned model to the Volume
- Run final evals
We run evals before and after training to establish a baseline and see how much the model improved. The most common way to measure the performance of speech recognition models is “word error rate” (WER):
WER = (substitutions + deletions + insertions) / total words.
The @app.function decorator is where we attach infrastructure and define how our
Function runs on Modal. Here we tell the Function to use our Image, specify the GPU,
attach the Volumes we created earlier, add our access token, and set a timeout.
Calling a Modal function from the command line
The easiest way to invoke our training Function is by creating a local_entrypoint —
our main function that runs locally and provides a command-line interface to trigger
training on Modal’s cloud infrastructure.
This will allow us to run this example with:
Arguments passed to this function are turned in to CLI arguments automagically. For
example, adding --test will run a single step of training for end-to-end testing.
Training will take ~1.5 hours, and will log WER and other metrics throughout the run.
Here are a few more examples of terms the model predicted correctly after fine-tuning:
| Base Model | Fine-tuned |
|---|---|
| and pm package | npm package |
| teach them | tritium |
| chromebox | chromevox |
| purposes | porpoises |
| difsoup | div soup |
| would you | widget |
Deploying our fine-tuned model for inference
Once fine-tuning is complete, Modal makes it incredibly easy to deploy our new model.
We can define both our inference function and an endpoint using a Modal Cls.
This will allow us to take advantage of lifecycle hooks to load the model just once on container startup using the @modal.enter decorator.
We can use modal.fastapi_endpoint to expose our inference function as a web endpoint.
Deploy it with:
Note: you can specify which model to load by passing the model_name as a
query parameter when calling the endpoint. This defaults to model_output_name, which
we set in our Config above, and is the name of the directory where our model
was saved.
Here’s an example of how to use this endpoint to transcribe an audio file: