Efficient LLM Finetuning with Unsloth

Training large language models is an incredibly compute-hungry process. Open-source LLMs often require many GBs (or in extreme cases, one TB!) of VRAM just to fit in memory. Finetuning models requires even more memory; a common estimate for naive finetuning puts the VRAM requirements at roughly 4.2x the original model size: 1x for model weights + 1x for gradients + 2x for optimizer state + 20% for activations. Parameter efficient methods like LoRA can improve matters significantly, since this estimate now applies to just the LoRA modules’ weights, rather than the entire model’s. Further gains can be made with quantization of each of the components mentioned above, but doing so requires quantization-aware training, which can be tricky to combine with methods like LoRA.

Unsloth provides optimized methods for LLM finetuning with LoRA and quantization, leading to typical performance gains of 2x faster training with 70% less memory usage. This example demonstrates using Unsloth to finetune a version of Qwen3-14B with the FineTome-100k dataset on Modal using only a single GPU!

We create a Modal App to organize our functions and shared infrastructure like container images and volumes.

Container Image Configuration 

We build a custom container image with Unsloth and all necessary dependencies. The image includes the latest version of Unsloth (as of writing) with optimizations for the latest model architectures. Once the image is defined, we can specify the imports we’ll need to write the rest of our training code. Importantly, we import unsloth before the rest so that Unsloth’s patches are applied to packages like transformers, peft, and trl.

Volume Configuration 

Modal Volumes provide storage that persists between function invocations. We use separate volumes for different types of data to enable efficient caching and sharing:

  • A cache for pretrained model weights - reused across all experiments
  • A cache for processed datasets - reused when using the same dataset
  • Storage for training checkpoints and final models

Picking a GPU 

We use L40S for its healthy balance of VRAM, CUDA cores, and clock speed. The timeout provides an upper bound on our training time; if our training run finishes faster, we won’t end up using the full 6 hours. We also specify 3 retries, which will be useful in case our training function gets preempted.

Data Processing 

We’ll be finetuning our model on the FineTome-100k dataset, which is subset of The Tome curated with fineweb-edu-classifier Below we define some helpers for processing this dataset.

Loading the pretrained model 

We can’t finetune without a pretarined model! Since these models are fairly large, we don’t want to download them from scratch for each training run. We solve this by caching the weights in a Volume on download, and then loading from the Volume on subsequent runs.

Training Configuration 

First we’ll define what layers our LoRA modules should target. Generally, it’s advisable to LoRA finetune every linear layer in the model, so we target every projection matrix of each attention layer.

We want to expose the different hyperparameters and optimizations that Unsloth supports, so we wrap them into a TrainingConfig class. Later, we’ll populate this config with arguments from the command line.

Main Training Function 

This function orchestrates the entire training process, from model loading to final model saving. It’s decorated with Modal function configuration that specifies the compute resources, the volumes needed, and execution details like timeout and retries.

Finally, we invoke our training function from an App.local_entrypoint. Arguments to this function automatically get converted into CLI flags that can be specified when we modal run our code. This allows us to do things like tweak hyperparameters directly from the command line without modifying our source code.

To try this example, checkout the examples repo, install the Modal client, and run

You can also customize the training process by tweaking hyperparameters with command line flags, e.g.

Utility Functions 

These functions handle the core logic for model loading, dataset processing, and training setup. They’re designed to be hackable for new use cases.