Run open-source LLMs with Ollama on Modal
Ollama is a popular tool for running open-source large language models (LLMs) locally. It provides a simple API, including OpenAI compatibility, allowing you to interact with various models like Llama, Mistral, Phi, and more.
In this example, we demonstrate how to run Ollama on Modal’s cloud infrastructure, leveraging:
- Modal’s powerful GPU resources that far exceed what’s available on most local machines
- Serverless design that scales to zero when not in use (saving costs)
- Persistent model storage using Modal Volumes
- Web-accessible endpoints that expose Ollama’s OpenAI-compatible API
Since the Ollama server provides its own REST API, we use Modal’s web_server decorator to expose these endpoints directly to the internet.
import asyncio
import subprocess
from typing import List
import modalConfiguration and Constants
Directory for Ollama models within the container and volume
MODEL_DIR = "/ollama_models"Define the models we want to work with You can specify different model versions using the format “model:tag”
MODELS_TO_DOWNLOAD = ["llama3.1:8b", "llama3.3:70b"] # Downloaded at startup
MODELS_TO_TEST = ["llama3.1:8b", "llama3.3:70b"] # Tested in our exampleOllama version to install - you may need to update this for the latest models
OLLAMA_VERSION = "0.6.5"Ollama’s default port - we’ll expose this through Modal
OLLAMA_PORT = 11434Building the Container Image
First, we create a Modal Image that includes Ollama and its dependencies. We use the official Ollama installation script to set up the Ollama binary.
ollama_image = (
modal.Image.debian_slim(python_version="3.11")
.apt_install("curl", "ca-certificates")
.uv_pip_install(
"fastapi==0.115.8",
"uvicorn[standard]==0.34.0",
"openai~=1.30", # Pin OpenAI version for compatibility
)
.run_commands(
"echo 'Installing Ollama...'",
f"OLLAMA_VERSION={OLLAMA_VERSION} curl -fsSL https://ollama.com/install.sh | sh",
"echo 'Ollama installed at $(which ollama)'",
f"mkdir -p {MODEL_DIR}",
)
.env(
{
# Configure Ollama to serve on its default port
"OLLAMA_HOST": f"0.0.0.0:{OLLAMA_PORT}",
"OLLAMA_MODELS": MODEL_DIR, # Tell Ollama where to store models
}
)
)Create a Modal App, which groups our functions together
app = modal.App("example-ollama", image=ollama_image)Persistent Storage for Models
We use a Modal Volume to cache downloaded models between runs. This prevents needing to re-download large model files each time.
model_volume = modal.Volume.from_name("ollama-models-store", create_if_missing=True)The Ollama Server Class
We define an OllamaServer class to manage the Ollama process. This class handles:
- Starting the Ollama server
- Downloading required models
- Exposing the API via Modal’s web_server
- Running test requests against the served models
@app.cls(
gpu="H100", # Use H100 GPUs for best performance
volumes={MODEL_DIR: model_volume}, # Mount our model storage
timeout=60 * 5, # 5 minutes max input runtime
)
class OllamaServer:
ollama_process: subprocess.Popen | None = None
@modal.enter()
async def start_ollama(self):
"""Starts the Ollama server and ensures required models are downloaded."""
print("Starting Ollama setup...")
print(f"Starting Ollama server on port {OLLAMA_PORT}...")
cmd = ["ollama", "serve"]
self.ollama_process = subprocess.Popen(cmd)
print(f"Ollama server started with PID: {self.ollama_process.pid}")
# Wait for server to initialize
await asyncio.sleep(10)
print("Ollama server should be ready.")
# --- Model Management ---
# Check which models are already downloaded, and pull any that are missing
loop = asyncio.get_running_loop()
models_pulled = False # Track if we pulled any model
# Get list of currently available models
ollama_list_proc = subprocess.run(
["ollama", "list"], capture_output=True, text=True
)
if ollama_list_proc.returncode != 0:
print(f"Error: 'ollama list' failed: {ollama_list_proc.stderr}")
raise RuntimeError(
f"Failed to list Ollama models: {ollama_list_proc.stderr}"
)
current_models_output = ollama_list_proc.stdout
print("Current models detected:", current_models_output)
# Download each requested model if not already present
for model_name in MODELS_TO_DOWNLOAD:
print(f"Checking for model: {model_name}")
model_tag_to_check = (
model_name if ":" in model_name else f"{model_name}:latest"
)
if model_tag_to_check not in current_models_output:
print(
f"Model '{model_name}' not found. Pulling (output will stream directly)..."
)
models_pulled = True # Mark that a pull is happening
# Pull the model - this can take a while for large models
pull_process = await asyncio.create_subprocess_exec(
"ollama",
"pull",
model_name,
)
# Wait for the pull process to complete
retcode = await pull_process.wait()
if retcode != 0:
print(f"Error pulling model '{model_name}': exit code {retcode}")
else:
print(f"Model '{model_name}' pulled successfully.")
else:
print(f"Model '{model_name}' already exists.")
# Commit the volume only if we actually pulled new models
if models_pulled:
print("Committing model volume...")
await loop.run_in_executor(None, model_volume.commit)
print("Volume commit finished.")
print("Ollama setup complete.")
@modal.exit()
def stop_ollama(self):
"""Terminates the Ollama server process on shutdown."""
print("Shutting down Ollama server...")
if self.ollama_process and self.ollama_process.poll() is None:
print(f"Terminating Ollama server (PID: {self.ollama_process.pid})...")
try:
self.ollama_process.terminate()
self.ollama_process.wait(timeout=10)
print("Ollama server terminated.")
except subprocess.TimeoutExpired:
print("Ollama server kill required.")
self.ollama_process.kill()
self.ollama_process.wait()
except Exception as e:
print(f"Error shutting down Ollama server: {e}")
else:
print("Ollama server process already exited or not found.")
print("Shutdown complete.")
@modal.web_server(port=OLLAMA_PORT, startup_timeout=180)
def serve(self):
"""
Exposes the Ollama server's API endpoints through Modal's web_server.
This is the key function that makes Ollama's API accessible over the internet.
The web_server decorator maps Modal's HTTPS endpoint to Ollama's internal port.
"""
print(f"Serving Ollama API on port {OLLAMA_PORT}")
# ## Running prompt tests
#
# The following method allows us to run test prompts against our Ollama models.
# This is useful for verifying that the models are working correctly and
# to see how they respond to different types of prompts.
@modal.method()
async def run_tests(self):
import openai
from openai.types.chat import ChatCompletionMessageParam
"""
Tests the Ollama server by sending various prompts to each configured model.
Returns a dictionary of results organized by model.
"""
print("Running tests inside OllamaServer container...")
all_results = {} # Store results per model
# Configure OpenAI client to use our Ollama server
base_api_url = f"http://localhost:{OLLAMA_PORT}/v1"
print(f"Configuring OpenAI client for: {base_api_url}")
client = openai.AsyncOpenAI(
base_url=base_api_url,
api_key="not-needed", # Ollama doesn't require API keys
)
# Define some test prompts
test_prompts = [
"Explain the theory of relativity in simple terms.",
"Write a short poem about a cat watching rain.",
"What are the main benefits of using Python?",
]
# Test each model with each prompt
for model_name in MODELS_TO_TEST:
print(f"\n===== Testing Model: {model_name} =====")
model_results = []
all_results[model_name] = model_results
for prompt in test_prompts:
print(f"\n--- Testing Prompt ---\n{prompt}\n----------------------")
# Create message in OpenAI format
messages: List[ChatCompletionMessageParam] = [
{"role": "user", "content": prompt}
]
try:
# Call the Ollama API through the OpenAI client
response = await client.chat.completions.create(
model=model_name,
messages=messages,
stream=False,
)
assistant_message = response.choices[0].message.content
print(f"Assistant Response:\n{assistant_message}")
model_results.append(
{
"prompt": prompt,
"status": "success",
"response": assistant_message,
}
)
except Exception as e:
print(
f"Error during API call for model '{model_name}', prompt '{prompt}': {e}"
)
model_results.append(
{"prompt": prompt, "status": "error", "error": str(e)}
)
print("Internal tests finished.")
return all_resultsRunning the Example
This local entrypoint function provides a simple way to test the Ollama server.
When you run modal run ollama.py, this function will:
- Start an OllamaServer instance in the cloud
- Run test prompts against each configured model
- Print a summary of the results
@app.local_entrypoint()
async def local_main():
"""
Tests the Ollama server with sample prompts and prints the results.
Run with: `modal run ollama.py`
"""
print("Triggering test suite on the OllamaServer...")
all_test_results = await OllamaServer().run_tests.remote.aio()
print("\n--- Test Suite Summary ---")
if all_test_results:
for model_name, results in all_test_results.items():
print(f"\n===== Results for Model: {model_name} =====")
successful_tests = 0
if results:
for result in results:
print(f"Prompt: {result['prompt']}")
print(f"Status: {result['status']}")
if result["status"] == "error":
print(f"Error: {result['error']}")
else:
successful_tests += 1
print("----")
print(
f"\nSummary for {model_name}: Total tests: {len(results)}, Successful: {successful_tests}"
)
else:
print("No results returned for this model.")
else:
print("No results returned from test function.")
print("\nTest finished. Your Ollama server is ready to use!")Deploying to Production
While the local entrypoint is great for testing, for production use you’ll want to deploy this application persistently. You can do this with:
modal deploy ollama.pyThis creates a persistent deployment that:
- Provides a stable URL endpoint for your Ollama API
- Keeps at least one container warm for fast responses
- Scales automatically based on usage
- Preserves your models in the persistent volume between invocations
After deployment, you can find your endpoint URL in your Modal dashboard.
You can then use this endpoint with any OpenAI-compatible client by setting:
OPENAI_API_BASE=https://your-endpoint-url
OPENAI_API_KEY=any-value # Ollama doesn't require authentication