Chat with PDF: RAG with ColQwen2

In this example, we demonstrate how to use the the ColQwen2 model to build a simple “Chat with PDF” retrieval-augmented generation (RAG) app. The ColQwen2 model is based on ColPali but uses the Qwen2-VL-2B-Instruct vision-language model. ColPali is in turn based on the late-interaction embedding approach pioneered in ColBERT.

Vision-language models with high-quality embeddings obviate the need for complex pre-processing pipelines. See this blog post from Jo Bergum of Vespa for more.

Setup

First, we’ll import the libraries we need locally and define some constants.

Setting up dependencies

In Modal, we define container images that run our serverless workloads. We install the packages required for our application in those images.

These dependencies are only installed remotely, so we can’t import them locally. Use the .imports context manager to import them only on Modal instead.

Specifying the ColQwen2 model

Vision-language models (VLMs) for embedding and generation add another layer of simplification to RAG apps based on vector search: we only need one model.

Chat services are stateful: the response to an incoming user message depends on past user messages in a session.

RAG apps add even more state: the documents being retrieved from and the index over those documents, e.g. the embeddings.

Modal Functions are stateless in and of themselves. They don’t retain information from input to input. That’s what enables Modal Functions to automatically scale up and down based on the number of incoming requests.

In this example, we use a modal.Dict to store state information between Function calls.

Modal Dicts behave similarly to Python dictionaries, but they are backed by remote storage and accessible to all of your Modal Functions. They can contain any Python object that can be serialized using cloudpickle.

A Dict can hold a few gigabytes across keys of size up to 100 MiB, so it works well for our chat session state, which is a few KiB per session, and for our embeddings, which are a few hundred KiB per PDF page, up to about 100,000 pages of PDFs.

At a larger scale, we’d need to replace this with a database, like Postgres, or push more state to the client.

Images extracted from PDFs are larger than our session state or embeddings — low tens of MiB per page.

So we store them on a Modal Volume, which can store terabytes (or more!) of data across tens of thousands of files.

Volumes behave like a remote file system: we read and write from them much like a local file system.

Caching the model weights

We’ll also use a Volume to cache the model weights.

Running this function will download the model weights to the cache volume. Otherwise, the model weights will be downloaded on the first query. For more on storing model weights on Modal, see this guide.

Defining a Chat with PDF service

To deploy an autoscaling “Chat with PDF” vision-language model service on Modal, we just need to wrap our Python logic in a Modal App:

It uses Modal @app.cls decorators to organize the “lifecycle” of the app: loading the model on container start (@modal.enter) and running inference on request (@modal.method).

We include in the arguments to the @app.cls decorator all the information about this service’s infrastructure: the container image, the remote storage, and the GPU requirements.

Loading PDFs as images

Vision-Language Models operate on images, not PDFs directly, so we need to convert our PDFs into images first.

We separate this from our indexing and chatting logic — we run on a different container with different dependencies.

Chatting with a PDF from the terminal

Before deploying in a UI, we can test our service from the terminal.

Just run

and optionally pass in a path to or URL of a PDF with the --pdf-path argument and specify a question with the --question argument.

Continue a previous chat by passing the session ID printed to the terminal at start with the --session-id argument.

A hosted Gradio interface

With the Gradio library, we can create a simple web interface around our class in Python, then use Modal to host it for anyone to try out.

To deploy your own, run

and navigate to the URL that appears in your teriminal. If you’re editing the code, use modal serve instead to see changes hot-reload.

Addenda

The remainder of this code consists of utility functions and boiler plate used in the main code above.