Analyze NYC yellow taxi data with DuckDB on Parquet files from S3
This example shows how to use Modal for a classic data science task: loading table-structured data into cloud stores, analyzing it, and plotting the results.
In particular, we’ll load public NYC taxi ride data into S3 as Parquet files, then run SQL queries on it with DuckDB.
We’ll mount the S3 bucket in a Modal app with CloudBucketMount.
We will write to and then read from that bucket, in each case using
Modal’s parallel execution features to handle many files at once.
Basic setup
You will need to have an S3 bucket and AWS credentials to run this example. Refer to the documentation for the exact IAM permissions your credentials will need.
After you are done creating a bucket and configuring IAM settings,
you now need to create a Secret to share
the relevant AWS credentials with your Modal apps.
The dependencies installed above are not available locally. The following block instructs Modal to only import them inside the container.
Download New York City’s taxi data
NYC makes data about taxi rides publicly available. The city’s Taxi & Limousine Commission (TLC) publishes files in the Parquet format. Files are organized by year and month.
We are going to download all available files and store them in an S3 bucket. We do this by
attaching a modal.CloudBucketMount with the S3 bucket name and its respective credentials.
The files in the bucket will then be available at MOUNT_PATH.
As we’ll see below, this operation can be massively sped up by running it in parallel on Modal.
Analyze data with DuckDB
DuckDB is an analytical database with rich support for Parquet files. It is also very fast. Below, we define a Modal Function that aggregates yellow taxi trips within a month (each file contains all the rides from a specific month).
Plot daily taxi rides
Finally, we want to plot our results. The plot created shows the number of yellow taxi rides per day in NYC. This function runs remotely, on Modal, so we don’t need to install plotting libraries locally.
Run everything
The @app.local_entrypoint() defines what happens when we run our Modal program locally.
We invoke it from the CLI by calling modal run s3_bucket_mount.py.
We first call download_data() and starmap (named because it’s kind of like map(*args))
on tuples of inputs (year, month). This will download, in parallel,
all yellow taxi data files into our locally mounted S3 bucket and return a list of
Parquet file paths. Then, we call aggregate_data() with map on that list. These files are
also read from our S3 bucket. So one function writes files to S3 and the other
reads files from S3 in; both run across many files in parallel.
Finally, we call plot to generate the following figure:
This program should run in less than 30 seconds.