Process Hundreds of GB of Data with DuckDB in the Cloud#
DuckDB is great tool for running efficient queries on large datasets. When you want cloud data proximity or need more RAM, Coiled makes it easy to run your Python function in the cloud. In this post we’ll use Coiled Functions to process the 150 GB Uber-Lyft dataset on a single machine with DuckDB.
Query Parquet data with DuckDB#
We start with creating the SQL queries that we want to run against the data locally.
def load_data(conn): # Load data into memory so that subsequent queries are fast conn.execute( ''' CREATE TABLE test AS SELECT * FROM read_parquet("s3://coiled-datasets/uber-lyft-tlc/*") ''' ) def compute_percentage_of_tipped_rides(conn): # Run the actual query return conn.execute( ''' SELECT hvfhs_license_num, sum(tipped) / count(tipped) FROM (select *, CASE WHEN tips > 0.0 then 1 ELSE 0 end as tipped from test) GROUP BY hvfhs_license_num ''' ).fetchall() def create_conn(): import duckdb return duckdb.connect() def query_results(): conn = create_conn() load_data(conn) return compute_percentage_of_tipped_rides(conn)
These queries aren’t particularly fancy, they are meant to illustrate how we can process these files. If we execute these queries as is, it would pull all the data onto our machine. The whole dataset won’t fit in memory on most workstations, so let’s look at how Coiled can make this work. The loading would take a long time, even if we had enough memory.
Query S3 data in the cloud with DuckDB + Coiled serverless functions#
Coiled Functions come into the equation since we need access to machines that have enough resources and are also close to our data. Coiled can connect to AWS or GCP and thus, use all resources that are available there. We will go through the necessary steps execute these queries on a VM in the same region as our data with enough memory available.
We’ll have to adapt our
create_conn function to use
load_aws_credentials and set the AWS region using the DuckDB AWS extension.
def create_conn(): import duckdb conn = duckdb.connect() conn.execute("CALL load_aws_credentials()") return conn
The next step is adding the
@coiled.function decorator to the function that executes our queries.
The decorator will tell Coiled that it should spin up a large VM on AWS and run the query there, and then return the result locally.
@coiled.function( vm_type="m6i.16xlarge", # 256 GB of RAM region="us-east-2", # region of our data keepalive="5 minutes", # keep alive to run multiple queries if necessary ) def query_results(): conn = create_conn() load_data(conn) return compute_percentage_of_tipped_rides(conn)
Let’s execute our queries and pull the results back to our local machine:
result = query_results() print(result) [ ('HV0005', 0.1912300216459857), ('HV0003', 0.1498555901186066), ('HV0004', 0.09294857737045926), ('HV0002', 0.08440046492889111), ]
The data are now all in memory on our VM in the cloud:
There is no need to adjust the other functions. Coiled will run our query on a VM in the cloud with enough resources and close to our data.
Let’s take a brief look at the arguments to
vm_type: This specifies the type of AWS EC2 instance. We are looking for an instance that has enough memory to hold our data. This instance has 256GB, so this should be sufficient.
region: The region specifies the AWS region that our VM is started in. Our data are also in
keepalive: Keeps the VM alive so that we can run multiple queries against the data in memory.
coiled.function() will now start a VM in AWS with the specified EC2 instance. The VM is normally up
and running in 1-2 minutes. Coiled will scan our local environment and replicate the same
dependencies on this machine. We don’t have to specify an explicit Python environment. Inputs of
your function are serialized and sent to the VM
as well. Coiled will return our results back to our local machine.
Coiled would normally shut down the VM immediately after the Python interpreter finishes. This is mostly to
reduce costs. We specified
keepalive="5 minutes" to keep the VM alive for a few of minutes after our Python interpreter
finished. This ensures that new local runs can connect to the same VM avoiding
the boot time of up to 2 minutes; we call this a warm start.
You can use Coiled serverless functions to run queries on a machine with as much memory as you want. This grants you access to computational resources that can be very close to your data. Doing data processing in the cloud becomes very easy with this functionality.
Want to run this example yourself? Get started with Coiled for free at coiled.io/start. This example run comfortably within the free tier.