Posts by Patrick Hoefler

TPC-H Benchmarks for Query Optimization with Dask Expressions

Dask-expr is an ongoing effort to add a logical query optimization layer to Dask DataFrames. We now have the first benchmark results to share that were run against the current DataFrame implementation.

Read more ...


Coiled observability wins: Chunksize

Distributed computing is hard, distributed debugging is even harder. Dask tries to simplify this process as much as possible. Coiled adds additional observability features for your Dask clusters and processes them to help users understand their workflows better.

../../_images/chunksize_task_stream.png

Read more ...


Reduce training time for CPU intensive models with scikit-learn and Coiled Functions

You can use Coiled Run and Coiled Functions for easily running scripts and functions on a VM in the cloud.

Code snippet adding coiled.function decorator to scikit-learn model training.

Read more ...


Process Hundreds of GB of Data with DuckDB in the Cloud

DuckDB is great tool for running efficient queries on large datasets. When you want cloud data proximity or need more RAM, Coiled makes it easy to run your Python function in the cloud. In this post we’ll use Coiled Functions to process the 150 GB Uber-Lyft dataset on a single machine with DuckDB.

Code snippet of using the coiled.function decorator to run a query with DuckDB on a large VM in the cloud.

Read more ...


High Level Query Optimization in Dask

Dask DataFrame doesn’t currently optimize your code for you (like Spark or a SQL database would). This means that users waste a lot of computation. Let’s look at a common example which looks ok at first glance, but is actually pretty inefficient.

Read more ...


How to Train a Neural Network on a GPU in the Cloud with coiled functions

We recently pushed out two new and experimental features coiled run and coiled functions which is a deviation of coiled run. We are excited about both of them because they:

Read more ...


Dask performance benchmarking put to the test: Fixing a pandas bottleneck

Getting notified of a significant performance regression the day before release sucks, but quickly identifying and resolving it feels great!

Read more ...


Utilizing PyArrow to improve pandas and Dask workflows

Get the most out of PyArrow support in pandas and Dask right now

Read more ...