Posts by Patrick Hoefler
TPC-H Benchmarks for Query Optimization with Dask Expressions
- 05 October 2023
Dask-expr is an ongoing effort to add a logical query optimization layer to Dask DataFrames. We now have the first benchmark results to share that were run against the current DataFrame implementation.
Coiled observability wins: Chunksize
- 19 September 2023
Distributed computing is hard, distributed debugging is even harder. Dask tries to simplify this process as much as possible. Coiled adds additional observability features for your Dask clusters and processes them to help users understand their workflows better.
Reduce training time for CPU intensive models with scikit-learn and Coiled Functions
- 01 September 2023
You can use Coiled Run and Coiled Functions for easily running scripts and functions on a VM in the cloud.
Process Hundreds of GB of Data with DuckDB in the Cloud
- 07 August 2023
DuckDB is great tool for running efficient queries on large datasets. When you want cloud data proximity or need more RAM, Coiled makes it easy to run your Python function in the cloud. In this post we’ll use Coiled Functions to process the 150 GB Uber-Lyft dataset on a single machine with DuckDB.
High Level Query Optimization in Dask
- 04 August 2023
Dask DataFrame doesn’t currently optimize your code for you (like Spark or a SQL database would). This means that users waste a lot of computation. Let’s look at a common example which looks ok at first glance, but is actually pretty inefficient.
How to Train a Neural Network on a GPU in the Cloud with coiled functions
- 24 July 2023
We recently pushed out two new and experimental features coiled run
and coiled functions
which is a deviation of coiled run
. We are excited about both of them because they:
Dask performance benchmarking put to the test: Fixing a pandas bottleneck
- 23 June 2023
Getting notified of a significant performance regression the day before release sucks, but quickly identifying and resolving it feels great!
Utilizing PyArrow to improve pandas and Dask workflows
- 05 June 2023
Get the most out of PyArrow support in pandas and Dask right now