All Posts
Shuffling large data at constant memory in Dask
- 15 March 2023
With release 2023.2.1
, dask.dataframe
introduces a new shuffling method called P2P, making sorts, merges, and joins faster and using constant memory.
Benchmarks show impressive improvements:

Just in time Python environments
- 23 February 2023
Docker is a great tool for creating portable software environments, but we found it’s too slow for interactive exploration. We find that clusters depending on docker images often take 5+ minutes to launch. Ouch.
How many PEPs does it take to install a package?
- 17 January 2023
A few months ago we released package sync, a feature that takes your Python environment and replicates it in the cloud with zero effort.
Scaling Hyperparameter Optimization With XGBoost, Optuna, and Dask
- 06 January 2023
XGBoost is one of the most well-known libraries among data scientists, having become one of the top choices among Kaggle competitors. It is performant in a wide of array of supervised machine learning problems, implements scalable training through the rabit library, and integrates with many big data processing tools, including Dask.

Handling Unexpected AWS IAM Changes
- 06 January 2023
The cloud is tricky! You might think the rules that determine which IAM permissions are required for which actions will continue to apply in the same way. You might think they’d apply the same way to different AWS accounts. Or that if these things aren’t true, at least AWS will let you know. (I did.) You’d be wrong!
AWS Cost Explorer Tips and Tricks
- 06 January 2023
Spending time in AWS Cost Explorer is one of the best ways to understand what’s going on in your AWS account. It’s one of the few places in the AWS Console where you can get a global view of your account or even of your entire organization.
Automated Data Pipelines On Dask With Coiled & Prefect
- 19 December 2022
Dask is widely used among data scientists and engineers proficient in Python for interacting with big data, doing statistical analysis, and developing machine learning models. Operationalizing this work has traditionally required lengthy code rewrites, which makes moving from development and production hard. This gap slows business progress and increases risk for data science and data engineering projects in an enterprise setting. The need to remove this bottleneck has prompted the emergence of production deployment solutions that allow code written by data scientists and engineers to be directly deployed to production, unlocking the power of continuous deployment for pure Python data science and engineers.
