All Posts

Shuffling large data at constant memory in Dask

With release 2023.2.1, dask.dataframe introduces a new shuffling method called P2P, making sorts, merges, and joins faster and using constant memory. Benchmarks show impressive improvements:

P2P shuffling uses constant memory while task-based shuffling scales linearly.

Read more ...


Just in time Python environments

Docker is a great tool for creating portable software environments, but we found it’s too slow for interactive exploration. We find that clusters depending on docker images often take 5+ minutes to launch. Ouch.

_images/senvs2_build_push_pull.svg

Read more ...


How many PEPs does it take to install a package?

A few months ago we released package sync, a feature that takes your Python environment and replicates it in the cloud with zero effort.

Read more ...


Scaling Hyperparameter Optimization With XGBoost, Optuna, and Dask

XGBoost is one of the most well-known libraries among data scientists, having become one of the top choices among Kaggle competitors. It is performant in a wide of array of supervised machine learning problems, implements scalable training through the rabit library, and integrates with many big data processing tools, including Dask.

_images/dask-optuna-xgboost.png

Read more ...


Handling Unexpected AWS IAM Changes

The cloud is tricky! You might think the rules that determine which IAM permissions are required for which actions will continue to apply in the same way. You might think they’d apply the same way to different AWS accounts. Or that if these things aren’t true, at least AWS will let you know. (I did.) You’d be wrong!

Read more ...


AWS Cost Explorer Tips and Tricks

Spending time in AWS Cost Explorer is one of the best ways to understand what’s going on in your AWS account. It’s one of the few places in the AWS Console where you can get a global view of your account or even of your entire organization.

Read more ...


Automated Data Pipelines On Dask With Coiled & Prefect

Dask is widely used among data scientists and engineers proficient in Python for interacting with big data, doing statistical analysis, and developing machine learning models. Operationalizing this work has traditionally required lengthy code rewrites, which makes moving from development and production hard. This gap slows business progress and increases risk for data science and data engineering projects in an enterprise setting. The need to remove this bottleneck has prompted the emergence of production deployment solutions that allow code written by data scientists and engineers to be directly deployed to production, unlocking the power of continuous deployment for pure Python data science and engineers.

_images/coiled-prefect-problem-to-solve.png

Read more ...