Upstream testing in Dask#
Dask has deep integrations with other libraries in the PyData ecosystem like NumPy, pandas, Zarr, PyArrow, and more. Part of providing a good experience for Dask users is making sure that Dask continues to work well with this community of libraries as they push out new releases. This post walks through how Dask maintainers proactively ensure Dask continuously works with its surrounding ecosystem.
Dask has a dedicated CI build that runs Dask’s normal test suite once a day with unreleased, nightly versions of several libraries installed. This lets us check whether or not a recent change in a library like NumPy or pandas breaks some aspect of Dask’s functionality.
To increase visibility when such a breakage occurs, as part of the upstream CI build, an issue is automatically opened that provides a summary of what tests failed and links to the build logs for the corresponding failure (here’s an example issue).
This makes it less likely that a failing upstream build goes unnoticed.
How things can break and are fixed#
There are usually two different ways in which things break. Either:
A library made an intentional change in behavior and a corresponding compatibility change needs to be made in Dask (the next section has an example of this case).
There was some unintentional consequence of a change made in a library that resulted in a breakage in Dask.
When the latter case occurs, Dask maintainers can then engage with other library maintainers to resolve the unintended breakage. This all happens before any libraries push out a new release, so no user code breaks.
Example: pandas 2.0#
One specific example of this process in action is the recent pandas 2.0 release. This is a major version release and contains significant breaking changes like removing deprecated functionality.
As these breaking changes were merged into pandas, we started seeing related failures in Dask’s upstream CI build. Dask maintainers were then able to add a variety of compatibility changes so that Dask works well with pandas 2.0 immediately.
Special thanks to Justus Magin for his work on the
xarray-contrib/issue-from-pytest-log GitHub action.
We’ve found this to be really convenient for easily opening up GitHub issues when test failures occur.
Also, thanks to Irina Truong (Coiled), Patrick Hoefler (Coiled), and Matthew Roeschke (NVIDIA) for their efforts ensuring pandas and Dask continue to work together.