Posts by Sarah Johnson

One Trillion Row Challenge

Last month Gunnar Morling launched the One Billion Row Challenge with the task of writing a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. This took off greater than anyone would expect, gathering dozens of submissions from different tools.

../../_images/1trc-data-generation.png

Read more ...


1BRC in Python with Dask

Last week Gunnar Morling launched the One Billion Row Challenge and it’s been fun to follow along. Though the official challenge is limited to Java implementations, we were inspired by an unofficial Python submission and have our own unofficial submission for Dask.

Bar chart comparing 1BRC runtime for various Python implementations.

Read more ...


How to Run Your Jupyter Notebook on a GPU in the Cloud

You can often significantly accelerate the time it takes to train your neural network by using advanced hardware, like GPUs. In this example, we’ll go through how to train a PyTorch neural network on a GPU in the cloud using Coiled notebooks.

Snippet of using `coiled notebook start --vm-type g5.xlarge --region us-west-2` to show how to start a jupyter notebook on a GPU.

Read more ...


Processing a 250 TB dataset with Coiled, Dask, and Xarray

We processed 250TB of geospatial cloud data in twenty minutes on the cloud with Xarray, Dask, and Coiled. We do this to demonstrate scale and to think about costs.

County-level heat map of the continental US showing mean depth to soil saturation (in meters) in 2020.

Read more ...


How well does Dask run on Graviton?

ARM-based processors are known for matching performance of x86-based instance types at a lower cost, since they consume far less energy for the same performance. It’s not surprising then that some companies, like Honeycomb, are switching their entire infrastructure to ARM.

bar chart of AWS cost vs. processor type

Read more ...


Just in time Python environments

Docker is a great tool for creating portable software environments, but we found it’s too slow for interactive exploration. We find that clusters depending on docker images often take 5+ minutes to launch. Ouch.

../../_images/senvs2_build_push_pull.svg

Read more ...