Why I Zarr?

Josh Moore
Open-Source Science (OSSci)
4 min readJul 10, 2022

--

The Zarr logo of a 3-dimensional Z set within an XYZ coordinate system. To the right, the word “Zarr”.
Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. (“Z” is for compressed like zlib)

In bioimaging, there are hundreds of ways that acquired images get written to disk, leaving developers like me constantly wondering how to (best) read what are essentially just n-dimensional arrays of data.

Discovering HDF5 in the early ’00s was an eye-opener. The thought of a single library which cleanly managed IO for large volumes rather than, e.g., ungainly stacks of TIFFs gave me hope. Unfortunately, there were sufficiently mixed opinions such that the move to a standard format never arose.

Fast forward several years, and the growing pressures of cloud storage and gargantuan volumes have led to the development of next-generation file formats like Zarr. Built very much on the model of HDF, the Zarr format not only provides a clean library for accessing the internals of binary files, it also makes those internals themselves simple and transparent. Each n-dimensional chunk of data is given a name and placed in a separate file making it accessible without the need for a library and vastly improving parallelism.

A scientific figure showing across the X access seconds per chunk from 10^-5 to 10⁰. Across the Y axis are the words “Overhead”, “Zarr”, “Tiff”, and “HDF5”. Data values are plotted for “local”, “http”, and “s3”. The results show that for s3 storage each call to Zarr can be an order of magnitude faster.
2021 benchmark comparing access latency per chunk in Nature Methods.

Even so, Zarr’s not the ONE file format for bioimaging. As with HDF5, it’s not ideal for every use case. It thrives in situations where many atomic operations can be coordinated lock-free (like the cloud), but on inode-limited cluster filesystems Zarr may make your system administrators very unhappy. It’s important to know the trade-offs and choose what will work in a given situation. As a community, though, we can strive to keep users’ lives simple by enabling interoperability between a few n-dimensional formats. With some work, covering the full breadth of use cases and reducing the number of ways that things get written to disk is closer than ever before.

Who can Zarr?

An added benefit of the simplicity of Zarr is how tractable keeping your data FAIR and accessible in the long-term becomes. Though the bytes need decoding, a fair amount of the general structure needs no interpretation. This also makes implementing the Zarr specification a relatively approachable task. Though Zarr was initially developed in Python, the ease of implementation and the utility of having a way to persist collections of n-dimension arrays to disk and share them widely has led to the adoption in C, C++, Java, Julia, Javascript, and Rust (with the R community in hot pursuit).

A drawing consisting of the logos for Python, C++, Java, Javascript, Julia, and Rust passing each other (using small drawn arms) Zarr cubes.
Languages with Zarr support, happily sharing n-dimensional data via Zarr.

Data scientists from any of those programming languages (and others!) are encouraged to give Zarr a try. If you have (humongous) binary data that you want to compress and share online, it’s likely a good fit. There are also a growing number of related libraries that provide explicit methods for reading and writing Zarr (like dask and xarray in Python) which encourage use as a common exchange mechanism.

Or, if you are interested in getting involved in the internals, much of the feverish developments at the moment revolve around how best to fit Zarr into more storage models. There are experiments, for example, of how to store “distributed” Zarrs on IPFS, how to create “virtual” Zarr files by pre-processing large monolithic formats with Kerchunk, or how best to group chunk files into “sharded” Zarrs to reduce the total number of files on disk.

A schematic showing on the top a “monolithic file” consisting of one file with large binary blocks within it and on the bottom a “Zarr directory” consisting of multiple nested directories, and many binary files of a much smaller size.
Simple schematic of the difference in internals between monolithic and cloud-native formats.

But we’re also always looking for ways to bridge communities. The NetCDF team at UCAR, for example, has added Zarr as a backend to the netcdf-c and netcdf-java libraries allowing users to keep the same API while gaining the benefits of both HDF5 and Zarr. We’d similarly like to see TileDB users able to easily ingest existing Zarrs for doing the heavy lifting that is not the focus of Zarr.

How to Zarr?

If you are as excited as we are about sharing (or reducing!) the burden of software development for scientific data, we look forward to discussing with you through the new Open Source Science (OSSci) channels. Until then, if you’re interested in reaching out to the Zarr developers the best place to do so is on Gitter. Or if you want to learn more first, check out the tutorial at https://zarr.readthedocs.io or introductory talks on the Youtube playlist. All materials are also available from the main website, https://zarr.dev.

Josh Moore is a maintainer of the Open Microscopy Environment (OME) and the Zarr projects. You can read more on the effort to link these two developments in “OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies” (with more of the backstory in the supplementary note).

--

--

Josh Moore
Open-Source Science (OSSci)

Maintainer of the Open Microscopy Environment (OME) and Zarr