Shuffle sharding in Dropbox's storage infrastructure

First, some terms and context:

[We aggregate blocks] into 1GB logical storage containers called buckets. [Buckets] are aggregated together and erasure coded for storage efficiency. We use the term volume to refer to one or more buckets replicated onto a set of physical storage nodes.

OSDs [are] storage boxes full of disks that can store over a petabyte of data in a single machine, or over 8 PB per rack.

Cells are self-contained [clusters of OSDs] that store around 50PB of raw data.

Now, how buckets and volumes are shuffle sharded throughout the data center to aid their availability and durability in the face of failure:

Volumes are spread somewhat-randomly throughout a cell, and each OSD holds several thousand volumes. This means that if we lose a single OSD we can reconstruct the full set of volumes from hundreds of other OSDs simultaneously.

In practice there are thousands of volumes per OSD, and hundreds of other OSDs they share this data with. This allows us to amortize the reconstruction traffic across hundreds of network cards and thousands of disk spindles to minimize recovery time.

From James Cowling’s 2016 description of Dropbox’s storage architecture. I don’t how I missed that in my review of object storage prior art and literature.

Menu