Not much tech content here. They racked a dense array of drives in a JBOD. There's no description of a complex or sophisticated ingestion engine, system for distributing data across the array, file system, error correction, nothing.
An Exabyte of disk is definitely eyebrow-raising, but companies like LinkedIn had an exabyte of storage in 2021 just in their HDFS clusters.
That's about 1GB per user. If you imagine them storing every interaction anyone has with the site-- every file upload, every page load, every mouse movement-- and throw in some duplication-- it's not completely impossible.
Do these logs ever get pruned? Is it worth knowing that 10 years ago, Johnny took 1.5 seconds to click the upvote button? Or is it just easier to keep it, imagining some what-if value extraction?
If I remember correctly, JBOD means that if a drive fails the data on the drive is lost. If you have so many disks then you have an expected fail rate of roughly 1.7% per year (according to the popular drive statistics from back blaze).
Isn't it a bit harsh to lose 1.7% of data every year? Why not a dead simple RAID that can tolerate the loss of a disk without data loss?
A JBOD is as the initialism, they're Just a Bunch of Disks. They are simply enclosures that allow a large number of disks to be hooked up to a server. Any RAID or whatnot would be done on the software side, with Ceph or something similar as another reply states.
CERN has had some interesting storage technology, at least in the past. When I was at Seagate, we were trying to pitch our Kinetic drives (basically a 3.5" HDD with an ethernet port that talks key/value store instead of SATA), and CERN was one of the large purchasers of these.
Their datacenter is used for a lot more than just LHC results, also. Zenodo, the open science result repository, also lives in their storage.
Looks like a marketing piece publicizing that CERN is using WD HDD products at scale with no technical details. To make matters worse, the WD product links don’t even work!
CERN publicizing about their HGST Ultrastar use [0] in 2013 was what got me started buying their drives and I never had issues with them. HGST is now part of Western Digital [1].
The last time I had to buy disks I switched to Seagate Exos X and thought I'll continue buying them. I think it was one of the Backblaze Drive Stats which made me buy them. I like the drives.
So CERN is now doing again such an advertising campaign:
> When Bonfillou shared the requirements from the next generation collider, the team suggested testing the company’s new series of JBODs (Just a Bunch of Drives), the Ultrastar hybrid storage platforms.
Since the project is expected to start in 2029, add about at least 5 more years for CERN to collect data on the drive stats, that's a long time to wait.
Does anyone here know if WD's Ultrastar are still as good as back then, when HGST was HGST? Was it just a brand change and all the rest, R&D-team, design, production, was still the same, separate to WD?
A lot of distributed systems innovations came out of the High Energy Physics space.
Nvidia itself was basically bankrolled by DoE labs for much of it's early existence, MPI was a HPC project driven by Oak Ridge, the WWW was a side project at CERN to simplify information retrieval, etc.
High Energy Physics is a very data and computationally heavy problem that lends itself nicely to HPC, and a lot of the innovations in CS subfields like Bioinformatics, Machine Learning, and Theory was enabled by this research.
How is all this data not compressible? e.g. XML compresses 20:1 or better using just gzip. Is all this physics data indistinguishable from white noise? Are their incentives to not compress? Don't take the big data out of big science. Then it's no longer big science and we cannot impress the masses.
> 'How's the spacecraft doing?' 'I dunno. All this equipment is just used to measure TV ratings.'
What makes you think it's not compressed? (or that the data is stored as XML?)
There's very sophisticated compression systems throughout each experiment's data acquisition pipelines. For example this paper[1] describes the ALICE experiment's system for Run3, involving FPGAs and GPUs to be able to handle 3.5TB/s from all the detectors. This one [2] outlines how HL-LHC & CMS use neural networks to fine tune compression algorithms on a per-detector basis.
Not to mention your standard data files are ROOT TFiles with TTrees which store arrays of compressed objects.
The 'uncompressed' stream has already been winnowed down substantially: there's a lot of processing that happens on the detectors themselves to decide what data is worth even sending off the board. The math for the raw detectors is 100 million channels of data (not sure how many per detector, but there's a lot of them stacked around the collision) sampling at 40Mhz (which is how often the bunches of accelerater particles cross). Even with just 2 bits per sample, that's 1PB/sec. But most of that is obviously uninteresting and so doesn't even get transmitted.
An Exabyte of disk is definitely eyebrow-raising, but companies like LinkedIn had an exabyte of storage in 2021 just in their HDFS clusters.
https://www.linkedin.com/blog/engineering/open-source/the-ex...