Hoard: Distributed Data Caching System to Accelerate Deep Learning Training

ianmunoz · on Dec 11, 2018

This doesn't seem like they actually provided and/or did anything new here. (Hard to tell cuz there is also no obvious code. Is it just cachefsd?)

About a year and a half ago I did something similar with essentially the exact same hardware.

BeeGFS filesystem across x3 IBM power8 Minksy's with NVMe drives for distributing data quickly, in parallel, and effectively across multiple machines. Really helped with the scripting of a batch job as the file system was unified across the platform and had some good in memory caching.

I have the numbers somewhere but I felt like theirs aren't really that great considering 40gbps mellanox & NVMe. Should really be able to get quiet a bit better throughput. Also ran thousands of jobs and many Tb of data. Not just 24 over gigabytes.

(FWIW I didn't thoroughly read the actual article that thoroughly.)

tedivm · on Dec 11, 2018

Has the code for this been released at all? I've gone through the paper and can't seem to find it.

zunzun · on Dec 11, 2018

[flagged]

sctb · on Dec 11, 2018

Please don't do this anymore?

https://news.ycombinator.com/newsguidelines.html