Hi all! Yucheng (CEO XetHub) here, happy to answer any technical questions anyone might have. Our current tech is a significant enhancement over the original Git Is For Data paper we published last year https://www.cidrdb.org/cidr2023/papers/p43-low.pdf . Hope to write more about it soon! (Maybe with follow up paper or at minimum, a blog post)
HF’s Git LFS storage backend makes it easy to publish datasets and models - but difficult for anyone to collaborate on repos because LFS gets slower and slower with every change, with history bloated with full versions of every changed file.
Now with over 12PB stored in LFS (1.3m models, 450k datasets, 680k spaces), HF is replacing LFS with XetHub’s content-defined store that uses Merkle trees to deduplicate at the block level, unlocking the ability to push small changes to huge files without having to transfer/save the whole file again.
The implementation scales to 100TB repos while preserving all Git history to enable fast development on HF Hub.
HF is a cool website but I wish they offered a more obvious way to download. Let me download the model or dataset as a 7zip archive, or even better create torrents.
HF is the major distribution center for so called open models, yet I need an account to access most stuff. Github just let's me grab a release or download a project as a zip folder.
It really grinds my gears cuz its not a barrier to open source fully open models and datasets, more like that really annoying pothole.
HF has a great service but I really dislike how they serve models. If they created torrents for models, torrents that don't force you to login to access I would seed as many models and datasets as possible.