Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: We scaled Git to support 1 TB repos (xethub.com)
279 points by reverius42 on Dec 13, 2022 | hide | past | favorite | 144 comments
I’ve been in the MLOps space for ~10 years, and data is still the hardest unsolved open problem. Code is versioned using Git, data is stored somewhere else, and context often lives in a 3rd location like Slack or GDocs. This is why we built XetHub, a platform that enables teams to treat data like code, using Git.

Unlike Git LFS, we don’t just store the files. We use content-defined chunking and Merkle Trees to dedupe against everything in history. This allows small changes in large files to be stored compactly. Read more here: https://xethub.com/assets/docs/how-xet-deduplication-works

Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go. XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. Even at 1 TB, we know downloading an entire repository is painful, so we built git-xet mount - which, in seconds, provides a user-mode filesystem view over the repo.

XetHub is available today (Linux & Mac today, Windows coming soon) and we would love your feedback!

Read more here:

- https://xetdata.com/blog/2022/10/15/why-xetdata

- https://xetdata.com/blog/2022/12/13/introducing-xethub




There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.

If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)

The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)


We have found pointer files to be surprisingly efficient as long as you don't have to actually materialize those files. (Git's internals actually very well done). Our mount mechanism does avoid materializing pointer files which makes it pretty fast even for repos with very large number of files.


For bigger annex repos with lots of pointer files, I just disable the git-annex smudge filters. Consider whether smudge filters are requirement, or a convenience. The smudge filter interface does not scale that well at all.


They're not a requirement! git-xet has a --no-smudge option if you prefer to deal with an unsmudged repo.


By the way, our mount mechanism has one very interesting novelty. It does not depend on a FUSE driver on Mac :-)


That's smart! I think users have to install a kext still?


Nope. No kernel driver needed :-) We wrote an localhost NFS server.


Based on unfsd or entirely in-house?


Entirely in house. In Rust!


Is this available somewhere open source? This sounds amazing!


Fancy! That’s awesome.


> The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

...isn't that just parsing git diff --name-only A..B tho ? "Process only files that changed since last commit" is extremely simple problem to solve.


Yeah, that's the basics of it. Just make sure all the output is atomic, scale up the workers, handle inputs that are joins, retry when the workers get rescheduled, etc.


Is DVC useful/efficient at storing container images (Docker)? As far as I remember they are just compressed tar files. Does the compression defeat its chunking / differential compression?

How about cleaning up old versions?


Wouldn't any container registry be more suitable for that task than dvc ..?


Yes but I’m looking for general purpose storage, so that’s one litmus test :)


I also have a lot of issues with versioning data. But look at git annex - it's free, self hosted and has a very easy underlying data structure [1]. So I don't even use the magic commands it has for remote data mounting/multi-device coordination, just backup using basic S3 commands and can use rclone mounting. Very robust, open source, and useful

[1] When you run `git annex add` it hashes the file and moves the original file to a `.git/annex/data` folder under the hash/content addressable file system, like git. Then it replaces the original file with a symlink to this hashed file path. The file is marked as read only, so any command in any language which tries to write to it will error (you can always `git annex unlock` so you can write to it). If you have duplicated files, they easily point to the same hashed location. As long as you git push normally and back up the `.git/annex/data` you're totally version controlled, and you can share the subset of files as needed


Sounds like `git annex` is file-level deduplication, whereas this tool is block-level, but with some intelligent, context-specific way of defining how to split up the data (i.e. Content-Defined Chunking). For data management/versioning, that's usually a big difference.


XetHub Co-founder here. Yes, one illustrative example of the difference is:

Imagine you have a 500MB file (lastmonth.csv) where every day 1MB is changed.

With file-based deduplication every day 500MB will be uploaded, and all clones of the repo will need to download 500MB.

With block-based deduplication, only around the 1MB that changed is uploaded and downloaded.


I combine git-annex with the bup special remote[1], which lets me still externalize big files, while benefiting from block level deduplication. Or depending on your needs, you can just use a tool like bup[2] or borg directly. Bup actually uses the git pack file format and git metadata.

I actually wrote a script which I'm happy to share, that makes this much easier, and even lets you mount your bup repo over .git/annex/objects for direct access.

[1]: https://git-annex.branchable.com/walkthrough/using_bup/

[2]: https://github.com/bup/bup


Have you tested this out with Unreal Engine blueprint files? If you all can do block-based diffing on those, and other binary assets used in game development it'd be huge for game development.

I have a couple ~1TB repositories I've had the misfortune of working with using perforce in the past.


Last time I used perforce in anger it did pretty decent with ~800GB repo(checkout+history).

I keep expecting someone to come along and dethrone it but as far as I can tell it hasn't been done yet. The combination of specific filetree views, drop-in proxies, UI-forward and checkout based workflow that works well with unmergeable binary assets still left Git LFS and other solutions in the dust.

+1 on testing this against a moderate size gamedev repo, that usually has some of the harder constraints where code + assets can be coupled and the art portion of a sync can easily top a couple hundred GB.


1TB of checkout is the kind of repo I'm talking about I have two such repos checked out on this box currently. I'm not sure I've ever checked out a repo of this scale locally with history. I'd love to have the local history.


Not yet. Would be happy to try - can you point me to a project to use?

Do you have a repo you could try us out with?

We have tried a couple Unity projects (41% smaller due to republication) but not much from Unreal projects yet.


Most of my examples of that size are AAA game source that I can't share however, I think this is a project using similar files that is based on unreal. It should show if there is any benefit: https://github.com/CesiumGS/cesium-unreal-samples & where the .umap binaries have been updated and in this example where the .uasset blueprints have been updated https://github.com/renhaiyizhigou/Unreal-Blueprint-Project


Does that work equally well whether the changes are primarily row-based or primarily column-based?


HashBackup author here. Your question is (I think) about how well block-based dedup functions on a database - whether rows are changed or columns are changed. This answer is how most block-based dedup software, including HashBackup work.

Block-based dedup can be done either with fixed block sizes or variable block sizes. For a database with fixed page sizes, a fixed block size matching the page size is most efficient. For a database with variable page sizes, a variable block size will work better, assuming there the dedup "chunking" algorithm is fine-grained enough to detect the database page size. For example, if the db used a 4-6K variable page size and the dedup algo used a 1M variable block size, it could not save a single modified db page but would save more like 20 db pages surrounding the modified page.

Your column vs row question depends on how the db stores data, whether key fields are changed, etc. The main dedup efficiency criteria are whether the changes are physically clustered together in the file or whether they are dispersed throughout the file, and how fine-grained the dedup block detection algorithm is.


Yes, see this for more details of how XetHub deduplication: https://xethub.com/assets/docs/xet-specifics/how-xet-dedupli...


"Sounds like `git annex` is file-level deduplication, whereas this tool is block-level ..."

I am not a user of git annex but I do know that it works perfectly with an rsync.net account as a target:

https://git-annex.branchable.com/forum/making_good_use_of_my...

... which means that you could do a dumb mirror of your repo(s) - perhaps just using rsync - and then let the ZFS snapshots handle the versioning/rotation which would give you the benefits of block level diffs.

One additional benefit, beyond more efficient block level diffs, is that the ZFS snapshots are immutable/readonly as opposed to your 'git' or 'git annex' produced versions which could be destroyed by Mallory ...


> let the ZFS snapshots handle the versioning/rotation which would give you the benefits of block level diffs

Can you explain this a bit? I don't know anything about ZFS, but it sounds as though it creates snapshots based on block level differences? Maybe a git-annex backend could be written to take advantage of that -- I don't know.


ZFS does snapshots (very lightweight and quick) and separately it can do deduplication. It has a lot of nice features, I'd recommend looking into it if you find it interesting. It's quite practical these days (I think it comes with ubuntu even) and it's saved my butt a time or two.


No, that is not correct, git-annex uses a variety of special remotes[2], some of which support deduplication. Mentioned in another comment[1]

When you have checked something out and fetched it, then it consumes space on disk, but that is true with git-lfs, and most other tools like it. It does NOT consume any space in any git object files.

I regularly use a git-annex repo that contains about 60G of files, which I can use with github or any git host, and uses about 6G in its annex, and 1M in the actual git repo itself. I chain git-annex to an internal .bup repo, so I can keep track of the location, and benefit from dedup.

I honestly have not found anything that comes close to the versatility of git-annex.

[1]: https://news.ycombinator.com/item?id=33976418

[2]: https://git-annex.branchable.com/special_remotes/


If git annex stores large files uncompressed you could use filesystem bl9ck level deduplication in combination with it.


can you be more specific here,very interested


There are filesystems that support inline or post-process deduplication. btrfs[1] and zfs[2] come to mind as free ones, but there are also commercial ones like WAFL etc.

It's always a tradeoff. Deduplication is a CPU-heavy process, and if it's done inline, it is also memory-heavy, so you're basically trading CPU and memory for storage space. It heavily depends on the use-case (and the particular FS / deduplication implementation) whether it's worth it or not

[1]: https://btrfs.wiki.kernel.org/index.php/Deduplication

[2]: https://docs.oracle.com/cd/E36784_01/html/E39134/fsdedup-1.h...


One problem is if you need to support Windows clients. Microsoft charges $1600 for deduplication support or something like that: https://learn.microsoft.com/en-us/windows-server/storage/dat...


Deduplication is included with every version and edition of windows server since 2012. You need to license windows server properly of course, but there is no add-on cost for deduplication.


there exists an open-source btrfs filesystem driver for Windows...


Yeah, which is great for storage but doesn't help over the wire.


ZFS at least supports sending a deduplicated stream.


Right, and btrfs can send a compressed stream as well, but we aren't sending raw filesystem data via VCS.


zbackup is a great block level deduplication trick.


If you like git annex check out [datalad](http://handbook.datalad.org/en/latest/), it provides some useful wrappers around git annex oriented towards scientific computing.


Founder of DoltHub here. One of my team pointed me at this thread. Congrats on the launch. Great to see more folks tackling the data versioning problem.

Dolt hasn't come up here yet, probably because we're focused on OLTP use cases, not MLOps, but we do have some customers using Dolt as the backing store for their training data.

https://github.com/dolthub/dolt

Dolt also scales to the 1TB range and offers you full SQL query capabilities on your data and diffs.


CEO/Cofounder here. Thanks! Agreed, we think data versioning is an important problem and we are at related, but opposite parts of the space. (BTW we really wanted gitfordata.com. Or perhaps we can split the domain? OLTP goes here, Unstructured data goes there :-) Shall we chat? )


You say you support up to 1TB repositories, but from your pricing page all I see is the free tier for up to 20GB and one for teams. The latter doesn't have a price and only a contact option and I assume likely will be too expensive for an individual.

As someone who'd love to put their data into a git like system, this sounds pretty interesting. Aside from not offering a tier for someone like me who would maybe have a couple of repositories of size O(250GB) it's unclear how e.g. bandwidth would work & whether other people could simply mount and clone the full repo if desired for free etc.


XetHub Co-founder here. We are still trying to figure out pricing and would love to understand what sort of pricing tier would work for you.

In general, we are thinking about usage-based pricing (which would include bandwidth and storage) - what are your thoughts for that?

Also, where would you be mounting your repos from? We have local caching options that can greatly reduce the overall bandwidth needed to support data center workloads.


Thanks for the reply!

Generally usage based pricing sounds fair. In the end for cases like mine where it's "read rarely, but should be available publicly long term" it would need to compute with pricing offered by the big cloud providers.

I'm about to leave my academic career and I'm thinking about how to make sure all my detector data will be available to other researchers in my field in the future. Aside from the obvious candidate https://zenodo.org it's an annoying problem as usually most universities I'm familiar with only archive data internally, which is hard to access for researchers from different institutions. As I don't want to rely on a single place to have that data available I'm looking for an additional alternative (that I'm willing to pay for out of my own pocket, it just shouldn't be a financial burden).

In particular while still taking data a couple of years ago I would have loved being able to commit each daily data taking in the same way as I commit code. That way having things timestamped, backed up and all possible notes that came up that day associated straight in the commit message would have been very nice.

Regarding mounting I don't have any specific needs there anymore. Just thinking about how other researchers would be able to clone the repo to access the data.


My preferences on pricing.

First, it's all open-source, so I can take it and run it. Second, you provide a hosted service, and by virtue of being the author, you're the default SaaS host. You charge a premium over AWS fees for self-hosting, which works out to:

1. Enough to sustain you.

2. Less than the cost of doing dev-ops myself (AWS fees + engineer).

3. A small premium over potential cut-rate competitors.

You offer value-added premium services too. Whether that's economically viable, I don't know.


As a footnote, the rationale:

1. I'm unlikely to adopt something proprietary for this sort of use. Lock-in is bad, but it's especially with a startup which can disappear tomorrow or pivot who is holding my key data. Open-source means if you disappear, I'm alive. I don't trust you, and open-source mostly means I don't need to.

2. With open-source, pricing which is more than the cost of AWS + engineer makes no sense. I'd rather host myself. However, the labor costs means that AWS + engineer gives a lot of potential profit margin for you. I'd much rather not run servers myself.

3. A cut-rate competitor will have similar per-customer cost structure as you, but you'll have somewhat higher fixed costs. For me, paying a little bit more for the reliability of going with the most competent vendor is an obvious choice (which you would be, by virtue of having written it). I wouldn't consider a cut-rate vendor unless the savings was very significant.

4. Not in my current job, but in past jobs, I'd gladly pay for service and support on top of that. A lot of things are cheaper for you to do (or explain) as the author / expert, than for my guys to figure out themselves.

For this to work requires a certain economy-of-scale. That requires deep VC pockets to get to profitability, or a good beachhead (e.g. a single big customer). There are many single big customers, but I have no idea how you'd build that connection. Most are in oddball industries, and not companies you'd think of.

For example, a few years back, I interacted with a major military contractor who specializes in manufacturing, and has no competence in technology. They did many billions in business, and had just paid a few million for a semi-incompetent tech consulting firm as an acqui-hire to try to build basic tech expertise. Outsourcing everything to you would have been a far better decision for them (and for many companies like them) if they could find you and vet you, and vice-versa (probably with a strategic investment as well).

They were very good at what they did, but what they did was very much not tech or software.


What does a Merkle Tree bring here? (honest question) I mean: for content-based addressing of chunks (and hence deduplication of these chunks), a regular tree works too if I'm not mistaken (I may be wrong but I literally wrote a "deduper" splitting files into chunks and using content-based addressing to dedupe the chunks: but I just used a dumb tree).

Is the Merkle true used because it brings something else than deduplication, like chunks integrity verification or something like that?


One monorepo to rule them all and the in the darkness, pull them. - Gandalf, probably


And in the darkness merge conflicts.


If I had to "version control" a 1 TB large repo - and assuming I wouldn't quit in anger - I would use a tool which is built for this kind of need and has been used in the industry for decades: Perforce.


I work in gamedev and think perforce is good but far from great. Would love to see someone bring some competition to the space maybe XetHub can.


So, you wouldn't consider using a new tool that someone developed to solve the same problem despite an older solution already existing? Your advice to that someone is to just use the old solution?


When the new solution involves voluntary use of git? Not just yea, but hell yes. I hate git.


Why do you hate git? I've been pretty happy with it for code, and wouldn't mind being able to use it for data repositories as well.


Is it really worth re-hashing at this point? Reams have been written about the UX


It's used by the vast majority of software engineers, so apparently it's "good enough".


Don’t ascribe positive feelings to popularity. I’m only using git until the moment there’s a viable alternative written by someone who knows what DX is.


This was my thought as well. Perforce has its own issues, but is an industry standard in game dev for a reason: it can handle immense amounts of data.


What does immense mean in the context of game dev?


On "real" (AA/AAA) games? Easily hundreds of gigabytes or several terabytes of raw assets + project files.

Sometimes even individual art project files can be many gigabytes each. I saw a .psd that was 30gb because of the embedded hi-res reference images.

You can throw pretty much anything in there, in one place and things like locking, partial-checkout, etc. Which gets artists to use it


Perforce also has support for proxies right? It’s not just the TB of data, it’s all of your coworkers in a branch office having to pull all the updates first thing in the morning. If each person has to pull from origin, that’s a lot of bandwidth, and wasted mornings. If the first person in pays and everyone else gets it off the LAN, then you have a better situation.



I have a 1.96 TB git repo: https://github.com/unqueued/repo.macintoshgarden.org-fileset (It is a mirror of a Macintosh abandoneware site)

  git annex info .
Of course, it uses pointer files for the binary blobs that are not going to change much anyway.

And the datalad project has neuro imaging repos that are tens of TB in size.

Consider whether you actually need to track differences in all of your files. Honestly git-annex is one of the most powerful tools I have ever used. You can use git for tracking changes in text, but use a different system for tracking binaries.

I love how satisfying it is to be able to store the index for hundreds of gigs of files on a floppy disk if I wanted.


There seem to be a lot of data version control systems built around ML pipelines or software development needs, but not so much on the sort of data editing that happens outside of software development & analysis.

Kart (https://kartproject.org) is built on git to provide data version control for geospatial vector & tabular data. Per-row (feature & attribute) version control and the ability to collaborate with a team of people is sorely missing from those workflows. It's focused on geographic use-cases, but you can work with 'plain old tables' too, with MySQL, PostgreSQL and MSSQL working copies (you don't have to pick - you can push and pull between them).


Maybe a silly question:

Why do you need 1Tb for repos? What do you store inside, besides code and some images?


Repositories for games are often larger than 1TB, and with things like UE5's Nanite becoming more viable, they're only going to get bigger.


A whole lot of images?

I personally would love to be able to store datasets next to code for regression testing, easier deployment, easier dev workstation spin up, etc.


Still, 1TB?

Once you get to that amount of images it would be much easy to manage it with some files storage solution.

Or I'm missing something important?


All of them require having some sort of parallel authentication, synchronization, permissions management, change tracking, etc.

Which is a huge hassle, and a lot of work I’d rather not do.

My current photogrammetry dataset is well over 1TB, and it isn’t a lot for the industry by any stretch of the imagination.

In fact, the only thing that considers it ‘a lot’ and is hard to work with is git.


Some docker images? ;)


https://github.com/facebook/sapling is doing good work and they are suggesting their git server for large repositories exists.


I actually encountered a 4TB git repo. After pulling all the binary shit out of it the repo was actually 200MB. Anything that promotes treating git like a filesystem is a bad idea in my opinion.


Yes... and no. The git userspace is horrible for this. The git data model is wonderful.

The git userspace would need to be able to easily:

1. Not grab all files

2. Got grab the whole version history

... and that's more-or-less it. At that point, it'd do great with large files.


Exactly for the giant repo use case, we have a mount feature that will let you get a filesystem mount of any repo at any commit very very quickly.


Just in case if you are wondering about alternatives: there is Unity’s Plastic https://unity.com/products/plastic-scm which happens to use bidirectional sync with git. I’m curious how this solution compares to it! I’ll definitely give it a try over the weekend!


I was already upset about Codice Software pulling Semantic Merge and only making it available as an integrated part of Plastic SCM. Now that I see the reason such a useful tool was taken away was to stuff the pockets of a large company, I'm fuming.

I know that they're well within their rights to do this as they only ever offered subscription licensing for Semantic Merge, but that doesn't make it suck less to lose access.


What about SVN?

Besides other features, Subversion supports representation sharing. So adding new textual or binary files with identical data won’t increase the size of your repository.

I’m not familiar with ML data sets, but it seems that SVN may work great with them. It already works great for huge and small game dev projects.


Can I upload a full .pbix file to this and use it for versioning? If so, I'd use it in a heartbeat.


CEO/Cofounder here. We are file format agnostic and will happily take everything. Not too familiar with the needs around pbix, but please do try it out and let us know what you think!


The link takes me to a login page. It would be nice to see that fixed to somehow match the title.


Visit https://xetdata.com for more info! (Sorry, can't edit the post link now.)


Can it be used to store container images (Docker)? As far as I remember they are just compressed tar files. Does the compression defeat Xet's own chunking?

Can you sync to another machine without Xethub ?

How about cleaning up old files?


Yeah... The compression does defeat the chunking (your mileage may vary. We do a small amount of dedupe in some experiments but never quite investigated it in detail.). That said, we have experimental preprocessors / chunkers that are file-type specific that we could potentially do something about tar.gz. Not something we have explored much yet.


Does this fix the problem that Git becomes unreasonably slow when you have large binary files in the repo?

Also, why can't Git show me an accurate progress-bar while fetching?


Mostly! (At the moment, it doesn't fully fix the slowdown associated with storing large binary files, but reduces it by 90-99%. We're working on improving to closer to 100% that by moving even the Merkle Tree storage outside the git repo contents.)

As for why git can't show you an accurate progress bar while fetching (specifically when using an extension like git-lfs or git-xet), this has to do with the way git extensions work -- each file gets "cleaned" by the extension through a Unix pipe, and the protocol for that is too simple to reflect progress information back to the user. In git-xet, we do write a percent-complete to stdout so you get some more info (but a real progress bar would be nice).


Thank you for posting this. Is there any way to access a Xet dataset via a URL (assuming the dataset owner has opted to share in that manner) so that, for example, one could visit a web page that contains some embedded code (JS, WASM etc) which pulls the Xet data into the page for processing?


How data is split in chunks ? Just curious.


They mention 'content-defined chunking', but it as far as understand it requires different chunking algorithms for different content types. Does it support plugins for chunking different file formats?


Today we just have a variation of FastCDC in production, but we have alternate experimental chunkers for some file formats (ex: a heuristic chunker for CSV files that will enable almost free subsampling). Hope to have them enter production in the next 6 months.


That's interesting. Can a CSV chunker make adding a column not affect all of the chunks?


The simplest really is to chunk row-wise so adding columns will unfortunately rewrite all the chunks. If you have a parquet file, adding columns will be cheap.


CEO/Cofounder here! Content defined chunking. Specifically a variation of FastCDC. We have a paper coming out soon with a lot more technical details.


... why do you have 1TB of source code (you don't! mandatory hacker snark) Is git really supposed to be used for data? Or is this just a git-like interface for source control on data?


Git is only not "supposed" to be used for data because it doesn't work very well with data by default. Not because that's not a useful and sensible thing to want from a VCS.


It's a fundamentally bad idea because of how any DVCS works. You really don't want to be dragging around gigabytes of obsolete data forever.

Something like git-lfs is the appropriate solution. You need a little bit of centralization.


Because of how Git's current implementation of DVS works. There's nothing fundamental about it. Git already supports partial clones and on-demand checkouts in some ways, it's just not very ergonomic.

All that's really needed is a way to mark individual files as lazily fetched from a remote only when needed. LFS is a hacky substandard way to emulate that behaviour. It should be built in to Git.


Game development, especially Unreal engine, can produce repos in excess of 1TB. Git LFS is used extensively for binary file support.


Signed up & browsing "Flickr30k" repo (auto generated) repo & it was really slow for me. Like CSV, does it also supports other data formats like json, yml etc.,?


We are file format agnostic and you should be able to put anything in the repo. We have special support for CSV files for visualizations. Sorry for the UI perf... there are a lot of optimizations we need to work on.


The tl;dr is that "xet" is like GitLFS (it stores pointers in Git, with the data in a remote server and uses smudge filters to make this transparent) with some additional features:

- Automatically includes all files >256KB in size

- By default data is de-duplicated 16KB chunks instead of whole files (with the ability to customize this per file type).

- Has a "mount" command to allow read-only browse without downloading

When launching on HN it would be better if the team was a bit more transparent with the internals. I get that "we made a better GitLFS" doesn't market as well. But you can couple that with a credible vision and story about how you are a better and where you are headed next. Instead this is mostly closer to market speak of "trust our magic solution to solve your problem".


These details seemed.... really clear to me from the post the OP made? Did you just not read it, or have they updated it since you commented?

(excerpt from the OP post:

> Unlike Git LFS, we don’t just store the files. We use content-defined chunking and Merkle Trees to dedupe against everything in history. This allows small changes in large files to be stored compactly. Read more here: https://xethub.com/assets/docs/how-xet-deduplication-works)


I see a lot of reasons to version code.

I see far less reasons to version data, in fact, I find reasons against versioning data and storing them in diffs.


Something that I've experienced from many years in enterprise software: 90% of enterprise software is about versioning data in some way. SharePoint is half as complicated as it is because it has be a massive document and data version manager. (Same with Confluence and other competitors.) "Everyone" needs deep audit logs for some likely overlap of SOX compliance, PCI compliance, HIPAA compliance, and/or other industry specific standards and practices. Most business analysts want accurate "point in time" reporting tools to revisit data as it looked at almost any point in the past, and if you don't build it for them they often build it as ad hoc file cabinets full of Excel export dumps for themselves.

The wheels of data versioning just get reinvented over and over and over again, with all sorts of slightly different tools. Most of the job of "boring CRUD app development" is data version management and some of the "joy" is how every database you ever encounter is often its own little snowflake with respect to how it versions its data.

There have been times I've pined for being able to just store it all in git and reduce things to a single paradigm. That said, I'd never actually want to teach business analysts or accountants how to use git (and would probably spend nearly as much time building custom CRUD apps against git as against any other sort of database). There are times though where I have thought for backend work "if I could just checkout the database at the right git tag instead needing to write this five table join SQL statement with these eighteen differently named timestamp fields that need to be sorted in four different ways…".

Reasons to version data are plenty and most of the data versioning in the world is ad hoc and/or operationally incompatible/inconsistent across systems. (Ever had to ETL SharePoint lists and its CVC-based versioning with a timestamp based data table? Such "fun".) I don't think git is necessarily the savior here, though there remains some appeal in "I can use the same systems I use for code" two birds with one stone. Relatedly, content-addressed storage and/or merkle trees are a growing tool for Enterprise and do look a lot like a git repository and sometimes you also have the feeling like if you are already using git why build your own merkle tree store when git gives you a swiss army knife tool kit on top of that merkle tree store.


You're suffering from a failure of imagination, maybe because you've never been able to version data usefully before. There are already lots of interesting applications, and it's still quite new.

https://www.dolthub.com/blog/2022-07-11-dolt-case-studies/


One use-case would be for including dependencies in a repo. For example, it is common for companies to operate their own artifact caches/mirrors to protect their access to artifacts from npm, pypi, dockerhub, maven central, pkg.go.dev, etc. With the ability to efficiently work with a big repo, it would be possible to store the artifacts in git, saving the trouble of having to operate artifact mirrors. Additionally, it guarantees that the artifacts for a given, known-buildable revision are available offline.


Cofounder/CEO here! I think it less about "versioning", but the ability to modify with confidence knowing that you can go back in time anytime. (Minor clarification: we are not quite storing diffs; holding snapshots just like Git + a bunch of data dedupe)


> the ability to modify with confidence knowing that you can go back in time anytime

This is versioning


Versioning is a technique. Backups, copy+paste+rename also does it


Anything that might be audited. Being able to look at things how they were how they changed to find out how they got to where they currently are, and who did what; is amazing for many application. Finance, healthcare, elections, Etc.

Well unless fraud is the goal.


Shameless plug for https://snapdir.org which focuses on this particular use case using regular git and auditable plain text manifests


Versioning data is great, but storing as diffs is inefficient when 99% of the file changes each version.


We don't store as diffs, we store as snapshots -- and it's efficient thanks to the way we do dedupe. See https://xethub.com/assets/docs/how-xet-deduplication-works/


As always it depends on the application. It can definitely be useful in some applications.


What are the reasons against?


The lack of reasons for doing it IS the reason against. GIT isn't a magic 'good way' to store arbitrary data, it's a good way to collaborate on projects implemented using most programming languages which store code as plain text broken into short lines, where edits to non-sequential lines can generally be applied concurrently without careful human verification. That is an extremely specific use case, and anything outside of that very specific use case leaves git terrible, inefficient, and gives almost no benefit despite huge problems.

People in ML ops use git because they aren't very sophisticated with programming professionally and they have git available to them and they haven't run into the consequences of using it to store large binary blobs, namely that it becomes impossible to live with eventually and wastes a huge amount of time and space.

ML didn't invent the need for large artifacts that can't be versioned in source control but must be versioned with it, but they don't know that because they are new to professional programming and aren't familiar with how it's done.


I literally don't know anyone or any team in ML using git as a data versioning tool. It doesn't even make sense to me, and most mlops people I have talked to would agree. Is that really the point of this tool? To be a general purpose data store for mlops? I thought it is for very specialized ML use cases. Because even 1TB isn't much for ML data versioning

Mlops people are very aware of tools that are more suited for the job... even too aware in fact. The entire field is full of tools, databases, etc to the point where it's hard make sense of it. So your comment is a bit weird to me


Building mlops solutions for a big tech. Agree, most mature ml teams are not using git for ml data versioning, but in my experience and user research it’s not due to lack of intent. Teams have been forced to move to other ml data tools in absence of scalable git solution, most of which come with a lot of cognitive overhead for ml engineers who don’t want to spend time adopting several custom tools in their ml pipelines.


I think you'll find varying levels of maturity in ML ops. Anyway I think we basically agree, if you use something like this you aren't that mature, and if you are mature you would avoid this thing.


Indeed, there is a lot of pain if you actually try to store large binary data in git. But we managed to make that work! So a question worth asking is how might things change IF you can store large binary data in git??


That is exactly what git-lfs is, a way to "version control" binary files, by storing revisions - possibly separately, while the actual repo contains text files + "pointer" files that references a binary file.

It's not perfect, and still feels like a bit of a hack compared to something like p4 for the context I uses LFS in (game dev), but it works, and doesn't require expensive custom licenses when teams grow beyond an arbitrary number like 3 or 5.


XetHub Co-founder here. Yes, we use the same Git extension mechanism as Git LFS (clean/smudge filters) and we store pointer files in the git repository. Unlike Git LFS we do block-level deduplication (Git LFS does file-level deduplication) and this can result in a significant savings in storage and bandwidth.

As an example, a Unity game repo reduced in size by 41% using our block-level deduplication vs Git LFS. Raw repo was 48.9GB, Git LFS was 48.2GB, and with XetHub was 28.7GB.

Why do you think using a Git-based solution is a hack compared to p4? What part of the p4 workflow feels more natural to you?


The centralised model of Perforce is more of a natural fit for one thing, since by default it allows you to clone subsets, and just the latest version of files. File locking is much more integrated into the p4 workflow as well, in git you can still modify files locally, then commit them. The check happens on push, and sometimes git fails to send the lock notification upstream. Oh and it breaks down entirely if you use branches.

Some of these have workarounds and hacks for more experienced users. I'm not about to run around teaching people the intricacies of arcane git incantations, while p4 functions, by default, how you'd want to. The programming side is better on git though, yeah.


(XetHub engineer here)

We're working on perforce-style locking on XetHub, and I believe git already supports things like only cloning the latest version of files. Cloning the full repo without "smudging" (pulling in binary file contents) is already possible, and cloning while smudging a subset is on our roadmap. We're definitely on a path to making git UX for dealing with large binary files as easy as perforce, and there are lots of advantages to keeping a git-based workflow for teams that already work with git.


I think this is a foot-gun, it's a bad idea even if it works great, and I doubt it works very well. You should manage your build artifacts explicitly, not just jam them in git along with the code that generates them because you are already using it and you haven't thought it through.


I don't think you've made your case here. The practices you describe are partly an artifact of computation, bandwidth, and storage costs. But not the current ones, the ones when git was invented more than 15 years ago. In the short term, we have to conform to the computer's needs. But in the long term, it has to be the other way around.


You're right! It makes way more sense, in the long run, to abuse a tool like git in a way that it isn't designed for and which it can't actually support, then instead of actually using git use a proprietary service that may or may not be around in a week. Here I was thinking short term.


You seem nice.


Than you, I've worked very hard to become so. You seem nice, too!


Xet's initial focus appears to be on data files used to drive machine learning pipelines, not on any resulting binaries.


This feels like something that is prime for abuse. I agree with @bastardoperator, treating git for file storage is going to go nowhere good.


git-xet follows the patterns established by other git extensions like git-annex and git-lfs, but with some UX improvements. In all of these cases, the large/binary file contents are actually stored outside of git, but available with git commands. The goal here is to make working with and versioning code and data together seamless, not to blindly use the git internals to store data. Does that still seem like a bad idea to you?


How's this differ from using git LFS?


We are significantly faster? :-) Also, block-level dedupe, scalability, perf, visualization, mounting, etc.


write a blog post about these statements. Otherwise is just your opinion on it.


Even better :-) We have a paper. Its on the way.


Please consider https://sso.tax/ before making that an "enterprise" feature.


I mean yeah, that's working as intended surely? Some of those price differences are pretty egregious but in general companies have to actually make money, and charging more for features that are mainly needed by richer customers is a very obvious thing to do.


I believe the counter-argument is that they should charge for features but that security should be available to anyone. Imagine if "passwords longer than 6 chars: now only $8/mo!"

That goes double for products where paying for "enterprise" is only to get SAML, which at least in my experience causes me to go shopping for an entirely different product because I view it as extortion


Security is available for everyone. It's centralised security that can be easily managed by IT that isn't.

I don't see an issue with charging more for SSO though as I said some of the prices are egregious.


Very sad to see Bitwarden in this list




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: