I'll say what I said in February: I started a company with the same premise 9 years ago, during the prime "big data" hype cycle. We burned through a lot of investor money only to realize that there was not a market opportunity to capture. That is, many people thought it was cool - we even did co-sponsored data contests with The Economist - but at the end of the day, we couldn't find anyone with an urgent problem that they were willing to pay to solve.
I wish these folks luck! Perhaps things have changed; we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.
We also started "Git for data" several years ago but since then pivoted to data science/ML tooling (https://dotscience.com/) by building features that people actually want on the original product. Since then the "git for data" accounts only probably for 5% of the total functionality :)
I guess "Git for data" is not very useful if you don't have the whole platform built around it to actually use the features. We mainly use it for data synchronization between the nodes and provenance tracking so people can see what data was used to build specific models and to track how the project evolves itself without forcing people to "commit" their changes manually (as we have seen that often data scientists don't even use git, just files on their Jupyter notebooks).
Agreed, it's always refreshing to read candid posts like this. A useful term for Google is "startup post mortem" although you still have to do plenty of sifting.
Git succeeded because it was free, and then business models were able to be built up around the open-source ecosystem after a market evolved naturally. There is a need, but if you go into it trying to build a business from scratch, you're going to have a bad time.
Mercurial would not have won. Mercurial has since added features, that are not the recommended workflow according to their docs, to have similar branching model to git but the default "as designed" workflow of hg is arguably inferior to git (yes, I know that word will get downvoted).
Without git, git's style of branching would likely never have been added to hg and even though it's been added now AFAICT hg people don't use it. No idea why. Git people get how much freedom git branches give them, freedom that other vcs, include hg don't/didn't.
Nice to read this. I was trying to collaborate on a single project that used Mercurial, and man as a git user I could not understand the branching model.. had the hardest time. I ended up working from a local git repo, doing my work there, and then very carefully pushing the commits one at a time at the very end. If I made a mistake, I basically re-cloned the Hg repo because apparently editing history is a no-no. I found the experience very frustrating.
Sibling comment to mine:
> Git branching is not intuitive, because they are not branches but pointers/labels.
Funny, that's exactly why I DO find git branches more intuitive.
Git branching is not intuitive, because they are not branches but pointers/labels. When you talk about the master branch, you actually talk about the master pointer.
The other VCSes have an intuitive concept of branches, because they are in fact branches.
I liked Mercurial more than Git, but when BitBucked dropped Mercurial I also switched to Git.
I must be an outlier, because it's always been the opposite for me.
I started on Mercurial and didn't use Git for years. The moment I switched to Git everything made so much more sense to me. Mercurial seemed like it did magic and wouldn't explain it to you. There were multiple kinds of branches, there were revision numbers, octopus merges were impossible to understand, the whole thing tried to act immutable but effective workflows included history editing for squashing and merging and amending and cherry-picking, which is anything but. Partial commits were a little bit of a mystery to me, and shelves seemed to be their own separate thing.
To me Git was simple in comparison. The working copy was the last state at the end of a long sequence of states. Patches were just the way you represented going from one state to another, rather than canonical, so you woujldn't resolve an octopus merge so much as you would get to your desired state and call it a day. Branches were labels to a particular state. Stashes were labels with an optimized workflow. Reflog was just a list of temporary-ish labels. New commits were built against the index, which you could add or remove to independently of file state. Branches were branches were branches, no matter where the repository was. Disconnecting from upstream was simply a matter of removing a remote.
I know it doesn't match up with other people, but I simply have never been able to see Mercurial as an example of a good tool /despite starting on it/. It's always been easier to use git at any level of complexity I need it depending on the problem I'm solving, whether it's saving code or rescuing a totally botched interactive rebase, merge, etc.
Git branches as labels into a DAG of edits maps exactly to what I think branches are. The difference between two branches is their respective edits from a common base. If you muck up a commit, you reset the pointer to the previous commit. If you muck that up, and accidentally reset too much, you can use your reflog to find out where you used to be on the DAG and reset the branch to that.
The transparency of the mechanism enables the user to be more powerful while knowing fewer concepts in total. The power of the system comes from the composition of simple parts.
AFAIK (from the rumour mill and not from any kind of reliable source) the `git branch` command was only added as a cargocult from all the SVN users flocking to git and asking "So how do I branch?!". Previous to this, everything was tags and checkouts.
Again, no verifiable source, just water cooler talk with other devs.
Git exists, because Bitkeeper were being Aholes. [1] A developer needed some metrics on the Bitkeeper repository that Linux used. Remember this is a proprietary and commercial product that granted a handful of licenses to the Linux community as a token of support. So when Andrew Tridgell reversed engineered the format that Bitkeeper used, they threatened to sue him under the DMCA.
This caused a firestorm, some defended him, others defended Bitkeeper, and a lot of people said why the hell is Linus using proprietary software to manage an Open Source project?!?!! Linus waded in and said he'd think about it, I think was on a thursday or friday, and by the next week he had working python prototype of git. [2] The rest is history. Bitkeeper faded into irrelevance and git became the lingua franca for open source projects. Arguably its biggest strength was not revision control, but being designed in manner that many collaborators could seamlessly commit changes for merging. Obviously architected to fulfill the time consuming requirements of Linus Torvalds, it has stood a test of time. I'm writing this from memory, so if it disagrees with Wikipedia take it with a grain of salt.
99.99999% of projects are not the Linux kernel, so how could Git have succeeded because of Linus, other than Linus originating the genius design of it? The Ruby community jumped onto Git even though there was no Github, and Ruby itself didn't use Git. In my opinion it was because Git was the first tool that was superior to SVN in every way.
The first time I used Git I swore I would never use SVN again. It was even popular back then to set up git+svn systems so you could work on your git repo, and push a branch to svn to satisfy your employer.
People associate git with Github (and Gitlab), but it used to be very common to just set up a ssh server that people could push projects on to, my server still has a dozen or so projects on it that I haven't touched in a decade. Github spawned from the popularity of Git in the Ruby community, and the desire to make it a little more accessible to people that didn't want to have their own git servers.
> so how could Git have succeeded because of Linus, other than Linus originating the genius design of it?
Perhaps that is exactly the point. There was a fair amount of hype and press coverage over Git when it was first unveiled. And it was because Linus wrote it, and wrote it in an unexpectedly short time. And it was on the coattails of the whole Bitkeeper saga.
I'd argue it succeeded (at least more recently) because of the UX that companies like Github and Gitlab gave, not particularly linus or because it was free.
Just because Steph Curry uses Under Armor doesn't mean everyone will too.
I think obviously something created by Linus was deemed to be of great value, but what made Git as profound as it is today is: UX for the majority of people who are beginners, which was Github backed by millions of other developers.
Darcs 2 (introduced in 2008-04) reduces the name of scenarios that will trigger an exponential merge. Repositories created with Darcs 2 should have fewer exponential merges in practice.
Was git really better than for example Mercury or Darcs for most 'normal' projects when it was released? I certainly don't remember it as such. It was certainly better for Linus and the specific workflow problems he had with kernel development, but I don't recall it being better overall at the time.
What its release (and the very public BitKeeper spat leading up its release) did do was bring the idea of distributed VCS to the forefront.
It's ~5 years old and I really wanted it to be huge. Hoping this new project is a success. Especially since I notice I went to high school with one of the founders of Dolt (Hey Tim!)
That project looks like a command-line p2p file sharing system. There doesn't appear to be any branching. It also doesn't appear to be a database (like with a schema), but simply raw files being passed around. There's no data types or queries.
I'm not sure why you bring it up now. They don't call it "git for data" anywhere that I see, and it's missing 2 of the 3 core features that I think a "git for data" would need to have.
Like I said, this project is old. I brought it up in the context of older projects, independent of whether they succeeded pivoted, etc. If you did some research you would have made this connection:
My point is they pivoted and so maybe this idea won't work, or this was too early.
EDIT: Looking back on _your_ post, I mentioned it because you specifically said "It's a program, and it appears to be an open-source one you can download and use today." And that is what 'dat' is/was. I thought I would mention it.
The need is definitely there. My day job involves such need. But we simply cannot trust a drive-by startup to fill the gap. It's safer just to roll your own.
I'll check it out. We think the world is a little more ready for this now, given how widely Git is adopted and the advances in other data tooling (like ML). But, as we're all aware, starting a business is hard :-)
I wonder how much of the market opportunity here is contingent upon market education. Everybody can clearly see the value of having a personal automobile, but how successful can you be at selling automobiles to people who don't know how to drive? Do people desire cars enough to buy one if they don't know how to drive?
Everyone can see how FAANG companies are growing wealthy off the mountains of data they are amassing, so everyone understands how data can be desirable. But what if your potential market base doesn't understand how to "drive" data - how to identify which data would be valuable for them and how best to exploit it? It seems to me that part of a go-to-market strategy needs, at least in the short term, to help potential customers transition from "that's a really shiny bauble" to "I understand how this is going to make me money."
Maybe, but also maybe there just isn't a huge demographic of data scientists with discretionary purchasing capacity and a hair-on-fire problem that they are desperately searching for someone to take their money and fix.
I think that a lot of the data VIP types we met with honestly wanted to know why they needed it, but the more they thought about it, the more it just seemed like a shiny thing.
It's telling that dozens of similar companies with smart people behind them have thrown their talents at this solution, and none of them have located the problem people are eager to pay to solve.
You may be right in the big picture, but not all "GitHubs for data" are the same. This product seems super cool:
> Dolt is Git for data. Instead of versioning files, Dolt versions tables. DoltHub is a place on the internet to share Dolt repositories. As far as we can tell, Dolt is the only database with branches.
I find it hard to believe no other database has branches, but if that's true and if this product works like you'd imagine, that is really cool.
Given your historical observation, I think you're right that this will not lead to a market revolution, but sometimes you need the right product to change the landscape.
In our group we use git for code repos and cloud for storage and actual compute. It works seamlessly and git APIs work fantastically with almost any service, IDE or whatever your tool of choice.
I suspect with the increasing cloud adaption, accessing data is getting easier by the day and I see no real need for a “git for data” tool. Plus, as a data scientist, it allows me to keep code and data separate, especially if I’m working with confidential data.
On one hand, startups are as exciting as bank heists. You put together an amazing team, do your homework as best you can, then get killed trying to execute your perfect plan. I'm proud of what we built and we all learned a lot the hard way.
On the down-side, building a company is emotionally, physically, mentally exhausting. It wasn't really a matter of whether I could pay myself fairly; I drew a typical programmer salary along with everyone else.
However, the important detail is that you're spending someone else's money and every penny of it represents someone you respect putting their trust in you, and you feel the weight of that every day.
Ultimately, I don't exactly regret it but I certainly wish that we weren't so convincing that we convinced ourselves of a market opportunity that we couldn't access or didn't exist at all. There was so much heat for "data" in 2011 it really seemed like we just had to show up with an amazing product.
> Ultimately, I don't exactly regret it but I certainly wish that we weren't so convincing that we convinced ourselves of a market opportunity that we couldn't access or didn't exist at all.
How should you really know this without trying?
If you aren't convinced, where should the motivation come from?
Every business has about 100 questions where if you know the answer to those questions, you are quite likely to be successful. The hard part is knowing which questions need to be asked of each business.
To be clear: my co-founders and I were all seasoned, multi-startup people. We had extremely high-calibre investors with finely tuned bullshit meters. In the end, our own pitching skill undid us because I think less-convincing founders would have undergone 25% more scrutiny and that would have required us to demonstrate that we had signed LOIs from 3-5 real customers before starting.
There was a subconscious misdirection around the fact that we didn't actually have anyone beating down our door. We let the excitement for "data", our personalities and our track records carry the moment.
Of course founders have to drink their own Kool-Aid to some extent or they won't make it. But there's real power and value in the customer development mindset. People want this? Prove it.
Grant funders are starting to require teams to publish their code and data. Maybe they're the target audience? Data repos vendors could get on the list of approved vendors for teams receiving funding.
OP says "we couldn't find anyone with an urgent problem that they were willing to pay to solve" and "there was no market to capture". Wouldn't a counter example contradict those statements, regardless of the size of Palantir's clients?
We got a blog on the storage system coming on Wednesday. It's a mashup of a Merkle DAG and a B-tree called a Prolly Tree. It comes from an open source package called Noms (https://github.com/attic-labs/noms).
The post Tim links here is a very apt description of what Pachyderm does. We're designed for version controlling data pipelines, as well as the data they input and output. Pachyderm's filesystem, pfs, is the component that's most similar to dolt. Pfs is a filesystem, rather than a database, so it tends to be used for bigger data formats like videos, genomics files, sometimes databases dumps. And the main reason people do that is so they can run pipelines on top of those data files.
Under the hood the datastructures are actually very similar though, we use a Merkle Tree, rather than a DAG. But the overall algorithm is very similar. Dolt, I think, is a great approach to version controlling SQL style data and access. Noms was a really cool idea that didn't seem to quite find its groove. Whereas dolt seems to have taken the algorithm and made it into more of a tool with practical uses.
How does Pachyderm deal with GDPR requests. Is it possible to remove a file not just from the present but also from the history? It would be no use to delete a file on GPDR request from the current version while still keeping it around in past commits.
Request to purge data are one aspect of the GDPR that Pachyderm makes trickier. It makes it easier to remove a piece of data and recompute all of your models without it, because it can deduplicate the computation. But to truly purge a piece of data deduplication becomes a hinderance, because the data can be reference by previous commits, and even by other user's data. You can delete a piece of data and have it not be truly purged.
The best recommendation we have for that is that user's data should be encrypted with a key that's unique to the user, and when that user asks you to purge their data you should throw away the key. That means that even if two users have the same data it will be stored encrypted by different keys, so if one asks for the data to be purged the other can still keep their data.
But then wouldn't the storage and distribution of keys become a similar problem to the original one? If the keys get distributed, then it's hard to really remove them.
Yes, all the keys do is scale the problem down. In general this is a very tough problem, everything else in the system is designed to avoid data loss, that's the biggest scariest failure case. But then when you want to lose data all the measures in the system to prevent data loss prevent that from happening.
What is your take on the need for time-travel queries for versioned, mutable data? Versioning immutable data items is not enough if you have structured data that is updated. Every time you update a data item, you store a full copy - not a diff of the actual data. You are not able to make "time-travel queries" - give me the data that was generated in this time-range, for example.
For example, if you have a Feature Store for ML, and you want to say "Give me train/test data for these features for the years 2012-2020". This isn't possible with versioned immutable data items. Also, if you don't store the diffs in data - if you store immutable copies, you get explosive growth in data volumes. There are 2 (maybe 3) frameworks that allow such time-travel queries i am aware of: Apache Hudi (Uber) and Databricks Delta. (Apache Iceberg by Netflix will have support soon.)
That's nice. Do you have any idea if it is possible to translate those rows into higher-level time-travel queries? Like if you could plugin an adapter to transform the rows into a data structure (parquet, arrow, json, whatever) that could be useful to analytics and ML apps?
Hi Sid, if you are curious about how it works internally, you can read some of the old docs from Noms here (Dolt uses a fork of Noms as its internal storage layer).
I can't tell you how happy I was when I discovered noms several days ago, and then the subsequent disappointment that it is not developed anymore. Anyway, now that it's in the open maybe someone with the technical chops to develop such a thing will continue development. Here's to hoping! And good luck in your new endeavour, which also looks like a very cool project.
Thank you! For the record, Replicache uses Noms internally, so I think ... something ... will still become of Noms. We're just not sure how to move forward with it at this point.
And I personally find Noms incredibly satisfying mental model to work with, so I hope that eventually some others will too.
Unfortunately it looks like merging is still under development, and encryption doesn't appear to supported either. Definitely one for me to watch though.
You are correct. It is an homage to Git. We needed a word that meant "idiot" or similar, that started with D for data that was short enough to type on the command line.
I actually thought at first glance the name was Do-It, which reminds me of this story from Bill Atkinson doing user testing at Apple:
We had a place where we would put up a dialog that would say “Do It”, or “Cancel”, and we’d give somebody a document and say “Here, edit this and save it”, and they’d get to the point where they’re supposed to choose “Do It”, and they’d look a little miffed, and then hit cancel. And we saw that several times, different people.
[...] And when we saw this [in the video recordings], people looking a little miffed and then hitting cancel instead of do it, we turned up the volume and played it back, and [heard them mutter]: “What’s this ‘Dolt’?” I’m no dolt! So I hit cancel”.
In case it isn't obvious, this is a joke, Linus actually didn't name Linux after himself, the person who hosted the original source tree did: https://en.wikipedia.org/wiki/Linux#Naming
Linus is self-aware enough that I'm sure there's a kernel of truth to that joke, but both git and mercurial were clearly inspired by the BitKeeper fiasco. They describe an unpleasant and ill-tempered person, respectively.
There's a lot of different ways that you could interpret the name Linus chose. That's part of what made it clever.
“Mercurial” doesn’t mean ill-tempered. It typically means fickle, changing. That can be related to moods (regularly pleasant, but inclined to fly off the handle with little provocation) but that is by no means its only application. It can also mean sprightly (connecting with the “changing” meaning), which is quite the opposite of ill-tempered.
Interesting. I don't think I've heard it used in a positive way before. If I knew anything at all about Mercury in Roman mythology, perhaps the other possible interpretations would have been more obvious.
I was wondering how many new projects with names derived from insults would now arise. Having searched on github for a few, my conclusion is that all the best ones have already been taken.
honestly, maybe this reflects my americanness but I presumed it was derived from the (western film/culture) word, a corruption of 'get'. Today I learned that it means something else in british.
Yeah, I too am tired of projects named mean-ly. Git, dolt, LAME, Gimp... DWARF is borderline, even if it has historical reasons (being based on ELF) it's not nice to hear out of context.
Even if it's a joke on yourself, just like, why would you give anyone who hasn't heard of your project the idea that it might be mean?
You wouldn't name your pet Dumbass. Why your pet project.
A bit off topic, but at summer camp many years ago, a counselor looked over my shoulder while I was using GIMP, and said something like "that's a sick joke". And that's how I learned about BDSM.
So, we ingest a third-party dataset that changes daily. One of our problems is that we need to retrospectively measure arbitrary metrics (how many X had condition Y on days 1 through 180 of the current year?). Imagine the external data like this:
When we ingest a new UUID, we add a column "START_DATE" which is the first date the UUID's metrics were valid. When any of the metric counts changes, we add "END_DATE" to the row and add a new row for that UUID with an updated START_DATE.
It works, but it sucks to analyse because you have to partition the database by the days each row was valid and do your aggregations on those partitions. And it sucks to get a snapshot of how a dataset looked on a particular day. It would be much easier if we could just access the daily diffs, which seems like a task Dolt would accomplish.
I mean it has a better chance of working than getting the third party to implement versioning on their data feed.
You can accomplish this using time-travel queries in frameworks like Apache Hudi and Databricks Delta that i mentioned in more detail in an earlier comment. They only work for Spark-based data pipelines.
A year or so I looked into "git for data" for medical research data curation. At the time I found a couple of promising solutions based on wrapping git and git annex:
At the time GIN looked really promising as something potentially simple enough for end users in the lab but with a lot of power behind it. (Unfortunately we never got it deployed due to organizational constraints... but that's a separate story.)
I think they could find funding and use-cases, if they had something like lincensing and terms of use backed into data to track lineage.
E.g. "this columns contains emails" and is revokable. Or when you publish data, "this column needs hashing/anonymizing/...".
And if you track data across versions and can version relations, you can create lineage.
Overall seen many of these lately, waiting for one to really shine. But not because I think it's a grand problem, as I can version my DDL/DML even/code, but I see some need for it because I have a lot of non-tech people working with data throwing it left and right and expecting me to clean up after them.
Amazon has an internal k/v store with branches (as well as schema versioning and hierarchical keys). It was primarily used for config management, which it excelled at.
Eh, I worked on database with branches in 2002 for 3 years while I was at ESRI. It is called a versioned system... Here is how it works from an answer I gave several years back on gis.stackexchange https://gis.stackexchange.com/questions/15203/when-versionin...
Seems like a lot of work went into this and there are very smart people behind it. However, I can’t help the feeling that this will lead to so many unintentional data leaks.
Nevertheless, starred. Let’s see what does it give.
Really interesting. Would be nice to see documentation. All their examples show modifying the database by running command line sql queries, does it turn up a normal mysql instance or just emulate? Are hooks available in Go? Surprised they don't market it as a blockchain database. I'm building a Dapp right now and this could be really useful.
I think data (as in raw, collected / measured / surveyed data) doesn't really change, but you get more of it. Some data may occasionally supersede old data. Maybe the schema of the data changes, so your first set of data is in one form, and subsequent data might have more information, or recorded in a different way.
One really important feature of time series data is the preservation of what the dataset looked like at each point in time. Financial data providers will make a mistake (off by order of magnitude, missed a stock split, etc) and then go back and correct it. This means you end up training models entirely on corrected data, but trade based on uncorrected data.
We think this is the assumption (that data does not change) but we don't think it holds and data diffs help show you when something you did not expect did change.
Maybe not a killer app, but there are certain kinds of collaborative 'CRUD' apps that could benefit greatly from having versioning built into the database as a service.
For instance, how much of a functional wiki could one assemble from off-the-shelf parts? Edit, display, account management, templating, etc could all be handled with existing libraries in a wide array of programming languages.
The logic around the edit history is likely to contain the plurality if not the majority of the custom code.
We think over time (like years) we can achieve read performance parity with MySQL or PosgreSQL. Architecturally, we will always be slower on write than other SQL databases, given the versioned storage engine.
Right now, Dolt is built to be used offline for data sharing. And in that use case, the data and all of its history needs to fit on a single logical storage system. The biggest Dolt repository we have right now is 300Gb. It tickles some performance bottlenecks.
In the long run, if we get traction we imagine building "big dolt" which is a distributed version of Dolt, where the network cuts happen at logical points in the Merkle DAG. Thus, you could run an arbitrarily large storage and compute cluster to power it.
Since Wil Shipleys presentation "Git as a Document Format" (AltConf, 2015, [1]) the idea of using git to track data has stuck with me.
Cool to see another approach at this.
From the first look, I miss the representation of data as plain-old-text-files, but I guess that's a little bit in competition with the goal of getting performance for larger data sets.
Anyway, I am wondering, did somebody here try using plain git like a database to track data in a repository?
The idea is good, the product may be good too (can't find any whitepapers or something about underlying technology). But some of their marketing is suspiciously unprofessional. Like "Better Database Backups". In DB world, you can't call a "backup" anything that can't restore all of your DB files bit-for-bit, anything non-deterministic. You can call it "dump", "export" or whatever, but not backup.
I don't think they plan to compete on DB backups storage market. So please don't mislead you potential customers.
I use a Python based CMS called CodeRedCMS for my website. They store all their content in a file called db.sqlite3. I use PythonAnywhere for hosting the site and they read the website-files from GitHub. So whenever I update my site (including the blog), I just push the latest version of the db.sqlite3 file to GitHub and pull it into PythonAnywhere.
So, as I understand, as long as the DB can be converted into files, it will work as anything else on Git and GitHub. What am I missing?
Non binary data can be saved as text - for example you can have an SQL database dump. You can put that text into git. What does this solution add to that simple idea?
Git take existing files, and allow you to version them.
Git for data would take existing tables or rows, and allow you to version them.
A uniform, drop in, open source way to have an history or row, merge them, restore them, etc. that works for Postgres, Mysql or Oracle in the same way. And is compatible with migrations.
You can have an history if you use big table or couchdb, not need for Dolt if it's about using a specific product.
Slightly related - how does ML track new data input and ensure that the data hasn't introduced a regression?
I would assume there's an automated test suite, but also some way of diffing large amounts of input data and visualizing those input additions relative to model classifications?
You generally can't analyse the accuracy of an ML system by each individual piece of data in the training set. Each batch of examples slightly changes the model making their updates interact and combine during the training process, so it becomes extremely difficult to assign the contribution of individual examples. Of course you could retrain the model leaving one example out, but that would be exceedingly slow and the result would be inconclusive from a single run because the stochastic noise of the training process is larger than the effect of removing or adding one example.
Related areas are confidence calibration, active learning and hard example detection during training. Another approach is to synthesise a new, much smaller dataset that would train a neural net to the same accuracy of the original larger dataset.
Looks interesting, depending on performance this could neatly cover a few use-cases I have at the moment without needing to build as much myself. At least dolt on its own, whether we would need the hub is another matter but I guess it depends on uptake.
Recently i was working with some open-data data and i was in need for a tool that transforms those csv/jsons to something standardized, that i can run queries against and patch the data. Maybe this is a use case for dolt.
git-lfs lets you put store large files on GitHub. With Dolt, we offer a similar a similar utility called git-dolt. Both these allow you to store large things on GitHub as a reference to another storage system, not the object itself.
Is there a way to page sql results? Also, it would be awesome if I could use rlwrap with `dolt sql`, so I can use the shortcuts I'm used to in an REPL environment.
Awesome. Yeah, my question wasn't clear, but I meant paging in the shell, as you correctly assumed. In a pinch, I can run page an inline `dolt sql -q` query in the OS shell. But it would be idea to be able to page results in the dolt shell, as we can in most SQL database shells.
BTW, I should have written it above, but dolthub/dolt is quite impressive. I hope you all make it, because it's a great product that I would love to use at work if I eventually shift back over to a data science position (right now, working as a software dev).
Pardon my ignorance, but is data copy writable? Or can it be owned? Obviously someone can get into trouble upload propriety code to git. Is there proprietary data?
We use AWS but will switch over to Google or multi-cloud when we exhaust our credits.
The system is still pretty simple. The main cost is the storage for the blobs in the Dolt repos pushed to DoltHub. We use S3 for that. There is an API that receives pushes and writes any other metadata (user, permissions, etc) into an RDS instance that stores metadata for DoltHub. That instance is also used to cache some critical things. Then it's just a set of web servers and a GraphQL sitting on top serving our React app.
You can export to whatever you like. In our imagination, data versioning would sit upstream of production just like source code versioning. You take the data out and do what you need with it in a "compile" step.
An example use case that "git for data" seems to break: storing data for medical research where the participants are allowed to withdraw from the study after the fact. Then their data must be deleted retroactively, not just in the head node. I don't know of a good methodology for dealing with this at all as it breaks backups, for example.
The problem extends beyond medical research due to privacy laws like the GDPR. A participant or user must be able to delete their data not merely hide it so as to protect themselves from data breaches. Suggestions welcome.
In principle, you should be able to 'rewrite history' in the same way you can already do with git. It is clunky to remove a file from all versions using git itself but easy using tools like bfg[0].
You can rebase to change the history. As with git, if you do this, everyone with a clone will need to clone a fresh copy, as they can no longer merge with the remote HEAD.
But there are some limitations for large tables. Diffs only work if you order the file in the same order. Plus, you have to import the data into some other tool to do anything with it.
I'll say what I said in February: I started a company with the same premise 9 years ago, during the prime "big data" hype cycle. We burned through a lot of investor money only to realize that there was not a market opportunity to capture. That is, many people thought it was cool - we even did co-sponsored data contests with The Economist - but at the end of the day, we couldn't find anyone with an urgent problem that they were willing to pay to solve.
I wish these folks luck! Perhaps things have changed; we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.
https://www.youtube.com/watch?v=EWMjQhhxhQ4