Yet another file in the .git directory. The work is impressive and certainly helpful, but I can already hear Fossil proponents say "just use SQLite", which is getting more and more true.
I love SQLite, and Fossil is very cool, but I don't see the fundamental difference between Git adding another file in the .git directory, and Fossil adding another table or index in the SQLite database.
This is about having a single, unified interface for all operations. This is all explained in great details by SQLite itself at https://sqlite.org/appfileformat.html
> This is about having a single, unified interface for all operations.
Unless you're intending to run joins on git data, were exactlty do you see any fundamental difference between running CRUD operations via an SQL interface or just importing/exporting a file?
That's the whole point: git data is highly relational. Retrieving a commit alone is completely useless to you, just as retrieving any of the core objects alone is. Every operation you do requires retrieving multiple, interconnected objects... which SQL excels at.
As usual with the SQLite / Fossil developer argumentation, it just seems very biased and far-fetched. Just one example:
> Pile-of-Files Formats. Sometimes the application state is stored as a hierarchy of files. Git is a prime example of this, though the phenomenon occurs frequently in one-off and bespoke applications. A pile-of-files format essentially uses the filesystem as a key/value database, storing small chunks of information into separate files. This gives the advantage of making the content more accessible to common utility programs such as text editors or "awk" or "grep". But even if many of the files in a pile-of-files format are easily readable, there are usually some files that have their own custom format (example: Git "Packfiles") and are hence "opaque blobs" that are not readable or writable without specialized tools. It is also much less convenient to move a pile-of-files from one place or machine to another, than it is to move a single file. And it is hard to make a pile-of-files document into an email attachment, for example. Finally, a pile-of-files format breaks the "document metaphor": there is no one file that a user can point to that is "the document".
More precisely:
> But even if many of the files in a pile-of-files format are easily readable, there are usually some files that have their own custom format (example: Git "Packfiles") and are hence "opaque blobs" that are not readable or writable without specialized tools.
What is advocated here is to transform the pile-of-files in a single SQLite database accessed through SQL queries. So instead of having only a few binary blob, transform everything in a binary blob and force the use of one specialized tool for everything.
> It is also much less convenient to move a pile-of-files from one place or machine to another, than it is to move a single file.
This is not true.
> And it is hard to make a pile-of-files document into an email attachment, for example.
I would not trust someone that had just sent his git repo over email.
> Finally, a pile-of-files format breaks the "document metaphor": there is no one file that a user can point to that is "the document".
A VCS will track source files. Maybe their argument is true for other applications, but for a VCS this is plain useless.
Indeed having only an SQL connector accessing a database is a unified interface to the file. But unifying this to the user means that you have to move the complexity further down, as explained:
> But an SQLite database is not limited to a simple key/value structure like a pile-of-files database. An SQLite database can have dozens or hundreds or thousands of different tables, with dozens or hundreds or thousands of fields per table, each with different datatypes and constraints and particular meanings, all cross-referencing each other, appropriately and automatically indexed for rapid retrieval, and all stored efficiently and compactly in a single disk file. And all of this structure is succinctly documented for humans by the SQL schema.
Yeah, and I don't want to have this complexity managed by a single "entity", I want to have several different tools available to do whichever kind of work I need to do. If I'm working on graphs and need to store them, I would prefer having the ability to read my file directly in my other tools for graph analysis / debugging without having to take the intermediate step of connecting to the SQL database, or redefining a way to work with the SQL paradigm to adapt my file format to the "dozens or hunders or thousands of different tables, fields per table, each with different datatypes".
This point is even more salient regarding grep / awk. The author obviously prefer using the query language of his choice and disregards the variety of tools to work on text, but there are many, many tools available to do all kind of work on it, and believing that
> An SQLite database file is not an opaque blob. It is true that command-line tools such as text editors or "grep" or "awk" are not useful on an SQLite database, but the SQL query language is a much more powerful and convenient way for examining the content, so the inability to use "grep" and "awk" and the like is not seen as a loss.
Is just nonsense. Passing on the file edition conveniently put under the rug, querying the text is usually only the beginning, usually someone wants to parse the output and act upon it, maybe even put back some modified version (sed), and so on.
The author just seems close-minded and living in his own world, unable to imagine that other people might want to work differently.
This reminds me a lot of his rant against git and for fossil, with the exact same bad faith arguments and lack of knowledge about other ways to do things.
Your points are valid, especially considering that this page is explaining the benefits (for the author) of using SQLite as a generic application file format; however we're only talking about git here, and my usage of git is limited to "git some-command", sometimes "git some-command | grep foobar", and most of the time I'm in a GUI anyway. I'm not grepping the git objects directly, so whether I use a git subcommand or a sql subcommand won't make any difference to me. The real advantages of using sql subcommands for me are:
- I could probably plug that into something else with more ease than something that is git-specific
- I have more flexibility for querying out of the box, without learning the specifics of each subcommand. The full SQL language is there at my disposal for outputting exactly what I need
Use sqllite instead of git? Or git should use sqllite? If the latter then one problem is you'd need to keep your own fork forever as they don't accept patches. I'm not sure if that's a price worth paying to reduce the number of files git uses.
Why is this a problem for you, anyway?
It's not a problem for me, because all I see is the different commands that _use_ the underlying infrastructure. It's more about the design that was chosen: if you want to speed up things with git you have to implement specific logic in application code that will write a file and will need to update it periodically to keep it up-to-date, instead of using a querying engine made specifically for this purpose.
I don't see git ever changing its file format, but I do see another tool that imports everything from git and gives you a read-only sqlite db where you can do whatever you want, including displaying a graph quickly as the post advertises.
I think you don't fully understand what you are proposing. The storage engine (file system or SQLite) has little to do with git graph algorithm performance. SQLite doesn't magically "display a graph quickly".
What I'm saying is, the iteration step from "existing dataset" (what we have today) to "faster data traversal" (what the article proposes) is a custom file with a custom format on one side, and the appropriate query/index on the other side; one is definitely more understandable, portable and maintainable than the other.
Except that SQL has never had great DAG data structures, queries, nor indexes. You can model a DAG in a relational database, and you can non-standard SQL extensions to get some decent but not great recursive queries to do some okay semi-poorly indexed graph work, but having maintained databases like that at various times that all gets to be just as much a "custom file with a custom format" as dependent on database version and business logic as anything git is doing here.
If there was a stronger graph database store and graph query language for consideration than SQL you might be on to something. SQL isn't a great fit here either.
Fossil itself is stored entirely inside a SQLite db and only uses it to do everything it needs; if Fossil can do it, any VCS can do it. In fact, there is a whole section on that point in the official SQLite page (https://www.sqlite.org/lang_with.html#rcex2).
I'm not saying SQL is the best way to store and query DAGs; any graph database would be better. All I'm saying is that SQL is probably better at designing and maintaining a solution than what git does with its custom file format and custom code.
I'm only comparing what the pile-of-files that git currently is and a full-fledged SQL database. None is perfect, but one feels overall easier than the other.
But you are also almost intentionally confusing the SQL standard here in your comment with the SQLite implementation (a de facto standard, of a sort, but not a recognized standard by any body of peers to my knowledge) with SQLite's particular binary format (which does change between versions even). That is a custom file format with custom code. Certainly it is very portable custom code, as SQLite is open source and ported to a large number of systems, but just because it is related to the SQL standards doesn't gift it the benefit of being an SQL standard in and of itself.
The SQL standards define a query language, not a storage format. There are SQL databases that themselves optimize their internal storage structures into "piles of files". In fact, most have at one point or another. SQLite is an intentional outlier here; it's part of why SQLite exists.
There's nothing stopping anyone from building an SQL query engine that executes over a git database, for what that is worth. Because you can't execute SQL queries against it today doesn't really say anything at all about whether or not git's database storage format is insufficient or not.
All of that is also before you even start to get into the weeds about standards compliance in the SQL query language itself and how very little is truly compliant between database engines, as they all have slightly different dialects due to historic oddities. Or the weeds that there's never been a good interchange format between SQL database storage formats other than overly verbose DDL and INSERT statement dumps. That again are sometimes subject to compatibility failures if trying to migrate between database engines, due to dialectal differences. Including what should be incredibly fundamental things like making sure that foreign key relationships import and index correctly, without data loss or data security issues, because even some of that is dialectal and varies between engines (drop keys, ignore keys, read keys, make sure everything is atomically transacted to the strongest transaction level available in that particular engine, etc).
Git's current pile of files may not be better than "a full-fledged SQL database", that's a long and difficult academic study to undertake, but a "a full-fledged SQL database" isn't necessarily the best solution just because it has a mostly standard query language, either.
Also, the on-disk format for SQLite has been extended, but has not fundamentally changed since version 3.0.0 was released on 2004-06-18. SQLite version 3.0.0 can still read and write database files created by the latest release, as long as the database does not use any of the newer features. And, of course, the latest release of SQLite can read/write any database. There are over a trillion SQLite databases in active use in the wild, and so it is important to maintain backwards compatibility. We do test for that.
The on-disk format is well-documented (https://sqlite.org/fileformat2.html) and multiple third parties have used that document to independently create software that both reads and writes SQLite database files. (We know this because they have brought ambiguities and omissions to our attention - all of which have now been fixed.)
@jasode's reply in this thread made a good summary of the two parallel discussions/quid pro quo:
- on filesystem vs sqlite (put git files in sqlite): there's a good benchmark on https://www.sqlite.org/fasterthanfs.html claiming sqlite is up to 35% faster than fs. I'd like to see the same benchmark with git's file pattern; also, it's a known issue with git that it was written for linux first, hence optimized against Linux (relatively) good fs performance (vs Windows and Mac at the time). Same with most OSS build systems that (over)use process forking, which is also very optimized in Linux.
- on Fossil vs git (why bother putting git files in sqlite and not directly jump to fossil?): that was my comment, and it relates to subject of this article (the commit-graph). I'm wondering if Fossil has seen the optimization that git has, with regards to number of commits, considering that sqlite is the only high-profile project that uses it. Maybe performance is supposed to be taken care of by the sqlite database itself ?
It's very much a meaningful metric when the entire point of TFA is caching the commits graph. This is only an issue when you actually have lots of commits, and even more so a very branchy graph.
>It's not a meaningful metric when discussing whether to use sqlite as storage for git instead of files in the .git
There are 2 different conversations happening.
You seemed to be responding to ggp (rakoo) "sqlite db vs files".
However masklinn was responding to gp (Aissen) question of "Fossil vs Git" performance and a charitable reading would be comparing the Fossil algorithms (also affected by combination with underlying SQLite db engine algorithm) for commit searches, graph traversal, etc. In that case, the high number of commits and total repository size to stress test Fossil/Git would be very relevant.