Have you every played with SQLite virtual tables (https://sqlite.org/vtab.html) - they could allow to provide an SQLite interface while keeping the same structure on disk. Though it requires a bit of work (implementing the interface can be tedious), it can avoid the conversion in the first place.
Then again, do you need virtual tables?
The .warc structure won't change, so the tables won't change. But you can have SQL views defined instead for common queries.
> PS: I do believe that at some point .sqlite will become the defacto standard for such initiatives. Sure, it's not text... but it's pretty close.
What is the advantage of moving around .sqlite-files, over just loading the (compressed) WARCs into sqlite databases when you need them?
I've been messing around with different formats for my own search engine crawls, and ended up with the conclusion that WARC is a pretty amazing intermediary format that weighs both the needs of the producer and consumer very well. I don't use WARCs now, instead something similar, but I probably will migrate toward that format eventually.
WARC's real selling pint is that it's such an extremely portable format.
Anecdotal evidence, but I produced a medium-size crawl in the past (~20TB compressed). I used distributed resources with off-the-shelf libraries (i.e. warcprox etc.) and managed to get corrupted data in some cases where neither the length-delimited (i.e. offset + payload length) nor the newline-delimited (triple newlines between records) logics were valid any longer. Took me some time to build a repair tool for that.
Sqlite has an amazing set of well-understood and documented guarantees on top of performance, there's a host of potential validation tools to choose from and you can even use transactions etc. So that alone seems like a great idea.
What's more, you can potentially skip CDX files if you have sqlite databases (or build your own meta sqlite database for the others quickly).
A bit late to this thread, but I think WARC is a reasonable format for raw HTTP traffic. We should definitely have better tools to ensure WARC files produced are valid, and that's one of the things we build at Webrecorder.
Unless you're crawling really text heavy content, most of the WARC data is binary content that doesn't really need to be in a db. However, sqlite or database as replacement for CDX is an appealing option, where WARC files can remain static data at rest and derived data (offests, full-text search, can be put into a db.
We are experimenting with a new format, WACZ, which bundles WARC files into a ZIP, while adding CDXJ and exploring sqlite as an option for full-text search.
I agree that it's better to build on solid, existing formats that can be validated, especially when large amounts of data are concerned!
> What is the advantage of moving around .sqlite-files, over just loading the (compressed) WARCs into sqlite databases when you need them?
I suppose there could be made a case for easy extension: You don't need to change the entire spec to add another table in the SQLite database, maybe containing other metadata.
> WARC's real selling pint is that it's such an extremely portable format.
I mean, so is SQLite: it is also apporoved as a LoC archival method. (See SQL Archive [0])
> What is the advantage of moving around .sqlite-files, over just loading the (compressed) WARCs into sqlite databases when you need them?
The .warc spec is ideal. I'm not saying we replace it (ref. xkcd: standards).
On top of what uniqueid said, "loading" is much slower and more cumbersome than it sounds.
I'm not saying SQLite will replace text (maybe my aphorism sounded too firm). I'm saying that maybe along with the .warc.gz archives at rest, one could have .sql.gz files at rest as well.
In other words: why not move ACID-compliant archives moving around?
Seems like the benefit of sqlite is the sort of usecases where you maybe don't want to load everything you've crawled into a search engine, but want to be able to cherrypick the data and retrieve specific documents for further processing. Which is certainly a use case that exists, and indeed not really what WARC is designed for.
End-user facing applications usually don't consume website crawls, do they? That's impractical for many reasons, the sheer size alone being perhaps the biggest obstacle.
If you want to do something like have an offline copy of a website, ZIM[1] is a far more suitable format as it's extremely space-efficient and also fast.
WARC is a file format written and read by a rather small but specialized set of web crawler tools, most notably the internet archive's tooling. For example, its java-based crawler heretrix produces warc files.
There are a couple of other very cool tools, such as warcprox, which can create web archives from web browser activity by acting as a proxy, pywb which play back the same to make archived versions browsable, and related libraries [1](shoutout to Noah Levitt, Ilya Kreymer and collaborators for building all this).
The file format itself is an iso standard, and it's very simple: Capture all http headers sent and received, and all http bodies sent and received, and do simple de-duplication based on digests of the body.
There is a companion format, CDX, which builds indexes from warc files (which in turn are just concatenated records, so rather robust).
Although all of this is great, I worry a bit about where we're heading with QUIC / udp-based protocols, websockets and other very involved protocols which ultimately make archival much harder.
If there's anything you can do to help these (or other) tools to keep our web's archival records alive and flowing, please do so.
The web archiving community is surprisingly small and fragmented (in terms of tools) given its impact. Thankfully the .warc format looks pretty powerful and standard for the web we have so far (which is a lot! ).
Now with the new protocols, dunno maybe its too soon to worry? Then again, maybe its an IPv4 / IPv6 analogy.
> There is a companion format, CDX, which builds indexes from warc files (which in turn are just concatenated records, so rather robust).
Good point. I's planning of combining this fact with the ATTACH option that SQLite has - allowing to query multiple database files [0]
Oh hi, thanks for building this! I haven't had the chance to play with it, but my hunch is that sqlite for warc can fill a great niche and would be much more portable (and probably performant).
Allowing multiple DB files is a great idea, since that fundamentally enables large archives, cold-storage and so on.
A WARC file gives you exactly what you would see on the wire, or in the network inspector tab of your browser. It does nothing to the content, and that's the point.
The only thing you gain (and that's very important for other reasons as well) is an immutable ground truth to work from when creating the reader view of a given article.
It'd be neat to extend the warc format and tooling to support cached http responses for things like REST endpoints. Then you could make sure everything you did in a session is recorded for later use.
From the specification it would appear fairly straightforward once an approach was chosen...
Here's the relevant extract from the warc spec that informed my difficulty estimate -
"The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file."
Edit: Upon 2 minutes of reflection I think the way to go for what I'm envisioning is some kind of browser session recording -> replayable archive solution.
Although this is a nice idea, it's extremely difficult to get it completely right.
Consider a SPA where navigation happens via xhr or similar requests and updates are json that's patched into the DOM. Even browsers have a hard time figuring out how to make this a coherent session.
Now with warc, you get a single record per transfer, i.e. every json file, every image, every css is an individual record. It's completely up to the client/downstream tech to re-assemble this into a coherent page.
If you want to go down that road, my best suggestion would be to start with a browser's history - that's probably the most solid version of a session that we have right now.
I like that you're logging responses instead of just the result / payload. Reminds me of some idea I had of using something like queue as an intermediary between a webserver and the backend. I don't exactly remember my reasoning anymore right now
The WARC format is extremely simple and yet so powerful. Most importantly though, there are already pebibytes of already crawled archives.
This is a fairly straightforward mapping of a .warc file to a .sqlite database. The goal is to make such archives SQL-able even in smaller pieces.
The schema I've come up with it's tailored around my requirements, but comment if you can spot any obvious pitfalls.
PS: I do believe that at some point .sqlite will become the defacto standard for such initiatives. Sure, it's not text... but it's pretty close.