Hacker News new | past | comments | ask | show | jobs | submit login
GNU Recutils (tomasino.org)
505 points by carlesfe on Jan 26, 2020 | hide | past | favorite | 143 comments

This is really neat. But before you integrate it into anything else you build, consider that it appears to be a very old blob of C code. It took afl-fuzz something like 2 minutes to start finding wild free crashes†. This seems like a worthwhile thing to reimplement in a higher-level language.

I don't know, maybe they've already done this work and ruled out anything bad from the dozens of unique crashes AFL trivially finds? I wouldn't want to pretend that I've done any serious inspection here.

For anyone else reading thread and wondering if someone has reported this issue yet: as of 10:35 UTC 27 Jan, nothing seems to have been posted to the bugs mailing list [1]. I'd encourage anyone who has even minimal knowledge about fuzzing and security and AFL (I don't) to make a post there. Don't let the combination of bystander effect + impostor syndrome stop you from doing it.

[1] https://lists.gnu.org/archive/html/bug-recutils/

I think that a blog post with the details (exactly what to install, exactly what to run) will get a few upvotes here. (At least, I promise my upvote.)

Sometimes the manuals seam clear, but when you actually want to run the program you discover that it needs a library, or that the directories must have some specific names, or ...

For context, afl can take weeks to find crashes in nontrivial binaries.

The maintainer it's active still with a bugs mailing list and chat in IRC. Your findings would surely be well received.

My "findings" here would just be "take the first recins example from your blog post and feed it to afl-fuzz, then wait 2 minutes".

A good follow-up to this would be to get the afl-fuzz error report and send it to the maintainers. Maybe they're not even aware of those problems.

Honestly, I don’t think it’s fair to request from 'tptacek to do this. It was just a friendly heads-up from him. We are many here who could do it. So if anyone has some time over and feel free to dig in :) (Or feel free to refrain.)

s/and feel/feel/

if you can be bothered, rebuild it with debug symbols, run it, dump core and try and find exactly where the bug is.

I vaguely remember doing this with wget, there was a way to make it think the terminal's width is (unsigned)-4, then when printing the download status to stdout, it clears a buffer with a memset(ptr, ' ', -4). Of course -4 in this context is a huge number. It overwrote its whole self until segfault. (this issue was fixed, btw)

great learning experience, for anyone who knows enough C to understand what they're looking at.

If I'd done anything significant, I would, but all I did was confirm the suspicion that this old c-language GNU tool hadn't been exhaustively fuzzed. I'm sure the recutils team can do a perfectly fine job fuzzing it themselves.

Or perhaps you could write a blog post on how to use a fuzzer so we can all learn from your findings?

Something about this request gets under my skin like nothing I've read on HN in a very long time.

It was weird. I decided to take it as a compliment. But as it's an unearned one, I think I probably won't write a blog post about it.

Indeed it's a compliment. And also a humble request because I'm interested in this subject. Perhaps my way of expression was not the best? But I'll take some time to see what other resources are available there.

I'm serious that you can read the (excellent) Quick Start for AFL, pick a C program (try recutils!) and get afl-fuzz running very quickly, and it's really sort of self-explanatory once it's running. It's a really well-built piece of software.

Sometimes people write strange things ¯\_(ツ)_/¯

The Github page for afl-fuzz has a really excellent Getting Started doc.

Sounds cool. Could you share the link to their official git repo?

Hmm. Doesn't look like very "hands on" to me (README.md). Or then I just couldn't find the document you mentioned in the previous post. But I guess one has to learn these things by trial and error then.

It did not take me much time to find https://github.com/google/AFL/blob/master/docs/QuickStartGui... which I guess is the doc he referenced.

> Or then I just couldn't find the document you mentioned in the previous post.

There you go: https://github.com/google/AFL/blob/master/docs/QuickStartGui...

For those looking for tutorials; in addition to the one already linked, I’m quite sure there are quite a few decent YouTube videos about fuzzing with AFL.

https://llvm.org/docs/LibFuzzer.html might also be quite interesting due to the potentially significantly higher fuzzing speed (no fork(2) for each try).

I give it a couple of months before it is rewritten in Rust, just like bat and ripgrep.

There is something here: https://github.com/aisamanra/rrecutils

I think Rust is not a choice for everyone / everything. Learning to debug and master C is still very much a valuable skill.

Blog author here. Glad to see someone submitted this and you all liked it.

Recutils has MUCH more to offer beyond the basic intro I gave here. It has wonderful org-mode integration for you emacs people.

Here's a recfile of my read books for reference. I generated this from my Goodreads export csv and a few recutils calls: https://ttm.sh/Equ.rec

Would you consider writing a second article into some of the more useful bits you could do with it?

Or maybe link some resources you found most helpful?

Thank you!

A tip: on accented characters such as Spanish names:

    setxkbmap us -option compose:rwin &
Then just press [Right windows key] + [' ] , [a] in order to type in an 'á'.

I think you're referring to my book data? It's all from Goodreads export, not hand entered. I love my compose key and will get around to cleaning this up.

You can also use an international layout that uses AltGr

I forgot the "us" switch. Now it will work fine.

I noticed the “Yeelong” mention in one of the examples. Do you happen to have a Lemote Yeelong? Always been on the lookout for one and would love to hear about it.

I don't, sadly. That is an example from the Recutils help docs, not my own personal collection.

Recutils is really handy when coupled with command-line tools that return structured data.

`guix search`[^] outputs data in recutils format, so if you are searching for a database driver for Python, but want to filter out "python2" variants, and ignore uninteresting fields such as versions or dependencies, you can do:

  $ guix search python mysql | recsel -q 'python-' -p name,synopsis,homepage
  name: python-mysqlclient
  synopsis: MySQLdb is an interface to the popular MySQL database server for Python
  homepage: https://github.com/PyMySQL/mysqlclient-python

  name: python-pymysql
  synopsis: Pure-Python MySQL driver
  homepage: https://github.com/PyMySQL/PyMySQL/

  name: python-peewee
  synopsis: Small object-relational mapping utility
  homepage: https://github.com/coleifer/peewee/
Without recsel, the output is 100 lines long, with lots of duplication between the Python 2 and 3 variants.

[^] GNU Guix is a package manager that works on top of any GNU/Linux distribution.

Brilliant! Thanks. I Will start outputting recfile format from now on.

Ah the joy of something old as a new discovery! What a great feeling, where has this been all this time?

OP/blog author: Great post, thank you for sharing your experiences

I've been sloshing in the text soup of Tiddlywiki Sqlite3(csv) VimWiki OrgMode Mediawiki Freemind(xml) and now it looks like Rec is the next ingredient to experiment with.

FZF and moreso Ripgrep https://github.com/BurntSushi/ripgrep has been really great to add to the mix.

Inb4 someone makes or mentions a JavaScript-based version of this, which operates on JSON files.

In fact, you can already do the querying part using https://stedolan.github.io/jq/, I believe you can make modifications with it as well, but a different front end a la recins/recdel would make that a bit more convenient.

Sure, but do you want to write

    [{"title": "Opening",
      "lede": "What happens when we \"open\" a file?"},
     {"title": "ltrace",
      "lede": "It turns out that ltrace uses the \"ptrace\" system call."}]

    title: Opening
    lede: What happens when we "open" a file?

    title: ltrace
    lede: It turns out that ltrace uses the "ptrace" system call.
? When you get a parsing failure after you edit it (or after a sector on your SD card gets a read error), which one do you think will be easier to fix?

There's also jq for yaml: https://github.com/kislyuk/yq

In my (oceanographic) research area, some formats are binary and others are text-based.

Binary tends to be used for big datasets recorded by instruments that are left in the field, unattended, for months to years. Since every byte counts, these instruments cram information in very tightly. The binary nature of the files makes them a pain to deal with, but it also confers an advantage: the files are very seldom corrupted by a person who thinks they are benignly viewing the data.

The text files, on the other hand, sometimes get extra junk inserted because someone in the data analysis pipeline thinks it's OK to look at information in MSword or MSexcel.

Sometimes, opaque binary data formats are superior, in terms of data integrity.

I thought of this, while reading about recutils and thinking of a contrast with sqlite. Recutils looks great, but if I started sharing data in that format, I bet it wouldn't be long before derived versions of the files had become corrupted, as someone edited with MSword.

In this forum, people will sniff at people who use MSword, and I have done so, myself. But, it's a simple fact that some people who are good at one thing are not good at another. Some of my colleagues who use MSword for every silly thing (e.g. seminar announcements) are actually very good at their subject matter (e.g. the science talked about in those seminars).

I've been experimenting recently with YAML for this kind of thing, and it's working out really well for me so far.

I have a tool called yaml-to-sqlite ( https://github.com/simonw/yaml-to-sqlite ) which converts a YAML file into a SQLite database, which I can then use with Datasette ( https://github.com/simonw/datasette )

My biggest project with it so far has been my site https://www.niche-museums.com/ - a guide to small and niche museums. The museums themselves live in a single ~100KB YAML file in GitHub: https://github.com/simonw/museums/blob/master/museums.yaml

I have a CI script which builds that YAML file into a SQLite database and deploys it + Datasette + custom templates to https://www.niche-museums.com/

I've been running the site like this for a few months now and I really like it. I love having my content in source control, I find editing the YAML to be reasonably pleasant (I even edit it on my iPhone sometimes using the Working Copy app) and any YAML errors are caught by CI before they are deployed.

This looks at least as nice as the recutils format. But I like it better because there are more tools around to work with yaml.

datasette looks cool. I am going to try it out for a tool that I am building (store all my metadata in a SQLite database).

I especially like this idea where you distill data into its most primitive form. It isn't quite as hairy as using n3-notation, but the ability to just collect the data without regard to being CSV or SQL gets you to an interesting space where tracking these changes with GIT works better. I've worked with CSV with 400+ columns (don't ask it was a horror) and SQL is really too verbose for these kinds of collections.

GNU Recutils is a nice metalanguage for data. You can churn out CSV from it which in turn can be \copy loaded into PostgreSQL. Maybe its too many steps, but I found it an important format for creating the logical model of representing data without predisposing it to any other technology or encoding.

Unfortunately its org-mode integration with emacs is broken when using spacemacs-- to far for me to fix as I'm no Elisp wizard.

> tracking these changes with GIT works better

That's actually the main selling point I am seeing here.

Is there are any better format (apart from yaml maybe?) to collaborate on datasets, that can also immediately be used with code?

Why haven’t I ever encountered this before? Seems cool and better than all the bespoke config files lying around on any given unix machine (passwd, hostnames, fstab, etc).

I wonder sometimes if, every 1-2 years, I should review the full catalog of well-maintained libraries and utilities available on Linux / GCC / LLVM / Python. I forget about many of them because they weren't relevant at the time I came across them.

Then I forget about that idea 10 minutes later when a new episode of my anime watchlist gets dubbed. I wonder how much that has cost me.

This is a fantastic idea! I might start a checklist (hmm, or a recutils db?)

I've been using emacs for a couple of years now and every couple of months I find something new and huge buried in it that surprises the heck out of me... how can these things hide for so long!

I used to do something similar, namely; every 6-9 weeks I would open up the Linux Kernel menuconfig and keep up to date with "everything", reading the helptext on each "(NEW!)" item.

Off topic, but do you have any good anime suggestions? My favorites are Deathnote, Attack on Titan, Full Metal Alchemist, etc. Trying to find something as amazing as Deathnote is challenging.

Thanks I'll check out all of these!

> a cat/mouse detective story with well defined rules and Sherlock vs. Moriarty levels of intelligence that _actually convince_ you that the characters are geniuses

Ah, then you'll love Monster: https://myanimelist.net/anime/19/Monster

Deathnote is pretty exceptional, so it's not surprising that you're having trouble finding other series of that caliber.

From your list, I'm guessing that you like anime that makes you think, and aren't turned off by story arcs that are overall dark and depressing.

The first thing that comes to mind is "Death Parade". It's currently available on Funimation, not sure where else.

A few more suggestions are "Elfen Lied" (although it can be a bit gory) and "Black Lagoon" (not super dark, but very entertaining imho).

Note: I only watch English dubs because I can't follow the action while also reading subtitles. If you're open to subtitled anime, I'm sure someone else could give you a much bigger list of suggestions.

I'm pretty much with you on that. I would prefer an English manga, but the translations are weaker in my experience when they're in the respective show. Although to be fair I've only both read and seem Deathnote.

I did watch the subs for the latest Attack on Titan because the dubs weren't out yet. But I was already invested enough in getting answers to that shows many questions as soon as possible.

It's a shame there's not much Deathnote level stuff. I'm so invested in that particular execution of a cat/mouse detective story with well defined rules and Sherlock vs. Moriarty levels of intelligence that _actually convince_ you that the characters are geniuses. Watchmen also comes to mind.

Any recommendations for non-anime that fits that genre? Novels, shows, comic books, etc?

I'm not sure. Are we talking about stories where the author is a genius, or where one of the characters is a genius?

For example, the film "A Beautiful Mind" is about character who's clearly a genius, but I don't recall the script showing signs of its writer having rare intelligence.

Whereas stories like Deathnote have some plot twists that make me think the author is highly intelligent, regardless of the IQ of most of the characters. I'm guessing I'd put a story into this category if there were plot revelations at the end which in retrospect I could have figured out my self, but didn't until the moment chosen by the author.

Looks really interesting. Could imagine someone can build a open source Notion alternative with this and offer the Recfiles as export option, would enable all types of crazy use cases. Add a read/write API and now you can have bots acting on your knowledge base. Basically MediaWiki but with a long-standing underlying format and compatibility with some unix tools already.

Does anyone have any more resources (besides the manual of course [https://www.gnu.org/software/recutils/manual ]) they can recommend about Recutils and Recfiles?

I couldn't find much more, which is why I decided to write this intro. I may do a video exploration of the various tools and features as well. It seems like others are as excited by this as I am.

Earlier this weekend I was doing experimental performance testing of an idea I have for a wiki solution[0].

This, or SQLite could be really useful to embed data into pages.

[0]: yeah, I've had to dig beneath the surface of Confluence lately and Confluence has this weird property of immediately making me motivated to try writing a decent wiki

FWIW, if you want a ready-made solution, wiki.js might be worth looking in to.

Looks really nice except agpl license which I used to think meant that any integration towards any existing solution.

At this point I've recently read someone from FSF(?) saying this is only MongoDB and others misinterpretation so at this point I am just utterly confused.

Also, when somebody uses AGPL that usually means: we found the scariest license we could find while still calling it open source, but, we have a commercial license to sell you.

However I couldn't find any licensing option. Does this mean it isn't just a way to sell commercial licenses?

I'm completely honest here. I really don't get it, but then again it took a while before I really understood the GPL as well so I'm ready to be enlightened :-)

TBH I hadn’t noticed that it’s licensed under the AGPL. I haven’t studied that license closely but yeah, I guess for many companies it’s considered a liability.

Thanks anyway!

I went ahead to study the FSF FAQ but they don't really answer it completely as far as I can see. The clear cut answers are:

- if you combine your program with an AGPL program then your program has to become AGPL as well, just like the GPL.

- if you use an AGPL program unmodified it doesn't seem like you have to distribute kt to users who use it over the network

But as far as I can see the FAQ doesn't say what happens if the AGPL program reaches out to other applications to get data. For some reason I always though anything that was touched by the AGPL program, either over the network or otherwise would have to become AGPL.

If that isn't the case - and there is more and more to suggest that, then I think FSF should point that out clearly.

Notion is a free-form information space, Recutils is more about a collection of entries with a specific format. If anything Recutils could be the import/export format of Airtable instead.

Just for answering how and where it could be used: GNU Guix uses recutils format to display search results and more, for example use the recsel command to select sessions of interest

   $ sudo guix processes | \
    recsel -p ClientPID,ClientCommand -e 'LockHeld ~ "perl"'
    ClientPID: 19419
    ClientCommand: cuirass --cache-directory /var/cache/cuirass …

That's awesome!

This is so interesting, how is it that nobody seems to have heard of this?

There are so many usecases where this would fit much better than traditional approaches.

Thank you for sharing!

I think this gets really interesting when combined with the bash builtin that supports reading records into variables[0].

    recsel contacts.rec | while readrec
       if [ $Checked = "no" ]
          mail -s "You are being checked." ${Email[0]} < email.txt
          recset -e "Email = '$Email'" -f Checked -S yes contacts.rec
          sleep 1
0. https://www.gnu.org/software/recutils/manual/Bash-Builtins.h...

> and even field-level crypto

Like you might expect, this is quite poor. CRC32 for authenticity, mac-then-encrypt, no binding between keys and values, using low-entropy passwords directly as AES keys, and a fairly trivial looking read overflow in the decrypt function. That's just two minutes looking at one source file.

I like the idea. But I love sqlite. I think the perfect match of this idea and sqlite is a serialization format to and from Sqlite, in a nice text format. It could look something like this, but probably closer to the "line mode" that the sqlite3 tool implements.

I never used 'recutils' before, so, before I start investigating, did anyone try to pair this with Markdown? Could this be made an extension to Markdown, so I can encode data fields within markdown?

I had an idea a while back to to this end. Markdown files as a database of sorts with queries, validations, actions/plugins, etc. written in Javascript. I documented the interfaces but haven't had time to implement it.



I imagine you could use recfmt templates [0] which generate valid Markdown, and use (pp [1] and) pandoc [2] to (pre)process into the desired final format.

[0] https://www.gnu.org/software/recutils/manual/Templates.html#...

[1] https://github.com/CDSoft/pp

[2] https://pandoc.org/

I'm not sure what you mean "data fields within markdown".

You could definitely use markdown within fields of a recfile. There would be some multi-line syntax you wouldn't be able to use, because it would conflict (e.g. '+' unordered lists), but other than that you could mark up text with Markdown, no worries.

Are some/most of the operations O(n)? How practical is this database then?

The purpose behind the software (from the authors point of view) can be found here: http://www.gnu.org/software/recutils/manual/Purpose.html#Pur...

It's practical for many use cases where that doesn't matter. Keeping track of your personal book collection is good, your bank's online transaction processing system probably not.

I can think of two scenarios (they overlap):

1. Reference data (countries, car makes and models, zip codes, tax rates, etc)

2. “Human scale” databases - ie stuff that a small group of people are manually curating

O(n) would probably be fine for either of those.

That is really cool.

An alternative is using JSON files with tools like jq, but Recutils looks much more powerful.

EDIT: nice, Python bindings https://github.com/maninya/python-recutils

I think the most current version is this one [0]; I submitted a bug report to request the GitHub repository be updated to point to the Savannah repository. Also, the Python bindings are here [1]

[0] https://savannah.gnu.org/projects/recutils/

[1] https://git.savannah.gnu.org/cgit/recutils.git/tree/python

I do not think that there was ever a period when "the only option for computing for quite a while" was text files. It certainly wasn't the 1960s, 1970s, or 1980s. dBase databases were not text files, for just one example.

I was refering to plain text as the interface to the machine, not the file formats. When your primary interface is through text, it makes sense that good tooling would develop.

So what I don't understand about recutils is in what category of problems does it really excel at over other solutions?

If I want human readable files I think I would opt for just a bunch of markdown files in a directory structure before I go the recutils route. If I want a lightweight database I think I would go for SQLite like others have already mentioned. In what situation do you really run into that you need human readable/editable referential integrity?

It's like markdown, but for data.

Just like markdown provides a quick way in that to renderize something pretty, but is still readable in plain text, recfiles do the same for data, it has data types, keys, integrity, and is easily queryable with its tools, and also easily read in plain text.

Sure, you could use some type CSV, but that is not pretty to read as recfiles. The point is the same as markdown, usable as just text, but with good tooling around it.

with markdown files in a dirs, you would have to provide you own tooling.

Okay, but markdown is data (perhaps not well structured, but data nonetheless) + formatting. If I'm editing and going into the files directly anyway why do I need to be able to do queries? I guess I'm asking what's the point? Can't I find the info I want just using find and grep on a bunch of markdown files? Maybe I don't even have to do that if I've organized my markdown files sufficiently well.

Let me be clear, I get that you don't have db operations with a bunch of markdown files with a directory structure, but on small scale data repositories do you really need that? For example, if I have a bunch of recipes in my "personal database of markdown files" I can quickly find the chili recipe I'm looking for by going to the directory "Soups & Salads > Chili > Slow Cooker Chili Recipe" or something along those lines. Or I could have grepped for "slow cooker chili". Either way I'm going to find that recipe with no extra tooling on my part.

Where do rec files features add value? Plus, if I use rec files now it seems I have to build my own formatting because I can't rely on markdown editors to build it automatically for me. Or is there a way to specify formatting for rec files?

Recfile is for tabular data, markdown is for text documents.

Agreed. A Recfiles is to CSV what YAML is to JSON (and what MarkDown is to HTML).

Some people might prefer human readable databases. For example if your database is not too large, you could use it with Recutils and version control the database itself.

You can do this with SQLite but it would require CSV export:



So maybe the key technical benefit is human readable diffs of the database where the tool enforces referential integrity? Otherwise, it feels like a subjective personal preference thing. (e.g. I like the color green, but you like the color red)

I think it would work for any situation that people would normally use MS Access? Though arguably Sqlite solves the same problem.

Those are not plain-text, not easily "hand-searchable" or editable by hand.

The idea is to take full advantage of plain text, just like, markdown! You could even use git for versioning.

They made the multiline field value format incompatible with the email/HTTP/etc headers one...


In the 90's there was rdb aka nosql. It was packaged in debian too at that time. I was a happy user. Then lost track of it. Thanks to this post I tracked it down again


    your database is a human-readable text file that you can grep/awk/sed freely, and a line-oriented structure makes it perfect for version control systems. 
But in fact it's not great for grep/awk/sed/, etc. For those tools to work well, you'll find you need to keep each record on its own line.

Not for awk:

    BEGIN {

        for( i=1; i<=NF; i++ ) {
        nfields = split($i, afields, ": ")
        if( nfields != 2 ) {printf( "Bad fields count: %s %s %s\n", NR, i, $i) | "/dev/stderr"; exit 1 )}
        field(afield[1]) = afield[2]
(Untested, but should be generally accurate.)

That parses records based on blank lines, into fields based on lines, splitting the fields into individual data recoreds based on ": " as a regex.

For more, see the GNU Awk User's Guide:


Previous discussion (September 2017): "Recutils – Tools and libraries to access plain text databases called Recfiles (gnu.org)" https://news.ycombinator.com/item?id=15302035

Is recutils available as a library? Are there language bindings? Specifically asking for Go support.

Never heard of recutils before but the on-disk format looks compelling. There has been limited (no?) takeup however which makes me fear recutils are intrinsically broken

Yes, recutils comes in 3 forms: a c library, command line utilities, and an org-mode plugin.

> (yes, the offical package image is two gay turtles)

It's weird that the FAQ page linked there [1] doesn't seem to link back in any way to recutils' own page. The only way to get there [2] seems to be to click on the "Software" header link and Ctrl-F for recutils.

[1] https://www.gnu.org/software/recutils/faq.html#whyturtles [2] https://www.gnu.org/software/recutils/

Does it support values with newlines?

Yes, you can do \n, or a literal newline and start the next line with a +. The second technique is wonderfully readable.

Is there a GUI for viewing and editing recutils files? Or at least a generic GUI that's easy enough to add new backends?

Given this is human-readable, I'm curious to learn what you're looking for in a specialised GUI that any text editor wouldn't provide? (vim, nano, notepad++ VS Code, whatever)

I'm thinking on using this for me, but being able to interact with non-programmers used to Excel and similar GUIs. But I enjoy working only with the text formats :)

Ah, that makes sense now!

I was so focused on "human readable" == "text editor" that I hadn't thought about GUIs in the more "G" (graphical) sense.

With the ubiquity and general awesomeness of sqlite I see no real reason usecase for these toolset over sqlite.

While I agree with the general awesomeness of sqlite, I worry about the blanket statement that you've claimed.

If you read the documentation 'Purpose' [1], it says, the issues with databases, like SQLite:

    - The stored data is not directly human readable.
    - The stored data is definitely not directly writable by humans.
    - They are program dependent.
    - They are not easily managed by version control systems. 
I just want to say that I also love sqlite, but I cannot avoid worrying that when the asteroid strikes and the ocean men invade, I will not be able to program a sqlite driver on a homemade 8 bit computer.

[1] https://www.gnu.org/software/recutils/manual/Purpose.html#Pu...

Sure, text files are awesome, I get it and generally I agree. But sqlite is in every nook and cranny of our computer and mobile systems, it is in Android, it is in the browsers. I find it hard to imagine a postapocalyptic world where one would be able to read a file somehow, yet not be able to get access to some sqlite processing software.

I like the rectools for educational purposes, but beyond this for any however small purpose project I'd rather switch to sqlite pretty soon.

Human readability. I don't need any other software but a text editor in order to easily view/edit the database.

My ambivalence exactly. The user in likes plain text for its accessibility and flexibility, but the programmer in me dislike it for its gross inefficiency.

Does some sort of general data compiler/decompiler exist? Please don't say Binary XML.

There are so many use-cases where the compute inefficiency of this just doesn't matter but the human efficiency of being able to read and edit the files is a huge win.

Consider the gross inefficiency in the other direction. To simply edit done simple data, you have to have how much code and infrastructure? And how much of that is plagued by bitrot?

It is a "scaling" problem. At small scales, yes, human time can be more valuable than computer time. At large scales, humans will sacrifice their own time to make computers work faster. The problem is with intermediate scale, where the computer takes a bit too long to do simple things.

Case: You really, really enjoy your text editor and have even built special tools that work within it, but it doesn't support SQLite and you generally enjoy using text tools anyway.

One key differentiator is that recutils are for use cases where you want your data files to be in a plain text format.

sqlite databases are binary blobs which don't have meaningful diffs if you check them into source control.

I wrote a SQLite <-> Json converter for this reason. Ended up coming up with a weird convention in the Json to represent relations. It might be cool to instead write a recfiles <-> SQLite converter. I want to use SQLite from within my app, but want to check in readable data to a source control system.

The CSV format is really ugly - i barely consider it human readable.

Why would you check a database into source control?

Not trying to be funny, genuinely wondering.

Sometimes you want data under version-control. You want to track changes and maybe you want to be able to reverse them. An example is human-supervised data-imports.

Example: we operate a pipline that pulls datasets from multiple government agencies and merges them. The sources are updated periodically and updates frequently introduce new inconsistences. To reconciliate, we use auxiliary datasets that record tweaks to each source, such as entries to add, mutate or delete.

In our implementation of the pipeline the source datasets, the tweak tables, and the resulting merged dataset are kept in Sqlite for ease of processing. The pipeline writes out dumps of each table during processing, even for intermediate stages. When running the pipeline, I can scan the diff and decide whether the changes are reasonable or tweaks must be introduced or removed. Once I'm satisifed with the result, the dumps are then committed to record the current known-good state.

When somebody wants to know why one record in the result is the way it is, I can determine how it changed in the source data, how it was tweaked, and what the result was before and after. It's really easy to produce diffs between revisions. Code-reviews are conducted over changes in pipeline code, validation logic, source and result all in one step.

In GNU Emacs SqlMode is not as nice to work in, whereas recfiles integrate out of the box with org mode. Doing the same with sqlite works, but is not nearly as nice.

Tell me when you can grep or sed an sqlite database, or cleanly check it into a git repo.

I've never heard of Recutils before, but I've been using grep, sed, awk, cut, etc. to work with data for years. sqlite is just too much of a hassle for version control and sharing. If I use sqlite, I have to export to text, commit and push, pull on the new machine, then import into sqlite. I've never run into performance issues even with thousands of records.

And where you need a reasonably performant relational data store, all those cumbersome gymnastics might be worth it.

But if you're working with a modest dataset where volume and performance aren't a primary concern, working straight in text is infinitely more convenient. There's a reason people do so much bulk data processing in CSV, JSON, or other easily manipulated formats.

SQLite is hugely more complex and therefore brittle and difficult to debug that this.

Especially if you absolutely need 100% reliable compatibility between many different devices.

Even more so if the format need to be part of a standard and/or a contract.

- SQLite is damn easy.

- It's almost a standard due to licensing.

- Tiny, really tiny.

Ah the difference between easy and simple!

It would be simple to implement an append-only writer in Recutils. I don't know where I'd start if I wanted that for Sqlite. I don't think the format allows it.

You would write a program that opens the database and perform insert statements.

I meant append-only on the filesystem layer which allows stream-reading for example.

I understand. I think should be avoided.

Go ahead and compare the SLOC count with a parser for the "rec" format.

Also, why are you ignoring the other points?

I recall a type of XML-representation that looked like this, i.e. was line based. And there was a Python (?) tool to convert XML to and from the line based format. Would be nice to compare with recutils but I can't find it now

It was this. "PYX". Far from as nice as I remember I thought then.


I'm not sure exactly what I'm imagining or how it would work, but reading this post I can't help thinking some kind of Wikidata integration could be really interesting.

I remember at Amazon they open sources something in similar vein but worked with a bunch of various key value formats.

We used it heavily for triaging into log files and such.

Am I the only one which has problems with records insertion?

This is cool in a kind of "neat idea" way but please god nobody use this for anything that anybody else might ever use! I feel like it's almost irresponsible to tell people that this exists.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact