
Crab – SQL for your filesystem - cogs
http://etia.co.uk
======
jewel
> Just try these in Bash or PowerShell! > select fullpath from files where
> fullpath like '%sublime%' and fullpath like '%settings%' and fullpath not
> like '%backup%';

This isn't a very good example, because it's trivial to do in bash:

    
    
        locate sublime | grep settings | grep -v backup
    

(Replace `locate sublime` with `find / | grep sublime` if locate's results are
too old.)

> select fullpath, bytes from files order by bytes desc limit 5;

This is better. Here it is in bash:

    
    
         find / -type f -exec stat -c '%s %n' {} \; | sort -nr | head -n 5
    

Cherry picking another one that stood out to me.

> select writeln('/Users/SJohnson/dictionary2.txt', data) from fileslines
> where fullpath = '/Users/SJohnson/dictionary.txt' order by data;
    
    
        cd /Users/SJohnson/; sort dictionary.txt > dictionary2.txt
    

Some of the rest of the examples are trivial in bash, and others look
potentially useful. Of course they are trying to demonstrate its capabilities
so the examples are contrived. I can see how this would be useful for someone
who doesn't know the command-line, but as someone who is proficient in both
SQL is pretty verbose.

In the real world I'd switch to a scripting language for some of the more
complex cases, since they'd be rare.

~~~
untog
Sure, if you already know all that. But

    
    
        select fullpath, bytes from files order by bytes desc limit 5;
    

is infinitely more readable/parseable than:

    
    
        find / -type f -exec stat -c '%s %n' {} \; | sort -nr | head -n 5
    

Just something exists doesn't mean it isn't an incomprehensible pile.

~~~
hobs
I think powershell still isnt really that hard for this (and was mentioned
earlier)

Something like:

    
    
      gci / -recurse | sort fullname | select fullName, length -first 5
    

Really doesnt seem that hard in comparison.

~~~
untog
Agreed. It's interesting - last time I looked at PowerShell it seemed insanely
verbose, but maybe I was looking at some odd examples.

~~~
WorldMaker
As pointed out a lot of the verbosity in PowerShell is optional these days as
there are a good number of aliases for common commands.

The nice thing about the verbosity is kind of similar to the point of SQL in
that it too has something of a natural language DSL intent (in PowerShell
commands are expected to be "Verb-Noun" and verbs for the most part are
encouraged to be from a relatively small set of common verbs). This can make
it a bear to fully write out without an IDE (and there are multiple choices
these days) or tab completion (which gets better with each release).

The benefit however, is that typically you can very easily read someone's PS1
script if it has been written at full verbosity and know what it is doing.
It's often very close to self-documenting at that point.

------
cphoover
Reminds me of facebook's [https://osquery.io/](https://osquery.io/) does this
offer anything different?

~~~
michaelmior
osquery also has some level of setup overhead and requires a daemon running in
the background. I don't think this is true of crab.

~~~
sharvil
You don't need to run the daemon. It's also possible to just use it
interactively via osqueryi.

~~~
michaelmior
Thanks for clarifying. I haven't actually used osquery myself so I wasn't
aware.

------
thrownaway2424
As a reminder, we actually did have a filesystem with integrated SQL (-like)
query features in BeFS, in 1997.

------
gourneau
This is pretty cool, I have find myself wanting a tool like this for ages.
However, does anyone know of a pure open source alternative (just for Linux) ?

~~~
rakoo
A more unixy alternative to osquery is termsql
([https://github.com/tobimensch/termsql](https://github.com/tobimensch/termsql)):
it works with anything on the input, so it's a matter of "ls"-ing the correct
folder and then using SQL to output what you want.

------
justin_vanw
This is a brilliant idea!

You can do most of the same kinds of things via find and grep and some shell
foo, but honestly who can remember all of that? Maybe someone smarter than me,
but every time I need it I am reading man pages.

The find syntax to get files modified more than 20 minutes ago? How can you
remember that? But modified > now() - interval '5 minutes' (well that's
postgres but still), I can remember it and I haven't used it in 2 or 3 years,
because it's slightly less arbitrary and doesn't have 8 different gotchas.

EDIT:

find . -mmin -5 # that gets you files modified in the last 5 minutes. The part
I can't ever remember:

find . -mmin +5 # that gives you files last modified more than 5 minutes ago

find . -mmin 5 # apparently this is files modified exactly 5 minutes ago? The
fact that this syntax exists (and is different from +5) seems absurd to me.
What is the resolution? It must be minutes. This option exists only to confuse
people.

~~~
enesunal
may be an alternative

find . -mmin <5 # less than 5mins

find . -mmin >5 # more than 5mins

find . -mmin =5 # exact 5mins

------
oconnore
"Multi platform" ... only runs on OSX.

~~~
cogs
We're working on Windows, the query syntax and the functions are going to be
the same.

------
eka808
I like to use ad-hoc linq queries with linqpad to get this type of stuff done.

Ex : Directory.GetFiles(theDirectory).Take(50).GroupBy(a => ...)

------
brixon
Log Parser does this too (Windows only). [http://www.microsoft.com/en-
us/download/details.aspx?id=2465...](http://www.microsoft.com/en-
us/download/details.aspx?id=24659)

~~~
cogs
We looked at Log Parser before building Crab. The syntax was far from standard
SQL, it didn't have joins etc.

It didn't use string matching to identify subsets of files to query.

And maybe the most important thing, it doesn't have the exec function to run
operating system commands on the files you get back in your query results.

------
rufugee
I've been looking for something like this (and thinking about developing
something if I can't find a satisfactory solution) to use across many
different hosts to identify duplicate files, etc. I've got media spread across
many different linux and os x machines. Can crab handle this?

~~~
Qantourisc
Well a workaround is mounting and using joins ?

Now, be careful the example find duplicate file names (with equal size). Not
duplicate files ! Those would require check of the contents of the file.

Also there are tools like fdupe, same problem regarding remote hosts though.
Some tools use xattr to store hashes. Some might use DB's. (With xattr tools,
you just run the tool on the remote host first, then on the local host, if you
need to save bandwith.)

I however don't have the perfect answer.

~~~
rufugee
Yeah, ideally what I want is a daemon which hooks into libevent or something
similar, and each time a file changes or is created, calculates a checksum and
updates other metadata, and then finally provides this information back to a
central queryable database.

------
g4nt1
Makes me think of txt-sushi ([http://keithsheppard.name/txt-
sushi/](http://keithsheppard.name/txt-sushi/)). Can be pretty useful instead
of using awk.

------
crivabene
Reminds me of WinFS, which was cancelled but I always thought it was a
brilliant concept.

[https://en.wikipedia.org/wiki/WinFS](https://en.wikipedia.org/wiki/WinFS)

~~~
jug
I think it was abandoned when Microsoft realized they could achieve similar
object orientation built on the same old file system (so no
incompatibilities), just with the .NET layer in between. See also PowerShell
and the example above. :)

------
bechampion
IMHO ... doesn't replace the good old for,find,grep etc..

------
otterley
Does this maintain an inode metadata index as well? Otherwise how will you
avoid stat'ing the whole filesystem (or a branch thereof)?

Does it handle extended attributes?

~~~
cogs
Crab doesn't pick up metadata during the scan, because this would slow the
scan too much.

There is a wrapper function 'metadata' which returns a specific metadata item
from a file at a given path at query run time, but this basically runs mdls
under the hood.

Crab can handle extended attributes using the EVAL function function which
runs an OS command and returns the result as a string. But you have to parse
the string, for example to return the size of a resource fork:

select bytes +
coalesce(matchedgroup(eval('ls','-l@',fullpath),'.com.apple.ResourceFork\t(\d+)'),0)

------
zrail
Yay for commercial open source!

~~~
worldadventurer
It doesn't like it's open source - you download a time-limited trial version,
after which you need to buy a license. And there's no link to the source.

