Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: FSQL – Search through your file system with SQL-esque queries (github.com)
341 points by kshvmdn on May 15, 2017 | hide | past | web | favorite | 126 comments



Does anyone remember WinFS (1)?

Bill Gates described it as his biggest product regret (2).

I remember I thought it was brilliant. Too bad it was probably a little bit too futuristic for its time, as for a few other things they launched when it just was not the right time... the clunky Tablet PCs (3) were for sure another example.

(1) https://en.m.wikipedia.org/wiki/WinFS

(2) http://www.zdnet.com/article/bill-gates-biggest-microsoft-pr...

(3) https://en.m.wikipedia.org/wiki/Microsoft_Tablet_PC


To clarify the comment re: Bill Gates biggest regret, his biggest regret according to the referenced article was that Microsoft never shipped WinFS. He did not regret the product itself. It was unclear to me what the parent meant, as my first question was, "If he regretted it, did he allude to reasons why it's a bad idea?" But that question no longer makes sense when you realize that the Mr Gates ostensibly still believes in the idea.


I remember reading this from Bill Gates on his reddit AMA:

> We had a rich database as the client/cloud store that was part of a Windows release that was before its time. This is an idea that will remerge since your cloud store will be rich with schema rather than just a bunch of files and the client will be a partial replica of it with rich schema understanding.

Then he confirms few comment later that he was talking about WinFS.

https://www.reddit.com/r/IAmA/comments/18bhme/im_bill_gates_...



While WinFS was a good start, I think the idea of a semantic file system could be extended much further to the whole system (if it weren't for pesky POSIX)

I think most people would expect

/home/geokon/program1/src/

and

/src/program1/home/geokon

to have pretty much the same content

A tag based file system that makes the two equivalent would eliminate all sorta of annoyances where you can't decide how to structure your file hierarchy (should you have bin/program1 bin/program2 sr/program1 src/program2 or program1/src program1/bin program2/src program2/bin? both layouts have their advantages).

Something like a "path/path/bin/path/path/bin" wouldn't work.. but it's hard to find a case where you really need it. And the vast majority of time the subfolders aren't strict subsets of the parent (like mammals/dogs mammals/cats mammals/whales - where dogs/mammals would be a little weird)


Years ago, I had a similar idea, but never did anything with it. Look at DNs in LDAP (and X.400/X.500), they are based on attribute=value pairs. What about a filesystem in which filenames were collections of attribute=value pairs?

e.g. /home/geokon/program1/src/foo.c

could become: user=geokon/program=program1/category=src/lang=c/name=foo

You could potentially decide that the order of the attributes is not significant, only the set of attribute-value pairs.

Downside: Too much typing. Although, maybe you could allow standard aliases for attribute names, so that:

user=geokon/program=program1/category=src/lang=c/name=foo

could also be written as:

u=geokon/p=program1/c=src/l=c/n=foo

(I think the attributes should be first-class filesystem objects, just like files are, as opposed to just text strings. Some of the values, e.g. an enumerated value like lang=c, should be first-class objects as well.)

I also took some inspiration (in concept not syntax) from https://en.wikipedia.org/wiki/Faceted_classification such as https://en.wikipedia.org/wiki/Colon_classification

Being too different from what everyone else is doing would be the real killer, however.


Just write geokon program1 src c foo, and then just display a list of files that match. You could for example also make fake folder/menus to navigate tags. (Which would just be appending filters.)


There's a project that kinda wants to do that, ie provide a virtual interface on top of the existing filesystem based on tags you give each file: https://tmsu.org/


That was my first thought, too. I'd love a file system that was more like a relational database.

Also see the Pick operating system.


Transactional ACID updates to file systems would be pretty fantastic.


yes. thinking about this for a long time already


BTRFS has had a bunch of problems trying to actually compete with traditional filesystems, though. In the distributed world, CalvinFS seems pretty promising to me.


I always wanted the file system to be built on a version control.


Also MUMPS. See https://en.wikipedia.org/wiki/Pick_operating_system for info.

In some ways, Microsoft is on the way there with the way PowerShell works, and the ability to script things through OS functions that return objects which can be queried. If we ever see a WinFS, it would be very powerful combined with PowerShell


For those of us on Windows: Everything [1] does the job quite nicely with much less verbose syntax.

[1] http://voidtools.com


I've been using this for years and it is LIGHTNING fast.

No need to "index" all the files because it reads directly from the MFT. If you create a new file matching the search pattern it's already sitting in the results window by the time you alt-tab.

Also, the "directory size" equivalent of Everything is WizTree [1] ... much faster than WinDirStat, which I see recommended way too often.

[1] http://antibody-software.com/web/software/software/wiztree-f...


Are you sure it doesn't index anything? There is even a section in the settings called Indexes.

Also when you launch it first, it's going to be empty and says it's scanning your folders and it takes a little bit until you see something.

I think it is still indexing (maybe using the MFT instead of recursively listing files and directories), it's just a lot better than Windows search indexing. And it might use this [1] to keep up to date? It's mentioned in the settings

[1] https://en.wikipedia.org/wiki/USN_Journal


No I'm not sure. Maybe it builds a rudimentary index... but give it a shot yourself and see. 15 seconds after installing, you can search your entire system instantly. It's crazy whatever it is doing.

And yes I do believe it uses the USN Journal to stay up to date.


There's actually a wikipedia page, and it explains how it works. As we suspected, it uses the MFT for initial indexing and change journal for updates.

https://en.wikipedia.org/wiki/Everything_(software)


Thanks for WizTree, you're right that it's an order of magnitude faster than WinDirStat. Only thing it's missing is that graphical block view, but for my usecase it will be perfect.


@all, do you guys know if osx / linux have the MFT like structures accessible to the user? Any links?


Linux ext systems have inodes, which store all the metadata except the name. Directories contain name + inode number.

NTFS stores the name in the MFT instead, which makes hard links a bit weirder.


I've been using "SpaceSniffer" which is similarly faster than WinDirStat.


Directory Report is also faster than WinDirStat, and It can create more reports


Everything is pretty awesome, one of the things I really miss on Linux.

When I last looked for a Linux replacement for it, none had the real time updates or the instant search ui, and even those that claimed to index the file system for a quick search were very slow. I actually ended up writing my own hacky and rudimentary GUI over locate to achieve something that fits my needs (and of course doesn't support real time updates).

Maybe things changed since then? Any chance that the HN crowd knows of a good Linux replacement for Everything?


Have you tried FZF (the fuzzy finder)?

https://github.com/junegunn/fzf

It integrates well with UNIX shells, editors (VIM integration is excellent), git log search etc. It's really versatile.

Of course, there is also q:

http://harelba.github.io/q/

for something that is more like the software described by OP.


I don't know what features exactly Everything has, but maybe fsearch is a good alternative?

http://www.fsearch.org/


Try out KDE's search. It does realtime fulltext indexing and just works.

Also integrated with the KRunner framework.


I heard that Everything is dependent on the exact format of NTFS. The same algorithm isn't applicable to ext4 or HFS+.


Everything is awesome


I'd really like to know why Microsoft can't do something like this in Windows itself. Everything is very useful.


Isn't that what they tried to do with WinFS?

https://en.wikipedia.org/wiki/WinFS


I don't know their reasons for WinFS, but I do know that Everything's authors figured out a long time ago how to get great search with Windows' current FS.


My guess is that MS needs an excuse to index the contents of your files.


Everything does not index the content of your files, only the name and some attributes.

Also MS already index file contents by default, it just sucks at it. There have been several occasions where I use the find file by name syntax and Windows can't even find the file in the current folder.


This is exactly my point. The key to finding files is to use

  name:*mylostfile*.txt
But Everything excels at this so pretend you never saw it.


I recently learned that you can also do advanced search (for e.g. date of modification) in Everything 1.4 (beta):

filename dm:>=16/05/2017 size:5mb..9mb parents:>=3


This is pretty amazing. I've been using a Windows program called Agent Ransack [0] for finding files with regex, but this Everything program is so much faster. Incredibly so. Thanks for the tip!

[0] https://www.mythicsoft.com/agentransack


Everything is one of the first tools I install on every windows box. For me personally it is a must have. I don't think I know of any other tool which altered my workflow that heavily.


Does it keep everything local?

I couldn't find anything on the FAQ and I remember a similar tool posted here on HN and people complained that it called home for some search functionality.


everything in Everything is local


Cool, thanks


Agent Ransack is another good one.


This is my secret weapon.


Reminds me of osquery [0].

[0] - https://github.com/facebook/osquery


Thanks for sharing! I've installed this locally and I'm really impressed by how easy to use and powerful this is! I must have missed the previous mentions on HN [0],[1].

[0] https://news.ycombinator.com/item?id=8528460 [1] https://news.ycombinator.com/item?id=12600790


Haha, you're not alone -- https://github.com/kshvmdn/fsql/issues/2.


Doesn't Windows have something like this built-in? WMI or something?


yeah; wmi is incredibly powerful. before osquery linux and os x had nothing like it. it even has performance counters (albeit at slower intervals than win etl) at the ready.


Osquery also runs on windows now :)


On macOS, there is a query syntax [0] that's usable in Spotlight and the mdfind(1) command. Richer searchable attributes [1], but the results may have to be piped through other tools for formatting or other output.

[0]: https://developer.apple.com/library/content/documentation/Ca...

[1]: https://developer.apple.com/library/content/documentation/Co...


I think SQL is too verbose for use on the terminal. find + grep does the trick with way less verbose syntax (but also probably less readable). With that said, it is quite cool.


Don't see it offhand so asking:

1) How/where are you storing the index 2) Have you tried this on large (30+ TB filesystems)?


even without an index, having a way to project declaratively instead of relying on cut/sed is giving me hot flashes.


I wanted to write the exact opposite: a Mysql/Postgres client as a FUSE filesystem driver. Namespaces -> folders, tables -> (editable) CSV files, stored procedures and settings accessible as (editable) plain text files.


If someone put data in a column that wasn't valid, like a string in a bigint column, would the table be altered or would the FUSE driver refuse to make the change?


There's been some attempts. Here's one:

https://github.com/BMDan/DFuse


Sounds dangerous!


No more than DELETE FROM ... WHERE ... wait, where is WHERE?


Even this can be made safe(r) if you only only only connect to your database nthrough a proxy that sanitizes queries. IIRC vitess adds an implicit LIMIT 10 to queries that don't have a limit.


A simple solution is to not allow UPDATE or DELETE statements without a WHERE clause.


Seems if the query is always going to start with SELECT, that maybe it should be assumed? I would never use this though, ack or find seem sufficient to me.


It's a shame Bash used 'select' as an elaborate menu built-in - it'd be quite neat to name the binary that (and drop the quotes). The you could just type the query right into your prompt!


You could use an alias. They're case sensitive so SELECT could be mapped without impacting the built in "select".


Just use the fish shell instead, and you can avoid the years of shell cruft of bash, or the endless customization of zsh. Bash is a good environment for shell scripting, but not really the best for user interaction. Although perl is probably the best environment for shell scripting.


Xonsh is far better both interactively and for scripting than bash, fish, zsh, or perl.


"alias select=command select" lets you override it.


Yeah, I like the idea of using sql but it is painful to write.

It would be nice to omit the select and the quoting; my suggestion would be that

    fsql "select * from ..."
could be written as

    select all from ...
and all could be assumed if ommitted; so you could write

    select from
or just

    from ...
But unfortunately `select` is a sh(1) reserved word and `from` is an existing command! (shows who who your mail is from)

So maybe select and from could be shortened to sel and frm.


> or just > from ...

Or just ... Why even bother naming fields. Just make <space> enter return everything from anywhere from all time for all people on all platforms.

How many times are we going to have this ridiculous suggestion that less characters/words is automatically better.

This is 'short/arrow functions' (pick your language) all over again, and invariably ends up with the situation where the new syntax is just fucking impossible to read at first glance, because it has so many variances.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Parens are optional. Unless you have no arguments, then they're required. Curly braces are optional, unless you want more than a simple expression, or no return or to return an object literal, then they're required.

I grow weary enough of this bullshit notion that code must 'look pretty', but when you're using "less characters is always best" as the definition for 'pretty' it just becomes unbearable.


> this ridiculous suggestion that less characters/words is automatically better.

Not what I suggested; yes they are shorter but that's not the point. Can you find a longer name than for `select` that is more appropriate?

I sympathise but for interactive shells the ergonomics are different than for most languages.

Even then I don't mind long names but I do mind hitting the shift key. Which means avoiding most punctuation.


still arrow functions are the single one ES6 addition to js that made the language finally nice enough for me to enjoy ;-)

as a fp geek


You can do "alias select=command select", it works in bash and zsh (and maybe others).

In zsh you could also "alias select=noglob command select" and it wouldn't do wildcard matching. Then you could use

    select * from ...
and it would pass the asterisk to your select binary :)


I like this! I've started work on this [0], feel free to add to the issue if you'd like.

[0] https://github.com/kshvmdn/fsql/issues/10


It could be extended with other queries like “update” for batch file operations, I suppose. Like you, I’m fine with my existing suite of find/grep/ack/awk/perl/whatever, but I already know how to use them; a beginner or someone who doesn’t live their whole life at a Unix terminal could probably benefit from the simpler interface.


Pay me then sick of walking or riding


Didn't BeOS have some awesome database-like file system indexing and query system?


Sure; in general, a file system endowed with extended/extensible attributes can be naturally seen as a relational database (in which the files themselves are BLOBs).


This video talks in detail about the extended attributes in BeOS:

https://systemswe.love/archive/minneapolis-2017/ivan-richwal... - "Metadata Indexes & Queries in the BeOS Filesystem"

https://player.vimeo.com/video/209021697



find/grep/awk/ag get me a long way to be honest. However, I think this is a cool project because it makes filtering of file attributes (such as size) so much easier. No need for splitting strings and using regex. Cheers.


Not to take anything away from this project, but you can filter on size, permission, etc easily and robustly using just `find` (try -perm, -size, -{c,m}time, etc flags): https://linux.die.net/man/1/find

If you are splitting strings (from output of `ls -l` presumably) for such tasks, then definitely take a look at find.


Nice project, wish you the best ! Although tbh, I personally won't use this simply because I know enough of find(1) to not see the cognitive overhead of switching to sql to do filesystem /queries/.

Any examples where this would be better than using find (with the occasional filter thrown in) ?


Subselects would be a pretty awesome feature.

"select name from foo where name not in (select name from ../bar where date < ...)"

I'm usually fine with `find`, but when doing things more interesting than just "find files in this directory that are not in that directory", while uncommon, tend to make me think about my pipeline a bit.


Out of curiosity, I'm interested in how folks would do the "in this not that" folder query. At a gut shot, I'd assume that diff would be used. I'm about to dig through the find man page to see if it has something directly to help.


One tool for this is the 'comm' utility: given two files containing sorted lines, it can output one or more of (1) lines only in file 1, (2) lines only in file 2, and (3) lines common to both files.


Assuming no dups in file1, this outputs lines in file1 that aren't in file2:

    sort file1 file2 file2 | uniq -u
(double file2 is not a typo :)


Or (a little faster):

  comm -23 <(sort file1) <(sort file2)


Great idea! Got the issue up [0], feel free to add to it if you'd like.

[0] https://github.com/kshvmdn/fsql/issues/4


SQL is familiar-ish to a lot of people.

Contrariwise, I rarely see a non-trivial `find` invocation that does not have a few bugs. (including my own!)


I think the BeOS had a file system that was set up like a database that could be queried.


It sure did. Alas at the time I had a BeBox in college (mid-late 90s) I didn't know SQL yet ;) I think it just searched over file metadata, not contents, though I might be mistaken.


We have implemented smth. like this with sqlite extension, pretty powerful, with all the goodies sqlite (and its extensions) provides...


Sounds interesting, would love to take a look if it's available anywhere.


Unfortunately, it was commercial development, so it's not released. But implementation is relatively easy - it was a sqlite virtual table that (as much as I remember) looked in where condition for dir field, and listed that directory (= returned stat() data). Whole thing was quite interesting, because almost every component was somehow hooked into sqlite (either vith function or virtual table), so one could do pretty interesting things only with SQL.


Did someone come up with a generalized rule about putting SQL on top of every possible system that contains queryable information? Here is first-pass:

>eventually every system that contains information that can be queried will have a sql interface


That's just a lemma in the wider theorem that every sufficiently complicated application evolves to have ad-hoc SQL reports exported to Excel.


SQL is the standard language for querying relational data, so why not.


It's not a great standard. Practically keywords the whole English language...


Considering its longevity as compared with other 'standards' that came around 40 years ago, I'd say it is in fact, a great standard.


Yes, it is a "Structured-English Query Language", formerly abbreviated as SEQUEL.


Yes, it is called "Spark driver" these days.


Nice! I'm actually working on a similar project to push lsof and files from /proc into some postgres tables. Lets me do cool things like query log files across a ~6000 server infrastructure similar to:

  SELECT distinct(l.name) 
    FROM lsof l, lsofer_runs r 
  WHERE l.lsofer_id = r.id 
    AND fd_type = 'REG' 
    AND l.fd ~ '[0-9][uw]' 
    AND l.name like '%log' 
  GROUP BY l.name, r.hostname 
  ORDER BY name
Best of luck!


so you're rewriting osquery? https://osquery.io/


His description sounded like it would do joins across different hosts. Osquery looks to be single host at a time only.


Yep.

I'm specifically writing it to find any log file that isn't being pushed into our third party logging service. It's a surprisingly difficult problem, especially considering the amount of tech sprawl that's accumulated. Since it's also a relatively low latency environment, it has to be written in a way that doesn't add too much load (without core isolation..).



Definitely crossed my mind, but I'm working on hosts where installing auditd isn't really easy. Broken yum and apt all over the place makes installing new packages almost impossible. Same goes for lsof, but its installed in "enough" places. Kinda nightmarish, but it gives me a chance to write some fun code ;).

Also, thanks for the article! Super interesting. Think that'd be better than implementing something on top of sysdig?


Auditd has the advantage of not being intermittent polling, which could miss something. Sounds like it isn't an option though.



+1 for anything by Brendan Gregg.

I wasn't aware of the polling limitations of sysdig, but it definitely explains some things I've seen in the past. This is definitely going in my toolkit. Cheers!

Edit: dammit, spelled his name wrong.


Hah, apparently.


I need this immediately.


I'll see about getting it into a public repo soon and let you know.


Is this something you could do with Presto? You'd need to write a custom connector and it doesn't look like there is support for dynamically adding/removing catalogs (https://github.com/prestodb/presto/issues/2445) but it would presumably handle the heavy lifting for you.


Looks very cool, is the code available anywhere? Would love to take a look.


The code is behind my company's GH, but here's a rewrite of the collector script:

https://github.com/red-bin/lsofer/blob/master/lsofer.sh


This is nice, but what I'm actually looking for is a lightweight clone of SharePoint Search[1].

Something that has a self-hosted Web Interface, and an engine that I can point at some file servers, and let it index the files to it's hearts content. All I then have to do is search the index 'google style' for my files.

Any suggestions?

--

[1] https://i-technet.sec.s-msft.com/dynimg/IC423463.jpg


This is really neat. MacOS has a easy to use smart folder which I use to find recent files and large files. An interface like this is an advantage because it's easy to understand what it's doing and it's cross platform . Other people make the claim it may be verbose (but being verbose makes the operation clearer) and SQL is so familiar to programmers that are power users.


FROM dir1, dir2 doesn't mean the same as in SQL. In SQL that's a join of dir1 and dir2, but here it's a union.


Hence the esque. :)

Good point though, I'll make a note of that in the README.


I never know when I have to use find vs grep. And linux grep is different from macOS grep so I google about it every day lol. I just never figure it out. I think I'll be a heavy user of FSQL.


Reminds me a bit of SPARQL, that is used on the linux desktop e.g. by Gnome Music to find your music collection through Gnome Tracker


Kudos to the authors for taking an idea I've had for a while and actually doing it. Very, very cool.


Really cool idea, but I'm really missing the WHY section in the Readme file.

Thinking about a use case is quite hard. Anyone?


Really cool idea


This is exactly why I love PowerShell.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: