
Ask HN: How to (instantly) search by filename in Linux? - brbsix
I&#x27;m hoping someone has some advice or a tool to recommend for this particular problem. I have many tens of thousands of (binary) files in a few directories spread over a few filesystems. I&#x27;d like a way to search by filename (either pattern or regular expression).<p>Right now I am using a quick script to compose a somewhat complex `find` command. This is otherwise effective but quite slow. Each query takes anywhere from 1-20 seconds.<p>An alternative that I pursued was `mlocate`. A daily cron (or systemd timer as the case may be) script generates a database for each directory (e.g. `updatedb --database-root DIRECTORY_A --output DATABASE_A.db --require-visibility 0`). Then to search `locate --basename --database DATABASE_A.db:DATABASE_B.db:DATABASE_C.db PATTERN`. However unlike `find`, `mlocate` does not offer an ignore or &#x27;!&#x27;. I suppose I could then strip ignored paths from the output with another tool, but things are starting to get pretty hackish at this point.<p>Perhaps there is something like Bitbucket&#x27;s Quick File Search [1]?<p>Or something along the lines of etsy&#x27;s hound [2] but for files rather than code of course. I&#x27;ve been using hound for instant search of all my repos and it is quite incredible.<p>[1]: http:&#x2F;&#x2F;blog.bitbucket.org&#x2F;2013&#x2F;02&#x2F;07&#x2F;introducing-quick-file-search&#x2F;<p>[2]: https:&#x2F;&#x2F;github.com&#x2F;etsy&#x2F;hound
======
jjoe
This is probably over the top. But why not mirror those file names in
/dev/shm/ (memory) and search against that instead? Ex:

# create dir structure

find /media -type d -exec mkdir -p /dev/shm/{} \;

# create file structure

find /media -type f -exec touch /dev/shm/{} \;

Then search /dev/shm/ for your files.

------
tgflynn
How about:

locate "pattern" | grep -v "antipattern"

locate is installed by default on Ubuntu and I think on most Linux
distributions so I'm not sure why you need to worry about the cron jobs and
all of those command line arguments, normally that's taken care of
automatically (unless of course you have special requirements for when the
cron job runs).

~~~
brbsix
The system `updatedb` job (that creates the database `locate` uses) only scans
one root tree. This one root tree includes lots of random junk (basically your
entire filesystem minus a few things) and does not support multiple
filesystems/root trees. Hence the necessity to create multiple databases. I
agree it's pretty silly.

~~~
tgflynn
"does not support multiple filesystems/root trees."

I just tried the experiment on Ubuntu 14.04 and that statement does not appear
to be true. If I create a file on a mounted non-root filesystem then run
updatedb followed by locate with a pattern matching that filename I get the
file's path.

There are a couple of things that could be preventing this in
/etc/updatedb.conf. Make sure the mounted filesystem isn't excluded by
PRUNEPATHS or PRUNEFS. Also I'm not clear on what PRUNE_BIND_MOUNTS does but
my guess is that if set to yes it would exclude nfs mounted filesystems.

~~~
brbsix
I was incorrect. It does support other mounted filesystems, just not multiple
root trees. As you suspected, `updatedb.conf` had /media in PRUNEPATH.

------
zehemer
Since you mentioned Bitbucket's file search, fzf[1] immediately springs to my
mind. I don't know how well it would scale to your use case but it supports
pattern and regexp based searching.

[1]: [https://github.com/junegunn/fzf](https://github.com/junegunn/fzf)

------
RogerL
Recoll?

[http://www.lesbonscomptes.com/recoll/](http://www.lesbonscomptes.com/recoll/)

Haven't used it, I'm a Windows guy. I searched for "Everything for linux"
(Everything by voidtools does this functionality for windows, and is
fantastic).

Alternatively, can you run Everything under Wine?

------
atsaloli
How about caching output from "find" on the few filesystems and then searching
the output? You can refresh the cache from cron. It'd be faster than running
"find" each time.

~~~
brbsix
Yes, this was one of the first things I tried. It cut down on the extremely
long queries, but on average was about the same time if not longer. The cached
output from find was hundreds of mB in total. Despite it's drawbacks,
`mlocate` is a speedier alternative. Just interested in finding out what sort
of better tools are out there. If only I could get access to Bitbucket source
and see how they managed. :)

~~~
atsaloli
Got it.

Do you search based on pattern in the filename only, or on the pattern in the
full path to the file?

I ask because my next suggestion would be to put the output of find into a
database so you can have the benefit of modern indexing technologies instead
of just having a linear index.

(If that works out, you might want to go the whole hog and put the objects
into the database too.)

~~~
brbsix
I only search based on pattern in the filename. So I was using find:

    
    
      -name '*something*'
    

locate:

    
    
      --basename
    

Or to search the cached find output:

    
    
      grep -P '(?<=/)[^/]*something[^/]*$' find_output.txt
    

Using modern indexing technologies sounds like a great suggestion. Any idea
where to start with this? I don't have much in the way db experience.

~~~
atsaloli
I would recommend storing the data in Postgres (advanced open source
database). See how searching the table compared to using find.

You can optimize further by adding indexes. Postgres supports indexing based
on regular expressions so that if there are particular regular expressions
that you often search for, you can index on that.

------
DrScump
What can't you do with find? Are you quoting your wildcards properly?

~~~
brbsix
find is great, just slow. I'm interested in some sort of solution that can
scale.

