
How to build a search engine with common Unix tools (2018) [pdf] - faizshah
https://www.iaria.org/conferences2018/filesDBKDA18/AndreasSchmidt_Tutorial_SearchEngine.pdf
======
heinrichhartman
Those methods were always appealing to me, and I tried to bild some home use
applications with it (document management, even some bash cgi scripts) but
they fall over quite soon:

* How do you deal with words like "\---" in your text that look like the match separator of grep?

* What if your filenames contain spaces?

* Are the sed/awk/perl one liners really all that readable and correct?

* How to catch and report failure conditions ... in pipe steps?

This stuff is great for interactive use and one-off ETL, not for applications.

Not sure what real alternatives are that give you:

\- parallel execution

\- seamless composition (like |)

\- object passing not byte streams

\- Quick to write.

Most of the time I switch to Python for this, but it does not give you sane
parallelity. Sure you can do this with Java + Akka, but this takes days to
build out...

Any recommendations?

~~~
faizshah
The two I reach for first: Dask and SparkSQL

Dask is super easy and quick to learn provides similar features to spark but
can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this
but I haven't tried it yet.

For very fast processing and ease of writing SparkSQL is the tool I reach for.
Start a single node spark instance (super easy) then interactively wrangle ur
data declaratively with SQL. Great for quick and dirty cleaning and
aggregation of big-ish data.

If you're into google cloud BigQuery is currently my top tool for quick and
dirty processing but u can do a lot more with ur 5$/1TB with a giant compute
engine high mem instance and Dask or SparkSQL.

~~~
heinrichhartman
Thanks for this. I did not know about Dask! wow this looks great. Love the
web-based task visualizations:
[https://distributed.dask.org/en/latest/web.html](https://distributed.dask.org/en/latest/web.html)

~~~
faizshah
Check out the Dask Bag it’s my favorite feature, it helps you deal with non
tabular data that also might not be structured consistently:
[https://examples.dask.org/bag.html](https://examples.dask.org/bag.html)

Everybody I show it to likes it even more than working with data frames once
they grok it.

------
limesontoast
I am planning to read through this more thoroughly as something I really want
is my own personal search engine. Ideally I need it to

1\. store data locally for offline retrieval

2\. Support indexing big sites including stack overflow, Wikipedia, Reddit,
news.ycombinator.com, microsoft.com docs, and a bunch of other domains.

3\. Be easy to add a single URL into the index from command line and
optionally browser plugin. Only index that page, this would replace my
bookmarks.

4\. Optionally auto store browser history for a custom period of time, purge
when expires.

Does anything like that exist?

------
ko56
For those interested in building their own local offline datasets/search
engines check out Kiwix and Zeal. Understand how they work.

Code is open and there are a ton of already created data dumps + indexes. You
don't have spend time rebuilding a Wikipedia/wikidata/stackoverflow dump and
index by yourself.

~~~
limesontoast
Thanks for the pointers, I'll check those out as well.

------
mickael-kerjean
I did build something like this (but even simpler) for the support website of
my side project:
[https://support.filestash.app/](https://support.filestash.app/). a PHP script
calling grep and displaying the results. It's very hacky but is good enough
for its intended use case: search through the entire IRC chat log

~~~
probably_wrong
I did something similar too. For my last move, I wrote a detailed list of
which item was in which box. My original plan was to add a QR code to each
box, so I could quickly see what's inside.

But once I was done, I realized that I had it backwards. Therefore I wrote a
PHP page to grep the list, and figure out in which box a specific item was.

------
SanchoPanda
The linked site on the slides is password protected, and internet archive is
silent on it as well; Does anyone have a copy of the referenced materials?

~~~
cr0sh
I've put out a request to his email address for access; if I hear anything, I
will post back to this thread...

~~~
cr0sh
Should be open now:

[https://www.smiffy.de/dbkda-2018/](https://www.smiffy.de/dbkda-2018/)

------
duggan
I always enjoy demonstrating various combinations of cat, grep, uniq, sort,
and cut to folks unfamiliar with the command line; data scientists in
particular.

Even if you can't ship a bash script to production, they're great tools for
ad-hoc exploration and validation.

------
riddleronroof
In SQL
[https://gist.github.com/sanealytics/0e910380576fbe4825455264...](https://gist.github.com/sanealytics/0e910380576fbe4825455264b125ecdb)

~~~
pstuart
sqlite has full text search capability:
[https://sqlite.org/fts5.html#overview_of_fts5](https://sqlite.org/fts5.html#overview_of_fts5)

------
blondin
man... slides always feel out of context for me. i would rather a blog post
than slides. these seem to cover the basic theory as well...

------
ninjamayo
why?

~~~
faizshah
There was a post earlier on using command line tools instead of Hadoop for
quick data processing. This shows a non-trivial example of how you could
implement a complex data pipeline and an overview of some of the commands you
could learn if you’re interested.

~~~
crmrc114
Yeah, this is a pretty cool post I never considered doing something like this
in the shell. It seems silly to me that I forget how powerful basic tooling in
the native shell can be.. sometimes I have a jackhammer and I forget that a
sledgehammer will do the job just fine.

~~~
ahi
I've been there. 100s of lines into a ruby program then "oh yeah, cut sort
grep"

~~~
e12e
Just for anyone else, you could probably end up in a similar corner with perl
- but in both cases it's likely a case of "holding it wrong" \- ruby borrows
heavily from perl which borrowed heavily from shell with sed, awk, grep, cut
and friends.

So this kind of thing _should_ be quite doable in a short ruby script - or a
few short scripts - albeit written in "shell" style, with eg '-n or -p (wrap
code in "while gets...end",-p with "puts _"), probably along with -a
(automatically split lines).

Its in some senses an entirely different dialect of ruby, though.

Some examples here:

[https://github.com/learnbyexample/Command-line-text-
processi...](https://github.com/learnbyexample/Command-line-text-
processing/blob/master/ruby_one_liners.md)

