
Show HN: PageFish search algorithm - peterburkimsher
https://github.com/peterburk/pagefish
======
sonofgod
Instead of addition, why not use an actual hashing algorithm, truncated to a
certain number of bits? That'll give you better binning and give you
flexibility about filesize / number of file tradeoffs.

I like the idea of using unicode characters -- have you performance tested it
to see if it gives superior performance to an ASCII filename of the same
number of bytes?

~~~
peterburkimsher
Binning! That's the word, not "hash distribution" like I wrote. Yes, I'd like
to experiment more with different hashing algorithms. It needs to run fast on
the query side.

Unicode seems great, but I noticed that filenames are restricted on various
platforms, including Github. Uppercase and lowercase URLs are equivalent. So
perhaps I should consider a different subset of characters instead of the
first 65536 Unicode characters.

Performance is much faster than grep, because it only reads a fraction of the
file. I couldn't find a good benchmark for ElasticSearch/Lucene/Sphinx, and
apparently there isn't an easy way to compare.

[https://stackoverflow.com/questions/44846271/performance-
com...](https://stackoverflow.com/questions/44846271/performance-comparison-
elasticsearch-vs-solr-vs-sphinx-vs-whoosh)

~~~
sonofgod
Whilst there's a few hashing algorithms explicitly designed to be memory and
time-intensive to make password retrieval impractical, most are really fast,
even in Javascript. (Given there's g and so should be fine but isn't
particularly lightweight. oing to be network access, I expect it to be
entirely negligible)

In terms of performance I was more wondering whether or not "£" (one code
point, two bytes) was much less expensive than "Ao" (two code points, two
bytes). Just restricting yourself to ASCII might also have advantages for
ensuring portability.

[https://qntm.org/safe](https://qntm.org/safe) might also be of interest:
discusses potential issues with using unicode as a high-density encoding --
base-2048 is Twitter-safe.

------
nine_k
> _Server-side operations cost money. Client-side JavaScript is free._

This is a reasonable heuristic. OTOH when a page spins up fans in your laptop,
or, worse, makes your browser unresponsive due to the amount of JavaScript
executed, it's not as free as many users would like.

(I'm talking about the general principle, not the particular implementation.)

------
gregoriol
Obviously it is faster than grep... which is "linear search"! Compare it
instead to some "hashing search" to see how good or fast it really is. For
example, take SQLite's Full Text Search as reference.

Imho, there is no point in such a custom solution: it is too limited and too
simplified for real general uses.

~~~
LordHeini
Using some database backed full text search will beat this by a huge margin.

Especially when it comes to actually working correctly.

As the author states where are some caveats with invalid "hashes" and i
managed to get false negatives on the third try.

For example 'Baden-Württemberg' and 'Egypt' is in the file but not found by
the search.

Not to mention fuzzy search which is highly important if you are doing any
kind of text search.

And setting up some PostgresSQL or SQLite costs nothing compared to the cost
of debugging this rather naive approach.

~~~
peterburkimsher
Some queries (e.g. San Francisco) work on localhost but not Github, apparently
because of case-insensitive URLs. That's probably what's up with Egypt and
B-W.

A better hashing/binning algorithm should fix that.

I don't think this can compete with a database for features. It's more a proof
of concept for using the filesystem as a search index, which might have some
other applications when server-side isn't available (e.g. IPFS).

------
rjmunro
Why not just have one file per city name? Or use e.g. the first 2 characters
of the city name as the hash, so that certain near misses later in the city
name appear in the downloaded files.

~~~
peterburkimsher
Good idea! Even take it to the extreme: a subfolder for every character in the
string.

------
dangoljames
This is a very vertical search algorithm; it is useful only for indexing and
searching cities as shown.

~~~
peterburkimsher
I plan to use it for much more than just city names, specifically Chinese-
language lyrics/subtitles/Bible/menus for Pingtype. I could also use it to
speed up the word spacing, which does a lot of dictionary searching.

[https://pingtype.github.io](https://pingtype.github.io)

It's really hard to type Chinese. But that means there's no typos in the
query! Therefore fuzzy searching isn't as important.

~~~
ohum
It could be made into a real meta search engine, and you could have layers of
searchers, a network of searchers, with a common memory space, and parallelize
the whole thing in sequencel, throw it on a vps with a few hundred/thousand
cores, index the entire ipv4... You would probably need to use a big cloud

