They generated lot of json files with all the permutations of alpha numerics upto three characters( like a.json, b.json, aa.json, sta.json), when user types initial character they fetch the corresponding json which has more relevant results.
They do "actual" search only after three characters but most of the time you would find the result in first three characters. This is not a full-text search but has its own advantages.
Bloom filters seems to have similar advantages.
Can keep modifying the list over time to take into account, recency, new data, etc. But this is a pretty scalable solution for the most part. Rebuilding your entire index on S3 should still cost you less than a dollar.
mdbook (https://github.com/rust-lang/mdBook) uses elasticlunr (http://elasticlunr.com/) as an inverted index and is generated statically.
The same is done for zola (https://www.getzola.org/documentation/content/search/).
If you wanted a very small bundle to use with a static site, I don't think it's obvious that a bloom filter variant is a bad approach.
I mean, yeah, this thing that the author said "made for a fun hour of Saturday night hacking" is probably not the optimal solution, but that would've been true whether or not the author chose to build something based on bloom filters or an inverted index.
It took me literally half an hour to implement this using elasticlunr on the browser. We will went with the backend solution as it felt more extendable, but this got me thinking there's some abuse of ES for things you could easily do that way.
That is exactly the property of bloom filter searches (unless I'm misunderstanding) - you should never get a false negative but can expect false positives. The design of the filter will define how many false positives are likely for a given lookup in a given dataset.
> might be good for a first-pass.
Exactly. If the number of false positives is low then you can perform an exhaustive search on the few results to exclude the wrong ones. You won't miss anything as there are no false negatives.
Of course defining the bloom filter such that the number of false positives is low enough to "win" by a significant margin might not be easy, particularly for growing rather than static data, just like choosing an optimal hash function often isn't.
The only thing that worries me about it is that e.g. with Lektor, I can just write a small Python template filter if I need something, whereas Zola isn't extensible at all (as far as I know). Has that been a problem in practice?
Source code to the blog is on github: https://github.com/cetra3/cetra3.github.io
I haven't had a need to customise zola itself, as the template language it uses (tera) is quite usable with macros etc...
(26 letters + 10 digits) * (26 letters + 10 digits + space/wildcard) = 1369
Peanuts for Apache.
Each of my 10 000ish articles got a single bit in those files where the offset is the article number.
A search query like foo bar would involve the files " f", "fo", "oo", "o ", " b", "ba", "ar", "r "
These are only 8 files! for a whooping 8 kb-ish of data.
We then perform the AND operation. 0000000 01000000 00000000
Remaining articles are pulled [in reverse order] and checked for the presence of the exact phrase or the presence of keywords.
The php then flushes
(I also did a version with a file with url's in it. Then the first url is article 1 or bit 1. Fun for stuff linked to from the blog)
For longer articles one could make 3 digit files. 37x37x37 is little over 50 000 files. Each query only involves a few of those.
Sadly one cant push bits to files but for a new article bits are to be appended to each data file eventually. I just add the new articles as false positives until I have a bunch of them.
The advantage of the disadvantage in that each page has to be checked for the keywords is that if it is gone it wont show up in the results. Even a highly accurate search solution shall have false positives.
view source for laughs
> Sadly one cant push bits to files but for a new article bits are to be appended to each data file eventually.
You can it is just that each of the files has to be padded out to a multiple of full bytes. Just read the last byte and OR in the new bit at its offset.
The method you are using is called q-gram filtration.
The main optimization you could do is to use the specificities of the grams to prioritize the ANDing so that the most specific grams are chosen first. This can allow early stopping for misses.
You can also use multiple threads to read and AND in parallel after a AND tree order has been decided from the specificities.
I would also remark that it is possible to implement approximate matching or error tolerant search with q-grams. That is to rank results according to edit distance or Levenshtein distance from the query.
It is also possible to do regular expression search over them as Russ Cox explains how he implement the Google Code Search: Regular Expression Matching with a Trigram Index or How Google Code Search Worked https://swtch.com/~rsc/regexp/regexp4.html This is not specific to 3-grams. It can be made to work with any gram size or even indexes consisting of variable size grams.
Unless I've missed something I don't think it works like that. It is just a html document, it comes as is. If you modify the dom the source stays the same?
Also, the html parser does a wonderful job ignoring things like comments and display:none nodes / script nodes. It just doesn't do anything with them. After the innerHTML is modified there is nothing further to do with each script. It is just ignored.
I've tried things like adding a 5 mb comment to the end of a page to see the behavior.
I had lots of ideas but chose to keep it simple enough to un-break brute force. Ideas: where letters are merged into one. Ones that appear to frequently (a,o,e are one letter) and/or rare ones (q is just a k). Ones that sit side by side on the keyboard (dootbell). Discard the order of the characters (so that "becuase" might just work) Typos that dont look like typos (1 is equal to l) and find ways to scale the solution to fit the texts or chunks thereof.
Probably some of the lessons from that domain could be applied here
> Efficient discovery of information, based on partial
knowledge, is a challenging problem faced by many large scale distributed systems. This paper presents Plexus, a peer-to-peer search
protocol that provides an efficient mechanism for advertising a bitsequence (pattern), and discovering it using any subset of its 1-bits.
A pattern (e.g., Bloom filter) summarizes the properties (e.g., keywords, service description) associated with a shared object (e.g.,
But I think that paper is more about discovery than search. It would be good for a tagging system but I didn't read enough to know if it would be useful for a more general search index.
Well also trying to rank them but that's kind of harder in this context.
As I'm using markdown as the example above, probably I can apply this method to update the search database index when the post is published, and then enable the client-side search in the platform.
I'll try it using this approach.
dex blog: eddieoz.com (or eddieoz.crypto using unstoppable extension)
works really well. even 9 years later. doubt anything compares.
In 2006 I built the regular kind of posting-list inverted index for my web pages at http://canonical.org/~kragen/inverted-index.html. I just used regular HTML <table>s for the representation, trusting to Content-Encoding: gzip to squeeze out the resulting redundancy, and the index is split into many different files so that the client can snarf just the part of the index that interests it. Building the index (over my smallish set of pages) was surprisingly easy, although I don't think I ever implemented the actual client-side JS search engine.
The same year, I also wrote a search engine using a kind of structure similar to a Bloom filter per post, similar to the one Stavros describes, but transposed and encoded for better locality of reference; source code is in https://email@example.com/msg0... and the algorithm is outlined in https://firstname.lastname@example.org/msg001....
Three weeks ago, I wrote https://gitlab.com/kragen/derctuo/-/blob/master/recursive-bl... describing a data structure, not yet tested, for using a bunch of Bloom filters for indexing large data sets. This structure seems like it might be capable of competing with traditional database sorted-column indexes for ad-hoc query evaluation, perhaps even exceeding their performance on some queries, though at a larger space cost. I hadn't thought about combining that with my 2006 client-side inverted-index search approach, but it's designed to have reasonable locality of reference, so it might work.
Related comment thread: https://news.ycombinator.com/item?id=23473303
— ⁂ —
Bloom filters are a reasonable thing to try when looking for the "frequency-domain representation of thinking", but I've long thought Pentti Kanerva's Fully Distributed Representation was interesting, in which each atom is encoded by a large random bitvector, attribute-value pairs are encoded by XORs of the atomic bitvectors, and aggregates of multiple attributes are encoded as the bitwise majority rule (with ties broken at random) of all their attribute-value pairs.
You can recover the value of an attribute from an aggregate by XORing the aggregate with the attribute's atom and examining its Hamming distance from all the candidate atoms, a process which Kanerva originally suggested you could train a neural network to be able to do in better than linear time. The atoms actually stored for that attribute will have a Hamming distance many, many standard deviations from 50%, while the false matches will all be within, say, six standard deviations.
Skimming the paper this reminds me of word2vec and similar techniques for embedding words into vector spaces but instead of using XOR the embedding is accomplished with another neural network. Here's a link to a good description of the embedding process: https://towardsdatascience.com/neural-network-embeddings-exp....
I think the latent-spaces approach of ANNs is pretty interesting but I don't feel like I understand it well enough. It's a bit lower dimensional than Kanerva's proposal, though.
 Kanerva, Pentti. “Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors.” Cognitive Computation 1 (2009): 139-159.
I can't wait for this to exist. The Holographic chain by ceptr.org (Arthur Brock + Eric Harris-Braun) is the closest to an agent centric web [also called Protocol Cooperativism] that I've come across so far, it's truly distributed . Have you come across them?
Absolutely. Because it is agent centric it can do this completely transparently. Since the individual agent is only the starting point, what you describe will definitely be developed on top of it, as I would agree that this is a common use case.
The real groundbreaking breakthrough with this project is that they are putting the individual user back in the middle, enabling peer-to-peer networks organized into a beautiful symphony sometimes referred to as Protocol Cooperativism. It is in contrast to Platform Capitalism - what we have today, and which takes humongous rents through corporate middlemen and third parties.
Building out from the individual agent allows, for example, cooperative Open Value Networks to be built (Sensorica is working on this), where users have complete agency/autonomy over the protocols and agreements they use, as well as users being able to temporarily hand off decision-making to others if they trust them, to act on their behalf, reflecting how they work together in the real world.
Basically this project is allowing everybody to carry a copy of the rules. This allows the protocols to rapidly evolve and change to meet the shifting needs of individuals and their communities.
There is a project building on the work of Ceptr called Holo-REA (developed out of http://valueflo.ws), which is doing exactly what you describe above.
Here is a story to quickly explain their work:
If you read my comment, I'd be grateful to know if it helped clarify some things. Does it help you see how what you described above is possible?
I've actually reported a couple of security bugs in servers in the past, which were caused by people running grep for searches.
Imagine you have a tree of web-content at /var/www, when a user searches for "foo" you run "grep -r -h foo /var/www", then you parse out the resulting filenames.
Now imagine what happens if your user searches for "foo /tmp; rm -rf /". Whenever you pass user-input to a command-line application you need to be damn sure you know what you're doing - if the shell is involved in parsing your command you'll have problems.
Bleh. I can see why people install WAFs.