I didn't, it is AI generated. I wasn't sure if I'd re-use them after the Memory Allocation post, but they've proven to work really well and people like them. I've been considering getting proper, consistent sets commissioned. One of the things I want to do in future is make it so you can pet them, which could require some animation work. Open to artists reaching out to me about this. :)
DALL-E 3? I've seen some similar images out of it while trying for more pointilist views. (DALL-E 3 has a certain obsession with regularity & centering.)
Midjourney! The prompt for the puppy with his head cocked to one side was: “playful happy cartoon huskie puppy asking an insightful question looking at the camera, question mark above, muted colors, block print.” Never could convince the model to give me that question mark.
Ah. A little surprising, MJ's tend to be messier and more textured.
But yeah, I too know the sorrow of getting MJ to generate specific text in images - funnily enough, I too wanted a question-mark for my cat image (for https://gwern.net/review/cat thumbnail ) and I just couldn't get MJv5 to do it and had to settle for making the entire cat a question-mark: https://gwern.net/doc/cat/psychology/2023-11-04-gwern-midjou... It looks cool and I'm satisfied with it, but it wasn't what I had been going after.
In MJ's defense, MJv6 (c. January 2024) is considerably better at text, and you could probably get that question-mark now with some prompting & maybe region-inpainting.
This is really reassuring to hear, thank you! I make sure my posts are reviewed by people with a range of experience and familiarity, without them the posts wouldn't be as accessible as they are. I'm glad this is working. :)
I remember discussing bloom and cuckoo filters a while back and someone linked.to.great papers from late 2010s with much more advanced probabilistic structures that were handling modification much better or had other interesting properties. Anyone here knows where to look for these? I didn't take notes...
Not sure, but I recently came across xorsat filters (https://eecs.ceas.uc.edu/~weaversa/XORSATFilters.pdf). It's not a great replacement for bloom filters in many use cases because they are not modifiable but have some other interesting properties:
1. They are very close to the information theoretic minimum needed to encode the information so are more space-efficient than bloom filters or cuckoo filters.
2. They have a natural way to store metadata (at the cost of additional space of course) so you can associate a few bits of metadata with each set bit at pretty reasonable space cost.
The first I heard of a Bloom filter was when I was tasked with implementing one in iBooks. I used a Bloom filter for search.
My recollection is a little fuzzy now but as I recall, as we parsed each pagination of the ePub, we would hash or index (the first three characters of?) every word on the page. Each page got its own bitfield or some such structure to hold these hashes.
When a user typed a word to search we could potentially eliminate a large percentage of the pages of the ePub. If a page had no hash matches to the hashed search term then we knew the word was not contained on the page at all.
A best-case example might be searching for "Jabberwocky" in the book "Through the Looking Glass". Very likely all but one page of the ePub will fall away.
As is pointed out in the article, if we did get a hash match for a page it was not a guarantee the word was there — but then we would drop down to a more standard string compare to find not only if the word truly existed on the page but where it was as well (or if it existed in multiple places on the page).
The worst case was of course entering a search term like the word "the" where potentially every page would have to be exhaustively string-compared. But that kind of falls in the lap of the user — why did you search for "the"?
This is really interesting, thanks for sharing! I'd imagine the three-character trick would have failed on programming-based ebooks, right? With all those underscores and stuff. Do you remember the preprocessing steps you took for this?
On a side note, I just spent like an hour on your blog. Fascinating stuff!
WebKit, at the time, was not very good at pagination, something ePub requires. iBooks work resulted in a number of "Radars" filed against WebKit with regard to pagination. As a side effect I suspect printing a web page from Safari (who does that?) is markedly better since.
One example: I personally used it to deduplicate recently viewed ads in an app, to avoid spamming users with the same ad too much. One interesting challenge there was that we needed to make sure that the client and server encode and decode the filter exactly the same, despite being implemented in different languages.
Something I spent time thinking about, but wasn’t able to find a huge amount of information on, is how you could use compression alongside bloom filters. You could make enormous bloom filters that make use of run length encoding or sparse bitmaps. You sacrifice insert and lookup speed but you could make enormous bloom filters this way.
Bloom Filters are already lossy compression, so depending on how sparse they are I'm not sure you'll get too much benefit out of it, but maybe I'm just thinking of much, much smaller filters.
BTW we ended up open sourcing that BF library that encodes and decodes filters in multiple languages, the company has been out of business for nearly a decade but the project is still out there https://github.com/EverythingMe/inbloom
I'm not sure if it's a big deal for business, but for my bachelor's thesis, we showed how we could use a type of counting bloom filters (mentioned in the blog) known as Spectral Bloom Filters, for client-side searching on static websites. Basically, it means doing all the searching and ranking directly on the client side, without needing to send any requests to the server.
It's used a lot in databases for query processing. Specifically, if you're performing a hash join, probing the hash table is more expensive than probing the Bloom filter. If you are pretty sure you're not going to find very many matches in the hash table (i.e. the join is very selective), you can skip going into the hash table by first checking the Bloom filter.
Rather than using multiple hash functions, would it make more sense to use a single algorithm over (prefix | input), with k different prefixes? This may allow computing those hashes in parallel, using SIMD for example, and caching the prehash state of the prefixes.
There's a lot of optimizations you can do on top of the classic Bloom filter. In short, you can use a single hash to compute an offset into a table of bit patterns. SIMD lets you perform multiple lookups in parallel. I wrote a blog post about more advanced Bloom filters if you're curious:
this is a great article! I love how you explained it. here are my takeaways from it:
- Bloom filter is a space-efficient data structure
- You're trading off some accuracy (possibility of false positives) for the significantly smaller memory footprint and faster membership checks
Very good explanation of Bloom filters, but are they optimum from a complexity point of view given a false positive rate and a bounded number of bits? Are there other kind of filters if we relax de condición of zero false negative rate?, are there other better but more complex data structures able to provide the same effect?
Thank you so much. I still have a way to go. People who struggle with low contrast will struggle with the default colour scheme, for example. But I'm trying to make progress with every new post <3
You're very kind! I'm a big believer in writing you can really lose yourself in, and try and achieve that with interactivity and visuals. The aesthetics of my posts is very important to me.
Have loved bloom filters for a long time. If probabilistic data structures that make use of hashing are your thing, you should also check out: https://djhworld.github.io/hyperloglog/
...this way, the output of the 3 hashes don't affect each other during the set and membership-check. I wonder how that would affect how much more data you can store. I'm sure someone has considered this.
And if you want to try some code, I've prepared three toy Bloom filter implementations to accompany the article:
JavaScript: https://codapi.org/embed/?engine=browser&sandbox=javascript&...
Python: https://codapi.org/embed/?sandbox=python&src=gist:e7bde93f98...
Go: https://codapi.org/embed/?sandbox=go&src=gist:e7bde93f98c5e4...