Hacker News new | past | comments | ask | show | jobs | submit login
Creating a search engine with PostgreSQL (xata.io)
530 points by todsacerdoti 11 months ago | hide | past | favorite | 174 comments



Please do not try to do this while attempting to mirror Elastic Search capabilities.

I know what I am talking about. Back in 2000's I was asked to build a search engine. Parsing data from image EXIF information and indexing that into a taxonomy - three levels down and with counts. In MySQL 3.x.

Before that, the company went through multiple vendors who charged fortunes and were not capable of doing this properly, quite shockingly. One was Autonomy, and that thing just straight up could not do a taxonomy even at the top level.

It was 6 weeks of doing the impossible, writing very fragile SQL queries where performance was different literally if you rearranged the SELECT columns. We did it, amazingly, but this is not something I will ever do again. Databases are essentially the same, but search engines have come a long way.

As an intellectual exercise, please go head. "You just tokenize and then you are done!"

No, a search engine does a LOT more than just splitting your corpus of text into tokens. Soon after you are "done", new requirements come in. Taxonomy navigation? Multiple languages support? Automatic synonyms? Spellcheck "Did you mean" functionality? Performance at massive scale?

You will engineer yourself right into a corner. Just use a search engine for your own sanity.

Finally, there are things for syncing PG and ES data - ZomboDB, PGSync.


I think this comment highlights the two separate discussions going on in this thread. If you're building a customer-facing search engine, avoid reinventing the wheel by leveraging powerful tools like elastisearch.

On the flip side, if you're a data analyst or developer who has a large database with one or more text columns they want results from in a more flexible way than using "LIKE/ILIKE" SQL queries, it's probably easier and faster to create an FTS index/table in that database to get them 90% of the way there.


Looking forward for part 2 / postgres vs elasticsearch. One application at my company uses PG for CRUD on objects and elastic for searching them. We completely underestimated the effort of keeping the two datastores in sync and are actually contemplating to get rid of elasticsearch.


A technique I've used before is to treat Elasticsearch as rebuildable at any time, consider this approach:

A cron runs every 5 minutes that looks at your database for any objects you're indexing where last_modified_at timestamp > last_indexing_started_timestamp.

Index the objects into Elasticsearch, then update the last_indexing_started_timestamp value to be when you started the original sync process, so we catch any modified objects between the start/end of the update run, next run.

Then if Elasticsearch needs rebuilding you can just clear out the last indexing timestamp and resync from the start of time, and its self-recovering / won't get out of sync.


I've done exactly this for about a decade and it has worked supremely well. Robust and resilient because it's simple and idempotent.

In my case I'm using Solr and my last_indexed field isn't written to until the Solr index call completes without error. I have a very basic lock on the indexing process which hasn't failed me yet, and if it ever did fail the consequences would only be wasted CPU cycles. I consider that a lower risk than updating last_indexed only to have the actual indexing fail unexpectedly.

In the rare instances I've needed to re-index from scratch the process has been incredibly simple:

1. Start a new instance of Solr on a powerful AWS instance and direct index updates to it

2. Set all last_indexed fields to NULL

3. Wait for the scheduled task to complete the re-indexing

4. Reboot the new Solr instance on a sufficient AWS instance

5. Shift to the new Solr instance for search engine reads


Tried this as well but the difficulty is that Postgres is a relational database whereas ElasticSearch stores schema-less documents.

Your record in ES might include data from many different tables, and figuring out what to (efficiently) update when there is a change in Postgres is not a simple task.


For me, a shotgun approach seemed the least likely to break.

Anything that is a dependency in the elastisearch index should trigger a job to export to it. And since it is idempotent it doesn't matter if it accidentally exports two or ten times the same index in a bg job. Just make sure before writing that you do a quick check that you're not overriding a fresher one. So just have a freshness timestamp which is the latest timestamp of any record used in the indexing data.

Furthermore you can do a daily job to just re export a critical part of the index. Doesn't matter if it is or isn't fresh. So let's say you query all records that were modified in the last day, and trigger the export job thatnmaynincludebthat record. Even if it causes duicate work. Idempotency saves you there.


Perhaps include a "last modified" timestamp w/ timezone in tables of interest, PG can update this on a trigger so no app code has to change. Index this field. Then build a view on top of the relevant tables that assembles the document for ES. Include in the view a field which contains the most recent of all the "last modified" dates, and filter the view on that timestamp?


We've solved this ("any of these N related tables needs to update the search record") but building reactivity into our backend ORM:

https://joist-orm.io/docs/advanced/full-text-search

Granted, currently we still do pgsearch against this derived field, but could sync it over to ES.


This issue comes up in several domains and tech layers (back to front, bottom up etc.)

Are there any generic, algorithmic or even just heuristics that help with this?

It’s something I‘ve been thinking about over some time now. Any pointers, strategies and tips are appreciated.


Or we can also use Logstash


I had a similar setup at a previous gig and didn't find it particularly challenging. Whenever an update happens to an entity in PG we send a message to async replicate that entity in ES (via ID lookup in PG). As always you need good monitoring and retries for async computation but ES is quite stable and fast so we rarely had any issues.

Now, we had pretty lax consistency requirements, as long as the "latest state" of PG ended up in ES within a reasonable timeframe everything was fine so maybe your requirements are different.


Yeah, I think more folks should learn about postgres LISTEN and NOTIFY functionality, https://www.postgresql.org/docs/current/sql-notify.html, it's great for stuff like this, and there are now various client libraries like pg-listen in NPM that make this really easy to use.

This is especially useful if you have lots of different teams/code paths that may update your DB - just set up a trigger that causes a NOTIFY message to get sent, then have a client responsible for reading for PG and populating ES. Alternatively, if you can accept a bit more latency, just have a trigger that sets a "needsESIndexing" dirty column somewhere and have a polling process that picks rows WHERE needsESIndexing = TRUE and just updates this to FALSE when the indexing is complete.


It has been several years since I worked on a system that implemented listen/notify, but I recall there was a significant caveat that if the listener was unavailable at the exact moment the NOTIFY executes, the message would be lost.

That’s a significant risk for things that need to be in sync between two systems, so we stuck with listen/notify for info-level ops things and used polling/queue systems that offered better guarantees for more important tasks. Don’t want to be in a position where a quiet hiccup with a deploy or something results in fun bugs like 0.5% of rows being silently out of sync between ES and Postgres.


You could put a Nats JetStream connection in line and you'd have a good buffer to catch the notifications. Not fool proof but then you'd need some kind of batch process that looks for any outstanding changes and syncs them.

We do something like this with our systems. External events get written to the event bus but all operations are idempotent on the event bus. So at night we send another round of the days events to clean up any inconsistencies.


Making jobs idempotent forces good design imo.


Yes agreed. It's (usually) more work, but the gains in robustness and fault tolerance are (usually) well worth it


Or a bit more 'web scale's use wal_insert and the outbox pattern to transactionally replicate events to a messaging system (Kafka etc).


you can also just use a CDC to import from Postgres into Kafka using Debezium and then on the other end Kafka Connect to write into Elasticsearch, no need to write any code at all with this setup ymmv ofc


Sounds like a lot of moving parts though, unless the team is already familiar with maintaining all those parts


if you already use Kafka anyway its a valid solution, I wouldn't introduce Kafka for this obviously. Also Kafka Connect and Debezium are mature well maintained open source projects in case that wasn't clear enough.

https://debezium.io/

https://docs.confluent.io/platform/current/connect/index.htm...


I went a similar route (Postgres for CRUD and Elastic for searching) and also underestimated the effort of keeping the two datastore in sync as well as underestimated the effort in maintaining a reliable Elastic cluster with limited manpower/experience. After moving to Postgres full text search with indexes and query boosting, I accomplished everything I needed inside Postgres with update triggers and search queries that were incredibly performant.


Curious, did you try zombodb? [https://www.zombodb.com/]


For a moment I thought it was from zombo.com


The site was updated just to replace flash and add TLS. Amazing!


I have done this. It was a challenging task, even with significant experience, and some results aren’t as good (most are as good, very few are better).

Overall it was a success - our ops work is very much reduced, enough so to have easily paid for the engineering time invested. Not to be undertaken lightly though.

Depending on your needs, you may be better served by materialised views, normal views, or triggers. The builtin text search may not suit your use cases; it’s not necessarily hard to come up with alternative schemes.


>We completely underestimated the effort of keeping the two datastores in sync and are actually contemplating to get rid of elasticsearch.

Can you expand on this? Is it that it's tedious to write code that updates both? I've been meaning to play around with meilisearch and was trying to think about the synchronization issue.


This is exactly why I (as a solo developer at the time) just used Postgres' built-in fulltext search when that was a requirement. Worked out great, although the "simple" interface to it was a bit too simple and I ended up having to build up a DSL to translate searches into the more complex interface to the fulltext index.

And any modifications to any field that was indexed, or having to update how things were indexed, was a chore thanks to referential integrity enforcement at the DB level: I had to remove and afterwards reapply things like foreign key constraints, triggers, stored procs, etc.... for both the "up" AND the "down"! Fortunately, since Postgres lets you make schema changes in a transaction, there usually wasn't anything to worry about integrity-wise.


I want a public standard for a WAL format that disparate databases can share to stay in sync.


I think they are not comparable, for some basic queries maybe PG is enough, but I don't think than the text processing capabilities of lucene are there, and it's ok because one is a database, and the other a search engine. Elastic has drifted a lot to being like a database, but I least for me, what has worked very well is having a database as a source of truth, and denormalizate the database as needed to Solr which is queried by the site. Just something like faceted search navigation, is way easier to do well in Solr or elastic versus PG


Back in 2017 when spectrum and meltdown hit I was boarding a flight to mexico with my wife for a short get away. I noticed a pagerduty as the flight started to take off about a backup in our queue. Our company handles millions of calls daily and at that time we indexed each call record as a single document into elasticsearch (ES). When I landed and finally got to our hotel and plugged into the wifi the problem was significantly worse. As a result of the reduced compute capacity from the patches aws rolled out the night before we observed over all servers including our application servers, database and elasticsearch roughly a 30% reduction in computing power. I remember sitting on the beach trying to figure out a good solution. batching the index operations thanks to _bulk endpoint was immediately what I realized we had to do just wasn't clear on how to get there. After 2 days of hacking on the problem and explaining to customers it would be okay no data was lost, I had a solution that involves redis sets to ensure we only index unique call records and a self queueing bulk indexing job that also limits the total number of jobs queued as a function of our overall capacity. Most frameworks assume a single record indexed based on update/create/delete operations is the way to go but that doesn't scale... It's been a few years now since then and elasticsearch has proven it's worth over and over again as a denormalized index allowing for faster search and aggregated reporting that our normalized database could not provide...



I'm wondering if CrateDB [https://github.com/crate/crate] could fit your use case without keeping 2 places to store and sync data.

It's a relational SQL database which aims for compatibility with PostgreSQL. Internally it uses Lucene as a storage and such can offer fulltext functionality which is exposed via MATCH.


CrateDB is very promising. It's not acid though.

Another potential of to use zombodb (haven't tried it myself).


Yep, CrateDB is not ACID but has eventual consistency at the row level.

I think for some use cases it's fine that some inserts are not visible in search results immediately. Also, once could tweak refresh_interval or use explicit refresh for "must be visible" inserts.


The easiest log retention setup I've ever used was just a server w/ logs dumped into a text file searched by zgrep. I'm not it's not ideal for all use cases, but I miss the simplicity it.


This can be done and centralized with rsyslog and it's magical. And yes it can scale.


We did that at my previous job. Because we had a really good idea of what was searched and used we indexed everything really well. We only had around ~ 6b records but it turned out to be faster to search than ES. And we didn’t have nodes turning bad every other month.


You might trade for a different problem, which is that search could end up being a very different query load on the Postgres cluster compared to existing queries. That might make it hard to scale it out for both use cases.


You can always have a separate, streaming replica [1] used for full text searches.

[1] https://www.postgresql.org/docs/current/warm-standby.html#ST...


Normally best practice is to return primary keys in ES and then do a lookup query in the db. Did you attempt this?


Got any links or documentation about this method by any chance? Not sure if you are talking about querying or syncing data here.


I think what they just mean is that you only have all your searchable fields indexed in ES and then, when you do a search query, the only thing ES returns is PKs which you then do a second roundtrip to the database (or cache) with to get all the actual records from. Which means you don't have to worry too much about all the record information being in sync perfectly.


I was researching this recently and came across this article: https://pganalyze.com/blog/gin-index

The GIN index has some similarities to Elasticsearch's inverted indices (last I knew anyway), which also can be quite expensive to write to. If you're doing heavy writes, something to test and consider carefully.

TLDR; writes get a lot more expensive with GIN indices.


Check out debezium.


Has anyone tried it on a large and frequently updated database?


Anecdotally, I've heard of successes using it at relatively large companies for syncing pg to elastic. I don't have first hand experience.

There's also pgsync[0], but again, this is just from some preliminary research on my part, can't speak to relative or absolute quality for either option.

[0] https://pgsync.com/


p50/p99 retrieval times at realistic loads or it didn't happen. A search engine that returns results in, say, a minute is not "advanced". Of course a relational database like Postgres can do it on paper.


(disclaimer: author of the post)

I plan a follow up to compare it with Elasticsearch, however, I don't think I'm going to attempt benchmarking, because whatever realistic scenario I come up with, it will not necessarily be relevant to your use case.

I mostly agree with you and I probably wouldn't use this at large scale (say, more than a few million records). I was primarily interested how much of the functionality I can replicate. Because for small search use cases this has some clear advantages: less infra to maintain, strong consistency, joins, etc.

Also, at Xata we're thinking about having a smooth transition between using Postgres at small scale, then migrating to use Elasticsearch with minimal breaking changes.


Using measured comparisons and making informed choices is certainly the best way.

If you can start with Postgres to have a relational database with the benefit of Full Text Search (i.e. avoid Elastisearch) as well as JSON fields (i.e. avoid MongoDB) then you end up simplifying initial hardware/software requirements while retaining the ability to migrate to those solutions when user demand requires it.

So many developers seem to build with the idea that they'll become the next FAANG when actual (or reasonably forecasted) user load doesn't remotely require such a complex software stack.


I think you’re being ungenerous in the reasoning. The initial setup to get something running is similar while a specific technology may scale better. From that perspective, you may be better off picking ES to avoid paying back technical debt if you do need to grow. Early strategic investments help - sure, you won’t get everything right, but getting enough things right can be the difference between the sales channel being bottlenecks on deal flow vs on engineering. The former can be a lot faster to resolve than the latter.


No failed company ever failed because of a lack of technical complexity, nor have successful companies been held back by not adopting complex architectures early.

> "It is hard for less experienced developers to appreciate how rarely architecting for future requirements / applications turns out net-positive."

— John Carmack

https://en.m.wikipedia.org/wiki/You_aren%27t_gonna_need_it


I think picking lucene/es for search is hardly "architecting for future requirements" if your task is to build search functionality. Using Postgres FTS for search feels much more like cutting corners for me and an under-appreciation of the complexity inherent in full-text search, a complexity that ES solves competently and Postgres barely even addresses.


I'm not against picking the right tool for the job.

My only point is that making an architecture decision because it will immediately reduce complexity is much more sensible than basing your choice on potential future needs.

Evaluating needs from a "complexity reduction" standpoint is safe and will net returns. Evaluating needs from a "potential risks" standpoint is a lot harder, and easy to do wrong; the true risk is not growing at all, so the heuristic for any new project should be to do the simplest possible thing that solves the problem and starts the scaling process (i.e. whatever produces a saleable product).

The other benefit to starting with uncomplicated architecture is that you leave yourself with more scaling vectors, so once you deeply understand the problem(s) you actually need to solve, you can pick the right tool.

For us, Postgres FTS covered 95% of our use-cases. If we had started by just using ElasticSearch, we would have had a lot more complexity to maintain, and we would never have discovered our current (surprisingly elegant) architecture.


You're talking about application complexity when the parent is talking about operational complexity. The difference between some app servers talking to a Postgres instance with a replica and that plus an ES cluster is jump. If you're paying for something managed then it probably doesn't matter much to you. In my experience it's actually quite rare that teams working on a product where search is a feature and not the product itself ever graduate from what Postgres offers so I tend to push back on ES when it isn't free.


You may be able to hedge your bets by separating the Postures search DB from the OLTP. Perhaps it could use FDW or similar to make them appear the same from the app layer.

Then if FTS won't scale to unpredictable future needs it should be easier to rip out and replace with ES or anything else. And one doesn't have to pay that cost if/until it's certainly a requirement.


I had done something similar recently. Goal was to take a huge Postgres database and make it searchable and usable.

It ended up that doing a offline batch job to boil down the much bigger dataset into a single pre-optimized table was the best approach for us. Once optimized and tsvectored it was fairly performant and not a huge gain with Elastic. Still keeping the elastic code around “in case”, but yeah, Postgres search can be “good enough” when you aren’t serving a ton of clients.


Sure, benchmarks won't prove that it works in ALL cases, but it can at least prove it works in SOME cases?


I interviewed somewhere with an ex-Googler who revealed they cache all of the search results for terms they have seen before, then when they update the index they also update the cached results. From that perspective, fast search results aren't actually that exciting since you can constantly run a background task to update the cached results and just serve those as the requests come in. This caching and response time seem orthogonal to the speed of the actual act of calculating search results.


I work on Google Search infrastructure. Of course, caching is important, but a large percentage of queries don't hit the cache. The backend that serves results that aren't cached has very strict latency bounds.

> From that perspective, fast search results aren't actually that exciting since you can constantly run a background task to update the cached results and just serve those as the requests come in.

If that's how it worked, I agree, it wouldn't be that impressive (every search result would just be a 1-1 cache lookup). That's not how it works, though, and as someone who works adjacent to the system, it is pretty impressive how fast it is when the work it's doing is actually pretty expensive.


Almost everything is faster if you can precompute.


For sure, but the approach is quite viable. If 19 out of 20 searches by a user are almost instantaneous and single novel one requires a few seconds, they'll assume a hiccup in their internet connection and still view the site as "really fast". It's certainly useful for limiting demands on expensive hardware.


20% of all Google searches are brand new


Then 80% aren't, and those ones will perform very well.


The other 20% also perform well. I have quite literally never encountered a Google search that took more than some tens of milliseconds for a round trip.


You're absolutely correct. Caching common search queries allows the site to allocate hardware for processing "expensive" queries, with the objective of having both complete at near the same time.

Without caching, the cost of operating the site would dramatically escalate.


And use 80% of the resources?


Eh, a non trivial % of those are likely just brand new misspellings.


Yup, all FastComments threads are precomputed for all sort directions! Each page comes from the DB in < 1.5ms 75% of the time.

It gets tough with pages with 100k+ comments though, so there are different tricks and switches for different flows and data sizes.


Your site is killing me. Every page I visit I have to accept the cookies :(


Fixed. Was an issue specific to certain countries - my bad!


Oh, that's weird...


I recall reading about the local transit department which had gotten some fancy accelerator card to help with route searches, ie when a customer wanted to go from A to B, which busses, trams etc to take.

Can't recall if it was "just" a bunch of FPGAs but it was a big-ass PCI card.

Some years later I tried to find this story again, and to check if they still used it. Turned out they had ditched it after just a couple of years. As memory sizes had increased, they could just precompute all possible routes for the next day and keep them all in memory...


which is why es uses multiple levels of caching


> p50/p99 retrieval times at realistic loads or it didn't happen.

Therein lies the problem - how do you generate actual realistic loads for a search engine without having a large number of people use it for searches? Simply hitting it with random search terms isn't realistic.

Some people will be on slow connections, search terms for something specific might spike in only a certain region (earthquake, etc), etc.

If your terms are too random, it'll perform worse than it should (results not in the cache), and if not random enough it will perform better than it should.


One actual solution is to use historical search logs. Just because "random" is a bad answer doesn't mean people don't try and make reasonable reproductions of load to replay and benchmark. Cacheing is also a big factor.


I don't know if this is true for elasticsearch, but at least with solr, when you update an index, the default is to run some of the queries in the cache of the old searcher to warm up the new one.


Lucene (what Elasticsearch and Solr use) are based on inverted indexes which is what the GIN in the article also uses.

So the benefits of ES/etc are being able to scale horizontally scale across nodes or any additional features it adds on top of the main index.


We use postgres fts and it works fine, you just gotta how you rank order the rows in the query - if you only use ts_rank then it's perfect but you likely want to use some other relevance metric to adjust the ranking but then you can't rank order primarily by that metric. Once you nail this the results are as fast as any other typical db table query with an index.


Actually good ranking is the most valuable thing in search engine.


What I mean is you can't do

    select * 
      from table 
     where ts_query(...)
  order by relevance_metric
but instead do

    select * 
      from (
        select * 
          from table 
         where ts_query(...) 
      order by ts_rank(...)
         limit 1000
    ) 
  order by relevance_metric 
     limit 10


Good ranking is what made Google a trillion dollar company.

Google indexed the same sites as Altavista, but Google Page Rank made the right sites bubble to the top and made Sergey and Larry billionaires...


You’re right. But at the same time that quality seems to decline or at least stagnate. I think they face the hard problem of an decreasing signal to noise ratio.

Ranking is definitely easier when you also provide and moderate the content. That implies the technical solutions might differ qualitatively.


I've tried using PG FTS in a real application, it didn't scale very well. It worked well for 100s-1000s of records, but after that queries would quickly go from a few ms to like 500ms. And another 400ms for highlighting.

Mind you, this was 10-15 years ago, so thing will have changed and improved. I know the indices have become a lot faster since then.


I've built something like this before, and it was pretty fast. What load could one node have supported, I have no idea, but read-only operations can be spread across replicas easily.


When I was a teenager I made a search engine and database from scratch, not knowing anything about either. I just wanted to see how far I could get, how fast I could make it, returning somewhat relevant search results.

It's actually very easy to make a fast basic database and search engine, even as an amateur programmer. As long as you understand basic CS algorithms and how to exploit the operating system and hardware, you can put one together in a month or two. Speed is not bad even with high-level languages; something like 250K QPS, back in 2003, on a laptop. Scalability isn't much of an issue either if you shard it. Indexing, locking, and consistency are more complicated than the storage and retrieval parts.

The big problem to overcome is the subjective nature of search. What do I really want to find? How do I find something when I don't know what I'm looking for? How do I get around people trying to game the system? How do I handle complex queries and datasets? That's when it gets orders of magnitude harder.


250k RPS seems like a lot when SQLite can’t do that today and I feel like inverted indices are more expensive (eg RocksDB gets to ~130k RPS but I think their numbers are on beefier hardware than my laptop or my machine isn’t set up well). Are you sure you wrote a general purpose database and those numbers are correct? I’m curious what you did to outperform these engines (I agree it’s possible, but I’m curious what techniques you employed)


I'd guess:

- everything was in RAM,

- it was written in a compiled language, and

- probably skipped worrying about too many crazy edge cases


Same guesses I'd also add, single client without network connections. Anyone who says "writing a database from scratch is easy" makes me sweat profusely.


Especially not easy if you actually implement read/write transactions (=> concurrency control, WAL, ...)


When building a search engine the biggest problem isn't QPS but the size of the data set you are indexing. If your search structures can fit in memory of a single machine then you can serve effective infinite QPS with ~zero latency. When it grows beyond that you have to get creative, and that's when all the different tradeoffs come into the picture.


It is open sourced anywhere? I am curious


One of the many problems with search inside Postgres is that search is very CPU-heavy (and starting to become GPU-heavy). Ideally, you'd like to reserve CPU in your database for transactional updates to your core data model.

I've seen a lot of ES and Solr clusters operating at 100% of 10+ nodes during a re-index, or just 30-50% of 10+ nodes during normal operation. The corresponding database would be say an AWS L/XL instance at 50-100GB of data and 30% CPU utilization. Moving all of the search CPU into your primary DB means now you'd have to shard it.

But I love PG extensions for search, recursive joins, vectors, etc on side projects. It can keep things fun and simple.


Could you solve this by searching on a read-only replica?


Convincing your devs to keep two handles two the db, one for reads one for writes from the beginning is a blessing.


DB client libs really should have this built in. Even the heavy-handed things wrapping them don't. I have my own that I copy into every new project with things like xactReadWrite(innerFunc) and xactReadOnly(innerFunc), where the read-only randomly picks from a pool of connections to read replicas.


Can you elaborate?


Determining whether a query is appropriate to hit the read replica is something I've never seen done automatically and well. But if you design your system from the beginning that when you make queries the developer chooses whether to hit the reader or writer by way of essentially do_query / do_query_reader then it's a million times easier scale reads and it doesn't have to come with any additional operational complexity when you're small you can just have both handles point to the writer but have a read-only user or set TRANSACTION READ ONLY on the session.


Some of it, yes.

But in practice, you want to fix a bug in chinese tokenization, or OpenAI releases the next version of its embeddings, or you want to add a few synonyms, or change the aggressiveness of the stemmer.

Then you have to rewrite your whole search index, and if its part of your primary db, you're pretty sad.


I’ve actually done this before and made it pretty damn fast -

https://austingwalters.com/fast-full-text-search-in-postgres...

The website is https://askhn.ai for the moment


FYI the askhn.ai sub-title "Discover, Manage, Query...." is kerned[1] horribly

[1]: https://en.wikipedia.org/wiki/Kerning


You can also marry these techniques with pgvector to find related content through embeddings. I've found this to be pretty magical.


Could you get into some more detail or point at something that does?



It's interesting that folks go down this hacky route when they can use something like Vespa, which is orders of magnitude better from a performance, relevance, scalability, and developer ergonomics perspective.


The advantage of pg_vector is that you don't need a second, specialised database and you also don't need to synchronise data.

It makes much more operational sense to use pg_vector if your use case can be implemented tha way.


It makes terrible operational sense. What are the HA/DR, sharding, replica, and backup strategies and tools for pg_vector? What are the embedding integration and relevance tools? What are the reindexing strategies? What are the scaling, caching, and CPU thrashing resolution paths?

You're going to spend a bunch of time writing integrations that already exist for actual search engines, and you're going to be stuck and need to back out when search becomes a necessity rather than an afterthought.


The HA/DR, Sharing, Replica and Backup would all be the same as before. Its all in PG so you use the existing method.

If you have two systems, then you have two (unique) answers for HA,DR,Shard,Replica,Backup - the PG set and the Vespa.

That's more complicated, from an operational perspective.

PG FTS is quite good, and there are in-pg methods that can improve it.

And, from experience, when it's item to upscale to Solr/ES/etc it's not a very heavy lift.


What makes most operational sense is going to depend on your context.

From my vantage point, you’re both right in the appropriate context.


What if you don't need those things yet and you just have some embeddings you want to query for cosine similarity? A dedicated vector database is way, way overkill for many people.


Is that a different system?

Sorry I’m on spotty mobile that can’t open anything besides HN lol (God bless this website).

Sometimes it is just easier to use the existing systems and squeeze them as much as possible. Especially when it’s a small team or solo without much $$


When it comes to search I cannot disagree more.

https://vespa.ai is a purpose built search engine.

If you start bolting search onto your database, your relevance will be terrible, you'll be rewriting a lot of table stakes tools/features from scratch, and your technical debt will skyrocket.


Or it'll be good enough for whatever minimal search use case you have, and you upgrade to vespa (or whatever new thing) later when it's actually needed. If we jumped right to the most capable long-term solution for every feature we had, our systems would be nuts.


FYI it’s “Darth” Vader not “Dark” Vader. I too thought it was “Dark” as a kid


It's actually Dark Vador in some translations such as French, which makes the "error" common.


Thanks, fixed it. That was embarrassing :)


I couldn't concentrate on anything in this post, it was so distracting. How can you know Yoda quotes, but you're still saying Dark Vader?!


Very nice and clear article. Also SQLite offers advanced indexing functionality and stemming with a standard plugin. For english SQLite works well too


I have found my experiments with SQLite here https://github.com/daitangio/knowledge

Give it a try, it is very powerful


I wonder if a FDW could do this. Since just replicating the to-be-searched data should have only a few writers it could work well for moderately sized data.


Nice article, although the part on fuzzy search not being supported by PostgreSQL is partly incorrect. The pg_trgm extension and an GIN trigram index supports the use case of fuzzy search like in the example in this article:

https://www.postgresonline.com/article_pfriendly/169.html

This could wastly speed up query but with added cost of more memory usage and operation time during updates.


I've been meaning to attempt running a custom search engine for particular sites I've 'bookmarked'. Some sites contain gold that could be useful in the future and is not often discovered in Google results.

Should I go the Postgres/Elasticsearch route or are somewhat out-of-the-box solutions available?


For such a light demand and fixed site requirements, a single-file sqlite dB is probably best. Modern Sqlite has full-text capabilities that are quite powerful and relatively easy to implement.

https://www.sqlite.org/fts5.html


For something small with a minimal footprint, I'd recommend Typesense. https://github.com/typesense/typesense

Elasticsearch is heavy, and relational databases with search bolted on (like Postgres or SQLite) aren't great.


It depends on what the user requirements are. FTS works pretty well with both Postgres and SQLite, in my experience.

Here's a git repo someone can modify to do a cross comparison on a specific dataset, if they are interested. It doesn't seem to indicate the RMDBs are outclassed in a small-scale FTS implementation.

https://github.com/VADOSWARE/fts-benchmark


For personal use nobody cares about 100ms vs 10ms response. What they do care about is relevance. Consider the following from those repo outputs:

Typesense

    [timing] phrase [superman]: returned [28] results in 4.222797.ms
    [timing] phrase [suprman]: returned [28] results in 3.663458.ms
SQLite

    [timing] phrase [superman]: returned [47] results in 0.351138.ms
    [timing] phrase [suprman]: returned [0] results in 0.07513.ms
So SQLite is faster, but who cares? I want things like relevance and typo resilience without having to configure anything.


The article covers typo resilience in the section "Typo tolerance / fuzzy search".

This adds a step between query entry and text search where you find the similarity of query words to unique lexemes if the word is not a lexeme. Seems like a reasonable compromise to me?


I'm not trying to be argumentative. As long as people find a solution they're happy with, I think that's great. For me, I'm far less interested in handling typos, but I can see how it would be valuable in many applications. I'm usually less interested in tying in and learning another set of services if I can get 90% of the way there with one, but leaving the option of adding it later if additional requirements make it necessary.


Also I’ve got a small project in which I try to compare meilisearch and postgres fts w/ pg_trm, it’s called podcastsaver:

Podcastsaver.com (click on the nerds tab in the top right)

Never got to it but there are a bunch of other search engines worth adding — Sonic, Typesense, etc. Maybe some day


I am wanting to do something similar. Archivebox seems to be the best solution for this sort of self-hosted, searchable web archive. It has multiple search back-ends and plugins to sync browser bookmarks (or even history).

I haven't finished getting it set up though, so take this recommendation with a hefty grain of salt.


How would something like this work in practice? Would you generate any tags or summaries per site when inserting it into the db?


ArchiveBox can extract text from HTML (and possibly PDFs too). I think it can be configured to extract subtitles from YouTube videos as well. So it can do full text searches. Basically you could have your own, offline & curated search-engine.


You could run a full text search or search against an auto-generated summary. Or if you want to be fancy, use semantic search like in Retrieval Augmented Generation.


Edit: I forgot to add how would I add the webpage to the databases already suggested here? Do I need to use a separate program to spider/index each site, and check for its updates?


If you're looking for a turn-key solution, I'd have to dig a little. I generally write a scraper in python that dumps into a database or flat file (depending on number of records I'm hunting).

Scraping is a separate subject, but once you write one you can generally reuse relevant portions for many others. If you can get adept at a scraping framework like Scrapy you can do it fairly quickly, but there aren't many tools that work out of the box for every site you'll encounter.

Once you've written the spider, it's generally able to be rerun for updates unless the site code is dramatically altered. It really comes down to how brittle the spider is coded (i.e. hunting for specific heading sizes or fonts or something) instead of grabbing the underlying JSON/XHR that doesn't usually change frequently.

1. https://scrapy.org


Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.

https://github.com/mozilla/readability

Btw, readability, is also available in few other languages like Kotlin:

https://github.com/dankito/Readability4J


Do you prefer it locally or in the cloud? If in the cloud, check out Xata (the domain of the blog post here).


Searchkick gem + Elasticsearch is a good combo


    You can also design more complex boosters, for example, boost by the rating, 
    but only if the ranking has a certain number of votes. To do this, you can 
    create a function like this:

    create function numericBooster(rating numeric, votes numeric, voteThreshold numeric)
 returns numeric as $$
  select case when votes < voteThreshold then 0 else rating end;
    $$ language sql;
There's really no need to use a function here. It does nothing and makes the code at the call site harder to understand. Additionally, using a function might incur unintended performance penalities. At the very least, the function should be marked IMMUTABLE LEAKPROOF PARALLEL SAFE.


Every time I read blogs like this I'm tempting to get back to the product, still in use, where we scraped the search integration due to the issues we had with pgsql (performance). Although ElasticSearch (nowadays: OpenSearch) is available, we never deemed it important enough for this part of the product.

In case you care, my comment from a few years: https://news.ycombinator.com/item?id=27977526


PostgreSQL’s pg_trgm extension is also very useful. If you want to be able to search and find partial words etc then trigram indexes can be more useful than full text indexes. I’ve made the mistake of just chucking full text search at the problem before and often it doesn’t give the results you might want.


Why does this site break keyboard navigation? Having to scroll with my mouse only is kind of annoying, but also means I can't page-up and pain-down


This is the first I'm seeing this. So it's abstracting the business logic of the API (ie. GraphQL) into the database? What are we trading off here?


https://rcs.aawadia.dev/ is also using pg search


Has the word "advanced", must be good ..


How does SQLite fare against PostgreSQL for this particular use case?


Why do I see so much discussion about SQLite on HN? I've literally never seen it used in production, my usage of it is purely as a database for local testing


With the move to the "edge" for compute, having your database at the edge reduces latency and basically makes it viable. So the theory is that SQLite read replicas will be the answer to a lot of these plans. See what Fly and Cloudflare are investing in with the edge network and SQLite plans.

On top of that SQLite is now viable in the browser with WASM and the "origin privet file system" api. And so people are increasing looking at moving more state and persistence to the browser with an eventually consistent sync. This is what the "Local First" movement, myself included, are excited about.


It is what Deno is doing, by putting what traditionally known as the middleware on the edge with a db (localStore). Essentially emulating the browser stack in some ways. Be very interesting to see different kind of code sharing amongst Deno edge nodes and browser.


Sqlite is extremely powerful for any locally-embedded databases, such as apps with an offline component. For example, searching text in notes, books, cached news, etc. It's very simple to implement inside Sqlite with much less overhead than including a Postgres/Mysql instance.


Lol, you’re using it “in production” right now! SQLite is everywhere, almost definitely in the browser you’re using to view this post.


I guess it's a difference in industry. Everything I do is accessed over a network, and for caching we use simpler key-store libraries rather than relational databases. I don't develop client software


Oh I don’t currently have it in my stack per se, more that it is all over the damn place and can do a ton.


It is used in a huge number of production environments, including within your browser (most probably).

Probably not something you want to run a multi tenant SaaS database with, but it is useful if you are going the one tenant per database route.


HN has fads like everything else. If this were 10 years ago they'd be asking about MongoDB. 5 years ago they'd be asking if the Blockchain could power their search engine. Now they want an AI search engine.


Sqlite has nothing to do with AI searching in the way I think you're suggesting.

It was not designed for "AI usage" or as a "AI database"

SQlite is from 2000 and is more comparable to a single file postgres/mysql database.

It's also on literally billions of devices including whatever you're reading this on.


Tech that has stood the test of time is the opposite of a fad.


I feel like MongoDB was on the wane even 5 years ago.


Also, the interest may not have all been…organic.


It’s used in production quite a bit, probably just not in your field as much. Anytime I’m thinking about storing something in a file I use SQLite instead


SQLite has been the default store for CoreData in iPhone apps for years afaik. I imagine that's a lot of apps.

Of course, CoreData sucks, but that's for unrelated reasons.


We do natural language processing with some custom tokenisation, the data it uses at runtime is put into an sqlite db at build-time. Works fine for our usecase.


Put simply - SQLite is incredibly powerful and simple - it can handle a lot more then people might think - and AFAIK it is the most "used in production" database whether you're aware of it or not.

Depending on the use case it can scale incredibly well, and is tiny and battle hardened.

It's the most deployed database, it's on every smartphone, "smart device", computer, it's inside of many of your apps, powers many of the websites you use, etc.

---

tl;dr - If you're following "use the simplest, but reliable tool for the job" then sqlite is a valid option for a crapload of use cases, most people aren't aware of what it can actually do because it doesn't "web scale".

Sqlite isn't new, it's old, and it's "used in production" count is in the literal billions.


I've had good luck using FTS5 in Sqlite for small-scale use scenarios, such as examining scraped data, etc. It's remarkably powerful and quick.

https://www.sqlitetutorial.net/sqlite-full-text-search/

https://www.sqlite.org/fts5.html


I put together a demo of this for work where I pulled all our private repos and built an index, then used datasette as the front end.

It was incredibly fast, so much so that I found it more useful than github search. The index is pretty out of date now but I still use it purely based on convenience and speed


There is https://www.sqlite.org/fts5.html but I cannot deduce if it has the same features; the way of working is completely different


This looks cool as heck. +1


thanks for this!


For something like this I'd use FoundationDB to store the entries redundantly and consistently with high availability and TypeSense for searching through the records and regenerating the index as necessary from FDB.


Does Typesense work with external vector databases?


Typesense has a vector store / search built-in: https://typesense.org/docs/0.24.1/api/vector-search.html

In the upcoming version, we've also added the ability to automatically generate embeddings from within Typesense either using OpenAI, PaLM API or a built-in model like s-bert or E5. So you only have to send json and pick a model, Typesense will then do a hybrid vector+keyword search for queries.


I see you run hnswlib but do you (plan to) support external vector databases, so users can upgrade?


We don't plan to support external vector databases, since we want to build Typesense as a vector + keyword search datastore by itself.


I see. Do you plan to replace hnswlib with your own technology?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: