Hacker News new | comments | show | ask | jobs | submit login
Show HN: Fastest search engine in the world
38 points by marcuslager 150 days ago | hide | past | web | favorite | 44 comments
Hi! According to my benchmark tests I've just built the fastest free-text search engine in the world [0]. However, providing proof of that as well as making people care has proven to be a near impossible task. I could use some help from fellow programmers to both work on the formal proof but also to test this code against the Big5's offerings of full-text search.

Would you care to go fetch an amount of common crawl data to test the abilities of ResinDB?

If not, then do you perhaps have another strategy to convince people when you have invented something big? Is writing papers and formally proving things a way into the market place? If so, how do you figure MongoDB overthrew it's previous market place ruler? It had no formal proof of anything. The market loved it though, for it's speed.

I would love for formal proof to be a success indicator. Diverting people from using Lucene/Elasticsearch would then be a breeze.

Why ResinDB? Because why use a cloud-based luxury cruiser when the most precise relevance, the fastest querying and most energy-aware choice is provided by using an on-premise kayaak. ResinDB on-premise, a word2vec search engine implementation, is the fastest (local) information retrieval system known to me. How about men?

[0] https://github.com/kreeben/resin




Lucene is the standard. Companies have an upfront investment in it, and it would cost them more to change then to not change. If this is a business thing, you need to find the company that has thousands of Lucene instances and try to convince them that they will gain a measurable benefit.

How to convince people when you have invented something big?

You need hand them a peice of paper that says: This [existing setup] costs you $5000/mo. This [resindb] will cost you $200/mo.

If you can directly show a financial benefit to your buyer it becomes a no brainer for them. It's something many (especially SAAS) founders get wrong.


Yes I should probably try to step out of the dev role and see things commercially or through a business utility perspective. My mind however wants me to dive even deeper into the rabbit hole that is search and relevance and the hard problems of concurrent read/write. I do believe I'm aligned well enough already for the database/search market but how to find time for marketing?

Edit: I read your post again and you are saying I need to be orders of magnitude cheeper than my competitors. Does the technical solution (being orders of magnitudes more performant) count in any way?


In my anecdotal experience, selling something, especially to businesses, typically comes down to 4 things (in descending order of what's easiest to sell):

Will it make them more money?

Will it save them money?

Will it make their (the decision maker's) life significantly easier?

Will it make their (the decision maker's) life feel dramatically better?

Notice how they get more subjective as the list goes down. Its hard to convince people of, or prove, subjective things. That last one is so hard that I'm not sure I've ever seen someone successfully sell something that way in person. But the first one needs almost no convincing at all.

Now, does your search engine do one or more of those things for people? If the added performance doesn't fulfill one of those desires, sorry but they probably won't be interested enough to go through the pain of switching and taking the risk of failure with a new and largely unproven system.

But if you can show that it will demonstrably fulfill one of those needs, say by cutting down the number of servers required, less down time, or better search results that also result in more sales or engagement, then you've got a product you can sell.


Thanks for that very good advise.

From what I hear, folks (not only here at HN) aren't really looking for speed, instead they look for features. I'm starting to look at performance less and less as something to strive for and more and more just a confirmation that the architecture is healthy.

Say my benchmarks are correct and I do in fact beat my closest competitor by some measurement. I'm thinking I should start utilizing that performance to achieve higher relevance, donate it if you will to that cause instead of the speed cause.

To me though, Lucene is not so much of a competitor as she is a role model. Well maybe she's both. My real competitor, some day, will hopefully be Elasticsearch. I've used versions 2 and 5. I'm underwhelmed.


That's a big claim and kudos if you really pulled it off. There is also the aspect of relevancy in addition to speed.

I think the best way you can showcase is to build a few sample proof-of-concept search engines. For e.g., How about a search engine for Wikipedia? Project Gutenberg? StackOverflow? All these datasets are freely available. You can set up a search engine for this and easily let anyone be able to verify your search engine's speed and relevancy.

Lastly, in addition to both speed and relevance is how easy it is to install, customize and extend.

Hope that helps!


Yes (it helped) and speed, i.e. querying and indexing performance, for sure is only an USP if you also have relevance.

I'm confident the relevance is as good as or better than Lucene. I especially like my phrase queries and how they seem more relevant compared to that of a Lucene phrase query. The scoring is a half-way implementation of word2vec (in a lot of ways similar to the scoring mechanics of Lucene's tf-idf scheme). I'm aiming for full word2vec implementation in vNext.

I have only my own benchmark tests to tell me I'm faster than Lucene. Which is why I'm contemplating writing a formal proof both of ResinDB's performance and of it's relevance.

My test data has been the English verison of Wikipedia plus Project Gutenberg. I suppose I could publish those indices to the world, as a demo search engine. I don't think a soul would care about a proper searchable Project Gutenberg though. Looking into common crawl now.

I'm a part-time father of two, employed doing tedious unmotivating work, focusing completely on my spare time project. I need some advise as to what the next step should be, if I wanted to make this into a business that I could spend all of my time with, not only nights and weekends. Formal proof? Demo?

Side note: one of the most approachable people in the database building community is Oren Eini, creator of RavenDB. He's reviewing ResinDB on his blog. I've read a preview of the entire series of posts, implemented solutions for the best parts of the critique and just released v2. Blog is here: http://ayende.com/blog


Great that you already have those datasets. Yes, putting it on a small public server where people can search and evaluate would be good. Honestly the "speed" part cannot be truly verified if it is on a standard public server but at least the other aspects of it can be. As I see it, you are up against primarily Elasticsearch and Postgres's search engines. [To put out a full index could be costly so you could always try a small subset but still a good enough chunk, say a million docs or so].

Another thing to keep in mind is how easy it is to install and be pluggable. I know you have designed it as a library but I think a small wrapper around it with its own http server so anyone using it can start it as a service and use http to access via JSON would be useful too. [At least everyone these days seem to do everything in containers]. And also to add, Elasticsearch sets a good bar for how easy it is spin up a search engine and get started. Again, not sure how far you must be going to make ResinDB as easy to install, to use and document and all that.

One way to get adoption is to approach a few open-source projects and non-profit orgs (or profit orgs but you might've to start out for free) and see if you can convince them to use your search engine. Once you have a couple or more, it helps in two ways. First, you can get good feedback on what are the steps that someone besides you need to do to get it in production and updates and maintenance and second, you can use them as reference customers.

Feel free to contact me via email [same as hn id but with the popular email service from another search engine out there! :)].


Thank you so much for this feedback. My eyes have been on Elasticsearch ever since their first funding of 80 million bucks. But I have also noticed how Postgres is the only database engineering team that seem to care about full-text search. Their indexing capabilities are just awsome. They have a library of indexing types you can use. They all seem well constructed and so does Postgres (and the team).

A HTTP wrapper. Sure. It's in the backlog. I can push it up a bit.

What is it that you like about ELK the most? The easy-peacy install where you immediately can start writing data, the HTTP JSON API, or something else?


> Diverting people from using Lucene/Elasticsearch would then be a breeze.

Maybe. But most people's problems with Lucene/Elasticsearch aren't really speed. They use Lucene because its feature rich and been worked on for almost two decades. And there's plenty of strategies to mitigate any speed problems (caching, sharding, etc) with lots of knowledge spread throughout hundreds of orgs to the point where "making Elasticsearch fast" is something you can hire someone for. Lots of sunk cost in the Lucene stack.

Not saying what you're doing is without value. Just be conscious that "speed" is only one of many, many factors when evaluating a search engine. In fact, as someone who regularly helps clients evaluate search solutions its very rare that it comes up. "Fast enough" with the right features for the problem is really want people want.


I'm glad I posted here because one point is maybe finally clear to me about speed. It's not a good word for performance. It's the wrong word to use.

The great folks at Elasticsearch would _love_ for Lucene to be more performant. It would make life so much easier for them.

The Lucene team spend a good buck on their nightly performance tests. It's astonishing how well-tested Lucene is.

I wonder why I'm faster at writing and reading. Maybe it's because I have been benchmarking against an older version (4.8). But still. I wonder if my tests are all wrong or if I just got lucky in my design. ResinDB has flaws. It puts massive pressure on GC at writing time, if your batches are huge. I'm working hard at optimizing that achilees heel away. Before I have completely done so, writing speed is achieved through lots of memory allocations. It's surprisingly easy though to move away from using GC as a service.

But performance is not a feature anymore?

Edit:typos and tried to clarify things


Late to the party.

As a search practioner, a formal proof would not convince me so I don't think sales would become a breeze with one. I am interested in performance but I am more interested in relevance. In my experience, sub-second performance is good enough for most use cases, less than 300ms is ideal, and faster is great but I stop caring.

What is really hard with Solr, ES, and Lucene is relevance. Solr has the best out the box experience, but I find ES the easiest to customize though still not easy. What I personally would love is something that has great defaults and easy customization. I would also love to see integration with machine learning algorithms, such as word2vec, as a feature rather than something you have to do yourself.

My advice would be to build on top of Lucene/Solr/ES rather than start from scratch because performance for all three is already good enough. Instead do something that makes using those technologies easier/better. For example, Algolia built autocomplete on steroids which is a big value add if you need that and it is something Lucene doesn't do well. Algolia did write their own search engine from scratch but you can duplicate their functionality in ES (contrary to their marketing claims).

So if I could give Resin any arbitrary dataset and it would automatically compute word vectors and add that as a relevance signal along with BM25 and custom ranking (e.g. popularity), that would get me very excited.


Offering a formal proof is really cool, but shouldn't be the whole of your marketing efforts. Instead, focus on what's gained. Even if it is faster, people are still going to rightfully ask: so what? How does that actually improve my life/sales/etc?


I agree to what you said whole-heartedly and I have to conlude by now that I am so not even close to being a sales person. I don't care for the psychology of a sale. I've been in many. I've only seen one or two beautiful ones.


Totally unrelated to this topic, so please ignore if this is a bad place to discuss this.. but: How difficult is this to embed in Go? I've never heard of embedding C#. I suspect if I was to use this it would likely be outside of Go.

Right now I'm in need of an embedded indexer with full text search for schema less queries. I've settled with a (incomplete) custom indexer I wrote that applies FTS via Bleve.

However, I doubt this will scale well - so I was assuming that I'd switch to ElasticSearch or Solr. Resin sounds interesting though. Especially if I can embed it, and not have to run a separate process.


"and not have to run a separate process"

Not at all an unrelated issue to me, but unresolvable at the moment me thinks. To use Resin within the same process as a Go app Resin would have to be a Go library.

I'm glad you posted this because I don't think there is a embedded search engine library for Go, which is both a little funny but could also constitute a business optortunity for a Go programmer.

Would you care to talk a little more about your requirements?


Sure. I'm using it for a mildly distributed locally focused offline-able content addressable store _(that's a mouthful)_. Think Camlistore, with the things that I wanted. Personal storage is the main use case but with some limited database capabilities.

As such, the records stored in the .. store, need to be indexed with provided fields for later retrieval. The indexer is responsible for this. This ends up being far more like a "database" than anything, honestly, as my queries can be complex, or simple. Eg, tags:foo title:hasWord:bar, etc.

So basically the indexer should be able to run a full suite of database-like queries, I just don't care about the data being retrieved, only the id(s) that matches the queries. To reiterate, the indexer is just responsible for returning the content hashes/ids. The content addressed store actually stores/retrieves the data. Needed operations are all the standard ones: AND, EQ, OR, NOT, PREFIX/SUFFIX is nice too but not required, etc. and of course FullTextSearch.

Anyway, hope this answers your question.


Thx for the feedback. What I think you should and hope you already do realise is Lucene is nowhere near maximum performance for full-text search nor is it's relevance. And implementing new scoring routines is a drag in Lucene.

Google is also nowhere near maximum relevance. I like word2vec. That model fits into my world view. I'm going to implement it and then take it further. Hopefully while being funded. If not then it shall be my contribution to the open source space and nothing more.


If you want to do word vector similarity search, try the "annoy" library from Spotify. It's much much faster than Gensim. https://github.com/spotify/annoy


It appears to do the vector similarity part but not the vector creation part and therefore not a gensim replacement. Am I missing something?


Would your technology be adaptable to other problem domains? I can see a few different applications outside of the traditional "search engine" space where speed will be critical and other features would be less important.


Maybe you can use our index...this is interesting if you join to us and help develop search platform Bubblehunt. Imagine if every user can get own search engine. And if your solution is better, why not? Write me if this is interesting for you :)


You wouldn't mind me using your data?

I've seen Bubblehunt and I've played aound a little bit with it. Reach me at github.com/kreeben/resin.


FWIW, various HNers expressed the interest of having something similar to algolia service. Basically, some service than can index a website (including static websites) and provide a search API for it. Good luck!


Stolen!


What do you mean?


I mean, great idea, I'm stealing it.

Edit:

A distributed serch engine: https://github.com/kreeben/dire

Open for feedback.


voting my comment would have more like a good HN answer ;)


What do you call a full-text search index is an inverted index, isn't it? How do you use word embedding exactly? There is no mention of it in the README.

Good luck while trying to climb at lucene and elasticsearch.


have some live demos and practical use examples.


What would impress _you_?

Me personally I don't think it is impressing of Google to be able to store every web page in existance and to refresh them every other minute. I just don't think that is a good way of spending electricity while we haven't figured out yet how to properly utilize the sun's energy. Google is all-knowing while burning shit-loads of coal. What's impressive about that?


Back up you claims. Have benchmarks and demos. Why say these things when there are neighther benchmarks or demos out there?


My claims are backed up by the code I've spent blood and sweat to create. Disprove me please because I need to know of scenarios that I need to solve that goes into vNext, scenarios were I'm currently not doing great.

Edit: and also: I'm reaching out to you guys not because I want a pat on the back or free PR. I'm looking for advise as to how to move from having unique tech to having a business. Is this the right forum?


> having unique tech

You have claims and they aren't even unique.

> Disprove me please

That's not how it works. Why would I waste my time testing your software when you don't seem to possess common sense or experience? The probability that you can back up what you say is excessively low.


I'm scrolling up to see where you hit a nerve with me (or is it the other way around?). Anyway, sorry about that.


It's not about hitting a nerve, you are making extraordinary claims and for some reason you think other people should compile your source and disprove you instead of you showing any evidence in the first place. Why would you think that?


I think what OP was saying is that yes, it would impress _them_, and it may impress other _people_, and impressing people is generally good for business.


compared to mongodb i have no ideas how to use your database or whatever it is. with mongodb i can just download and unstall, then copy a code snippet and run it myself.


You download it and then after some time you unstall it, got it. Well, I don't think I need to worry about MongoDB then ;)

But I know what you mean. ResinDB is a library that let's you embedd a database inside of your application. It's not a service such as MongoDB. MongoDB load and keep indices in-memory. That's the fastest type of architecture you can have if you want to answer quickly to queries. To have it all in-memory.

Well, there _is_ one faster way. It's to construct a smart index file, bitmapped, stored on a SSD, where the data is laid out in such a way that reading from it is just as fast or faster than reading from an in-memory data structure. This is what Resin achieves.


i dont even know what you are selling. show me a demonstration. can i run it in nodejs ? or sell directly to those who know exacly what you are talking about and solves their particular need.


I agree that a library such as this project is not at all as consumer-friendly as an application is. Some might even call it completely unsexy. It's a component of something bigger though, something you can indeed call into from nodejs. But that's another project.

>or sell directly to those who know exacly what you are talking about

Yeah I've been thinking I should try to get a few gigs as a speaker at tech meetups or conferences to talk about this tech but I haven't yet found a good enough story to tell.

Edit: give me one more chance to describe what Resin is.

Have you heard of SQL Server LocalDB? It's proper SQL Server, but it runs inside the process of your application. It's a library that has support for SQL, fast reads and writes. It's a database like any other databasem but it's a library.

Unfortunately (or not) SQL Server LocalDB has no support for full-text search. This is why there is a marketplace for libraries such as Lucene, who make full-text search their priority. That market place has been as fixed on Lucene, an open source free software project (LGPL), as the world has been on Google for about as long a timespan. If I want to make a dent on that market I need to be as open as Lucene and as performant.

To keep up with achademia the code base of a search engine should move fast (my view). Managed code lacks the preciseness of C++ but allow you to work fast. As hell.

So, ResinDB looks very much like a much smaller (in code size) version of Lucene. We will see the coming months or so, who moves the fastest. Me or the Lucene team.


Thank you, I think I somewhat understand now. For it to be accessible for me, I'm used to something like:

installation:

  npm install resin
usage:

  var resin = require("resin");
  var wikipedia = resin.init({file: "c:\temp\wikipedia.json", dir: "c:\resin\data\wikipedia"});
  var dogs = wikipedia.query("title: dog");
  
  // or ...
  var players = resin.init({dir: "c:\resin\data\playerData"});
  var oldPlayers = players.query("age > 30");


How does it compare to Groonga?


"Groonga is an open-source fulltext search engine and column store."

We seem to be at least cousins. Thx for that link. I will have to get back to you.

Edit:

Groonga seems to be cloud software. ResinDB is a in-process library, not a service.

Put ResinDB behind a service end-point and you have "ResinDB as a service", much more like the Groonga architecture.

Orchestration of read/write in a distributed service-like environment is something that is not solved within the ResinDB codebase. ResinDB is intended to be a component of a distributed database, not a distributed database in itself.

Groonga has been around since 2011. I started on ResinDB last year, in March of 2016.

Groonga make monthly releases. I take long pauses because of my lifestyle.

Groonga is a team of devs. I'm an independent solo dev.

Groonga is unmanaged code. ResinDB is managed code.


# Creating immediate value

You could use apiblueprint.org and swagger.io to create SDK bindings in various languages for your distributed search engine service. Which you can build using a Paxos library for the consensus algorithm, (lib)torrent for the data-exchange and the s2n or openssl library for SSL/TLS encryption.

# User facing values

None of your points in the last four paragraphs, even if impressive from a developer angle, are of any relevance to a paying customer (end-user). Except your end-user is thrilled and motivated like you are. But even then, you need to keep the motivation up with excellent and enjoyable docs, tutorial a cool website and good integration into developer tools.

# Growth Hacking

After reading the whole discussion like I've the impression that you're looking for growth hacking, but have no idea how to express it other than with differntiating features. Marketing and growth hacking is really different in that it doesn't exploit clean-ness, but messy-ness. That means your whole taks as a growth hacker/marketer is to convince a (healthily) growing mass of people, decision-makers and early-adopters using manipulative tricks. Be it neuro-marketing, selling-techniques, (programmatic) scaling at and taking an advantage or any other form of gaining mass-recognition and presence. You can find a more concise and useful explanation of this on your digital book-shelf.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: