
Show HN: Fastest search engine in the world - marcuslager
Hi! According to my benchmark tests I&#x27;ve just built the fastest free-text search engine in the world [0]. However, providing proof of that as well as making people care has proven to be a near impossible task. I could use some help from fellow programmers to both work on the formal proof but also to test this code against the Big5&#x27;s offerings of full-text search.<p>Would you care to go fetch an amount of common crawl data to test the abilities of ResinDB?<p>If not, then do you perhaps have another strategy to convince people when you have invented something big? Is writing papers and formally proving things a way into the market place? If so, how do you figure MongoDB overthrew it&#x27;s previous market place ruler? It had no formal proof of anything. The market loved it though, for it&#x27;s speed.<p>I would love for formal proof to be a success indicator. Diverting people from using Lucene&#x2F;Elasticsearch would then be a breeze.<p>Why ResinDB? Because why use a cloud-based luxury cruiser when the most precise relevance, the fastest querying and most energy-aware choice is provided by using an on-premise kayaak. ResinDB on-premise, a word2vec search engine implementation, is the fastest (local) information retrieval system known to me. How about men?<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;kreeben&#x2F;resin" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kreeben&#x2F;resin</a>
======
grizzles
Lucene is the standard. Companies have an upfront investment in it, and it
would cost them more to change then to not change. If this is a business
thing, you need to find the company that has thousands of Lucene instances and
try to convince them that they will gain a measurable benefit.

 _How to convince people when you have invented something big?_

You need hand them a peice of paper that says: This [existing setup] costs you
$5000/mo. This [resindb] will cost you $200/mo.

If you can directly show a financial benefit to your buyer it becomes a no
brainer for them. It's something many (especially SAAS) founders get wrong.

~~~
marcuslager
Yes I should probably try to step out of the dev role and see things
commercially or through a business utility perspective. My mind however wants
me to dive even deeper into the rabbit hole that is search and relevance and
the hard problems of concurrent read/write. I do believe I'm aligned well
enough already for the database/search market but how to find time for
marketing?

Edit: I read your post again and you are saying I need to be orders of
magnitude cheeper than my competitors. Does the technical solution (being
orders of magnitudes more performant) count in any way?

~~~
Alacart
In my anecdotal experience, selling something, especially to businesses,
typically comes down to 4 things (in descending order of what's easiest to
sell):

Will it make them more money?

Will it save them money?

Will it make their (the decision maker's) life significantly easier?

Will it make their (the decision maker's) life feel dramatically better?

Notice how they get more subjective as the list goes down. Its hard to
convince people of, or prove, subjective things. That last one is so hard that
I'm not sure I've ever seen someone successfully sell something that way in
person. But the first one needs almost no convincing at all.

Now, does your search engine do one or more of those things for people? If the
added performance doesn't fulfill one of those desires, sorry but they
probably won't be interested enough to go through the pain of switching and
taking the risk of failure with a new and largely unproven system.

But if you can show that it will demonstrably fulfill one of those needs, say
by cutting down the number of servers required, less down time, or better
search results that also result in more sales or engagement, then you've got a
product you can sell.

~~~
marcuslager
Thanks for that very good advise.

From what I hear, folks (not only here at HN) aren't really looking for speed,
instead they look for features. I'm starting to look at performance less and
less as something to strive for and more and more just a confirmation that the
architecture is healthy.

Say my benchmarks are correct and I do in fact beat my closest competitor by
some measurement. I'm thinking I should start utilizing that performance to
achieve higher relevance, donate it if you will to that cause instead of the
speed cause.

To me though, Lucene is not so much of a competitor as she is a role model.
Well maybe she's both. My real competitor, some day, will hopefully be
Elasticsearch. I've used versions 2 and 5. I'm underwhelmed.

------
krishna2
That's a big claim and kudos if you really pulled it off. There is also the
aspect of relevancy in addition to speed.

I think the best way you can showcase is to build a few sample proof-of-
concept search engines. For e.g., How about a search engine for Wikipedia?
Project Gutenberg? StackOverflow? All these datasets are freely available. You
can set up a search engine for this and easily let anyone be able to verify
your search engine's speed and relevancy.

Lastly, in addition to both speed and relevance is how easy it is to install,
customize and extend.

Hope that helps!

~~~
marcuslager
Yes (it helped) and speed, i.e. querying and indexing performance, for sure is
only an USP if you also have relevance.

I'm confident the relevance is as good as or better than Lucene. I especially
like my phrase queries and how they seem more relevant compared to that of a
Lucene phrase query. The scoring is a half-way implementation of word2vec (in
a lot of ways similar to the scoring mechanics of Lucene's tf-idf scheme). I'm
aiming for full word2vec implementation in vNext.

I have only my own benchmark tests to tell me I'm faster than Lucene. Which is
why I'm contemplating writing a formal proof both of ResinDB's performance and
of it's relevance.

My test data has been the English verison of Wikipedia plus Project Gutenberg.
I suppose I could publish those indices to the world, as a demo search engine.
I don't think a soul would care about a proper searchable Project Gutenberg
though. Looking into common crawl now.

I'm a part-time father of two, employed doing tedious unmotivating work,
focusing completely on my spare time project. I need some advise as to what
the next step should be, if I wanted to make this into a business that I could
spend all of my time with, not only nights and weekends. Formal proof? Demo?

Side note: one of the most approachable people in the database building
community is Oren Eini, creator of RavenDB. He's reviewing ResinDB on his
blog. I've read a preview of the entire series of posts, implemented solutions
for the best parts of the critique and just released v2. Blog is here:
[http://ayende.com/blog](http://ayende.com/blog)

~~~
krishna2
Great that you already have those datasets. Yes, putting it on a small public
server where people can search and evaluate would be good. Honestly the
"speed" part cannot be truly verified if it is on a standard public server but
at least the other aspects of it can be. As I see it, you are up against
primarily Elasticsearch and Postgres's search engines. [To put out a full
index could be costly so you could always try a small subset but still a good
enough chunk, say a million docs or so].

Another thing to keep in mind is how easy it is to install and be pluggable. I
know you have designed it as a library but I think a small wrapper around it
with its own http server so anyone using it can start it as a service and use
http to access via JSON would be useful too. [At least everyone these days
seem to do everything in containers]. And also to add, Elasticsearch sets a
good bar for how easy it is spin up a search engine and get started. Again,
not sure how far you must be going to make ResinDB as easy to install, to use
and document and all that.

One way to get adoption is to approach a few open-source projects and non-
profit orgs (or profit orgs but you might've to start out for free) and see if
you can convince them to use your search engine. Once you have a couple or
more, it helps in two ways. First, you can get good feedback on what are the
steps that someone besides you need to do to get it in production and updates
and maintenance and second, you can use them as reference customers.

Feel free to contact me via email [same as hn id but with the popular email
service from another search engine out there! :)].

~~~
marcuslager
Thank you so much for this feedback. My eyes have been on Elasticsearch ever
since their first funding of 80 million bucks. But I have also noticed how
Postgres is the only database engineering team that seem to care about full-
text search. Their indexing capabilities are just awsome. They have a library
of indexing types you can use. They all seem well constructed and so does
Postgres (and the team).

A HTTP wrapper. Sure. It's in the backlog. I can push it up a bit.

What is it that you like about ELK the most? The easy-peacy install where you
immediately can start writing data, the HTTP JSON API, or something else?

------
softwaredoug
> Diverting people from using Lucene/Elasticsearch would then be a breeze.

Maybe. But most people's problems with Lucene/Elasticsearch aren't really
speed. They use Lucene because its feature rich and been worked on for almost
two decades. And there's plenty of strategies to mitigate any speed problems
(caching, sharding, etc) with lots of knowledge spread throughout hundreds of
orgs to the point where "making Elasticsearch fast" is something you can hire
someone for. Lots of sunk cost in the Lucene stack.

Not saying what you're doing is without value. Just be conscious that "speed"
is only one of many, many factors when evaluating a search engine. In fact, as
someone who regularly helps clients evaluate search solutions its very rare
that it comes up. "Fast enough" with the right features for the problem is
really want people want.

~~~
marcuslager
I'm glad I posted here because one point is maybe finally clear to me about
speed. It's not a good word for performance. It's the wrong word to use.

The great folks at Elasticsearch would _love_ for Lucene to be more
performant. It would make life so much easier for them.

The Lucene team spend a good buck on their nightly performance tests. It's
astonishing how well-tested Lucene is.

I wonder why I'm faster at writing and reading. Maybe it's because I have been
benchmarking against an older version (4.8). But still. I wonder if my tests
are all wrong or if I just got lucky in my design. ResinDB has flaws. It puts
massive pressure on GC at writing time, if your batches are huge. I'm working
hard at optimizing that achilees heel away. Before I have completely done so,
writing speed is achieved through lots of memory allocations. It's
surprisingly easy though to move away from using GC as a service.

But performance is not a feature anymore?

Edit:typos and tried to clarify things

------
rpedela
Late to the party.

As a search practioner, a formal proof would not convince me so I don't think
sales would become a breeze with one. I am interested in performance but I am
more interested in relevance. In my experience, sub-second performance is good
enough for most use cases, less than 300ms is ideal, and faster is great but I
stop caring.

What is really hard with Solr, ES, and Lucene is relevance. Solr has the best
out the box experience, but I find ES the easiest to customize though still
not easy. What I personally would love is something that has great defaults
and easy customization. I would also love to see integration with machine
learning algorithms, such as word2vec, as a feature rather than something you
have to do yourself.

My advice would be to build on top of Lucene/Solr/ES rather than start from
scratch because performance for all three is already good enough. Instead do
something that makes using those technologies easier/better. For example,
Algolia built autocomplete on steroids which is a big value add if you need
that and it is something Lucene doesn't do well. Algolia did write their own
search engine from scratch but you can duplicate their functionality in ES
(contrary to their marketing claims).

So if I could give Resin any arbitrary dataset and it would automatically
compute word vectors and add that as a relevance signal along with BM25 and
custom ranking (e.g. popularity), that would get me very excited.

------
andkon
Offering a formal proof is really cool, but shouldn't be the whole of your
marketing efforts. Instead, focus on what's gained. Even if it is faster,
people are still going to rightfully ask: so what? How does that actually
improve my life/sales/etc?

~~~
marcuslager
I agree to what you said whole-heartedly and I have to conlude by now that I
am so not even close to being a sales person. I don't care for the psychology
of a sale. I've been in many. I've only seen one or two beautiful ones.

------
notheguyouthink
Totally unrelated to this topic, so please ignore if this is a bad place to
discuss this.. but: How difficult is this to embed in Go? I've never heard of
embedding C#. I suspect if I was to use this it would likely be outside of Go.

Right now I'm in need of an embedded indexer with full text search for schema
less queries. I've settled with a (incomplete) custom indexer I wrote that
applies FTS via Bleve.

However, I doubt this will scale well - so I was assuming that I'd switch to
ElasticSearch or Solr. Resin sounds interesting though. Especially if I can
embed it, and not have to run a separate process.

~~~
marcuslager
"and not have to run a separate process"

Not at all an unrelated issue to me, but unresolvable at the moment me thinks.
To use Resin within the same process as a Go app Resin would have to be a Go
library.

I'm glad you posted this because I don't think there is a embedded search
engine library for Go, which is both a little funny but could also constitute
a business optortunity for a Go programmer.

Would you care to talk a little more about your requirements?

~~~
notheguyouthink
Sure. I'm using it for a mildly distributed locally focused offline-able
content addressable store _(that's a mouthful)_. Think Camlistore, with the
things that I wanted. Personal storage is the main use case but with some
limited database capabilities.

As such, the records stored in the .. store, need to be indexed with provided
fields for later retrieval. The indexer is responsible for this. This ends up
being far more like a "database" than anything, honestly, as my queries _can_
be complex, or simple. Eg, tags:foo title:hasWord:bar, etc.

So basically the indexer should be able to run a full suite of database-like
queries, I just don't care about the data being retrieved, only the id(s) that
matches the queries. To reiterate, the indexer is just responsible for
returning the content hashes/ids. The content addressed store actually
stores/retrieves the data. Needed operations are all the standard ones: AND,
EQ, OR, NOT, PREFIX/SUFFIX is nice too but not required, etc. and of course
FullTextSearch.

Anyway, hope this answers your question.

------
marcuslager
Thx for the feedback. What I think you should and hope you already do realise
is Lucene is nowhere near maximum performance for full-text search nor is it's
relevance. And implementing new scoring routines is a drag in Lucene.

Google is also nowhere near maximum relevance. I like word2vec. That model
fits into my world view. I'm going to implement it and then take it further.
Hopefully while being funded. If not then it shall be my contribution to the
open source space and nothing more.

~~~
visarga
If you want to do word vector similarity search, try the "annoy" library from
Spotify. It's much much faster than Gensim.
[https://github.com/spotify/annoy](https://github.com/spotify/annoy)

~~~
rpedela
It appears to do the vector similarity part but not the vector creation part
and therefore not a gensim replacement. Am I missing something?

------
tixocloud
Would your technology be adaptable to other problem domains? I can see a few
different applications outside of the traditional "search engine" space where
speed will be critical and other features would be less important.

------
vvvkkk
Maybe you can use our index...this is interesting if you join to us and help
develop search platform Bubblehunt. Imagine if every user can get own search
engine. And if your solution is better, why not? Write me if this is
interesting for you :)

~~~
marcuslager
You wouldn't mind me using your data?

I've seen Bubblehunt and I've played aound a little bit with it. Reach me at
github.com/kreeben/resin.

------
amirouche
FWIW, various HNers expressed the interest of having something similar to
algolia service. Basically, some service than can index a website (including
static websites) and provide a search API for it. Good luck!

~~~
marcuslager
Stolen!

~~~
amirouche
What do you mean?

~~~
marcuslager
I mean, great idea, I'm stealing it.

Edit:

A distributed serch engine:
[https://github.com/kreeben/dire](https://github.com/kreeben/dire)

Open for feedback.

~~~
amirouche
voting my comment would have more like a good HN answer ;)

------
amirouche
What do you call a full-text search index is an inverted index, isn't it? How
do you use word embedding exactly? There is no mention of it in the README.

Good luck while trying to climb at lucene and elasticsearch.

------
z3t4
have some live demos and practical use examples.

~~~
marcuslager
What would impress _you_?

Me personally I don't think it is impressing of Google to be able to store
every web page in existance and to refresh them every other minute. I just
don't think that is a good way of spending electricity while we haven't
figured out yet how to properly utilize the sun's energy. Google is all-
knowing while burning shit-loads of coal. What's impressive about that?

~~~
CyberDildonics
Back up you claims. Have benchmarks and demos. Why say these things when there
are neighther benchmarks or demos out there?

~~~
marcuslager
My claims are backed up by the code I've spent blood and sweat to create.
Disprove me please because I need to know of scenarios that I need to solve
that goes into vNext, scenarios were I'm currently not doing great.

Edit: and also: I'm reaching out to you guys not because I want a pat on the
back or free PR. I'm looking for advise as to how to move from having unique
tech to having a business. Is this the right forum?

~~~
CyberDildonics
> having unique tech

You have claims and they aren't even unique.

> Disprove me please

That's not how it works. Why would I waste my time testing your software when
you don't seem to possess common sense or experience? The probability that you
can back up what you say is excessively low.

~~~
marcuslager
I'm scrolling up to see where you hit a nerve with me (or is it the other way
around?). Anyway, sorry about that.

~~~
CyberDildonics
It's not about hitting a nerve, you are making extraordinary claims and for
some reason you think other people should compile your source and disprove you
instead of you showing any evidence in the first place. Why would you think
that?

------
jedisct1
How does it compare to Groonga?

~~~
marcuslager
"Groonga is an open-source fulltext search engine and column store."

We seem to be at least cousins. Thx for that link. I will have to get back to
you.

Edit:

Groonga seems to be cloud software. ResinDB is a in-process library, not a
service.

Put ResinDB behind a service end-point and you have "ResinDB as a service",
much more like the Groonga architecture.

Orchestration of read/write in a distributed service-like environment is
something that is not solved within the ResinDB codebase. ResinDB is intended
to be a component of a distributed database, not a distributed database in
itself.

Groonga has been around since 2011. I started on ResinDB last year, in March
of 2016.

Groonga make monthly releases. I take long pauses because of my lifestyle.

Groonga is a team of devs. I'm an independent solo dev.

Groonga is unmanaged code. ResinDB is managed code.

~~~
FractalNerve
# Creating immediate value

You could use apiblueprint.org and swagger.io to create SDK bindings in
various languages for your distributed search engine service. Which you can
build using a Paxos library for the consensus algorithm, (lib)torrent for the
data-exchange and the s2n or openssl library for SSL/TLS encryption.

# User facing values

None of your points in the last four paragraphs, even if impressive from a
developer angle, are of any relevance to a paying customer (end-user). Except
your end-user is thrilled and motivated like you are. But even then, you need
to keep the motivation up with excellent and enjoyable docs, tutorial a cool
website and good integration into developer tools.

# Growth Hacking

After reading the whole discussion like I've the impression that you're
looking for growth hacking, but have no idea how to express it other than with
differntiating features. Marketing and growth hacking is really different in
that it doesn't exploit clean-ness, but messy-ness. That means your whole taks
as a growth hacker/marketer is to convince a (healthily) growing mass of
people, decision-makers and early-adopters using manipulative tricks. Be it
neuro-marketing, selling-techniques, (programmatic) scaling at and taking an
advantage or any other form of gaining mass-recognition and presence. You can
find a more concise and useful explanation of this on your digital book-shelf.

