
Is Ontolo the fastest crawler, parser, indexer? - ppsscc
https://ontolo.com/blog/ontolo-v7-behind-the-scenes/
======
howlingfantods
Why phrase it as a question? The linked article is a PR piece that obviously
argues that Ontolo is the fastest crawler, parser, indexer.

The actual article title "400,000,000+ Prospects a Day: Behind the Scenes of
ontolo" seems less click baity.

~~~
joekrill
I didn't even bother clicking through because I saw the domain name
(ontolo.com) and assumed it was a PR piece. Also, Betteridge's law of
headlines gave me my answer already (spoiler: it's "no").

~~~
benwills
Hi Joe. I just saw this post here on HN. I'm the sole author of ontolo.

Your comment about the ontolo domain and assuming it was a PR piece, is that
because you know of ontolo as a marketing tool, or because it wasn't a more
generalized site like Medium?

Also, if you know of any other crawlers, etc, that are faster than ontolo (in
the public domain, or public claims), let me know. There aren't a lot of folks
I know who are working on this sort of thing and not also working at search
engines. It'd be neat to meet some other folks who are interested in the
problem.

~~~
joekrill
Hey Ben. It's because it was immediately apparent it would be a totally biased
piece. When companies write about their own products, they usually do so from
a completely biased perspective, in their favor, because they want whoever is
reading it to favor their product. Microsoft.com isn't going to feature an
article proclaiming OSX is the better operating system, for example -- so
anything comparing the two on that site is going to be heavily biased towards
Windows. If I'm looking for something that's actually informative, whatever I
find there isn't going to be much use to me.

~~~
benwills
That makes sense. I don't know if this factors in, but I didn't post the link
here. And if I ever did post a link here to ontolo, the HN post title would be
very different.

And I agree that few, if any, folks that talk about their own stuff are able
to do so objectively.

------
Cyph0n
Interesting read. I'm assuming throughput is a key requirement for the system,
but I wonder how much more difficult writing and maintaining the codebase is
compared to a "safer" but slower language such as Java, Scala, or Go.

~~~
benwills
I just saw this post. I'm the sole author of ontolo.

As for requirements, they're very low. The machines running the
crawler+parser+indexer, which achieve those speeds, are dual Xeon 5639s with
72GB ram. When the indexer isn't running, the crawler and parser use 6 cores,
but only "need" 4 of the 24 total on the box. The indexer generally uses very
little CPU. All of those cores are actually needed only for queries. So, the
limitation on throughput, at this point, is the 1gbps connection to the box. I
say 'box' singular, as the 400,000,000/day is per box.

As for maintaining the codebase, it's about 10k lines, plus another 10k of
data that's referenced throughout. As I've written everything in the crawler
and parser by hand. There are no magic boxes or libraries that it goes into,
so debugging is working around only the logic I've constructed, as opposed to
someone else's logic.

That said, I've designed it to use as few system resources as possible, unless
it would run faster using a uint32_t vs a bit field, for example. If someone
else were to begin working with my code, I'm unsure of what that would be like
for them. I'd like to believe it's generally straightforward, but I can't say
until that happens.

That said, I've also designed it to be easily updated as the way that HTML is
written continues to evolve. I know that's a bit hand-wavey, but it's as much
as I'm comfortable describing how it works behind the scenes, as no one else I
know of (other than common search engines) parses HTML like ontolo does.

If you have any other questions, just let me know. I'll be monitoring this
post today.

\- Ben

~~~
Cyph0n
Thanks for the reply!

That's truly impressive - around 4.6k rps, including all of the additional
post-processing involved - on a _single_ machine! I'm also quite surprised by
the lines of code: ~20k is quite small for a project of this complexity
written in C.

~~~
benwills
It's my pleasure. I love talking about this sort of stuff.

As for the rate of requests, I just couldn't figure out how to max out the
gigabit connection on Linux with epoll. I ended up moving to FreeBSD and
kqueue, and when I did, that's when I was finally able to comfortably max out
the connection. I could have just been doing something wrong with epoll, but I
wasn't figuring it out, and neither did a friend who looked at it. My guess is
that some of the limitations of the api that are resolved in kqueue, just
weren't cutting it for that much data coming into epoll.

I briefly considered writing my own tcp stack to run on raw sockets, but I'm
quite happy with FreeBSD's network stack, which seems to be more than
sufficient.

As for the size of the project, toward the end, it kept shrinking as I better
eliminated redundant logic. There might be more ways to improve it, but I'm
pretty happy with where it is.

~~~
ppsscc
Ben ontolo is truly awesome product. What is the tech stack you are using? Any
open source projects like Lucene? Also are you hosted on the cloud AWS? Just
curious.

~~~
benwills
I think that everything I'm willing to list publicly is in that post.
Understand that almost all of it is written in pure C. [Edited to add: I do
not use Java in any part of ontolo.]

An example is that I do use Redis for some queueing operations. But there's a
bottleneck in part of the process (not with Redis, but the nature of being
unable to POP multiple items off of a list at once using the blocking pop
requests, requiring multiple requests over a socket, or two separate data
stores; one to notify, another to hold the data). So I'm in the process of
writing my own queue for the crawler system.

I'm guessing that your question comes from a curiosity or desire to build
something similar, or something similarly performant. I don't know if this is
a useful perspective or not, but it's mine: that every time I don't know
what's happening to a byte of data and how it's processed, there's the
possibility of some unknown amount of inefficiency there, and that
inefficiency could be small or large. But the not knowing is the problem. So
I've tried to eliminate as many of those as possible. I don't always get it
right, and I'm sure there are many areas for improvement. But if I find a
performance bottleneck, I go in and edit my own code, rather than having to
learn someone else's codebase that is written to handle edge- and use-cases
I'll never need. I then either have to modify their code - and have to
maintain future updates as well, etc - or I write my own, and keep refining
it.

I'm certainly in no position to give any sort of programming advice to anyone,
but that's just the philosophy I've had from the beginning while I worked on
this. And I think it's an important one that's helped me get the performance
I've achieved with it. It's a conversation I have often with other friends who
are programmers and who make more liberal use of tech stacks. But those are
often problems of a very different nature than the ones I'm interested in
solving.

And if you do hope to build something similar, or something different but
similarly performant, be sure to test everything. You don't need amazing
tests, just consistent tests with consistent units of measure. Learn where the
compiler is optimizing things out and how to make sure that doesn't happen,
etc (I will often insert an integer that gets incremented by a random number
generated by a fast xorshift prng at the critical place of measure, in order
to increase the odds the compiler won't optimize it out), to ensure a level of
quality in your tests. And when you're working at the byte level, keep
refining the operations that run the most frequently and test all possibly
ways you could imagine doing it.

~~~
ppsscc
I have already built similar large scale crawling systems using forks of
Apache Nutch, Lucene and projects from hadoop eco system. To be honest with my
limited knowledge, I just couldn't believe the numbers from the blog post. I
came over to HN for smarter people to shed some light. But the master himself
has revealed most of the answers. Thank you. We are not into what Ontolo does,
that is just too complex for us.

------
web007
No, Google is.

~~~
ijamj
[https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...](https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines)

