
17 Year Old Builds a Better Search Engine - adebelov
http://live.wsj.com/video/young-scientist-builds-a-better-search-engine/6F3049AB-61FE-4CEA-8972-45EFA94FFF27.html?mod=wsj_article_tboright#!6F3049AB-61FE-4CEA-8972-45EFA94FFF27
======
tlrobinson
The biggest reason I immediately switched to Google when I first tried it was
Google only matched documents containing exactly the words I searched for,
nothing more.

It's fine (and good) if search engines add more intelligence like this, but
I'll always need a way to search for exact phrases. The default behavior of
Google is much "fuzzier" than it was 5 years ago, so I'm surprised they don't
already do something like this (or do they?)

~~~
rblackwater
That's a good point. It's sometimes difficult trying to search for programming
topics now. "Did you mean _UIView_?" No, Google, I really did mean NSView.

~~~
haberman
"Did you mean" is just a suggestion. It doesn't change your search results
AFAIK.

~~~
wahnfrieden
If it's confident enough, it does.

~~~
haberman
Yes but that's a different message ("Showing results for..."), and there's a
link to force a search for the original term. For me it's right at least 50%
of the time that it does this (probably much more, but I can confidently say
>50%), so it's a net win for me personally.

------
tg3
That interviewer was terrible:

> How long have you been interested in computer coding, and searching, and
> things like that?

If you're going to have someone do an interview about computer science, at
least let the interviewer be someone with a cursory knowledge of the field.

~~~
unohoo
Your comment is just ignorant.

The interviewer wasnt terrible. If the interview was being shown where the
primary audience was geeks your argument makes sense. Given that the audience
of that video could be a mix of geeks and non-geeks, I think it was just
appropriate. After all, if he starts talking in terms of Markov chains, how
many people in his audience are gonna understand that ?

~~~
sekm
Agree. However if I were to say anything bad about the interviewer I'd go with
the line:

> "Good luck in the future and I have a feeling we're going to be seeing more
> of you in the (uhh..) future."

That's just bad interviewer.

~~~
unohoo
Come on -- cut him a break. You cant literally analyze each and every word /
sentence like that - unless its like a major goofup

~~~
alttab
Dude it's his job to form coherent and well structured sentences in ad hoc
interviews, it's in the job description. My thoughts were this guy doesn't
take his job seriously, doesn't know jack about the field, or isn't cut out
for that line of work.

------
cantbecool
If I have to see another story with 'XYZ YEAR OLD BUILDS XYZ', I'm going to go
thermonuclear war on HN. Adolescents have been exposed to high speed internet
and technology before they can even remember. What do you expect is going to
happen, honestly?

~~~
lumberjack
While the internet was probably a good source of information I'd play it down
to mentorship as the decisive factor. In all of these cases, the successful
teenagers seem to be lucky enough to either have a good private mentor, be
enrolled in a mentorship program, or be enrolled in a highschool famous for
the encouragement and resources that it provides to it's enterprising
students.

In contrast many of us would have been told to "aim for a simpler project" if
we proposed something like that to our computer science teachers.

~~~
a_bonobo
Your post summarizes Malcolm Gladwell's book Outliers on extremely successful
people quite well -

it's mostly being born to a) rich parents who can afford to pay for extra-
curricular activity, good schools or who have the time to "improve" their kids
or b) being around schools or universities that can recognise the talent and
give talented kids the input and resources they need. Both a) and b) are based
on pure luck.

For an example, check out Bill Gates' early life:
<https://en.wikipedia.org/wiki/Bill_Gates#Early_life>

Rich parents, exclusive school (with their own computer! unheard of at the
time) and local companies that let Gates and other kids screw around on their
computers. Luck!

~~~
OmegaHN
I both agree and disagree with your post. I do agree that people's discoveries
and successes are very much based on luck (i.e. external factors), but I don't
think personal accomplishment, progression, and overall contribution is luck
based. Sure, becoming a Mark Zuckerberg takes luck, but the only difference
between him and other people (assuming sufficient time in a certain subject)
is that the general public liked and found a use for Zuckerberg's idea more.
Essentially, success is luck-based, but genius is not.

~~~
jahewson
I don't like this term "genius", as there is no such thing. We need to move on
from the idea of "being a genius", which is a modern phenomenon originating
from then phrase to "have a genius", where genius is latin for _genie_. When
we call someone a genius, we're saying that they _are_ a mythical all-knowing
being, which seems to absolve us from having to think about why a person is
the way they are, and why we are not. It is the intellectual equivalent of
burying our heads in the sand when we see someone brilliant, and ensures that
we'll never really understand or appreciate success.

~~~
wamatt
You've touched on something real here.

The term 'genius' bothers me too because it seems calling someone a genius is
less about the object (ie the person we are talking about), but more about the
speakers _feelings_ toward that object (eg godlike awe etc).

------
enjo
A much better explanation:

<http://www.youtube.com/watch?v=fmxNuVDJZEY>

~~~
sosuke
One of those ideas that, after you hear it, you think ya, that makes total
sense, why isn't this already being done? Very cool project idea to be working
on still in high school.

Edit: This is 2011? Anyone know if something usable came out of this?
[http://www.theglobeandmail.com/technology/science-fair-
gold-...](http://www.theglobeandmail.com/technology/science-fair-gold-
medalist-17-invents-better-way-to-search-internet/article600329/)

~~~
ehsanu1
Who says it isn't already being done? I'm pretty damn sure Google and Bing do
some sort of topic modeling (LDA, LSA or something else) that basically does
the same thing - though the exact method is not the same.

Watching the TED talk, he seemed to be comparing his search results to that of
some academic search engine, which are generally very dumb and purely keyword
based. Google does a whole lot more than that. Kudos to the guy for figuring
this all out on his own, but I don't think this research is as original as
portrayed by the media here.

------
wickedchicken
Armchair analysis of his algorithm after watching his TED talk: a version of
LSA that uses PageRank instead of a straight SVD to calculate rankings.

LSA[1] has been around since the 80s and is used in many applications from GRE
testing to Apple's junk mail filtering[2]. It's used a lot since the patent
expired, it's relatively good and can be computed quickly. Of course, a lot of
text-retrieval research has happened in the past few decades, one of my
favorites being LDA[3] which relies on a much more sound statistical basis
than finding lower-dimensional representations of term-document vectors.
Unfortunately LDA's model is not directly computable and answers must be
determined via Monte-Carlo methods.

As for 'indepdendence,' his terminology gets a little confused here. At first
I thought he was talking about the 'bag-of-words' assumption that most large-
scale language models have. These effectively ignore grammar (other than
stemming) in order to efficiently determine the 'gist' of a document without
its intricacies. However, his videos imply he is talking about word-sense
disambiguation[4], which is certainly known about and was the crux of LSA in
the first place. If he _is_ talking about lifting the bag-of-words assumption,
there has been some interesting work going on, such as [5] (disclaimer: I am a
coauthor on that paper).

If you're interested in this stuff, I highly recommend trying out the LSA demo
server at [6] (it can get swamped sometimes so don't kill it) and David Blei's
LDA implementation at [7]. The LDA-C inputs and parameters are a little obtuse
when you first look at it, and I don't have my notes on how to use it at the
moment but if you play around with it it should make sense.

This kid is crazy smart, and I hope he gets exposed to a lot of really cool
research since he can obviously pull off a lot at a young age. Best of luck to
him.

[1] <http://en.wikipedia.org/wiki/Latent_semantic_analysis>

[2]
[http://developer.apple.com/library/mac/#samplecode/LSMSmartC...](http://developer.apple.com/library/mac/#samplecode/LSMSmartCategorizer/Introduction/Intro.html)

[3] <http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation>

[4] <http://en.wikipedia.org/wiki/Word-sense_disambiguation>

[5] <http://aclweb.org/anthology-new/D/D12/D12-1020.pdf>

[6] <http://lsa.colorado.edu/>

[7] <http://www.cs.princeton.edu/~blei/lda-c/>

~~~
bambax
This kind of comment is what I come to HN for, so thank you.

I use OpenCalais to compute keywords for texts; where does it stand vs. [6]
and LSA in general?

~~~
Dn_Ab
At their scale I doubt they are using LSA or anything of similar complexity.
My guess would be something like tf-idf coupled with any thing from naive
bayes to logic programming for entity recognition.

Final output would likely use some hand curated rules on how to combine the
above with infromation from curated databases like dbepida.

________________

There is this thing called random projection based on the observation that in
high dimensions, most direction are nearly orthogonal to each other. This is a
wonderful observation because it allows us to write stupendously simple (and
fast) algorithms that give good results for mathematically sound reasons.

I strongly suggest anyone who is interested in LSA but wants something
scalable, simpler to write and faster to look into _Random Indexing_.

------
econner
Essentially: PageRank where nodes are words and edges are occurrence in the
same document.

Really cool idea. Great work. I love to see this kind of stuff.

------
naner
_...for the social-media world_

His search is focused on short text blurbs from social media. Not replacing
Google.

EDIT: Though the technique may work well for general search.

------
hokua
Definitely Bill Gate's Mini-Me

------
saalweachter
It's not really appropriate to this particular story -- high school student vs
VC-funded startup -- but what really bugs me about all "the next Google"
reporting is that they ignore the biggest gotcha of search: corpus size
matters.

The secret sauce of a successful search engine isn't the core algorithm which
handles 90% of the work 90% of the time, it's the millions of tweaks which
keep the millions of pathological cases from making every query useless. On a
small enough corpora, even grep does a great job. But once you are trying to
search the world, it stops being an insight problem; at scale there are so
many corner cases that you can't ignore them.

------
DigitalSea
In this day and age building "X is better than well known Y product" means
nothing. You could build a car with 5 times the gas mileage of any car around
today and I bet you'll be hard pressed to make a dime for the first few years
while competing with the likes of Ford, Toyota or General Motors. People don't
care about things being better when it comes to the web, people stick with
what they know and if you're a Google user like me who is also vested in
Gmail, Google Docs, Google Analytics, Google Adwords, etc then you're in too
deep to switch to any other search engine.

I remember Cuil when it launched. They were onto something great, made some
pretty bold claims and in many ways were better than Google at search and look
what happened to them? Nobody cared and they died in a huge internet tire
fire. Sure the issues with poor results didn't help and people wanted them to
succeed but lets be honest here, Cuil would have ended up like Bing (a few
users but nothing to gloat about). People are too lazy to switch to anything
new especially when it comes to search, it takes time to woo a user from
another product that still does the job perfectly.

Having said that, this kid is 17 and he's done a f*cking amazing job. How many
people can say at the age of 17 they built anything remotely cool like this?
I'm sure some can, but not many. If he keeps on this path, he'll be achieving
bigger things in his 20's and 30's than this and it'll be well-deserved. The
interviewer was pretty bad though, he had no clue whatsoever an insult to the
kid they were interviewing who deserves at least an interviewer with a
remotely above average IQ.

~~~
chc
Cuil didn't die because people were "too lazy to switch to something new." It
died because among the few who did know what Cuil was, opinions ranged from
"disappointment" to "laughingstock". Remember "Cuil theory", in which degrees
of disconnection from reality were measured in Cuils?

Cuil was ambitious, and that was laudable and did get it some attention. But
unfortunately it was nowhere near good enough for the level of hype they
built. Many of these "Better than the big guy" products have the same problem,
albeit to a lesser degree — they are better by some very specific metric, but
that metric isn't the one that most people associate with the product's value.
In Cuil's case, relevant and accurate results were what people wanted, but
Cuil was worse on those axes than ancient search engines like Mamma.com.

~~~
EdiX
> Cuil didn't die because people were "too lazy to switch to something new."
> It died because among the few who did know what Cuil was, opinions ranged
> from "disappointment" to "laughingstock".

Cuil died the day it was announced. They went live with a great PR fanfare and
a corrupted index, everyone marginally interested in it saw it when it was
broken, which is how it got its reputation as a laughing stock.

~~~
chc
Let's not forget the Cpedia pivot, where they turned their search engine into
a sort of Dadaist Wikipedia. That was probably an even bigger mistake than
launching a broken search engine (and more ambitious, too).

------
nschiefer
Hi, this is Nicholas, a long time lurker on HN and the person in the video. I
saw this thread during my morning commute to work (and was very surprised, to
say the least!) and wanted to register to mention a few important details that
the news articles always omit. Hopefully this helps correct a few
misconceptions!

To begin, I'd like to flatly deny that I "built a better search engine." I did
my (very academic) work in information retrieval and developed a new algorithm
that seems to give significantly better search results (when compared to other
academic search techniques, more on this later) on short documents like
Twitter tweets. Specifically, my algorithm uses random walks (modelled as
Markov chains) on graphs of terms representing documents to perform a type of
semantic smoothing known as document expansion, where a statistical model of a
document's meaning (usually based on the words that appear in the document) is
expanded to include related words. My system is in no way, shape, or form a
"search engine" or even comparable with something like Google---rather, it is
an algorithm that could help improve search results in a real, commercial
search engine.

My work is not, by far, the first to attempt document expansion. A number of
related techniques, including pseudo-relevance feedback expansion, translation
models, some forms of latent semantic indexing, and some of those mentioned by
exg already exist. However, to my knowledge, the knowledge of my science fair
juges (some of whom are active IR researchers), and the knowledge of my
research mentor (also more on this later), my work is a novel method (not a
synthesis of existing methods) that seems to work quite well in comparison to
other, similar, algorithms on collections of small documents like tweets.

The last point is certainly important: it is simply impossible to compare my
algorithm to something like Google, for several reasons. First, I'm not a
software engineer or a large company; it is downright impossible for me to
craft a combination of algorithms like that found in Google to get comparable
results. No commercial search engine would be so foolish as to use only a
single algorithm (essentially a single feature, from an ML perspective).
Instead, they use hundreds or thousands. Second, it is essentially impossible
to compare search engines with any level of scientific rigour. I evaluated my
system using a standard corpus of data published by NIST as part of TREC (the
Text REtrieval Conference), consisting not only of 16+ million tweets, but
also of sample queries and the correct, human-determined results for these
queries. However, to achieve statistically comparable results, many variables
have to be controlled in a way that is impossible with a large, complex search
engine. Instead, the academic approach compares individual algorithms one-on-
one and postulates that these can be combined to give better search results in
aggregate.

Specifically, my research showed that my system achieved above-median scores
on the official evaluation metrics of the 2011 Microblog corpus when compared
to research groups that published last November. Furthermore, my system did
the best of all of the "single algorithm" systems, including those that used
other document expansion techniques like I described above.

Most of my work was spent on the development of the algorithm, proofs of its
convergence and asymptotic complexity, a theoretical framework, and a
statistical analysis of my results. Notably absent from this list is
engineering. My project is not, by any means, "a toy engineering project" as
some commenters have suggested. Actually, the engineering in my project is
quite poor, as that area is not one I've had much exposure to.

To briefly address my research mentor: my parents had nothing to do with my
project other than providing emotional support when I was stressed. I had a
research mentor at a university who I found after I did very well at the 2011
Canada-Wide Science Fair. He provided me with important computational and data
resources (such as the corpus I used), but did not develop my algorithm,
proofs, or code, which were my own work.

Given the recent attention of my project (and Jack Andraka's project on cancer
detection), I'd like to point out a general trend in news articles about
science fair projects. In general, the media has a tendency to focus on the
potential applications of a project and ignore the science in it, leading to
(seemingly fair) criticism. Using me as an example, the talk about "toy"
projects and "synthesis" is fair given how it is portrayed in the media.
Somehow, "novel IR algorithm based on Markov chain-based document expansion,"
even with careful (and thorough!) explanation, gets turned into "Teen builds a
better search engine." Similarly, a great friend (and roommate) of mine whose
project on drug combinations to treat cystic fibrosis was completely shredded
on Reddit when it got significant media attention last year. In his project,
he never once claimed or tried to claim that he had done anything with
immediate (or even near) medical applications. Instead, he discussed his work
to identify molecules that bind to different sites on the damaged protein and
can work synergistically as drugs. The media spin-machine quickly turned this
into "Teen cures cystic fibrosis" and other such nonsense. Even Jack's project
(I know both him and his project), which is unusually "real world" has being
overspun by the media. It's just what happens. Heck, people even make fun of
it in upper-level science fairs, but it still happens.

Finally, thank you for the encouraging words! To finish with a shameless plug,
I'd like to point out that, while fairs like ISEF tend to be very well-funded
(because of the positive publicity). However, many regional and state (in the
US) or national (outside of the US) youth science organizations struggle to
find funding (and even volunteers) to run fairs that send people to ISEF. If
you ever find yourself in a position where you can help (financially, with
your time, whatever), I'd strongly encourage it. Given the impact the science
fairs have had on my life, I know that I certainly will.

~~~
SiVal
Wonderful work, Nicholas. I'm adding you to my list of good role models for my
little boys. The 17-yr-old sports stars get most of the press, because their
accomplishments are easily seen. You are the equivalent of a 17-yr-old
basketball star with ten years of training behind him, but you play on a court
that is nearly invisible (mathematical terrains and state spaces).

You are the kind of role model I want for my boys. Now, I have my work cut out
for me explaining to them why. ;-)

------
koide
I was instantly reminded of David Evans (from UVA and Udacity fame:
<http://www.cs.virginia.edu/~evans/>) listening to this kid, both physically
and in his manner of speech.

I really expect to... eeeh... see more of him in the future.

------
PetroFeed
I know very little about search but what I love about how he looked at the
graph element, exploring relationships between "entities".

The relationships between entities in our world holds so much information and
yet in most databases it's reduced to a join between tables. Mapping the
relationship and capturing the "hidden" information and therefore making it
available for use unlocks amazing potential.

------
akrymski
This kid would definitely make a great marketing person. There's absolutely
nothing new in what he's done but he's presenting it really well, thumbs up to
his parents who have probably done quite a bit of the work ;)

When I was 18 (10 years ago) I did loads of research into that stuff and knew
just as much about information retrieval, vector space tf-ids models, latent
semantic indexing, wordnet analysis, etc. At the time it was fairly cutting
edge research. This stuff isn't anything new now, I was actually forced to
decipher some research papers instead of reading popular books on the subject.
It was fairly obvious to me back then that none of these techniques worked
well for general web search. I did end up building a system that clustered
google search results (in realtime) into DMOZ categories letting you refine
your search results by clicking a category (which was actually useful and
worked quite well in case you were searching for something ambiguous like
"jaguar").

None of these techniques are new to anyone working in information retrieval.
Just looking at co-occurrence of words in tweets and expanding the query with
some related terms (weighted appropriately) would probably achieve what he has
done (weekend project for an average dev).

I'd call this kid really smart if he'd actually figure out how to improve
general web search, or could think of a useful application at least. Talking
about existing research and making it look like your own isn't great form in
my opinion. Coming up with your own definition for a "word" just makes you
look stuck up. Much better off acknowledging work of other researchers and
quoting them, although that would never generate as much press I guess.

Sorry if the rant is quite negative, I'm just getting a bit fed up with all
the marketing surrounding "young geniuses" and teenage entrepreneurs these
days. If I wanted to read that stuff I'd get the local paper.

~~~
nschiefer
Most of your feedback is perfectly fair. My algorithm was not meant for
general search, not am I a "young genius" or anything of the sort. Similarly,
techniques like tf-idf, latent semantic analysis, document modelling, and even
pseudo-relevance feedback expansion are no longer "cutting edge" techniques.

However, your blanket characterization that "there's absolutely nothing new"
in my work and that I just talked "about existing research" while "making it
look like [my] own" is somewhat offensive. Based on a fairly extensive review
of the literature, the algorithm that I developed is novel and seems to
outperform a number of these standard techniques on short document like
tweets. As for "just looking at co-occurrence," that's essentially a type of
pseudo-relevance feedback expansion and is, of course, well known and easy to
implement.

Please realize the difference between a news interview published online and
the paper I submitted to the fair when assessing the novelty of my work and
when suggesting that I made no reference to others' work.

Regarding my "definition for a 'word'", I apologize for appearing pretentious.
I was asked to speak on the conference's theme of "redefinition" in relation
to my research and did the best I could.

Finally, I think it's kind of strange that you automatically assumed that my
parents even worked on my project. Neither of them are even familiar with the
details of the project. My work was my own, and your automatic assumption of
what is tantamount of broad academic fraud and plagiarism assumed particularly
bad faith. In any case, thanks for your honest feedback; I try to be careful
about how I come across and I'll be even more mindful in the future.

------
richardburton
What I found amazing was the contract between the interviewer's grasp of the
English language when compared to the interviewee.

------
Miner49er
He read the Harry Potter books when he was 6?!

~~~
godDLL
First time I read Bulgakov's "Master & Margarita" was when I was 7. Enjoyed it
very much.

But it was only two years later that I read it again, and this time was left
with no questions. See the first time I read it, many things were more grownup
than my understanding of such matters could be at that age.

And unless it's "Harry Potter and the Methods of Rationality", I'm not that
impressed anyway.

~~~
madmax108
<sarcasm> Yes, we're all geniuses here </sarcasm>

I read The Master & Margarita when I was 9 (or 10) AND my first Harry Potter
around the same time... and yes, I enjoyed both of them thoroughly! :)

------
lwat
Has Google not been doing this for ages?

~~~
ehsanu1
Almost surely - though with a different technique perhaps. The only "study"
I've seen of it is here: [http://www.seomoz.org/blog/lda-and-googles-rankings-
well-cor...](http://www.seomoz.org/blog/lda-and-googles-rankings-well-
correlated)

------
spaghetti
If it's so great why aren't I using it right now? After my first Google search
in 2000 I was hooked.

Does the article actually link to anything related to his computer coding? Oh
wait here it is: <http://www.cuil.com>

~~~
hokua
Well he is only 17. Did you know about Markov chains when you were 17?

~~~
srj55
Not to diminish his work in any way, but markov chains are (were?) part of the
senior finite math curriculum in Ontario.

E.g. We used them to examine weather prediction models in gr. 12

~~~
hokua
Wow. The most wonkish think I remember from high school is learning about the
Balmer series (atom spectral emissions)

