

Better Search Doesn't Mean Beating Google - limeade
http://bits.blogs.nytimes.com/2009/03/09/better-search-doesnt-mean-beating-google/?hp

======
augustiner
The most fundamental problem with natural language search engines is that the
"natural language" part is more a limitation than a feature to me. Natural
language is meant for people to communicate with other people and not with
computers. I believe that a well designed keyword/tag based search combined
with factual auto suggestions extracted from formal/semantic sources (similar
to wikipedia) could be far more efficient for people to use and computers to
run.

~~~
micks56
I am in law school and as a law student I spend lots of time searching through
past cases. My searching is done almost exclusively through Westlaw, the
online database of Thompson West products.

They have many different types of searches, but the two applicable to this
discussion are (as they are written on the site) both "terms & connectors" and
"natural language." T&C works well using your standard OR/AND/etc. However,
natural language works so much better even though you type in the exact same
words.

The natural language search returns cases more on point and has one awesome
feature: the most relevant text is in red type, set apart from the rest of the
case. From the natural language search West is better able to determine what
the legal researcher wants and shows it to him.

I spent almost 2 years of law school searching using terms and connectors
because I thought the same thing as you do. But I recently converted when I
realized West returns better results from their natural language search.

~~~
gord
@micks56 - re your legal search using natural language -

Are you able to compare, say, Wests results to those of a pure google text
search on the keyword terms?

[ To do that you'd need some example of large legal texts fully online and
thus indexed by google - I dont know if that exists ]

Its sometimes hard to discern the value of the tech versus the quality of the
implementation + usability factors - but your observations are interesting. I
wonder how search on medical information compares...

gord.

~~~
micks56
I am trying to think of a test that can be run on my West search engine and
Google. West's legal resources dwarfs Google's. Google might as well be
considered non-existent in the area compared to West or LexisNexis. That is
just what those two companies do. They have people that enter cases into
databases as they become available. Google just doesn't do that.

I haven't thought of a fair test to run yet. The two engines do different
things. My West can search case decisions, statutes, administrative codes,
briefs filed to the court, secondary sources (sort of the research paper of
the legal field) and the news.

So I tried doing a search on the news only. I searched "ycombinator" and the
results returned are news articles only, whereas on Google someone probably
wants and gets the YC home page, this site, or the actual function. None of
those show up on the West site.

Then I ran a search of these terms on each (I didn't enter quotes on the
actual searches): "massachusetts custody modification"

On Westlaw, I get cases, and statutes on point. With extra terms I will easily
get to cases that deal with my specific issue. On Google, the first link is a
divorce resource site and the rest are for lawyers.

Searching statutes might work. But the main reason statutes search well on
Google is the Cornell Law site. The quality of results for statutes is
probably a bigger testament to them and their cataloging efforts.

I would say both search engines hit their target markets well. Most people
searching "massachusetts custody modification" don't want 20 decisions of the
Mass SJC. And people searching the same on Westlaw don't want attorneys.
Google is much much faster though. It returns in a fraction of a second.
Westlaw took about 12 seconds to return 10,000 hits. First three hits were
decided yesterday, which is pretty cool.

There is a group of people creating an open legal database. I can't remember
its name. I think they are based in the San Fran area. I think it was started
by some hacker that worked on opening up some other government data and is now
on the court system. I have the bookmark buried somewhere and of course can't
find it. Does anyone know which one I am talking about? We could maybe test
that database versus the commercial West one.

~~~
gord
thanks for the write up.. interesting to see how things develop in the real
world outside your own domain.

I'm surprised the big G hasn't just paid some money to get that data, given
their plan to scan all the worlds books.

I wonder what percent of all text is legal or medical.

~~~
micks56
I doubt West, LexisNexis, or any other legal aggregator will sell the
information to Google. Those companies make a lot of money selling it to
lawyers on a monthly subscription basis. They also do some value-add to the
materials. What I see on West or LexisNexis is more than just the publicly
available decision. West and Lexis employ lawyers to create summaries and
other helpful things for the legal researcher.

There certainly is a lot of legal text. Lawyers certainly are good at creating
volumes of paper. For example, the Supreme Court just decided a case, _Wyeth
v. Levine._ It will be recorded in volume 555. So to date the Supreme Court
decisions have filled 554 volumes of 1000 pages each. And that is just one
court. Every state court, state appeals court, and state supreme court,
federal court, land court, etc has similar volumes and page counts.

And all of this is just the primary sources. Once you add secondary sources,
aka books and papers written by learned scholars on individual topics or
cases, the number of books and pages increase by orders of magnitude. And we
still haven't archived any statutes (those go on forever, for each state) or
any administrative law. And each one of those has comment sections that go on
for pages whereas the actual rule is only a paragraph.

I wonder what percentage this is, too. I bet it is still extremely small
compared to what the rest of the world has produced. There are so few law
writers when compared to all other writers.

~~~
gord
Thats a lot of text. The few patents Ive read strike me as quite verbose. I
was quite amazed at what was patentable, and how loosely described
{ephemeral!} the descriptions were. I'm not suggesting all legal text is as
sparse in information.

We could certainly do with a better text search for patents.. but I wonder if
thats possible unless a form of restricted prose is used that makes the text
less obtuse/verbose.

Maybe an algorithm can reduce the common legal motifs and replace them with
shorter versions thus refactoring legal-speak into human-readable prose on
which text search can be effective.

[ For some reason this reminds me of the law student drama series 'the paper
chase'. ]

How well is the information hyper-linked? Presumably one paper references many
previous rulings, and youd jump around a lot in researching issues.

~~~
micks56
Thank you for reminding me of patents. I forgot to mention those. A patent is
a completely different entity compared to case law. Case law and case
briefs/motions written by lawyers have to be short, concise, to the point, and
logical. The judge will quickly (in a matter of a few seconds) ignore your
argument if he has to spend any time figuring out what you have to say.

That leads all of our law professors to drill into our heads brevity, clarity,
and conciseness in everything we write. But as you mentioned, patents have a
completely different audience and goal.

I am amazed at what is patentable too. I wrote a research paper arguing
against software patents. The professor that graded my paper disagreed with
the position very much. I wrote mine a few days before the Court of Appeals
for the Federal Circuit heard the _Bilski_ case. When the decision was
rendered this past fall they made some law that is similar to what I argued. I
should go show the professor the paper he marked down and the _Bilski_
decision. But I digress...

Patents are a land grab. The goal is to get the vaguest, broadest patent
possible and protect the most space. And the legal-speak is there because
those words have been litigated time and time again and they have a known
meaning to the courts. As soon as you write a new phrase you open yourself to
debate in front of the court. A macro to convert legal-speak to human-readable
prose should be used at the researcher's own peril.

We are told time and time again: read the case for yourself. Do not read
anyone else's summary. And don't paraphrase words unless you know to stay away
from the special ones.

Example: There was a contracts case where the contract says "only use pipe
made by Company A to build my house." The builder uses pipe from Company B.
The court ruled against home buyer because "only use pipe by Company A"
doesn't actually mean that! It means use pipe similar to the quality of
Company A! So translating the legal speak required to really get a builder to
use pipe from Company A into "only use pipe from Company A" will result in
failure.

The information is hyper-linked very well. I wish I could show you, but I
can't. My student access to the site is restricted to school use only. I am
pretty sure I will be violating the TOS by posting any of the information.

But every case cited is linked. Those are the most important. Judges are
linked to other decisions. Same with arguing lawyers. Statutes are linked.
Footnotes are linked. Obscure terms are linked. For example, a medication will
have a link but a legal term of art will not.

I just wish the search and the site overall were faster. Sometimes the
navigation is quirky, too.

------
mixmax
The article argues that one thing is having a successful technology, and
another is having a successful business. This is, of course, true - but
there's a significant correlation between the two. Correlational databases,
Google's search algorithm and the light bulb were all influential technologies
that founded hugely successful companies.

There's a rule of thumb saying that your solution to a problem has to be 2 -
300% better than the existing state of the art for it to be adapted
successfully without artificial help (marketing $, monopolies, etc.) and
Google certainly lived up to that when it went live. It remains to be seen
whether Wolfram Alpha will.

~~~
vaksel
It won't.

It'll probably be another case of Powerset or Cuil, lots of hype by the
company in question that is impossible to live up to.

------
kailashbadu
You can’t beat Google only by developing a superior technology. You’ll still
be left with the Herculean task of drilling your search engine deep into the
minds of hundreds of millions of users around the world. Google the brand is
far mightier than Google the technology. It’s the Google’s redoubtable
omnipresence and visibility that makes it tick.

~~~
sketerpot
Microsoft tried, and is still trying, with their Live Search. It's the default
search in Internet Explorer, and the Microsoft brand is very well known. But
you know what the problem is? The search results just aren't as good!

I tried out MS Live Search, thinking that a company with as much money to
throw at things as Microsoft would probably be of similar quality to Google. I
quickly got frustrated when most of the search results were nothing like what
I was looking for. Meanwhile Google consistently gave exactly what I wanted
near the top of their results page.

What I'm saying is that technology does matter. There are other brand-name
titans in the world.

------
gord
No. Better search does mean beating google.

But you don't have to play within Googles rules.. theres lots of territory
between text search [=cool] and Natural Language [=sucks, to a first order
approximation].

For example, just treat data as a graph of tagged pieces of text.. and give a
good web interface to that. Bypass all the RDF, semantic web hype and just
make something workable, usable. A wiki for data.

anyone looking for a co-founder? Im working on this.

