

 Garbage In, Garbage Out: Why Scraping Doesn't Work for Local Search  - mcxx
http://blog.nelso.com/2009/07/garbage-in-garbage-out-why-scraping.html

======
jack_deneut
The blog post I wrote wasn't primarily about the legality of scraping (and I
also didn't expect it to be read by more than a few people). But as that seems
to be the topic of the thread, here's my response.

The courts found that it isn't possible to copyright facts, and that's all we
were scraping - things like addresses, business name, and phone number. We
weren't even scraping things like business category, because something as
simple as putting a restaurant in the category "Fine Dining" might be
considered a judgment call and therefore value-add by the original site.

And think of what would have happened if the court had found otherwise (i.e.
had found that lists of facts could be copyrighted). If you opened a store,
and I was the first one to put your address and phone number on-line, no one
else could ever include your address or phone number on their site. Even if
you created a website for your own business after I published your address,
you wouldn't be able to include it on your site, because you'd violate _my_
copyright.

I can't see how the Supreme Court could have ruled any other way.

------
mshafrir

      "We've tried scraping ourselves in the past (yes, it's perfectly legal),"
    

Is scraping indeed "perfectly legal"?

~~~
roc
Yes. Facts can't be legally protected.

A compilation is protected only as a completed work. You can't, for example,
photocopy the yellow pages and resell it. But you could copy each and every
phone number from the yellow pages and use those numbers in your competing
directory.

~~~
imp
Is that true in all cases? There's dispute about who owns sports statistics
because the leagues are trying to claim they have licensing rights to them.

~~~
coderdude
These would be the same people that claim to have rights over who can let who
watch their broadcasts, no? I don't see how anyone can have licensing rights
to stats regarding events.

------
jshen
There will always be garbage in. you're algorithms have to overcome this for
the most part. Some things have to be manually dealt with and some things
could be manually dealt with, but it's impossible to manually verify tens of
millions of local listings.

~~~
jack_deneut
That was the point of the post. Without some manual oversight of every new
listing, too many errors will creep in to make the database truly useful.

Local search is not like general text search, where one expects that many of
the results returned are not relevant, are wrong, are spam, etc. and can be
ignored. Having the incorrect address or phone number for a business on a
local search site is a major irritant to the user, and will quickly erode
one's user base.

~~~
jshen
I think you missed my point. Manual oversight of tens of millions of listings
is not remotely practical. The article points out bad data in google, and it
hasn't eroded their user base ;) Don't you think there is a reason google
hasn't manually reviewed each listing?

~~~
tokenadult
_The article points out bad data in google, and it hasn't eroded their user
base ;)_

I don't trust Google at all for local information. I use social networks and
ask my local friends rather than Google when I'm looking for a restaurant or
other business in my town.

~~~
jshen
apples and oranges

------
mbarr
It looks like it still needs a lot of work. As a quick test I looked for
Sports Bars in London (via their categories) and it returned an Antique Shop
in Westerham. I then tried editing the record to remove irrelevant categories
and got a server error.

~~~
jack_deneut
The only cities we have good data for are Prague, Copenhagen, and lower
Manhattan. Try <http://www.nelso.com/cz/prague/>

