

Ask HN: Do you know a good resource for large data scraping job? - hugo31370

My company, Easy Vino (easyvino.com), is gearing up for beta release and we need to populate our database with wine lists. The job consists of extracting information from wine lists (which we have and are usually PDF, HTML or Pictures) to put it into our database.<p>We have a simple back office that connects to a wine API to search for wine info and we need help inputing the data. I'd rather have the same person (or team) doing this as the learning curve is significant.<p>Does anyone know a cheap resource for this type of task? Any help or reference is appreciated.<p>Thanks a lot!
======
devs1010
I'm not sure exactly what sort of answer you are expecting. Unless the data
you want is in a standardized format (such as a standardized XML schema), any
effort to extract data would require writing custom parsers for each set of
data that has a different structure. I'm not sure if you are asking for advice
on which technology stack to use for writing this or are looking for a pre-
made tool that can extract this for you? There may be some tools that can
"attempt" to do this without requiring you to write custom code but I am not
sure how effective they would be.

~~~
hugo31370
I believe it has to be a person. I've used Mechanical Turk in the past and
it's great for easy, simple tasks. This one requires a little learning, which
means sticking to one person/team would be best because they can quickly get
faster and more efficient.

I'm looking for advice on companies or people you've used in the past that you
liked. Thanks!

------
ig1
The typical way of doing this is to use mechanical turk, there are some third
party services (their name escapes me) which are built on top of mturk to
provide reliability.

The typical way they do this is to have two different people enter the data
and when there's a mismatch have a supervisor decide which is right.

~~~
hugo31370
I've used mechanical turk in the past for easy tasks. This one requires a
little learning and I feel people get a lot faster even after 1 day. My
concern with Mturk is having different people all the time, which is a lot
less efficient. To give you a number, right now it takes me 1-2 minutes to add
a line, whereas for someone new it takes him 5-8 minutes. That's the kind of
learning curve I'm hoping for if I hire the same person to do this for 2-3
weeks.

Do you know if Mturk can offer this? Thanks a lot!

~~~
ig1
I believe you can create a custom group of qualified workers on mturk

~~~
hugo31370
thanks

------
polyfractal
You might have good luck just hiring some cheap Virtual Assistants to do this
work for you. oDesk or elance are pretty good for these types of
administrative tasks

~~~
hugo31370
Thanks! Do you know anyone in particular?

~~~
polyfractal
Alas, I don't have any personal experience with hiring a VA - I've just
listened to Rob Walling talk about them a lot.

I've used oDesk several times though for other things and it works fairly
well. My suggestion is to write up a short "test" project and hire out five or
six competent looking VAs. Give them a hard deadline (two hours tops, etc) and
see who completes the job in a satisfactory manner.

Some will totally suck, some will never get back to you, and a few will be
awesome. If your entire group sucks, ditch them and move on to a new group of
candidates.

Once you find someone that is good and in your price range, give them a larger
task and see how it goes.

~~~
hugo31370
thanks! I'll try that

