
Introducing the Priceonomics Analysis Engine and API - ryan_j_naughton
http://priceonomics.com/introducing-the-priceonomics-analysis-engine-and/
======
ecesena
Some ideas/feature requests:

1\. extract twitter/fb/ecc. contacts

2\. (optionally) follow about/contact pages to improve contact information

3\. improve Facebook likes if the website declares a FB page/app (as an
example, check out theneeds.com, we have 14k likes, but priceonomics says 203)

~~~
rohin
Thanks for the feedback! Right now the Social analyzer pulls how many people
shared that one specific page on FB (or liked it or comment it), rather than
how many followers a Page has or Users an FB App has. We could add that per
your suggestion though.

We currently use it to see how popular various articles are on the web since
FB likes is a good proxy for overall traffic. FB like are public whereas page
views typically aren't.

~~~
ecesena
I ment more about extracting that the page's twitter account is @..., and the
Facebook page is fb.com/...

------
jobposter1234
Love what you guys are doing! In a similar space in terms of structuring the
web (although I'm not doing general purpose tools like you are).

Can you comment on why I'd use your API instead of writing my own stuff? I'm
very comfortable with crawlers and data extractors, don't mind running my own
infrastructure. I've played with Kimono and Import.IO quite a bit, but they
get in my way and actually slow down my work.

Also, do you have any advice on gathering info from social media? I've found
some places are fairly liberal (facebook), some are locked down tightly. I'd
love to know what the professionals do in terms of infrastructure and scraping
policy too...

~~~
bwood
The main reason we think there needs to be a platform for crawling and
structuring data is that it's a huge hassle to reinvent the wheel every time.
If you have experience writing crawlers and have been able to afford the time
to learn to write them well, you've basically already reinvented the wheel.
But now you also have to maintain the wheel. We think it would be better if
people started working together on the problem of structuring the web. Besides
running infrastructure (which has it's own challenges), there is really no
reason why everyone needs to build their own extractor for every website.
People generally want to extract the same things, and just a handful of
quality implementations should be enough to satisfy almost everyone.

One of the keys to actually having an extractor repository is providing
incentive for people to build and maintain the extractors, which is actually
where we're headed once we allow developers to start building their own
applications on the Analysis Engine.

If you're good at extracting structured data from HTML, you should continue
writing your own extractors and even consider selling access to them. On the
other hand, if you can't be bothered learning the art of extracting data, why
not pay someone to use their commercial grade extractor?

So, to summarize, you should use our API so that you can better leverage your
time and contribute to the creation of even better structuring tools.

We really like the stuff that Kimono and Import.IO are working on. They are
significantly lowering the barriers to getting started at
extracting/structuring data, which is great for everyone. Of course there are
limitations to what their tools can do, but that's what you'll see every time
someone attempts to simplify a complicated process. We aim to be the glue that
connects people with data acquisition, structuring, and analysis tools.

Crawling social media is pretty tough because social platforms tend to be very
reserved about access to their data (probably due to the enormous amount of
personal data they have). It's a disturbing trend for sites to require an
account before they even let you past the landing page. It's basically not
publicly available data anymore. That's actually not something we have much
experience doing, since we haven't had any customers ask for it yet.

------
mbesto
How do you get around sites blocking the IP of your engine?

------
zeeshanm
One thing you may want to do is parse html to get relevant data into a
structured format so end-user doesn't have to write a post-parser.

~~~
bwood
Thanks zeeshanm! We actually have a lot of parsers that do just that, we just
haven't yet made them available on the engine. We're also looking at ways to
let people create their own and make it available to other users, since we
think that's the best way to get to a fully structured web. Are there any
sites in particular that you think we should focus on first?

~~~
zeeshanm
I was thinking more along in the lines of creating a dynamic parser. It's an
interesting and a challenging problem and I have thought about it for some
time. There may need to be some human intervention involved but by design it
should be dynamic.

