Hacker News new | past | comments | ask | show | jobs | submit login

I'm working on a startup that has web scraping at its core. The vision is a bit larger and includes fusing data from various sources in a probabilistic way (e.g. the same people, products, or companies found on different sides with ambiguous names and information. This is based on the research I've doen at uni). However, I found that there are no web crawling frameworks out there that allow for large-scale and continuous crawling of changing data. So the first step has become to actually write such a system myself, and perhaps even open source it.

In terms of use cases, here are some I've come across:

- Product pricing data: Many companies collect pricing data from e-commerce sites. Latency and temporal trends are important here. Believe it or not, there are still profitable companies out there that hire people to manually scrape websites and input data into a database.

- Various analyses based on job listing data: Similar to what you do by looking at which websites contain certain widgets, you can start understanding job listing (using NLP) to find out which technologies are used by which companies. Several startups doing this. Great data for bizdev and sales. You can also use job data to understand technology hiring trends, understand the long-term strategies of competitor's, or us them as a signal for the health of a company.

- News data + NLP: Crawling news data and understanding facts mentioned in news (using Natural Language Processing) in real-time is used in many industries. Finance, M&A, etc.

- People data: Crawl public LinkedIn and Twitter profiles to understand when people are switching jobs/careers, etc.

- Real-estate data: Understand pricing trends and merge information from similar listings found on various real estate listing websites.

- Merging signals and information from different sources: For example, crawl company websites, Crunchbase, news articles related to the company, LinkedIn profile's of employees and combine all the information found in various source to arrive at meaningful structured representation. Not limited to companies, you can probably think of other use cases.

In general, I think there is a lot of untapped potential and useful data in combining the capabilities of large-scale web scraping, Natural Language Processing, and information fusion / entity resolution.

Getting changing data with low latency (and exposing it as a stream) is still very difficult, and there are lots of interesting use cases as well.

Hope this helps. Also, feel free to send me an email (in my profile) if you want to have a chat or exchange more ideas. Seems like we're working on similar things.




Would love to hear more about your use case Denny--sending you a PM.

As for web-scale crawling of particular verticals such as products and news, you might want to try: http://www.diffbot.com/products/automatic/

We're planning on releasing support for jobs, companies, and people later.

(disclosure: I work there)


How do you use a probabilistic approach to scraping data? Were you able to get a low number of false positives?


Sorry for the confusion. They are used for "merging" scraped data from various sources, not in the scraping process itself. For example, they help in figuring out if similar-sounding listings on related websites refer to the same "thing".

If interested, take a look at this (and related) papers: http://www.cs.ubc.ca/~murphyk/Papers/kv-kdd14.pdf


That makes more sense. Thanks! I'll check out the paper. I was hoping you had some revolutionary new scraping method.


>I found that there are no web crawling frameworks out there that allow for large-scale and continuous crawling of changing data.

Are you distinguishing between "I found that there are no" and "I didn't find any so far"?

Which ones that came close have you rejected, and why?


I can't say for sure that there are none, but I believe that I've done quite a bit of research. If there really was an excellent web crawling framework it should have bubbled up to the top.

I don't remember the names of all projects that I've looked at, but the main ones were Nutch, Hetrix, scrapy and crawler4j. I've come across several companies/startups that have built their crawlers in-house for the same reasons (e.g. http://blog.semantics3.com/how-we-built-our-almost-distribut...).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: