Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What info do you web scrape for?
129 points by cblock811 on Aug 11, 2014 | hide | past | favorite | 110 comments
I have been web scraping for several months and am starting to teach it at Meetups. I'm lucky enough to work for a company that has a few pre-crawled copies of the web that I can query against and a distributed processing platform to speed up any scraping I do.

I'm running out of ideas of what to build though. I build scrapers to produce content for the company based on the data and insights I find. They are usually marketing verticals though, such as finding all websites using feedback tools (I search based on their javascript widgets), and do analysis on that info.

So if you had these resources, what would you be looking for? I love building tools that help people so any feedback/ideas would be great!

I'm also open to hearing what you would scrape for on the live web. I find that if I'm doing broad analysis then the pre-crawled copies are best, and for specific sites/information I use the live web.




There is a closet industry for scraping any sort of data that can move markets. Fed, crop, weather, employment,etc.

Anything that is released at a certain time on a fixed calendar, you can bet that multiple parties are trying to scrape it as fast as possible.

If you can scrape this data( the easy part), put it in a structured format( somewhat hard) and deliver it in under a few seconds(this is where you get paid) then you can almost name your price.

It's an interesting niche that hasn't been computerized yet.

If you can't get the speed then the first 2 steps can still be useful to the large number of funds that are springing up using "deep learning" techniques to build a portfolio over timelines of weeks to months.

To answer the question of: > Wouldn't this require a huge network of various proxy IPs to constantly fetch new data from the site without being flagged and blacklisted?

This is why I gave the caveat of only looking at data that comes out at certain times. That way you only have to hit the server once, when the data comes out, or atleast a few hundred times in the seconds leading up to the data's release:)


That is a fairly surprising opportunity. I have experience monitoring/scraping thousands government websites for a different purpose. Considering some government sites have a round trip of well over 5 seconds, seems like it'd be a fun challenge to parse, format, and deliver it that fast.

What types of data formatting are you talking about here? Would it require a unique template for each individual site?


> deliver it in under a few seconds(this is where you get paid) then you can almost name your price.

Wouldn't this require a huge network of various proxy IPs to constantly fetch new data from the site without being flagged and blacklisted?

Or are you referencing from the time you scrape data to deliver it in under 3 seconds?


My understanding is that you need to deliver the data with a latency measured in the range of milliseconds, and even then that might not be fast enough due to direct access. Here are a couple articles in the WSJ --

"Speed Traders Get an Edge" - Feb 6, 2014 - http://online.wsj.com/news/articles/SB1000142405270230445090...

"Firm Stops Giving High-Speed Traders Direct Access to Releases" - Feb 20, 2014 - http://online.wsj.com/news/articles/SB1000142405270230377550...


A bit off topic but, if I were to scrape such data except without intention of selling it, instead using it myself... How fast are stock markets? Surely, I would know that the price of stock will increase in the next few days too and buying that stock after 1 hour, say, news hit a major news site, that would still profit me? If not, why? I mean surely you can find someone selling that said stock at all times, no ?

edit: replaced mysql with myself


>It's an interesting niche that hasn't been computerized yet.

That's quite an assertion. I'm certain it has been.


I work as a research analyst for a Canadian provincial opposition party. Most government data is in terrible HTML tables, often dynamically generated, and almost none of it is in an easily machine readable format. I spend a lot of time downloading PDF files of data and converting them to JSON formats.

I have two main recurring scrapes:

- political donations. Every donation to a political party in my province above ~$300 is posted publicly on a gov't website (in a PDF). I use the data to run machine learning algorithms to predict who is most likely to want to donate to my party.

- public service expenses. My province has a "sunshine list" which publishes the salaries and contracts for all senior government officials. We grab it weekly (as once someone quits the gov't, their data disappears).

One tool that you could consider building is an easily accessible expense website, where people can enter the name of a public official and see all their expenses, including a summary of the total amount spent. There have been a number of massive expenses here in Canada related to this [1, 2].

[1] http://news.nationalpost.com/tag/alison-redford/ [2] http://en.wikipedia.org/wiki/Canadian_Senate_expenses_scanda...


This is very interesting work, regarding your last point, I came across this site which is searchable by employee name or government organization http://canada.landoffree.com/.

I think the salary and expense disclosure is only for Ontario, based on the sunshine list.


I don't know much about the civil service, but I know that most provincial assemblies have expense disclosures, at least for the elected officials, e.g. [1], [2]. Federal cabinet ministers also have to disclose all of their office expenses (can't find a link, but I know it exists).

[1] BC: http://www.leg.bc.ca/Mla/remuneration/index.htm [2] Alberta: http://alberta.ca/travelandexpensedisclosure.cfm


I've had three primary uses of web scraping. The hard part for me has never been speed. Getting the results structured is somewhere between easy and hideously complicated.

1. Reformatting and content archival (lag times of hours to days are no prob).

As an example, I put together a site to archive comments of a ridiculously prolific commenter on a site I follow. I needed the content of his comments, as well as the tree structure to shake out all the irrelevant comments leaving only the necessary context. Real time isn't an issue. Up until recently it ran on a weekly cron job. Now it's daily.

2. Aggregating and structuring data from disparate sources (real time can make you money).

I work in real estate. Leasing websites are shitty and the information companies are expensive and also kinda shitty. Where possible we scrape the websites for building availability but a lot of time that data is buried in PDFs. For a lot of business domains, being able to scrape data in a structured way from PDFs would be killer if you could do it! I guarantee the industries chollida1 mentioned want the hell out of this too. We enter the PDFs manually. :(

Updates go in monthly cycles, timeliness isn't a huge issue. Lag times of ~3-5 business days are just fine especially for the things that need to be manually entered.

This is exactly the sort of scraping that Pricenomics is doing [1]. They charge $2k/site/month. Hopefully y'all are making that much.

3. Bespoke, one shot versions of #2.

One shot data imports, typically to initially populate a database. I've done a ton of these and I hate them. An example is a farmer's market project I worked on. We got our hands on a shitty national database of farmers markets, I ended up writing a custom parser that worked in ~85% of cases and we manually cleaned up the rest. The thing that sucks about one shot scrape jobs from bad sources is that it almost always means manual cleanup. It's just not worth it to write code that works 100% when it will only be used once.

Make any part of structuring scraped data easier and you guys are awesome!

[1] http://priceonomics.com/data-services/


There are services that cover at least part of what you mentioned. These effectively provide you a tool to visually build a scraper and then they automate the scraping in the background, creating an API or spreadsheet of the data.

Import.io is one example, and I think there's another more recent YC-backed one. However, I tried using import.io a little while back but without much joy.


I think import.io is buggy to say the least, having used it in the past to scrape some websites, it was pain to work with. Kimonolabs is still very lacking in terms of ability to handle different websites, it is very much limited to a certain portion of the web, it looks like they are more about creating APIs...APIs that people are supposed to find interesting and valuable but like the topic of this question, it seems like it's only valuable to someone who has a direct need for that dataset, by itsef would serve no interest to say.

Having private access to Scrape.it, I can say that it focuses strictly on making a great tool and the ability to scrape websites without costing a fortune. I know the founders and they are extremely dedicated to making a tool that can pretty much handle anything you throw at it like AJAX, Single page apps, crawling selected links and all sub links after the page. They've just begun adding login and form support so should be able to play with those as well very soon. It only supports csv output at the moment but hopefully they can make available like API output.


We have a non-tech intern and import.io looks like like a great tool to get him chewing up data. I'm playing with it now. Why didn't it work out for you? Beyond the wrapped browser interface being a little funky lol. (Edit: eugh, selecting data for import is really clunky.)

Ask HN: Anybody got a visual scraping service they like?


It was the data extraction and selection process I couldn't get to work. I was trying to scrape a particular search on autotrader.co.uk (I wanted more up to date results than their daily emails provide, and I wanted to filter out cars that had been written off). I don't remember all the details, but I followed the tutorial video and got to the stage where you select a single item that matches your criteria and it's supposed to extrapolate from there. However I just seemed to be stuck in an infinite loop of it asking me to do this.


I found you often have to select two, then it figures it out. I assumed it was probably because of alternating odd/even row CSS classes.


Thank you for the great feedback. I have a real estate background as well and keep wanting to find a project that would benefit that industry. It feels like a lot of what's out there is stuck in the past. I would love to help fix that. Sounds like I have a new project!


Regarding 2)

Why wouldn't it work for PDF's? If you're able to get the file itself, you should be able to OCR it...

Is there anything obvious that I am missing in regards to PDFs?


I've worked with OCRed PDFs, the main thing that should be obvious is that OCR results range from poor to horrendous. It takes a lot of manual cleanup if a high degree of accuracy is required. Or depending on why you want the text, you can adjust expectations or add layers of software such as fuzzy search algorithms to deal with the issues.

Again depending on the application, the mixed quality of OCR isn't always a deal breaker, but it's not always as simple as it might appear.


It's not the text that's the issue, it's the structure. PDFs have nowhere near as much structure as markup. You end up having to do this for dozens of layouts and it gets hurty really fast:

http://schoolofdata.org/2013/06/18/get-started-with-scraping...


There are computer vision libraries that automatically extract tables from PDFs. For example, http://ieg.ifs.tuwien.ac.at/projects/pdf2table/.

You may want to give that a try if you haven't looked at it before.


>For a lot of business domains, being able to scrape data in a structured way from PDFs would be killer

Can you mention some of those domains? I'm interested. I had worked on one such project earlier, for a financial startup.


What about legal implications? Do you get permission from the sites you crawl?


Legality of scraping is a subtle issue - I wrote up my take on it here: https://blog.scraperwiki.com/2012/04/is-scraping-legal/


[deleted]


Thank you for answering. Though private and internal, you are still using the data for profit, is that correct? Does it mean that one can't sell the data, but can use it for analysis etc and still profit from it?

I saw a service recently that emails app store reviews/ratings for a fee. Not sure if they are scraping or getting the reviews some other way. The same idea can be extended to lots of things like Amazon reviews etc. Not sure of the legal stuff though.


I deleted it 'cause I'm not comfortable having those details online. An extremely long story short, in our domain we're using the scraped data exactly as the owners intend, albeit via machines instead of people. Consult a lawyer.


How can I contact you?


My username at google's email service.


Partial plug, but very related to the topic: if you're doing large scale analysis on the web and you don't want to have to actually run a large scale crawl, use the CommonCrawl dataset[1]! Common Crawl is a non profit organization that wants to allow anyone to use big web data.

I'm one of the team behind the crawl itself. Last month (July) we downloaded 4 billion web pages. Thanks to Amazon Public Datasets, all of that data is freely distributed via Amazon S3, under a very permissive license (i.e. good for academics, start-ups, businesses, and hobbyists). If your hardware lives on EC2, you can process the entire thing quickly for free. If you have your own cluster and many many terabytes of storage, you can download it too!

People have used the dataset to generate hyperlink graphs[2], web table content[2], microdata[2], n-gram and language model data (ala Google N-grams)[3], NLP research on word vectors[4], and so on, so there's a lot that can be done!

[1]: http://commoncrawl.org/ [2]: http://webdatacommons.org/ [3]: http://statmt.org/ngrams [4]: http://nlp.stanford.edu/projects/glove/


What would it take in terms of resources, etc. to get to the point where Common Crawl was doing web scale crawls on a regular basis?


We're aiming to do monthly crawls from this point on. The main hold was automating the intensive manual steps of our crawl process. Now we have scripts that make running our 100 node EC2 cluster and processing the terabytes of web data relatively trivial.

If anyone wants to discuss sourcing well distributed crawl lists for billions of pages per month, we'd love to chat. We want to make sure we cover a diverse variety of languages and domains. Given that we're trying to get a good sample of the web, that's a difficult proposition!


A lot of services with online billing refuse to send bills by e-mail, instead requiring users to log into their websites.

No doubt the companies would justify this by saying e-mail isn't secure enough. The side-effect that it'll stop many users bothering to look at their bill isn't why they do it at all, no sir.

I've been considering making a web scraper that goes to the phone company, electricity company, gas company, broadband company, electronic payslips, bank, stockbroker, AWS and so on; logs in with my credentials; downloads the PDF (or html) statements; and sends them by e-mail.

Of course, such a web scraper would need my online banking credentials, so I'm not in the market for a software-as-a-service offering.


I started working on a tool like this over the weekend to pull down my bills, and pay the balance.

I think there is a market for something like Mint for bill paying - it's a it's a bit of a pain to remember when I have to pay all of my bills each month, and make sure to go through each one, and pay the balance on time.


I scrape about 60 blogs and news sites that deal with a niche topic and examine all the hyperlinks. If more than one of them links to the same page, I assume that it's a page that's generating some buzz, so I send it out in an email. It's proved to be a generally reliable assumption.


Have you thought about the case in which a blogger might be using a URL shortener? You might be missing some potentially 'buzzy' links if you are just looking at the direct URL string


Would love it if you posted a blog article sometime about the technical details.


There's really not much to it. Scrape each site/feed every X minutes, find all the hyperlinked URLs on the page, add them to a database table ... and if they're already in the table, send out an email with links to the "buzzed about" URL, as well as all of the sites/feeds that mention it. I keep the links in the table for about a month.


I scrape Gumtree and eBay hourly using a python script for certain things I want under a certain price. The script sends me an email with the link in it and I get on top of it sharpish.

Managed to bag a lot of stuff over the last couple of years for not much money.

If someone bags this up as a service I'd pay for it.


@allegory could you share the the script with us ?


Would love to but not at the minute because it has hard coded credentials for eBay API in it. It's on my list as a TODO to tidy it up. Will stuff on github and post a Show HN on it soon :)

I've got one that monitors amazon prices for sudden lows as well.


I'll second a want for this script, let us know when it's up!


That ll be great ! cant wait !


I do a lot of scraping for my day job. We have a business intelligence team that will build us reports that we need from the data that we have. However I find that this process is so incredibly slow and sometimes we only need to compile the data for a one-off project. I used to use vb.net for this as that's what I started learning programming with. Now I use python/requests/bs4 for all my scraping scripts.

I've started working on a new website that will use data scraped from several vbulletin forums. I've found that even 2 vbulletin forums running the same version may have completely different html to work with. I'm assuming that it's the templates they are using that changes it so much.

I'm setting up the process so that the webscraping happens from different locations than the server were the site is hosted. The scraping scripts upload to the webserver via an api I've built for this. Mostly did this because for now I'm just using a free pythonanywhere account and their firewall would block all of this without a paid account. And then also none of these sites would see the scraping traffic coming from my website, etc...


At Diffbot, we have an automated discussion thread parser, currently in beta testing, that might be exactly what you need. Send me a note at mike@diffbot.com and I'd be glad to hook you up.

(disclosure: I work there)


When I worked at MyEdu, I didn't actually sign on with the dev team originally — I worked on "the scraper team". We scraped college and university websites to get class schedule information: which classes were being taught, broken down by department and course number; by which professors; at which times on which days. If you're ever looking for an interesting challenge, I would encourage you to try getting this data.

Well-formed HTML is the exception rather than the rule and page navigation is often "interesting". Sometimes the school's system will use software from companies like Sungard or PeopleSoft, but there's customization within that... and of course, there's no incentive for the schools to aggregate this information in a common format (hence MyEdu's initiative), so there are plenty of homegrown systems. In short, there's no one-size-fits-all solution.

* NOTE: If you do attempt this, I insist that you teach throttling techniques from the very start. Some schools will IP block you if you hit them too hard; other schools have crummy infrastructure and will be crushed by your traffic. Scrape responsibly!


I've attempted to do this before for a side project (RateMyProfessor, but for textbooks!) and its incredibly hard to do accurately. One of the (many) issues that I ran into was that some schools still have all of their course data PDF format, in addition to the problems Cyranix listed above.

Much respect to any person or team that has to wade through this stuff.


We do a lot of live web scraping of product information from retail sites for http://agora.sh. We basically scrape all of the essential product info and offer it to the user in an optimized view (we call it the 'product portal') that can be accessed without having to load a new page. This reduces tab sprawl and provides a general improvement to a lot of shopping workflows, such as just wanting to see larger/more images of the product (likely clothing) and being able to do so with an overlay on the existing page.


My first scraping project was well over 10 years ago in college. I was a member of the education club and we wanted to get funding so I convinced the college of education to allow us to charge $10 automatically to students in their school. But then the administration dragged their feet to give us a list to submit to the accounting office for billing. 1 (of the many professors) submitted a list to me via outlook that they copied off the site so I was able to look at the HTML structure of their list. The university used basic security (htaccess) and didn't verify that you had permission for a task once you were in. I had access because I worked for the dean of men. So I scraped all the faculties student lists and then used another system behind the htaccess point to get all the relevant information on each student. Compiled a list of 300 students and submitted it getting the club $3,000 in funding. The college of ed office staff were freaked because they had no clue how I came up with the student roster (no one in their office gave it to me) but nothing came of it.

Been scraping a lot lately but mostly:

- government website for license holders

- creating lists of businesses for different segments (market research/analysis)

- using those lists to scrape individual sites and make analysis (how many use facebook/youtube/etc)


I am currently scraping for brand product and nutrition data. Having to build custom scrapers per brand is hell.

I have a dream to use something closer to OCR against a rendered page, rather than parsing DOM. That way it would be less custom, and I could say, for instance, "find 'protein', the thing to the right of that is the protein grams".

I, personally, don't know how to do this, but I'd be willing to pay for a more generic way to scrape nutrition data (email in profile :) )


There's companies out there that already have this info. Checkout http://kwikeesystems.com .


take a block of text, sentence, or paragraph.

build 2 classifiers: #1 that classifies a sentence that contains the information you want, and #2 that classifies the actual data within that subset.

NLP


I once wrote a scraper for a Yellow Pages site in Python. It pulled down the business category, name, telephone and email for every entry, and returned a nicely formatted spreadsheet. The hours I spent learning the ElementTree API and XPath expressions have paid for themselves several times over, now that I have a nicely segmented spreadsheet of business categories and email addresses, which I target via email marketing.


As someone responsible for search on a yellow pages company, I can confirm that most YP websites have little to no protection against this. Company information is usually public anyway. We just make it very easy for you to get it :)


I'm working on a startup that has web scraping at its core. The vision is a bit larger and includes fusing data from various sources in a probabilistic way (e.g. the same people, products, or companies found on different sides with ambiguous names and information. This is based on the research I've doen at uni). However, I found that there are no web crawling frameworks out there that allow for large-scale and continuous crawling of changing data. So the first step has become to actually write such a system myself, and perhaps even open source it.

In terms of use cases, here are some I've come across:

- Product pricing data: Many companies collect pricing data from e-commerce sites. Latency and temporal trends are important here. Believe it or not, there are still profitable companies out there that hire people to manually scrape websites and input data into a database.

- Various analyses based on job listing data: Similar to what you do by looking at which websites contain certain widgets, you can start understanding job listing (using NLP) to find out which technologies are used by which companies. Several startups doing this. Great data for bizdev and sales. You can also use job data to understand technology hiring trends, understand the long-term strategies of competitor's, or us them as a signal for the health of a company.

- News data + NLP: Crawling news data and understanding facts mentioned in news (using Natural Language Processing) in real-time is used in many industries. Finance, M&A, etc.

- People data: Crawl public LinkedIn and Twitter profiles to understand when people are switching jobs/careers, etc.

- Real-estate data: Understand pricing trends and merge information from similar listings found on various real estate listing websites.

- Merging signals and information from different sources: For example, crawl company websites, Crunchbase, news articles related to the company, LinkedIn profile's of employees and combine all the information found in various source to arrive at meaningful structured representation. Not limited to companies, you can probably think of other use cases.

In general, I think there is a lot of untapped potential and useful data in combining the capabilities of large-scale web scraping, Natural Language Processing, and information fusion / entity resolution.

Getting changing data with low latency (and exposing it as a stream) is still very difficult, and there are lots of interesting use cases as well.

Hope this helps. Also, feel free to send me an email (in my profile) if you want to have a chat or exchange more ideas. Seems like we're working on similar things.


Would love to hear more about your use case Denny--sending you a PM.

As for web-scale crawling of particular verticals such as products and news, you might want to try: http://www.diffbot.com/products/automatic/

We're planning on releasing support for jobs, companies, and people later.

(disclosure: I work there)


How do you use a probabilistic approach to scraping data? Were you able to get a low number of false positives?


Sorry for the confusion. They are used for "merging" scraped data from various sources, not in the scraping process itself. For example, they help in figuring out if similar-sounding listings on related websites refer to the same "thing".

If interested, take a look at this (and related) papers: http://www.cs.ubc.ca/~murphyk/Papers/kv-kdd14.pdf


That makes more sense. Thanks! I'll check out the paper. I was hoping you had some revolutionary new scraping method.


>I found that there are no web crawling frameworks out there that allow for large-scale and continuous crawling of changing data.

Are you distinguishing between "I found that there are no" and "I didn't find any so far"?

Which ones that came close have you rejected, and why?


I can't say for sure that there are none, but I believe that I've done quite a bit of research. If there really was an excellent web crawling framework it should have bubbled up to the top.

I don't remember the names of all projects that I've looked at, but the main ones were Nutch, Hetrix, scrapy and crawler4j. I've come across several companies/startups that have built their crawlers in-house for the same reasons (e.g. http://blog.semantics3.com/how-we-built-our-almost-distribut...).


YQL is surprisingly quite brilliant:

https://developer.yahoo.com/yql/


I'm currently scraping data such as "tweets, comments, likes" a website gets each day so I can graph them over time.

One thing I am having a hard time scraping backlinks to websites. Currently using bing but they are paid after like 5000 queries. I really wonder how other companies like seomoz do this daily against millions of websites.


1. Monitoring competitors. By monitoring product/service offerings close to my own operations, I can get bizdev people on the phone and speak to partners when I see indications in the public marketplace that someone has a better sourcing deal than I do. Haven't done this in five years or so.

2. Gathering basic data that should be freely available anyway (like currency exchange rates, global weather, etc.). Always this is done carefully and with a light touch, with maximum respect for load inferred on targeted systems. Again, haven't bothered in about five years.

3. Automating content acquisition. For search engines, media libraries, etc. This is more like ten years ago. These days there's so little call for it... maybe if I ran a boutique hotel chain in a copyright-isn't-respected jurisdiction and wanted to provide a fat library of in-room entertainment...


I just started getting into scaping (mostly been using import.io) mostly because it is a complement to what I really care about - data visualization. I've gotten a ton of interest for my side project and despite that I haven't opened the beta I'm still worried that it won't be as lucrative as creating some niche reporting services for various verticals (real estate, auto, etc). Essentially data that is very tabular and not hierarchical or qualitative. You can think of my work as pivot charts on crack. If someone already pre-compiled this data, I'd much rather pay for it that do it myself. My value add is the analysis/viz done on top of the data. If you want to chat, feel free to email me, contact info is in profile.


Machete.io is awesome!


Thanks!


In terms of ideas, how to scrape Javascript-heavy sites. This one has broken me and the Import.io helpdesk: http://www.usrentacar.co.uk/. I'm now trying in CasperJS/PhantomJS but no joy there either.

I'm looking to buy a house, and not all local estate agents post to Rightmove (or some post with a 24-hour delay). Trying to submit the search form on the agent's own hideous website, parse the results and get a standard structure between them all is hideous - I gave up in the end.

Once I have the data the challenge is then analysing it (geolocation, how long are the commute times, distance to amenities etc) which is its own separate challenge


I made a javascript based automated scraper in a win7/vista desktop gadget. It was originally for dislplaying the remaining credit on my mobile. You put in search terms and a website and it scrapes it and tries to return what it thinks you are looking for (weather, stock prices, remaining blance etc). It works ok. I think there is definetly demand for a well made scraper/alert app/service though.

App is here http://robotification.com/creditile/

Also didn't yahoo make a thing for scraping this 'Yahoo Pipes' or something.


I scraped around hundreds of the major universities of US for one of my client. And he used the data to build a mobile platform for students which integrated different services. I mainly grabbed course, class routines, bus route and email of both professors and students. I still have around one and half million of email address of academics.

I also did some e-commerce information scraping.

One of the most interesting one was for a data selling company. They asked me to collect data of geo information, disaster, finance, tweets etc. We used to apply ML and statistics to give forecast with historic data.


What about https://www.kimonolabs.com/ ? Makes it pretty easy to collect data and presents in a structured format (JSON).


I have a side-project which scrapes play-by-play data from NBA games to gain more insights into these games.

Here is an example of the (un-finished) side-project: http://recappd.com/games/2014/02/07

I'm far from the only person scraping this data. Look at sites liked http://vorped.com and http://nbawowy.com for even better examples.


Scraping really is a quite complex process, and not everybody does it right. Do you employ a (distributed?) crawler pool? What if a scraped page goes offline (404/410)? And, how do you handle network errors, and 403's / getting caught (and possibly blocked) - if at all? Do you conceal the scraping by employing a fake user agent? Do you (sometimes?) request permission for scraping to relevant webmasters? These are the things that can make it or break it IMHO.


BTW I write tailor made PHP+MySQL scraper scripts, targeting English or Italian language sites; contact me for more info :)


A while ago, I had the idea of creating a travel site that catered to the group of people that enjoy traveling but aren't bound by time (i.e. I want to go to X, but I don't care when -- just show me the cheapest weekend for the next 3 months).

Anyway... it turns out that flight APIs are ridiculously non-existent. I ended up scraping two different airline sites, but since it was against their terms, I never took the site any further.


So, Skyscanner? (Pick'Whole month' / 'Whole year' for the date).


This is the first I've heard of Skyscanner, and they kind of get what I was going for, but I want more of a "set it and forget it" type of service -- something where I can enter in the closest airport and a list of cities I want to visit, and every time I log in I'll see the prices for those cities over the next several months (or get alerted via email, etc.).


The hospitality and travel industries are very slow to update their technologies. I used to work with Ritz Carlton and St. Regis and even those brands are practically in the stone age, so I can't imagine how scraping for flight info would go.

I've thought of even building a simple event aggregator for some friends in the industry and they are blown away that it's possible. Then I remember how many venues are in cities like Charlotte and San Francisco and realize why these industries lag in technology. There just isn't a large pool of developers who want to solve their problems.

Do you have any projects you are currently working on?


Completely agreed on the "lag in technology" -- couple that with the fact that even if you do manage to book a flight as an affiliate, most providers only give you a flat fee (as opposed to hotel affiliates which tend to give a percentage).

Disclaimer: Those claims were relevant the last time I investigated (>1 year ago). It could have changed by now.

On the plus side, building out the flight tool really endeared me to hacking on things - which eventually lead me to leave my last job and co-found something new. :)


Sabre just opened their API to the public. Not sure if it's paid or not but we were going to do this exact idea for YC Hacks.


Oh wow, how did I miss that? I looked at sabre when I first started scraping, but the application process to just look at the APIs was exhausting.

Thank you!


Frankly I'm annoyed to see this topic here. Most people who have taken to scraping are low-life scum. They see content that others have spent months or years producing, simply set up a site that aggregates all of that information, and then sit back and collect revenue from ads or reseller links you paste everywhere.

People who put in a few hours of work to take advantage of other people's hard work piss me off. :/


Are you against hyperlinks too? Seriously, computers naturally make content reusable. Try to imagine the next level when we don't depend on hoarding. The semantic web imagined this, though it was too complicated. But the idea still has huge benefits; every web site a linked database, with content precisely described. But I think many orgs are too afraid they don't really have something to offer in the big picture (that's what so much of business is about).

I do a lot of work in scraping, but it's for non profit healthcare, academic, and general knowledge augmentation. It's painful, but the only way to get to the next level without waiting a thousand years for everyone to make their own consistent API and metadata descriptions.


I used to scrape web for a daily deals search engine i wrote for a client in 2010. But we scraped in realtime as the number of sites were really low (in 10s).

pre-crawled copies with distributed processing platform could be cool. you could come up with a better search engine with programmable rules that are edited collaboratively (like wikipedia)


I've used Mozenda for Web scraping. They have a free trial and can scrape some complex formats, like drilling down several levels in a Web site or database. They can also parse PDF's.

See https://www.mozenda.com


I love scrapping and even made a subreddit for the purpose where I showcased few of my public work. Any scraping lover can join in.

http://www.reddit.com/r/scrapingtheweb/


We do so to determine new pages on websites within our industry. Often the new information here is not formally announced or done so only weakly. We regularly uncover valuable new info about company developments and changes.


I currently scrap my lending club account to automatically trade loan notes on the secondary market. This way I can buy/sell notes that satisfy my criteria. If anyone is interested in this, I can send you the scripts.


I scrape for Google Playstore for app data, top ranked apps, etc. Unlike iTunes, they don't have a public API/feed. So, in order to get the data I require, I scrape playstore on a pretty regular basis.


I scrape Google to save search results.

http://www.labnol.org/internet/google-web-scraping/28450/


PlainSite (http://www.plainsite.org) uses about 20 different scrapers/parsers to download and standardize legal materials.


I scrape craigslist for side by side comparisons of stuff I want to buy from there. Eg: Cars, motorcycles, etc. Maybe even real estate would be a good target.


Sports scores, statistics, etc are always high in demand for scraping and great to get people interested when learning scraping techniques.


Yeah, very much agreed. I have to scrape schedules together to populate games for my ticket site (http://www.boxrowseat.com).

The popular sites are wising up to scraping of information, so it's really an exercise in futility. But when it works, it's super rewarding.


I am looking for data having list of all universities and associated colleges . I didn't found anything related to it anywhere


So like the dmoz.org or higher education?


Yes, sort of like that only. I found a good list on http://www.4icu.org/ , and wikipedia too have list of many colleges and universities.

But information about all colleges associated with university is generally available only on specific site of university, and many times those are in pdf.

Having said that, my above requirement(college associated with university) itself is secondary. I found the list of universities at many places , but not the list of colleges.

So for now just getting a list of all college with location is enough for me.


I scrape the Hacker News website to get links and numbers of points. These allow me to produce my "top 10" lists.


I regularly scrape financial data - historical prices, live quotes, company information, quarterly reports, etc...


Reddit and twitter account for tendancy.

but twitter is so vast you may want to categorize account.

But reddit is a good source for a lot of info.


Why not use PRAW? It's very mature, useful library using the Reddit API.


i will dig this thks


What is tendancy ?


trends sorry .... english is not my native speaking.


No problem, I thought maybe I don't understand some latest techno-speak ;-)


I used to scrape a bunch of webcomics to turn them into RSS feeds. I still have one or two running, actually.


Try https://www.comic-rocket.com/ and see if it has all the comics you want to read. It supports either RSS, a web-based reader, or an Android app.


i'm building a database of games and scores. web scraping has been very helpful.

maybe instead of trying to change up the content, try to change up the method. ie. do a talk on running crawlers/scrapers to seed your database at an interval. (instead of just "scraping").


if you could publish a price list for items sold at major grocery chains, I am sure that many people could use it (bonus if it includes aisle numbers)



Not a bad idea, but I think the grocery chains are going to fight you on it.


Online personal banking




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: