Hacker News new | past | comments | ask | show | jobs | submit login

I dont want to be too harsh but I wouldnt find this useful (and my job depends a lot on crawling data)

1. When most people scrape data, they generally are interested in a very specific niche subset of the web. Sure you might have a billion row database of every article ever publisbed, but do you have all the rows of every item sold in FootLocker.com, for instance? As well as the price of each item(which is extracted from some obscure xpath)

2. Second, most people are interested in daily snapshots of a page. Like the daily prices of items in an ecommerce store. Not a static, one time snapshot.

I strongly believe crawling is something that can rarely be productized. the needs are so different for every use case. And even if you were to provide a product that would make crawling obsolete, I would still never use it. Because I dont trust you crawled the data correctly. And clean, accurate data is everything




From what I understand this is not trying to solve the typical e-commerce problem of closely watching your competitors selling something, but rather trying to provide a database to people interested in content on the web.

It probably won't solve the problems you're working on, but I could imagine quite a lot of interesting text analysis cases.


Yes a generalisation of the way wikipedia provides a web site, and a more machine readable for of what they do.

If I get back into blogging, I would be really happy to have my posts structured and indexed in whatever standard way made them easier bots to use.


I haven't tried Mixnode yet, but the way I understand it, it lets you query websites and retrieve their HTML content that you can then parse - without you having to crawl the site. Looking at their Github, they seem to utilize WARC, so they may also allow you to request the website for certain timestamps?

That being said, I find this highly interesting, if it works like that. We are working on a peer-to-peer database that lets you query a semantic database, popularized mostly by public web data, but with strong guarantees of accurate and timely data, and this could be a great way to write more robust linked-data converters.


What if the product was a framework for sourcing, aggregating, and visualizing data? When the user is put in control, you don't need to trust the product to do these things for you - it simply enables you to do what you want.

I think this is where the web is headed - where common users gain the ability to perform tasks that currently only developers or technical experts can do.


It's always been the goal to empower the user but you also have the movement to simplify everything.


As a data engineer who needs to crawl websites sometimes, Mixnode looks interesting to me. I agree that it is hard to make scraping a product because it is so use case specific. However crawling, defined as downloading all HTML, PDF, images on a given site, is a pretty common first step and something that could be a product. Then turning that into SQL sounds pretty awesome.


This. I write crawler software (adapters mostly) for the same client, and I could never figure out anneasy way for my client to specify the xpath/case paths in a meaningful way and extract the data.

Every crawler task requires different paging methods, different xpath patterns, etc that it makes things more complicated to generalize it.


There is a product (several of them from one company, actually) for crawling, but it's more of a tooling than a end-user product https://scrapinghub.com


I could imagine that they let you schedule a custom crawl as an added service.


> but do you have all the rows of every item sold in FootLocker.com, for instance? As well as the price of each item(which is extracted from some obscure xpath)

What if they did? Would you buy it then? What could they possibly offer you before you'd be willing to use their product?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: