1. When most people scrape data, they generally are interested in a very specific niche subset of the web. Sure you might have a billion row database of every article ever publisbed, but do you have all the rows of every item sold in FootLocker.com, for instance? As well as the price of each item(which is extracted from some obscure xpath)
2. Second, most people are interested in daily snapshots of a page. Like the daily prices of items in an ecommerce store. Not a static, one time snapshot.
I strongly believe crawling is something that can rarely be productized. the needs are so different for every use case. And even if you were to provide a product that would make crawling obsolete, I would still never use it. Because I dont trust you crawled the data correctly. And clean, accurate data is everything
It probably won't solve the problems you're working on, but I could imagine quite a lot of interesting text analysis cases.
If I get back into blogging, I would be really happy to have my posts structured and indexed in whatever standard way made them easier bots to use.
That being said, I find this highly interesting, if it works like that. We are working on a peer-to-peer database that lets you query a semantic database, popularized mostly by public web data, but with strong guarantees of accurate and timely data, and this could be a great way to write more robust linked-data converters.
I think this is where the web is headed - where common users gain the ability to perform tasks that currently only developers or technical experts can do.
What if they did? Would you buy it then? What could they possibly offer you before you'd be willing to use their product?
Every crawler task requires different paging methods, different xpath patterns, etc that it makes things more complicated to generalize it.
It's not some proposal to re-architect the web so that crawling isn't necessary. It's a data warehouse of the web as a service.
Very misleading title.
The problem is that there are two valid interpretations here, and it wasn't clear which was the right one.
Let’s hope that the food is varied enough and not too outdated, though.
The tag line of this project is similar in fashion to the following projects which I have came across.
I am definitely looking forward to seeing more projects like these which will be helpful in transitioning us from Web2.0 to Web3.0
I think the main hurdle we face in transitioning the web we know today to the vision behind all these projects is companies that have already aggregated huge volumes of data. (e.g. Facebook, LinkedIn, Angelist, CrunchBase, Yelp)
They are now doing their best to protect their data to secure their competitive moat. This has the effect of preventing data from being utilized in other ways than was originally intended.
I did write a post about this topic before around 3 months ago as well..
Only yesterday I kinda messed up in an interview because I wasn't good at SQL. Just cursorily checked the link you posted and it is looking good. Thanks for the suggestion.
Firstly, if you are scraping you would generally only be targeting a specific list of sites, and you'd want to make sure you were getting the freshest content - which means going straight to the source.
Secondly, while plenty was shown around metadata, there wasn't much shown about extracting actual content. I had expected it to be some kind of clever, AI-hype product that extracted semantic data, but it appears to be much more rudimentary than that, effectively letting you query the DOM with SQL.
I don't mean to hate on it - this really does look interesting - I'm just not convinced there is any real value over existing (or custom) scraping tools.
> string_between(content, '<title>', '</title>') as title
Is it really the case ? Can you really avoid crawling before doing that ? Article is unclear
I mean from human point of view there are no divs or spans on a page but articles, comments paragraphs pictures links and so on.
Sure it a much larger problem but Google for example seems to be able to extract categorised information from web pages.
Do they do the scraping crawling and we just search their database?
Do we have to do the scraping/crawling and we dump the results into a mixnode server running locally?
I'm also interested in how often they rescrape their pages, and if they have rate-limit bypassing tech (for the Amazon scrapers).
So far, I think they're calling the Web a database because you can use SQL to query their database — which makes me feel like they're missing the point.
But they've done such hard work and they look like they're really excited about it — but I just don't understand why
string_between(content, '<title>', '</title>') as title
content_type like 'text/html%'
What they offer is not really clear from the article. It seems that they only provide a raw SQL interface over a database of crawled web pages (to be fair, they added a few HTML-related SQL functions). We don’t know where this database come from, or who is supposed to provide it.
Great to see that SQL is making a come back, though.
This new product sounds like it is just a query language that can be used on top of what you yourself have paid them to crawl. I don't believe they've actually crawled the whole web and are providing an interface to that. Their website says things like "the entire web" and "trillions of rows", but I'm guessing that's only true if you pay them a few million dollars to do that.
That consisted of querying 100 terms over about 100 sites, and scraping Google's (rather inaccurate) "results found" estimate. About 10k Google queries.
Slowing those queries to the point they don't trip Google's bot detection and request CAPTCHAs is the hard part of this -- given a single IP, the queries stretch over a week or more.
A single source to query that information directly would make these investigations far easier. I've several such projects in mind.
It's a critical problem that the site doesn't explain why people would want to use it. What tool or behaviour will it replace? What are those people doing today?
"I am a paying customer; who am I and what are my problems?" How many of those customers are there, and how much are they willing to pay to solve their problems?
Is it faster/better to use Mixnode than to create my own scraper? Is it possible to purchase an enterprise instance that runs in our datacentre? Is this flexible enough to accommodate my future business rules?
How much will this cost, and who do I call if it breaks? Can I purchase an SLA comparable to what AWS offers?
Most businesses have about 100 hard questions associated with them, where if you have good answers you're probably going to do just fine. The answers are the easy part; figuring out the questions for each company is hard.
You're getting into data parsing. Almost nobody scrapes data without processing, parsing and normalizing it, but scraping is getting the data in the first place.
Scraping isn't easy at scale, though. You have to distribute your crawlers, adhere to TOS (in theory) and avoid getting blocked. It's simple at small scale, though.
I don't know about the utility of this service, though. It handles the less interesting part of data acquisition and processing. I also agree with other comments that most scraping use cases are targeted and small in scope.
I still believe that sometimes you might need some real anonymous way to crawl data to syndicate it, think of linkedin data, how are you going to insert the data when Linkedin blocks every single request you do? currently I use a paid service that has a crawler which I use for getting the data I need. https://proxycrawl.com/anonymous-crawler-asynchronous-scrapi...
Do you think Mixnode can help in getting to insert row data from difficult websites like Linkedin or Google?
There are slightly shy 2 billion websites worldwide, 200 million are active. A 32-bit integer could index each site. A further hash for site paths.
There were 30 trillion unique URLs as of 2012
In August 2012, Amit Singhal, Senior Vice President at Google and responsible for the development of Google Search, disclosed that Google's search engine found more than 30 trillion unique URLs on the Web, crawls 20 billion sites a day, and processes 100 billion searches every month  (which translate to 3.3 billion searches per day and over 38,000 thousand per second).
There are terabyte MicroSD cards, so this looks viable.
The Second Edition of the OED had 171k+ full entries for words in current use, 47k+ for obsolete words, and ~ 9500 sub-entries.
And there have been a number of supplements since.
40k is low by a significant multiple.
This is OOM level analysis, not higgh-precision estimation.
Though I appreciate the correction.
It’d be great if CS courses and bootcamps would teach some basic web history.
Why? I've been around since HTML 1.0 and used to be a die-hard strict XHTML advocate (now I'm not just because today HTML5 still is written as pretty well-formed XML usually + has more semantic tags and is more readable and more unified this way) and actually love XML as I find it more readable than JSON but how I still don't get how is XML better than JSON in any aspect other than readability (which is subjective, many people say XML is pain to read). Sure, XML provides 2 distinct ways of expressing object properties and allows unencapsulated text within an element alongside subelements but I doubt these are a good things at all. I feel like I would even prefer JSON to replace HTML itself as it could introduce more order to the chaos and make the web more machine-readable.
@mixnode hasn't liked any Tweets yet