Hacker News new | past | comments | ask | show | jobs | submit login
cloudblare 5 months ago | hide | past | web | favorite

I dont want to be too harsh but I wouldnt find this useful (and my job depends a lot on crawling data)

1. When most people scrape data, they generally are interested in a very specific niche subset of the web. Sure you might have a billion row database of every article ever publisbed, but do you have all the rows of every item sold in FootLocker.com, for instance? As well as the price of each item(which is extracted from some obscure xpath)

2. Second, most people are interested in daily snapshots of a page. Like the daily prices of items in an ecommerce store. Not a static, one time snapshot.

I strongly believe crawling is something that can rarely be productized. the needs are so different for every use case. And even if you were to provide a product that would make crawling obsolete, I would still never use it. Because I dont trust you crawled the data correctly. And clean, accurate data is everything

From what I understand this is not trying to solve the typical e-commerce problem of closely watching your competitors selling something, but rather trying to provide a database to people interested in content on the web.

It probably won't solve the problems you're working on, but I could imagine quite a lot of interesting text analysis cases.

Yes a generalisation of the way wikipedia provides a web site, and a more machine readable for of what they do.

If I get back into blogging, I would be really happy to have my posts structured and indexed in whatever standard way made them easier bots to use.

I haven't tried Mixnode yet, but the way I understand it, it lets you query websites and retrieve their HTML content that you can then parse - without you having to crawl the site. Looking at their Github, they seem to utilize WARC, so they may also allow you to request the website for certain timestamps?

That being said, I find this highly interesting, if it works like that. We are working on a peer-to-peer database that lets you query a semantic database, popularized mostly by public web data, but with strong guarantees of accurate and timely data, and this could be a great way to write more robust linked-data converters.

What if the product was a framework for sourcing, aggregating, and visualizing data? When the user is put in control, you don't need to trust the product to do these things for you - it simply enables you to do what you want.

I think this is where the web is headed - where common users gain the ability to perform tasks that currently only developers or technical experts can do.

It's always been the goal to empower the user but you also have the movement to simplify everything.

> but do you have all the rows of every item sold in FootLocker.com, for instance? As well as the price of each item(which is extracted from some obscure xpath)

What if they did? Would you buy it then? What could they possibly offer you before you'd be willing to use their product?

As a data engineer who needs to crawl websites sometimes, Mixnode looks interesting to me. I agree that it is hard to make scraping a product because it is so use case specific. However crawling, defined as downloading all HTML, PDF, images on a given site, is a pretty common first step and something that could be a product. Then turning that into SQL sounds pretty awesome.

This. I write crawler software (adapters mostly) for the same client, and I could never figure out anneasy way for my client to specify the xpath/case paths in a meaningful way and extract the data.

Every crawler task requires different paging methods, different xpath patterns, etc that it makes things more complicated to generalize it.

There is a product (several of them from one company, actually) for crawling, but it's more of a tooling than a end-user product https://scrapinghub.com

I could imagine that they let you schedule a custom crawl as an added service.

For anyone else confused by the title, this is an alternative to you doing the crawling.

It's not some proposal to re-architect the web so that crawling isn't necessary. It's a data warehouse of the web as a service.

It’s not even an alternative to crawling. It’s just a way to exploit the result of crawling. You have to crawl first.

Very misleading title.

I don't think it's actually wrong to call it an alternative. If someone says that going to a restaurant is an alternative to cooking, you know exactly what they mean. It's not an alternative method, but it is an alternative choice.

The problem is that there are two valid interpretations here, and it wasn't clear which was the right one.

As long as Mixnode took care of the cooking for us, yes, using Mixnode’s SQL cutlery is kind of an alternative to cooking.

Let’s hope that the food is varied enough and not too outdated, though.

Ok, we've taken that bit out of the title above.

Ah, the latest company to replicate ql2's WebQL from the 90s. https://www.directionsmag.com/article/2901 and still at https://www.ql2.com. Props to Ted Kubaitis who developed WebQL on the side and grew it into a very helpful company for data collection, price tracking, etc.

Its heartening to know that more and more folks are starting to share the vision of web as Tim Berners Lee by starting such projects. https://en.wikipedia.org/wiki/Giant_Global_Graph

The tag line of this project is similar in fashion to the following projects which I have came across.





I am definitely looking forward to seeing more projects like these which will be helpful in transitioning us from Web2.0 to Web3.0

I think the main hurdle we face in transitioning the web we know today to the vision behind all these projects is companies that have already aggregated huge volumes of data. (e.g. Facebook, LinkedIn, Angelist, CrunchBase, Yelp)

They are now doing their best to protect their data to secure their competitive moat. This has the effect of preventing data from being utilized in other ways than was originally intended.

I did write a post about this topic before around 3 months ago as well..


Interesting product. BTW: 10 times posted in HN, first time they’re trending (if you click the domain you’ll see the list). Persistence pays I guess :)

Almost every story is posted multiple times.

Off topic, but thanks for pointing that cool HN feature out

Reminds me of YQL, https://developer.yahoo.com/yql/.

Which used to support a similar usage (select from url) but no longer does.

Yeah, html table is deprecated :(

This does look really interesting for research and discovering conten. But I'm not sure how good a replacement it would be for more generally scraping content.

Firstly, if you are scraping you would generally only be targeting a specific list of sites, and you'd want to make sure you were getting the freshest content - which means going straight to the source.

Secondly, while plenty was shown around metadata, there wasn't much shown about extracting actual content. I had expected it to be some kind of clever, AI-hype product that extracted semantic data, but it appears to be much more rudimentary than that, effectively letting you query the DOM with SQL.

I don't mean to hate on it - this really does look interesting - I'm just not convinced there is any real value over existing (or custom) scraping tools.

It would be great if they allowed people to write custom views for a certain group of pages, and allowed them to be run and indexed by default. Then you could create, for example, an Amazon item page view that scrapes price and description, and reviews, and quantity, and seller and all that shit and it would be scraped and indexed for you. They could make it optional and make it default only when the view becomes popular based on their own stats. How awesome and useful would that be?

And if this was centralized, everyone would benefit, since, say, amazon would only get indexed by this service, rather than thousands of individual companies with their own bots doing similar things.

With XHTML 2.0 and related tools like XQuery it could've been a matter of course. Hell XQuery is still a much better tool for this job than SQL, but no one cares.

I mean

> string_between(content, '<title>', '</title>') as title


I think the main point here is that you can get data from many different places without having to run crawlers. Like the etld example. tbh I too want to see better DOM handling (stringBetween is not the best function for HTML parsing lol) but the main value prop is pretty impressive.

> you can get data from many different places without having to run crawlers

Is it really the case ? Can you really avoid crawling before doing that ? Article is unclear

This is cool. Just an idea, do you think that you can make it do select within the DOM? It would be amazing to do SELECT on the document from a human point of view.

I mean from human point of view there are no divs or spans on a page but articles, comments paragraphs pictures links and so on.

Sure it a much larger problem but Google for example seems to be able to extract categorised information from web pages.

Im confused.

Do they do the scraping crawling and we just search their database?

Do we have to do the scraping/crawling and we dump the results into a mixnode server running locally?

I'm slightly confused too. They say that the web is a database, but it looks like we're SQL querying their database of the web.

I'm also interested in how often they rescrape their pages, and if they have rate-limit bypassing tech (for the Amazon scrapers).

So far, I think they're calling the Web a database because you can use SQL to query their database — which makes me feel like they're missing the point.

But they've done such hard work and they look like they're really excited about it — but I just don't understand why

This seems like a really cool idea, but I'm struggling to imagine what the use cases would look like? Anybody have any ideas?

Somewhat related, I did this hacky thing in 2012. Worked a charm... No db required...


Is this an alternative to crawling/scraping, or a way to exploit the result of crawling/scraping ?

What they offer is not really clear from the article. It seems that they only provide a raw SQL interface over a database of crawled web pages (to be fair, they added a few HTML-related SQL functions). We don’t know where this database come from, or who is supposed to provide it.

Great to see that SQL is making a come back, though.

Based on a quick google search (e.g., https://stackoverflow.com/questions/46673751/nutch-vs-heritr...), their existing product appears to be a hosted solution for crawling the web.

This new product sounds like it is just a query language that can be used on top of what you yourself have paid them to crawl. I don't believe they've actually crawled the whole web and are providing an interface to that. Their website says things like "the entire web" and "trillions of rows", but I'm guessing that's only true if you pay them a few million dollars to do that.

I guess, there are using common crawl as a base. Not sure wether they are doing actual crawling along with it.

This is way less cool than the title suggests. They are doing a bunch of crawling and inserting the raw html content into their big centralized database, where you can run queries on the text inside:

    string_between(content, '<title>', '</title>') as title
    content_type like 'text/html%'

I would say this is more exciting than it looks, though. I used to do a lot of crawling in the early 2000s, and almost all of it was expressed in terms of string_between calls. XPath is more convenient, but I'd say that 85% of the time you can collapse an XPath query into a string_between-style query. It can be awkward and even inconsistent, but in practice it often works well.

Having done a fair amount of work with xslt and using regex to strip out bad data, I agree. But 85% is terrible if you are creating a database. Any regex style query on xml or pseudo-xml requires bespoke treatment and a high amount of human hours to check the results before you can be sure an edge case didn't completely destroy your model.

I've conducted several analyses in which I've looked for trends or patterns across multiple websites or domains. Finding out where discussion / content covering specific topics or keywords is an example, see "Tracking the conversation":


That consisted of querying 100 terms over about 100 sites, and scraping Google's (rather inaccurate) "results found" estimate. About 10k Google queries.

Slowing those queries to the point they don't trip Google's bot detection and request CAPTCHAs is the hard part of this -- given a single IP, the queries stretch over a week or more.

A single source to query that information directly would make these investigations far easier. I've several such projects in mind.

What does the creator of Mixnode expect, believe or hope that people will use this tool for?

It's a critical problem that the site doesn't explain why people would want to use it. What tool or behaviour will it replace? What are those people doing today?

"I am a paying customer; who am I and what are my problems?" How many of those customers are there, and how much are they willing to pay to solve their problems?

Is it faster/better to use Mixnode than to create my own scraper? Is it possible to purchase an enterprise instance that runs in our datacentre? Is this flexible enough to accommodate my future business rules?

How much will this cost, and who do I call if it breaks? Can I purchase an SLA comparable to what AWS offers?

Most businesses have about 100 hard questions associated with them, where if you have good answers you're probably going to do just fine. The answers are the easy part; figuring out the questions for each company is hard.

When I saw the column called content my face went a little bit sad.

I might be missing something, but could you explain why?

Because 99% of scraping is parsing that column. At least that’s been my experience.

Scraping is literally just the successful acquisition of content.

You're getting into data parsing. Almost nobody scrapes data without processing, parsing and normalizing it, but scraping is getting the data in the first place.

Scraping isn't easy at scale, though. You have to distribute your crawlers, adhere to TOS (in theory) and avoid getting blocked. It's simple at small scale, though.

I don't know about the utility of this service, though. It handles the less interesting part of data acquisition and processing. I also agree with other comments that most scraping use cases are targeted and small in scope.

I hear you, but given they went with SQL, what choice did they have? No schema could adequately represent all the possible content in the document body.

An interesting way to enter the search indexing space.

Would this work as a complement to SPARQL? Or are they completely orthogonal to each other?

Website seems to be down ... Slashdot effect ... But i really like the idea :)

A few years ago I worked in a startup, and to find customers we needed to find web sites using certain technologies (e.g. wordpress and certain plugins). We used the service of an extremely similar SaaS startup for a little bit -- that basically did the exact same thing as Mixnode. That startup didn't work out and was shut down soon (and my startup didn't work out either). Wish you best of luck and hope things work out for you, maybe the tech climate and trends have evolved since a few years ago and this could work out a business now.

It seems to me that the moment you'd need to do anything interesting with a website, you'd need to crawl a lot of its pages and you'd hit robots.txt limitations very quickly.

I could use it for a lots of things if it could filter for HTTP header. If there would be additional plug-ins to detect e.g. 3rd party tags it would be even more powerful for testing.

It looks like there is a "headers" column, so my guess is that you can do that.

I would love to know more about this. I extensively use scraping and crawling at my startup and it's too much one consuming

the idea of turning the web into a database colums/rows is hell of a great crazy idea, to be honest I was like wow, good luck with it.

I still believe that sometimes you might need some real anonymous way to crawl data to syndicate it, think of linkedin data, how are you going to insert the data when Linkedin blocks every single request you do? currently I use a paid service that has a crawler which I use for getting the data I need. https://proxycrawl.com/anonymous-crawler-asynchronous-scrapi...

Do you think Mixnode can help in getting to insert row data from difficult websites like Linkedin or Google?

A database can be owned. I am very uneasy with anything that can be owned by one person hypothetically where many have contributed.

Yahoo pipes had a very similar thing long back !

It was YQL i suppose

I wonder - if you take the top ten keywords and url info from all pages on the web - would the data fit on a micro-sd card !?

There are ~40k words in English. You don't need a full URL, but only a hash. The words could similarly be hashed, most-frequent words to smallest values.

There are slightly shy 2 billion websites worldwide, 200 million are active. A 32-bit integer could index each site. A further hash for site paths.


There were 30 trillion unique URLs as of 2012

In August 2012, Amit Singhal, Senior Vice President at Google and responsible for the development of Google Search, disclosed that Google's search engine found more than 30 trillion unique URLs on the Web, crawls 20 billion sites a day, and processes 100 billion searches every month [2] (which translate to 3.3 billion searches per day and over 38,000 thousand per second).


There are terabyte MicroSD cards, so this looks viable.

> There are ~40k words in English.

The Second Edition of the OED had 171k+ full entries for words in current use, 47k+ for obsolete words, and ~ 9500 sub-entries.

And there have been a number of supplements since.

40k is low by a significant multiple.

For key-value lookup, to rouggh magnitudes are 10^4 - 10^5 (words) vs 10^8 (active sites).

This is OOM level analysis, not higgh-precision estimation.

Though I appreciate the correction.

Kids these days... We could have had XHTML, xpath, and the web as a semantic DB. I wonder if the author even knows what these things are, or what happened with the vision of a semantic machine-readable web. I rarely come across engineers who even know what XML is (no, it’s not an alternative encoding format to JSON).

It’d be great if CS courses and bootcamps would teach some basic web history.

> I rarely come across engineers who even know what XML is (no, it’s not an alternative encoding format to JSON).

Why? I've been around since HTML 1.0 and used to be a die-hard strict XHTML advocate (now I'm not just because today HTML5 still is written as pretty well-formed XML usually + has more semantic tags and is more readable and more unified this way) and actually love XML as I find it more readable than JSON but how I still don't get how is XML better than JSON in any aspect other than readability (which is subjective, many people say XML is pain to read). Sure, XML provides 2 distinct ways of expressing object properties and allows unencapsulated text within an element alongside subelements but I doubt these are a good things at all. I feel like I would even prefer JSON to replace HTML itself as it could introduce more order to the chaos and make the web more machine-readable.

The author is working with the Web, like as it is. Not an imaginary one where everyone has formatted their page in validated XHTML. This is the reality regardless of the author's age, be it 17 or 70.

> I rarely come across engineers who even know what XML


To be fair, you can to use XPath in HTML as well

Good, now you need to design an index to that database.

Has anyone received their invitation yet?

Where are the querys stored?

This is very smart. Also a very good choice of language. Reminds me of a comment from a few days ago https://news.ycombinator.com/item?id=18144385

I too looked at the comment and thought, yes it’s about time to get my SQL act together. I can write sql queries, and can also understand looking at them what they supposedly do, but my day job doesn’t really demand more than a simple select on two tables. Where can I go learn/explore more competitive SQL?

Really good introduction to some of the more "modern" features: https://modern-sql.com/

this is an amazing resource http://selectstarsql.com/

The timing of your comment couldn't be better.

Only yesterday I kinda messed up in an interview because I wasn't good at SQL. Just cursorily checked the link you posted and it is looking good. Thanks for the suggestion.

Can you remember the questions? I think that I'm relatively good with SQL, but I just realized, I have never been asked any SQL specific questions, even though most of my work has been tied to it. The questions usually revolve around specifics of the engine, not query language itself.

I think that was a smart choice too. It's good that they could see through all the hype with the more recent languages and pick SQL. interesting choice indeed. Not sure how it scales though.

Can you please elaborate on the pricing model?

Their Twitter is strange:

It says

    Likes: 1023
But when you click on it, it says

    @mixnode hasn't liked any Tweets yet
What happened?

It means they unliked all their likes en masse. Twitter's counts of these things are eventually consistent... with strong emphasis on eventual.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact