
Launch HN: Dashblock (YC S19) – Turn Any Website into an API - HPouillot
Hey HN,<p>We&#x27;re Hugues and Max, co-founders of Dashblock (<a href="https:&#x2F;&#x2F;dashblock.com" rel="nofollow">https:&#x2F;&#x2F;dashblock.com</a>). Dashblock turns any website into an API. People use us to access product information, news content, sales-related data or real-estate offers for instance.<p>As a data scientist, Hugues realised how complicated it was to access web data programmatically when a website doesn&#x27;t provide an API. You have to build a script to pull the HTML, render the page in some cases, find selectors for the information you are interested in, distribute your tasks to scale and if the structure of the page changes, you have to update your selectors to find back the information.<p>We decided to build Dashblock to make it really simple to access web data through an API. Our software is basically a browser that allows you to access a website, right-click on the information you want to extract and preview your API on other pages.<p>In order to create long-lasting APIs, we developed a machine learning model that is resilient to website updates. For now, we mainly handle changes at the level of the HTML structure but with enough training data, we will also be resilient to UI updates.<p>Besides, our model detects similar content on the page to facilitate the selection process. When you call your API, we launch a headless browser, render the page, classify the content of the page using structural, visual and semantic features, and structure it by minimizing the entropy to give you a list when needed.<p>Our pricing model is related to the number of API calls our users make per month and if you want to give it a try, we currently offer 10k API calls when you sign up! You can download our software here : dashblock.com.<p>If you have any questions, we would be happy to answer them and if you have any related ideas, feedbacks or experiences, feel free to share them :)<p>Thank you !
======
igammarays
A productivized scraping service - useful! Entire companies are built around
scraping certain popular sites - this is a disruptive idea indeed. A growing
catalog of up-to-date scrapers for popular websites would put of lot of
freelancers out of work. I would invest in this.

However, the ML claim is highly suspect. There is no way that a machine could
reliably understand the semantic content of a website - that would require
Artificial General Intelligence. If anyone could do that, it would've been
Google. But even Google relies on human-edited structured metadata to define
the content of sites (i.e. Rich Snippets and the like).

~~~
HPouillot
It doesn't require Artificial General Intelligence, with enough training data
(crowdsourced data and human-edited metadata like JSON-LD or RFD), we can
classify automatically the attributes on the page (product name, movie title,
creation date, author), structure them and recognise the type of entity.

Feel free to contact us if you want to invest (hello@dashblock.com), we are
currently raising funds ;)

~~~
rvnx
But, what's the value compared to using open-source products like Portia
[https://portia.readthedocs.io/en/latest/getting-
started.html](https://portia.readthedocs.io/en/latest/getting-started.html) ?
Functionally it looks very similar.

~~~
r0rshrk
I'm sure this comment will go down in history like the Dropbox comment

~~~
rvnx
Fine, I prefer to loose my comment than my invested money

------
refrigerator
Congrats on the launch! Look forward to playing around with it.

I remember a similar startup making quite a splash on HN a few years ago —
[http://www.kimonolabs.com/](http://www.kimonolabs.com/). Do you know why they
shut it down and what you guys are doing differently?

~~~
MCorbani
Thank you! Kimono Labs built an amazing product and has been finally acquired
by Palantir. The main difference between us comes from our machine learning
model that gives stability to extraction over time and parse webpages in a
generic way. We are also working on automating navigation, which is something
they didn't do =)

~~~
ryanSrich
How would you compare to something like UiPath? Congrats on the launch.

~~~
MCorbani
UiPath and others RPA actors allow their users to mainly automate local
processes on Windows. We are cross-platform and focusing on the web, which
makes us useful for other use-cases like gathering data or automating
navigation sessions!

------
whoisjuan
This has been tried many times and it never seems to gain traction to become a
relevant concept. Off the top of my head, I remember Kimono Labs that looked
quite promising. Then it was acquired by Palantir and shut down. I also have
seen many solutions that are similar (basically most scraping companies, like
Diffbot which also claims to use machine learning for their extraction
techniques)

What's the plan here to really become differentiated? Why is now the right
time for this concept and not before when others tried it? Also, how do you
plan to address the concerns of companies that don't want their data to be
accessed programmatically? That seems like a big challenge to overcome in
order to become commercially succesful.

~~~
tipalink
In regards to your question about companies' concerns: if the data is made
publicly available (i.e. web page is not behind authentication), then why
should it matter how it's accessed?

~~~
whoisjuan
If you can access it programatically, then you can access it at scale which
means you can quickly scrape content and replicate it somewhere else. Many
business rely on a model where the data or information they generate is meant
to be consumed by a human.

For example, Google temporarily bans your IP when you hit things like Google
Play urls multiple times in a few minutes. This is clearly an attempt to block
anyone but a human to extract information from the Play store.

------
d--b
Sorry to be a pain here, but I very much doubt your ML thing is working at
all. Opening a website and finding a dom element is trivial, so the only thing
I'd get when buying from you is the promise that this will be resilient to
website updates.

But at the same time, for $500/month, you can definitely have people updating
the selectors manually...

~~~
AznHisoka
Save your money. Create robust regression scripts and hire a freelancer to fix
it when the schema changes. anyone serious crawling/scraping data from
websites won’t leave it up to some automated ml magic to extract the right
data for them.

~~~
jakoblorz
The Freelancer approach does not work, if the data you extract is time
critical. But then again, why would anyone not try to find an api for such
data and rely on dom parsing. So, OPs product is worth it, if you validated
and trust their ML model to work correctly, believe they can guarantee a
certain uptime and the data you extract is time sensitive or mission critical.
Also the data cannot be resourced from an api. People with such needs may be
willing to pay good bucks, but good luck finding early adopters as well as the
data-niche where there is demand for something like this.

------
turtlebits
Very cool, but super slow, especially for an API point, which I would expect
you could use directly from a front end.

Tested on a site I regularly visit

    
    
      dashblock (3 selectors, ~20 items):  16.911 seconds
      curl (no scraping):  60 ms
      chrome:  987 ms
    
    

edit: added chrome

~~~
HPouillot
Indeed, we are rendering the whole page with the javascript, that's why it
takes longer than a curl. For now, it's especially useful for dynamic pages
but we also plan on supporting pages that don't require rendering.

~~~
__ka
Maybe you already do it, but I think integrating adblocker functionality when
loading JS sites would be desirable to reduce load time. And if ads are what
the API user is interested in, perhaps add a flag for whether or not one wants
ads to load.

Recommendation: [https://github.com/cliqz-
oss/adblocker](https://github.com/cliqz-oss/adblocker) Should be the fastest
adblocker library (used by Ghostery, Cliqz and Brave)

~~~
HPouillot
Thanks for the advice, it makes a lot of sense !

------
miki123211
I see some potential in this for accessibility. There are some websites which
are impossible/very hard to use with a screen reader and don't provide an
official aPI, mostly for corporate lock-in reasons. Using this tool for API
generation and then writing a super nice to use client would be awesome. My
life could be so much better with solutions like those.

------
the_watcher
This is cool, looking forward to trying it out. Manual scraping is doable, but
there have been plenty of times that I've just decided not to do something
because I'd have to spend an hour getting/scraping the data. Hopefully this
will take that time down to 10 minutes or less.

~~~
MCorbani
Definitely =) Let us know know if we can be of any help!

------
omarhaneef
I really want to try it because I think I need something like this.

However, how do I know it is a legitimate product and not some virus/scam
software?

I know YC is a vote of confidence, if it is a YC company, and all the copy
sounds legit, and you sound like a pair of honest, hard working entrepreneurs.

But is there some way to check before I run the software?

Edit: Note that I would not have this concern if it were a web platform like
Kimono used to be.

~~~
MCorbani
That's a good point. Our app is validated by Apple on MacOS and the Windows
version will be soon. Also, we have thousands of users and you can google us :
no claim for spam at all =)

Note : FyI, we worked on a SaaS version but the user experience was not slick
enough in our point of view (e.g. iframe of the websites).

------
vessenes
Hey Hugues and Max, congratulations.

Can I ask some questions about how this would apply to a project of mine?

I currently create a personal newspaper, printed daily in my office. It’s a
reasonably large piece of software that pulls in my calendar, emails, news
stories I care about, twitter feeds, weather, stock quotes, etc.

I use python’s newspaper library for parsing RSS feed links to news sites, but
it is, at times lacking, so dashblock strikes me as interesting.

What I understand from the video is I could over time build out APIs with
dashblock for major news sites; this would help with a few sites that are hard
for newspaper.

How would I use dashblock in production - unattended, CLI Linux or Mac? Also,
it looks _really_ slow in the video, is this typical speeds? Is this something
that you require be run on your cloud, or could I run it locally?

Thanks, Peter

~~~
HPouillot
Thanks !

You can create an API for any website (news website included) from our
Mac/Windows software and you can access this API from anywhere. It runs on our
servers and you can query it from any language you want. Let us know if you
need more help hello@dashblock.com

------
sradu
Have done a lot of scraping in my life, and I'm super excited about what
Hugues and Max have built.

I tested a super early version and was surprised how well it worked.

~~~
HPouillot
Thanks Radu !

------
css
how do you avoid getting banned by the companies you scrape? Most ToSs have a
clause like:

> We prohibit crawling, scraping, caching or otherwise accessing any content
> on the Service via automated means... [etc]

~~~
reascenda
This may now be moot after the LinkedIn Vs hiq labs case a couple of days ago
which appears to have blanket legalised we scraping.

~~~
kube-system
hiQ v. LinkedIn means you probably aren't going to jail for scraping
LinkedIn's website. It doesn't mean LinkedIn can't IP ban you.

------
w457uiw4gftyi
Are you supporting the use case where web site providers consider scraping to
be hostile? Ie, spinning up new cloud instances until one isn't blacklisted by
the site, all behind the scenes so the consumer of your API doesn't have to
worry about such things?

~~~
HPouillot
We don't use sophisticated methods for now, we just use a serverless
architecture, so IP changes at each invocation. Feel free to contact us at
hello@dashblock.com if it doesn't work for your use-case :)

------
sbr464
Nice work. Do you have an admin API for creating or managing the APIs you
generate? Asking in the case of integrating into another app.

Also, how well does it handle JavaScript apps? Can you specify different
engines to parse a site with or specify JS disabled/enabled etc?

~~~
HPouillot
We don't provide an API to manage other API yet, but this inception use-case
is interesting. Could you specify what your app would like to do ? We render
the Javascript of the page and for now we don't provide a way to specify if
you want to render the page or not but we plan on doing so.

------
gargarplex
Has anyone tried this for careers pages? Would be interested in how this
performs on a random sample of ~50 crunchbase NYC startups’ careers pages. I
dunno how much time would have to be spent training data...

~~~
HPouillot
We did :) It works on all kind of pages. You just have to set it up on one
page and it will work on all similar pages of the website. Did you have in
mind to train a model to recognise careers pages across websites ?

~~~
gargarplex
Yeah, that would be really helpful. I want to monitor careers pages of all
local companies in the Crunchbase NYC geo in order to help candidates search
for local companies by keywords (eg C#). We have an API already (syncs with
Algolia) to receive the jobs, with unique key on each job’s URI; and we
wouldn’t want to scrape more than once per day.

~~~
mLuby
Would love to use that if/when you get it working.

~~~
gargarplex
It's quite a daunting project, but if you want to join the @codeforcash on
Keybase, would definitely welcome support.

------
tipalink
Very cool! Is there a way to authenticate in to a site and then keep a session
alive to scrape private content? Can it pass cookies or can you manually set
headers?

~~~
MCorbani
Not yet, but we are working on it =)

------
sidcool
Congratulations on launching. This seems like a cool idea. I have some
reservations on how widespread the adoption could be. But I love the concept.

------
h0h0h0h0111
I've had some ideas that have relied on scraping data from sources that don't
provide an open API (and server-render their sites), and the scraping part has
been a bit of a barrier - gotta say I'm amazed how easy it was using your
tool. The UX was pretty intuitive also, I like that you've basically embedded
a web browser, cos everybody already knows how to use a web browser!

~~~
HPouillot
Thanks for your feedback !

------
reeddavid
This looks awesome, just tried it out on Poshmark (they don't have a feature
to alert me when new items in my size are listed). I was a huge fan of Kimono
Labs before they stopped operating, and this serves a similar purpose for me.

I might have missed it, but how can I see (or edit) the configuration of my
configured API? It looks like all I can do is run the API or delete it.

~~~
HPouillot
I was a huge fan of Kimono as well. You can't edit an API for now but we will
add this feature in the next release.

------
mLuby
I like how simple it is—best of luck! (BTW I think your demo video can be
shortened in the middle; after 6 selectors it's clear how that works.)

1\. How hard would it be to do _inputs_? That is, there's a form that I have
to fill out manually but I want to do so by API.

2\. How well does this work for creating UX tests? The Selenium "no code"
tools I've seen are terrible.

~~~
HPouillot
Thanks ! 1\. It changes the user experience but the underlying model stay the
same and will allow our user to record session with inputs and clicks in next
releases. 2\. Indeed, if you can replay a session you check the data is what
you expected. What solution have you tried so far ?

------
surfer77
Love this, I submitted this to API List
([https://apilist.fun/api/dashblock](https://apilist.fun/api/dashblock)), been
seeing more and more scraping apis become available, it seems it is a becoming
a very competitive industry and this is a unique solution (at least from what
I've seen)

~~~
MCorbani
Thanks =)

------
RazZziel
Looks promising, but it's only available for OSX and Windows. Will we be
seeing a Linux release soon?

~~~
MCorbani
Yes! we have been quite occupied since the end of YC but we plan on releasing
it soon =) Please ping us at hello@dashblock.com and we will inform you when
the version is live!

------
borisandcrispin
I tried a couple of web scraping tools in the past weeks and Dashblock was by
far the best. Easy to start and getting the results with an API is exactly
what I wanted. (In my case I connected it to Zapier + Airtable).

~~~
HPouillot
Thanks for your feedback !

------
quickthrower2
If this works for amazon.com.au with it's 20 different page layouts and page
navigation systems (sometimes ajax, sometimes not) for different product
types, I'll be impressed.

~~~
MCorbani
Indeed, amazon has different layouts and can be tricky. For now, our model is
resilient to minor changes but we are working on improving it - amazon.com.au
look like a good test ;-)

------
the_watcher
It looks like the 10K API call offer is limited to people who sign up for the
developer plan ($149/mo), but your post implies it's free. Did I misread the
offer in your post?

~~~
MCorbani
No you read it correctly : by creating an account today, you will have 10k
FREE API calls =)

~~~
lucasverra
good marketing, im creating an account to use later in the year

what’s the mínimums macos versión ? why not web if this is electron ?

~~~
HPouillot
Ahah, great ! The minimum required version is 10.10 (Yosemite)

If you want to do that on the web, you'll have to render the page in an iframe
to select the content and most websites don't allow it. In short, the user
experience is way better with a software.

~~~
lucasverra
ok, downloading now. Can i still benefit the offer pretty please :)

------
kull
I installed it but It didn’t get me data I needed. I am still gonna use
parsehub which allows me to easily go up and down in the html tree to get data
hidden under layers of divs.

------
____Sash---701_
See - [http://www.nightmarejs.org/](http://www.nightmarejs.org/)

------
hbcondo714
Congrats on the launch! Seems a little similar to Diffbot.com but they do not
require a client download.

~~~
MCorbani
Correct! Also, Diffbot extracts automatically generic entities (e.g. product
name and price, comments, etc) while we let our users choose exactly the data
they want on any webpage =)

------
username18
Are the number of API calls per month? Is the answer the same for a free
account?

~~~
HPouillot
You have 10k API calls when you sign up and 1k per month after that. Does it
answer your question ?

------
udayrddy
did you mention WebSITE, not WebPAGE ?!!

Oh wow, instagram.com is on your youtube demo video thumbnail. Interested to
know on how is it traversing the site, I do not think fb has put the usernames
in public.

~~~
MCorbani
We don't crawl the websites yet. However, you can create an API on a given
webpage and gather data from similar webpages on the same website by calling
the API with the new URL.

------
awad
How do you differentiate from Octoparse?

~~~
MCorbani
There is plenty of differences among which 1/ we don't rely on classic
selectors (CSS, xPath, etc) which allows us to be resilient to website updates
2/ we offer a simple UI that automates data selection and structuration and 3/
we are available on Windows and MacOS =)

~~~
kabacha
> we don't rely on classic selectors (CSS, xPath, etc)

I'm not buying this, does AI process html as text lol? Surely it process it as
a tree, right?

------
bayareamaverick
Congrats! Look forward to trying it out.

~~~
MCorbani
Thanks! Let us know if you have any question or feedback =)

------
cryptozeus
Really neat !

~~~
MCorbani
Thank you !

------
jtx22
I wonder how long it'll take these sites to require Captcha for basic access.

~~~
MCorbani
Good question. However, that would require websites' users to validate a
Captcha every time they navigate it, which is not optimal in terms of user
experience.

~~~
skidMarkUndies
reCaptcha V3 operates behind the scenes though:

[https://developers.google.com/recaptcha/docs/v3](https://developers.google.com/recaptcha/docs/v3)

~~~
MCorbani
Good point! That's why our plan is to focus on use-cases that create value for
websites too, in order to partner up with them.

