
Show HN: Turn any website into an API (for those who miss Kimono) - welanes
https://simplescraper.io/
======
phsource
This is very cool! I love how you brought back the original Kimono UI with the
checkmark and Xs for adding and removing data tags.

We built WrapAPI ([https://wrapapi.com](https://wrapapi.com)) back in the day,
before we ended up starting Wanderlog
([https://wanderlog.com](https://wanderlog.com)), our current travel planning
Y Combinator startup. This definitely is still an unsolved problem.

However, from a business point of view, we found that it was rather difficult
to make a business out of an unspecialized scraping tool. The Kimono founders
expressed a similar sentiment: ultimately, scraping is a solution looking for
a problem.

Developers can often roll their own solution too, which limits your customer
base and how much you can charge. Instead, vertical-specific tools that target
particular industries seem to be the way to go (see Plaid as an example!)

Alternatively, you have to be good at Enterprise and B2B sales. This is a
product that you need to get the word out, get a champion, and do customer
success on since it has a substantial learning curve. We were not, so that was
why we chose to focus on other projects to start out

Best of luck, and feel free to get in touch if you'd like to chat more

~~~
MetalGuru
Curious, what comparison are you making with Plaid here?

~~~
phsource
Plaid, Yodlee, and others abstracts away extracting data from various banks
and financial services providers, so they're providing a solution built on top
of the same data extraction techniques that this tool uses

~~~
MetalGuru
Oh, interesting. I thought they just provided secure authentication to an
app’s end users’ bank accounts for things like payments (an alternative to
someone like PayPal doing two microtransactions, then having you confirm the
amounts as a way of validating it’s your account). It’s not like Plaid is
scraping financial data though, right?

~~~
detaro
Scraping is incredibly common with banking apps like that, because many banks
do not have APIs (and are only changing slowly).

------
welanes
Hey HN, I posted this in a comment thread the other day and (to my surprise)
it got a positive reception so added a few more updates and decided to post it
proper.

The idea is to be able to choose a website, select the data you want, and make
it available (as JSON, CSV or an API) with as little friction as possible.

Kimono was the gold standard for a while so did yoink some of their ideas,
while doing some other things differently.

Still needs some work but as an MVP would appreciate any feedback. Cheers.

~~~
nannal
>would appreciate any feedback

Any option for a firefox build?

~~~
welanes
Yes, working on it now.

------
beagle3
I don't feel it is right to describe it as "turns a website into an API",
rather "gives scraped data through an API".

"Turn website into an API", for me, evokes the image that I can automate (say)
placing an order in Amazon as an API, or paying my bills automatically. It
includes scraping, of course, but requires a lot more
(mechanize/twill/selenium/phantom/etc power).

There was a company called Orsus that did exactly that. Last I heard about
them it was the year 2000.

------
uberswe
I like the idea but I was skeptical as to how well it works and noticed the
video on the main page of your website which scans coinmarketcap seems to be
wrong. It gets 200 cryptocurrency names but only 100 prices which means only
the first result is correct.

I have a similar idea that I'm working on, your site is definitely bookmarked
and will try the extension later.

~~~
treve
Also interesting that this main example is also a violation of coinmarketcap's
terms. They have a paid API.

~~~
chirau
If i use my pen and notebook to write down all those values, am i also in
violation of those terms?

If they don't want their data to be scraped, it is up to them to secure it.

~~~
treve
The argument you're making here is 'I don't believe in copyright'. Which is
fine, but doesn't really negate my point.

~~~
chirau
It's a moot point. I very much believe in copyright, but you can't just put
info in the public domain and yell, "Take a look but don't remember/retain it"
in the name of copyright. If I redistribute it or reuse it for commercial
purposes without your consent then maybe there is a case. But if I am just
scraping it, i.e remembering it... Come on now.

Otherwise everyone who gets the lyrics to copyrighted songs or memorizes them
and sing them in the shower is also in violation of copyright. Which would
reduce the whole copyright thing to ridiculousness.

~~~
treve
All I said that it's against their term of use. I didn't try to make a point
about whether it _should_ be or not. If you are curious about it, and whether
using pen and paper is allowed, take a look at it.

------
save_ferris
What is it about this service as a business model that prevents it from taking
off? I’ve known at least two YC startups that tried to build businesses around
this idea.

I think one or both were acquired and immediately shut down, but I’m not 100%
sure about that.

~~~
tsergiu
I'm the founder of parsehub.

We are doing well and are independently owned.

I think there are 3 things that contribute to this:

1\. It is very easy to make a prototype that looks "magical" but very hard to
build something that works in real applications. There are an enormous amount
of quirks that a browser allows, and each site you encounter will use a
different set of those quirks. Sites also tend to be unreliable, so whatever
you build has to be very resistant to errors.

2\. There is a technological wall that every company in this space reaches
where it is not yet possible to mass-specialize for different websites. So
even if you're able to build a tool that works very well on any individual
website, the technology is not there yet to be able to generalize the
instructions across websites in the same category. So if a customer wants to
scrape 1000 websites, they still have to build custom instructions for each
website (5-10x reduction in labor vs scripting) when what they really want/is
economically viable for them is to build a single set of instructions that
will work for all similar websites (10000x reduction in labor vs scripting).
This is something that we're working on for the next version of parsehub, but
is still a couple years away from launch.

3\. Many of the YC startups you hear about have raised funding from investors
and have short term pressures to exit.

The combination of the three makes it very tempting to give up and sell.

~~~
swalsh
#2 is what would transform this from a nice niche tool, to something very
valuable. In the ecommerce space, tracking competitor pricing is a great
example of this type of thing. I can also see use casese for recipe's,
finance, healthcare, you name it. Those b2b use cases are worth real money.

Just curious, in your experimentation, have you found it necessary to train a
new model for each "category"? Or have you found a way to generalize it?

~~~
tsergiu
Training a new model for each category is already possible today, but doesn't
achieve the goal (mass-specialization).

The problem is that when you pre-train a model, you can only solve for the
lowest common denominator of what every customer might want.

In ecommerce, for example, you might pre-train to get price, product name,
reviews, and a few other things that are general to all ecommerce. But you
won't pre-train it to get the mAh rating of batteries, because that's not
common to the vast majority of customers (even within ecommerce). It turns out
that most customers need at least a few of these long-tail properties that are
different than what almost every other customer wants, even if most of the
properties they need are common.

And so the challenge is to dynamically train a model that generalizes to all
"battery sites" based on the (very limited) input from a customer making a few
clicks on a single "battery site".

------
ainiriand
Hi, is it possible to make it compatible with firefox?

~~~
welanes
Sure, in fact I'll do it this weekend.

------
mikikian
Maybe a better business model is to offer this as a service to site owners who
are not tech savvy. Site owners then have the ability to offer an API to new
customers making it a win / win. Site owners can now offer an API (free or
paid), and API consumer can rely on getting data in the future.

------
MildlySerious
I just gave this a shot on the ISO website to get a list of country codes[1],
but it seems the selection algorithm breaks down when there's no specific
classes applied to elements, as every td.v-grid-cell is selected, which is all
of them, instead of the values of the alpha2 column for example.

This seems hard to solve entirely programmatically, maybe having a way to be
more specific by providing a selector yourself or selecting multiple entries
and having the plugin figure it out could add a lot of utility in such cases.

[1] -
[https://www.iso.org/obp/ui/#search/code/](https://www.iso.org/obp/ui/#search/code/)

------
nopcode
I believe this could be a good solution to turn legacy software into an API.
The “generated code” should be a reverse proxy, not a scraping lib.

Also, scraping a website to use/copy it’s data is illegal in my country
(Belgium). I’m not sure this tool itself would be.

~~~
ilrwbwrkhv
nothing can stop it. lots of belgian sites are scraped everyday across the
world.

------
flingo
Is there a reason this doesn't spit out some python or JavaScript code to
scrape the same info out?

This just seems to add another dependency to whatever I'm developing. Plus, it
sends data through a server I don't control. (I assume)

~~~
petr-nagy
Did you read the website? It says "Scrape locally or create recipes that run
quickly in the cloud."

Also, what use could website spitting essentially the same python/js script
over and over have?

~~~
flingo
I must have skimmed past that. Whoops. I avoided trying it out because it's
not available on firefox, so I couldn't correct my assumption by testing it.
Also, couldn't easily find copy of the extension source and gave up.

The site/extension basically has to do that each time it _scrapes locally_ (or
use generic parametrised scraper) If you wanted to use it in an API, my
impression is that you can run it in chrome as an extension you need to get
from the chrome store or tunnel your data through a third party server. Is
that wrong?

Can you scrape data locally without running chrome/the extension? I can't tell
from reading the site, sorry. (if it's actually there, please link an anchor
tag to it or something please)

------
maroonblazer
I like this.

Please consider adding the ability to script clicks on elements, e.g. buttons.

I manage a site where we load a subset of articles on initial page load and
then have a "Load more" button that executes Javascript to load another batch
of articles. Getting a list of articles from our CMS is a bit of a hassle so
being able to scrape it easily instead would be ideal.

~~~
welanes
Hey, right now you can select a Pagination element that the app will use to
load the next page / new data.

If the site's publicly accessible and you're able to share, send the details
to mike @ simplescraper.io and I'll get this working for you.

------
holeyness
Does this work with authenticated pages?

~~~
welanes
Yes - you're able to save data behind a login using the point and click
functionality as it extracts whatever data is loaded in your browser ("local
scraping").

And no - if you choose to also create a cloud recipe that runs on the server,
the remote browser instance won't be able to access data behind a login.

It's possible but I'd rather not store third-party credentials for the time
being.

------
mrskitch
This is super cool. I really enjoyed and missed the kimono workflow.
Automating something like this with browserless.io would be really fun (I run
that project). Extensions is one of the things we’re looking to support.

Anyways give me an email at joel at browserless dot io if you ever want to
chat

~~~
welanes
Cheers Joel. I have most of your blog posts on Puppeteer bookmarked - super
helpful and well written.

For sure, once the app is a notch more tried and tested I'll get in touch.
Appreciate it.

------
joelvalleroy
Awesome! One question I have after reading the page is - what is the pricing
plans concerning credits? (for automated scraping)

~~~
welanes
Right now it's free and will be until it's stable. Starting price will be
about $25 for 4000 scraping credits, 200k API calls and data storage.

This will likely change as I have more stats and feedback on usage and
expenses. But the goal is to offer a price point that's fair and low relative
to other options.

------
ntaylor
Kimono was cool, nice to see another option. I still have a Kimono t-shirt in
a drawer somewhere.

~~~
kitd
_Kimono t-shirt_

Hmm, definite missed merch opportunity there.

------
matz1
How to use the 'pagination' feature ? The help guide doesn't even mentioned
it.

~~~
welanes
Hey, yes the guide still needs work. Here's what you gotta do:

\- Click the pagination icon and then click the pagination element (usually
'Next' or an arrow). The icon will turn green

\- Click 'view results' and then choose to save the recipe

\- Select the number of pages you'd like to scrape

\- Run your recipe and it will scrape those pages

------
monkeydust
Looks good, could this be integrated into n8n.io to be used to drive a
workflow?

------
cfan01
Firefox add in please.

------
earth2mars
if you can add RSS feed response that would be great

~~~
SweeToxin
If you need data from a website that updates on a regular basis there’s a
recent Show HN I’ve seen that does exactly this
[https://news.ycombinator.com/item?id=21398524](https://news.ycombinator.com/item?id=21398524)

------
nightnight
OT: or just use puppeteer, not really hard, for free and you can rule the
world

