
Show HN: Link.fish – API to extract data from websites as JSON - linkfish
https://link.fish/api
======
callmeed
Having done a ton of scraping in the past (especially around ecommerce and
products), this looks pretty cool.

A couple comments in general:

1\. Personally I think its better to be great at extracting _one_ kind of data
instead of _average at many types_. It makes sales and growth efforts easier.
Pick one of those things (products, recipes, social, etc.) and just focus on
that and get great at it.

2\. I don't think you need the credit <-> request abstraction. Anyone using an
API knows what a request is (I hope).

Now, a few comments regarding products specifically:

1\. I got 500 errors on a couple random product URLs.

2\. On an Amazon product that's on sale, I got back the original price but not
the sale price.

3\. If you truly want to be GREAT at scraping products, the 2 things most
people in this space can't do are: (a) extract ALL high-res images for a
product, and (b) extract a product's options and variant data (colors, sizes,
etc. and availability for each combination)

Personally I think there are a ton of opportunities in this space. This is a
good start and I wish you the best.

~~~
matt_wulfeck
> _Personally I think there are a ton of opportunities in this space._

Totally agree, the problem I see is that when you become big enough to be
noticeable, websites start growing ban hammers at you for flagrant disregard
of TOS. Working around that stuff becomes an art in of itself.

~~~
sharpshadow
One solid way would be to offer the service as browser addon. With that you
can avoid any blockage, because the user itself is doing it.

------
laktek
Nice work! I also built something very similar called Page.REST (Show HN
thread:
[https://news.ycombinator.com/item?id=15189099](https://news.ycombinator.com/item?id=15189099))

Page.REST supports extracting contents using CSS selectors. oEmbed and
OpenGraph tags. Also, do you plan to support extracting from client-side
rendered pages (a la React)?

BTW, I'm interested how you decided the pricing? I went with a $5 one time fee
as most people use such tools for ad-hoc purposes.

General question to readers: What do you think of the Schema.org format? Is it
easy to consume? (from a language library perspective)

~~~
BrandoElFollito
Do you support authenticated calls? (you mention "public urls" but I do not if
this means "on Internet" (as opposed to a pivate IP space) or "non-
authenicated")

~~~
laktek
I will be launching authenticated requests soon. Would you like to get early
access? drop me an email (contact details are on profile)

PS: I'd appreciate if you can become an early adopter of the service. That
helps me to scale fast :)

------
jotaen
Bug report: when I try out the service on your frontpage, the URLs seem to get
converted to lowercase internally. So if I try to fetch this URL
[https://goo.gl/DKukBD](https://goo.gl/DKukBD) (which points to this very HN
submission), it actually queries
[https://goo.gl/dkukbd](https://goo.gl/dkukbd) (which points to some random
website)

~~~
linkfish
Thanks a lot! Will investigate and fix. If you run into any other issues
please keep them coming!

------
mericsson
Well done! How does this compare to diffbot's website extraction?
[https://www.diffbot.com/](https://www.diffbot.com/)

~~~
dwynings
__Disclaimer: I work at Diffbot __

Major differences I can see (OP feel free to correct if I 'm wrong):

Link.fish

* doesn't provide a web crawler

* relies heavily on microdata, schema.org, RDFa, etc

* relies on manual parsers for sites that don't have microdata embedded

* doesn't full-render pages by default (Diffbot renders every page, so it can use computer vision to automatically extract the data)

* doesn't support proxies

* doesn't support entity tagging

Probably plenty more, but that's what jumps out to me at first blush.

\--

Since I see other people have mentioned price as a concern, we're always
willing to help out bootstrapped startups. Just shoot me an email:
dru@diffbot.com

------
pen2l
It's been quite a while since I last did web-scraping (I used to use
BeautifulSoup, more than a decade ago).

I'm just wondering, since a lot of people are using fairly advanced cloud-
hosting solutions with, I assume, tools offered by their respective hosting
place to fight spam, is web-scraping a lot different from what it used to be
about a decade ago? What steps do you guys take to prevent being identified as
a bad actor by the place that you are scraping?

And on the other end, if you have a data-rich website, what are your feelings
toward aggressive scrapers?

~~~
twblalock
CDNs like Distil Networks and Cloudflare make scraping more difficult than it
used to be. If you get caught by them, you can end up blocked from all of the
sites they protect, not just the one you were scraping.

~~~
always_good
Writing some scrapers this week, I noticed it's also common for the origin
server to just check if the request is coming from VPN/VPS IP address range.

For example, the exact same request will work from your home connection where
it doesn't work from EC2.

------
etewiah
I've been checking out link.fish for a while now - awesome product! My
interest is in scraping real estate websites and it seems to do quite a good
job with many of the sites I've tried. Already mentioned by others but I
suggest

1\. Concentrate on a specific segment (like real estate) 2\. Consider a
browser extension (helps mitigate problem of too many requests coming from one
central server)

I have long planned on building an open source real estate website scraper but
just haven't found the time to do it.

------
mmahemoff
Looks good, but why complicate the pricing with "credits" when 1 credit==1
request. The tiers already reference parallel "requests", so you could just
say N requests instead of N credits.

~~~
janober
The reason is that we also offer that pages can be rendered with a full
browser (to execute all JavaScript and make screenshots) which takes much more
resources.

------
holtalanm
This looks almost exactly like the functionality provided by
[https://page.rest/](https://page.rest/)

------
adventurer
520 error here. Hug of death or they took your site down.

~~~
janober
Very sorry for that. Thought a smaller server could handle the load because of
Cloudflare but was apparently wrong. Is up again btw.

------
TenJack
Wasn't this submitted already?
[https://news.ycombinator.com/item?id=15099041](https://news.ycombinator.com/item?id=15099041)
[https://news.ycombinator.com/item?id=14522439](https://news.ycombinator.com/item?id=14522439)
How are you able to do 'Show HN' in such recent succession?

~~~
janober
Is the same domain but different products. The one posted before is the
bookmark manager for mainly B2C. The product I posted now is the B2B version
which uses the same technology behind the bookmark manager but allows access
via API.

------
nl
I've been playing around with a way of extracting information from the text on
websites (eg, finding names of people or price ranges in a textual story
rather than in a table).

I've got as far as something that works much better and is much more flexible
than things like UoW OIE or just using Stanford named entity recognition.

Is this a thing others need or would find useful?

------
kashprime
Is this similar to Apify? [https://www.apify.com/](https://www.apify.com/)

~~~
linkfish
Yes is similar to it in the regard that with both tools data from websites can
be extracted.

------
nailer
Trying a site with a simple HTML schedule:
[https://www.pineapple.uk.com/studio/index/filter/](https://www.pineapple.uk.com/studio/index/filter/)
didn't work. I'd love it to be able to do this and would pay a small per-API-
call fee.

~~~
janober
Clicked something together very fast but then did not work in the end because
the domain of the website (the library I use did not know uk.com). Will fix
that issues tomorrow. You can either simply check again tomorrow or contact me
at api@link.fish .

~~~
nailer
It works, but only on that site - a similar site
([http://studio68london.net/work/timetable/](http://studio68london.net/work/timetable/)
fails). I want to be able to point something at an arbitrary URL and have it
extract the tabular data. I'd write this myself, but I'd rather pay someone
else to maintain it.

------
lawl
I tried a random reddit thread. Did not fetch comments, only information about
the submission. Then I tried it with HN. Same. Then I tried it with a github
issue. Same.

Then I tried it with the first link I got from news.google.com which was
nytimes. No article text included.

Maybe I'm misunderstanding the purpose of this? Or was that just a string of
bad luck?

~~~
janober
No is not just bad luck. Currently did not concentrate so much on "text pages"
like blogs or articles yet. Mainly on pages which contain more data like
prices, geo coordinates, social media profiles, ... That said support for the
mentioned pages can simply be added via our point and click GUI by any user.
Do sadly not have time right now, but can add support for this pages by
tomorrow.

~~~
weego
So your paid for product for scraping and structuring information from a
webpage cannot actually return most content off a webpage as it is now?
Wouldn't that be a more important vertical slice of a product for an MVP than
having a fully thought out pay tier system?

~~~
nedwin
Maybe? It's all about tradeoffs.

I could see why you would want to figure out how much value the MVP is
creating, and $$$ is an honest way to do that.

It sounds like two things are happening with the MVP: \- emphasis on more
complicated sites with more data (higher propensity to pay) \- this
functionality is actually possible but a user needs to take the time to set it
up via a the GUI.

Feels like a pretty good tradeoff to me.

~~~
janober
Thanks, seems like you got it ;-)

------
sjs382
In the pricing, different tiers have different "priorities". It would be
helpful to know what these "priorities" mean in the real world.

If I submit a request as a low-priority user, should I expect a response in 1
second? 1 minute? 1 hour? Something else? And how consistent is the amount of
time I should expect to wait?

~~~
janober
The time it takes, in general, is mainly dependent on how fast the page loads.
The priority simply means, that if for some reason there are more requests at
a given time then we can handle, that the people with the higher plans get
served first. With other words for 99% of the requests, there should not be
any time difference at all.

~~~
sjs382
I understand that people with higher priority get served first.

The question is how that will affect someone's real-world use.

~~~
janober
Like written above should it normally not make a difference at all. But sure
is possible that if there is an unexpected huge surge the people in the lowest
plan suddenly have to wait a few seconds longer.

~~~
sjs382
I understand. I realize that I wasn't clear but my second comment was feedback
wrt/ your marketing page.

------
ojanik
[https://en.wikipedia.org/wiki/The_Expanse_(TV_series)](https://en.wikipedia.org/wiki/The_Expanse_\(TV_series\))

Error: 500 Message: { "status": 500, "message": "Internal Server Error" }

~~~
janober
Yes it seems like you found a bug;-) Is gonna get fixed tomorrow. Thanks!

------
tylerpachal
There is a small typo in the "Why link.fish API?" section on the homepage:

> Additionally, do we have a growing collection of custom parsers for websites
> and website independent parsers for specific data.

I don't think you need the "do".

~~~
linkfish
As not native english speaker I am always happy to get help in that regard ;-)
Thanks!

------
noinput
a 404 on the TOS & Privacy policy isn't the best to build confidence.

~~~
linkfish
Very sorry for that! Worked everywhere but should not have used a relative
link on the free account signup page. Got fixed.

------
nreece
Looks good!

* _shameless plug_ * Our little startup, Feedity - [https://feedity.com](https://feedity.com), also helps create custom RSS feeds for any webpage.

------
danielvinson
The feature I'd be looking for in this is to be able to recursively scrape for
contact information (email, phone number, etc.)... that doesn't seem possible
with this?

~~~
linkfish
Yes right now not. However an endpoint for exactly that is actually planned.
So you can simply write to api@link.fish and we can inform you when it is
ready.

------
janober
Hi just launched this API. So would love to get feedback like what could be
improved or what is missing. Is a first version so any kind comments are
welcome!

~~~
wiradikusuma
I tried a link to some product in some ecommerce
([http://www.lazada.com.my/samsung-galaxy-note-8-6gb-
ram64gb-r...](http://www.lazada.com.my/samsung-galaxy-note-8-6gb-ram64gb-rom-
original-samsung-malaysia-set-black-91710731.html?spm=a2o4k.home.recommended-
items_31972.18.34aa7d86A4adxN&lzd_rec_event_src=&lzd_rec_event_dest=SA356ELABILOGBANMY&strat=rec_global_top_prods&pa=home_page.recommended_items)),
and it does extract the content.. but I only care about the "hero" item. Is it
safe to just always take the 1st item in mainEntity>offers>offers[0]?

~~~
linkfish
It did actually just extract the data of the "hero" item. The thing is that it
gets offered by multiple companies for different prices. So all the prices are
valid and none is right or wrong. So really depends what you want. If you want
simply "a" price, you can take the first. If you want the cheapest one you
would have to itterate over them to find it.

------
gbrits
So where's the point & click GUI to select items from a page? Signed up, but
can't seem to find it on first glance

~~~
janober
Sorry yes, have to make that clearer or write in the email. Is on the top of
the page under "Plugins" -> "Data Selector".

------
JanKoenig
This looks very helpful. Would love to use some of this structured data to
create a sample voice app for Jovo

------
swaraj
Does this use schema.org / og meta tags or are you trying to infer object
types yourself

~~~
linkfish
It uses a combination of everything to extract information incl. custom
parsers. The data returned to the user is always in schema.org.

------
attacomsian
Can't access the site. Is it down?

~~~
janober
Very sorry for that. Thought a smaller server could handle the load because of
Cloudflare but was apparently wrong. Is up again btw.

------
purplepotato
I like this article.

------
taivare
like your logo

------
palani666
I tried a link to some product in some ecommerce
([http://www.lazada.com.my/samsung-galaxy-note-8-6gb-
ram64gb-r...](http://www.lazada.com.my/samsung-galaxy-note-8-6gb-
ram64gb-r...)), and it does extract the content.. but am only care about the
"hero" item. Is it safe to just always take the 1st item in
mainEntity>offers>offers[0]?

~~~
janober
Actually answered that question already yesterday. For convenience here again:

It did actually just extract the data of the "hero" item. The thing is that it
gets offered by multiple companies for different prices. So all the prices are
valid and none is right or wrong. So really depends what you want. If you want
simply "a" price, you can take the first. If you want the cheapest one you
would have to itterate over them to find it.

------
profalseidol
Nope, nothing useful. rather build your own parser..

[https://www.pmu.fr/turf/02112017/R4/C4](https://www.pmu.fr/turf/02112017/R4/C4)

