Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Spider Pro – easy and cheap way to scrape the internet (tryspider.com)
391 points by hyperyolo 42 days ago | hide | past | web | favorite | 103 comments



While looking at the tool I've realized that I've built something similar many years ago. I wonder if it's worth digging up the source code, polishing and publishing it. Does the market need more of these tools? Are there features in this type of tool on the market that seem to be completely missing or inadequately implemented?

And it's likely that running these as a server-side application is ultimately more appealing than this simpler (to implement) automation inside the user's browser, right? Seems that many companies providing similar (server-side) scraping tools have been successfully sold off... Is that an indication of (still existing) high demand for these?

EDIT to add: The tool and the accompanying website look fantastic, by the way. Congratulations on the 1.0 launch, Amie!


Does the market need yet another browser extension scraper like this? No, I don't think it does.

[1] https://chrome.google.com/webstore/search/scraper?_category=...

[2] https://addons.mozilla.org/en-US/firefox/search/?platform=wi...


25 is not very many on AMO, there are only a few that seem relevant and most are unmaintained. The only one that really seems usable is the webscraper.io one. Maybe ScrapeMate but the author has left and is working on the Python/cloud ScrapingHub. Similarly Chrome has Web Scraper Plus w/ poor maintenance and the rest are wrappers/helpers for various websites.

The extensions market isn't particularly lucrative, they're like mobile apps but with only 33% of users even knowing they exist. The SaaS companies don't have a growth limitation hence their success. But if you want an extension for a resume item or something getting a few thousand AMO users or Chrome reviews shouldn't be hard.


Interesting that their demo screen cast shows Zillow. Zillow is fairly aggressive in applying anti-scraping defensive measures.


.. which may be the reason why their demo screen cast shows Zillow.


Yeah, the problem is that Zillow imposes IP bans on you when you've been found to be scraping their site.


Which is why any serious effort involves rotating pools of proxies.


Not just rotating pools of proxies but sometimes shady gray market residential proxies, so that you can appear to be coming from hundreds or thousands of unique geographically distributed end-user DOCSIS3/ADSL2+/VDSL2/GPON/whatever last mile end user customer netblocks.

If you want to go down a rabbit hole of shady proxies run on compromised/trojaned end user SOHO routers or PCs, google "residential proxies for sale"

https://www.google.com/search?client=ubuntu&channel=fs&q=res...


Once worked for a place using this to scrape search engines.

It's amazing how easy and comparatively cheap it is to get access to thousands of residential IPs. Is it via spyware running on people's machines? Shady people working at ISPs doing nefarious things for cash? We never knew....

The key thing to know is that if you want your traffic to come from an IP "in" some other country (according to geolocation databases anyway) it's really only a few bucks a month to get a proxy. Most of them have poor IP reputation so they suck to use on Google, but work very well for everything else out there...


> Is it via spyware running on people's machines? Shady people working at ISPs doing nefarious things for cash?

Might be as simple as https://hola.org/ & https://luminati.io/ - "unblock a website, download our VPN client", meaning you "unblock" by using somebody else's line. And the also sell access at luminati. Most users aren't aware of the implications.


It's a combination of three general things:

a) The type of "services" luckylion mentions where people have opted in to a shady gray market thing reselling proxies through their connection.

b) compromised home routers/gateway devices/internet of shit devices

c) compromised home PCs (mostly windows 7/10 trojans/botnets)


not that shady... luminati.io makes residential and mobile proxies a snap.


And IP tunneling...

Hello ALL social network folks who don’t know how spam was the Origin of social networks. (Fb, Friendster, hi5, blah blah blah)

Who the hell is documenting the history of the internet


IP bans are simple to bypass.


Step 1) Invest money in non-Zillow real estate app

Step 2) Hammer Zillow with all known ip addresses

Step 3) Profit


Step 4) Friendly chats with FBI & SEC?


Most IP bans are only temporary.


I wonder how Spider Pro does with Facebook, Linkedin, Whitepages and others that try their best to block scraping but still have an introductory free to view webpage...


Since this is designed for non-technical users and only scrapes content that's already been displayed to the user, I can't imagine many folks would use it in such a way that they could tell, unless they included a script to detect this scraper explicitly on their site


And their documentation shows them scraping HN:

https://www.notion.so/Spider-Pro-Documentation-5d275abd49c64...


Please respect the robots.txt if you do. HN's application runs on a single core and we don't have much performance to spare.


Serious question, how is that possible? Somebody recently gave me a Dell R720 2RU server with 16 cores and 128GB of RAM for free. There's literally that much slightly used server gear showing up on the used market from companies that have migrated everything to aws/gcp/azure/whatever.

If all of HN has only a single core then you're running it on less server resources than I could buy on ebay with $180 and a visa card?

https://www.ebay.com/itm/DELL-R610-64GB-12-CORE-2X-HEX-CORE-...


It's pretty easy if you're keeping everything in RAM and don't have layers upon layers of frameworks. You've got about 3B cycles per second to play with, with a register reference taking ~ 1 cycle, L1 cache (48K) = 4 cycles, L2 cache (512K) = 10 cycles, L3 cache (8M) = 40 cycles, and main memory = 100 cycles. This entire comment page is 129K, gzipping down to 19K; it fits entirely in L2 cache. It's likely that all content ever posted to HN would fit in 128G RAM (for reference, that's about 64 million typewriter pages). With about 30M random memory accesses being possible per second (or about 7 billion consecutive memory accesses - you gain a lot from caching), it's pretty reasonable to serve several thousand requests per second out of memory.

For another data point, I've got a single box processing every trade and order coming off the major cryptocurrency exchanges, roughly 3000 messages/second. And the webserver, DB persistence, and a bunch of price analytics also run on it. And it only hits about 30% CPU usage. (Ironically, the browser's CPU often does worse, because I'm using a third-party charting library that's graphing about 7000 points every second and isn't terribly well optimized for that.)

https://www.cryptolazza.com

Software gets slow because it has a lot of wasteful layers in between. Cut the layers out and you can do pretty incredible things on small amounts of hardware.


Would love to see a blog post on building it some time, I think the last vaguely similar (but serious) rundown in this vein was “One process programming notes”.

https://crawshaw.io/blog/one-process-programming-notes


> Serious question, how is that possible?

You see, it is possible. Take a step back and look again to see the evidence: The site is called Hacker News, the first goal of this site was to prove that something useful can be created in an entirely custom programming language, and the site handles a huge amount of traffic, so it is a challenging task to let this run on a single core.

So the answer from a hacker's mind to why they let it run on a single core might simply be: Because they can.

On the other hand, YCombinator is a successful company, so buying a larger server would certainly be in their latitude. But that would be less intellectually appealing, and part of their success come from the fact that they decide as hackers, and don't always take the easiest path.


Cool as it may be for the intellectual challenge of tight coding, to run on a minimum of resources, it also makes things more vulnerable to DDoS, slashdot effect, and less than ethical people running abusive scraping tools that don't respect robots.txt. As a person who is on the receiving end of the very rare 3am phone call for networking related emergencies, I try to provision a sufficient amount of resources above "the bare minimum" to ensure that I'm not woken up by some asshat with a misconfigured mass http mirroring tool.

RAM is so cheap now for small sized things that you can afford to trivially have an entire db cached at all times, with only very rare disk I/O.

As an example we have a request tracker ticket database for a fairly large sized isp which is a grand total of under 40GB and lives in RAM. It's dozens of thousands of tickets with attachments and full body text search enabled. For those not familiar with RT4 it's a convoluted mess of Perl binary scripts.

I could probably run my primary authoritative master DNS on bind9 on Debian-stablr on a 15 year old Pentium 4 with 256MB of RAM, but I don't...


Don't know how much it is still true, but HN was originally implemented in arc.

The language homepage[1] says "Arc is unfinished. It's missing things you'd need to solve some types of problems. [...] The first priority right now is the core language."

Perhaps parallelism is still pending. A Ctrl-F on the tutorial doesn't turn up any hits for "process", "thread", "parallel", or "concurrency".

[1]: http://www.arclanguage.org


Arc has threads and HN's code relies heavily on them. Arc currently runs on Racket, though, which uses green threads, so the threadiness doesn't make it to the lower levels. Racket has other ways of doing parallelism, but as far I know they don't map as well to Arc's abstractions.


It's a single-threaded process running a single core.


Worth noting it really doesn't automate paging through results, and they go out of their way to make the behavior seem organic, and explicitly say they won't change that approach.

"Automating things in this way could put load on servers in a way that a manual user couldn’t, and we don’t want to enable that behavior."


As an aside, it looks like this was created by one person, which shows an amazing level of talents in design, UX, programming, marketing, and presumably devops. Kind of scary.


is that a bad or a good thing?


"scary" in this context means "makes me feel inadequate". So, good for them to be that way.


Well it makes them a damn unicorn so... good thing?


Depends on the unicorn. Some can spin up the full stack, put it under source control, wrap a CI/CD pipeline around it, and have it deployed to a TLS-encrypted public website in under an hour.

I think you can actually find the above level of developer - what many would classify a 'unicorn' - in approximately 1-2% of the workforce.

That doesn't mean they understand your business or products, know how to work well with other individuals, etc. There are a lot of other factors beyond just the productivity angle.

Now, the above developer also with the ability to fundamentally understand the business, as well as interact with all of the key people while performing said productivity stunts on a consistent basis... I think you are getting into the .1-.01% range.


Wut. If you think someone can build the OP product in less than an hour, this is a new level of HN delusion ...


I imagine that "full stack" in this context refers to the equivalent of a Hello World project wrapped around a CI/CD pipeline.


Correct. Not an actual complete project, but the boilerplate required to begin implementing & delivering features.


Ah, yes, good and experienced engineers can do that.


If it's just "hello world" in this day and age of good frameworks, great open source tooling like Terraform and SaaS offerings that's not too tall of an order.


It looks like they are leveraging Netlify for their website, Gumroad for payments, and Notion for wiki. This helps to make as much of what is not actually the product (marketing, documentation, financials) as easy and fast as possible. Most of that hour would be actually making the product (the web extension).

Given a talent for design, and some marketing, and some experience making extensions. This should be possible within a reasonable time frame.


Through practice and experience you can reduce the amount of time it takes to make routine actions and decisions. But the bottleneck is making good decisions about matters that are not routine, and every new project has plenty of those.


I suspect the "scary" comment may not have been understood because of ESL, perhaps. So I think the person who asked if it was a good thing wasn't being sarcastic.


Yeah but when was the last time you actually caught a unicorn stabbing the fuck out of someone with their horn... so I vote scary


Ah it reminds me of Kimono Labs. I miss that product, it was fantastic.


I was trying to remember the name! I set up an incredible amount of automation with Kimono Labs.. that was one of the best products I've ever used. I remember when they shut down, but only realized now they were acquired by... Palantir?


Me too. Absolutely one of the best products I’ve ever used. It’s also the product that taught me that I can’t trust a SaaS business with important work. Wish they still existed.


Seems like a sure-fire way to get bought out is to build a web-scraper service...


Also seems similar to Dashblock (YC S19):

https://news.ycombinator.com/item?id=21006475


https://www.diffbot.com/ is still well regarded, no?


It's like http://80legs.com which, by the way, was acquired.


also similar to parts of yahoo! pipes and some of dapper features (acquired by Yahoo!)

I wonder why it was shutdown


Any comments on using Gumroad as your payment processor? I went with Paddle for a Chrome extension - it seems more flexible (e.g. you could do team subscription plans with custom pricing) but I think Gumroad is a lot easier to integrate (e.g. they have a simple license check API).


webscraper.io is free and has more functionality. I've used it for a quick and easy way to scrape data off multiple pages.

Probably has a slightly higher learning curve but once you get past that it's easy.


Unfortunate that the contact us page doesn't work. I'm interested, but I wanted to ask if a FF extension is planned, since that's my primary browser.


yup, firefox is supported! Though not distributed through the webstore - I'm planning to move it to the store sometimes soon.


Awesome, thanks.


This is scary good and easy to use. Clever idea with the Chrome extension too.


Clever, but how long until sites learn to detect the extension and block the user?


Detection is always a cat and mouse game. Using an extension in a real browser (instead of electron/headless chrome) is probably one of the hardest to detect because it requires running a "real" browser.

Of course somebody will find a way to detect it, then the extension maker will patch it, and the cycle will continue.


Correct me if I’m wrong, but is it not still as simple as knowing the “chrome-extension://“ unique id of the extension? I’m aware of the cat and mouse aspect of scraping and that was one of the pitfalls I’ve been wary of as a fingerprinting vector.


I'd be surprised if sites had permission to read a chrome-extension:// URL. That'd be a sizable privacy leak.


I'm not sure about the chrome-extension protocol, but this API still seems to be present: https://developer.chrome.com/extensions/runtime#method-sendM...


I just tested in chrome 77, and I could only do `chrome.runtime.sendMessage("clngdbkpkpeebahjckkjfobafhncgmne", {},{},console.log)` from within the Stylus extension page, not from an external page like hacker news.


So is a Safari extension in development?


Seems to work well, but it appears that you can't re-scrape a site without starting from scratch, it's a one time deal.


Yeah, it needs a lot of additional features.

I hope this isn't one of these "abandon-ware" kind of deals.


The DRM scheme could use some work. Here's a simple crack (run from the extension license page):

    chrome.storage.sync.set({
        spider_valid_license: {
            key: 'No license (00000000-00000000-00000000-00000000)',
            lastChecked: new Date('Jan 1 3000')
        }
    })


I'm not familiar with web scraping but have been looking for feed(s) from shopping sites (e.g. deals.ebay.com, deals.amazon.com etc.). In good ol' days they used to publish RSS, but not any more. Can I use this for what I'm looking for? Will eBay and Amazon end up banning me? Alternately does anyone know of a good service that aggregates shopping feeds?


This has a nice UI. It reminds me of Kantu (now https://ui.vision/) which I've used with varying degrees of success. That works by recording Selenium scripts; is Spider Pro entirely custom?


This is lacking clarity in the payment schedule... Is it One-time? Monthly? Yearly?


EDIT: snippet from the website to help answer this question without requiring you to click. I didn't intend this to be a rude answer, I don't think it deserves downvotes.

" + No server involved (zero subscription fee for you!)

...

Unlike other web scraping softwares, it requires only one time payment to scrape for unlimited time and data. No more subscriptions or huge fee for your small data analysis projects! "


How is this at all sustainable as a business model? My company is in charge of scraping a few TB/week. Would we be allowed to use your service?


This is a browser extension that takes the place of copy/pasting from web sites (ie manual scraping), what you’re looking for is probably more automated.


I think this is all done client side, manually, through a browser extension. They aren't providing you with an API to retrieve this data live afterwards.

Still a cool project but I think this limits a lot of its utility. Providing an API to access the data as a service would be a lot more profitable.


Go watch the demo video. From what I can see it's not automated enough to do tb/week. That said I bet you could hook it into scrapy and automate it. Probably will be slow though


It's all client-side, it's not what you think, it's more a light scraping-on-the-go type product than a permanent enterprise solution. Good for what it does tho.


from a cursory look it seems that this software runs on your client, i.e. it's not a service.

Scrolling down:

" Can I scrape password protected stuff with Spider?

Yes! It’s a browser extension, so as long as you log in first, you can scrape whatever you like. "


It's just a browser extension


On the homepage, right above the Purchase and Trial buttons is "Pay Once and Use It Forever"


Error info: contact us at the bottom shows page not found.


This is quite ironic, since crawling its own website is surely one of the best ways to detect broken links.


its not a crawler.... its a browser extension to extract data from an optionally paginated list.


good-time to showcase given this was recently in the news. Looks awesome thanks for sharing https://news.bloomberglaw.com/privacy-and-data-security/insi...


Since the extensions are being distributed via Gumroad, how would updates work?


Have a look at dashblock.com, they went through YC recently


Nice example of Tailwind CSS website as well.


How is it the cheapest if it is not free? There are plenty of free scrapping tools out there


^ Came to say the same thing.


We've unsuperlatived the title above.


How many are offline-only?

Almost everyone one chrome web-store are online-based and "freemium" only.


Sorry, maybe this is obvious, but what is an offline web scraper?


Probably meant installed and running locally


Yes.. It felt silly writing "serverless".


Nonsense - beautiful soup is free.

Don't pay for something you can program yourself.


Disclaimer: I have nothing to do with this project, I'm also a programmer isn't part of it's target demographic. But:

Are you suggesting that if somebody wanted to setup a basic scrapper, they should learn how to install python, use packages, figure out how the webpage is structured, and dig through the page source to figure out how to extract their data?

This project has a decent visual data picker, interface, etc. You could literally hand it to a non-programmer, and have them scraping comments off hacker news without any programming knowledge at all.

I get paid over 25$/hr professionally as a programmer. This project costs 40$ one time. I'm not saying it's worth the money, but all it has to do is save somebody 2 hours of time.

If you are a programmer then this project likely isn't for you.


Sure, everyone's time is free too, after all.

(Alternate smart-ass reply: Surely you coded the browser, operating system, and network stack involved in posting this response by yourself, right?)


I'm not the OP, but it looks like it was in reply to the "cheapest way" part of the title. Incidentally, that line doesn't appear on the web site.


> Surely you coded the browser, operating system, and network stack involved in posting this response by yourself, right?)

No because they were all free.

And I suck too.

But it was free.


https://www.crummy.com/software/BeautifulSoup/bs4/doc/

I mean...

soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Is this hard?


For HN's audience, no, that's not hard. But this product is clearly not only for us, and you've also cherry-picked a very simple example.

There's also the aspect of downloading the page in the first place and dealing with things like authentication and bot detection which a product like this helps solve.

I personally don't have a use for this product right now, but I won't be so bold as to say I'll never find a case where using it wouldn't be easier or more cost-effective than hacking up my own solution.


Good, now do that on a page that uses JS and authentication. Trivial on this extension, not so much using BS.


If you show that to non-programmers they'll take one look at that and understand that about as well as egyptian hieroglyphics.

"Soup? Python? What's an 'id'? class? href? I just want to select the title!"




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: