And it's likely that running these as a server-side application is ultimately more appealing than this simpler (to implement) automation inside the user's browser, right? Seems that many companies providing similar (server-side) scraping tools have been successfully sold off... Is that an indication of (still existing) high demand for these?
EDIT to add: The tool and the accompanying website look fantastic, by the way. Congratulations on the 1.0 launch, Amie!
The extensions market isn't particularly lucrative, they're like mobile apps but with only 33% of users even knowing they exist. The SaaS companies don't have a growth limitation hence their success. But if you want an extension for a resume item or something getting a few thousand AMO users or Chrome reviews shouldn't be hard.
If you want to go down a rabbit hole of shady proxies run on compromised/trojaned end user SOHO routers or PCs, google "residential proxies for sale"
It's amazing how easy and comparatively cheap it is to get access to thousands of residential IPs. Is it via spyware running on people's machines? Shady people working at ISPs doing nefarious things for cash? We never knew....
The key thing to know is that if you want your traffic to come from an IP "in" some other country (according to geolocation databases anyway) it's really only a few bucks a month to get a proxy. Most of them have poor IP reputation so they suck to use on Google, but work very well for everything else out there...
Might be as simple as https://hola.org/ & https://luminati.io/ - "unblock a website, download our VPN client", meaning you "unblock" by using somebody else's line. And the also sell access at luminati. Most users aren't aware of the implications.
a) The type of "services" luckylion mentions where people have opted in to a shady gray market thing reselling proxies through their connection.
b) compromised home routers/gateway devices/internet of shit devices
c) compromised home PCs (mostly windows 7/10 trojans/botnets)
Hello ALL social network folks who don’t know how spam was the Origin of social networks. (Fb, Friendster, hi5, blah blah blah)
Who the hell is documenting the history of the internet
Step 2) Hammer Zillow with all known ip addresses
Step 3) Profit
If all of HN has only a single core then you're running it on less server resources than I could buy on ebay with $180 and a visa card?
For another data point, I've got a single box processing every trade and order coming off the major cryptocurrency exchanges, roughly 3000 messages/second. And the webserver, DB persistence, and a bunch of price analytics also run on it. And it only hits about 30% CPU usage. (Ironically, the browser's CPU often does worse, because I'm using a third-party charting library that's graphing about 7000 points every second and isn't terribly well optimized for that.)
Software gets slow because it has a lot of wasteful layers in between. Cut the layers out and you can do pretty incredible things on small amounts of hardware.
You see, it is possible. Take a step back and look again to see the evidence: The site is called Hacker News, the first goal of this site was to prove that something useful can be created in an entirely custom programming language, and the site handles a huge amount of traffic, so it is a challenging task to let this run on a single core.
So the answer from a hacker's mind to why they let it run on a single core might simply be: Because they can.
On the other hand, YCombinator is a successful company, so buying a larger server would certainly be in their latitude. But that would be less intellectually appealing, and part of their success come from the fact that they decide as hackers, and don't always take the easiest path.
RAM is so cheap now for small sized things that you can afford to trivially have an entire db cached at all times, with only very rare disk I/O.
As an example we have a request tracker ticket database for a fairly large sized isp which is a grand total of under 40GB and lives in RAM. It's dozens of thousands of tickets with attachments and full body text search enabled. For those not familiar with RT4 it's a convoluted mess of Perl binary scripts.
I could probably run my primary authoritative master DNS on bind9 on Debian-stablr on a 15 year old Pentium 4 with 256MB of RAM, but I don't...
The language homepage says "Arc is unfinished. It's missing things you'd need to solve some types of problems. [...] The first priority right now is the core language."
Perhaps parallelism is still pending. A Ctrl-F on the tutorial doesn't turn up any hits for "process", "thread", "parallel", or "concurrency".
"Automating things in this way could put load on servers in a way that a manual user couldn’t, and we don’t want to enable that behavior."
I think you can actually find the above level of developer - what many would classify a 'unicorn' - in approximately 1-2% of the workforce.
That doesn't mean they understand your business or products, know how to work well with other individuals, etc. There are a lot of other factors beyond just the productivity angle.
Now, the above developer also with the ability to fundamentally understand the business, as well as interact with all of the key people while performing said productivity stunts on a consistent basis... I think you are getting into the .1-.01% range.
Given a talent for design, and some marketing, and some experience making extensions. This should be possible within a reasonable time frame.
I wonder why it was shutdown
Probably has a slightly higher learning curve but once you get past that it's easy.
Of course somebody will find a way to detect it, then the extension maker will patch it, and the cycle will continue.
I hope this isn't one of these "abandon-ware" kind of deals.
key: 'No license (00000000-00000000-00000000-00000000)',
lastChecked: new Date('Jan 1 3000')
+ No server involved (zero subscription fee for you!)
Unlike other web scraping softwares, it requires only one time payment to scrape for unlimited time and data. No more subscriptions or huge fee for your small data analysis projects!
Still a cool project but I think this limits a lot of its utility. Providing an API to access the data as a service would be a lot more profitable.
Can I scrape password protected stuff with Spider?
Yes! It’s a browser extension, so as long as you log in first, you can scrape whatever you like.
Almost everyone one chrome web-store are online-based and "freemium" only.
Don't pay for something you can program yourself.
Are you suggesting that if somebody wanted to setup a basic scrapper, they should learn how to install python, use packages, figure out how the webpage is structured, and dig through the page source to figure out how to extract their data?
This project has a decent visual data picker, interface, etc. You could literally hand it to a non-programmer, and have them scraping comments off hacker news without any programming knowledge at all.
I get paid over 25$/hr professionally as a programmer. This project costs 40$ one time. I'm not saying it's worth the money, but all it has to do is save somebody 2 hours of time.
If you are a programmer then this project likely isn't for you.
(Alternate smart-ass reply: Surely you coded the browser, operating system, and network stack involved in posting this response by yourself, right?)
No because they were all free.
And I suck too.
But it was free.
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Is this hard?
There's also the aspect of downloading the page in the first place and dealing with things like authentication and bot detection which a product like this helps solve.
I personally don't have a use for this product right now, but I won't be so bold as to say I'll never find a case where using it wouldn't be easier or more cost-effective than hacking up my own solution.
"Soup? Python? What's an 'id'? class? href? I just want to select the title!"