Hacker News new | past | comments | ask | show | jobs | submit login

Interesting that their demo screen cast shows Zillow. Zillow is fairly aggressive in applying anti-scraping defensive measures.



.. which may be the reason why their demo screen cast shows Zillow.


Yeah, the problem is that Zillow imposes IP bans on you when you've been found to be scraping their site.


Which is why any serious effort involves rotating pools of proxies.


Not just rotating pools of proxies but sometimes shady gray market residential proxies, so that you can appear to be coming from hundreds or thousands of unique geographically distributed end-user DOCSIS3/ADSL2+/VDSL2/GPON/whatever last mile end user customer netblocks.

If you want to go down a rabbit hole of shady proxies run on compromised/trojaned end user SOHO routers or PCs, google "residential proxies for sale"

https://www.google.com/search?client=ubuntu&channel=fs&q=res...


Once worked for a place using this to scrape search engines.

It's amazing how easy and comparatively cheap it is to get access to thousands of residential IPs. Is it via spyware running on people's machines? Shady people working at ISPs doing nefarious things for cash? We never knew....

The key thing to know is that if you want your traffic to come from an IP "in" some other country (according to geolocation databases anyway) it's really only a few bucks a month to get a proxy. Most of them have poor IP reputation so they suck to use on Google, but work very well for everything else out there...


> Is it via spyware running on people's machines? Shady people working at ISPs doing nefarious things for cash?

Might be as simple as https://hola.org/ & https://luminati.io/ - "unblock a website, download our VPN client", meaning you "unblock" by using somebody else's line. And the also sell access at luminati. Most users aren't aware of the implications.


It's a combination of three general things:

a) The type of "services" luckylion mentions where people have opted in to a shady gray market thing reselling proxies through their connection.

b) compromised home routers/gateway devices/internet of shit devices

c) compromised home PCs (mostly windows 7/10 trojans/botnets)


not that shady... luminati.io makes residential and mobile proxies a snap.


And IP tunneling...

Hello ALL social network folks who don’t know how spam was the Origin of social networks. (Fb, Friendster, hi5, blah blah blah)

Who the hell is documenting the history of the internet


IP bans are simple to bypass.


Step 1) Invest money in non-Zillow real estate app

Step 2) Hammer Zillow with all known ip addresses

Step 3) Profit


Step 4) Friendly chats with FBI & SEC?


Most IP bans are only temporary.


I wonder how Spider Pro does with Facebook, Linkedin, Whitepages and others that try their best to block scraping but still have an introductory free to view webpage...


Since this is designed for non-technical users and only scrapes content that's already been displayed to the user, I can't imagine many folks would use it in such a way that they could tell, unless they included a script to detect this scraper explicitly on their site


And their documentation shows them scraping HN:

https://www.notion.so/Spider-Pro-Documentation-5d275abd49c64...


Please respect the robots.txt if you do. HN's application runs on a single core and we don't have much performance to spare.


Serious question, how is that possible? Somebody recently gave me a Dell R720 2RU server with 16 cores and 128GB of RAM for free. There's literally that much slightly used server gear showing up on the used market from companies that have migrated everything to aws/gcp/azure/whatever.

If all of HN has only a single core then you're running it on less server resources than I could buy on ebay with $180 and a visa card?

https://www.ebay.com/itm/DELL-R610-64GB-12-CORE-2X-HEX-CORE-...


It's pretty easy if you're keeping everything in RAM and don't have layers upon layers of frameworks. You've got about 3B cycles per second to play with, with a register reference taking ~ 1 cycle, L1 cache (48K) = 4 cycles, L2 cache (512K) = 10 cycles, L3 cache (8M) = 40 cycles, and main memory = 100 cycles. This entire comment page is 129K, gzipping down to 19K; it fits entirely in L2 cache. It's likely that all content ever posted to HN would fit in 128G RAM (for reference, that's about 64 million typewriter pages). With about 30M random memory accesses being possible per second (or about 7 billion consecutive memory accesses - you gain a lot from caching), it's pretty reasonable to serve several thousand requests per second out of memory.

For another data point, I've got a single box processing every trade and order coming off the major cryptocurrency exchanges, roughly 3000 messages/second. And the webserver, DB persistence, and a bunch of price analytics also run on it. And it only hits about 30% CPU usage. (Ironically, the browser's CPU often does worse, because I'm using a third-party charting library that's graphing about 7000 points every second and isn't terribly well optimized for that.)

https://www.cryptolazza.com

Software gets slow because it has a lot of wasteful layers in between. Cut the layers out and you can do pretty incredible things on small amounts of hardware.


Would love to see a blog post on building it some time, I think the last vaguely similar (but serious) rundown in this vein was “One process programming notes”.

https://crawshaw.io/blog/one-process-programming-notes


> Serious question, how is that possible?

You see, it is possible. Take a step back and look again to see the evidence: The site is called Hacker News, the first goal of this site was to prove that something useful can be created in an entirely custom programming language, and the site handles a huge amount of traffic, so it is a challenging task to let this run on a single core.

So the answer from a hacker's mind to why they let it run on a single core might simply be: Because they can.

On the other hand, YCombinator is a successful company, so buying a larger server would certainly be in their latitude. But that would be less intellectually appealing, and part of their success come from the fact that they decide as hackers, and don't always take the easiest path.


Cool as it may be for the intellectual challenge of tight coding, to run on a minimum of resources, it also makes things more vulnerable to DDoS, slashdot effect, and less than ethical people running abusive scraping tools that don't respect robots.txt. As a person who is on the receiving end of the very rare 3am phone call for networking related emergencies, I try to provision a sufficient amount of resources above "the bare minimum" to ensure that I'm not woken up by some asshat with a misconfigured mass http mirroring tool.

RAM is so cheap now for small sized things that you can afford to trivially have an entire db cached at all times, with only very rare disk I/O.

As an example we have a request tracker ticket database for a fairly large sized isp which is a grand total of under 40GB and lives in RAM. It's dozens of thousands of tickets with attachments and full body text search enabled. For those not familiar with RT4 it's a convoluted mess of Perl binary scripts.

I could probably run my primary authoritative master DNS on bind9 on Debian-stablr on a 15 year old Pentium 4 with 256MB of RAM, but I don't...


Don't know how much it is still true, but HN was originally implemented in arc.

The language homepage[1] says "Arc is unfinished. It's missing things you'd need to solve some types of problems. [...] The first priority right now is the core language."

Perhaps parallelism is still pending. A Ctrl-F on the tutorial doesn't turn up any hits for "process", "thread", "parallel", or "concurrency".

[1]: http://www.arclanguage.org


Arc has threads and HN's code relies heavily on them. Arc currently runs on Racket, though, which uses green threads, so the threadiness doesn't make it to the lower levels. Racket has other ways of doing parallelism, but as far I know they don't map as well to Arc's abstractions.


It's a single-threaded process running a single core.


Worth noting it really doesn't automate paging through results, and they go out of their way to make the behavior seem organic, and explicitly say they won't change that approach.

"Automating things in this way could put load on servers in a way that a manual user couldn’t, and we don’t want to enable that behavior."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: