
“Google just started mass banning/limiting Archive Team downloads” - dredmorbius
https://twitter.com/textfiles/status/1112494767601053696
======
jpatokal
According to ArchiveTeam's own tracker, there was a temporary dip around 1 AM
PT (when this tweet was posted), but the speed has recovered and they are
again crunching through 100k+ items/hour.

[http://tracker.archiveteam.org/googleplus/](http://tracker.archiveteam.org/googleplus/)

(the graphs show up in the pink block at the bottom, which can take a while to
render)

Where is the graph in the tweet from? If it's just measuring successful
downloads, how did it conclude that it's Google at fault? Is there another
tracker of download failures that shows quota errors/DOS blocking etc?

~~~
dredmorbius
I've been in the ArchiveTeam's IRC channel for ... the past three months or
so. This wasn't just a temporary dip.

Workarounds are applied, but Google not fighting us would be a Really Nice
Thing presently.

(Google's performance throughout this episode has been ... poor.)

The tracker is also only _very_ partial, showing about 1/50th of all activity
at any one time.

A better picture is at the Grafana tracker:

[https://atdash.meo.ws/d/BQbN9QEiz/archive-team-tracker-
chart...](https://atdash.meo.ws/d/BQbN9QEiz/archive-team-tracker-
charts?orgId=1&from=now-6h&to=now)

~~~
Dylan16807
Is there anything that says how many items haven't been added to the download
queue yet? Or a rough sense of the overall percentage somewhere that I'm
missing?

Edit: I see the post a few days ago said about 80%, that's at least something.

~~~
dredmorbius
88% complete, on target to hit 92% of targets.

~~~
dredmorbius
And actually hit 98.5%

------
andrewstuart
The front page of Hacker News - the only way to get your support issue with
(insert big tech company name here) fixed.

As I keep suggesting, these big companies each need their own internal
ombudsman, because at the moment they provide no way to fix things when their
systems have gone crazy.

The other reason (big companies X/Y/Z) need to each have an internal ombudsman
is that it's getting tedious reading on Hacker News about how (big company
X/Y/Z) has done something crazy and the user/developer/customer cannot get it
fixed.

Maybe its an entrepreneurial opportunity - someone could make a website that
pays a fee to someone with Google/Microsoft/Amazon who is willing to use their
influence to solve a crazy problem. "Rent a manager friend at Google".

~~~
burtonator
You just gave me an idea actually...

I don't have time to implement it but if you like it go ahead and steal it.

Basically a Hacker News style-site for complaining about companies that don't
fix their problems or own their support issues and their user base can upvote
them to embarrass them into actually fixing the problem.

~~~
dcow
Doesn’t this exist?

~~~
theoh
"Get Satisfaction" partly fits this description.
[https://en.wikipedia.org/wiki/Get_Satisfaction](https://en.wikipedia.org/wiki/Get_Satisfaction)

------
duxup
>talk to Google

Is that even a thing?

Google doesn't seem to want to be talk'n to outside of my information and in
some cases credit card.

~~~
dmitrygr
Historically if Google wrongs you, the only ways to get it fixed are:

1\. Get lucky and some googler sees your official bug report

2\. Know a googler who can file a bug internally

3\. Make noise on social media till someone at Google notices

~~~
rohan1024
I've uploaded a lot of data to Google(mostly photos) and someday if Google
decides that I'm in violation of their policy and chooses to lock me out of
system I'm screwed.

I need to start thinking about proper storage system for my pictures and other
data.

~~~
torgian
I use hard drives.

------
KirinDave
It is 4pm on a Sunday at Google HQ. It seems a bit premature to start reading
deep policy decisions into this.

------
dredmorbius
Context:

Saving of public Google+ content at the Internet Archive's Wayback Machine by
the Archive Team has begun

[https://old.reddit.com/r/plexodus/comments/az285j/saving_of_...](https://old.reddit.com/r/plexodus/comments/az285j/saving_of_public_google_content_at_the_internet/)

Previously on HN:

[https://news.ycombinator.com/item?id=19407865](https://news.ycombinator.com/item?id=19407865)

~~~
itchyjunk
Ah thanks, context helps. Maybe the G+ servers just have security measures to
limit bandwidth usage? Do you think it's malice?

~~~
chx
No, it's just stupidity if they weren't stupid this entire effort would be
unnecessary as they would be sending over the content to the Internet Archive
themselves, probably along with a cheque that looks big to the IA (10 mil/yr
budget) and is not even a rounding error to G (109B cash at hand).

~~~
KirinDave
Why would that be good?

~~~
Dylan16807
Because there is a huge amount of important public posting on there. It didn't
get billions of users but it had enough. Keeping up a read-only blob, or even
better letting someone else do it, should be in the shutdown considerations of
any major product.

~~~
xorand
G+ had a nerdier audience and many interesting communities and posts. I don't
get this ghost town line. For example HN is a niche site, with almost no
audience, with almost all accounts dead. Right?

~~~
KirinDave
I ran several G+ groups and they were destroyed over time by unfettered
harassment, unstoppable spam, and just a lack of interest.

I've yet to see these things happen to HN.

~~~
dredmorbius
HN has niche appeal, a strongly focused discussion (there's one story feed,
and effectively about 30-40 stories that really register on the front page,
though many more are submitted), and pretty dedicated professional moderators.
As of 2015:

 _Roughly 2.6M views a day, 300K daily uniques, 3 to 3.5M monthly uniques. It
depends on how you count, of course._

[https://news.ycombinator.com/item?id=9220098](https://news.ycombinator.com/item?id=9220098)

Which ... actually probably compares favourably with Google+, which had a core
of about 50-100k highly active users (posting 50-100x monthly), and maybe an
extended set of as much as 100 million who'd interacted with the site at one
time or another significantly.

I've done a fair bit of measurement (limited by available resources and
indicators), and one conclusion I'm coming to is that raw numbers do a
_pathetic_ job of indicating media or forum vitality. Most especially raw
census numbers.

Looking at G+ communities, and running grid plots of _engagements_ , it turned
out that _posts_ drove other engagement, not _members_ , and in fact it seems
as if there's some kind of fall-off (at least on a per-member basis) when a
given forum gets above about 5,000 members (though I need to check this).

Google had more users. But they were spread out over a vastly larger set of
forums and discussion, there was no central "square" (as with HN's "new" or
"news" pages), moderation was exceedingly uneven, and often entirely absent,
and there were (and remain) huge barriers for like minds to come together.

HN's overall focus is fairly (but not excessively) narrow, and much of the
conversation takes itself too seriously (and certainly myself), but _relative
to the rest of the Net_ it's an exemplar. Good conversation remains
exceedingly hard to find.

~~~
xorand
> it turned out that posts drove other engagement, not members

Yes, this was something particular to G+. Incidentally see these in Linus
interview [0], HN post [1]

" The whole "liking" and "sharing" model is just garbage. There is no effort
and no quality control. In fact, it's all geared to the reverse of quality
control"

"I'm not on any social media (I tried G+ for a while, because the people on it
weren't the mindless usual stuff, but it obviously never went anywhere)"

[0] [https://www.linuxjournal.com/content/25-years-later-
intervie...](https://www.linuxjournal.com/content/25-years-later-interview-
linus-torvalds)

[1]
[https://news.ycombinator.com/item?id=19559970](https://news.ycombinator.com/item?id=19559970)

------
jonas21
Did they bother talking to anyone at Google before they started doing this? It
sounds like they’re making a massive number of requests in a short period of
time which is probably indistinguishable from abuse to Google’s automated
systems.

~~~
ClassyJacket
"Did they bother talking to anyone at Google before they started doing this?"

How would you go about contacting Google? As far an I've ever heard they're
notoriously impossible to contact by anyone who isn't paying large amounts to
run ads.

~~~
baroffoos
You don't even have to pay much for ads to get help. They will do a call and
walk you through the steps for setting up everything to start spending. Its
just that advertisers are the only customers google has.

~~~
gscott
Only in the past year or two. Before that you could only contact them via a
support form and someone in India would answer the support.

------
notacoward
What's really stupid about this is that it would have been less effort and
expense for everyone involved if Google had arranged to work with Archive Team
directly instead of forcing them to scrape content via HTTP over the public
internet. That includes Google themselves. What's happening now is probably
that they're ramping down the number of servers handling Google+ content,
which they could have done even sooner if they'd cooperated on a proper
archiving strategy.

I guess sometimes Google does this kind of crap just to prove (mostly to
themselves) that they can. Truly, the bully of the internet.

~~~
identity-haver
There was a claim [1] that the G+ terms of service might legally prohibit them
from doing this after the service is shut down. I haven't verified it.

However, it's clear that for an archiving effort this big, people at Google
are explicitly allowing it. The user agents and fetch patterns of the Archive
Team crawler were clearly distinct enough to get caught by an automated tool,
and someone knew someone at Google in order to get it unblocked.

Unfortunately, any archival effort that requires the "Warrior" crawler (and
not just a guy with a 4TB disk) is at the mercy of the website's remaining
staff and management. Just ask Soundcloud. Archive Team started to archive
their stuff when it looked like they were going to go under, but Soundcloud
shut them down.

[1]
[https://news.ycombinator.com/item?id=19410050](https://news.ycombinator.com/item?id=19410050)

~~~
notacoward
That's a really good point. OTOH, I think it would be nearly impossible for
anyone to make a claim that their privacy has been violated by archiving
_public_ posts. In that case, rights have been granted to everyone (i.e. the
rights Archive Team is currently exercising without issue) so limitations on
rights granted to Google itself are irrelevant. OTOOH, IANAL. ;)

------
mirimir
In my experience, scraping Google data is _hard_. I did it years ago for a
project. And I had to lease a huge block of private proxies. Each one only
lasted a few minutes. But with a large enough block, they'd come back.

~~~
dredmorbius
Google Web Search especially -- I've found that more than a query every 5-10
minutes will start throwing CAPTCHAs. For what I was doing, there was no other
way to get the information I was looking for, so I just resigned myself to
very slow crawls.

For Google+ itself, over a period of _years_ , I'd hammered it with 100s to
~100k requests from residential IP space without ever throwing rate-limiting,
at a rate of 2-20 queries/second or so, roughly.

We've started getting news of rate limiting over the past few months as
archival activity has proceeded.

------
tanilama
I mean enforcing traffic control restriction is everyday operations for big
companies, why would they assume that Google will treat them differently?

------
29ssyg
I'm getting lots of 500 errors on Google Plus these hours, so maybe they're
not getting throttled, just Google servers crapping out.

~~~
toomuchtodo
> Google servers crapping out.

This seems....highly unlikely for Google.

~~~
ummonk
You've never opened Youtube, have you?

------
soup10
You don't get it guys, when Google scrapes the web and downloads everyones
data then serves up parts of it with sponsored ads next to it in searches its
OK because they are Google. But if you scrape their data it's not OK because
you're not Google. Once you understand this it makes perfect sense.

~~~
SpicyLemonZest
In case this is meant to be a serious comment, there's a standard mechanism
called the robots.txt file to tell crawlers you don't want them to scrape your
website. You don't have to let them if you don't want to.

~~~
wsh
Except archive.org _doesn’t_ obey robots.txt files any more [1], and they also
ignore requests to remove content.

[1] [https://blog.archive.org/2017/04/17/robots-txt-meant-for-
sea...](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-
engines-dont-work-well-for-web-archives/)

~~~
0815test
They don't obey robots.txt files _posted after-the-fact by domain hoarders
that have zilch to do with the original content_. This is entirely proper on
their part.

------
xorand
How can I archive my photos from G+ posts i.e. this
[https://get.google.com/albumarchive/110322266958783287132/al...](https://get.google.com/albumarchive/110322266958783287132/albums/photos-
from-posts)

~~~
rasz
In theory you can use Google Takeout. In reality Takeout has NEVER been able
to backup my YT comments, and there is no way or reporting a
bug/appealing/speaking to anyone about it.

~~~
xorand
I made available all the sources for all those animations, but this will not
bring back 150K views/day. It's a shallow metric of interest, I know, however
I am human too and I appreciated that, seen that presently there is,
technically, no scientific publication avenue for this.
[https://doi.org/10.6084/m9.figshare.4747390.v1](https://doi.org/10.6084/m9.figshare.4747390.v1)

------
kazinator
Banning downloads of what, exactly?

~~~
kristofferR
Google+ before Google deletes it in a few days.

~~~
mcv
A few days? I think it's tomorrow.

------
mrhappyunhappy
What do they mean by this? I did notice that looking for archives of a few
domains, including my own returned no results.

~~~
dredmorbius
Google+ is shutting down on April 2.

There's a bunch of folks who are interested in preserving content from the
site. Some personally/individually, a whole host of communities, and, well,
because their mission is to preserve the world's data, the Internet Archive
and an unaffiliated though closely-working group, the Archive Team:
[https://archiveteam.org](https://archiveteam.org)

That's a bunch of volunteers who basically suck the guts out of failing
websites, and they've got some big ambitions.

I'm from the Google+ user community side and have been helping organise
information and activity on behalf of others. There's a pretty comprehensive
(and messy) wiki at
[https://social.antefriguserat.de](https://social.antefriguserat.de) with FAQs
and directories and guidance and a whole bunch of other stuff, plus other
resources -- G+ communities, subreddits, and more.

We (the G+ community side) stumbled across Archive Team this past January and
were delighted to discover they existed. (They've told us they're also glad
they exist.) And we've been working together to coordinate this pull, with AT
providing the technical capabilities -- code, servers, bandwidth, and storage
-- and us mostly pestering them with questions but also knowing a good bit
about the characteristics of Google+ data and some of its organisation. (I've
made a hobby of ... measuring much of that.) I was finally able to answer my
mother's persistent question "but who _cares_ about any of this" with
"arkiver, of the Archive Team".

Anyway: the Archive Team grab is in full swing, we've got about 24-48 hours
left, depending on when Google shut shit down, and all of a sudden all the
archiving agents ("Warriors") stop collecting data. We're 86% through the
working set, and are on track to complete 92% of all scheduled data requests
by the earliest anticipated shutdown window. (Earlier estimates, before the
throttling, were over 100% completion, meaning we had some slack space.)

Answer your question?

~~~
mrhappyunhappy
Yes, thank you

