> No updates to Blogger have been documented since September 2020. Google has also moved away from Blogger for their own company blogs. For these reasons, Blogger may be at risk of shutting down.
> Archive Team did a discovery between February and May 2015, but did not begin downloading actual content until November 2023.
How exactly does this work? Is this farming out the work of wget-ing URLs, followed by making those contents available on archive.org Wayback Machine?
It seems trivial to poison the uploaded content. And there's no way to validate that the content has not been tampered with in flight, unless a redundant fetch were performed in a trusted environment, at which point farming out the work serves no purpose.
Secondly, this doesn't solve the discoverability problem. Sure, after Blogger shuts down, a particular webpage on a blog might be preserved. But you will only be able to view its content 10 years in the future if you happen to already know the exact URL where it used to exist.
ArchiveTeam has their own "Warrior" tool they run, but it does effectively farm out the equivalent of wget to a bunch of client machines. Their tool is very careful to preserve the content exactly as it was and in an archival format.
The problem I had with ArchiveTeam is that they are very insular, and don't accept help quickly or easily from someone who is not already inside their inner circle. I'm guessing they've been burned more than a few times.
There are publicly available versions of their tools, but those aren't the versions that actually get deployed. You have to know or be told the private locations for the tools that actually get deployed.
What's important to them is not how much resources you could help bring to bear to solving this problem that you care about right now, but are you still going to be here in a year to help them with the 99th problem that you don't care about?
I can't say that I blame them for being insular in the face of whatever their past experiences were. I just wish that they had let me join and help them, because I do think they have a good mission and they deserve to be helped. But if they won't let you help, then all you can really do is sit on the sidelines and watch. Or not, as you choose.
I sympathize with both you and AT on this. I spent a ton of time archiving geocities when it went down and that got a lot of attention, some of it seemed to be more to social engineer the other party into being able to access what I'd already crawled rather than to help out as they pretended.
But when you genuinely are trying to help it can be pretty frustrating to find a party skeptical or non-responsive. Eventually I turned the project over to someone else. It's still alive ( https://geocities.restorativland.org/ ) and the reocities.com domain still redirects there so at least it wasn't for nothing.
> The problem I had with ArchiveTeam is that they are very insular, and don't accept help quickly or easily from someone who is not already inside their inner circle. I'm guessing they've been burned more than a few times.
Consider also: they're just flawed; they're cliquish and irrational and require you to figure out a way to ingratiate yourself to them ahead-of-time/out-of-band. They don't have to be the victims of past harms.
> Occam's Razor tells us that it is more likely that they are victims of past harms
That doesn't follow. That explanation is hardly parsimonious. If anything, Occam's Razor favors an explanation like the one I gave, which is far less complex.
Sure, they could just be clique-ish and resistant to outside parties wanting to help. But what's the reason for that behaviour?
I still think it makes a lot more sense if they have actually perceived harm from others coming in from the outside, which now explains the behaviour we see today.
But as I said, without more authoritative information from the parties in question, your guess is as good as mine.
IMO, it's not an irrational behaviour on their part. It may seem irrational to us, but I believe it's much more likely to be a rational response to events that have happened in the past. We just don't know what those events were.
I think that's where we differ. You see it as a purely irrational behaviour on their part that doesn't need explaining, whereas I see it as a rational response that does actually have one or more trigger events in the past.
As I understand it, Archiveteam doesn't do anything to mitigate the issue, which is why they used to discourage users with a "censored" internet to avoid running the warrior.
They could use some form of quorum, where the archived page is distributed to multiple users, preferably in different countries, and only accepted if the results mostly agree. (Given dynamic pages from some of the archived services, it is an interesting problem as to how you would compare if two pages are almost the same.)
Ah, I had the same thought you had and didn't think of dynamic content. Maybe using variable length blocks split with content-defined chunking, and if 99% of the blocks are the same then assume the content is the same, and store the 2 versions ?
The risk of poisoned content puts this squarely in the domain of our ancestor's storytelling history. This won't be admissible in a court of law, but it'll be a piece of our pre-LLM past occasionally flavored by the storytellers. I still think that's worth it.
I don't think it will. Discovery in Archives has been a problem forever and with more content made today than ever before it is becoming ever more difficult.
It was actually an issue with the Reddit archival, but it wasn't malicious. Some people made changes to the code they thought would make it faster, but actually corrupted results.
Note that "Blogger" here, which I had never heard of, means Blogspot which I think most people will come across regularly, so actually very useful to have a backup of
(See the first sentence under the "strategy" section, as well as that they say blogger.com/profile/NUMBER redirects to a blogspot subdomain)
I think those of us that remember the Internet from 20 years ago think of Blogger as the company and CMS and blogspot.com as merely the hosting domain.
Few people would refer to Google's analytics tool as Urchin though, and that has been fewer than 20 years
Blogger seems to not have been on my radar though I was around then, and is such a generic word that imo it's not a clear name to use when Blogspot is also available and everyone knows it
For another perspective, I’d have told you that what I formerly published my blog with was Blogger, with no thought about the blogspot domain name. I think of the latter as just the fallback they offered for people without their own hosting.
There was a really good (for me) Italian blogger I used to follow. Any way to recover his blog from the archive? Is this what ArchiveTeam will do? I have tried searching the current archive but I get only one page https://archive.is/http://comediventareilmiocane.blogspot.co...
Web preservation projects usually submit their content to the Internet Archive, or at least make it accessible through the Memento protocol. It seems the blog you're looking for already has some archived pages.
You can launch the second command several times to run multiple worker containers in parallel. On a low-end VPS you can easily run 2-3 of them. With better resources, you can run 10 of them, 20, or even more!
Important: don't use VPNs, proxies, hotspots, etc.
ArchiveTeam had a similar project for Google+ and they also said they downloaded zillions of Google+ pages and uploaded them to archive.org, but I have never been able to find the Google+ profile of anyone I know via their uploads — try finding the Google+ posts of Linus Torvalds or Terence Tao for example (just to pick two people who used to post pretty frequently and whose posts I used to read). I don't think they ever did an investigation of why this happened; as far as I can tell they just lost interest after Google+ shut down, and did not try to figure out what went wrong or even try retrieving the posts of any public profile.
Try clicking on any of the links. I only get 302 redirects and blank pages; I don't get to read any of the posts.
Basically, just as testing recovery is part of a backup strategy, testing the usefulness of the “saved” information ought to be part of any “saving” strategy.
The posts I tried worked for me, but the profile pages have CSS with body { visbility: hidden; } that was probably supposed to be removed by JavaScript which is broken, if you remove it manually using the browser developer console you can see the content (with broken CSS)
I guess those captures from 2020-2023 are 302s or blank pages because google+ had already shut down. There are more working captures before that. For Linus Torvald's google+ this capture was one of the last and it was made by archiveteam [1].
I've found the cdx search feature of archive.org incredibly helpful to quickly find functioning pages (status code 200) and to see all contents that has been captured under a certain domain or page. For example, you can make a query using Linus's google+ page as the URL parameter with an asterisk as a wildcard[2]. This allows you to see all the captured content from Linus's Google+ profile, including individual posts
Thanks that's useful. That seems to be a truncated snapshot of the profile page (your link 1 shows exactly 10 posts, with 8 of them posted between 2017–Oct–17 and 2017–Oct-25), rather than a true archive (a way to browse all the posts of a given person, without the text and comments being truncated etc).
It's great that the reality is not "nothing is visible" but rather "a bit is visible" — still, the archive just isn't very useful, compared to the impression from statements like "98.6% of profiles saved" etc. A little bit of forethought or testing its usability would have made a huge difference (some js code to expand posts and comments before saving the page, for instance, as some other archiving scripts at the time did, which I used to save the profiles of myself and some of my friends… didn't run it on popular public profiles thinking that ArchiveTeam would have likely covered it in their 98.6%, but it turns out they did not).
Thay got what they got and they can't get no more. We should be lucky for those 8 posts that were partially saved.
In the future, with the right assistance, I'm sure they could grab better archives in the limited time they have.
I still don't think that makes what they did save less useful or make it less admirable.
It would be like a library burning down and then telling the firefighters that the few scraps of books they managed to save are not useful. Why didn't they use tools needed to save the books better?
I think it's great that they saved so much (especially now that I can see that some of their results are actually usable, even if we have to manually edit the CSS or whatever). It's just that:
2. At the time they were saying things like 98% of content saved, which led one to believe they'd get to properly archiving some of the popular profiles at least, but it turns out everything is only very partial: in your analogy it's like the firefighters celebrating that they saved most of the books, but it turns out they got only the front cover of each book. (Which is great, and better than nothing! It's just… if they had made it easier to see what they were saving, maybe more people would have been able to help? Or would have saved the ones they particularly wanted?)
Isn't this only useful if you already know that the previous URL for Linus's profile is https://plus.google.com/102150693225130002912? Linus is a public figure, so this URL shouldn't be difficult to find. What about discoverability of lesser known individuals?
I hope blogger sticks around. Twice I have exported my content and imported into a blog I self hosted. Both times I went back to blogger because of ease of use.p, especially on mobile devices.
If you choose Blogger partly because it was free, and your blog got little to no traffic, that's not a sustainable business model for Google. They need ads or subscription fees to cover their hosting costs.
If there were were a sufficient number of popular Blogger blogs, that could carry the financial burden of the unpopular ones, but the platform seems to be stagnant and unpopular, so that model doesn't seem to be viable either.
I'm not sure what went wrong, Blogger had good market share years ago!
Fewer personal blogs, Wordpress for many monetized sites that are real businesses, Medium, substack these days. Blogger is this in-between thing. Personally I still use it but it does seem uncommon.
Ghost managed to see growth in their simple SaaS blogging platform over the past 10 years. Google had the resources to adapt Blogger and help it find a niche to thrive it, but seems to have starved it of resources instead.
I don't think Google will shutdown Blogger. There are some HUGE Blogger blogs out there. There are also many local community-focused blogs that get loads of comments. Blogger is very much alive and earning money for Google via AdSense.
They did shutdown the subscribe-to-posts via email feature, which was like a free unlimited subscriber newsletter.
I don't even want to get into why Blogger is so great, because if I say what I know, people will start abusing it.
I’ve occasionally toyed with moving from Blogger to something else. But whenever I’ve done so and investigated alternatives it’s looked like a lot of work and would cost money. Want a homepage too.
It’s really nothing fancy but I need a homepage and I want blog posts that may or may not have images and a lot of Wordpress templates would take customization to get what I want—and I don’t really want to have to learn a bunch of web stuff to do.
I think blogspam is a decent source of ad revenue for Google, so it seems unlikely to shut it down anytime soon, at least until they figure out how to monetize LLMs.
They plan to delete accounts with no recent activity, including data from all Google properties. The only exception is YouTube for some reason. The wiki page is simply out of date.
I wonder if they made a big list of the old content that would get deleted under their policy, and decided that some of that actually still brings in enough ad revenue to be worth keeping. And it was easier to just keep everything than make carve-outs to the deletion policy?
I'm sure that's part of it, but content on Youtube is also more discoverable (despite Google running a search engine), and the platform is more active.
> No updates to Blogger have been documented since September 2020. Google has also moved away from Blogger for their own company blogs. For these reasons, Blogger may be at risk of shutting down.
I found that a shame. I think with a lighter and modern UI, blogger could come back to gain some market share from Medium.
Medium it's so bloated and clickbaity, that I would take an old regular blog in WordPress or blogspot despite being more ugly. A lot of technical documentation of some niches, is still only present in blogs of this kind.
i have very fond memories of exciting times creating and designing my blogger websites when i was in 3rd grade in my netbook (i learned to use the computer before i could read). the first site i created was a site for downloading PC games.
if i was born earlier or later, those would've translated into web dev skills, and i'd be a dev much sooner.
i'll try to recover my old gmail to check if my blogger sites are still okay.
Send ArchiveTeam the link on IRC or here and we can save it to archive.org, then later you can use wayback-machine-downloader to grab it from archive.org.
> No updates to Blogger have been documented since September 2020. Google has also moved away from Blogger for their own company blogs. For these reasons, Blogger may be at risk of shutting down.
> Archive Team did a discovery between February and May 2015, but did not begin downloading actual content until November 2023.