Hacker Newsnew | past | comments | ask | show | jobs | submit | danmactough's commentslogin

Funny you should mention both RSS and BlueSky/Mastodon as a return to the web, since Dave Winer (who pioneered RSS) has written about how BlueSky/Mastodon etc. are very much NOT the web. http://scripting.com/2024/12/21.html

> Threads, Bluesky and Mastodon are the IBM, Microsoft and Apple of 2024. It's ridiculous if they think this is a web.


Makes a lot of sense and I don't disagree. They're definitely not a return to the web and I probably misrepresented them that way in my post. I was mostly using them as examples of existing platforms where you can find content without being subjected to the algorithm's suggestions, if you choose to use the "following" or "subscribed" features instead of "Discover" or "For You".


We noticed that all of our instances -- in 3 different availability zones in US East -- all cycled at the same time. That was pretty disappointing. Kind of defeated the purpose of having things in different AZs.


We did a zone-by-zone reboot. If you want to send me the instance ids I will ask our team to see what happened. You can find my email in my profile.


How much delay is between zones? Some peoples services don't come up instantly. Perhaps when zone 3 when down, the users services in zone 1 hadn't finished coming back up.


They did a different zone every day, so ~24 hours.


Hm... interesting, I had the opposite experience. It was the "aha" moment for me when instances on us-west-1a and us-west-1b were rebooted at different times, which let us avoid any downtime.


> you may also want to take this opportunity to re-examine your AWS architecture to look for possible ways to make it even more fault-tolerant

Basically they won't allow blame to be placed on them for anything they do. According to AWS, it's your fault this happened.


Amazon had to perform the maintenance due to XSA 108, and the timetable they had to meet was set by the Xen project.

They set up the maintenance to have as little impact as possible by splitting availability zones onto separate days so that people utilizing multiple zones for high-availability would not lose multiple zones at once.

Afterwards, they gave a detailed breakdown, linking to the vulnerability and explaining both why they had to perform maintenance and why they could not share more details upfront.

They also provided information on how to improve the fault-tolerance of your systems so that future issues like this won't stop your workflow.

Amazon isn't placing "fault" on users; they did a pretty stellar job of handling a nasty vulnerability and treating their users as well as possible.


Doesn't matter why they did the maintenance, so your first four sentences are simply rationalizations. The parent was complaining all the zones were restarted AT THE SAME TIME. Jeff explains why they had to do the maintenance, and nobody is disputing why they had to do it. What is being disputed is that they did it all at once, without rolling the restarts. This makes ZERO sense given their advice to 're-examine your (AWS based) architecture' for fault tolerance. Don't patronize me while you're doing something that affects me in a way I can't control.

I've had direct experience with AWS in this regard, and was equally disappointed in the outcome. If you want to see a class act in taking ownership of issues arising from such matters, have a look at Rackspace's response:

  Dear Rackspace Customer,

  I’m writing to apologize for the downtime and inconvenience that you and others of our customers have suffered in recent days.  Like other major cloud providers, we were forced to reboot some of our customers’ servers to patch a security vulnerability affecting certain versions of XenServer, a popular open-source hypervisor.  This maintenance was especially difficult for many of you because it had to be performed on short notice, and over the weekend.

  Now that this issue has been fully remediated, without any reports of compromised data among our customers, I’d like to explain what happened, and why.

  Whenever we at Rackspace become aware of a security vulnerability, whether in our systems or (as in this case) in third-party software, we face a balancing act.  We want to be as transparent as possible with you, our customers, so you can join us in taking actions to secure your data.  But we don’t want to advertise the vulnerability before it’s fixed — lest we, in effect, ring a dinner bell for the world’s cyber criminals.

  That’s the dilemma that we faced over the Xen bug. Such vulnerabilities are regularly found in software, whether proprietary or open source. The key, once a bug is identified, is to fix it swiftly and quietly.  This particular vulnerability could have allowed bad actors who followed a certain series of memory commands to read snippets of data belonging to other customers, or to crash the host server. We wanted to flag the issue as quickly as possible to those of you using our Standard, Performance 1, and Performance 2 Cloud Servers, and our Hadoop Cloud Big Data service.  But we didn’t want to do so until we had a software patch in place to address the vulnerability.

  When we learned of the security issue and realized its significance early last week, our engineers worked with our Xen partners to develop and test a patch, and organize a reboot plan.  The patch wasn’t ready until the evening of Friday, Sept. 26.  And the technical details of the vulnerability were scheduled to be publicly released on Wednesday, Oct. 1.  We were faced with the difficult decision of whether to start our reboots over the weekend, with short notice to our customers, or postpone it until Monday. The latter course would not allow us to sufficiently stagger the reboots.  It would jeopardize our ability to fully patch all the affected servers before the vulnerability became public, thus exposing our customers to heightened risk.

  We decided the lesser evil was to proceed immediately, at which time we notified you, and our partners in the Xen community, of the need for an urgent server reboot.  Even then, to avoid alerting cyber criminals, we didn’t mention Xen as the reason for the reboot. Another major cloud provider did attribute its reboot to security problems with Xen, which put all users of the affected versions of that hypervisor at heightened risk.  But we’re relieved to report that, as of now, we’ve learned of no data compromise among Rackspace customers.  Now that the vulnerability has been fully remediated, the Xen community has lifted its embargo on talking about it.

  Those of you who are longtime Rackspace customers know that we have a strong record of open, timely communication with you. We reach out to you whenever there’s an issue.  We answer the phone whenever you call. We do everything we can to find a solution.  This past weekend, our engineers worked tirelessly with customers and partners to remediate the Xen vulnerability.  

  This maintenance affected nearly a quarter of our 200,000-plus customers, and in the course of it, we dropped a few balls.  Some of our reboots, for example, took much longer than they should.  And some of our notifications were not as clear as they should have been. We are making changes to address those mistakes.  And we welcome your feedback on how we can better serve you.

  As a veteran Racker who is proud of our commitment to our customers and their businesses, I am personally sorry for any inconvenience or downtime that we caused you during this incident.

  Sincerely, 

  Taylor Rhodes
  CEO and President
  Rackspace

  taylor.rhodes@rackspace.com


Don't lie. I had ~120 instances I had to juggle between 3 availability zones, and never were two AZs down/rebooted at once. Our environment suffered no downtime, as we had at least 2-3 days notice per AZ to move instances around.

Rackspace's handling of the situation was a joke. They sent notification emails out at 9:30pm on a Friday night, and then proceeded to do the reboots Saturday at peak traffic times.


Yeah Rackspace is hardly to be held up as a standard bearer here, they don't even _have_ availability zones.

We have around 200 instances, had about 59 reboot, and specifically were able to plan around these happening on different days.

We weren't super excited when the window seemed to go to 4h right before it started, but we were prepared.

I'm an ex Racker and I've told people high up at Rackspace for years that until they implement something like availability zones, they're a joke for any kind of production. Their philosophy, as is pervasive in the hosting industry, is that they have paying customers so whatever they are doing must be right. Obviously Amazon often also seems to act this way, but this particular maintenance was handled well afaict, and availability zones showed their value.


> Their philosophy, as is pervasive in the hosting industry, is that they have paying customers so whatever they are doing must be right. Obviously Amazon often also seems to act this way, but this particular maintenance was handled well afaict, and availability zones showed their value.

This could explain why Rackspace was shopped around by Morgan Stanley. They may be profitable now, but Amazon and Google are going to eat their lunch.


Shopped around, with no takers. At last count it was reported they've given up on that.


Indeed they have. Now they're doubling down on their existing platforms (and also, OnMetal).

I think their best move would be to pivot to be a firm that manages solutions for corporations that refuse to move off on-premises equipment for whatever reason. Their CapEx costs fall away, and they already have a deep ocean of talent to draw on.

There are already large orgs that already do this, but Rackspace has the potential to suck A LOT less than they do at the same task.


Amazon and Google are worthless to any company without a robust sysadmin/devops team, which is most companies in the world.

It's easy to get stuck in the tech-savvy bubble here, where most people can write code, pick up Chef in a week, and are trying to build cheaply at "web scale." Those people don't need, or want to pay for, support with their servers.

But most companies need some help to run a few servers for web and email. Rackspace is the only large hosting provider who provides that across the board.

That said, Rackspace needs to beef up their devops support, or they risk limiting their own abilities to grow with their customers.


> Rackspace is the only large hosting provider who provides that across the board.

SoftLayer has provided manged service for years, and was bigger than Rackspace even before they were purchased by IBM.

http://www.datacenterknowledge.com/archives/2009/05/14/whos-... (Ignore the 2009 - the post has been updated as of 2013)


Plenty of hosting companies offer managed hosting as an upsell. You can't even get pricing for it at Softlayer without talking to a sales rep, for example.

Rackspace is the only large hosting provider who provides that across the board, because even their smallest cloud servers come with phone support. As far as I know, you cannot get phone support with an arbitrary set of cloud servers at Softlayer, or Amazon, or Google.


Would you rather be forced to fly first-class or have the option to choose?


No one is forcing anyone to buy from Rackspace. The question is whether Rackspace has a unique value proposition against Google or Amazon, and the answer is support.


Huh?

Elastic Beanstalk, OpsWorks, Google sites, Google apps, AWS Marketplace, etc..

Going by what you said about most companies, most registered businesses in the world are likely just looking for a single dinky site with a mailbox pointing @theirbusiness.com, definitely no need for more than a shared server. Google, Wordpress, Github pages, Shopify, and dozens of others make this very simple to setup and use. You said Rackspace is the only large hosting provider that provides this across the board, that is not true and they aren't even in my top 10 if I was looking for a provider.

For a single website that gets less than 10 visits / day with 5 html pages I was just quoted $75/mo minimum by Rackspace with some server management on my part.


something like availability zones

How is that different from a datacenter?


With az's you can at least have some fault tolerance and not have to pay the outbound costs or latency associated with going across the wan.


AZs in AWS are essentially distinct "datacenters" from a logical perspective. There is no shared infrastructure between them; if AZ B drops off, as long as you have your data and instances replicated and serving in another AZ, you will see no downtime.


That is "in theory". In fact there have been region wide failures at the AWS services level, though none recently that I know of.


I was pretty unhappy with how Rackspace handled the maintenance window.

1. Rackspace's maintenance announcement was sent at 9:00 PM on Friday night (Pacific time). Seriously?! I had already left for a weekend vacation without my laptop, so I couldn't do anything to get my company prepared. Even if the patch wasn't ready until Friday night, Rackspace could have scheduled the maintenance windows and announced them to customers much earlier.

2. The maintenance window for all three USA regions were scheduled at the same time. We couldn't just move to a different region without going to another continent.

3. Each maintenance window was 24 hours -- that's just too long. Even though our servers were only down for 10 minutes, we had to be on call and ready for 24 hours.

4. Although we have redundant servers in every region, we still couldn't guarantee that those redundant servers wouldn't be rebooted at the same time. As it turns out, we did lose both of our servers in ORD at the same time.


I would like to be able to have alerts go right into SNS.


More generally, it would be cool if every account automatically got two free read-only SQS queues: one for events and one for upcoming events. Every known event (startup, terminate, permission change, network error, etc) could be published to the queue.

Not high-priority for us but the parent comment sparked the idea (I'm not in devops so maybe this exists via a different mechanism.)


These are both great ideas. I will share them with the team today. Keep them coming!


seems like realtime CloudTrail->{SQS,SNS} could satisfy his request and several that I have.

(instead of waiting for events to batch from CloudTrail to S3)


+1 on this.

We got the update on the EC2 reboots, but totally missed the RDS reboot information and as such suffered some downtime through the reboots on Friday (that we could have avoided had we known about it). It would be nice if the RDS console supported the "Scheduled Events" section that the EC2 console had.


Oh the possibilities! This would be super handy.

I'd add SNS support for non-US SMS notifications :p Then I guess I would not need Pingdom anymore.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: