
Migrating 23TB from S3 to B2 in 7 hours - CherryJimbo
https://nodecraft.com/blog/development/migrating-23tb-from-s3-to-b2-in-just-7-hours
======
andrewstuart
@ USD$0.05 cents per gigabyte out from S3, that's about USD $1150 to transfer
23TB is that right?

[https://aws.amazon.com/s3/pricing/](https://aws.amazon.com/s3/pricing/)

I'd be pretty nervous about hitting the delete button after the data transfer.

~~~
desdiv
Would this crazy-ass strategy work?

1\. Spin up 50 AWS Lightsail instances[0] for parallelism

2\. In each instance download from S3 and upload to B2.

S3 to any AWS service in the same region is free[1]. $5 Lightsail instances
come with 2TB of data transfers each, so 50 of them can easily handle 23TB.
The whole transfer can be done within a few hours so the total computing cost
is less than $10 ($5 / 30 * 50 = $8.3). Total data retrieval cost for S3 is
($0.0007 per GB) * 23,000GB = $16.1.

[0] [https://aws.amazon.com/lightsail](https://aws.amazon.com/lightsail)

[1] "Transfers between S3 buckets or from Amazon S3 to any service(s) within
the same AWS Region are free." according to
[https://aws.amazon.com/s3/pricing/](https://aws.amazon.com/s3/pricing/)

~~~
coolgeek
LightSail has (or at least had) a hard limit of 20 instances. They also have a
soft limit of 2 instances, after which you must request an upgrade to a higher
limit. I had to submit a support request explaining my intended usage. It took
a week to get approved.

The stated reason for these limits is to avoid unexpectedly large bills. But I
suspect that it's also to prevent crazy-ass strategies for getting around
bandwidth costs.

------
sanxchit
Great writeup on data migrations. I was wondering whether you did a comparison
for this method vs using AWS snowball[1] to export S3 Data and B2 Fireball[2]
to ingest it.

[1] - [https://docs.aws.amazon.com/snowball/latest/ug/create-
export...](https://docs.aws.amazon.com/snowball/latest/ug/create-export-job-
steps.html)

[2] - [https://help.backblaze.com/hc/en-
us/articles/360001918654](https://help.backblaze.com/hc/en-
us/articles/360001918654)

~~~
CherryJimbo
We looked briefly at the snowball and fireball, but wanted to do this as
quickly as possible, whilst keeping the process entirely transparent to our
users. It was also an excuse for our team to get intimately familiar with the
B2 API, since it's not compatible with S3.

If we were to consider another large migration like this, physical media would
probably be the way to go.

~~~
late2part
I evaluated moving 2PB with snowball v. putting in 10g/100g links. The issue
with snowball ( i started a company that did what snowball did and shut it
down (failed )) and other fedex/RAID solutions is that you have 3 transfers.
You think the LAN transfer will be quick - but you're generally rate-limited
by the systems more than you are by the bandwidth-delay product. If you're in
a high traffic DC area, it's pretty easy to get temp bw or install circuits to
carry that. 10g for 2pB is 18 days of transfer - which sounds like a lot - but
that's 5 days of transfer on each site, 1 day of setup, and 1 week of
shipping. Those numbers aren't real but they're close.

So, snowball works in a lot of area, but like so many AWS products, it works
if you adapt to it.

pigz/scp/zstd works extremely fast in line.

In your case you're pulling from S3 to another object store.

I moved ~1PB from one S3 region to another. "Why not use replication," they
asked. That only works if it's turned on when you upload the object - another
fine-print 'gotcha' in the easy AWS service. Then you get into rate-limits. In
2010 I asked AWS if I could spin up 1000 servers to test something - nope -
elastiticy at that level is for the big boys.

Now I work for a large cloud company and we still run into elasticity.

To move the 1PB from one S3 region to another we spun up hundreds of spot
instances (oh, we were compressing and glacierizing it too) and built a
perl/mysql batch job "s3 get | zstd | s3 put" process and parallelized it. One
thing nice about S3 is it pulls the md5 hash - unless multipart, in which case
it's the hash of the hash, oh yeah.... So you should split it in advance if
you want to verify the hash (more fine print).

Worked great. Good for you for sharing this project, very cool.

~~~
semi-extrinsic
As a physicist I've always found the name "elastic scaling" funny. If it's
elastic in the physical sense, it means that the energy required to grow to
some size is quadratic (or higher) in the size. The marketing meaning is "easy
scaling", but the physical meaning is "really hard scaling".

E.g. compare a soap bubble versus a bubble gum bubble. It's a lot easier to
scale up the soap bubble, which is _not_ elastic.

~~~
javajosh
It's a very good observation, and I think it's more than just a funny aside.
The word 'elastic' connotes increasing resistance as the cluster grows, but
this is a false intuition. From AWS's POV 'resistance' to adding a node is
generally small, fixed, and, in general, independent of cluster size. I
suspect this is what makes cloud computers in general, and EC2 in particular,
such a cash-cow.

Moreover it turns out that elasticity is a very valuable quality of a cluster
for most workloads; we _want_ this intuition to be true, that our cluster
meets resistance as it grows, in the sense that it will shrink when the
workload decreases. This matches our economic intuition, too. We want this so
much we have to build another software layer to make this happen - e.g., k8s.

------
Can_Not
I assume I'm not the only one who has never heard of B2 before now, so I'll
summarize everything I've found out:

[https://www.backblaze.com/b2/cloud-storage-
pricing.html](https://www.backblaze.com/b2/cloud-storage-pricing.html)

It appears to be an S3 clone. Competing via lower price and less micro
charges. Highlights compared to Digital Ocean and Wasabi:

DO/Wasabi: minimum $5/month, but great deals compared to AWS/GCE otherwise.

B2: First 10GB storage free (probably not including bandwidth).

For a side project or startup looking for its first storage option, B2 seems
compelling. But something important: is it a drop in replacement? Is the api
accessible on your platform?

[https://help.backblaze.com/hc/en-us/articles/218513487-Is-
th...](https://help.backblaze.com/hc/en-us/articles/218513487-Is-the-B2-Cloud-
Storage-API-Compatible-with-Amazon-S3-?mobile_site=true)

I don't have a clear answer right now, but when I get closer to deploying my
current side project, it's something I'll take a deeper look into.

~~~
iansltx
Their API is _not_ S3 compatible. There are a bunch of client libraries on
various platforms for it though.

They have 1GB free of outgoing bandwidth per day. If you put them behind
Cloudflare then you effectively get your bandwidth for free. So half a cent
per GB per month is all you way. Plus your time building out the integration
instead of S3.

------
cavisne
I'm guessing the payback time on this is fairly long by the time you factor in
the cost of moving all the backups. S3 One Zone is pretty competitive against
Backblaze.

Cloudflare is betting on most of their customers serving HTML not large
uncachable blobs, that "Bandwidth Alliance" will disappear pretty quickly at
any sort of scale.

~~~
CherryJimbo
I can't speak for Cloudflare, but we've talked to them pretty extensively
about the whole project, and even did a case study about our use-case with the
bandwidth alliance, and as we switched cloud providers. Things may change in
the future of course, but they very much encouraged what we were doing.
[https://www.cloudflare.com/case-studies/nodecraft-
bandwidth-...](https://www.cloudflare.com/case-studies/nodecraft-bandwidth-
alliance/)

------
toomuchtodo
Would you consider open sourcing the micro service you wrote to perform the
migration? I could see it being helpful to others interested in migrating from
S3 to B2.

~~~
toomuchtodo
This request can be disregarded. I’m going to explore extending s3proxy for
the same purpose (migration and backfill of disparate object storage systems
through an abstraction layer).

------
kbowman
Did you compare it against Wasabi?

~~~
CherryJimbo
We did not. Wasabi looks interesting - thanks!

~~~
chime
We have about 125TB on Wasabi. Costs about 650/mo. Their performance for large
files is great.

~~~
metildaa
Haven't they had stability issues? I know it was troublesome for some Mastodon
instances.

~~~
chime
Once so far in last 8mo. It wasn’t a big issue in our use case, especially
since there was no data loss.

------
dgemm
If this is purely backup data, wouldn't glacier be a better fit than either S3
or B2?

Glacier is already cheaper than B2 and has the advantage of storing data
redundantly across datacenters. And glacier deep archive is 4 times cheaper
than _that_.

Full disclosure: AWS employee

~~~
snissn
No - the pricing on glacier makes it a steaming pile of shit

~~~
dgemm
It's built for off-site backup type workloads, where you write once and read
hopefully never.

You are probably using it wrong.

------
vviktor
Wouldn't it be simpler if you gradually moved?
[https://martinfowler.com/bliki/ParallelChange.html](https://martinfowler.com/bliki/ParallelChange.html)

I understand your main concern for moving was pricing. Developer hours also
cost money. It seems like you had to invest much more developer hours than you
would have if you gradually moved over the course of one month or so.
(probably a week?)

------
Bombthecat
Are they still one data center only?

~~~
BartBoch
It seems that they have two DCs now, not sure if the second is used for
redundancy though.

------
stevefan1999
I mean, the best way to migrate this huge amount of data was still has had to
be using a physical migration service like AWS Snowmobile[0], right?

[0]: [https://aws.amazon.com/snowmobile/](https://aws.amazon.com/snowmobile/)

------
paddor
Can I ask why ZIP? Isn’t it quite heavy on CPU at a not-that-good compression
rate? I’m thinking of LZ4 or Zstandard instead.

~~~
CherryJimbo
We had to pick something that was "good enough" for compression time/size, as
well as easy for our customers to download/view if they wish, on any OS. Zip
being supported in every popular operating system, and the average user using
Windows being able to right click -> unzip, was the primary reason for choice.

There are of course significantly faster and more efficient compression
formats like LZ4 which would be ideal if we were solely using the data
internally in managed environments, but we offer these backups as downloads to
our users, some of which aren't very technically inclined and still need to be
able to access the files easily.

------
ryanmarsh
The stated reason for moving to B2 was pricing. No breakdown of cost was given
however. I’m not sure how they arrived at that conclusion.

------
late2part
What's the monthly storage cost on that?

~~~
StavrosK
$115.

------
howiroll
I’ve heard Backblaze has only one region. Is that true?

~~~
chx
They said they use Cloudflare as well...

~~~
CherryJimbo
We do use Cloudflare, but a lot of the instance backups we store are multiple
GBs - Cloudflare doesn't cache those. Not to mention uploads from regions like
Singapore can be very slow all the way to the US.

It's not a deal-breaker for us, but we're very much looking forward to when
they can support more regions.

