
Level 2 Advanced Replacement Cache for ZFS - hepha1979
http://cr.illumos.org/~webrev/skiselkov/3525_simplified/
======
zdw
To give more info on this:
[http://wiki.illumos.org/display/illumos/Persistent+L2ARC](http://wiki.illumos.org/display/illumos/Persistent+L2ARC)

Basically, L2ARC is "Level 2 Advanced Replacement Cache" for ZFS, where level
1 is in RAM, and level 2 is on an SSD (usually a cheap/large MLC SSD, as
opposed to an expensive/small SLC SSD for ZIL). Basically, it's using a SSD to
act as a huge cache for reads, so they don't have to be serviced by a slower
spinning rust array.

Prior to this change, after every system reboot, the L2ARC would be cleared
and not used/filled until reads from disk happened. On a system that is
rebooted frequently (or even infrequently), this can result in slower
performance until the cache has been primed.

My understanding is that that with this change reads can happen from the L2ARC
devices after a reboot (the "persistence"), which removes the ramp up of
usefulness of the L2ARC.

~~~
codys
This (L2ARC being considered "valid" even after a reboot) sounds quite a bit
like ZFS is growing features that already exist in Linux's bcache (and maybe
dm-cache? I'm not sure how it treats data).

~~~
fiatmoney
All the parts (tiered caching, compression, checksums, redundancy,
deduplication, journaling, network availability, FS migration...) likely exist
separately; having them in a single filesystem (especially that doesn't
require kernel patches, just a module) is quite pleasant.

~~~
codys
And some of those things certainly benefit from being integrated into the FS.

I wonder if any knowledge at the filesystem level (as opposed to the block
level, where bcache and dm-cache operate) could help L2ARC make better caching
choices.

------
cokernel_hacker
Neat. My reading of these changes imply that they finally made the L2ARC's
info survive a reboot.

For some background:

ZFS, a modern WAFL clone [1], has a replacement algorithm called ARC [2] which
can concisely be described as a hybridized MRU/MFU (Most Recently Used/Most
Frequently Used) replacement algorithm to decide which pages make the most
sense for keeping in memory. There is considerable literature surrounding
replacement algorithm design, I have little to say about ARC other than it is
patented [3] and can be outperformed by newer algorithms.

Note that this is quite different from the traditional approach of FS/buffer-
cache design. One usually expects the OS kernel to manage the buffer-cache for
you (OS X has it's Unified Buffer Cache (UBC), NT has it's Cache Manager (Cc),
etc.). However ZFS includes it's own incredibly complex caching subsystem
around. I do not know why they didn't want to improve or modify the Solaris
kernel's segmap subsystem but there are consequences to this design. Notably,
ZFS's memory usage is quite a bit higher because of ARC.

The idea of performing read-caching in memory with ARC seemed like such a good
idea to the ZFS designers that they allow for a second level of ARC to take
place: L2ARC. L2ARC essentially runs the ARC algorithm between SSDs and HDDs
to, hopefully, speed up the performance of random reads in a ZFS storage pool.

Now to steer back towards what this code dump seems to be about. If you recall
from before, ZFS's ARC is a replacement algorithm based on _usage_ and it
needs to know which things to put where. This so-called persistent L2ARC
remembers where things were on a L2ARC device so that the storage pool can
take advantage of the fact that data is on the SSD on, say, a reboot.

Huh? Why did this require extra code? Remember, ARC was about caching: it
didn't need to remember anything. When coming back online, complicated things
happen: transactions get replayed, metadata integrity needs to be rechecked,
etc. Implementing a persistent cache that is crash safe is incredibly
difficult but not uncommon: auto-tiering [4] solutions like Fusion Drive [5]
have to provide this kind of safety.

[1]
[http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout](http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout)

[2]
[http://en.wikipedia.org/wiki/Adaptive_replacement_cache](http://en.wikipedia.org/wiki/Adaptive_replacement_cache)

[3] [http://patft1.uspto.gov/netacgi/nph-
Parser?patentnumber=6996...](http://patft1.uspto.gov/netacgi/nph-
Parser?patentnumber=6996676)

[4]
[http://en.wikipedia.org/wiki/Automated_Tiered_Storage](http://en.wikipedia.org/wiki/Automated_Tiered_Storage)

[5]
[http://en.wikipedia.org/wiki/Fusion_Drive](http://en.wikipedia.org/wiki/Fusion_Drive)

~~~
gnoway
A modern WAFL _clone_? I've never read that before. The article you link to
doesn't assert that either. Can you provide mode information?

~~~
cokernel_hacker
The original ZFS paper references WAFL wrt its similarity a number of times.
It seems like the biggest distinction that the paper claimed was that ZFS had
pooled storage and WAFL was network oriented.

WAFL's biggest idea of the day was "write-anywhere" (the WA in WAFL). Write-
anywhere is another way of phrasing copy-on-write which is a fancy way of
saying _never overwrite_.

The idea, while simple, can be built upon to yield features like cheap
snapshots and reasonable data integrity.

Perhaps "clone" is a bit too much but the similarity is definitely there.

FWIW, the NetApp folks sued Oracle because they also thought it looked similar
[2]

[1]
[http://users.soe.ucsc.edu/~scott/courses/Fall04/221/zfs_over...](http://users.soe.ucsc.edu/~scott/courses/Fall04/221/zfs_overview.pdf)

[2] [http://www.netapp.com/us/company/news/press-releases/news-
re...](http://www.netapp.com/us/company/news/press-releases/news-
rel-20100909-oracle-settlement.aspx)

~~~
gnoway
That's correct, and Oracle (actually Sun at that point) sued back. The cases
were both dismissed w/o prejudice [1]. Of course, that doesn't mean neither
were valid cases.

[1]
[http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_di...](http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/)

