More

jeltz · 2025-01-29T16:32:01 1738168321

I tried it a few months ago and it was a buggy mess with strange defalts. Totally unusable on a couple of the projects I tried it on.

jeltz · 2025-01-29T09:35:41 1738143341

How do these differ from SQL functions in PostgreSQL? They look like the same thing.

geysersam · 2025-01-29T23:11:35 1738192295

Yes but they are transparent to the optimizer so the argument against them in the blog article doesn't apply.

jeltz · 2025-01-29T09:27:45 1738142865

I have also tried many and so far none have even managed to be as good as the kludge that is SQL.

jeltz · 2025-01-28T12:37:55 1738067875

While I agree with you this is off-topic. I am happy to not have to see that argument. And you were the wine to bring it here now.

jeltz · 2025-01-21T17:00:54 1737478854

In some European countries water is free. I am from Sweden where all places have free tap water and fancy places often have free sparkling water.

kjkjadksj · 2025-01-21T17:52:51 1737481971

Along the mediterranian seemed like the only place to get free water were the ancient fountains that spittle out a stream. But then you’d have to wait for the inevitable old man to finish washing his head and arm pits in that fountain. Beer was usually substantially cheaper than the water offerings.

jeltz · 2025-01-17T02:51:06 1737082266

No, they are making the possible very late.

distortionfield · 2025-01-17T06:18:44 1737094724

> very late

when was your fully reusable full-flow staged combustion rocket engine scheduled flight, again?

imtringued · 2025-01-17T11:22:31 1737112951

Why does that matter? SpaceX is setting themselves up for failure by insisting that they need to nail re-entry first. Whenever they focused on a test flight for re-entry I'm wondering why they aren't working on more important things like the payload doors or orbital brimming. They will get the re-entry tests for free!

And even if they don't. The upper stage is cheap enough that it can be expended and still be cheaper per flight than Falcon Heavy. So that tells me that the delays are on purpose. Their test flight planning is designed to maximize ego stroking.

jeltz · 2025-01-14T20:52:43 1736887963

Not on most database workloads. There zfs does not scale very well.

ryao · 2025-01-14T21:47:47 1736891267

Percona and many others who benchmarked this properly would disagree with you. Percona found that ext4 and ZFS performed similarly when given identical hardware (with proper tuning of ZFS):

https://www.percona.com/blog/mysql-zfs-performance-update/

In this older comparison where they did not initially tune ZFS properly for the database, they found XFS to perform better, only for ZFS to outperform it when tuning was done and a L2ARC was added:

https://www.percona.com/blog/about-zfs-performance/

This is roughly what others find when they take the time to do proper tuning and benchmarks. ZFS outscales both ext4 and XFS, since it is a multiple block device filesystem that supports tiered storage while ext4 and XFS are single block device filesystems (with the exception of supporting journals on external drives). They need other things to provide them with scaling to multiple block devices and there is no block device level substitute for supporting tiered storage at the filesystem level.

That said, ZFS has a killer feature that ext4 and XFS do not have, which is low cost replication. You can snapshot and send/recv without affecting system performance very much, so even in situations where ZFS is not at the top in every benchmark such as being on equal hardware, it still wins, since the performance penalty of database backups on ext4 and XFS is huge.

LtdJorge · 2025-01-15T03:01:45 1736910105

There is no way that a CoW filesystem with parity calculations or striping is gonna beat XFS on multiple disks, specially on high speed NVMe.

The article provides great insight into optimizing ZFS, but using an EBS volume as store (with pretty poor IOPS) and then giving the NVMe as metadata cache only for ZFS feels like cheating. At the very least, metadata for XFS could have been offloaded to the NVMe too. I bet if we store set XFS with metadata and log to a RAMFS it will beat ZFS :)

ryao · 2025-01-15T04:58:08 1736917088

L2ARC is a cache. Cache is actually part of its full name, which is Level 2 Adaptive Replacement Cache. It is intended to make fast storage devices into extensions of the in memory Adaptative Replacement Cache. L2ARC functions as a victim cache. While L2ARC does cache metadata, it caches data too. You can disable the data caching, but performance typically suffers when you do. While you can put ZFS metadata on a special device if you want, that was not the configuration that Percona evaluated.

If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do. Using a feature ZFS has to improve performance at price point that XFS cannot match is competition, not cheating.

ZFS cleverly uses CoW in a way that eliminates the need for a journal, which is overhead for XFS. CoW also enables ZFS' best advantage over XFS, which is that database backups on ZFS via snapshots and (incremental) send/recv affect system performance minimally where backups on XFS are extremely disruptive to performance. Percona had high praise for database backups on ZFS:

https://www.percona.com/blog/zfs-for-mongodb-backups/

Finally, there were no parity calculations in the configurations that Percona tested. Did you post a preformed opinion without taking the time to actually understand the configurations used in Percona's benchmarks?

LtdJorge · 2025-01-17T17:55:50 1737136550

No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes. The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

About the inherent advantages of ZFS like send/recv, I have nothing to say. I know how good they are. It's one reason I use ZFS.

> If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do.

What does proper testing here mean? And what does "if you scale it" mean? Genuinely. From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting. What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

Edit: P4800x, actually. The flash disk are D5-P5530.

ryao · 2025-01-17T20:12:22 1737144742

> No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes.

That makes sense.

> The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

it is a balancing act. It is a feature ZFS has that XFS does not, but it is ridiculous to use a device that can fit the entire database as L2ARC, since in that case, you can just use that device directly and keeping it as a cache for ZFS does not make for a fair or realistic comparison. Fast devices that can be used with tiered storage are generally too small to be used as main storage, since if you could use them as main storage, you would.

With the caveat that the higher tier should be too small to be used as main storage, you can get a huge boost from being able to use it as cache in tiered storage, and that is why ZFS has L2ARC.

> What does proper testing here mean? And what does "if you scale it" mean?

Let me preface my answer by saying that doing good benchmarks is often hard, so I can't give a simple answer here. However, I can give a long answer.

First, small databases that can fit entirely in RAM cache (be it the database's own userland cache or a kernel cache) are pointless to benchmark. In general, anything can run that well (since it is really running out of RAM as you pointed out). The database needs to be significantly larger than RAM.

Second, when it comes to using tiered storage, the purpose of doing tiering is that the faster tier is either too small or too expensive to use for the entire database. If the database size is small enough that it is inexpensive to use the higher tier for general storage, then a test where ZFS gets the higher tiered storage for use as cache is neither fair nor realistic. Thus, we need to scale the database to a larger size such that the higher tier being only usable as cache is a realistic scenario. This is what I had in mind when I said "if you scale it".

Third, we need to test workloads that are representative of real things. This part is hard and the last time I did it was 2015 (I had previously said 2016, but upon recollection, I realized it was likely 2015). When I did, I used a proprietary workload simulator that was provided by my job. It might have been from SPEC, but I am not sure.

Fourth, we need to tune things properly. I wrote the following documentation years ago describing correct tuning for ZFS:

https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

At the time I wrote that, I omitted that tuning the I/O elevator can also improve performance, since there is no one size fits all advice for how to do it. Here is some documentation for that which someone else wrote:

https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

If you are using SSDs, you could probably just get away with setting each of the maximum asynchronous queue depth limits to something like 64 (or even 256) and benchmark that.

> From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting.

In 2015 when I did database benchmarks, ZFS and XFS were given equal hardware. The hardware was a fairly beefy EC2 instance with 4x high end SSDs. MD RAID 0 was used under XFS while ZFS was given the devices in what was effectively a RAID 0 configuration. With proper tuning (what I described earlier in this reply), I was able to achieve 85% of XFS performance in that configuration. This was considered a win due to the previously stated reason of performance under database backups. ZFS has since had performance improvements done, which would probably narrow the gap. It now uses B-Trees internally to do operations faster and also now has redundant_metadata=most, which was added for database workloads.

Anyway, on equal hardware in a general performance comparison, I would expect ZFS to lose to XFS, but not by much. ZFS' ability to use tiered storage and do low overhead backups is what would put it ahead.

> What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

You need to have a database whose size is so big that optane storage is not practical to use for main storage. Then you need to setup ZFS with Optane storage as L2ARC. You can give regular flash drives to ZFS and XFS on MD RAID in a comparable configuration (RAID 0 to make life easier, although in practice you probably want to use RAID 10). You will want to follow best practices for tuning the database and filesystems (although from what I know, XFS has remarkably few knobs). You could give XFS the optane devices to use for metadata and its journal for fairness, although I do not expect it to help XFS enough. In this situation, ZFS should win on performance.

You would need to pick a database for this. One option would be PostgreSQL, which is probably the main open source database that people would scale to such levels. The pgbench tool likely could be used for benchmarking.

https://www.postgresql.org/docs/current/pgbench.html

You would need to pick a scaling factor that will make the database big enough and do a workload simulating a large number of clients (what is large is open to interpretation).

Finally, I probably should add that the default script used by pgbench probably is not very realistic for a database workload. A real database will have a good proportion of reads from select queries (at least 50%) while the script that is being used does a write mostly workload. It probably should be changed. How is probably an exercise best left for a reader. That is not the answer you probably want to hear, but I did say earlier in this reply that doing proper benchmarks is hard, and I do not know offhand how to adjust the script to be more representative of real workloads. That said, there is definite utility in benchmarking write mostly workloads too, although that utility is probably more applicable for the database developers than as a way to determine which of two filesystems is better for running the database.

LtdJorge · 2025-01-19T20:53:13 1737319993

Thanks for the long post. Sorry for the nerd snipe, it might or might not have been intentional :D

I agree with what you said. I'll test what you provided, first with fio and then with Postgres (was also my choice beforehand) with a TPC-E benchmark. If I remember, I'll let you know. Postgres on ZFS is specially difficult to be sure about just from theory around the internet, there's too much contradiction or outdated info.

menaerus · 2025-01-15T08:32:15 1736929935

Refuting the "it doesn't scale" argument with a data from a blog that showcases a single workload (TPC-C) with 200G+10tables dataset (small to medium) at 2vCPU (wtf) machine with 16 connections (no thread pool so overprovisioned) is not quite a definition of a scale at all. It's a lost experiment if anything.

ryao · 2025-01-15T17:27:01 1736962021

The guy did not have any data to justify his claims of not scaling. Percona’s data says otherwise. If you don’t like how they got their data, then I advise you to do your own benchmarks.

jeltz · 2025-01-15T20:07:55 1736971675

It is based on data from internal benchmarks. Zfs is fine for database workloads but scales worse than Xfs based on my personal experience. It is unpublished benchmarks and I do not have access to any farm to win a discussion on the internet.

ryao · 2025-01-15T20:20:58 1736972458

I did internal benchmarks at ClusterHQ in 2016. Those benchmarks showed that a tuned ZFS FS of the time had 85% the performance of XFS on equal hardware (a beefy EC2 instance with 4 SSDs, with XFS using MD RAID 0), but it was considered a win for ZFS because of the performance difference when running backups. L2ARC was not considered since the underlying storage was already SSD based and there was nothing faster, but in practice, you often can use it with a faster tier of storage and that puts ZFS ahead even without considering the substantial performance dips of backups.

menaerus · 2025-01-15T18:31:21 1736965881

I don't have anything to like or not to like. I'm not a user of ZFS filesystem. I'm just dismissing your invalid argumentation. Percona's data is nothing about the scale for reasons I already mentioned.

ryao · 2025-01-15T20:17:21 1736972241

The argument he made was invalid without data to back it up. I at least cited something. The remarks on the performance when backups are made and the benefits of L2ARC were really the most important points, and are far from invalid.

jeltz · 2025-01-14T20:49:29 1736887769

It is possible to corrupt the file system from user space as a normal user with Btrfs. The PostgreSQL devs found that when working on async IO. And as fer as I know that issue has not been fixed.

https://www.postgresql.org/message-id/CA%2BhUKGL-sZrfwcdme8j...

curt15 · 2025-01-14T22:11:11 1736892671

LMDB users also unearthed a btrfs data corruption bug last year: https://bugzilla.redhat.com/show_bug.cgi?id=2169947

jeltz · 2025-01-13T13:29:39 1736774979

There have been bigger increases but most likely not faster ones though I am not sure how far back we have reliable data for rate of increase.

And even if so this rapid rate of increase is still not a good thing. It will require expensive adaptations and lead to extinctions.

aeonik · 2025-01-13T13:42:14 1736775734

It's possible there were higher spikes in the past, but I would guess that they were all pretty dramatic events, and yes the earth used to be much hotter in the past.

This is obvious if you think about it, all that carbon in the ground had to be in the air before photosynthetic life.

And before any life, the Earth was a molten ball of liquid rock.

https://en.wikipedia.org/wiki/Paleoclimatology

jeltz · 2025-01-11T17:12:53 1736615573

They suggest that based on their experience. Error handling in bash is awful so I advice against using it for anything complex.

superq · 2025-01-11T22:04:44 1736633084

You can go a really really long way, with a script that will work everywhere, with just set -e and aborting at the first error.

Better yet, 'bash unofficial strict mode':

    set -euo pipefail
    IFS=$'\n\t'

shawn_w · 2025-01-12T06:08:12 1736662092

That has lots of issues. See https://mywiki.wooledge.org/BashPitfalls#set_-euo_pipefail for some.

citrin_ru · 2025-01-13T11:47:00 1736768820

One can made arguments both for and against `set -e`.

Over may carrier I've seen many shell scripts:

The most common case is no `-e` and no explicit error/status checks for most commands. It's unreliable and such scripts is one of reasons why shell got it's bad reputation (and criticized by coders using languages where an unhanded exception would terminate a program).

In my own scripts I often use `-e` (but not always as it is a tradeoff). If you unfamiliar with shell `-e` also could cause problems. But `-e` is not magic which makes a script better, one have to know how it works.

The option author of this article advocates (no `-e` but explicitly check every command which could fail) is the least common IMHO. Unless the scripts calls just 1-2 commands it's tedious and the resulting script is harder to read. May be you can find such style in some opensource projects but I've never seen such scripts at work.

sgarland · 2025-01-12T15:29:36 1736695776

It does, yes, but IME if you can get people who aren’t familiar with shell to use it, it prevents more problems than it solves on the whole.

Then again, so does enforcing passing shellcheck in CI.

irunmyownemail · 2025-01-11T17:39:02 1736617142

Anything complex should be written in a competent language like Java. Script languages (like Bash and Python) are for short (a few lines long) scripts. Using the tool outside the scope of what it was designed for is not a good idea.

teddyh · 2025-01-11T17:57:12 1736618232

Tell me you have never seriously used Python without telling me you have never seriously used Python.

I mean, viewing Python strictly as a scripting language? I am honestly lost for words. There are many huge and major applications and web sites written in Python, without people regretting it after the fact. And yet here you are dismissing it out of hand without a single argument.

Izkata · 2025-01-11T18:42:54 1736620974

Meanwhile most of the time topics like this come up and people hate on shell scripts, those of us that like them see those criticisms the same way you're looking at this comment about python: So far out there it's almost not worth responding. I think that's why GGP and GGGP think "greybeards" don't think it's worthwhile based on experience - it's actually not worth arguing against misinformed comments so newer people don't realize it's still heavily used, just quietly in the background for things it's actually good at.

Further down is a comment about that: https://news.ycombinator.com/item?id=42664939

mturmon · 2025-01-11T23:08:50 1736636930

> …not worth arguing against misinformed comments …

Yeah, I have these same response patterns. Shell works really well for some use cases. I generally don’t respond to the comments that list the various “footguns” of shell, or that complain about security holes, etc. My use cases are not sensitive to these concerns, and even besides this, I find the concerns overstated.

dingnuts · 2025-01-11T20:21:49 1736626909

Don't be too rude, this is a common view among people who are technically adjacent but not engineers, like IT people. It's an incorrect superstition, of course, but in tech almost everybody has their superstitions. There's no reason to be rude -- ignorance is not a crime.

chillpenguin · 2025-01-11T21:03:27 1736629407

I see that kind of thing all the time. Usually it is about static types. People think that dynamic languages aren't "serious", or something. It is laughable that these people still make up a significant amount of comments, here in 2024.

shakna · 2025-01-12T06:01:57 1736661717

Using a tool beyond its design can be problematic.

But Python is not designed to only be a scripting language:

> What is Python?

> Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. It supports multiple programming paradigms beyond object-oriented programming, such as procedural and functional programming. Python combines remarkable power with very clear syntax. It has interfaces to many system calls and libraries, as well as to various window systems, and is extensible in C or C++. It is also usable as an extension language for applications that need a programmable interface. Finally, Python is portable: it runs on many Unix variants including Linux and macOS, and on Windows.

dotancohen · 2025-01-11T22:56:31 1736636191

When my scripts outgrow bash, they almost always wind up in Python.

That said, Sonnet 3.5 had gotten me much further in bash than was possible before - and it's all really maintainable too. I highly suggest consulting with Sonnet on your longer scripts, even just asking it "what would you suggest to improve".

TypingOutBugs · 2025-01-11T18:06:11 1736618771

Why shouldn't you use Python for larger projects, and why do so many startups succeed with large Python repos?