Funnily enough I have seen a slew of technical bulletins from Cloudera warning o...

teraflop · on May 2, 2015

Wow, that sounds bad and I don't remember hearing about it. Do you have any pointers to bug reports or descriptions of the problem?

HDFS uses chain replication, so I would have expected that by the time the client got acknowledgement of a write, it would already be acknowledged by all replicas (3 by default). So even if there's a bug causing one of the nodes to go down without fsyncing, there shouldn't be any actual data loss.

growse · on May 2, 2015

I think the client assumes the data is written after $dfs.namenode.replication.min blocks have been written, which I think is 1 by default.

What it actually means inside HDFS when it claims 'written', I'm not sure - I'd assume flushed to the dirty page buffer at a minimum and would hope fsync.

EdwardDiego · on May 2, 2015

Yeah, I'm very interested in this also.

smegel · on May 2, 2015

>>> OK its not simply that a node dies, but that disks on a node are replaced (which might sort of be related to a node dying).

TSB 2015-51: Replacing DataNode Disks or Manually changing the Storage IDs of Volumes in a Cluster may result in Data Loss Printable View Rate This Knowledge Article (Average Rating: 3.3) Show Properties « Go Back Information

Purpose Updated: 4/22/2015

In CDH 4, DataNodes are identified in HDFS with a single unique identifier. Beginning with CDH 5, every individual disk in a DataNode is assigned a unique identifier as well.

A bug discovered in HDFS, HDFS-7960, can result in the NameNode improperly accounting for DataNode storages for which Storage IDs have changed. A Storage ID changes whenever a disk on a DataNode is replaced, or if the Storage ID is manually manipulated. Either of these scenarios causes the NameNode to double-count block replicas, incorrectly determine that a block is over-replicated, and remove those replicas permanently from those DataNodes.

A related bug, HDFS-7575, results in a failure to create unique IDs for each disk within the DataNodes during upgrade from CDH 4 to CDH 5. Instead, all disks within a single DataNode are assigned the same ID. This bug by itself negatively impacts proper function of the HDFS balancer. Cloudera Release Notes originally stated that manually changing the Storage IDs of the DataNodes was a valid workaround for HDFS-7575. However, doing so can result in irrecoverable data loss due to HDFS-7960, and the release notes have been corrected.

Users affected:

Any cluster where Storage IDs change can be affected by HDFS-7960. Storage IDs change whenever a disk is replaced, or when Storage IDs are manually manipulated. Only clusters upgraded from CDH 4 or earlier releases are affected by HDFS-7575.

Symptoms If data loss has occurred, the NameNode reports “missing blocks” on the NameNode Web UI. You can determine to which files the missing blocks belong by using FSCK. You can also search for NameNode log lines like the following, which indicate that a Storage ID has changed and data loss may have occurred: 2015-03-21 06:48:02,556 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk_8271694345820118657_530878393 on 10.11.12.13:1004 size 6098 Impact:

The replacement of DataNode disks, or manual manipulation of DataNode Storage IDs, can result in irrecoverable data loss. Additionally, due to HDFS-7575, the HDFS Balancer will not function properly.

Applies To HDFS All CDH 5 releases prior 3/31/15, including: 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5 5.1, 5.1.2, 5.1.3, 5.1.4 5.2, 5.2.1, 5.2.3, 5.2.4 5.3, 5.3.1, 5.3.2 Cause Instructions Immediate action required:

Do not manually manipulate Storage IDs on DataNode disks. Additionally, do not replace failed DataNode disks when running any of the affected CDH versions.

Upgrade to CDH 5.4.0, 5.3.3, 5.2.5, 5.1.5, or 5.0.6. See Also/Related Articles Apache.org Bug HDFS-7575

HDFS-7960 Attachment

justinsb · on May 2, 2015

That is a bug in CDH/HDFS, but it was an error and is now fixed. That's not diminishing the severity of the bug, but you can patch and get the correct behaviour, without a performance hit.

That is not comparable to what seems to be the case here with ES & MongoDB, where they deliberately (by design) accept the risk of data-loss to boost performance. Now most systems allow you to do make that trade-off, but an honest system chooses the safe-but-slow configuration by default, and has you knowingly opt-in to the risks of the faster configuration.

I hope you consider editing your initial post - if you conflate bugs with deliberately unsafe design, we just end up with a race to the bottom of increasingly unsafe but fast behaviour.