Hacker News new | past | comments | ask | show | jobs | submit login
Is it better to use cat, dd, pv or another procedure to copy a CD/DVD? (2015) (stackexchange.com)
134 points by polm23 37 days ago | hide | past | favorite | 154 comments



The prospect of using dd for imaging terrifies me to this day, primarily because you still hear horror stories of people trying to write a Linux installer to their flash drive to try out, and accidentally overwriting their main drive instead because they mangled /dev/sda instead of /dev/sdb.

I don't know why dd remains the standard for so many tutorials when there are competent GUI tools that can take all the guesswork out of it. It's like a perpetual hazing ritual for new Linux users.


> I don't know why dd remains the standard for so many tutorials

You can also do something like

    dd if=file.iso of=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_120GB
Where the last part is basically the brand and name of your USB stick, not much guess work required. Though I've never seen that in any tutorials.


You're not alone. :) The installer for the USB stick Debian Live-based distro I made years ago will only install to a path starting `/dev/disk/by-id/usb-` (or `/dev/loop`).

I've also been doing paths through `/dev/disk/by-id/` in the ops documentation and scripts for my startup.

It's a tiny convention for us promote, and one of countless good practices we exercise, but this particular one can save much misery, at negligible cost.

And early, precarious startups are one of the contexts in which some of these tiny negligible-cost good practices seem to pay off especially well: in an early startup, it's easy for me to imagine a data loss, bad downtime, or missed opportunity ending the company (when a more-established company might be able to weather a mess-up better).


It's really unfortunate how tutorials still spell out all the basics to keep the beginners with them, but then rely on the old-fashioned device names when it comes to writing images. Using sda is fine if you're experienced and know what drive has what name, but technical references should stop relying on them already. It's not like you're getting some useful cross-platform compatibility out of old-fashioned commands or anything, every modern Unix has its own way to specify disk names.


I think the simplest explanation is that the authors themselves never considered the other options or were even aware of them -- they simply are sharing some historical knowledge they had. Very kind and good behavior! But I would imagine it's something rote and probably just a shortcoming in collective knowledge.

In short, I think ignorance is the simpler answer ;)


Yep, every couple of years I need to use dd. Never knew of an alternative to sda.


When I was first learning Linux it was pretty helpful to look at a few random things in /dev and /bin and /etc and learn what they did, whenever I had a few spare minutes.


ZFS has open my eyes to /dev/disk/by-id/. It's the only way when you have many of the exact same make/model/size drives per machine. You will learn by fire the first time you need to determine which drive you need to replace.


>Using sda is fine if you're experienced and know what drive has what name

Not even then. The prospect of making an occasional mistake, while low, is clearly much higher than when you are selecting the disk using its name. Being experienced doesn't excuse you from the responsibility of using a less risky tool.


To avoid mistakes I always do “lsblk” (on Linux) or “diskutil list” (on macOS) or “gpart show” (on FreeBSD) and carefully read the output before I do the dd to something.

Usually on Linux I only use /dev/disk/by-id/foo in scripts and config files. But using it when doing routine stuff with dd is a pretty good idea too.

It’s too bad that macOS and FreeBSD hasn’t anything similar to my knowledge. And since I use both of these operating systems so much, and so often for things involving dd, I think in my case I don’t gain much from doing by-id for dd on Linux unfortunately.


>To avoid mistakes I always do “lsblk”

Even this is faulty solution. By default, it only lists the devices with their "/dev/sd*" names.


I learned of this from this (rather long and detailed) article:

https://perfectmediaserver.com/

It's full of other great nuts-and-bolts advice.

I feel it's important to mention you want by-id as there are 5 permutations of /dev/disk/by-*:

  /dev/disk/by-id
  /dev/disk/by-partlabel
  /dev/disk/by-partuuid
  /dev/disk/by-path
  /dev/disk/by-uuid


What I do to find out which is which is use lsscsi:

    $ lsscsi
    [0:0:0:0]    disk    ATA      ST2000DX001-1NS1 CC41  /dev/sda 
    [3:0:0:0]    cd/dvd  ASUS     SH-224FB         1.00  /dev/sr0 
    [4:0:0:0]    disk    Generic- Multiple Reader  1.11  /dev/sdb


Unfortunately you have to do this again after each reboot, or whenever you plug in an USB stick, so the dev/by-* option does help when you do that repeatedly for the same device.


Excellent suggestion. It also includes the serial number for non-removable devices, like so:

    /dev/disk/by-id/ata-Samsung_SSD_860_PRO_1TB_S42NNF0K123456N -> ../../sdc


TIL! Thank you so much for this. I didn't know.


What happens if you have two?


Lifehack of the month! So obvious and so little used.


I love this tip! I always get sweaty armpits whenever I'm using dd and I check the of parameter several times just to be sure. This will save me from a lot of pit stains.


The “simple GUI” tool I keep seeing pushed Maker forums is Etcher. I needed to flash something from my work PC so I gave it a shot- 90MB of Electron bloat, while offering no extra functionality over dd . Plus, dd doesn’t advertise to me during the flash, and certainly doesn’t phone home to balena.

For folks just getting started or more comfortable with a GUI, I’d recommend giving USBimager[0] a look. It does exactly what’d you’d expect based on the name, it performant, and they have native apps. No affiliation, just a fan of a KISS app done right.

[0] https://gitlab.com/bztsrc/usbimager


I can also recommend Fedora Media Writer[0]. It can, despite the name, install other images too and it's fairly intuetive.

[0] https://flathub.org/apps/details/org.fedoraproject.MediaWrit...


There's also Rufus, sadly it's Windows-only.


Woeusb is pretty decent on linux, and does the same uefi-ntfs magic that Rufus does, and that you'll need if you're burning a large ISO like recent Windows ones.

Though if it's just for ISOs, ventoy is fantastic, just drag and drop the file and no burning at all :)


> that you'll need if you're burning a large ISO like recent Windows ones.

I ran into this very issue when trying to make a bootable Windows 10 USB on macOS. No amount of fiddling with dd, unetbootin or Etcher resulted in a bootable USB. Despite being principled about it, I had to admit defeat and just pulled an old <4GB Windows 10 iso and flashed that to the stick.

I know I could have installed Linux through a virtual machine and got it done that way, but that seemed horrible overkill. Oh well.


Double click iso to mount it, format usb as fat32 or exfat, copy and paste contents of mounted iso to usb drive

This has always worked for me to boot UEFI installers


The issue is fat32 only supports files that are up to 4GB in size.

Recent windows ISOs have a file that is >4GB, so you can't have the partition formatted as f32.

exfat isn't compatible with uefi.


Yeah, Win10_20H2_English_x64.iso is UDF these days, which doesn't have the ISO 9660 file size limitation of ~4GiB. Basically same as FAT32.

These are the two largest files in that image.

    5.0G    ./sources/install.wim
    534M    ./sources/boot.wim
I wonder how Microsoft's media creation tool deals with this...


You can do it by hand with the dism tool like so,

    Dism /Split-Image /ImageFile:C:\folder_name\sources\install.wim /SWMFile:C:\folder_name\sources\install.swm /FileSize:3800
I copied that from a ZDnet article with more details. Wasn't sure about posting the URL but searching will find it.


The esd file that the media creation tool downloads and then immediately deletes after creating the USB stick, has a smaller version of this file.

    4.0G ./sources/install.esd  
    381M ./sources/boot.wim


You are right about exfat, what about NTFS? I suppose that's the issue though with macos diskutility no supporting NTFS out of the box. Honestly the other suggestion to use ventoy is probably the best option. Such a great utility


As far as I know motherboard manufacturers have the option of implementing NTFS support in their UEFI, but it's nowhere near ubiquitous.


Etcher is extra bad on Macos. Populating the file chooser dialog took minutes. I have no idea how it could be this wrong.


While we're all making recommendations, I really like Popsicle[0]. Does what it says, and nothing more.

[0] https://github.com/pop-os/popsicle


> The prospect of using dd for imaging terrifies me to this day, primarily because you still hear horror stories of people trying to write a Linux installer to their flash drive to try out, and accidentally overwriting their main drive instead because they mangled /dev/sda instead of /dev/sdb.

Wouldn't one (at least partial) solution here be, that the kernel should refuse to let you write directly to a block device in use by a mounted filesystem? (Maybe combined with some special ioctl/whatever to bypass that restriction if you ever really need to.) Then, if you are running dd from an OS running on your main drive, the kernel will refuse to let dd overwrite the main drive, but will let it overwrite the flash drive (which presumably is not mounted, and anyway shouldn't be if you are about to overwrite it)


There are two major problems with this: partitions/device-mapper and backwards compatibility. First, partitions inherently overlap the parent block device. You would need to carefully track exactly which portions of the device are in use, or your solution is useless (it doesn't protect /dev/sda when /dev/sda1 is in use) or blocks the vast majority of use cases (/dev/sda2 cannot be used when /dev/sda1 is in use). The same applies for device-mapper, but worse. Secondly, the Linux kernel has a very strong backwards compatibility guarantee. Any change that would break valid (or even invalid) uses will be loudly rejected by Linus. With your idea, many disk management programs will be broken.


You could though probably make a wrapper that only kicks in when dd is called from the command line (as opposed to, say, from a script or some GUI program) and asking a confirmation "Do you really want to overwrite your main drive?", without breaking backward compatibility?

I have written a few wrappers like that on my own system preventing me from making a few common mistakes of mine (like scp'ing a file locally to a filename ressembling an IP address instead of on a remote server ;)


sure, but it'd be less confusing to just make it a new command. you could add features like invoking lsblk with a sensible set of flags.


Except the whole point of the wrapper is to protect people who are ignorant of alternatives from overwriting their main system drive by carelessly typing /dev/sda when they mean /dev/sdb on their local system. So making a new command and socializing it doesn't actually improve the situation at all.


When it's a new command it doesn't solve the problem of newbies and 3 decades of old tutorials, which was the comment that started this thread.


> partitions inherently overlap the parent block device. You would need to carefully track exactly which portions of the device are in use, or your solution is useless (it doesn't protect /dev/sda when /dev/sda1 is in use)

I think this could be addressed by the idea of a "parent block device". So /dev/sda1 is mounted, then its parent /dev/sda would be classified as mounted, but /dev/sda2 would not be (assuming there is no partition mounted there.)

I'm sure one could work something out that would work for device-mapper, LVM, etc as well

> Secondly, the Linux kernel has a very strong backwards compatibility guarantee

What about a sysctl knob? Turn it on, you get this new behaviour, turn it off, you get the backwards compatible behaviour. Each distribution can decide what to default it to. If it defaults to off in Linus' tree, that should satisfy his backwards compatibility concerns.

> With your idea, many disk management programs will be broken.

There would need to be some escape hatch, e.g. an ioctl, to allow unsafe writes. And disk management programs would have to be patched to invoke that escape hatch. A distribution wouldn't ship the sysctl as defaulting to on until it had patched all the disk management programs in that distribution. And, if you download a third-party tool, either its developers have patched it to use that ioctl, or else you can just temporarily turn off the sysctl knob while you use it.


> I don't know why dd remains the standard for so many tutorials

While dd isn't the best tool for writing images to devices (if only because of its arcane and bizarre command-line syntax), it is a valuable tool when you want to recover data from media that has errors. The 'conv=noerror' option is a lifesaver for when you want to recover something from your media.

But overall I agree: plenty of people recommend using dd just because "it's always been done that way" and often also because it makes them look smart :-)


It’s also nearly guaranteed to be available on the user’s system. Are there other tools to do this? Yes, in fact too many, to the point where none of them are the clear leader.

I hate tutorials that start out with “I’m gonna show you how to do X. I like to use Y, but since the Y tool isn’t installed, we have to get it. Start by editing sources.list. You might need to be root to do this. Here is how you do that.” Halfway through the tutorial we’re still Yak shaving.


ddrescue can also help with that use case: https://www.gnu.org/software/ddrescue/


I always used ddrescue like this. Even on badly scratched discs it was able to recover most of the data.

ddrescue -b 2048 -d -r 3 -R -v /dev/sr0 image.iso image.log


I used to use 2048, that's probably the wrong value to use. want to use something that corresponds to the ECC block size of the underlying media. but yes, I do that as well (came to say mostly the same comtent, but I'd do r -1)


Is the ECC available to the OS? I thought the drive handled the ECC and just reported errors?


If you're trying to read with a smaller block size than the media's native block size, then you'll make two or more attempts to read data from each corrupted block that's unrecoverable—making your recovery process much longer.

If you try to read with a larger block size than the media's native block size, you'll get errors for chunks of that size even when part of the data may have been recoverable.

The above is true whether or not the OS has access to the media's raw ECC data.


it's not just that. Reading scratched blocks means sometimes it might work, sometimes it might not. lets say 1% of the time you get a good read. If the ECC block is 16k and you always read in 16k blocks, if you get that 1% magic time that the read succeeds, you got the whole block. If however, you read 2k blocks, you need to have the same luck 8x times.

From experience of recovering bad DVDs and BRs, there were discs I had to pass through 100+ times to get a valid read on all blocks.

This is also because, the underlying hardware will always do a full ecc block read (only way for it to determine that it read the block correctly, to read whole block and verify it), so any smaller reads are pointless.


just in case you didn't read my response below, its not about the ECC being available to the OS, the OS doesn't need to see it.

the way optical media works is that the optical media reads bits from the drive in ECC block size and then verifies / fixes the block and passes that back to the OS if its has a valid block, otherwise returns an error to the OS. hence, my logic that I describe below.

optical media is an unreliable medium in general and hence depends on the ECC codes to ensure blocks are read correctly and they are used a lot. back in the day of CD and DVD burning there were fancier burners that provided apis for reading the error correcting stats into user space (i.e. how many of different types of errors were corrected), dont know if they still exist. It was never 0 across the board, but that's how the medium was designed, not to require it be 0 across the board.


What I guess I don’t understand is: why does the OS need to know the internal ECC block size if it doesn’t even see the ECC (or even know of its existence)? When I want a sector from my hard drive, I ask for 4096 bytes, not 4096+$ECC.[0] If I asked for 4096+$ECC, it would actually give me 4096 from the sector I requested, and $ECC from the next one. So why ask for 2048+$ECC and not just 2048?

As you said, the nature of the medium requires ECC (side note: modern hard drives do too). So if I ask for a 2048 byte sector, the drive has to read the ECC. So why ask for more than that? It already knows the sector boundaries. In other words, if I tell `dd` to use a block size of 2048+$ECC, won’t that actually work a sector and a half (well, 1 + $ECC/2048) at a time?

[0]: In fact, unlike CDs, I don’t even know or have any way of finding out how many ECC “bytes” there are in my hard drives' sectors


also, you don't ask for 2488+ECC (or +1). the ECC is not visible to the OS or user.

you just care about the size of data the ECC is protecting. if the ECC protects 16K or 32K of data, you want to read on those physical boundaries. as then you'll read a whole ECC block and it will either pass or fail. If it passes, you never have to try to read that ECC protected block again (and maybe fail).

Of course, there is one hitch to my scheme. Ensuring that you always read on ECC protected block boundaries. I'm pretty sure if you use ddrescue to always read the right block size it will, but not 100% (why not? perhaps the ECC protects data not visible to the end user in some way (say the first block is only 8kb, not 16kb in practice).

on the issue of hard drives, there is a lot more going on that puts you at the mercy of the firmware (relocatable sectors and the like). I did lose a RAID5 once (1 drive totaly died, and then in rebuild, another drive threw and error) , and I was able to use ddrescue to recover all but 4K block on it. as I was using a 128KB stripe size, that meant I probably lost somewhere between half MB and a MB of data - if the 4K was contained within a single stripe or not (probable it was). I was content with that. never did discover what data if at all was corrupted, but I was able to recover the raid5.


because the OS did't cache the full ecc block (it views the block size as 2048 or 4096 bytes), and with scratched media 2 reads of the same ecc block aren't going to necessarily both succeed.

simplistic case, imagine we have 1 ECC block of 16k, but we read at 2k, so we'll number the 2k blocks 0-7

T0 - read block 0, fails T1 - read block 1, succeeds! T2 - read block 2, fails T3-T7 repeat for blocks 3-7, all fail

in practice if we read a 16k bock at T1, we would be golden (and finished). Instead we did 8 steps, and only got 1/8 of the data.

This is becaue the OS doesn't have a concept of the hardware's ECC block size, so the optical hardware in a sense virtualizes it, and the OS will just keep on rereading the same ECC block on the media and possibly continue to get errors.


I'd also add that -R (reverse) doesn't make sense to me for optical media. really would slow down recovery as the default direction I'd think.


It's been so long since I regularly used optical discs (for anything other than the rare rented DVD) that I now assume by default that I'll need ddrescue because the discs are all old.


>its arcane and bizarre command-line syntax

I actually like dd's command-line syntax the most out of the Linux coreutils, and I wish more programs used a similar "key=value" argument system. It's pretty easy to remember too, since there's really only a few keys you need to remember to do 99% of what dd typically gets used for.


> arcane and bizarre command-line syntax

It is not arcane, you just have to read docs, like for every command line tool. Different cli applications have different syntaxes usually influenced by their domains, eg compare find, tcpdump, iptables, cut, docker.


Both dd's name and its command line syntax is intentionally arcane and bizarre and originally meant as a joke ;)

The tool is originally meant to do various conversions of formats of data on 8-track tapes and both its name and syntax is reference to (arguably less bizarre) syntax used by IBM's JCL to produce contents of tapes that need that kind of conversion to be usable on unix.


The "originally meant as a joke" belief is derived from an Eric Raymond quote in the Jargon File, not the tool makers.


Define arcane. You can type "man dd" and acquire the required knowledge, it is not hidden. Whatever the intention was the syntax is simple and pleasant to use, at least for me ;)


It’s not arcane in the sense of “unknown”, but arcane in the sense of “bizarre” or “why is it like this?” Almost every other Unix/Linux tool would use `--if {if}` not `if={if}`.


No, it is arcane and bizarre. No other command line tool uses dd's syntax.


I really love when I can learn a tool and use it basically forever. In the case of DD, I have been using it since the 90s, and my printed URM from April 1985 lists dd with syntax that is still relevant today. So even if it doesn't fit the normal mold, I only had to learn it once.


Example?

I used dd a lot to burn images on USB. And that command is simple as expected.


I can't provide an example of something which doesn't exist, i.e. another program with dd's syntax.

I didn't say it wasn't simple. But it is very weird.


I think what people see as weird is `dd bs=1024 if=/dev/sr0 of=/home/jason/sr0.iso`, instead of `dd --block-size 1024 --in-file /dev/sr0 --out-file /home/jason/sr0.iso`. It's a bit non-standard, but it's really not bad.


What's weird about a syntax as straightforward as 'command name' 'input file' 'output file'? How exactly would you do it differently?


If it were `dd input output` or `dd input -o output` no one would be claiming it's unique and bizarre. However the actual syntax is `dd if=input of=output` which is unique and (arguably) bizarre.


Unique, yes, and I'd even go as far as to say it's somewhat out of place on a Unix-like OS, but I have an easier time remembering "infile, outfile, block/batch size" than, say, arguments for tar due to its simplicity. I wouldn't call it bizarre because it's extremely straightforward.


> "no other ..."

Do you realize that anyone can make a cli tool with whatever syntax they like? :) Also, see "man xm", which uses quite similar approach. And there are probably many other examples.


> and accidentally overwriting their main drive instead because they mangled /dev/sda instead of /dev/sdb

One trick to be sure that you've typed correctly, start with 'echo':

# echo dd if=xxx of=yyy [enter]

This way you'll be able to check that command will be executed with the parameters you want. Especially helpful if you're running loops, e.g:

# for i in xxx; do echo yyy; done

I would never use GUI for that, as you can't be sure what it will execute, no matter what it shows.


How would `echo dd if=linux.iso of=/dev/sda` protect you from nuking the wrong drive? To be sure you aren’t overwriting the wrong drive, you need to read the command before executing it — and you don’t need echo for that, just pause for a second before hitting Enter — and you need to know which drive is which, and that requires running a different command, or using eg. /dev/disk/* instead of /dev/sd??.


> I don't know why dd remains the standard for so many tutorials when there are competent GUI tools

You miss the point entirely then. The point is to use a command line because what you think is competent for a GUI tool isn't nearly as intuitive as you think especially when a GUI isn't available.


Better permissions could help with this. Make it so that there is a user other than root that can write to the block device the flash drive is on, and then you sudo to that user to write images rather than sudoing to root. Then accidentally trying to write to the main drive will get a permission error.


Why do you think dd is special in that it can overwrite devices? Do you think ‘cat’ can’t?

The root user is special in that it can overwrite devices. Look at what you’re typing before using its powers.


One mitigation is to recommend users using symlinks in

    ls /dev/disk/by-id
instead of /dev/sd*

Other than that, someone can replace the whole of Etcher with a simple bash script that will list available block devices with nice descriptive names taken from sysfs, allow the user to select the one he wants to flash to, and handle flashing of potentially compressed image and verifying the result automatically.

It would probably not be much longer than 100 LoC, at least on Linux.


I once mistakenly overwrote an entire disk (or wrong partition, I don't remember) when trying to install a GNU/Linux distribution to one partition using a graphical installer. Never happened using dd. Sure some graphical applications can be helpful in this regard, but there's really no reason to assume the one you end up using will have you avoid this type of mistake. I'd rather use something that is stable and be vigilant as steadily required.


It's really not that hard to use correctly. Maybe it shouldn't be the standard for beginners' tutorials, sure, but it's not terrifying once you know how it works, as long as you exercise appropriate caution. And that's the whole benefit of command-line tools; they're like a chainsaw in that they get a whole lot of work done in a hurry, but you have to point them the right direction and keep them under control.


Alternatively, I have never written to the wrong device when using dd. I always double and triple-check with lsblk and blkid before committing to it. dd is always available, unlike other tools.

If a user blindly copy-pastes and destroys their main drive, that's a very good lesson to thoroughly double check before running destructive operations.


Just because you should do something doesn’t mean we should make it easy to screw up. And just because you and I double check doesn’t mean everyone else does. It is their fault, sure, but it’s also not when the internet guides don’t give you a massive warning about double checking.


The adventure doesn't start until things go wrong, just think of it as Linux's way of providing a lifetime of adventures and great stories, like trying to fix your parent's computer at 3am before they wake up and realize you thought you had the dual-boot Windows/SuSE setup working but now the master boot record is corrupted and you're still a Linux noob without StackOverflow or Google for answers. Battle scars my friend.


To add to what the other user said, just because you can do something with one tool, doesn't mean we shouldn't promote other tools that are easier to use and that make it easier for you to do the job without accidentally shooting yourself in the foot.


dd has the advantage of specifying a larger buffer. I've seen for example that in MacOS writing an image to a SD card with the internal reader of my macbook is a lot faster with a block size of 1M in respect to the classical 4k.

Also the advantage of dd is that you can run it with sudo easily, while with cat it will not work, since the redirection is done by the shell and not the cat binary itself, and you either have to open a root shell, or pipe the output of cat in `sudo tee filename >/dev/null` that is less than ideal.

Also I think the problem is of Linux that lets you write on disks that are mounted. In MacOS is forbidden and you must umount the drive (so overwriting your root partition by error is impossible).


its simple, already installed and has huge existing tow. at least with newer classes of device storage drivers like nvme you end up with completely different names for system drives!


I accidentally repartitioned my main linux drive once.

I just ran gparted:

  $ gparted
and instead of complaining that I didn't specify an argument, it chose the first one: /

When I realized what I had done, I couldn't recover the partition table, but I managed to rsync everything elsewhere. I'm sure there was a better/safer way but everything was recovered in the end.


dd can be faster than alternatives, if you set the block size right, and is always available, even without a GUI.

I've often used whatever disk utility came with my distro. Or, rufus (rufus.ie) if I'm on windows.


You can also just use cp, and then you don't need to worry about the block size. Really only need dd if you have to limit how much data is read.


You assume one has access to a GUI. Not always the case.


That's another advantage of using Qubes. Worst case scenario I botch the image-burning Qube, which is temporary anyways.


Just check dmesg when you plug in your USB drive, it will tell you what device it created for it.


That said I remember a roommate erasing it's Windows "C:" drive using Ghost (that is "Ghost" before Norton bought it).

No later than yesterday I copied my wife's entire (Windows 8.1) HDD to an SSD using the free (GUI) version of "Macrium Reflect". I'm pretty sure you can make the exact same mistake: copying destination unto source instead of the contratry. I used a Windows software and not DD for it was a Windows computer and I didn't feel like booting a live Linux CD to do the dump but under Linux, I always use dd.

Is it really that hard to learn that in dd the 'i' in "if" means "input" and that the 'o' in "of" means "output"?

The problem with "making things simple using a GUI" is that you typically totally lose the ability to do not just advanced things but even "average" things. For example for read-only media I always write the checksum on the media itself, using a sharpie. That way I can easily verify that my disk (or the copy) is ok doing a dd and piping into sha256sum (there are some gotchas to keep in mind but it works fine when you know how to do it).

Like, say, a Debian install ISO. I like to have these on read-only medium and make sure the checksum matches the official one (so I prefer a write-once / read-only DVD than a read/write memory stick).

How do you that with the GUI? Piping into a cryptographic hash?

I can understand that people prefer GUI over command line for many things but I think that people imaging entire disks are at least power users and can learn the difference between "input" and "output". Heck, maybe it's even doing them a service to teach them to use the CLI. Maybe it's the opportunity to teach them about piping, about cryptographic hashes, etc.

And once again: you can totally screw up with a GUI too.

I don't mean it in a bad way at all but... Linux on the desktop (which I use since 20 years) ain't exactly enjoying a huge market share compared to Windows or OS X and I think that's fine. If a Linux user cannot or isn't willing to learn dd, maybe that user is better served by Windows or OS X. And really: I don't mean it in a bad way. I don't think Linux has to "win" the Desktop war. I don't think it's wrong not to use Linux.

But I do think it's wrong to believe users willing to learn Linux cannot learn the difference between input and ouptut.

Also a strong case can be made that someone for whom it's a "disaster" to overwrite it's main drive is one hard disk failure away from disaster anyway.

I'm for educating users, not baby-feeding them with tools that are going to keep them in the dark and reinforce their bad practices (like not doing proper backups and hence being "one hard drive failure" away from disaster).


Any time CDs/CD-ROMs (as opposed to DVDs) are discussed, I get really nervous unless someone explicitly confirms that the tool being used handles XA/2352-bytes-per-sector data correctly. Supposedly cdrdao will.[1]

DVDs are for the most part pretty easy to duplicate identically. CDs have a lot of quirks, and it's very easy to end up with a useless "backup". Especially if the file format is ISO instead of bin/cue or similar. If one is trying to duplicate CDs/CD-ROMs, it's really vital to cross all your Ts and dot all your Is.

[1] https://consolecopyworld.com/psx/psx_copy_patch_linux.shtml


Is there a reason that the copying process needs to be aware of how bytes are organized? Why can’t you just read the bytes from disk A and write those exact bytes to disk B?


Because depending on what level the software is operating at, it may not even be aware of the correct data, and the file format used to store the image might not be capable of representing it accurately.

I'm simplifying here somewhat for space, but the CD-ROM specifications allow for at least two ways of storing the data on the disc.

A 100% vanilla data CD-ROM with no copy protection uses 2048 bytes per sector for the data that's visible to you as a user, and the remaining 304 bytes in that sector are used for data that helps recover the user-level data if any of it is unreadable (kind of like a RAID5 setup).[1]

Mode 2/XA discs (or mode 2/XA sections on mixed-mode discs) use those 304 bytes per sector to store user-level data instead. i.e. they are trading more space for less reliability. PlayStation games are the most common example. If you've ever tried to copy XA audio or STR video files off of a PlayStation disc in Windows and wondered why you got an error, or why the files copied but were corrupted, that's why. Your PC was only copying 2048 out of every 2352 bytes in each sector.

The ISO file format for discs can ONLY support 2048 bytes per sector. This is why groups like Redump use bin/cue for anything that comes off of a CD. If you convert a bin/cue to ISO, you are throwing away a little over 10% of the data in every sector.

If you want to learn more about this, read the specs for different types of CDs and CD-ROMs. It's a big mess, and I think everyone is happier that the industry standardized down to fewer options in the DVD era.

[1] This is in addition to the physical-level redundant data encoded in the pits on the disc itself, but AFAIK almost nothing can read discs at that level.


Which bytes? A raw 2352 byte copy that includes the P and Q channels (timing) can behave differently than a 2048 byte copy depending on the software.


Ok but why should we care about that in copying?

This is my mental model for what's going on:

There's a bunch of bytes stored on a CD-ROM in a defined order. Zeroes and ones. Copy them in order. You should now have a file on magnetic disk or flash that is precisely those byes in precisely that order. Anything that can make sense of one should be able to make sense of the other.

What am I missing here?


There is more than one set of bytes.

You may not care about copying the inode structure when you are copying your files from point a to point b, but you should when you are cloning the disk.

Many times the software on the CD is looking at the physical layout of the disk, not just the logical data, to function correctly.


So filesystem meta data is being copied that references the physical cd but it's no longer associated with that cd once copied to magnetic storage? Is that what's going on here?


No. Please see my other post in this thread. I'm struggling to think of a non-technical analogy, but basically what's going on here is that most data-copying tools, when used in the most common ways, are looking at CD-ROMs like they would look at the logical disk presented by a RAID5 array.

That works fine most of the time. But imagine if the computing industry had come up with a "RAID5 Mode 2" that was actually RAID0, where the parity disk was just used to store more user data instead of parity data, but most of the copying tools didn't know the difference between "RAID5" and "RAID5 mode 2", and so they just copied 2/3 of the data, on the assumption that the parity data would be recreated on the receiving end. 1/3 of the user data just went down the drain. That's basically what's happening when you try to store anything other than a 100% vanilla data CD-ROM as an ISO file instead of bin/cue.


Do any of the tools mentioned in the title support audio and/or multi-session disc formats? AFAIK dd is not suitable for these purposes (although the original author mentions they wish to create ISO images so presumably they only care about single session data).

Windows had a lot of great disc software in the 90s/00s which would handle just about any disc format one would encounter, e.g. Alcohol 120%, CloneCD, Disc Juggler. Some of these were capable of backing up the various protections of the time such as SecureROM and Starforce.

dd is a powerful tool but I often find it the "wrong" tool, e.g. is it desired to clone a 2GB partition layout to a 128GB usb stick for a live CD and have to manually edit my partition tables myself?


That answer does not seem very useful to me. The use cases are different.

Reasons to use dd include being able to set the block size, seek and skip to resume copies or extract parts of the data, and a whole bunch of special options for the very specific use case of large binary files or block devices.

Reasons to use cp include it being file based, so metadata and recursive copying is natural, efficient, and easy to use. This probably includes most copying on a daily basis. See also rsync.


One advantage dd has, specifically when it comes to copying physical media, is that you can tell it to continue in case of a read error, instead of crapping out halfway through. The lack of such functionality can make copying a scratched CD quite challenging with tools like cat or cp. You also have the choice to ignore the data that could not be read (dangerous) or to write a padded block of the correct size in its place and continue.

I agree that the use cases for all tools are different so comparing them fairly is impossible. dd is still my go-to for making disk images, but it's unnecessary and has high overhead without a lot of tweaking for simple tasks like file operations. cat is often a good tool for things like piped commands where you need a flag cat supports (so you can't use shell redirection) and cp/rsync are obviously superior for copying files rather than just data. Sure, uu could hack my commandsin such a way that one tool can do another's job, but why would I?


Notice that they way they use cp isn’t to copy the files contained on the disc but to copy the image of the disc itself.


I've been dealing with a lot of CD copying problems lately. I'm working on a setup to automatically detect and correct errors reading audio CDs which I've found don't use checksums or other mechanisms to prevent small defects from mangling the bitstream.

I'd previously used cdparanoia, but found it not paranoid enough.(for some reason it always outputs aiff regardless of options as well)

I'm looking at defeating the drive cache by actually reading the disc in a second and 3rd drive and doing a NASA style "all 3 must agree".

It's annoying, but I've got 800,000 CDs to rip right now, and I have to make sure each one is perfect.(Without proprietary solutions)

I also need to be able to positively reduplicate heavily scratched discs. It's a tough problem.


Don't know if it can help but... If it's Audio CDs the problem is kinda solved in that you can make sure you extract a bit-perfect copy by verifying the checksum with the checksums of other people who extracted the same CD. Even though the Audio CD format tried to make it "hard" to rip in a bit-perfect way, it's done precisely like that since years and years.

There's a community of people ripping Audio CD to flac file and making sure all their rips are bit-perfect.

I've ripped my own collection and used, as far as I remember, "whipper" on Fedora Linux (for whatever reason I couldn't make that software work on Debian back then). After ripping a track/CD the software would automatically verify the checksum with an online DB of rips.

You say you can't use proprietary solution: as far as I know there are several free rippers on Linux and the online DB of checksum isn't proprietary either (?).

In case you rip with an error, it's near impossible that someone else who ripped the CD would have read exactly the same error and end up with the same checksum.

Now: for all the audio CDs I tried there was at least one other person who had already ripped it but it's not always the case. For example I've read about some collection of classical music CDs coming in pack of 300 CDs (!) and nobody had bothered, at home, to rip them / to share the checksum.

But it can already help for all the "common" CDs you own.


The exact audio copy checksum database is not an option for me because of it's closed nature and the provenance of the data in it.


> because of it's closed nature and the provenance of the data in it.

How does any of that prevent you from using those checksums as an indicator of whether your own ripping process is working as expected?


As far as I can tell, I can't query the database from a script on linux. You have to use their software to do the rip, and query the database.

The scripts I have are very streamlined, and make multi-machine ripping very fast. I will probably just use md5 or sha1 as my checksum.

But also, I'll have all the flac files so I can always do a full comparison to see exactly what the difference is.


You are ripping 800,000 CDs and you are willfully not taking advantage of the best solution available?


It's not the best solution if I have to use a gui, and the software they have available to do it. I have scripts that will streamline things to a high degree. The EAC database doesn't seem to be open, and I don't feel like paying for it if I don't have to.

I've got enough copies of each disc that I don't need their data, and I want a better check than they do.


fair points!


800k? are you insane? even if everything else is automated and you have 800K devices at your fingertip ready to go, the simple amount of time to lift CD, open player tray / insert CD / close tray is in decades!.

Let's say you do this 8 hours per day for 5 days. The above series of movements can take around 30 seconds. That's 400k minutes which results in 400k / 40 hours == 10k weeks. At 50 weeks per year (only 2 weeks vacation each year) that still results in 200 years!!. Even if you do one CD every second this result in 200 years / 30 ==~7 years.


I very well might be. I took over Murfie when they shut down, and am in the process of re-building the company.

People don't understand the scale, and why I might want to make sure I do it right the first time rather than get halfway through and have to start over because I missed something.


I'm assuming you're the guy in this article? https://www.theverge.com/2020/2/5/21121594/crossies-murfie-m...


Yeah, that's me :)


Ah, I made a terrible mistake on my Math above. It's 400k minutes which is actually only 6666.66 hours (400k / 60). So in the end you'll have only 200 years / 60 = 3.33 years.

I suppose you're going to pay people to do it, so lemme help you a little. 6667 hours at let's say $15/hour that's roughly one million dollars for this task. I hope you have large pockets ;).


Actually your math is way off. I don't think it will take more than about 10 seconds per disc to process them if I get things set up right and everything is well organized and streamlined.

https://twitter.com/pontifier/status/1236002652468674562

That's 6 per minute. 360 per hour. 800,000 should take about 2,222 hours. I've got a bunch of people that will work for about $10/hour, and that's only $22,220 worth of labor over 277 days for 1 person.


Are you really looking at 800k CDs to rip, or are you planning on removing duplicates and finding a good specimen?


I will need to verify each disc. I may not have to rip all of them, but I will probably do some sort of rapid de-duplication check. It will probably be much more involved than the normal toc disc id though. I might compare bits from random areas of each disc to verify it is the correct pressing. I'm still not 100% sure how I'll do it, but I don't want to store more than I have to.

I'd like to be able to take the pieces of a broken disc and verify somehow that it's a legit copy. Then I'd treat the pieces like every other copy of that disc and give the owner access.


>audio CDs which I've found don't use checksums or other mechanisms to prevent small defects from mangling the bitstream

Redbook audio CDs does use CIRC error correction, by storing 8 bytes of parity data inside each 33 byte F3 frame. It is not enough to correct all errors though, and Yellowbook data CDs stores extra correction codes on top of it. (276 bytes for each 2352 byte sector).

(and there's also issue below F3 frame level about EFM modulation, whether merging codes are generated properly to keep DSV low enough)

See: https://byuu.net/compact-discs/structure/ https://john-millikin.com/%F0%9F%A4%94/why-i-ripped-the-same... https://john-millikin.com/%F0%9F%A4%94/error-beneath-the-wav...


>800,000 CDs

This is going to be quite a task, even with a bunch of the big boy 1000 disk auto loaders. Im not familiar with any solutions capable of opening CD jewel cases. You might want to contact few hackerspaces to inquire about building specialized machines just for the task of loading disks into spindles.

btw why not proprietary? like http://wiki.hydrogenaud.io/index.php?title=Exact_Audio_Copy


One non-obvious reason dd is often used in examples is because of device permissions.

You can't (easily/legibly) present a one-liner that combines cat and sudo because sudo doesn't give the shell redirection (sudo cat >> /dev/whatever) permissions. But "sudo dd of=/dev/whatever" works fine.

Edit: Yes, you can make a one liner via sh -c, tee, etc, but depending on cut/paste with quotes is tricky, or the one-liner gets long.


You can also just use `su`, run your command, then `exit`, but you’re right: `sudo dd` is much easier.

(This obviously ignores that you should almost never use `su`)


> you should almost never use `su`

Why?


The same reason you should be logged in as root. It has its uses, but you need to be careful as it can be easy to forget you’re running as root and screw something up. Explicitly putting `sudo` before every command makes it clear what you’re doing.


    Performance. This is an I/O-bound process; the main influence in performance
    is the buffer size: the tool reads a chunk from the source, writes the chunk
    to the destination, repeats. If the chunk is too small, the computer spends 
    its time switching between tasks. If the chunk is too large, the read and 
    write operations can't be parallelized. The optimal chunk size on a PC is 
    typically around a few megabytes but this is obviously very dependent on the 
    OS, on the hardware, and on what else the computer is doing. I made benchmarks
    for hard disk to hard disk copies a while ago, on Linux, which showed that
    for copies within the same disk, dd with a large buffer size has the advantage,
    but for cross-disk copies, cat won over any dd buffer size.
As a meta comment, it has been a shocking revelation in my career that the difference between "get it work" engineers and "make it right, make it fast"[3] has been this kind of low level knowledge. At least in my context, the majority of the bugs that have come up in the past while haven't been because someone didn't know how to concatenate 2 strings or even that many off by 1 errors. They have been things like not knowing the difference between theoretical TCP models and how their actually implemented on the OS in a high throughput environment (see TIME_WAIT[1] if you're curious), or understanding misorder of operations in distributed computing[2].

[1]: https://9oelm.github.io/2018-05-06--Listening,-Established,-...

[2]: http://book.mixu.net/distsys/

[3]: https://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast


Protocols that don't have a length header, but instead rely on delimiters is a good example. And, they are everywhere.


I always thought dd stood for "duplicate disk" or "duplicate device" and as such it was the "right" command to use here.

Apparently it's nothing to do with that and originally comes from "data definition" though.

https://en.m.wikipedia.org/wiki/Dd_(Unix)


It's original name was cc (convert and copy). Because that is what it does. They could not use that name because that was the name of the compiler. So, they went to the next letter of the alphabet and named it dd.


dd is also known as "Data Destroyer", because it's easy to make a mistake and overwrite the wrong device. This actually happened to me once, but I still use it when I need to write an ISO to a USB drive anyway.


An advice that was mentioned many times in this thread to avoid writing to the wrong disk. Use the symlinks in /dev/disks/* instead of /dev/sd* directly. Since they are more intuitive, using them will likely reduce the odds of you making a mistake.


Just avoid using /dev/sdX, use /dev/disk/by-id/ instead.


pv's progress display is very useful in many contexts, it has been one if the first things I make sure is installed on a new system for years, though I have recently come across https://github.com/Xfennec/progress which is brilliant for some times when pv is not an option or when you want to watch the status of several things which aren't convenient to arrange together via tmux/byobu. Or just for when you forget to use PV and don't want to cancel to change the command.


Thanks for the tip for progress! I didn't know this utility and it sounds really helpful.


I found it while putting together something similar myself (I might restart mine later, it was going to be a web & TUI based graph output, but progress+watch does the basic job more than well enough for now).


I've always used

  cdrdao read-cd --read-raw
to backup CD/DVDs, along with toc2cue and bchunk to convert to .iso.

I have no idea if the resulting iso is "more accurate" than using dd - but they've always worked, even from CDs with copy protection and obscure file systems (old mac CDs).


The buried lede here is that cat can be faster than dd: https://unix.stackexchange.com/questions/9432/is-there-a-way...

The other important part is that cat will not have issues with binary data, which I am embarrassed to admit I assumed was not the case for the reasons stated in TFA. pv is the tool I want to use for things like copying images to SD cards but I can never remember the exact syntax needed to get it to show me the percentage since I don’t use it often. dd has a decent syntax for its command parameters and has some fun options like ability to skip, so I can quickly create large empty files.


> pv is the tool I want to use for things like copying images to SD cards but I can never remember the exact syntax needed to get it to show me the percentage

You're in luck: the "syntax needed" is to literally not use any options:

    pv your.img > /dev/sdb


Haha good point. The first time I encountered pv it was in the form of like cat your.img | pv ... > /dev/sdb and then you had to tell it what size to expect and enable it showing the progress bar, etc.


Good ol' "Useless Use of Cat": http://porkmail.org/era/unix/award.html


True, if you pipe a data stream to it, it can't tell what the size is going to be and you'd need the --size option. The progress bar is still default, though.


I don't think it's a good idea to use a program that isn't aware of the device buffer to copy a CD/DVD. Such failures are rare nowadays, but installing some CLI program like cdrecord isn't difficult either.


I used to work on a hosting company and proper use of dd was important when copying data from LVM in a Xen host; unfortunately I seem to have forgotten most of it.

Some pointers: dd has oflags; may be oflags=direct is faster?

You can also use oflags=sparse and sometimes save space by creating a sparse file.


> may be oflags=direct is faster?

oflag=direct does direct I/O => copied data won't go into the buffercache.

On Linux search for 'O_DIRECT' in the open(2) manpage.

oflag (for output) and iflag (for input) are indeed useful. During/after a massive non-'direct' copy a system running other processes which benefit from data in the buffercache may crawl if the system, while copying, replaces some of it by the copied data, then has to re-read it.

In other terms this seems adequate when copying data which will not be, after the copy, soon read by any process. A raw filesystem image is a good candidate.

As usual YMMV. If most of the data to be copied already is in the buffercache or if it will occupy some unused part of the core memory... such optimization is useless. However in most cases (on most adequately-dimensioned non-idle systems) 'O_DIRECT' induces less systemwide load than cp, cat, pv(...) when copying a large set of data if most of it will not be, then, immediately read by anything.

Other tools (cp, cat, pv...) just cannot easily work in 'O_DIRECT' mode. Using some trick to enable it thank to a local version of openat() and LD_PRELOAD (which calls openat in O_DIRECT mode), albeit possible, isn't realistic in most contexts.

$ cd ~/tmp

$ strace -e openat dd if=/etc/hosts of=useless.tmp count=1 >& nodirect

$ strace -e openat dd if=/etc/hosts of=useless.tmp iflag=direct oflag=direct count=1 >& direct

$ diff direct nodirect

5,6c5,6

< openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_DIRECT) = 3

< openat(AT_FDCWD, "useless.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DIRECT, 0666) = 3

---

> openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3

> openat(AT_FDCWD, "useless.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

Moreover 'dd' has many options without equivalent in most other readily available tools.



I'm so glad that DD now has the status=progress option. Still not widely known.


I used to cat dd into "pv" : )


Before I learned of the flag, I used to do a 'while true; do killall -SIGUSR1 dd; sleep 5; done' loop in another terminal.


SIGINFO on macOS.


Many years ago I wrote https://github.com/adh/random-tools/blob/master/byc.c which tends to be the tool that I use for copying images from/to block device (it was originally meant for something else which is why you need -p for it to do the right thing).


What's the best way to rip DVDs through the CLI these days? Are the error correction techniques in the DVD standard enough to ensure that a simple `cat /dev/dvd0` or `dd /dev/dvd0` is going to work (and, in particular, fail if an error can't be corrected)?


For those using dd (or cp) on macOS, ⌃T will send the process a SIGINFO and print some additional information about progress. On Linux, SIGUSR1 will do something similar, but it's not as convenient.


I use tail all the time with "-f" to see log files in realtime, but seeing it as an option for reading an entire CD with "-c +1" was surprising.


Interesting that dvdisaster is not mentioned in the answers.


Thanks for mentioning dvdisaster.

Looks like it was originally written for optical disks, but I suppose it would work for other media:

https://en.wikipedia.org/wiki/Dvdisaster




Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: