Hacker News new | past | comments | ask | show | jobs | submit | pixelbeat's comments login

The vi vs emacs thing permeates a lot of tools. I find it useful to map all tools to be consistent with one or the other. Personally I use vi. Unfortunately setting vi mode for readline has a few caveats, but I was able to work around all of them with the settings in:

https://www.pixelbeat.org/settings/.inputrc


One might be wondering why 'test' and '[' are separate binaries at all. This is to give maximum flexibility wrt copying binaries etc. Note one can build coreutils like busybox as a single binary, by just `./configure --enable-single-binary` at build time. This is already available as an option on Fedora at least through the coreutils-single (1.2MB) package.


Sounds like a case of: https://www.pixelbeat.org/programming/sigpipe_handling.html

uutils uses the coreutils test suite, so it makes sense to add a test case for that, which uutils will eventually get to. I'll do that now.


yes(1) is the standard unix way of generating repeated data. It's good to do this as quickly as possible. I really don't understand why so many get annoyed with this code. 130 lines isn't that complicated in the scheme of things.


Nice. This handles the mangled example I discussed at:

http://www.pixelbeat.org/docs/unicode_utils/


GNU seq uses this trick:

slow path:

    seq -f '%.1f' inf | pv > /dev/null
     ...[  12MiB/s]
fast path:

    seq inf | pv > /dev/null
     ...[ 491MiB/s]


Can you explain that a bit more? I know "seq", but am unsure what your examples are illustrating. Thanks!


-f specifies a format, so it can't use the specialized itoa implementation, but probably just uses printf.


The GNU variant was discussed recently at: https://news.ycombinator.com/item?id=14542938

The commit that sped up GNU yes has a summary of the perf measurements: https://github.com/coreutils/coreutils/commit/3521722

yes can be used to generate arbitrary repeated data for testing or whatever, so it is useful to be fast


Very good point. I added a whole section to the article, implementing the counting lines example with `make -j`, which performs just as well as `xargs -P`


That really is a bash bug IMHO I've discussed this with upstream bash at http://lists.gnu.org/archive/html/bug-bash/2015-02/msg00052....

I've some general notes on SIGPIPE mishandling at http://www.pixelbeat.org/programming/sigpipe_handling.html


Yes "D" is not for disk/drive/device/...

It comes from the DD (data definition) statement of OS/360 JCL, and hence why dd has the unusual option syntax compared to other unix utils

BTW if you are using dd to write usb drives etc. it's useful to bypass the Linux VM as much as possible to avoid systems stalls, especially with slow devices. You can do that with O_DIRECT. Also dd recently got a progress option, so...

    dd bs=2M if=disk.img of=/dev/sda... status=progress iflag=direct oflag=direct
Note dd is a lower level tool, which is why there are some gotchas when using for higher level operations. I've noted a few at:

http://www.pixelbeat.org/docs/coreutils-gotchas.html#dd


You don't have to wait for updates to newer versions to get dd to report its progress. The status line that dd prints when it finishes can also be forced at any point during dd's operation by sending the USR1 or INFO signals to the process. E.g.:

    ps a | grep "\<dd"
    # [...]
    kill -USR1 $YOUR_DD_PID
or

    pkill -USR1 ^dd
It also doesn't require you to get everything nailed down at the beginning. You've just spent the last 20 seconds waiting and realize you want a status update, but you didn't think to specify the option ahead of time? No problem.

I've thought that dd's behavior could serve as a model for a new standard of interaction. Persistent progress indicators are known to cause performance degradation unless implemented carefully. And reality is, you generally don't need something to constantly report its progress even while you're not looking, anyway.

To figure out the ideal interaction, try modeling it after the conversation you'd have if you were talking to a person instead of your shell:

"Hey, how much longer is it going to take to flash that image?"

The way dd works is close to this scenario.


Yes this is true. Note BSD supports this better with Ctrl-T to generate SIGINFO, which one can send to any command even if not supported, in which case it's ignored. Using kill on linux, and having that kill processes by default is decidedly more awkward.

It's also worth noting the separate "progress" project which can be used to give the progress of running file based utilities.

We generally have pushed back on adding progress to each of the coreutils for these reasons, but the low overhead of implementation and high overlap with existing options was deemed enough to warrant adding this to dd


Also a gotcha on BSD (FBSD at least): SIGUSR1 kills dd.


Ctrl-T for SIGINFO is pretty useful, it would be good if Linux could pick this up.


Been waiting for years, frankly. I guess if I was motivated enough and had the time, I could do the research and submit patches..


They don't want it. Even if you got the patches accepted to the kernel they'd never accept patches to GNU Coreutils to support it.


The most recent comment I've seen on this is lukewarm at best: http://lkml.iu.edu/hypermail/linux/kernel/1411.0/03374.html


> in which case it's ignored

Ignored by the application, sure. But FreeBSD always prints useful stuff like load, current command, its pid and state:

$ dd if=/dev/random of=/dev/null

load: 0.72 cmd: dd 5820 [running] 0.70r 0.02u 0.68s 6% 2008k

263276+0 records in

263276+0 records out

134797312 bytes transferred in 0.708372 secs (190291809 bytes/sec)

(here, the "load:..." line is from the system, and the other 3 lines are from dd)


Yes! I'm thinking of building something like this for my neural net training (1-2 days on AWS, 16 GPUs/processes on the job). In this case the "state" that I'd like to access is all the parameters of the model and training history, so I'm thinking I'll probably store an mmapped file so I can use other processes to poke at it while it's running. That way I can decouple the write-test-debug loops for the training code and the viz code.


> I'm thinking I'll probably store an mmapped file so I can use other processes to poke at it while it's running.

That seems to run substantial risk of seeing it in an inconsistent state, yeah?


I generally use a semaphore when I'm reading and writing from my shm'd things. The data structure will also likely be append-only for the training process, as I want to see how things are changing over time.

Also I meant shm'd, not mmap'd.


I am knew to the shared memory concept. I am familiar with named pipes. Could you please elaborate a bit, I'm curios.

Are you passing the reference to an mmap adress, or using the shm systemcalls? In what language are you programming in? Does race conditions endanger the shared memory? If so, how does using semaphores help?

Sorry, if I asked a lot of questions, feel free to answer any/none of them :)


Sure! SHM is really cool, I just found out about it. It's an old Posix functionality, so people should use it more!

I'm using shm system calls in Python. Basically I get a buffer of raw bytes of a fixed size that is referred to by a key. When I have multiple processes running I just have to pass that key between them and they get access to that buffer of bytes to read and write.

On each iteration first I wait until the semaphore is free and then I lock it (P). That prevents anyone else from accessing the shared memory. I have the process read from the shared memory a set of variables - I have little helper functions that serialize and deserialize numpy arrays into raw bytes using fixed shapes and dtypes. Those arrays are then updated using some function combining the output of the process and the current value of the array. Then those arrays are reserialized and written back to the shm buffer as raw bytes again. Finally, the process releases the semaphore using P() so other processes can access it. The purpose of the semaphore is to prevent reading the arrays while another process is writing them - otherwise you might get interleaved old and new data from a given update. In a process-wise sense there is a race-condition, as each process can update at different times or in a different order, but for my purposes this is acceptable since neural net training is a stochastic sort of thing and it shouldn't care too much.

[0] http://nikitathespider.com/python/shm/ - original library which works fine for me

[1] http://semanchuk.com/philip/PythonIpc/ - updated version


>I've thought that dd's behavior could serve as a model for a new standard of interaction. Persistent progress indicators are known to cause performance degradation unless implemented carefully. And reality is, you generally don't need something to constantly report its progress even while you're not looking, anyway.

Progress bars by default are also garbage if you are scripting and want to just log results. ffmpeg is terrible for this.


> Persistent progress indicators are known to cause performance degradation unless implemented carefully.

Are you referring to that npm progress bar thing a few months back? I'm pretty sure the reason for that can be summed up as "javascript, and web developers".

Anyway, he's not proposing progress bars by default, he's proposing a method by which you can query a process to see how far it's come. I think there's even a key combination to do this on FreeBSD.

Or, for example, you could write a small program that sends a USR1 signal every 5 seconds, splitting out the responsibility of managing a progress bar:

% progress cp bigfile /tmp/

And then the 'progress' program would draw you a text progress bar, or even pop up an X window with a progress bar.



That's great! I think due to the way it's implemented it wouldn't be able to do progress reporting for e.g. "dd if=/dev/zero of=bigfile bs=1M count=2048", but that's a less common case than just cp'ing a big "regular" file.


Yes, C-t for SIGINFO, works on all BSDs (including macOS).


I've always used pv to get progress from dd, or other pipes:

  pv image.img | dd of=/dev/rdisk2 bs=1M
This adds another pipe though. I don't know the effect this has on performance.


On OS X (and BSD?) be sure to use /dev/rdisk[0-9]+ instead of /dev/disk[0-9]+.

Details as to exactly why it's faster are welcome. (I just know it bypasses stuff).

EDIT: someone mentioned this below http://superuser.com/questions/631592/why-is-dev-rdisk-about...


In FreeBSD, cached/block disk devices are long gone: https://www.freebsd.org/doc/en/books/arch-handbook/driverbas... so all disk devices in /dev are implicitly O_DIRECT.

Though, read cache can be enabled manually by creating separate device via gcache(8). This is usually not required, because caching is done at the filesystem layer.

It's important to specify block size for uncached devices, of course. dd(1) with bs= option will surely work, and with cp(1) your mileage may vary, depending on whether underlying disk driver supports I/O with partial sector size or not.


Ah wish I knew this last week. Writing Xubuntu to my USB took something like 2900 seconds from a Mac.


Usually just specifying a reasonable blocksize works for me. bs=1m or so.

Without that it does literally take hours.

I suspect the default blocksize is really small (1?) and combined with uncached/unbuffered writes to slower devices, it just kills all performance outright.

Edit: answered! https://news.ycombinator.com/item?id=13350002


Per the sibling comments, you just need to specify a sane block size. dd's default is really low and if you experiment a bit with 2M or around that you'll get near-theoretical throughput.

NB: Remember the units! Without the units you specify it as bytes or something insanely small like that. I've made that mistake more than once!


In other words, about 48 minutes for a ~1.2 GB file?


About 3mbit/second, or 400kbytes a second. I'd expect something 50-100 times faster.


"Yes "D" is not for disk/drive/device/..."

But that's the very beauty of unix!

If you can find a way to use 'dd' for disk/drive/device you can use it in interesting new manners (pipelines, etc.) and have very good confidence that it won't break in weird ways. It will do the small, simple thing it is supposed to do even if you are abusing it horribly.

Like this, for instance:

  pg_dump -U postgres db | ssh user@rsync.net "dd of=db_dump"


Is there a benefit to use dd over cat in this case?


You could use it to rate limit... or arbitrarily set block sizes per use case. I've used it for the former when doing 'over the wire' backups through ssh


Thanks for the tips.

Clueless noob here . . . most guides I've seen use bs=1M for writing e.g. a Linux installer to a USB drive. Does 1 MB vs 2 MB change anything?


The setting controls the block size. When writing to block devices, you can maximize throughput by tuning the block size for the filesystem, architecture, and specific disk drive in use. You can tune it by benchmarks and searching over various multiples of 512K block sizes.

For most modern systems, 1MB is a reasonable place to start. Even as high as 4MB can work well.

The block size can make a major difference in terms of sustained write speed due to reduced overhead in system calls and saturation of the disk interface.

A similar thing happens when writing to sockets where lots of small messages kill throughput, but they can decrease latency for a system that passes a high volume of small control messages.


>it's useful to bypass the Linux VM as much as possible to avoid systems stalls

Oh man, I didn't even know that was the cause of these problems.


Nice! Thanks. Built-in status and that Linux bypass trick are beautiful.


Oh? I had read that its name was originally "copy and convert" but `cc` was already taken by the compiler


Yeah, this article is wrong. Have you ever noticed the syntax for DD is unusual? It is set up more like a JCL syntax.


I thought it stands for copy and convert but cc was taken bu the c compiler.


Ask a graybeard.


Hasn't the ability to check progress be around forever? You could always send it SIGUSR1 and get back a progress report on stderr.


Sure it's been around for a while, GNU version on linux at least. Personally found pipe viewer (pv) quite handy too https://www.ivarch.com/programs/pv.shtml - available in most distros


Yup. Just don't do what I did earlier and `pkill -USR1 -f dd` if your desktop session is currently being provided to you courtesy of `sddm` …

As a programmer, it usually pays off to be on the lazy side, but every once in a while it comes back and bites me in the arse ;)


Thank you for the historical information.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: