The vi vs emacs thing permeates a lot of tools.
I find it useful to map all tools to be consistent with one or the other.
Personally I use vi. Unfortunately setting vi mode for readline has a few caveats, but I was able to work around all of them with the settings in:
One might be wondering why 'test' and '[' are separate binaries at all.
This is to give maximum flexibility wrt copying binaries etc.
Note one can build coreutils like busybox as a single binary,
by just `./configure --enable-single-binary` at build time.
This is already available as an option on Fedora at least through
the coreutils-single (1.2MB) package.
yes(1) is the standard unix way of generating repeated data.
It's good to do this as quickly as possible.
I really don't understand why so many get annoyed with this code.
130 lines isn't that complicated in the scheme of things.
Very good point. I added a whole section to the article, implementing the counting lines example with `make -j`,
which performs just as well as `xargs -P`
It comes from the DD (data definition) statement of OS/360 JCL, and hence why dd has the unusual option syntax compared to other unix utils
BTW if you are using dd to write usb drives etc. it's useful to bypass the Linux VM as much as possible to avoid systems stalls, especially with slow devices.
You can do that with O_DIRECT. Also dd recently got a progress option, so...
You don't have to wait for updates to newer versions to get dd to report its progress. The status line that dd prints when it finishes can also be forced at any point during dd's operation by sending the USR1 or INFO signals to the process. E.g.:
ps a | grep "\<dd"
# [...]
kill -USR1 $YOUR_DD_PID
or
pkill -USR1 ^dd
It also doesn't require you to get everything nailed down at the beginning. You've just spent the last 20 seconds waiting and realize you want a status update, but you didn't think to specify the option ahead of time? No problem.
I've thought that dd's behavior could serve as a model for a new standard of interaction. Persistent progress indicators are known to cause performance degradation unless implemented carefully. And reality is, you generally don't need something to constantly report its progress even while you're not looking, anyway.
To figure out the ideal interaction, try modeling it after the conversation you'd have if you were talking to a person instead of your shell:
"Hey, how much longer is it going to take to flash that image?"
Yes this is true. Note BSD supports this better with Ctrl-T to generate SIGINFO, which one can send to any command even if not supported, in which case it's ignored. Using kill on linux, and having that kill processes by default is decidedly more awkward.
It's also worth noting the separate "progress" project which can be used to give the progress of running file based utilities.
We generally have pushed back on adding progress to each of the coreutils for these reasons, but the low overhead of implementation and high overlap with existing options was deemed enough to warrant adding this to dd
Yes! I'm thinking of building something like this for my neural net training (1-2 days on AWS, 16 GPUs/processes on the job). In this case the "state" that I'd like to access is all the parameters of the model and training history, so I'm thinking I'll probably store an mmapped file so I can use other processes to poke at it while it's running. That way I can decouple the write-test-debug loops for the training code and the viz code.
I generally use a semaphore when I'm reading and writing from my shm'd things. The data structure will also likely be append-only for the training process, as I want to see how things are changing over time.
I am knew to the shared memory concept. I am familiar with named pipes. Could you please elaborate a bit, I'm curios.
Are you passing the reference to an mmap adress, or using the shm systemcalls? In what language are you programming in? Does race conditions endanger the shared memory? If so, how does using semaphores help?
Sorry, if I asked a lot of questions, feel free to answer any/none of them :)
Sure! SHM is really cool, I just found out about it. It's an old Posix functionality, so people should use it more!
I'm using shm system calls in Python. Basically I get a buffer of raw bytes of a fixed size that is referred to by a key. When I have multiple processes running I just have to pass that key between them and they get access to that buffer of bytes to read and write.
On each iteration first I wait until the semaphore is free and then I lock it (P). That prevents anyone else from accessing the shared memory. I have the process read from the shared memory a set of variables - I have little helper functions that serialize and deserialize numpy arrays into raw bytes using fixed shapes and dtypes. Those arrays are then updated using some function combining the output of the process and the current value of the array. Then those arrays are reserialized and written back to the shm buffer as raw bytes again. Finally, the process releases the semaphore using P() so other processes can access it. The purpose of the semaphore is to prevent reading the arrays while another process is writing them - otherwise you might get interleaved old and new data from a given update. In a process-wise sense there is a race-condition, as each process can update at different times or in a different order, but for my purposes this is acceptable since neural net training is a stochastic sort of thing and it shouldn't care too much.
>I've thought that dd's behavior could serve as a model for a new standard of interaction. Persistent progress indicators are known to cause performance degradation unless implemented carefully. And reality is, you generally don't need something to constantly report its progress even while you're not looking, anyway.
Progress bars by default are also garbage if you are scripting and want to just log results. ffmpeg is terrible for this.
> Persistent progress indicators are known to cause performance degradation unless implemented carefully.
Are you referring to that npm progress bar thing a few months back? I'm pretty sure the reason for that can be summed up as "javascript, and web developers".
Anyway, he's not proposing progress bars by default, he's proposing a method by which you can query a process to see how far it's come. I think there's even a key combination to do this on FreeBSD.
Or, for example, you could write a small program that sends a USR1 signal every 5 seconds, splitting out the responsibility of managing a progress bar:
% progress cp bigfile /tmp/
And then the 'progress' program would draw you a text progress bar, or even pop up an X window with a progress bar.
That's great! I think due to the way it's implemented it wouldn't be able to do progress reporting for e.g. "dd if=/dev/zero of=bigfile bs=1M count=2048", but that's a less common case than just cp'ing a big "regular" file.
Though, read cache can be enabled manually by creating separate device via gcache(8). This is usually not required, because caching is done at the filesystem layer.
It's important to specify block size for uncached devices, of course. dd(1) with bs= option will surely work, and with cp(1) your mileage may vary, depending on whether underlying disk driver supports I/O with partial sector size or not.
Usually just specifying a reasonable blocksize works for me. bs=1m or so.
Without that it does literally take hours.
I suspect the default blocksize is really small (1?) and combined with uncached/unbuffered writes to slower devices, it just kills all performance outright.
Per the sibling comments, you just need to specify a sane block size. dd's default is really low and if you experiment a bit with 2M or around that you'll get near-theoretical throughput.
NB: Remember the units! Without the units you specify it as bytes or something insanely small like that. I've made that mistake more than once!
If you can find a way to use 'dd' for disk/drive/device you can use it in interesting new manners (pipelines, etc.) and have very good confidence that it won't break in weird ways. It will do the small, simple thing it is supposed to do even if you are abusing it horribly.
Like this, for instance:
pg_dump -U postgres db | ssh user@rsync.net "dd of=db_dump"
You could use it to rate limit... or arbitrarily set block sizes per use case. I've used it for the former when doing 'over the wire' backups through ssh
The setting controls the block size. When writing to block devices, you can maximize throughput by tuning the block size for the filesystem, architecture, and specific disk drive in use. You can tune it by benchmarks and searching over various multiples of 512K block sizes.
For most modern systems, 1MB is a reasonable place to start. Even as high as 4MB can work well.
The block size can make a major difference in terms of sustained write speed due to reduced overhead in system calls and saturation of the disk interface.
A similar thing happens when writing to sockets where lots of small messages kill throughput, but they can decrease latency for a system that passes a high volume of small control messages.
Sure it's been around for a while, GNU version on linux at least. Personally found pipe viewer (pv) quite handy too https://www.ivarch.com/programs/pv.shtml - available in most distros
https://www.pixelbeat.org/settings/.inputrc