It seems to me that many core tools could use such updates. They're widely used yet most of them predate modern development techniques and tools such as widespread linter usage, unit tests, peer reviews before committing, appearance of various stable and portable libraries or even compilers, security scans.
I wonder what other nasty things would appear after a serious code review of the GNU and BSD core code bases.
I do agree that the code has been read probably hundreds of times and executed millions of time, but I doubt there have been many formal attempts at improvement and the overall methodoloy resembles brute forcing to me.
Most of the base system on Linuxes (which is GNU) certainly have unit tests and static analyzers run on them, to the best of my knowledge (just looked at the coreutils). Most of these programs were maintained by Cygnus, and now by Red Hat after they acquired them.
less is an exception to the rule, as it's a small, standalone program and not as critical. Its initial development start seems to predate the official announcement of the GNU Project. I don't think most people really expend much thought on something like a pager.
EDIT: Brute forcing is also a very pragmatic approach. Ken Thompson uttered his famous adage for a reason, even if it was somewhat humorous.
They do security scans, yes, but not really full-system code-quality review down to the level of every system utility (esp. not those developed primarily elsewhere). At least in the specific case of 'less', it's almost just a formality that it's even in the FreeBSD SVN tree, since the only activity is occasional re-imports of the upstream version: https://svnweb.freebsd.org/base/head/contrib/less/ Afaict, this Illumos initiative is the first attempt in years by anyone to review/clean up the internals.
There are also regular Coverity scans on the whole system (kernel and userspace) for the BSDs; the NetBSD one has been ongoing for quite some time and a lot has been fixed (a lot of the errors now seem to be in gdb...)
Another thing they could benefit from: remove aribtrary restrictions. For example, "cut" supports a custom delimiter to cut on with the -d option. But, it can only be a single character.
Why? Because that was easier for the programmer to do.
From a user's perspective, it seems pathetic.
(Yes, I know that there's sed and awk, but why force the user to switch tools so early?)
<wikipedia>
Lint first appeared (outside of Bell Labs) in the seventh version (V7) of the Unix operating system in 1979. It was derived from PCC, the Portable C Compiler, which was included with that system. Lint and PCC were developed by Stephen C. Johnson, who also authored the parser generator yacc.
While we're talking less(1) features, one I stumbled across a few years ago when wishing really hard that less supported an interactive regex line filter, was its interactive regex line filter:
&pattern
Which will: Display only lines which match the pattern; lines which do not match the pattern are not displayed. If pattern is empty (if you type & immediately followed by ENTER), any filtering is turned off, and all lines are displayed.
I'd prefer this followed a syntax closer to mutt's filters, and that the patterns were editable (e.g., typing '&' during a filter would show the currently extant filter for modification), but it's handy.
> Make less use getopt() instead of its byzantine option parser (it needed that for PC operating systems. We don't need or want this complexity on POSIX.)
This is the kind of thing that always astonishes me to see in a codebase: why reinvent something rather than just finding and including a compatibility implementation? Just grab an appropriate getopt.c and compile it in if the platform doesn't have one, then let the rest of the code pretend every platform has one. (Preferably an implementation of getopt_long; a quick search turned up some licensed under 3-clause BSD.)
less was started (~ '83) before getopt() was made available to the 'general' public ('85).
My guess is nobody bothered to replace these parts, since options were only added gradually; if at all. Coincidentally i'm in the same predicament with a tool a coworker of mine (initially) wrote, convoluted option parsing to say the least; but I'm too busy with fixing other parts or adding proper functionality to it than to replace it.
Many moons ago I mentioned the story of "more" and "less" to our (female) DBA, who was unfamiliar with the commands, and the expression "less is more".
"It's right in the manpage, actually". "No." "Yes, I'll send it to you."
And so I did, with the subject line "man less".
She sat at a desk right in front of mine, and I detected a somewhat painful silence as the email arrived. And realized I'd just inadvertently commented on her social life (confirmed through later conversations).
In general I agree with your sentiment, but getopt is one where I'm willing to make an exception. getopt_long is barely any actual code, and honestly its design stinks for an argument parser. It's also just standard C (IE. No platform issues to worry about assuming you're going to get argv and argc). I've written my own argument parser in just around 100 lines that's pretty similar to getopt and so far it's been well worth it.
Does your hand-written parser handle all the corner cases getopt does? Ending option processing with "--"? Interleaving options in any order (ls foo -l)? Stacking short options (ls -la) or writing them separately (ls -l -a)?
I had to go check if it does '--', but yes it does all of those things you listed. My parser is similar to getopt, but it's less clunky and avoids the duplication that getopt_long results in.
Thank you for your diligence, then; most hand-written parsers fail one or more of those.
> less clunky and avoids the duplication that getopt_long results in.
Duplication because of the ugly flag/val logic, which in practice is typically passed as either NULL, 's' (long option for short option 's') or NULL, OPTION_FOO (long option with no short option, OPTION_FOO > 255)?
Yeah, that does seem silly. I think they did that to simplify the setting of boolean parameters, by passing &some_flag, 1, but that seems woefully insufficient when you need to handle arguments. I'd love to have a C library as capable as Python's argparse, instead.
Of course, I wasn't going to half-do it if I was gonna replace getopt. I'm with you in that I get annoyed when I run into programs which do just half-done job at a getopt replacement and you can't do the normal stuff like "-df" instead of "-d -f".
And yes, that's pretty much what I'm getting at. The issue really stems down to the fact that getopt_long is just bolted on to getopt, so there's still the string short-opt syntax like "sf:t", and then you duplicate the options in the array of long opts with the chars or some random number > 255 otherwise. The string is really the most annoying part because there's no good way to generate it via the C preprocessor that I could come-up with, leading to some duplication between the long-opts and short-opts. It also can't easily generate help text for you from your arguments, which bugged me. My argument parser basically works off of a single xmacro header that holds all the argument information for getopt (Which gets organized via a few macros into an enum and an array). It's dead simple to add new arguments and there's no duplication or separate strings or etc. that you have to update at the same time besides adding code to handle that argument.
Personally, I wrote my argument parser specifically because I couldn't find any that I was happy with after looking around. They were either clunky to use (getopt_long), or were full libraries and seemed like it would be a hassle to get it integrated into my code. I'd love to see a 'standard' argument parser that works with long options well and doesn't just feel like an afterthought like it does with getopt_long, but I think the chance to make such a thing has been missed. Personally, if I would ever use such a thing then it needs to be a single or just a few .c and .h files that I can stick directly into my program. I wouldn't want to have to bet on the distribution having it or not, and I'm not going to add a separate dependency just for argument parsing.
I don't have it as a separate project specifically for the argument parser which is why I didn't link to it. But the meat of the parser is in ./common/arg_parse.c, with the header for it in ./include/common/arg_parse.h. If you want to throw it into your own project you'll want to take a quick look through the ./arg_parse.c and modify it to suit your project (The help text specifically is for my program, so you'll want to rewrite the text in that part).
You can see an 'example' usage in the same repo. The files listed below parse the arguments into a single struct with few flags inside, and also look for filenames in the arguments to load into the emulator:
./cmips/args.c
./cmips/args.h
./cmips/args.x
./cmips/args.x is a xmacro header. It mostly just contains the contents of the struct arg array for this program, but it's also used to create enum entries which index the struct arg array. The 'parse_args' function is fairly similar to what you'd do with getopt, just with the 'arg_parser' function instead.
The argument parser code could probably be improved. Just looking back at it, it's got a bit to much logic going on in that single function, it could probably be split up a bit. I'm gonna be working on this project again pretty soon I think, so I might fix up this parser along with it.
Thanks for sharing. I was going to suggest you make the relevant files available as lgpl -- but then I went and looked for getopt.c, and discovered that it's in the util-linux package and is also gplv2 (the version that I had installed, see more below). I'm normally a great fan of copyleft and the full gpl (rather than lgpl/bsd/mit etc) -- but it does strike me as a little strange to have something like this under the gpl -- it "feels" to me like more of a libc-thing (not really talking about your code in particular, as that is part of a different program anyway).
However, the getopt.h-file in the actual gnu libc-package (libc6-dev) is lgpl.
Interestingly, the getopt.c/.h in gnulib (gnu portability library) appear to both be gpl, not lgpl.
My code is just under GPL because I just wrote it as part of another program that's also GPL. That said, I'd be happy to re-license it as LGPL since I'm the only one who's touched that code, but since you'd probably want to drop it in to another program instead of compiling it as a library it wouldn't make much of a difference.
I've seen a few getopt.c implementations that are MIT I believe.
Personally I'm guessing that those getopt implementations are probably all different (Just because it hardly takes any time to write one). AFAIK gnulib is GPL itself, so the getopt inside was just licensed GPL too, same thing for util-linux. libc is LGPL though, so that getopt.c was licensed as LGPL. It is kinda curious though, I'm personally just surprised that there are so many implementations of the same thing.
Interesting! From what I'm reading so far, this seems not to be Illumos-specific (despite being motivated by Illumos's needs), but rather a cleanup that'd be applicable to any POSIX-like system. A fork of less that assumes POSIX-like functionality and cleans up a lot of things accordingly does seem like a worthwhile project. A bit more "unixy" design that uses the system versions of available functionality (like globbing and UTF-8!) should also reduce the risk of weird bugs & inconsistencies with how the rest of the system operates.
I've switched from less to most* some time ago, no idea how it looks underneath, but it works great for me. It seems to be available for most (tee hee) distributions.
I used 'most' a few years ago because of the windowing support (this is before 'screen' did get support for vertical split). Also I did love about 'most' that it used to colorize manpages, when other pagers didn't.
I stopped using it, like 4 years ago, when someone did told me on IRC that it was unmaintained and did have some bugs.
Note that I did never notice any 'bug' in my time as 'most' user, even if they may exist, and indeed, it's still available at least in all Debian versions.
It's curious that distributions are able to "maintain" a package (maybe even with custom patches) which is not maintained or updated upstream, for years.
It's encouraging to see actions like the one performed by this IllumOS developer. If a program is opensource, we can fork, improve and share. Or as users, we can take a look at the code when making choices, lots of people forget this in favor of search engine recommendations.
The level of love packages receive varies -- I'm a great fan of Debian, but in the case of "most" support appears to be less than stellar from the Debian side:
Well, yes. In either of the links -- I was thinking about the two normal-priority forwarded bugs for "most" (files starting with dash, sigpipe) and their age...
It's crazy to me that something as core as less has such bit rot. I'm aware GNU rebuilt a bunch of tools a while ago and added a slew of common features such as `-h` for human readable mode, and `--` long arguments. Does this rewrite have a name yet? and can it be brought into the core project at all? (or the GNU project?)
A faster string search would be nice. I used to (and still sometimes do) use less to analyze large trace files. With a few hundred MB the searching becomes a real bottle neck.
Couldn't be that hard to do a boyer moore for non RE substrings.
One of the lesser known less features is filtering only matching lines using &pattern. This is also very cool in combination with F, ie. tail -f mode. Unfortunately it tends to be extremely slow in large files, even though grep seemingly has no problem with them. I suspect it's related to search performance.
Overall I think less is one of those tools where it's really valuable to spend 10 minutes a day in the man page for a week, which should be enough to learn essentially all of its functionality.
Markers are also very useful, particularly paired with the functionality to pipe data to another file or shell command. E.g. to extract the instance of a server error plus some lines for context from an otherwise unwieldy log file. :) I use markers rarely enough that I invariable need to reread the man-/help-page, but being aware of the functionality is half the battle. :)
Another tip: within less, press -S to toggle line wrap. (Works for most other command line options, too.)
> Markers are also very useful, particularly paired with the functionality to pipe data to another file or shell command. E.g. to extract the instance of a server error plus some lines for context from an otherwise unwieldy log file. :) I use markers rarely enough that I invariable need to reread the man-/help-page, but being aware of the functionality is half the battle. :)
Not for string search, but I got fed up with the extremely long time it takes less to precalculate line counts on large files when I was working with log files of a size in the order of GBs. The result is here: https://github.com/nhaehnle/less
Since it was only a small hack to scratch the itch I was having at the time, I never really completed that project. For example, backwards line counting is not sped up, which can sometimes be noticeable.
If you feel like working on less-speedup issues, feel free to drop me a line.
Sure, but that's not an option if you actually want the line numbers. Given that it was possible to speed up the line number calculation by more than an order of magnitude, I do believe that fixing the code was the right way to go :)
I wholeheartedly agree, and I'm honestly surprised nobody has done this already. I often find that grep is ridiculously fast compared to less. It seems like a huge shame that a tool uses by so many people on such a regular basis is so slow at such a simple, commonplace task.
Honestly I'd take a stab at myself if I had the time. Maybe I should start a kickstarter or something like that.
As far as I know, it still holds that ag isn't fast, ack is slow ;-) That is to say: grep is pretty fast (too). In other words, ack improved the user-interface and api for search across files, with an eye towards programming and editors, but was relatively slow -- ag aims to keep the improved ui/api/output but bring speed back up to gnu grep-like levels.
You might be thinking of -F/--fixed-strings. -s is slient (long option --no-messages). For GNU grep 2.12, anyway. Or you might be thinking of BSD grep:
:~/tmp/riak/riak-2.0.0pre5/deps $ time (find . -type f -exec cat '{}' \; |wc -l)
2765699
real 0m5.021s
user 0m0.144s
sys 0m0.792s
:~/tmp/riak/riak-2.0.0pre5/deps $ time (find . -type f -exec cat '{}' \; |grep -E 'Some pattern' -v -c)
2765700
real 0m5.133s
user 0m0.264s
sys 0m0.852s
:~/tmp/riak/riak-2.0.0pre5/deps $ time (find . -type f -exec cat '{}' \; |grep -E 'Some..pattern' -v -c)
2765700
real 0m5.144s
user 0m0.400s
sys 0m0.768s
# "%% " used for leading comment lines in some of this code:
:~/tmp/riak/riak-2.0.0pre5/deps $ time (find . -type f -exec cat '{}' \; |grep -E '^%% ' -c)
27535
real 0m5.597s
user 0m0.520s
sys 0m0.788s
:~/tmp/riak/riak-2.0.0pre5/deps $ du -hcs .
405M .
405M total
:~/tmp/riak/riak-2.0.0pre5/deps $ time (find . -type f -exec cat '{}' \; |ag '^%% ' >/dev/null)
real 0m5.735s
user 0m1.480s
sys 0m0.876s
#actually find/cat is pretty slow -- I guess both GNU grep and ag
#use nmap to good effect:
$ time rgrep '^%% ' . > /dev/null
real 0m0.539s
user 0m0.404s
sys 0m0.128s
:~/tmp/riak/riak-2.0.0pre5/deps $ time ag '^%% ' . |wc -l
27500
real 0m0.252s
user 0m0.284s
sys 0m0.068s
:~/tmp/riak/riak-2.0.0pre5/deps $ time rgrep -E '^%% ' . |wc -l
27553
real 0m0.535s
user 0m0.396s
sys 0m0.140s
Note that grep clearly goes looking in more files here (more mathcing
lines). Still, I guess ag is indeed faster than grep in some cases (even
if it might not be apples to apples depending how you count -- of course
the whole point of ag is to help search just the right files).
:~/tmp/riak/riak-2.0.0pre5/deps $ time rgrep -E 'Some pattern' . |wc -l
0
real 0m0.266s
user 0m0.128s
sys 0m0.132s
:~/tmp/riak/riak-2.0.0pre5/deps $ time rgrep -E 'Some..pattern' . |wc -l
0
real 0m0.338s
user 0m0.212s
sys 0m0.120s
:~/tmp/riak/riak-2.0.0pre5/deps $ time ag 'Some..pattern' . |wc -l
0
real 0m0.111s
user 0m0.100s
sys 0m0.076s
I guess ag is indeed faster, even if it might not be due to fixed string
search...
[edit2: For those wondering that's an (old) ssd, on an old machine -- but with ~4G ram the working set should fit, as soon as some of my open tabs in ff are paged to disk...]
Thanks for benchmarking ag against grep. You're right that it's not exactly apples to apples. Ag doesn't search as many files, but it does parse and match against rules in .ag/.git/.hgignore. Also, ag prints line numbers by default, which can be an expensive operation on larger files.
I think most of the slowdown you're seeing with "find -exec | cat" is forking at least two processes (ag and cat) for each file. Also, each process has to be run sequentially (to prevent garbled output), which makes use of only one CPU core most of the time. I've tried to keep ag's startup time fast so that classic find-style commands still run quickly. (This is why ag doesn't support a ~/.agrc or similar.)
Just FYI, you can use ag --stat to see how many files/bytes were searched, how long it took, etc. I think I'll add some stats about obeying ignore rules, since some of those can be outright pathological in terms of runtime cost. In many cases, ag spends more time figuring out what to search than actually searching.
I tried to gauge the cpu usage (just looking at the percentage as listed
in xmobar) -- but both grep and ag are too fast on the ~400mb set of
files for that to work... As I have two cores on this machine, the
difference between ag and rgrep could indeed be ag's use of threads.
Many thanks for not just writing and sharing ag as free software, but
for the nice articles describing the design and optimizations!
At least this brief benchmarking run convinced me that I should probably
try to integrate ag in my work flow :-)
Quickly reviewing some of the posts on the ag blog/page[1], I'm guessing the speedup is mainly from a custom dir scanning algorithm and possibly from running two threads.
In the course of checking out ag (again) I also learned about gnu id-utils[2].
There should be a scrollbar on the bottom (I kept the commands on one line, rather than splitting with "\"). Might not be on mobile, though? In other words, the code-boxes should have overflow:scroll or something to that effect.
I think Chrome on OSX hides scroll bars by default unless you're scrolling. Regardless, the box is tall enough that it doesn't fit in my viewport so I wouldn't see the bottom scrollbar anyway.
Nobody needs POSIX conformance in an interactive pager; that's just conformance for the sake of conformance. Or do they? What is the economic justification ("business case") for working on POSIX compliance in a "more" command?
You should never invoke "more" directly in a script anyway, but rather observe the PAGER environment variable, and fall back on a plain "more" only if that isn't set. (Speaking of which, PAGER isn't described in POSIX, oops!)
If the user wants the pager to exit when the last line is reached, the user can specify the necessary option in PAGER, if their pager supports it. PAGER just has to be properly expanded: treated as a command, not a command name.
Cool, but terminfo and curses are another bunch of things I'd like to see go.. or at least delegated from the core of every fullscreen terminal application to a special compatibility layer (e.g. tmux or screen) for those using terminals which don't speak ANSI.
I've looked at it before, and I agree less is a mess.
I got excited to click on this, imagining that someone was modernizing less. Which is true in a narrow sense, but am I the only one to feel a lingering disappointment that we run shells in xterms (or slightly modified equivalents) that emulate ancient terminals, then implement pagers to page through screen after screen?
Can anyone point to work that starts with the combination of the two following propositions:
1. User interface elements invented since the 1970s are a pretty neat thing.
2. Text-based shells and the command line are also a pretty neat thing.
Yes. His sentiments here http://acko.net/blog/on-termkit/ are very close to mine: "It makes me wonder, when sitting in front of a crisp, 2.3 million pixel display (i.e. a laptop) why I'm telling those pixels to draw me a computer terminal from the 80s".
I'm not crazy about every last aspect of his design there - but it's a start.
I find 'view' fine for things like looking at a logfile, but not great for one of my more common use-cases for a pager, which is looking at something from stdin that's either large or slow-to-produce. You have to wait until vim reads in the entire stream before you can do anything with it. With 'less' you can immediately navigate/search/etc. while the stream is still coming in.
Also, the default action of the cursor keys is more useful for paging in less than in view. In view, cursor keys move the cursor; in less, they scroll the screen.
And less has built-in tailing which you can start and stop at any time. That's its killer feature for me.
On the other hand, vim/view can have some nice syntax highlighting for syslog format log files. I haven't found that enough to switch though.
Have you tried using the less-like script distributed with vim? The path varies depending on your distribution of vim
# ubuntu 14.04
alias less='/usr/share/vim/vim74/macros/less.sh'
# ubuntu 12.04: vim73, if I recall right
alias less='/usr/share/vim/vim73/macros/less.sh'
This behaves like `less` in many ways, and uses your syntax highlighting from vim. My only complaint is that some things with escape codes for colors are not flattened, but instead you see the escape codes. (Diffs seem to work fine, at least.) It also appears to read the whole thing into vim, which is likely not what you want for large files.
I also use the 'vimcat' script from the vimpager project [0] as well. (I'm not sure why I haven't just used the whole thing. I must not have realized there was more when I first grabbed it.)
It's probably not that simple. The article indicates that the refactor removed support for a lot of old platforms which the project may very well prefer to keep.
Well, it's certainly older. It's not really so ambiguous since the CSS version is "LESS". Speaking as someone who cut his teeth on the command line, I clicked the article thinking that it was referring to LESS since the idea of modernizing less never occurred to me (even though it's a much older piece of technology; LESS is much more broken...)
A lot of people are talking about revamping these old programs. I don't see what the problem is with less, cat, vim, etc...
I've never been using a program and wished for better functionality. Even when this was all new to me, it was never a problem figuring these out, using them, and I was always satisfied with them..
So...here's the question. I don't think these are broken, so what are you fixing?
>So...here's the question. I don't think these are broken, so what are you fixing?
First, it's not just about something not working. It's about creating tools that are extensible and understandable and hackable. Open Source is not just about "working", it's about being modifiable by the end user. All this cruft (a mess of 200 obsolete architectures, dead code and deprecated library support that nobody used since 1988) works against that goal.
Second, there are things that would be essential for some people, like international users (e.g proper multibyte support) that cannot be added due to dependancy of some custom methods of handling encodings. That's not some wishy washy magical unicorn feature request, it's essential for the main operation of what less does for those that have to deal with these encodings.
Third, there's nothing wrong in taking pride and crafting finely your tools. UNIX is supposed to be made of things that "do one thing and do it well". Less having its own utf-8 support breaks this division of responsibility. We have libaries for that. Same for getopts vs it's custom options parsing.
At least for programs written in C, most (all?) modern Unix-like platforms should include the functionality in the base install. On the language side, C89 requires support for wide and multibyte characters in a conforming libc implementation. And POSIX furthermore requires a locales/iconv system to specify and convert between encodings. Neither of those strictly require that UTF-8 be one of the supported encodings (C89 predates Unicode), but any reasonably modern implementation will include Unicode locales. And if it doesn't, I think at this point you can just consider that to be the system's problem: the current assumption for POSIXy programs is that they will use the system locales, not try to implement their own encoding machinery.
>I don't think these are broken, so what are you fixing?
This is what the post talks about. But in short, Iluminos (An operating system derived from OpenSolaris) needed a posix compliant pager(/usr/bin/more), ported the less program to their OS and in the process found many issues which they cleaned up and fixed.
> So...here's the question. I don't think these are broken, so what are you fixing?
I know many people that never got a BSOD on Windows. Does this mean that there is nothing that could be fixed about Windows? Or is it possible that different people have different experiences?
Clarity, without having to follow through to the article.
The lack of any context for HN posts (also reddit link shares) is a significant disadvantage of both sites. I've always been partial to Slashdot's link summaries, and wish that style were more widely used. See also Jakob Nielsen and microcontent.
I wonder what other nasty things would appear after a serious code review of the GNU and BSD core code bases.
I do agree that the code has been read probably hundreds of times and executed millions of time, but I doubt there have been many formal attempts at improvement and the overall methodoloy resembles brute forcing to me.
I could be wrong.