Hacker News new | past | comments | ask | show | jobs | submit login
Usability Improvements in GCC 9 (redhat.com)
235 points by chx 13 days ago | hide | past | web | favorite | 61 comments

Great! This directly addresses one of the reasons I use Clang most of the time, and it's great to see that the GCC realizing that error message quality-of-life is important. As an aside:

  $ gcc-9 -c cve-2014-1266.c -Wall
  cve-2014-1266.c: In function ‘SSLVerifySignedServerKeyExchange’:
  cve-2014-1266.c:629:2: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
    629 |  if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
        |  ^~
  cve-2014-1266.c:631:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
    631 |   goto fail;
        |   ^~~~
this is a cute example ;)

It saves us frequent users of C++/python from ourselves :P

If it weren't in a file named after a famous vuln, I would wonder what was on line 630, because the problem isn't obvious from just those two lines.

Another major improvement to GCC9 is gcov - starting with GCC 9, gcov can spit out coverage results in JSON format as well as take multiple coverage files as arguments, spitting out all the results to stdout, 1 json object per line (this is huge).

This will allow much of the code coverage calculation process to be done in parallel (i.e. much faster than is currently possible). This is because currently gcov writes everything out to files, which forces the wrapper to be single threaded lest one gcov process overwrite the output file of another.

For fun, I wrote a parallelized GCC9 gcov wrapper in python that generates an LCOV coverage report that genhtml can consume. Unofficial/anecdotal bench marking shows incredible gains over lcov on my own personal projects (700ms vs. 90 seconds). I'm sure it could be improved even more.


lcov can be really, really slow in how it handles its own file format. I ended up rewriting the genhtml stuff myself in Python (originally, it was to drive a treemap view, but extending it to handle the actual detailed coverage wasn't difficult), and then moved the "merge two files into one file" into Python as well for a >10x speedup. And Python is not known for its blazing speed.

After that, I ended up shifting to tackling the generation of the initial data from .gcda and .gcno, which is where I discovered bugs in both lcov and gcov in the collection process. One bug is that gcov somehow manages to compute negative edge counts on certain cycles in the graph, and when that happens, it double-counts those cycles.

My main issue with g++ compiler output is that for each coding error, I get a few lines of useful information, followed by a hundred "note:" lines detailing the finer details of increasingly improbable failed substitutions.

Clang by default shows only 20 error lines, or something like that, so you have to scroll up a ton less when compiling. I swapped to clang just for the improved error messages.

I wish gcc indented all the note lines, to indicate that they are sub-bullet-points of the actual error/warning.

I fixed that by making a small script which sends the output to `less` if it's over 30 lines: https://github.com/mortie/nixConf/blob/master/bin/msoak

With that, `msoak make` will show the first lines first, and let you scroll down to see the rest. (Simply doing `make | less` isn't enough, because you need to fool the program into thinking it's sending output to a TTY for nice colors.)

"With -fdiagnostics-format=json, the diagnostics are emitted as a big blob of JSON to stderr"

I’ve previously been thinking about what would be the consequences of implementing a subset of Unix tools to work with a binary protocol.

For example, tables would be output with type info and proper unambiguous delimiting of some sort.

Unrelated to that I once wrote a program in the past several years ago as two distinct parts that communicated over a pair of Unix pipes; one part the GUI which embedded a web view and the other part which contained the program logic and which output HTML to stdout. The code was quite ugly, but anyway it did the job at the time.

But the idea of outputting JSON is somewhat in between this two things. But the main takeaway here IMO is the idea about outputting a specific format as controlled by a flag. Of course other tools have done this too in the past.

Consider a few Unix utilities reimplemented to work a bit tighter together in terms of how some data is represented, and using for example Cap’n Proto to serialize the data and piping that between them but also writing a library for destructively converting to plain text representation and then using some method or combination of methods for deciding when to use binary piping. For example,

- A user-provided flag. Tiresome to type.

- A user-provided environment variable. Problematic when mixing tools that implement serialization with pure plaintext tools.

- The shell could be aware of which programs implement this serialization and invoke them with the flags or env vars corresponding to whether or not each program on each side of each pipe in a pipeline supports the serialization. Possibly the way to go.

In this way one could incrementally rebuild the Unix tools to work this way without having to change everything at once and still being compatible with the universal interface of plaintext forever, since no-one wants nor can implement this in every Unix tool out there.

Once you have a couple of tools support this then the possibility for new ways to interact with pipelines becomes available.

I still love and will use the classic command-line. But in some situations the possibilities enabled by this would be very compelling indeed.

An alternative that I have had kicking around in the back of my mind for a few years is having the process on the read side of the pipe call fcntl() with a new flag to set an "Accept" set of formats, which the process on the write side of the pipe can retrieve with a 'get' equivalent. If there's a format both sides recognise, they can use that.

Both sides just default to plain text so it'll be backwards compatible.

What you describe is PowerShell. It is available for most Linux and macOS.

Not quite. PowerShell input and output are .NET objects - not just data, but also behavior. This requires both sides of the pipe to agree on the object model and the execution environment.

JSON (or another pure data serialization format) is much better in that regard, because then you only need to agree on some basic data structures at the boundary.

I am aware that PowerShell is somewhat like this — in fact I started using PS before permanently switching to Linux back in 2009 — but I don’t enjoy PS. Some people like PS a lot I know and that’s great for them but for me it’s not something I am ever going to use again. I found my home with Unix those 10 years ago.

IMHO what kills PowerShell is that it's so incredibly verbose.

Even with the built-in aliases?

FreeBSD is using this in its base system (although I don't think it's all or even most apps as yet).


I wrote a couple articles about structured data over pipes, which may motivate some features in the Oil shell [1]

How to Quickly and Correctly Generate a Git Log in HTML http://www.oilshell.org/blog/2017/09/19.html

Git Log in HTML: A Harder Problem and A Safe Solution http://www.oilshell.org/blog/2017/09/29.html

I'm interested in ideas for structured data over pipes, or even better a prototype and example programs :)

[1] http://www.oilshell.org/blog/2018/01/28.html

I think csv (with a separator of your choice) would be ideal.

It wouldn't require special tools but it could work with them.

You could save the column headers, types and separator in extended attributes.

For extra points you could guarantee the structure of data by only allowing certain tools (selinux context maybe) that are trusted to write the files with the required structure.

That would save a lot of repeated wasteful parsing and checking.

This is very similar to PowerShell. But CMD.EXE just kills it for me.

> One concern I’ve heard when changing how GCC prints diagnostics is that it might break someone’s script for parsing GCC output

Honestly though, who would build a tool which depends on the (presumably undocumented) text output of another application, and then expect it not to break?

Compiler error string formats are well-documented. Tools have been relying on the : delimited format mentioned in the article since at least the mid-90’s (probably longer).

That makes them even more stable than flavors of the decade like xml or json.

In fact, the : delimited compiler error format is so well established that it will probably outlive the json schema the article presents, and maybe even json-as-defacto-standard.

I’m not saying the json thing is bad (it provides rich information that will be useful to some tools), but it’s shelf life will probably be much less than 25 years.

> it will probably outlive the json

your comment reminds of of a hilarious but very true ringing anecdote here¹ about json and 'The Sins of the Father' :-)

¹ https://news.ycombinator.com/item?id=8708617#8709223

I've built tools based on undocumented output of other applications because there was no documented way to get the information I needed. I don't expect that these won't break but it certainly doesn't make it less annoying when they do break.

Oh sure – the first few years of my tech career involved a lot of scraping data from the websites of enterprise tools because their APIs were either incomplete or nonexistent. It was annoying, but never unexpected.

Isn’t that part of the Unix Way?

Not expecting it not to break, but building tools that depend on the undocumented text output of other tools that.

Most Unix tools have well-documented output.

What do you mean by Unix tools, and what do you mean by well-documented?

Hypothetical example: can I read only the man pages of GNU’s and all BSD’s `ls`es and from that write a cross-platform tool that parses the output of all of them?

Check out the section "stdout" of the POSIX standard here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/ls...

When that text output has remained consistent for decades(?) can you really consider it an undocumented format?

Yes. It's sort of like using undocumented API calls: even if it's been unchanged for a decade, if it's undocumented it can change at any time without the usual deprecation warnings and versioning considerations.

Text output, particularly warnings and notes, isn't usually intentioned to be machine parseable and consistent in the first place, but rather is meant for humans to read and may be changed to be better read by humans at will.

Good thing then that it's not an undocumented API. :-)


I’ve been using Terraform a lot lately and aside from no debugger and even line numbers on errors, I wish it had used a real language and had usable errors like this. Great too see, it’s the human computers that are the most expensive to run so you don’t want to waste their time.

I would be really interested to know what kind of unit testing was done to prove that a feature like this won't break anything. I see that he left the original messages and added more guidance, so it would be possible to regress against GCC8. But the addition of the new guidance must require a certain degree of code partitioning and hand-inspection, given how complex GCC is under the hood.


> What is the optimizer doing? > ...

Nice! Now, can we get warned about UB?

You can at run time with ubsan. At compile time it doesn't make sense, because undefined behavior is used for cases where the overwhelmingly common case you won't see undefined behavior. It'd have 99% false positives, pretty much by definition.

This is nonsense. The optimizer makes decisions based on UB at compile-time. It should warn about this, especially when deleting code.

> overwhelmingly common case you won't see undefined behavior.

You think so. But this actually can be a source of serious security vulnerabilities. I want at least the option to know.

Would you like to have a warning for every "for(i=a; i<=b; i++)" loop, where the compiler assumes the loop is not infinite?

Or for every a[i++] in a loop (including a[i] with i++ in the for loop of course), where the compiler assumes it can convert the induction variable to *p++?

Both of these are undefined behavior if "i" is signed and it overflows.

Developers who have to work with actual responsibilities need to take those edge cases "you won't see" into account.

Correct me if I'm wrong, but I think you're misunderstanding what they're saying. They're not saying that undefined behavior can be ignored because it "probably won't happen" or something like that. They're saying that undefined behavior is everywhere in the C standard and that this fact is commonly exploited for the purposes of optimization. For example, signed integer overflow is undefined behavior which allows the compiler to avoid extra sign extensions when indexing in a loop.

Would you want a warning to be emitted at compile time every single time you write "for (int i = 0...)"? I'm sure that would actually end up doing more harm than good as it would just condition people to ignore these warnings altogether. As the poster mentioned, ubsan will catch the cases you actually care about, albeit at runtime. It's also worth noting that GCC does try to warn at compile time about a lot of the undefined behavior that you do care about, such as strict aliasing violations when -Wall is enabled.

> Would you want a warning to be emitted at compile time every single time you write "for (int i = 0...)"?

Yes, I would. I have transitioned to always using ranged loops. I haven't had to write a "for (int i = 0...)" style loop for over a year. Even one where I thought it was still needed, I learned about zip iterator adapters from a C++ conference video; wrote my own, paired it with a proxy iterator, and the problem solved. Everywhere such a loop has existed, it has been replaced with generator class to provide the range needed.

Not only would I want a warning, but I'd personally turn the warning into an error. I'm one of those dudes who runs around with pretty much as many warnings as possible turned on, and most of them turned into errors. Most of the warnings that are disabled by default are extremely helpful in preventing simple errors. Understanding what those warnings mean and how to resolve them is something I consider to be core understanding for anyone who wants to become more than a junior C++ developer.

There's a _ton_ of libraries that don't work like that. Sucks to be them. Linking to boost works (most of the time) though, and that's the only real hard requirement I have.

> I'm one of those dudes who runs around with pretty much as many warnings as possible turned on, and most of them turned into errors. Most of the warnings that are disabled by default are extremely helpful in preventing simple errors. Understanding what those warnings mean and how to resolve them is something I consider to be core understanding for anyone who wants to become more than a junior C++ developer.

I strongly agree, which is why I think it's important that these warnings are not spurious. -Werror is no longer viable if what it takes to resolve warnings is more ceremony than substance. Emitting a warning every single time the compiler encounters undefined behavior would be utterly overwhelming given that the vast majority of these instances are harmless like skipping a load because it would be undefined behavior for a couple of pointers to alias. Obviously GCC and Clang could do more (for example, Clang doesn't implement -Wstrict-aliasing if I recall correctly), but their current approach of warning about UB when harmful, rather than reporting all UB, is correct in my opinion.

This is not going to save you either. To have this be efficient, the compiler needs to prove many things: for example, that the end iterator and the current iterator come from the same base object. Compilers don't do this, because comparing iterators from two different objects is UB.

Yes, I do. It's easy to get into a habit of writing "for (size_t i = 0; ...)".

Could you explain why in this hypothetical scenario a warning should be emitted for “for (int i = 0...)”?

bonzini's hypothetical scenario was one where the compiler emitted a warning every single time it encountered undefined behavior. Assuming that a loop counter cannot overflow because signed integer overflow is undefined is one such instance and would fall under the "99% false positives" that they mention.

But of course this way leads to madness, because then we'd also have to warn every time you do `x + 1` or `*ptr`, both of which can lead to UB (if x is a signed integer and the addition overflows, or if ptr is null).


Range-based for loops are far more idiomatic for modern C++: https://stackoverflow.com/a/15089298/1111557

They're also not in the standard library. Not everyone can justify integrating Boost into their applications :(

You can certainly do this with just the standard library and no boost, for example using iota from <numeric>. https://en.cppreference.com/w/cpp/algorithm/iota

That’s not a generator, though. You need to allocate an extra container for this to fill, which is inconvenient.

And underneath they compile to an int for loop and a pointer dereference, both of which can be undefined behavior.

Yeah, enable -Wall. It won't catch everything, but it will get some of them.

And -Wextra. Adds some more useful warnings not enabled by -Wall. Even adding -Wextra doesn't turn on everything, yet more have to be turned on manually, like -Wzero-as-null-pointer-constant.

You are aware that -W turns on more warnings that -Wall?

Be glad they fixed it, sometime back, so "-W -Wall" doesn't turn off the extra -W warnings.

Yeah, but -Wextra has some options that I consider to be of dubious value.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact