Hacker News new | past | comments | ask | show | jobs | submit login
The most copied StackOverflow snippet of all time is flawed (programming.guide)
216 points by chris_wot on Dec 4, 2019 | hide | past | favorite | 88 comments

Suppose we had an index of snippets, meaning you've parsed them and are able to search isomorphically. So, e.g. variable names are not significant. Some techniques discussed[1].

Then we run that against source repos, we could get update notifications for copypasta'd code.

"In file F at line L, it looks like you used some code from SO at revision R. In revision R', it's been corrected."

[1]: https://wiki.haskell.org/Hoogle#Theoretical_Foundations

We essentially have that, they're stored in NPM, and it's horrible.

It turns out when you can package snippets you use so many you can't possibly keep track and audit them all.

Just look at the Left-pad thing, or the event-stream thing.

SO copypasta is better than NPM, because no one can change the codesnippet to steal bitcoins once you've copied it into your code base. It's much more secure than a mutable database.

> Just look at the Left-pad thing, or the event-stream thing.

Those prove that we could see the problem. Brokenness doesn't go away when you grab a snippet of code or reinvent the wheel, you're simply unaware of how much of it is buggy or broken.

What do you mean by this? As far as I understand, NPM provides access to packages, not snippets and doesn't as far as I know provide a way to search the code in those packages let alone isomorphically.

A lot of npm packages aren't longer than a typical stackoverflow answer, and they get used everywhere, to the point where installing a dozen packages can lead to tens of thousands of sub-packages being installed.

At that point, the packages are essentially "indexed snippets" of code.

There’s going to be a massive amount of false positives:

“I see you used “for i in...” and that copies this SO question about iteration...”

Agreed, you'd definitely need a mechanism to mitigate false positives.

One technique would be to try and define what constitutes "trivial" code.

Another would be to prioritize sources. Documentation from standard or major third party libraries should take precedence over SO.

Another would be a feedback mechanism. If repo authors vote a particular snippet up or down, after a threshold it could be excluded from matching.

Or you could opt-in by means of a comment, though this might make it useless.

There has been a bit of research on this[1]:

> I qualitatively analyzed the top 50 clones in that list and was able to identify the source (or at least a source) of the snippets in most of the cases.

[1]: https://meta.stackoverflow.com/questions/375761/how-to-handl...

The simple, readable, loop-based snippet mentioned in the article also works for the edge cases. Sometimes it's better to not be so clever.

No, it doesn't. It has the exact same rounding bug described in the article.

The article even mentions:

> FWIW, all 22 answers posted, including the ones using Apache Commons and Android libraries, had this bug (or a variation of it) at the time of writing.

That seems a bit confusing given how the loop works.

999,999 shouldn't be big enough to be affected by floating point rounding.

If the comment is correct then Java will evaluate ( 999999 >= 1000000 ) as true, which simply cannot be correct.

You're misunderstanding the bug.

In the end, the number will be displayed to one decimal place, eg. "1.1 MB" or similar due to the "%.1f" format specifier.

If the input is 999999 bytes, the loop will see that it is less than 1 MB and so will format the number 999.999 into "%.1f kB". When this number is rounded to one decimal place as part of that formatting, it rounds up to "1000.0 kB". This is the wrong output.

There's a bit of a catch 22 because ideally, you would be able to do the rounding first, and then see what units need applying, but you can't do the rounding until you know what the divisor will be.

The article solves this by first manually determining what the cut-off should be (it will be some number like "...999500...") but personally I'd probably just decide to round to significant figures instead so that you can cleanly separate the "rounding" and "unit selection" steps.

Oh, that makes sense, although I'm not sure "1000.0 kB" is strictly wrong in this case.

With the loop at least it's easy to adjust the thresholds if that is desirable, although a comment will be necessary to explain why you are making the cutoffs in weird places.

Interestingly enough in the past I've used a loop similar to this:

  static char suffix[] = { ' ', 'k', 'm', 'g', 't' };
  magnitude = 0;
  while ( value > 1000 )
    value /= 1000;
  printf("%.1f%cB", value, suffix[magnitude]);
Which is bad because it's subject to repeated rounding from the division, but avoids the problem you described.

I don't think the multiple divisions will do any harm as you only need 4 digits of precision left at the end.

And the pow and log methods they call run loops anyway. What is wrong with loops anyway?

> the pow and log methods they call run loops anyway.

That's actually not true (as somebody mentioned on reddit) https://github.com/openjdk-mirror/jdk/blob/jdk8u/jdk8u/maste...

It is probably slower anyway, but I was surprised to see no loops

They kind of cheated because they used a hard-coded polynomial approximation of log(x). So there is an implicit unrolled loop going over A_i*x^i.

Is using Math.pow and Math.log (twice) faster than a tight loop that won't run more than six times?


You run into the same problem if you are writing something like the C 'itoa' function (integer to ascii); if you want to write the digits out front to back you need to know what divisor to use for the leading digit so you need to either look it up in a table or take the log.

Taking the log is a lot slower than the table lookup, I found that out the hard way.

People convert so many integers to ascii and it is shocking how slow ascii <-> binary numeric conversions are compared to binary numeric operations, so it's not a matter of "premature optimizations".

Now you can write an itoa which generates the digits from back to front and not have to worry about copying the results because you return a pointer to the middle of the result buffer but then memory management gets more complex...

You can use a binary search to make the lookup even faster. There's no need to iterate over all possible lengths.

I'm not sure that you come out ahead with binary search over n=10 given branch prediction issues. You'd have to test it to really know.

I feel like this would be a good argument in favour for small scoped packages like we sometimes see on npm. Often enough it turns out that a trivial code snippet like this turns out to be not so trivial after all.


The point being that you lose all connection with a snippet after you copy+paste it. I can clearly see benefits when you centralize its development, make use of the collective mind to harden it, and get notified about possible updates whenever an edge-case is found.

It's a shame that including unit tests makes the snippet non-trivial to copy quite fast.

Rust has a nice solution for this though: tests can also be embedded in documentation comments:

    /// Adds one to the number given.
    /// # Examples
    /// ```
    /// let arg = 5;
    /// let answer = my_crate::add_one(arg);
    /// assert_eq!(6, answer);
    /// ```
    pub fn add_one(x: i32) -> i32 {
        x + 1
From: https://doc.rust-lang.org/book/ch14-02-publishing-to-crates-...

More languages could adopt that idea, and a good StackOverflow answer would include those tests in the snippet. StackOverflow might even automatically run the tests and add a passing/failing badge!

In python since 1999: https://groups.google.com/forum/#!msg/comp.lang.python/DfzH5... :-)

(Though not in the stdlib til v2.1, April 2001)

What I would really like is for unit tests to become full fledged features of a language. Any object can contain a Test method (which would be static), this method contains all the unit test code for that object. Select "Run tests" from your compiler, it compiles everything and goes through calling any Test methods it found but the main entry point is never called. A release compile doesn't link the Test methods, nor any method marked [Test] (support functions only needed for testing.)

> I feel like this would be a good argument in favour for small scoped packages like we sometimes see on npm.

Rather, I think this is an argument that this kind of functionality should be in the standard library; perhaps in the equivalent of `*printf` for each language.

Download and run code written by strangers without understanding what it does — what could possibly go wrong?


Ironic that this was published today. :-)

> Download and run code written by strangers without understanding what it does

Like a web browser does?

There's certainly a risk difference with code that runs in a reasonably well thought out sandbox.

That — and I am not a big fan of browsers getting more and more access to the hardware/OS over time.

That means being beholden to native apps on every platform if you want to do anything at a lower level. I'm not sure that's a better solution.

There is no issues with languages, packages, or what not.

Random code snippets from the internet are obviously completely unsafe. There is therefore basic "due diligence" to apply when considering using one such snippet:

1. Very carefully read the code to understand it.

2. Test it (corner cases/threshold values are the trivial things to test for such a piece of code doing conversions)

In general I do not copy-paste code snippets. I use them as examples of how to perform a task or how to use an API, then I write my own code. This also avoids IP issues.

then it's never possible to use any package/module/plugin anywhere. I get the danger but I'd rather have convenience than writing every function from scratch

There is a very big, and obvious, difference between using a plugin published by a well-known source, and a random code snippet posted by a random person.

I mean, is there, when the snippet is the most copy pasted answer on SO?

This does not tell you anything about the source of the snippet and, almost by definition, people who blindly copy snippets from SO are likely not experts in the field.

On the other hand, when I download and use Openssl (for example) I am reasonably confident that the code was developed and scrutinised by people who know what they are doing.

No absolutely not! I wholeheartedly despise npm, whenever I try to install a small node app to try it out, npm literally creates tens of thousands of directories, that's not okay for any reason! This is a risk worth taking.

In what language is it any different? It tried to help edit the rust docs. It downloaded > 100 packages and thousands of files. I tried to use some command via brew, it downloaded 15+ dependencies each derived from 100s or 1000s of files.

I agree I don't like the risk but is npm more risky?

> In what language is it any different?

C and C++, which haven’t made the decision to bundle a package manager with a programming language (which is dubious IMO because they are almost completely unrelated concerns), and for which you’re normally supposed to get dependencies from your curated, maintained, OS-provided repositories.

NPM creates a node_modules folder and then fills it with the libraries that your app has specified. Then each of those libraries has their own node_modules folder and NPM will install the dependencies of that library and this happens recursively which is absolutely crazy. The directory structure is A-5.0/B2.0, C-3.0/B2.0, D-3.0/B2.0 which leads to B2.0 being duplicated three times even if it has the same version. Almost every package manager uses a completely different strategy. First of all every package gets a globally unique identifier (in NPM package identifiers are relative to the node_modules folder of which there are many). Usually it is the name of the package and if a package manager needs to support multiple versions of a library within the same program it just adds the version itself to the identifier. This means that if you need B2.0 then it would be stored in node_modules/B/2.0/ and libraries A, C and D would use that single version.

The NPM community is well known for their one liner packages with most of the work done in the dependencies. If that one liner has 5 dependencies and your project transitively depends on it 5 times through react or something then you end up creating 5 times as many files than are needed. It's very easy for a trivial application that uses NPM to have a million files in the node_modules folder.

I forget that everything I copy from SO, and everything I post there, is under a CC BY-SA license. That SA is "share-alike" and I don't think people really understand what that means. From Wikipedia's article on it:

"These licences have been described pejoratively as viral licences, because the inclusion of copyleft material in a larger work typically requires the entire work to be made copyleft."

Now how much code uses something copied from SO? And I wonder how copyright even applies to "code snippets"?

> And I wonder how copyright even applies to "code snippets"?

The Software Freedom Law Center has a substantial article on copyrightability of code: https://softwarefreedom.org/resources/2007/originality-requi...

The important point here is that there's no minimum length for code to be copyrightable. It simply needs to be original and at least minimally creative. Since at least thousands of other developers have found the snippet to be useful enough to directly borrow rather than writing an equivalent, it sure looks copyrightable to me.

Thanks for that link! I found this part especially interesting:

"In particular, the laws stress that it is a programmer’s expression of some functionality that may be protected by copyright, and not the functionality itself. If code embodies the only way (or one of very few ways) to express its underlying functionality, that code will be considered unoriginal because the expression is inseparable from the functionality. Similarly, if a program’s expression is dictated entirely by practical or technical considerations, or other external constraints, it will also be considered unoriginal."

Sounds like a case that at least some snippets aren't copyrightable.

I don't understand how the above principle can distinguish anything, at all.

You could reasonably argue that every piece of code is completely and only expressing functionality, because it's all inherently directing the computer to do stuff. So only comments would be protected.

On the other hand, you could instead argue that every piece of code can be translated into another language, and in fact is, whether interpreted or compiled, so the source code is exclusively expression only as the functionality is never tied to it.

But it doesn't seem to me to make any sense to say that some part or aspect is expression and another is functionality. It's all or nothing.

I assume it's like plagiarism, and if you rewrite the code "in your own words" you've copied the functionality but not the individual expression.

Also seems that copyright regarding code and the whole world of copyleft is still a grey area in the courts.

That's interesting, comparing it to plagiarism, reminds me of when I was shamed when I was like 8 for rewriting a paragraph from a book in my own words for an essay. At least when I was that age, that was totally considered plagiarism (at least by my parents). It was crushing to find that even though I'd worked really hard on paraphrasing each sentence, it didn't count and I'd missed the whole point.

I wonder what standards colleges and research journals have now.

You'll find some more information on the "copyrightability" of code snippets in Section 3 of the paper that Andreas mentioned in his blog post and on the corresponding slides: https://empirical-software.engineering/assets/pdf/emse18-sni... https://empirical-software.engineering/assets/pdf/icse19-sni...

> user contributions licensed under cc by-sa 4.0 with attribution required.

I was just thinking about how everyone ignores the license for the code on SO. The code and way it's used is flawed.

For what it's worth, in the common user journey through SO (Google SRP to SO question page, scroll to answers), where does SO ever tell you about the license or the requirements of complying? If they simply rely on people knowing that generally there are licenses and such and they should look into it, then it's hardly a surprise almost no one complies. I've worked with many people, developers and not, who probably couldn't even rattle off a few common licenses.

I'm not a professional developer, but I've copied snippets from SO before. I've always included the answer URL in a comment next to it though. But mostly because if I ever had issues with it, I wanted to know where I got it and on the off chance anyone else looks at my code, I wouldn't want them thinking I wrote code I didn't write.

Common misconception: that licenses create requirements and reduce rights.

In legal reality if no license is attached to code you have almost zero rights to copy, use, or distribute it.

So if a casual browser of SO doesn’t see any license terms, they should assume that doing almost anything with the code is illegal.

It mean the user that asked the question cannot use the code of the answers.

No, it doesn't, because it is licenced under CC BY-SA.

It’s basically only time I would ever put a URL in the comment.

It’s a testament to SO that those URLs have still worked when I clicked on them years later and often provide valuable context to some esoteric bit of code.

Putting a URL in a comment still requires you to share the rest of your code does it not? The SA part of CC-BY-SA?

The rest of your code? IANAL but I don’t think it’s viral like the GPL.

The way I understand it, if you fix a bug in an SO code snippet, then in theory you own that bug fix back to the crowd.

EDIT: The more I read about this the more cringeworthy it becomes. CC BY-SA is not designed for source code. The language in the actual license is extremely imprecise for trying to reason about using modular bits of CC licensed code in a complex system.

“Adapted Material” is defined simply as;

Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material

... and ...

in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor.

Is my 100,000 LOC application “derived from” or “based upon” a function which converts a hex string to a byte[]? That’s a question that can only be decided in a courtroom, and to my knowledge CC BY-SA has never been litigated in the context of source code.

My inclination is that the terms “derived from” and “based upon” must necessarily carry a stronger meaning than, for example, “incorporates” or other similar terms which would not imply a central shared function or feature.

I think they should change it to MIT or BSD or CC0. I've put in my profile that all code I post is CC0. I don't want or need credit for explaining something and a 5 to 50 lines of code. Sometimes the code is longer but it's usually because it's setup for the actual important part. I don't really understand why people feel they need CC-BY-SA.

I really doubt that you could defend the copyright to a snippet like this in court. Saying it's CC BY-SA is all well and good, but if it's unenforceable in reality it's meaningless

I recently had the strange experience of writing a Promise.allSettled implementation in Node.js, but it wasn't quite working. I checked Stackoverflow and found an implementation that was almost line for line what I'd written.

I would imagine that's pretty common, especially for small helper functions for a very specific task. There might really only be one real logical way of doing it. How many different ways could you really implement leftpad?

Apparently whatever ended up in npm was shockingly horrible.

[The Error of Our Ways • Kevlin Henney] https://youtu.be/IiGXq3yY70o?t=1368

[slides, page 32] https://gotocon.com/dl/goto-berlin-2016/slides/KevlinHenney_...

Why don’t you think it’d be enforceable?

One of my favorite functions like this that I found years ago on SO is an extremely efficient bit of code to convert a hex string to byte[] in C#.

It would be fun for someone to make a standard library that consisted of all the highest voted SO utility functions.

Would be even better for Microsoft to add the most frequent snippets of code to the CLR!

I wonder what percentage of bugs in general are from people not understanding how floating points work. I think property based testing (QuickCheck) should be used whenever floats are involved. Nobody ever seems to get them right.

At least -0.0% are.

Would QuickCheck know to try 999999 as input? Most of the possible inputs would give correct results, and those that don't, aren't very 'special', such as 0,1,-1,MAX_INT, and so on.

Don't many QuickCheck-inspired libraries have special cases to ensure they generate common numbers like those? I could be misremembering, but I would have sworn I read that in the documentation of the last library I looked at (whose name escapes me).

Sorry, I phrased myself poorly - in this case, the inputs that give incorrect output aren't very special, as opposed to 0,1,INT_MAX,...

Although including numbers such as 10^N - 1 isn't out of the question either.

What's the cost of reading, correcting, and creating pull requests for 6943 bugs vs `apt-get update && apt-get upgrade`?

Assuming you talk about publishing a library instead of a snippet in the first place, the main difference would be one puts the onus to every client and the other on you.

Your Github account.

You'd be violating the ToS.

What part of the Github ToS would that violate?

To be human comparable, I'd prefer everything in MBs instead.

    ls -l --block-size=M

A lot of the gnu coreutils have a -h flag for "human-readable" sizes. Works with sort too, so it'll put 990M before 1.2G

Its not that flawed though. Showing 1000kb instead of 1mb could be interpreted as a deliberate decision.

To me 1000kB would imply that the result was calculated with 1024 as base instead of 1000, which is a wrong assumption.

It should always use the highest relevant prefix to avoid confusion imho.

The best way to avoid confusion is for everybody to stop using powers of 2 instead of powers of 10 for K, M, etc. The meanings of Kilo, Mega, etc. were well defined before computers were invented. It was a mistake to steal those terms and use them for something different.

this was as of 2010?

Actually, SO's obsession that the answers should contain code ready to copy/paste is flawed. They'd rather give them fish instead of teaching them how to fish.

I don't see that obsession, I have given plenty of prose-only answers that are the most popular one for the given question.

I've even gotten compliments for giving a prose-only answer instead of code.

But those were the users, not the moderators :)

On StackOverflow there's not much difference.

When you're reading the documentation for a software library, do you prefer if it includes examples for common usage or do you prefer a long list of API methods?

Have you ever used the "examples" section of a man page or do you prefer to scroll through the (sometimes long) list of parameters?

In my opinion, code examples are an important tool for teaching real-life usage patterns.

How else to you answer stuff like "how to compare strings in bash" if not with a code example?

On the other hand, often the people reading the answers are expert fishermen, but simply don't want to waste their time on trivial fish. If one wants to learn to fish, there are already many other resources.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact