Hacker News new | past | comments | ask | show | jobs | submit login
Parsing can become accidentally quadratic because of sscanf (github.com/biojppm)
510 points by JdeBP on March 1, 2021 | hide | past | favorite | 160 comments



Seen the GTA post already and the sscanf behavior is very surprising as one would expect it to parse the given string until the given format is satisfied or \0 is hit. Seems like this is not the case. Found this: https://stackoverflow.com/a/23924112

> sscanf() converts the string you pass in to an _IO_FILE* to make the string look like a "file". This is so the same internal _IO_vfscanf() can be used for both a string and a FILE*. However, as part of that conversion, done in a _IO_str_init_static_internal() function, it calls __rawmemchr (ptr, '\0'); essentially a strlen() call, on your input string.

Looks correct. Call flow:

https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...

https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...

https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...

https://github.com/bminor/glibc/blob/83908b3a1ea51e3aa7ff422...


By itself, that doesn't add up to N-squared behavior, though.

> It looks like the problem is that reading floats is done using c4::from_chars -> c4::atof -> sscanf which ultimately calls _IO_str_init_static_internal in the glibc implementation. As far as I understand, this essentially causes strlen to be called on the string that rapidyaml passes to it, essentially causing strlen to be called O(n^2) times.

As such, this doesn't follow. Just because sscanf takes the length of the entire source string doesn't mean will get N^2 behavior, unless you do silly things with sscanf, like extract a small token from the beginning of a large string, and then skip past it to do the same thing again after that small token: essentially, build a lexical analyzer in which the pattern matcher consists of calls to scanf. This is not an intended use case for sscanf. No self-respecting language parser designer would do such a thing in a non-throwaway, production program.

sscanf has undefined behaviors for out-of-range numeric values, so you can't use it for just blindly extracting numeric tokens at all, regardless of any other issues.

sscanf should only be used on a string that your program has either prepared itself, or carefully delimited by some other means.

For instance, suppose we have a string which holds a date (could be an input to the program). We can test whether it's in the YYYY-MM-DD format, and get those values out:

   int year, month, day;
   char junk;

   if (sscanf(input, "%4d-%2d-%2d%c", &year, &month, &day, &junk) == 3)
   {
      /* we got it */
   }
The purpose of junk is to test that there is no trailing garbage character. If there is, it will be scanned and the return value will be 4. If we don't care about that, we can just omit that argument and conversion specifier.

This will accept inputs like 01-1-1. Worse, it will scan leading signs!

If you wan to insist on nothing but digits, and exact digit counts, you can break it up as string tokens first:

   char yearstr[5], monthstr[3], daystr[3];
   char junk;

   if (sscanf(input, "%4[0-9]-%2[0-9]-%2[0-9]%c", yearstr, monthstr, daystr, &junk) == 3)
   {
      int year = atoi(yearstr);
      int month = atoi(monthstr);
      int day = atoi(daystr);
   }
atoi is safe here, by the way, because we know that the integers are decimal strings no more than four digits.

This is the sort of thing where sscanf comes in very handy. If you know how to use it, it solves these sorts of things nicely, and the performance is probably okay. At least not degenerately bad.

Complete sample with all the validation:

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>

  int main(int argc, char **argv)
  {
     char yearstr[5] = "", monthstr[3] = "", daystr[3] = "";
     char junk;

     if (argv[0] && argv[1] &&
         sscanf(argv[1], "%4[0-9]-%2[0-9]-%2[0-9]%c", yearstr, monthstr, daystr, &junk) == 3 &&
         yearstr[3] && monthstr[1] && daystr[1])
     {
        int year = atoi(yearstr);
        int month = atoi(monthstr);
        int day = atoi(daystr);

        printf("%d:%d:%d\n", year, month, day);

        return 0;
     }

     return EXIT_FAILURE;
  }
Some tests:

  $ ./date asdf
  $ ./date 1-2-3
  $ ./date 2020-01-1
  $ ./date 20-01-01
  $ ./date 2020-01-01
  2020:1:1
  $ ./date 2020-01-01z
  $ ./date 2020-01-12
  2020:1:12


I am confused about this comment. The parent comment gave clear call stack that pointing to the problem where the glibc's sscanf implementation would strlen the input string no matter what format string you use (because it need to masquerade as a file and that requires strlen the whole input: https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...)

All your magic on format string doesn't change that poor behavior unless you manually add '\0' somewhere to null-terminate the input string early.

If you parse the year-month-day in 100 places of that input string, it will strlen 100 times. And now replace the "100 places" to a linear function of the input string length (strlen(input) / 11?), that is quadratic.


I agree with you. However, the point of the parent comment is that you should not use scanf to do this; it is not designed to parse a small part of a large string. And indeed, I would say it is bad for more reasons than just this performance cliff. You should really prefer to use strtod or strtol on the exact scanned token instead. Even then, these functions may be impacted by C locale settings, which means you should probably instead prefer using implementations of this algorithm outside of libc...


I just ran strtod through a debugger it at least ends up in strlen.S, so it probably wont improve the runtime complexity . It seems to be best to just assume the worst when it comes to any functions in the C standard library that have to do with string handling and avoid them in favor of literally anything else.


Did you call strtod with a null endptr? If not, I'd be surprised if it actually strlen'd the input rather than a copy of it. Still, I agree, you shouldn't really use the C standard library to do strings.


> Still, I agree, you shouldn't really use the C standard library to do strings.

As someone who's been learning some C, I wonder if you have a recommendation for a guide on how to handle strings properly.


It probably shouldn’t involve writing your own string functions. They’re so hard to get right, obviously. Last time I wrote C, I copied out the Rust Vec and String API as C headers and didn’t quite nail the implementation, but it was fun.

There are at least a few libs like sds that store the length information in various ways, which solves at least this kind of problem.

https://github.com/antirez/sds


There’s plenty of string libraries, although ultimately no answer is perfect. It’s totally OK to use the libc string routines when learning C, but it would be good to keep in mind that it is not necessarily easy to write safe and correct software using it.

(Languages with more advanced string types and slicing mechanics are not impacted; C++, Rust, Go, etc. They do have their own issues of course.)


> This is not an intended use case for sscanf.

It's almost like JSON and XML were invented so people didn't have to understand lex/yacc.

And behold... they don't.


> [...] so you can't use it for just blindly extracting numeric tokens at all [...]

So in other words, it's completely useless?

> This is not an intended use case for sscanf.

Then what is?


Yes, scanf is almost completely useless, and sscanf a little less so. sscanf is useful if you need a pattern-based breakdown of some string datum, and the patterns are not out of sscanf's league. It will work "anywhere"; it's been in C since before 1989 ANSI C.

In the entire TXR project, I used sscanf in one function: a floating to integer conversion routine. The context is that integers can be bignums, so we can't just rely on ordinary conversion, except in the fixnum range.

http://www.kylheku.com/cgit/txr/tree/arith.c?h=txr-252#n2912

The approach taken is to sprintf the double value into a generous buffer, with a generous precision, using the %g format which will automatically use the e notation if necessary.

We detect whether the e notation was produced, and other characteristics, and then proceed accordingly.

I see that an impossible case is being tested for; since values with a small magnitude are handled via the fixnum route, we will never hit the case where we can short-circuit to zero.

Can anyone see a bug here, by the way? :)


The comment to which you reply uses the %[] conversion to impose stricter limits on acceptable characters to be interpreted as numbers.

My guess at the historical intention of *scanf is that it was meant for parsing tab-delimited and fixed-width records from old-school data files.


(As opposed to the comma-separated data files that we enlightened future-folk use!)


You know, why don't we use 0x1F (Unit Separator)? Seems like we have had a hierarchical delimiters since ASCII / punch cards, yet I never hear about them. https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Field_...

CSVs are such a pain if you have freeform text (in which you have to handle newlines and escaped quotes).

It seems like using FS/GS/RS/US from the C0 control codes would be divine (provided there was implicit support for viewing/editing them in a simple text editor). I get that it's a non-printable control character, and thus not traditionally rendered... but that latter point could have been done ages ago as a standard convention in editors (enter VIM et. al., as they will indeed allow you to see them).


It's nice being able to edit a CSV in a text editor. It wouldn't be worth it to lose that just to not have to escape some strings.


Text editors that can put control characters into files, and indeed PC keyboards that have a [Control] key, have been around for decades, though. For example: With WordStar, SideKick, and Microsoft's EDIT, one just typed [Control]+[P] and then the control character. M. Darkphibre mentioned VIM, but really this isn't an idea that was ever limited to just VIM and emacs.


It wouldn't take that much effort to add a glyph in the editor for it?


more than that it is nice to edit files with a standard keyboard,


If your editor doesn't allow you to enter arbitrary characters using a standard keyboard then you need a better editor.


The problem is not the editor, it's the human typing on the keyboard that sees , and thinks "that'll do". Using an obscure dedicated character is not going to happen.


I guess my point was a curiosity that these control codes had become obscure, when text format interchange is so prevalent throughout computing history.


I do not want to learn new input methods for every program, I want a standardized compose-like functionality that allow me to write both diacritics and backticks ( a plus for the rest of unicode)


All operating systems have that too. Look up "Input Method Editor" in your favorite OS's documentation. See also dead keys, compose key, AltGr, etc, etc. If you think you can't edit a file because it has funny characters in it, then you're not trying hard enough.


I know, still there is no reasonable way* to do both common european diacritics and backticks/tilde on windows without installing third-party software.

* I find dead keys irritating beyond reason, so I do not count them as an option


There are plenty of ways a lazy programmer will want to pass the records into something else that either doesn't handle non-printables, or does but the user doesn't like how it looks.

TSV also works more reliably than CSV, because most people don't put tab characters in the data in these kinds of records. Tab is even the default field delimiter for cut. But everybody uses CSV, because again it's easier to reason about the above. (shrug)


No matter what delimiters we use, we still have either a data sanitization problem or a data escaping problem. I've worked with a wire protocol that used the ASCII record delimiters, but with binary data so you also had to use binary escape (0x1b) and set the eighth bit on the following byte.


Which is a mess, because some parts of the world use commas as decimal commas: 3,14159etc. So we have to use semicolon for our CSVs.


> some parts of the world

most parts

Why can't you be normal? https://commons.wikimedia.org/wiki/File:DecimalSeparator.svg


I submitted a patch to support decimal commas when parsing timestamps in Go in 2013. I thought this was a slam dunk because while major users of decimal dots include USA, China, India, and Japan, the decimal comma is used by pretty much every other country in the world. Going from ~40% support to ~99.9% support seemed like an obvious win.

Rob Pike politely declined the patch, commenting "I might prefer to leave it unaddressed for the moment. Proper locale treatment is coming (I hope soon) and this seems like the wrong place to start insinuating special locale handling into the standard library."

Three years later another Go team member commented that "Date localization is definitely still in the planning."

We're in year 8 now. The issue is still open. Rob Pike is still hoping.


> ~40% support to ~99.9% support

№ of countries ≠ № of people though. ⅘ of the 5 most populous countries¹ use decimal points, and together they alone² have ~42½% of the world’s people. Do all decimal point countries together have ≤50%? (Edit: Is it really not “normal” to do things the Anglophone way?)

¹According to https://en.wikipedia.org/wiki/List_of_countries_and_dependen...

²Not counting territories


This is where Rust's decision to keep such functions out of the standard library entirely starts to look pretty smart.


Wow, taking a look at the related issues, they are really hostile towards fixing this particular pain point. I knew Go had a reputation for being condescending and opinionated, but I had no idea it was this bad.

https://github.com/golang/go/issues/6189

https://github.com/golang/go/issues/26002

https://github.com/golang/go/issues/27746

https://github.com/golang/go/issues/36145



C is full of record based stuff. Apart from the numeric modifiers to printf and scanf we have fread, fwrite, calloc all taking record sizes


> ... doesn't mean will get N^2 behavior, unless you do silly things with sscanf, like extract a small token from the beginning of a large string, and then skip past it to do the same thing again after that small token: essentially, build a lexical analyzer in which the pattern matcher consists of calls to scanf. This is not an intended use case for sscanf.

Then why does it support the "%n" specifier? Isn't it exactly for that use case?


to my understanding there is a long input string, completely in memory. that string consists of a gazillion of floats. the string is walked with a pointer, the floats are sscanf-ed one after the other. each of the sscanfs will call strlen once, but each strlen will scan the full remaining huge string to find the \0 eos byte.

the sum of all scanned bytes is quadratic to the length of the huge input string.

?

edit: oh the autocorrect...


That description seems to resemble my remark:

> extract a small token from the beginning of a large string, and then skip past it to do the same thing again after that small token: essentially, build a lexical analyzer in which the pattern matcher consists of calls to scanf.

Scanning floats with sscanf means undefined behavior on a datum like 1.0E999 against a %lf (double), where the exponent is out of range. IEEEE double goes to E308, if I recall.

C99 says "Unless assignment suppression was indicated by a[n asterisk], the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined."

I'm seeing that on glibc, 1E999 scans to an Inf, but that is not a requirement.

This problem could be solved with sscanf, too, while maintaining the degenerate quadratic behavior. sscanf patterns could be used to destructure a floating-point number textually into pieces like the whole part, fractional part and exponent after the E, which could then be validated separately, like that the E part is between -308 and 308 and so forth. Someone could still do this in a loop over a string with millions of floats. I'd say though that you're way past the point of needing a real lexical analyzer, and not multiple passes through scanf.


What about parsing an unknown number of at-most-4-digits integers from a string (being careful, as you exemplified)? Isn’t that a reasonable use of sscanf?

Do I need a full fledged parser for that? No. Does it have to traverse the whole string every time? No, because I’m being very clear about the maximum length of the thing I’m trying to parse.

You cannot not conclude that it’s a bug.


My thoughts on that is that the problem is at a level below sscanf; in other words, what else is affected?

A mechanism for turning a string in memory into a string should not be taking its length.

GBLIC has an extension called fmemopen. The API is:

  FILE *fmemopen(void *buf, size_t size, const char *mode);
This interface smells like it might have the flaw that if someone wants "buf" to come from a null-terminated string, they must measure that string in order to supply the size parameter.

I.e. the underlying mechanism is binary, and does not stop at null characters.

There needs to be a separate mechanism with a different underlying stream implementation:

  FILE *fstropen(const char *str); // mode implied: "r"
This doesn't require a size.

That fmemopen extension points to a solution to the original problem, though non-portable: open the entire buffer of numbers as a stream in one fmemopen call, and then use fscanf to march through it.


fmemopen is meanwhile portable and part of posix.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/f...


let's say you have a newline delimited dates in &junk. now you do a while loop, whoops quadratic.


Well, not if you do something like

   char * start = data; 
   char * next = strchr(data,'\n'); 
   while(next) 
   {

      *next=0; 
      sscanf(start, ...) 
      start=next+1; 
      char * next = strchr(start,'\n'); 
   }


junk is a character, not a string, though.


The valid point is that the junk could be arbitrarily long, and that glibc's sscanf will scan through it in taking the length of the input.

But we are not continuing to scan into the junk here to get more tokens here; we don't have a loop that will cause O(N*N) behavior.

E.g. this could be user input from a form field where they are expected to type a date, and we want to reject that if there is trailing junk. At that point, the user gets an error, and has to fix the input. We don't look for anything in the trailing junk.


Are you not listening to what anyone else is telling you here? Did you not read the article?

It's not about "scanning into junk". The problem would happen if you used scanf to extract integers from a string of `' '.join([1]*1000000)`. Because you would make 1,000,000 scanf calls, and _each_ of those calls would call strlen on the string, quadratically scanning over the string to repeatedly find the end.

In fact, every sscanf is O(size of the string) even when the string is well formed. This is not a well behaved function.


The calls to sscanf() in the preceding post have been edited since xe wrote that. My guess is that xe meant the parameter now named "input".


I found this while making a collection of what C implementation does what at https://news.ycombinator.com/item?id=26298300 .

There are two basic implementation strategies. The BSD (FreeBSD and OpenBSD and more than likely NetBSD too), Microsoft, GNU, and MUSL C libraries use one, and suffer from this; whereas the OpenWatcom, P.J. Plauger, Tru64 Unix, and my standard C libraries use another, and do not.

The 2002 report in the comp.lang.c Usenet newsgroup (listed in that discussion) is the earliest that I've found so far.

There have been several assertions that library support for strings in C++ is poor. However, note that the fix here was to switch from the Standard C library function sscanf() to a C++ library.

* https://github.com/fastfloat

One might advocate using other people's libraries, on the grounds that they will have got such things right. However, note that this is RapidYAML, and the bug got fixed several years after GTA Online was released. Other third-party parsing libraries have suffered from being layered on top of sscanf() in other ways.

* https://github.com/json-c/json-c/issues/173

* https://github.com/kgabis/parson/commit/96150ba1fd7f3398aa6a...


It turns out that the strlen() behaviour was present as far back as the 4BSD libc [1] circa 1980, and via the Net/2 releases has found its way into NetBSD/FreeBSD/OpenBSD along with countless other BSD-derived libc implementations.

Of course, the underlying vfscanf() implementation that reads from a FILE* simply proceeds one character at a time; the pseudo-quadratic behaviour arising from the strlen() call is simply an artifact of the simplistic "simulate a file using a string" wrapper in sscanf().

[1] https://minnie.tuhs.org/cgi-bin/utree.pl?file=4BSD/usr/src/l...


Musl reads the "file" in 256 byte increments and calls strlen on each one. Still a strlen, but on short strings it would only be called once, on a short amount of data. It would work fine for this use case.


As laid out at https://news.ycombinator.com/item?id=26298300, it doesn't call strlen(); and it isn't 256 bytes. Moreover that use case is not the use cases at hand, which are people parsing 10MiB of JSON and 504KiB of YAML with parsers that are not calling sscanf() just once.

MUSL, with the same implementation strategy of nonce FILE objects, ends up performing multiple passes over those big input strings, once to look for NUL and a second time to pull each character out one by one. And it's doing those multiple passes multiple times, as it sets up and tears down the nonce FILE object each call to sscanf(), meaning that it's scanning the same parts of the string for NUL over and over again as it repeats the same buffer fill(s) from the input string a few characters further on. As I said before, it's not entirely free of this behaviour. With the other implementation strategy, in contrast, there's only a single pass over the whole input string, even with those multiple calls to sscanf().


Let's look at the source: https://git.musl-libc.org/cgit/musl/tree/src/stdio/vsscanf.c

It fills in a buf with a string_read callback, but if we look at what vfscanf does, it uses shgetc, which eventually calls into __uflow, which is where the read is performed:

https://git.musl-libc.org/cgit/musl/tree/src/stdio/vfscanf.c https://git.musl-libc.org/cgit/musl/tree/src/internal/shgetc... https://git.musl-libc.org/cgit/musl/tree/src/stdio/__uflow.c

This will call into the string_read callback with a buf size of 1. The string_read will internally do a memchr(buf, 0, len+256);, so, OK, it's memchr-ing at most 257 bytes, sorry. The read data then gets stored in a stdio-internal buffer (the rpos/rlen below), so this code path will actually be skipped on future reads.

If it then needs to unparse a character, it will use ungetc to put it back in a stdio-internal buffer. Reading this code isn't that hard, I don't know why you keep insisting it does something else.


This isn't solely directed towards you, but as a general comment: if you're keen to illustrate a point about code behaviour (as we often are!) then it can potentially save time and ambiguity to include commit references and line ranges.

It still doesn't guarantee that the behaviour is stable (because dependencies within the linked range may change before a reader gets to it) but it can help support your message (and hopefully your argument) both at time-of-authorship and in future.

(and yep, it's a tradeoff of your own time -- and response time -- against future readers' time; often tricky to handle. I think code browsers could do a lot more to make it easier)


You haven't read the code properly. The memcpy() copies one character. After shgetc() reads that and possibly a few more characters, the entire FILE object goes away and the next call to sscanf() sets up an entirely fresh nonce FILE object and does the initial fill all over again, including scanning the stuff that it has already scanned for NUL, a few characters further on. As I just explained.


Yes, but it's not scanning the entire string, it's scanning at most 256 bytes. In MSVC, the number of bytes scanned is `M*N` where M is the number of calls to scanf and N is the size of the file. In musl, N is now 257, no matter the size of the file. Compared to ~10MB of data, that's basically free.

Sure, it's scanning a bit more than maybe it needs to, but it's much closer to a non-scanning libc in the amount of data it's consuming.


And this will be the third time that I've written the phrase "not entirely free of this behaviour", in comparison with the implementations that are entirely free of this behaviour.


It's not entirely free of overhead. The overhead is constant per invocation though, so it's entirely free of _quadratic_ behavior. Which is what's being discussed.


No, that was not what was being discussed, neither here nor in that discussion. See https://news.ycombinator.com/item?id=26297612 . None of us mentioned quadratic behaviour. We are talking about sscanf() calling strlen(), and in the original discussion why.


Sure, the quadratic behavior is attached to the "calling sscanf in a loop". The point is linear behavior for a call to sscanf. Which isn't the case for musl. Nor does musl call strlen, or even effectively call it. It will bizarrely scan for a NUL in up to 256 bytes ahead of where in the string it is currently. But that's not actually the same behavior as calling strlen or having other linear behavior.


In case anyone missed it, this sscanf behavior was the cause of the long loading times in GTA V, a story that hit the front page yesterday:

https://news.ycombinator.com/item?id=26296339


Well half the cause (or probably two thirds), as the post shows there’s a second quadratic blowup when filling the not-a-hashmap thing after the very slow parsing has completed.


A filling that wasn't even required because it was de-duping already de-duped data instead of just inserting it into a list (or a hashmap, as suggested in the post).


The filling was necessary, but the deduplication (in linear time as it works over an array) indeed was not as the input is unique.

I can see why the devs would not trust that though. So it’s the inefficiency of the dedup’ which bugs me more than the dedup itself.


That was a brilliant investigation. Reminded me of when I used to work on Windows debuggers. I wish the author touched on symbolization more. That is where a lot of the fun lies when dissecting precompiled binaries.


What are the tools for symbolization? I recently spent a lot of time laboriously propagating RTTI information around Visual C++ binaries by hand in Ghidra, so I would be very interested in learning about this tooling, even if it's all IDA for now.


It has been 2-3 decades since I’ve worked on this, so this may be outdated advice. In fact, I just learned of Ghidra when I read the article.

In general there are three cases:

- Developer forgot to strip/obfuscate symbols, in which case you’re golden

- Symbol you seek is in a shared library: patch the DLL. Often what we did was do investigations in a VM, where the n most common shared library calls were patched to send stack and symbol data to a local DB. Very similar play to RR, but only in the patched code.

- If the (missing) symbol you seek is pre linked (static), things get interesting. We built a pattern matcher that was incredible. Idea is, if you have the disassembly and a loop, like:

int fun(char* str) { call processChar on each uppercase character }

It pattern matched a name like rax_foreach_iflte_cx_call_obfuscatedSymbolForProcessChar

90+% of code either:

- enters the same dynamic library trampoline

- displays very very common programming patterns.

From what I know, we never released this outside early MS DN members because the 10% is often very useful code and developer tools weren’t given away for free yet.

But once the DLL is patched, you propagate names up graph. Any remaining symbols were hand annotated (and we run the propagator again).

Brian Parks went on to extend this system to work with C++ templates. C++ mangling is very predictable, so his program could scavenge for symbols (obfuscators only worked on the root symbol name) and build out the class at compile time for us to reference.


just a little related: I watched a guy step'n'watch the no-symbols code once after he hooked WINDBG up to an MS Office process that had a paint issue. He narrowed it down to something IME was doing -- I gained nothing but awe.


I'm sure that symbols were not mentioned much as the code was from an obfuscated game engine.


I write a lot of C but I literally never use scanf. C simply lacks the facilities to make such an API truly useful and flexible. You can't scan custom types, error handling is super limited, it's very easy to mess up the format string and potentially end up with corrupted data and security vulnerabilities.

If you need to parse JSON or some high level format look for a lib, for instance I can vouch for cJSON which is very small and has a rather nice API (by C standards at least).

If I need to parse a simple format string like "id:name:description" something like strtok_r does the job to split the components and then you can parse individual components any way you like, potentially offering better error messages.

If I need to parse a complex format then chances are scanf wouldn't do the trick anyway, so it's either custom made parser state machine or lex/yacc.

Note for instance that in the GTA case the call was `sscanf(p, "%d", buf)` which is a glorified strtol with vastly worse performance.

I would seriously advise not using scanf and friends for anything serious. Too many footguns.


Another example of so many things wrong with the status quo.

(1) Naive people find liberation in text-based protocols that don't require tools to understand.

(2) With experience one finds that variable-length data structures are much more treacherous than fixed-length data structures: you find simplicity instead in binary data formats that are dead easy to parse.

(3) Floating point numbers are astonishingly expensive to parse and unparse to ASCII compared to how many FLOPs you could do in that time. All that, for a function that isn't quite correct... That is, the number 1/5, which you write as "0.2" doesn't actually exist in IEEE floats. You don't notice at first because that number unparses as "0.2" -- but then note that "0.1 + 0.2 == 0.3" is false in most computer languages.

(4) Null-terminated strings are the definition of insanity, rivaled only by "pointers are integers, integers are pointers"

One pet project of mine lately is a "Data Assembler" that takes a graph of data structures (say test data for a family of logic chips, graphics data for a laser light show, or "maps" to upload to the engine control unit in your car) in Python and converts it to a byte array that is incorporated into an Arduino program.

The total size of the structure could be up to 32kbytes on a small device, or up to 250kbytes or so on a big device, but the program memory is huge compared to the RAM which might be as little as 2kb so the unpacking code is going to stream the data with a small working area.

The minimum it has to do is put together blocks of data that have references to one another that are only resolved once all of the blocks are otherwise complete.

I deliberated a lot about how to handle variable length data structures and decided to embed a two-byte length before the block for all blocks. I could have saved some bytes when the length of the structure was known, but then I'd have to figure out how to code lengths then (e.g. for an array of 2-byte ints do I count the ints or bytes) and figured I'd make up my mind once.


> Naive people find liberation in text-based protocols that don't require tools to understand.

I don't think text is necessarily the problem here. Other protocols might allow you to use less memory, but they aren't running into a memory limitation in this instance.

You can divide parsers into two groups: parsers designed to read data which follows a specific schema and parsers designed to read arbitrary data.

Schema-aware parsers have additional context which allows them to process the data much more efficiently. The stricter the schema, the more efficient the parser can be. Meanwhile, parsers for arbitrary data have no such context and must rely completely on syntax to parse the data. A poorly optimized schema-aware parser may still outperform a well optimized parser for arbitrary data.

You can still encode data as text and even use popular syntaxes like JSON or YAML with schema-aware parsers. I've personally helped design a schema-aware parser for XML data. But you should keep in mind that most third-party parsers for those syntaxes will be designed to parse arbitrary data. Parsers for arbitrary data can still be very fast - fast enough for many purposes. However, schema-aware parsers have a significant advantage when it comes to computational performance.


Text is the problem, here and everywhere else. Text encourages looking for delimiters, escaping, and flexibility (optional whitespace, case insensitivity, leading zeros, redundant signs, newline style compatibility). Compared to binary with length prefixes, that’s really slow, prone to injection vulnerabilities and subtle incompatibility, requires copying, introduces more side channels (the length of your message now depends on the values of numbers inside it!), ….


Text =/= Variable Length

I can store fixed-length data using text. A strict schema can specify things like fixed lengths, value padding, orders for maps, exact whitespace rules, etc which can entirely remove the need for delimiters. A schema could even require length prefixes, just like you might use for a non-text format.

These concepts do not necessitate abandoning text. If you need the data to be readable/writable by a human with minimal tooling, you can still leverage text without sacrificing much parsing efficiency.


In theory, sure. In practice, nobody does, and length prefixes stop a format like that from being conveniently writable with a plain text editor.


I work on avionics and we definitely use fixed-length text all the time. Text is often much more convenient than blobs, even when adhering to strict schemas.

Syntaxes like XML, JSON, YAML, etc have become popular for their convenience and versatility, but they do not represent the whole of what is possible with text.


You've not used enough Ada: its variable length string is so clunky that fixed sized text is used instead..


And once you get down to it, text still requires tools to parse. I routinely run stuff through `jq`, because despite JSON being a text format, actually consuming it as a human requires a tool. (Often to filter it, but sometimes just to colorize the output to make it legible.)


That's orthogonal. Msgpack is basically json but easier to parse from C, and less inefficient on floats and integers.


Naive people find liberation in text-based protocols that don't require tools to understand.

As a huge fan of text-based protocols (but not of JSON or YAML), calling a whole class of debuggability, interoperability, and onboarding benefits "naive" is a bit much.


It's a lot of the reason why software has gotten to be so porous in terms of security flaws.

I would trust programmers at Intel to handle FORTRAN-style data structures, and maybe even make a compiler in C.

I wouldn't trust anybody from Intel to write a web server or even a web application with PHP in that wild west of applications programming with strings they are sure to mess it up and thus leave a vulnerability in the management engine on your CPU. It might superficially look like good code, but American Fuzzy Lop will find the branch that isn't obvious, and so will some kid.


To me the problem seems to be old-school C programmers thinking that they don't need to use a library to parse a standard format. I have only ever seen JSON parsing errors in C and C++ codebases because they are the only languages where developers seem to think it's a good idea to parse JSON using string functions rather than a JSON parser.


Some of that is that they don't have systems like maven or npm that would make it really easy to incorporate libraries into the build.

That said, what I like about C is that it is a simple language where you can throw out the standard library and start anew: it makes me think of the old versions of ALGOL that didn't specify mechanisms for I/O, and now we know you can specify that behavior in libraries and leave the language pure.


> Some of that is that they don't have systems like maven or npm that would make it really easy to incorporate libraries into the build.

Definitely! I wish the C and C++ communities would take this problem seriously.

> That said, what I like about C is that it is a simple language where you can throw out the standard library and start anew

This definitely is a nice property. But it's not just C that can do this. You can for example do this with Rust, and get an excellent package management system (that can indeed package these alternatives to the stdlib) to boot.


As a C programmer, i can say that most of the json libraries fail on simple fuzzing test, or do some other idiotic stuff like trying to handle numbers (and doing it wrong of course). We have awesome parser generators such as ragel, yet people still handroll parsing code that is both buggy and slow.


> We have awesome parser generators such as ragel, yet people still handroll parsing code that is both buggy and slow.

Wasn't ragel responsible for the cloudbleed vulnerability? Wherever possible we really shouldn't be using C at all.


No, the bug was in code that was feeding input to ragel [1]. Ragel itself was fine. (Ragel also can output to non C languages). I've used ragel myself extensively and it's really rock solid, I've fuzzed the state machines it generates a lot. I've even used it to parse protobuf specifications and generate code from them, the company still uses that code. (and it's really not a lot of code)

1: https://blog.cloudflare.com/incident-report-on-memory-leak-c... (As for cloudflare, I don't think it's good idea to let single company to MITM the whole internet)


It sounds like Ragel still requires the user to "use it correctly", which is really not great.


That's like saying driving car into a wall full speed should not kill you


It's like saying that we ideally shouldn't have cars that can drive full speed into a wall, or at least that these cars shouldn't be the cars that most people use. And indeed modern cars will generally at least attempt to break before they hit a wall. It seems not too fanciful to think that in another 10 years we will have cars that can reliably prevent themselves from being driven into walls.

Our tech with regard to type and memory safety is more advanced than our car tech: we already have the tech to prevent these errors. And I believe we should be using it wherever possible. Obviously there will be exceptions (the more obvious being legacy codebases), but they should be exceptions.


My Subaru will do its best to turn off the gas and auto break when it detects I’m about to drive into a wall.


I used to be the only one where I worked who would use the standard library for xml in perl rather than just ad hoc regexes.

They'd recycle my scripts when dealing with very large files but wouldn't learn it themselves.


> variable-length data structures are much more treacherous than fixed-length data structures

A fixed-length structure is only fixed until you need to add a field to it. I understand the appeal, but good solutions seem to be very rare here. ASN.1 was possibly the first attempt but turned into a total nightmare. Protobuf seems to be doing a reasonable job, but only as a wire format and not a persistence format.

Text always "works", even if inadequately. And it can be handled with your normal diff and editing tools.


> Text always work

Windows vs Linux end-of-line markers often screws-up things. UTF often screws-up things. Tab-vs-space sometimes screws-up things. Text is related to Postel's principle, which is often questioned because it leads to sloppy implementations on one side and complex implementations on the other side, and other bug-is-now-a-feature niceties.

I designed a couple of small internal protocols, and have been maintaining them for at least a decade. Maintenance time is nearly 0 because I never had to add a field to them. Even though the software using them evolved a lot and in unexpected ways.

Yes, they stood the test of time because they solved the problem we had, and those problems didn't evolve much. Not every protocol or format needs to be hyper-extendable and scalable.

> even if inadequately

So it works poorly but it works, so it can be shipped. That's why you buy a new smartphone every two year and don't see much improvement in response time or battery life.


We've been using protobuf as a persistence format without issue?


A self-describing format like MessagePack can be easier to build readers for in other applications, or if the schema is lost or changed.


If we've lost access to the schema, then we've lost access to our entire source repository, so we probably don't need to read our data anymore?

I guess I could see that being an interesting property in software that wants to deal with incompatible forks, or software that for whatever reason wants other people to read their data but is unwilling to ship a schema with it, but just for persistence on backend servers it doesn't seem like a very interesting property or worth the extra space requirements.


CBOR is the gold standard for schema free data.


You have. Far more apps out there are using JSON or XML, including (as we've seen GTA).


Ah, you were just talking about popularity not fitness? Fair enough.


Whenever floating-point numbers converted to text are not intended to be used by a human, converting the numbers to decimal is a great mistake in my opinion.

The floating-point numbers should be converted to text as hexadecimal, e.g., in C99 and later, using printf with the %a or %A conversion specifiers.

Even when a human uses the textual floating-point numbers, if they are used only for making copies or comparisons, the hexadecimal numbers are more convenient and they are free of conversion errors.


How would the compiler be able to use any fast, floating point specific instructions if you parsed it as a hexadecimal string?


Not sure I understand the question. The suggestion is to use hexfloats, as they are unambiguous and straightforward to convert from and to (as opposed to anything with decimal significands): https://www.cplusplus.com/reference/ios/hexfloat/


You don't need any specific instructions. The hex encoded double contains the final binary representation of the floating point number. So no conversion is required to load it, except maybe for swapping around bytes on some architectures. Conceptually:

    double v;
    memcpy(&v, "\x00\x00\x00\x00\x00\xe4\x94\x40", sizeof(v)); // LE


On (2): implicit type promotions with unsigned and signed types can make it no-so-easy in C. FWIW, I always test with bytes with the sign bit on (on two's complement architectures, never worked with others) when I do stuff like that.

On (3): never perform equality tests on FP numbers in languages that use IEEE standard FP format.

On (4): I have worked both with ASCIIZ and counted strings, I think both have their pros and cons. But if you are writing stuff that has any contact with C/C++, it's often a bad idea to go against the stream. No, auto-adding a final 0 to your counted strings won't save the day every time.


Never perform equality tests on FP numbers period - it defeats the purpose of using floating point. This is not specific to the IEEE format.


Tell that to lua and javascript developers who only have access to FP numbers.


shrug They also only have access to IEEE FP numbers, so that applies equally much to the parent.


If my memory serves, IEEE FP is accurate for those real numbers that are also integers (at least on some range). JS users are safe as long as they stay within the integer algebra (ie they don't use division, mainly, or perhaps round up the result). Lua users have the extra option to recompile their interpreter it uses integers.


That's fair as far as it goes (I pedantically should have said general-purpose FP numbers), but, again, applies equally well (actually more so, since they were talking about IEEE FP specifically) to the parent post.


I agree generally that if you can afford the cycles to serialize and deserialize from decimal ASCII ("0.2") that you can probably afford to use a decimal numeric type in your business logic.

Personally I blame the C standard for not splitting strtod in to 1 parsing function (returning an integer and the power of ten), and 1 computation/conversion function. Without this you're basically hosed with respect to locales.


> Null-terminated strings

There is nothing inherently wrong about a null-terminated string convention.


There's no guarantee that it terminates.


I am not sure there is a universally better solution. If a function takes both string and its length (e.g. strncmp), there is no guarantee that the length is correct. Or like C++, you always represent a string by a (len,max,ptr) 3-tuple, but 1) this is not a primitive type and 2) len and max waste memory when you have millions of short strings.


If it is that bad use a variable length length code like the utf-8 transformation. For strings smaller than 127 bytes overhead is the same as NUL.


This increases code complexity. Also, for many applications, we prefer to have a max field for dynamic strings. Each type of string representation has its own pros and cons. This comes back to my original comment: there is not a universally better solution.


I'll add that variable length stuff is inefficient to parse as it breaks instruction level parallelism. Later parsing events become dependent on earlier events.


> (4) Null-terminated strings are the definition of insanity

Could you elaborate on this one? I've only used C in college and don't know why this is bad.


I would point out two main issues:

1. Such a string cannot contain arbitrary bytes. Since it is frequently useful to work with arrays of arbitrary bytes, we have to repeat every library function twice, once for strings that can contain nuls (the mem* series of functions), and once for strings terminated by nuls (the str* series of functions). The latter are much better supported, and are easier to work with, so we frequently see programs that trip over strings with nuls in them for no particularly good reason.

2. We often want to modify strings, not just read them. Modifications often need more space than was in the original string. A nul terminated string cannot say how much room beyond the current length is allowed to be written to. By default, most library functions assume that there is enough space, however this is frequently an unsafe assumption. This again causes us to duplicate every library function twice over in order to specify the size of the buffer (the strn* series of functions). Since, again, these functions are poorly supported and harder to use, we see many many instances of programs that have buffer overflow bugs, and therefore likely RCEs, for no good reason at all.


Besides the fact that you have to walk the string just to find the length of it...

Null-terminated strings try to save space by encoding the string length by convention. This can fail due to off-by-one errors, mistakenly allocating a fixed buffer, strings that have an unexpected null in them, and more.

Those can lead to all kinds of nasty problems like buffer overflows, which allow someone who can craft an input to write arbitrary stuff into memory.

God only knows how many security vulnerabilities and performance problems could be traced back to null terminated strings.


As is the case in this article it often happens that having the length of a string is required for certain operations, which leads to a lot of calls to strlen. This doesn't really seem like such a big problem to begin with as one assumes that the operations that require the length also have linear complexity. However it turns out (like in the article) that this is not always the case which leads to unexpected quadratic algorithms. The pointer + length representation is only 7 bytes longer on modern architectures and does not have this problem and has a few other advantages like being able to create substrings without allocating new memory.


In fact, having the length of the string was not required. Certain ways of implementing scanf() required getting the length of the string, or otherwise scanning the whole thing.

But saner ways did not.


It's incredibly easy to do something wrong and cause bugs or security issues in your app.

Or, compare string manipulation in C with any high level language like Ruby, Python, JS and the ease of use is night and day.


Another reason is a substring of a nul-terminated string is only nul-terminated if it is a suffix. So slicing out a substring generally requires copying to a new buffer so you can append the nul byte.


The decision to use 0-terminated strings in C may be the single worst decision in programming language history, at least from a security standpoint.


"in-band signaling is the root of all evil" -anonymous


This whole rant is a bit of a non-sequitor for a minor performance bug with a working fix.


Something I've mused about is whether it would be possible to reason about performance characteristics at a types level, to statically identify things like quadratic behavior. No idea what this would look like; I just see this gap where type systems have gotten quite good at ensuring against incorrectness (not C's, of course, but that's another discussion), but there's almost no equivalent story for catastrophic performance regressions, which in some cases can be just as serious as a program-crashing bug.

Basically when it comes to performance we're still in the place that we used to be with correctness: "either read the entire stack of source code to understand everything that's happening, or try and remember to read the docs which may or may not be correct and exhaustive and just hope for the best".

The performance characteristics of a piece of code are part of its contract with the outside world, but that contract is in no way formalized right now.


maybe you could encode the number of maximum nested loops into the type system, so when defining a function you set a maximum number which would fail to compile if exceeded? would be crude but possibly helpful at times


This is brilliant, I too feel like there’s an opportunity there but don’t have any idea off the top of my head how to implement such a thing.


Sounds like a new entry for https://accidentallyquadratic.tumblr.com/?


I ran up against this issue when I was creating a parser that parsed debian package information (code here[1]). Trying to match against a line with regex slowed the program down past feasibility. I had to switch to just assuming that two line breaks meant a new package record instead. After that, the program was zippy.

1: https://github.com/djhaskin987/degasolv/blob/develop/cli-src...


Does the control format have an official spec? I too have had to resort to parsing those files with regexes.


It's chapter 5 of the Debian Policy Manual.


Not that I know of, but I have noticed that two newline characters separate package records, and the record is of the form `Key: Value` for each line. As you can see in my code, I've gotten rid of most/all of the regex stuff, because, as we observe, it kills performance.


I think "(2020)" should be added to the title. (There was a very similar story on the front page yesterday, but alas, it's ultimately unrelated.)


Broadly, it's actually easy to see how naive parsing can become quadratic if a programmer isn't careful.

You parse by processing a long string/file via "get-a-token(), add-token-to-structure()" repeating until you have built your structure or made an error in the construction process (here a token a kind, a type of substring and get-a-token retrieves the substring type in some form).

The problems come because the naive programmer is only thinking about their structure building process, so they use some library function to get their tokens. Whether that's a regular expression interpreter or some other flexible facility by provided by the library, it's easy for it's definition of a token-string to include things "not in the string" as well as things in the string - a "quoted string" begins and ends with a quote, so to discover that X isn't a quoted string, you have to scan to the end. So if thing go wrong, each call to get-a-token may scan the while file to make sure this token isn't that. Just an example. The problem here seems to be tokenizing floats, which also has a "where it end" problem in c++. And the alternative is parsing floats yourself, which indeed no one wants to do since they are complex and defined by a complex standard.


Never been a fan of C style strings for exactly this reason. C style strlen is slow. In C++ strings, the size is included in the string structure. Also, fetching the distance between two std::string::iterator is fast. string_view::size should also be fast when that's available.


Was there ever anybody a fan of C style strings?

Maybe I misremember, but I think long ago people made arguments that C strings were superior because of how simple they were. Such was the thinking 30 years ago.


The null-terminated string is the most compact memory representation for a string. A strong competitor was Pascal strings which are led with a byte that carries the length of the string. The only problem is that limits you to 255 characters in a string. You would have to do some UTF-8 style length encoding to make it memory efficient while still capturing long strings.

Older computers had a lot less memory. Programs would write their own virtual memory systems to swap code and data to and from disk. C was born in this era. By the time large flat memory and many registers were available, the old C string was already the standard. It is very difficult to get away from that because everything from the OS on up assume standard C strings.


Unrelated but I found this but in the repo interesting:

(Under the heading “is it rapid?” Measuring performance).

> The 450MB/s read rate for JSON puts ryml squarely in the same ballpark as RapidJSON and other fast json readers (data from here). Even parsing full YAML is at ~150MB/s, which is still in that performance ballpark, albeit at its lower end. This is something to be proud of, as the YAML specification is much more complex than JSON: 23449 vs 1969 words.

How did YAML get so complicated. I’d assume it would be much more similarity YAML’s quirks adding a few extra things to work on. But that spec is quite huge by comparison (of words alone).


> How did YAML get so complicated

YAML has both more syntactical options, and broader semantic coverage; particularly, it includes the concept of schemas defining YAML profiles.

Also, the YAML spec has much more informational content. Examples are a stark illustration: RFC 8259 has 5 example JSON documents (3 of which are simple scalar values). The preview chapter of the YAML 1.2 spec has 28 examples, only a handful of which are single scalars.


I'm a fan of the parts of YAML that work. I wish there was a universally acknowledged strict YAML standard similar to 'use strict' in Javascript. There are some solid ideas embedded in YAML but the warts make people yell at it. It's also being abused for things that it really isn't meant to do like Ansible.


Two posts about sscanf today. I'm surprised how much people apparently use it. I basically never use it, though I've used strtok a lot.


This a performance bug in glibc (and others), right? Several developers seem to be hit by this.


This is an open bug in glibc from 7 years ago(!) https://sourceware.org/bugzilla/show_bug.cgi?id=17577


And if it’s not a bug in glibc, it’s a bug in the C standard for not requiring it to be big-O of the number of varargs.


Is this good ol' Shlemiel the painter (https://www.joelonsoftware.com/2001/12/11/back-to-basics/) rearing his lovable-but-sometimes-a-little-bit-empty head again?


Same pattern, but what's going on is that sscanf has been implemented as a call to a related function, with setup that incidentally does a bunch of unnecessary work. The scanning isn't what's causing the problem; it's that, before sscanf starts its actual work, it passes over the entire string to see how long it is.


So if I understand is the issue not that sscanf reads the whole string length, but that people call sscanf multiple times on the same string for each value they want to parse?

Meaning sscanf is actually designed to scan all your values in one pass?


It isn't designed that way, no. This is simply an internal choice made by C library implementors, and one implementation choice fares very badly in this situation, as the strings grow long. The other choice is pretty much immune to long strings in these scenarios. You cannot infer from this an intended usage of sscanf(), because this issue isn't part of the interface contract, just an effect of a particular implementation.


I believe I experienced a form of this recently with a regular expression. I never did figure out what exactly was going on but a certain string would cause the expression to run for an eternity. That was an interesting bug to track down.


Depends what you mean by "regular expression," but matching in perl regexps is NP-Hard.

https://perl.plover.com/NPC/



>First of all, I'm happy that as a matter of course ryml is already expected to be faster than alternatives by at least 10x, to the point that a mere 3x speedup is considered an issue!

Lol just say thank you


GTA loading screen with cyberpunk music on loop intensifies


Just another example of the old adage "whatever problem you're facing someone on the internet has already seen it and solved it". We just need to get better at searching


Well, as long as you're facing this problem in 2020 and not around 2012. Even though Rockstar should should have patched GTA by now, battle-tested JSON parsing solutions were a lot rarer 8 years ago.


True. Searching is turning more and more into a not solved problem. For a while, it seemed close to being a solved problem.


I didn't think this is sensational, because I found this behavior while writing one of my first C programs a couple of years ago.

Also you just have to google "sscanf slow"....


I've been hit by this behaviour just a few days ago with the fantastic libfuzzer, which finds much more errors than afl++ or honggfuzz in much less time. But it's buffer strategy is enoying, leading to inefficiency and artificial sscanf bugs. Problem is, oss-fuzz/libfuzzer gives you a const char* buffer without zero termination. In reality every such input buffer is zero terminated, because otherwise strtol or sscanf are worthless and lead to buffer overflows. They don't even allow adding a zero at the end, so you have to copy the whole input buffer just to add your zero termination, which makes all the fuzzer speed advantages worthless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: