Hacker News new | past | comments | ask | show | jobs | submit login
Finding CSV files that start with a BOM using ripgrep (simonwillison.net)
119 points by goranmoomin on May 29, 2021 | hide | past | favorite | 54 comments



>The --multiline option means the search spans multiple lines - I only want to match entire files that begin with my search term, so this means that ^ will match the start of the file, not the start of individual lines.

That's not correct because the `m` flag gets enabled by the multiline option.

    $ printf 'a\nbaz\nabc\n' | rg -U '^b'
    baz
Need to use `\A` to match start of file or disable `m` flag using `(?-m)`, but seems like there's some sort of bug though (will file an issue soon):

    $ printf 'a\nbaz\nabc\n' | rg -U '\Ab'
    baz
    $ printf 'a1\nbaz\nabc\n' | rg -U '\Ab'
    baz
    $ printf 'a12\nbaz\nabc\n' | rg -U '\Ab'
    $


Yup, that's exactly right. '\A' or '(?-m)^' should work, but don't, because of an incorrectly applied optimization.

The bug is fixed on master. Thanks for calling this to my attention! https://github.com/BurntSushi/ripgrep/issues/1878


Hmm. Thanks for fixing this, but, two things about the tests

1. These seem like they're effectively integration tests. They check the entire ripgrep command line app works as intended. Is this because the bug was not where it looks like it is, in the regex crate, but elsewhere? If not, it seems like they'd be better as unit tests closer to where the bugs they're likely to detect would lie?

2. While repeating yourself exactly once isn't automatically a bad sign, it smells suspicious. It seems like there would be a lot of tests that ought to behave exactly the same with or without --mmap and so maybe that's a pattern worth extracting.


You're right, I probably should have added a unit test. I added one in a branch of ongoing work. I don't always add unit tests, but in this case, it was pretty easy to. (FWIW, this is in the grep-regex crate, not the regex crate. The grep-regex crate is a glue layer between regex and ripgrep that adds in a bunch of line-oriented optimizations.)

> While repeating yourself exactly once isn't automatically a bad sign, it smells suspicious. It seems like there would be a lot of tests that ought to behave exactly the same with or without --mmap and so maybe that's a pattern worth extracting.

Quite possibly. Not a bad idea. I already do that in the tests for PCRE2. Most tests are run with both the default regex engine and with PCRE2. (There are some tests that have intended behavioral differences with mmap enabled, but those can be handled on a case-by-case basis.)


> That's not correct because the `m` flag gets enabled by the multiline option.

Is this documented? RG(1) only says

            -m, --max-count <NUM>
            Limit the number of matching lines per files searched to NUM.
Is this a different option or is there an implied <NUM> that prevents 'rg -U ^' from searching from the beginning of the file?


In this context the "m flag" refers to a flag inside the regex syntax. That is, when you use ripgrep's regex library as a standalone (as Rust programmers do), then '^' only matches at the beginning of text, where as '(?m)^' enables multi-line mode and thus permits to match at either the beginning of text or at the beginning of a line.

ripgrep also has a -U/--multiline flag, but it's orthogonal to the regex mode called multiline. It's an unfortunate naming clash, but they are names in otherwise distinct namespaces.

ripgrep always enables the regex flag 'm', regardless of whether -U/--multiline is enabled or not.


Thank you for taking the time to explain!


"BOM" == UTF-8 Byte Order Mark I guess.

I initially thought it was searching for "Bill of Materials" for electronics projects or similar.



There is no utf8 bom. Utf8 has no Byte order ambiguity. Only utf-16 needs a bom.


From Unicode 23.8[1]:

> In UTF-8, the BOM corresponds to the byte sequence <EF BB BF>. Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16, this sequence of bytes will be extremely rare at the beginning of text files in other character encodings. For example, in systems that employ Microsoft Windows ANSI Code Page1252, <EF16 BB16 BF16> corresponds to the sequence <i diaeresis, guillemet, inverted question mark> “ï » ¿”.

In practice, the UTF-8 BOM pops up. I usually see it on Windows.

[1] - http://www.unicode.org/versions/Unicode13.0.0/ch23.pdf#G1963...


it pops up a lot and it's annoying as it's an invisible diff in git


I've also had fun due to a BOM. In my case it was for configuring the Assetto Corsa server. It takes an INI for the entry list, but halts parsing if it encounters unexpected input. Without any kind of message. The BOM was an unexpected input, so the server just immediately shut down because the entry list was "empty".

That was a fun, and totally unstressful way to begin my time managing a racing league's race events.


It’s definitely a thing, it’s even in RFC 3629 but definitely not recommended. However some Microsoft tools default to writing CSV as utf8+bom and others expect it too so it’s hard to ignore.



Why does utf-32 not require a bom?


It's practically never serialized to a file. And if you need one, you can just use the same BOM value as UTF-16, just add two zero bytes in the correct place.


It does: https://www.unicode.org/faq/utf_bom.html#bom4

Well require is a bit excessive, but it certainly allows and recommends one.

Utf8 does not need one because the code units are bytes, so bytes order is not a concern.

Exchanging utf32 is pretty rare though, and as long as you don’t move anything between machines bytes order is not an issue.


huh?


Here's a coreutils (two-liner) version:

  printf '\xEF\xBB\xBF' >bom.dat
  find . -name '*.csv' \
    -exec sh -c 'head --bytes 3 {} | cmp --quiet - bom.dat' \; \
    -print
The -exec option for find can be used as a filter (though -exec disables the default action, -print, so it must be reenabled after).

Could be made into a oneliner by replacing the 'bom.dat' argument to cmp with '<(printf ...)'.


The cmp in coreutils understands:

  -n, --bytes=LIMIT   compare at most LIMIT bytes
so head is not really necessary:

  find . -name '*.csv' -type f -exec cmp -sn 3 {} bom.dat \; -print
Using -exec as a filter is a nice feature more people should use. That -type was put there just to avoid directories.


One large source of byte order marks in utf8 is Windows. In MS DOS and later windows, 8 bit encoded files are assumed to be in the system code page, which to enable all the worlds writing systems varies from country to country. When utf8 came along, Microsoft tools disambiguated those from the local code page by prefixing them with a byte order mark. They also do this in (for instance) the .net framework Xml libraries(by default). I don’t know what .net core does. I suppose it made sense at the time but I’m sure they regret this by now.


And, I bet a significant portion of the offending CSV files are from excel. Excel is really annoying because it also silently localizes CSV files: my language uses the comma as a decimal separator, so excel wil switch to semicolon for the delimiter.


Off topic but related, why does UTF-16 and UTF-32 even exist? Doesn't UTF-8 have the capability to go up to 32 bit wide characters already?


UTF-16 and UTF-32 are older than UTF-8.

Besides, at the beginning people were really against variable size encodings. UTF-8 won despite the Unicode consortium and all the committees effort, not because of it.


Do you know why UTF-8 won? I feel textual data only constitutes a very tiny portion of memory used, but working with fixed size encoding is so much more easier than variable size encodings.


The only universal, fixed-size encoding is UTF-32, which, as you can imagine, is very wasteful on space for ASCII text. Like it or not, most of the interesting strings in a given program are probably ASCII.

UTF-16 is not a fixed-size encoding thanks to surrogate pairs. UCS-2 is a fixed-size encoding but can’t represent code points outside the BMP (such as emoji) which makes it unsuitable for many applications.

Besides, most of the time individual code points aren’t what you care about anyway, so the cost of a variable-sized encoding like UTF-8 is only a small part of the overall work you need to support international text.


* UTF8 is ascii-compatible, so functions working with ascii will kinda sorta work with utf8

* all content will increase significantly in size using utf32 (utf16 is also variable-size, and markups being extremely common and usually ascii, while utf8 is not a guaranteed winner against 16 it often is)

* unicode itself is variable-size encoding due to combining code points, so a fixed-size encoding really doesn’t net you anything


> "will kinda sorta work"

That sound more like an anti-feature resulting in unstable programs that almost work.


Most of the times, it works flawlessly, other times it fails on some inputs. And that defines how well your old code will behave is what you are using it for, so this property is known at design time and if you go for reliability (you certainly should, but most people don't) you can know before writing the program if you'll need to care about text encoding or not.

Either way, a complete rewrite of the text handling functionality should give you flawless functionality. At this point in time, all the important ecosystems that use UTF-8 are almost there.

This is very different from the other encodings where a complete rewrite of the text handling functionality is needed just not to fail every time. That made all the important ecosystems that used other encodings to get almost there much sooner, but there was an important period when everything was broken, and the improvements are much slower nowadays, because when you need to fix every aspect of something, iterations take much more labor.


> That sound more like an anti-feature resulting in unstable programs that almost work.

It's both. It will generally ignore non-ascii data, but that is very commonly something you don't care about, in which case it's a net advantage over plain not working at all.


No, it definitely works without errors, as long as the UTF-8 text is in ASCII space.


Because Unicode (not UTF-anything, Unicode itself) is/became a variable-width encoding (eg U+78 U+304 "x̄" is a single character, but two Unicode code points[0]). So encoding Unicode code points with a fixed-width encoding is completely useless, because your characters are still variable-width (it's also hazardous, since it increases how long it takes for bugs triggered by variable-width characters to surface, especially if you normalize to NFC).

0: Similarly, U+1F1 "DZ" is two characters, but one Unicode code point, which is much, much worse as it means you can no longer treat encoded strings as concatenations of encoded characters. UTF-8-as-such doesn't have this problem - any 'string' of code points can only be encoded as the concatenation of the encodings of its elements - but UTF-8 in practice does inherit the character-level version of this problem from Unicode.


The only way to "properly" have a fixed width encoding is to allocate 80-128 bytes per character. Anything else will break horribly on accents and other common codepoints. So everyone uses the less-easy methods.

I base this number off the "Stream-Safe Text Format" which suggests that while it's preferred that you accept infinitely-long characters, a cap of 31 code points is more or less acceptable.


It was mostly because all the C functions that worked with single byte characters encoding also worked with UTF-8.


UTF-16 came after UTF-8. Software had gotten locked into 16 bit back when 16 bit meant fixed width, before either of those formats existed.


Note that UTF-8 wasn't actually standardized until Unicode 2.0 in 1996. This was at the same time as the surrogate pairs needed for UTF-16. And UTF-8 didn't find its final form until 2003, which was around the time when it really started to gain legs.

However, as you say, by 1996 people were already using the older UCS-2 standard.


It wasn't in unicode until later, but there was a published spec in 1993.

And sure they updated it in 2003 but "don't use invalid codepoints" is not a really notable update.


UTF-16 was first. Or rather, UCS-2, which was limited to the Basic Multilingual Plane, and which UTF-16 extends to the whole of Unicode.


Others have talked about the history of UTF-16. I'll focus on that last part: You must not write 32-bit wide characters in UTF-8.

Unicode / ISO 10646 is specifically defined to only have code points from 0 to 0x10FFFF. As a result UTF-8 that would decode outside that range is just invalid, no different from if it was 0xFF bytes or something.

It also doesn't make sense to write UTF-8 that decodes as U+D800 through U+DFFF since although these code points exist, the standard specifically reserves them to make UTF-16 work, and you're not using UTF-16.


> You must not write 32-bit wide characters in UTF-8.

You can't tell me what to do, dad. I'll encode 64 bits and you can't stop me! Bwahahahaa!

    $ perl -MEncode=encode_utf8 -e'print encode_utf8 "\x{7fff_ffff_ffff_ffff}"' | hex
    0000  ff 80 87 bf bf bf bf bf  bf bf bf bf bf           ÿ␀␇¿¿¿¿¿¿¿¿¿¿


To be fair, that actually isn't valid UTF-8 - the leading byte has no zero bit. The largest valid UTF-8 encoding is FE BF BF BF BF BF BF, with value U+F'FFFF'FFFF. (In fact, the original specification only listed up to FD BF BF BF BF BF (U+7FFF'FFFF).)

Furthermore, even if you assume a implied zero bit at position -1, that would only be FF BF BF BF BF BF BF BF, with value U+3FF'FFFF'FFFF.

Also 7FFF'FFFF'FFFF'FFFF is only 63 bits - fer chrissakes son, learn to count.


> As a result UTF-8 that would decode outside that range is just invalid, no different from if it was 0xFF bytes or something.

That's needlessly pedantic. If you use an old version of the spec those bytes are valid.

And "have the capability" seems to me to be talking about what the underlying method is able to do, not the full set of "must not" rules.


A character in UTF-8 can even be more than 4 bytes long. One example are flags or skin-colored emojis.


UTF-8 encodes code points, as do UTF-16 and UTF-32. Once you go from a sequence of bytes to a sequence of code points, you've moved beyond the specifics of the encoding.

Code points might be combined to form graphemes and grapheme clusters. Some of the latest emojis are extended grapheme clusters, for e.g. handling the combinatorics of mixed families. This is a higher level composition than UTF-x, it's logically a separate layer.

IMO talking about characters in the context of Unicode is often unhelpful because it's vague.


Right. "Character" is almost never what you meant, unless your goal was to be as vague as possible. In human languages I like the word "squiggle" to mean this thing you have fuzzy intuitive beliefs about, rather than "character". In Unicode the Code Unit, and Code Point are maybe things to know about, but neither of them is a "character".

In programming languages or APIs where precision matters, your goal should be to avoid this notion of characters as much as practical. In a high level language with types, just do not offer a built-in "char" data type. Sub-strings are all anybody in a high level language actually needs to get their job done, "A" is a perfectly good sub-string of "CAT" there's no need to pretend you can slice strings up into "characters" like 'A' that have any distinct properties worth inventing a whole datatype.

If you're writing device drivers, once again, what do you care about "characters"? You want a byte data type, most likely, some address types, that sort of thing, but who wants a "character" ? However, somewhere down in the guts a low-level language will need to think about Unicode encoding, and so eventually they do need a datatype for that when a 32-bit integer doesn't really cut it. I think Rust's "char" is a little bit too prominent for example, it needn't be more in your face than say, std::num::NonZeroUsize. Most people won't need it, most of the time and that's as it should be.


They exist before UTF-8 afaik.


A file containing every single Unicode codepoint once would be smaller in UTF-16 than in UTF-8. UTF-16 can make sense in some applications.


> Doesn't UTF-8 have the capability to go up to 32 bit wide characters already?

31.


Or 36 if you allow a leading byte of FE, but that's still not 32.


I don’t know if I’ll ever gonna need this, but I loved learning it!


I probably won’t ever need this, but I love the write up for a tool which I use daily


Is there anything like --multiline in GNU grep?





Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: