Hacker News new | past | comments | ask | show | jobs | submit login
Brian Kernighan adds Unicode support to Awk (github.com/onetrueawk)
618 points by ducktective on Aug 20, 2022 | hide | past | favorite | 208 comments



I became aware of this while watching Professor Brailsford's interview with him (Computerphile channel): https://www.youtube.com/watch?v=GNyQxXw_oMQ

(Around 7-8 minute mark)

Update:

At 24-25 minute mark, he talks about the technologies he is inquiring to write his new book with (he mentions troff and groff).

He says he wanted to try "XeTeX" (which supports Unicode) but "...I was going to download it as an experiment and they wanted 5 gigabytes and 5 gigabytes at the particular boonies place I'm living would...mmm..not be finished yet!"

So there we go...We had the opportunity to read the mind of the developer of awk and unix and co-author of the literal "C Programming Language", confronting with the absolute state of the tooling of the modern world.


The Problem is that TeXLive still defaults to doing a full install.

A full install means installing ~4000 packages, including their source files (tens of thousands of tex files) and built documentation (thousands of PDF files) and hundreds of free fonts (otfs, ttfs, texs own format).

This is huge (>7GB, not just the 5 GB claimed here).

However, you don't need 99 % of this for any given document.

Not installing the source files and documentation PDFs will alone reduce the size by roughly half.

Only installing the packages you really need from a minimal installation gives you a few hundred megabytes at most for even complex documents.

It's a bit annoying to get the list of packages needed though, since there is not really any working dependency management.

I wrote a python wrapper around the tex live installer [1] to make this easy for CI jobs, see e.g. [2].

On a side note: I'd recommend luatex over xetex.

- [1] https://github.com/maxnoe/texlive-batch-installation/

- [2] https://github.com/pep-dortmund/toolbox-workshop/blob/8b00f0...


I did some digging about a year ago and the biggest uses of space are the docs and, surprisingly, fonts: https://www.reddit.com/r/LaTeX/comments/oqmb3w/installed_mac...

It’s mostly a death by a thousand cuts thing.

MikTeX, which is available for Mac and Linux as well as Windows will do on-demand downloading of packages and is probably my recommendation for a new installation.

A lot of the drawbacks to TeX stem from to trying to meet the combined requirements of computing platforms of the 1970s–1990s. All input files effectively live in a single namespace so you cannot have two input files with the same name (the same applies to coding where you have a single namespace for all your commands as well).


Eons (>15 years) ago, I sold software that relied on having a working TeX installation, and to make it easier on my users, my installer provided a very very trimmed down version of MiKTeX as an optional download. The beautiful thing about MiKTeX is that its autoinstall feature meant that one could delete 90% of the packages, and the worst that would happen is that referencing something I deemed unessential would prompt a just-in-time download.

I recall I managed to cut MiKTeX down to about 12MB. Never heard a single complaint about it.


Why hasn’t something like this become the standard way to use LaTeX? Boggles my mind.


Latex is stuck in the past. Academic publishers even more so. It's 2022, and I can't use slightly non-standard letters (e.g. non-Latin characters) in my papers without jumping through hoops. It's anyone's guess as to which ones will work and which won't. The language was designed in the 80's and it shows, many commonly used packages are severely outdated and nobody cares.


I guess noone can receive fake brownie points (paper impact factor) by fixing that stuff.


On archlinux there's the texlive-core package which does not ship the PDF docs (most of the size). It should install 500mb (most of which are fonts..) and already provide enough to build normal documents, including lualatex for unicode support


> It's a bit annoying to get the list of packages needed though, since there is not really any working dependency management.

Sounds like someone should just create a wrapper that catches file opens of uninstalled dependencies and automatically downloads them on the fly.


This has been done and yes people should switch to this. "Tectonic automatically downloads support files so you don't have to install a full LaTeX system in order to start using it. If you start using a new LaTeX package, Tectonic just pulls down the files it needs and continues processing."

https://tectonic-typesetting.github.io/


MikTeX also does this: https://docs.miktex.org/manual/autoinstall.html. As far as I’m aware, TeX Live is unique in forcing the user to install all packages at once.

[EDIT: And in fact, if a cousin comment is to be believed, TeX Live also allows this kind of installation! https://news.ycombinator.com/item?id=32535521]


TeX Live allows a minimal installation, but what MikTeX has that TeX Live doesn't (AFAIK) is the on-demand automatic fetching and installing of packages. For example, if you `\usepackage{foo}` then with a minimal TeX Live you have to install it manually (with `tlmgr install foopkg`), while MiKTeX will either install it automatically or pop up a prompt asking you whether you want to install foopkg.


I can't believe I read this comment literally 10 minutes after installing endless GBs of xetex haha. Thanks for the info!


TeXLive also comes with installation schemes that will give you (if I remember the names correctly) bare, medium, and full installations, if you prefer not to pick packages yourself. Alternately, although I don't use it myself, I'm sure you could use MikTeX, which is much better about on-demand package installation. (Or even Overleaf, if you don't want to put anything on your local device!)


Development and re-juvenile of groff is recently taking a good path. While groff is still single-pass and doesn't do paragraph-at-once type setting, it is a much more pleasant experience than maybe like five years ago.


> the absolute state of the tooling of the modern world

Hah, TeX Live is... not that.

It's been enormous since I installed it off a CD in the 90s. The idea, and it works, is that you can just compile anyone's stuff out of the TeX ecosystem.

There is just... a lot... in it. You don't need a package manager if you install the whole universe locally. Like I said: not what I would call a modern approach to tooling.

On the other hand, I have latex files from the mid-Noughties and, I don't even need to check: they'll compile if I want them to.

But yeah, if you want just a little piece of TeX here and there, you're off the beaten track. That's not how TUG rolls.


Old approach: "we don't have a lot of disk space, we have to be careful what we install".

More modern approach: "hey, we have a lot of disk space, we can install everything locally".

Even more modern approach: "hey, we have fast Internet access, we only have to install stuff when we need it".


all three of those ways are important today; modern is overrated


TeX Live can also be configured to install the bare minimum TeX ecosystem (or just TeX+LaTeX), which only takes a few minutes to download and install but results in hunting down dependencies and manually installing them whenever you want to use a new package.

It also seems quite slow to update, and a recent (?) name change of `tools' to `latex-tools' seems to have broken multicol, which drove me to MikTeX. Internet connection required, but far less headache.


> He says he wanted to try "XeTeX" (which supports Unicode) but "...I was going to download it as an experiment and they wanted 5 gigabytes and 5 gigabytes at the particular boonies place I'm living would...mmm..not be finished yet!"

Man, I think once you're Kernighan there should be like a 1gigabit/sec symmetric circuit wherever you go just in case you use it to do something else useful.


on the other hand, I've done lots of excellent coding in places with awful internet. fast enough to be able to look things up when you really have to, but too annoying to meaningfully distract yourself


... or a fast car with a load of magnetic tapes (or memory cards, per M. Munroe) in the boot fuelled up and ready to roll. (-:


Ah, thought goto usage was considered harmful.


You put it jokingly? But distinguished people should definitely get some state managed perks like politicians do.


No they should not. They need to live in the same world we do, using the same tools we have. The example being politicians should make one think: do we really need more of this greasy pole and lust for power and freebies in computing? The whole elite mindset is corrosive.

Besides, where would you put the bar for “distinguished”? Who would take the decision?


Other “distinguished” people, obviously.

Which should make the problems of any such system obvious: it inevitably becomes yet another “in club” for people with connections and power.


> No they should not. They need to live in the same world we do, using the same tools we have.

Grand prizes with honorariums and perks of various kinds are a longstanding tradition with value.

I was mostly kidding, but: Kernighan totally can afford a big internet connection wherever he is. But he chooses not to buy it. The social benefit from him having one likely outweighs the personal benefit from him buying it.


You should start a fiber/ISP company


It would be more of a grand prize with an internet access component. :P

Of course, if I still operated internet access and Kernighan was in my area, I'd definitely comp him a circuit. :P


? Atomatons ?


THis isn't about bandwidth, it's about the size of modern binaries.


TeX distributions are enormous, but the binaries themselves are actually not big—the overhead imposed by Knuth’s obnoxious license, while nonzero (you can’t modify the original Pascal-in-WEB source, only patch it, so a manual source port to a different language is painful enough that nobody tried, we’re all using an automatic Pascal-to-C translation with lipstick on it), is not huge, and aside from a smattering of utilities that’s it for the binary part.

It’s just that the distros also include oodles of (plain text!) macro packages for everything under the sun. There are some legitimately large things such as fonts, but generally speaking a full TeXLive or MiKTeX distribution is bloat by ten thousand 100-kilobyte files, like a Python distribution with the whole of PyPI included.

If you know what you want, you can probably fit a comprehensive LaTeX workbench in under 50M, but it takes an inordinate amount of time.


I agree with your main point about the binaries being small and a reasonable LaTeX distribution fitting in under 50 MiB (I think starting with a minimal MiKTeX and letting it install what you need a few times will leave you there as well), but regarding

> a manual source port to a different language is painful enough that nobody tried

there are in fact quite a few, many of which are complete: https://tex.stackexchange.com/questions/507846/are-there-any...


Pardon me, but why have modern binaries grown so big? When I wrote my thesis in TeX, the entire installation would fit in some 30 megabytes or so. It was actually right in the uncomfortable middle. Far too big to carry around on a set of diskettes, but a CD would be waste of space.


You can fit a decent TeX distribution in <100MB.

But if you want to have every macro package that everyone everywhere likes, you're going to use some space.


I've often wondered why there isn't a dependency system for TeX that lets you get only the packages you need... feed it a document and tell it to automatically download and install any missing packages.

There may be some technical reason why this isn't practical. Anyone know, offhand?


Totally practical and supported. See MikTeX: https://miktex.org/kb/just-enough-tex

But the reason I don't use it is because I don't always have Internet when I need to write my document. I'd rather download all the packages and their documentation beforehand.


When stuck on a plane without wifi the only thing I could find on my computer to read were the Tex manuals. Good stuff.


MikTeX also supports downloading individual packages, though I wish there was an option to download, say, the most commonly used 1GB or something.


There is a TeX distribution that does what you say, TinyTeX. For some reason it is obscure and only really used by R and R Studio users. That may be because it is used to render R Markdown documents to pdf.


> He says he wanted to try "XeTeX" (which supports Unicode) but "...I was going to download it as an experiment and they wanted 5 gigabytes and 5 gigabytes at the particular boonies place I'm living would...mmm..not be finished yet!"

He can try "Tectonic" [0] - a modern XeTeX based TeX/LaTeX distribution that installs a minimum system and then downloads and installs dependencies on-demand. Tectonic is written in C and Rust [1].

[0] https://tectonic-typesetting.github.io/en-US/

[1] https://github.com/tectonic-typesetting/tectonic


It's sad that Tectonic conversion to Rust[1] was never finished. For now it's just a wrapper around C and C++ code. By far, it was the most promising thing in this distribution.

[1] https://github.com/tectonic-typesetting/tectonic/issues/459


Sounds like he was looking at downloading a complete TeX Live distribution; XeTeX itself isn't anything like that size (by a couple orders of magnitude, at least).


MacTeX is 4.7GB which matches the 5GB he's talking about and...

"MacTeX installs TeX Live, which contains TeX, LaTeX, AMS-TeX, and virtually every TeX-related style file and font. [...] MacTeX also installs the GUI programs TeXShop, LaTeXiT, TeX Live Utility, and BibDesk. MacTeX installs Ghostscript, an open source version of Postscript."

Which is, as you say, considerably more than just "XeTeX".

(Also those are universal binaries containing both Intel and ARM versions which probably adds some heft.)


> (Also those are universal binaries containing both Intel and ARM versions which probably adds some heft.)

Heh, I remember when "universal binaries" meant "PowerPC and Intel". Different universes ….


In the early/mid 1990s there were even "fat binaries" that had 68000 (the original Mac platform) and PowerPC binaries back when PowerPC was the new thing.


I think distros package it like TeX-full, TeX-minimal etc...The one having documentation files is a couple of GiB on Ubuntu...

I wonder what distro or editor he is using...


Two years ago, he used macOS on a 13" MacBook Air and an iMac, as per his conversation with Lex Fridman: https://youtu.be/O9upVbGSBFo?t=2523


It's not just the tooling of the modern world.

I remember installing teTeX (the predecessor of TeX Live) back in 2003 and I despaired at the file size because I knew it would take days to download on the shitty dialup connection I had at the time. Yeah, it wasn't 5 GB big back then, but it was certainly big by 2003 standards.


Watching this interview inspired me to start playing around with groff. It has a very steep learning curve... And being as old/niche as it is, I've found it very hard to find any active community to get newbie questions answered. If anybody knows where I could find that sort of thing, I'd be very grateful.


NB: I'm new to HN, so please excuse any formatting errors in this comment.

Well, I've been using groff on and off for a few years, I can give it a shot. Most "macros" as in TeX, are invoked by having the first character of a line be a `.` and then as in TeX, refer to a name. Now, macros in groff are weird. Macros are not delimited by `{}` delimiters as in TeX, but either by having an environment block like so:

    .TS
    .TE
Incidentally, `.TS` and `.TE` is the environment for tbl(1), the pre-processor for tables. Or by invoking another macro, as in this example for headings and paragraphs in the -ms macro set. (The -ms macro set is called as such because that's how you get it into your document, you supply the -ms flag at build time to groff to get the -ms macros). See groff_ms(7) for more information than I will supply here.

    .NH 1
    Some heading here
    .PP
    Start of a paragraph.
Sometimes you want to start a line of input with a period. To do that you just use the character '\&' before the period. The `\&` character is a non printing character, so use it anywhere you want, really. I imagine you want to play with math typography, in that case there is EQN. EQN works by having an environment block, `.EQ` and `.EN` and in that block you type in the EQN mini-language.

    .EQ
    f ( x ) ~=~ x sup {2} - 1
    .EN
EQN does not care about precedence or anything, so use `{}` to delimit expressions. I think there is some kind of EQN document rescued from Bell Labs that tells you about how the language works. Check troff.org for it, or email the mailing list for groff, which is also pretty good for newbie questions, it is not really all that high traffic.

Groff also does pictures, so you can for example create a box with some text in it (it is capable of more, like strokes, macros, color, and I believe rotation[not sure]). To start a picture, `.PS` and `.PE` will do so, and again, a small language as input in that block.

Now, to print anything you should know the design of the original troff system was heavily based on the UNIX philosophy, which means the output you get from running the groff program prints the output to stdout. Just redirect the output to a file to get a PDF or a PS document.

Obviously, there is much I have not covered in this comment, such as macros, registers, and other exciting topics such as refer(1), but this should give you a quick start.


This is helpful! Thank you for taking the time to write it up.

I didn't know about '\&' as a way to "short circuit" groff from interpreting something as a macro that should be literal (such as a line-initial period). That's great to know.

I've found a couple of resources that have been helpful. Mostly, I've been referring to this[0] page on gnu.org. I've also learned from this[1] article on Linux journal and Luke Smith's YouTube playlist[2] on groff.

To practice, I'm trying to re-typeset a paper I wrote in school that follows the Chicago Manual of Style[3]. I formatted the original in MS Word using a template my school provided, so I have something to visually compare against. I'm using the -ms macro set, mostly because that's what Luke Smith demos in his videos so that's what I got started with.

I've already got a reasonable imitation of the original document, which is fun and a bit rewarding. But there's a couple of very small details that I'm finding difficult to crack. Are you very well-versed with -ms? In particular, right now I'm trying to figure out how to add one blank line of space in footnotes AFTER the horizontal line, but BEFORE the first footnote begins. I suspect I'll need to redefine the -ms macro that controls the trap for footnotes...? But I haven't figured out which macro that is just yet.

[0] "The GNU Troff Manual", https://www.gnu.org/software/groff/manual/groff.html

[1] "Typesetting with groff Macros", https://www.linuxjournal.com/article/4375

[2] "groff/troff for Minimalist Document Complication", https://www.youtube.com/playlist?list=PL-p5XmQHB_JRe2YeaMjPT...

[3] "The Chicago Manual of Style Online", https://www.chicagomanualofstyle.org/home.html


Sadly I don’t know how to futz about with traps, I just haven’t gotten to that point yet in my use of groff.

First I would point you to the manpage, but, we all know that manpages are a mess, so it might not be there. I will however point you, perhaps not /super/ helpful, but alas, I know not of any better solution. If you go to /usr/share/groff/current/tmac/s.tmac and search for «module fn», that should be the code for the footnotes. Aside, the -ms name comes from this, -m means macro, and whatever letter comes after it is the macro set, hence why -man will be called an.tmac in groff source code, or in this case, s.tmac.

I am not sure where documentation is on traps in groff, but I assume it’s buried withing some manpage or info document somewhere. Software is crap, eh? :D

Of course, you could ask in the groff mailing list[1], they are usually pretty helpful in these matters. Which course of action you chose is of course up to you, I will investigate for myself as well to educate myself more on traps.

[1]: mailto:groff@gnu.org


> He says he wanted to try "XeTeX" (which supports Unicode) but "...I was going to download it as an experiment and they wanted 5 gigabytes and 5 gigabytes at the particular boonies place I'm living would...mmm..not be finished yet!"

I'm struggling to understand this sentence. Who are they? What's a boonie? What's not finished?


"in the boonies" = "in the wilderness/far away from everything/...". I.e. they probably have only crappy mobile internet and don't want to download large packages.


He could master it in a week if he set his head to it. I don't have one doubt of that. He just doesn't really need to.


I just wanted to say: that's an excellent interview and I strongly recommend others listen to it.


Apparently Kernighan is also updating his Awk book of 1988 this summer too.

https://irreal.org/blog/?p=10746


As I get old to start thinking about exactly how long I want my career to be, this is inspiring. He is coming up on 80 and he's still coding.


Is it? I'm glad it's an option, but I'm also glad retiring at 40 is an option (depending on a lot of stars aligning just right of course).

My answer to "how long do I want my career to be?" will never be "until my body or mind fail me".


Depends on how you think of it. Maybe I shouldn't have used the word "career" here, as I wrote code long before people paid me for it, and aim to keep doing it long after.


Depends on what you mean with 'career'. Brian Kernighan doesn't do this because he needs money, but because he would get bored otherwise.


[flagged]


well that escalated quickly


Ok?


I've found this to be THE best book for learning awk and regular expressions. Its astounding how clear and structured the material is and how easy it is to retain in your head when presented that way.


That would be cool


The choice to use UTF-32 (ie Unicode code points as integers, which might as well be 32-bit since your CPU definitely doesn't have a suitably sized integer type) is unexpected, as I had seen so many other systems just choose to work entirely in UTF-8 for this problem.

Now, Brian obviously has much better instincts about performance than I do and may even have tried some things and benchmarked them, but my guess would have been that you should stay in UTF-8 because it's always faster for the typical cases.


He mentions that "The amount of actual change isn't too great, so I think this might be ok" so I wonder if part of the equation has more to do with avoiding messing with legacy code rather than raw performance. If the current code expects all codepoints to have a constant-width representation, it may be complicated to add UTF-8 into the mix.

A complete guess on my part though, I never looked into AWK's source code.


This sounds reasonable. When the GoAWK creator tried to add Unicode support through UTF-8 he discovered that this had drastic performance implications (rendering some algorithms to be O(N^2) instead of O(N)), if done naive https://github.com/benhoyt/goawk/issues/35. Therefore the change was reverted till the more efficient implementation can be found.


Yes, that's right. With my simplistic UTF-8-based implementation it turned length() -- for example -- from O(1) to O(N), turning O(N) algorithms which use length() into O(N^2). See this issue: https://github.com/benhoyt/goawk/issues/93

Similar with substr() and other string functions, which when operating as bytes are O(1), but become O(N) when trying to count the number of codepoints as UTF-8.

GNU Gawk has a fancier approach, which stores strings as UTF-8 as long as it can, but converts to UTF-32 if it needs to (eg: the string is non-ASCII and you call substr).

It looks like Brian Kernighan's code has the same issue with length() and substr(). I'm going to try to email him about this, as I think it's kind of a performance blocker.


Brian sent me a lovely (and surprisingly lengthy) reply. He noted that length() was already O(N), as onetrueawk just represents strings as NUL-terminated C strings and uses strlen() (though strlen is a lot faster than decoding UTF-8). However, he said that substr() and index() have indeed gone from O(1) to O(N), as it used to be possible to just index by number of bytes.

The most challenging part, he said, was updating the regex engine, which "was written by Al Aho in 1977 and hasn't really been touched much since".


The code only uses UTF-32 in regular expressions where I suppose it was much simpler to adopt the older code. The rest uses UTF-8.


Is UTF-32 fixed size per char? Because then it allows simple math that you can’t do on UTF-8.


It's a fixed size per codepoint. Many clusters that appear atomic in a text editor are made up of multiple codepoints. The flag emojis are among the many examples.


UTF-8 is this magic ball of mud that somehow managed to convince people overstrikes were stupid and then implemented them so exuberantly that you can have a single glyph represented by half a dozen of them.


To nitpick: Overstrikes/other combining characters and the multiple possible ways to represent the same glyph are aspects of Unicode, the character set which deals with code points from 0 to 0x10ffff. Whereas UTF-8 is a way of encoding that character set into a sequence of bytes. The things you describe are present in all unicode encodings.


> Because then it allows simple math that you can’t do on UTF-8.

That's not actually useful, because unicode itself is a variable length encoding.

So it mostly blows up the size of your data.

Though it might have been selected for implementation simplicity and / or backwards compatibility (e.g. same reason why Python did it, then had to invent "flexible string representation" because strings had become way too big to be acceptable).


UTF-32 is a fixed-length encoding of Unicode[1], so it does simplify things a lot for a regex engine.

[1] At least when talking about code points, which is what matters for regular expressions (unless you want stuff like \X with is not universally supported).


Wait, does Unicode not have multiple representations even of simple things like the letter "ä"? Then you definitely need to handle actual characters/glyphs in regexes.


Sure, but that's up to whoever is writing the regular expression.

The standard '.' (match any character) in a regexp matches an Unicode code point, not a grapheme cluster. To match a grapheme cluster, you have to use "\X", which is not universally supported.

For example, in Python 3, the builtin module module "re" doesn't support "\X", you have to install the "regex" module for it:

    # text is 'e' followed by U+0301 (combining acute accent):
    text = b'e\xcc\x81'.decode('utf8') 
    print(f'text: "{text}"')

    import re
    print(re.match('^(.)(.)$', text).groups())   # prints "('e', '´')"

    import regex  # must be installed
    print(regex.match('^(\X)$', text).groups())  # prints "('é',)"


> To match a grapheme cluster, you have to use "\X", which is not universally supported.

You have the words "supported" and "implemented" mixed up. Kernighan claims Unicode support, so he is required by the standard the implement \X.

If a software does not implement \X, then it is not compliant, and it would be very wrong to say it supports Unicode. Does anyone have a deeplink showing the evidence for awk?


This is just false. UTS#18 specifies multiple levels of Unicode support. \X is part of level 2. It is perfectly valid to generally say "has Unicode support" even if it's just Level 1, assuming you document somewhere what precisely is supported.

For example, I regularly say that Rust's regex crate has Unicode support. But it does not support \X. It's more precisely documented here: https://github.com/rust-lang/regex/blob/master/UNICODE.md


> so he is required by the standard the implement \X.

Which standard is that?

If you're talking about Unicode, the "standard" for regular expressions[1] is an "Unicode Technical Standard", which according to itself isn't required for Unicode conformance:

> A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

So awk can claim Unicode support without supporting "\X" (like many regex engines).

If you're talking about POSIX, its regex chapter[2] doesn't mention "\X". In any case I don't think awk claims to conform to POSIX.

[1] https://unicode.org/reports/tr18/

[2] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...


At the end of this comment there is what I believe is a single grapheme cluster. On disk this single "letter" occupies 73 bytes. Surprisingly large number of tools and editors know how to work with things like these and render them at least somehow

I think I once created one that was about a kilobyte. Is there an upper limit?

I created it using this page https://glitchtextgenerator.com/

The 73 byte X:

x̧̡̬̘͓̖̲̻̻̲̠̪̻͓͙̜̂̓̊̔̀̀͗̑̀̅̀̂̚͘̕̚͘͢͜͠


Unicode is not an encoding, despite MS Notepad calling some encoding "Unicode".


Unicode isn't a storage encoding and so yeah, Notepad shouldn't do that. However Unicode does encode essentially all extant human writing systems into integers called "code points" between zero and 0x10FFFF. The Latin "capital A" is 65 for example.

However you'd probably like to store something more compact than, say, JSON arrays of integers. So there are also a bunch of encodings which turn the integers into bytes. These encodings would work for any integers, but they make most sense to encode Unicode's code points. UTF-8 turns each code point into 1-4 bytes, a pair of UTF-16 encodings turns them into one or two "code units" each of 2 bytes either little or big endian. And UTF-32 just encodes them as native 32-bit integers but again either little or big endian.


> Q: What is Unicode?

> A: Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols.

https://home.unicode.org/basic-info/faq/


Regardless of official terminology, there are two levels:

1. Map a character to a unique number in a character set (in Unicode: called codepoint)

2. Map a number that represents a character in a character set to a bit pattern for storage (transiently or persistently, internally or externally). Unicode code points can be bit-encoded in various ways: UTF8, UCS2 and UCS4/UTF32.

The original code points permit the same character to be represented in various ways, which makes equality checks non-trivial: for instance a character like "ä" can be represented as a single character or alternatively as a composition of "a" + umlaut accent (2 characters).

So far, this is all about plain text, so we are not talking about font families or character properties (bold, italics, underlined) or orientation (super-script, sup-script).

Ken Lunde's opus magnum is the standard book on representing text in various languages other than English, with a focus on Asian languages: https://www.oreilly.com/library/view/cjkv-information-proces...


In your quote encoding refers to assigning numbers (code points in Unicode parlance) to characters (I am simplifying here, I know the definition of character in Unicode is not that easy).

It’s like a catalogue of scripts. We have to extend it when we encounter new scripts that are not catalogued yet (or when we create new emojis)

Converting a byte sequence to a Unicode code point sequence and vice-versa is called transformation format (or more generally an encoding form, but then might not be deterministic) by Unicode (see <https://www.unicode.org/faq/utf_bom.html#gen2>). Unicode specifies UTF-8, -16 and -32. We do not have to change these formats unless the catalogue hit the limits of 32 bits (not a big problem for UTF-8 but for the other two formats). These formats are already able to encode code points that are not assigned yet.

And the confusion now is that a lot of people call what Unicode calls transformation format (i.e. the byte to code point mapping) encoding as well. The term charset is also used sometimes.

PS: Note that a goal of Unicode is to be able to accommodate legacy encoding/charsets by having a broad enough catalogue. This is so that these legacy encoding which may come with their own catalogue can be mapped to the Unicode catalogue. So we have control codes (even though not part of any “proper” human script), precomposed letters (there is a code point for à although it could be represented by a + combining `), things like the Greek terminal form of sigma separately encoded, although that could be done in font-rendering (like generally done for Arabic), and a lot more to aid with mapping and roundtrips.


Just a note about the Greek terminal form of sigma: when dealing with Greek numerals, ςʹ (the final form of sigma) is 6 while σʹ (the non-final form of sigma) is 200; they need to be differentiated whatever the font engine decides to render.


Thanks, good to know. I’ve also realised that determining which form to use in mathematical formulas is maybe not that straightforward.

Edit: by which I mean that I’ve only seen the non terminal form so far in maths but it’s hard to write an algorithm that distinguishes between a word ending in sigma and some juxtaposition of Greek variables.


Unicode uses the term “character encoding form” or “character encoding scheme” for what is normally referred to or abbreviated as “character encoding” or “charset” (see e.g. RFC 8187), and uses “character encoding” or “coded character set” for the abstract assignment of natural numbers to the abstract characters in a character repertoire, which is more usually referred to as just “[coded] character set” (cf. also UCS = Unicode Character Set). This different use of terminology can cause confusion. The GP is correct that Unicode as a whole is not what is colloquially meant by “encoding”.


MS has taken the approach of calling UTF-16 "Unicode" as it's what's used in most of their systems.


It's always the tradeoff, some operations are simpler on UTF-32 but they have additional memory (and therefore cache) footprint and since you typically don't want to use UTF-32 externally you have to convert back and forth which is not free.

I think these days people don't bother with UTF-32 too much because it's not even like you have a clean "one 32bit int, one character" relation anyway since some characters can be built from multiple codepoints. Since generally most code manipulating character strings are interested in characters and not codepoints, UTF-32 is effectively a variable-length encoding too...


Right, somebody else might have actual metrics but I'd have guessed actual regular expression patterns are split something like:

90% Only care about ASCII, thus individual bytes in UTF-8, and so UTF-32 just wastes memory

1% Care about individual code points, but spread over multiple bytes (e.g. the double dagger ‡), UTF-32 is perfect

9% Care about multiple code points (to form e.g. a Flag, or é written in combining form, or two women kissing) and so UTF-32 doesn't really help again


Applications and text-processing libraries are free use an internal text encoding or data structure, if you will, which does not suffer from the drawbacks. Conversions to encodings suitable for data exchange (e.g. UTF-8) are performed at the I/O boundaries.


Sure, and if stuff lives inside your "application and text-processing library" for long enough and performs enough of the thusly optimised actions without leaving, you might even amortize the cost of the work you did at the edges.

But probably not.


Another factor is that nowadays machine code execution is much faster than memory accesses, so the trade-off of requiring more program logic to process a more compact format makes a lot of sense.


A “character” can be of fairly arbitrary length in Unicode, so no.


Not to be contradictory, but unicode is not a specific encoding. ufc-8 is an encoding (with a non specific length) and utf-32 is an encoding of a Unicode code point with a specific length.


Most people talking about characters in Unicode are referring to grapheme clusters, and I assume the parent is too, since they are in fact defined by Unicode and vary in length in all encodings, hence their answer of "no" to the GP.


It is fixed size per code point, which are what developers (and programming languages) sometimes casually call a character, but in practice a character is a grapheme, which can be multiple code points once you're outside the ASCII range. But it can still be useful to count code points, which would be faster in UTF-32.

Edit: Mixed up code units and code points.


And even then, in some languages at least, what constitutes a grapheme isn't always well defined.


> in some languages at least, what constitutes a grapheme isn't always well defined.

Can you provide some examples? People say this a lot, but the cases I've been able to find tend to be things like U+01F1 LATIN CAPITAL LETTER DZ, which is only not well defined in the sense that Unicode defines it wrong (as one character rather than two) presumably-on-purpose, for compatibility with one or more older character encodings.


I don’t know about DZ, but things that are two letters in one language can be one in another (famously ij in Dutch, but there are others).

Also, things like æ and œ are much easier to deal with if they are a single glyphs. Their upper-case versions are respectively Æ and Œ even in languages where they are two letters. I suppose that now we would do it with something like ZWJs to make sure that both letters are transformed consistently but there are technical reasons behind the current situation.

[edit] here you go: dz is the 7th letter of the Hungarian alphabet, but the capital version (in a normal sentence) of dz is Dz and not DZ. Yeah, languages are weird: https://en.m.wikipedia.org/wiki/Hungarian_alphabet


> famously ij in Dutch

ij is considered two letters in Dutch, although they go by special rules: https://onzetaal.nl/taalloket/ij-plaats-in-alfabet


Is DZ "wrong" because it's not considered a digraph by professionals, or because people don't agree that digraphs should be considered single characters?


"DZ" isn't 'wrong', it's a perfectly valid two-character string consisting of "D" followed by "Z". Assigning to a multi-character string a encoded representation that isn't the concatenation of representations of each character in the string (especially while insisting that that makes it a distinct character in its own right) is what's wrong.


But the letter DZ is not the same thing as the letter D followed by the letter Z. It's a standalone letter in e.g. Hungarian or Slovak, much like æ isn't "ae".


True - I was thinking of Unicode's definition ("[extended] grapheme clusters").


It is fixed size...for now ;)


But what order are the bytes being chomped?


[flagged]


This adds nothing to an interesting discussion. Can you leave this Facebook nonsense there please?


You are right.

Unicode in UTF-8 will have variable char length. Plain ASCII will be one byte for each char, but others might have up to 4 bytes. Anything dealing with it will have to be aware of leading bytes.

UTF-32 in other hand will encode all chars, even plain ASCII ones, using 4 bytes.

Take the "length of a string" function, for example. Porting that from ASCII to UTF-32 is just dividing the length in bytes by 4. For UTF-8, you'd have to iterate over each character and figure out if there is a combination of bytes that collapse into a single character.


> Porting that from ASCII to UTF-32 is just dividing the length in bytes by 4. For UTF-8, you'd have to iterate over each character and figure out if there is a combination of bytes that collapse into a single character.

You have to do that with UTF-32 as well. Yes, every codepoint is 32 bits, but every character can be made up of one or more codepoints.


What?

U+004D in UTF-8 is 4D, in UTF-32 is 0000004D

U+0430 in UTF-8 is D0 B0, in UTF-32 is 00000430

I took this examples from page 102 of The Unicode Standard Version 5.0.

These leading zeros on the UTF-32 are what is omitted from UTF-8. One of the encodings is optimized for space, the other for processing speed.

I am no expert on unicode though, there might be something I'm missing (ligatures, this kind of stuff). I would gladly accept a more in depth expalanation on why I am wrong on this.


The letter à can be represented in two different ways in Unicode:

* As the single code point U+00E0, which encodes 'à'.

* Or as the sequence of two code points U+0061 U+0300, which respectively encode the Latin letter 'a' and the combining acute accent (it's hard to display as a standalone character, so go e.g. here: https://www.compart.com/fr/unicode/U+0300). These two code points get combined into a single grapheme cluster, the technical name for what most people consider to be a character, that displays as 'à'.

As you can see, there is no difference in the visual representation of à and à. But if you inspect the string (in python or whatever) then you'll see that one of them has one code point (one "char"), while the second has two. If you're on Windows, an easy way is to type it into pwsh:

    PS> 'à'.EnumerateRunes() | % { "U+{0:X4}" -f $_.Value }
    U+0061
    U+0300
    PS> 'à'.EnumerateRunes() | % { "U+{0:X4}" -f $_.Value }
    U+00E0


The character "à" as a single byte E0, as far as I know, is not UTF-8. This is the 224th character of the extended ASCII table (also ISO 8859). It's another system: https://en.wikipedia.org/wiki/Extended_ASCII

This representation is incompatible with UTF-8, which marks the octet series Cx, Dx, Ex and Fx as leading bytes. You can see the table here: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

What I believe Kernighan _wants_, is to reuse a lot of code that was designed for ASCII. Code that treats 1 char as 1 byte. In order to do that, he is going to encode each unicode character using 4 of whatever type he was previously using before, zeroing the "pad" bytes, which is exactly what UTF-32 does. This way, he doesn't have to fundamentally change all the types he is already working with. (ps: I looked at the commit after I wrote this, he doesn't seem to be doing what I suggested).

Your PS snippet failed for me "Method invocation failed because [System.String] does not contain a method named 'EnumerateRunes'."

Dit it in bash, which supports UTF-8 and lets you explicitly set the encoding:

    $ LC_ALL=C bash -c 'x=à; echo ${#x}'
    2
    $ LC_ALL=en_US.UTF-8 bash -c 'x=à; echo ${#x}'
    1
These snippets count the number of chars according to the selected encoding.

If I had UTF-32 locales in my machine, "à" would appear as 4. Unfortunatelly I don't know any system that implements UTF-32 to display its behavior.

Let's now reconstruct "à" in UTF-8, telling bash to use only ASCII:

    $ LC_ALL=C bash -c 'printf %b \\xC3\\xA0'
    à
I've tunneled UTF-8 through an ASCII program, and because my terminal accepts UTF-8, everything is fine. Didn't even had to use \u escapes. This backwards compatibility is by design, and chosing between UTF-8, 16 and 32 plays a role in how you're going to deal with these byte sequences.


> The character "à" as a single byte E0, as far as I know, is not UTF-8.

It's not, but I wrote "Unicode" and "code point", not "UTF-8" and "code unit". It sounds like you don't know the difference, which makes this whole discussion a bit silly and probably a bit pointless. One of the best explanations I've read is this one: https://docs.microsoft.com/en-us/dotnet/standard/base-types/... and especially the section on grapheme clusters.

> Your PS snippet failed for me "Method invocation failed because [System.String] does not contain a method named 'EnumerateRunes'."

Sounds like you're using the outdated windows powershell, not pwsh.


I'm talking about how many bytes it takes to store characters in these encodings, and how the lengh of these sequences changes according to each standard.

You are talking about another level of abstraction. These UTF-16 builtins are first-class citizens in .NET, and you never have to deal with the raw structure behind all of it unless you want to. C does not have these primitives, a string there is composed of smokes and mirrors using byte arrays.

In summary, you are talking about using a complete, black box Unicode implementation. I'm talking about the tradeoffs one might encounter when crafting stuff that goes inside one of these black boxes, and it's different from the black box you know.


Okay, let's back up a little. You were replying about a comment that said:

> Yes, every codepoint is 32 bits, but every character can be made up of one or more codepoints.

And you expressed surprise and incredulity at the idea. I explained that a "character" (which, given the context, is to be taken as a "grapheme") can, indeed, take up more than one code point. This has nothing to do with UTF-32, UTF-8, or .NET's implementation. It's inherent to how Unicode works.

As an even more obvious example, try to figure out how many bytes are in the character ȃ encoded as UTF-32, using whatever method you'd like. (I say "more obvious" because you'll be forced to copy-paste the character I've used in your code editor.) Other examples include emojis such as this one: https://emojipedia.org/woman-golfing/ (that I can't use in the comment box for some reason).

Then for some reason you felt compelled to offer a rebuttal that showed you didn't understand the difference between a grapheme and a code point. So I just linked some doc. I'm not attacking you.


That reply is not my first interaction on the thread. I guess you didn't see the previous one, so perhaps you are missing context. Here's the link: https://news.ycombinator.com/item?id=32535499

"ȃ" uses two bytes in UTF-8, 2 bytes in UTF-16 and 4 bytes in UTF-32. The woman golfing uses 4 bytes in UTF-8, UTF-16 and UTF-32.

My surprise was at the idea of having to look up a variable length sequence of bytes in a string stored as UTF-32. To me, UTF-32 exists exactly for the use case of "I don't want to calculate variable length" (optimizing for speed), and UTF-8 exists for "I don't want to use more bytes than I need" (optimizing for space).

I never used the term "codepoint" in this entire thread until now. I was originally talking about storing bytes, hence my little discomfort of not being able to communicate what I'm trying to say. Not feeling attacked, just incapable of expressing, which might have come off wrong. Sorry about that.

BTW, there are many things that I don't understand about Unicode. I never read the full standard. I don't know enough about a grapheme to tell whether it is something that impacts or not the number of bytes when storing in UTF-32.

Why we are talking about UTF-8 versus UTF-32 and not Unicode as a whole? Because there is a code comment about these two in the commit linked by the OP, which sparked this particular subthread.


> "ȃ" uses two bytes in UTF-8

"â" uses two bytes when encoded in UTF-8, while "ȃ" (which was what bzxcvbn supplied as an example and you pasted in the quoted section) uses three bytes when encoded in UTF-8.

  >>> s="ȃ"
  >>> len(s)
  2
  >>> s.encode('utf8')
  b'a\xcc\x91'
  >>> import unicodedata as ud
  >>> [ud.name(c) for c in s]
  ['LATIN SMALL LETTER A', 'COMBINING INVERTED BREVE']


> The woman golfing uses 4 bytes in UTF-8, UTF-16 and UTF-32.

No. It's made up of 5 code points. Each of these takes 32 bits, or 4 bytes, in UTF32. So that emoji, which is a single grapheme, uses 20 bytes in UTF32. Once again, just try it yourself, encode a string containing that single emoji in UTF32 using your favorite programming language and count the bytes!


Let's go to the source:

https://www.unicode.org/faq/utf_bom.html

> Q: Should I use UTF-32 (or UCS-4) for storing Unicode strings in memory? > > This depends. It may seem compelling to use UTF-32 as your internal string format because it uses one code unit per code point. (...)

It also confirms what I said about implementation vs interface (or storage vs access, whatever):

> Q: How about using UTF-32 interfaces in my APIs? > > Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. (...) While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling

Finally:

> Q: What is UTF-32? > > Any Unicode character can be represented as a single 32-bit unit in UTF-32. This single 4 code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character.

So, it does in fact reduce the complexity of implementing it _for storage_, as I suspected. And there is a tradeoff, as I mentioned. And the Unicode documentation explicitly separates the interface from the storage side of things.

That's good enough for me. I mentioned before there might be some edge cases like ligatures, and you came up with a zero-width joiner example. None of this changes these fundamental properties of UTF-32 though.


Reading this thread, I feel bad for the person you’re arguing with. It’s clear you are in the “knowledgeable enough to be dangerous” stage and no amount of trying to guide you will sway you from your mistaken belief that you are right.

Now to try one last time, you are misreading the spec and not understanding important concepts. Take the “woman golfing” emoji as an example. That emoji is not a Unicode “character” and is part of why it can’t be represented by a single UTF32. That emoji is a grapheme which combines multiple “characters” together with a zero width joiner, “person golfing” and “female” in this case. Rather than have a single “character” for every supported representation of gender and skin color, modern emoji use ZWJ sequences instead, which means yes, something you incorrectly think is a “character” can in fact take up more than 4 bytes in UTF32.


I am reading the spec, discussing online and trying to understand the subject better, what is wrong with that?

I said I might be wrong _multiple times_, and its genuine. I'm glad you appeared with an in-depth explanation that proves me wrong. I asked exactly for that.

The first examples in this thread are not zero-width-joiners (á, â) or complex graphemes. They all could be stored in 4 bytes. It took some time to come up with the woman golfing example.

By the way, one can still implement reading "character" by "character" in sequences of 4 UTF-32 bytes, and decide to abstract the grapheme units on top of that. It still saves a lot of leading byte checks.

Maybe someone else learned a little bit reading through the whole thing as well. If you are afraid I'm framing the subject with the wrong facts, this is me assuming, once again, that I never claimed to be a Unicode expert. I don't regret a single thing, I love being actually proven wrong.


> I am reading the spec, discussing online and trying to understand the subject better, what is wrong with that? I said I might be wrong _multiple times_, and its genuine.

I encourage you to reread the comment I replied to again and see if it has the tone of someone “trying to learn” or rather something different.

> I don't regret a single thing, I love being actually proven wrong.

Me too!


> I encourage you to reread the comment I replied to again and see if it has the tone of someone “trying to learn” or rather something different.

To me it sounds ok. I'm not an english native speaker though, so there is a handicap on my side on tone, vocabulary and phrasing.

My intent was to admit I had wrong assumptions. At some point before this whole thread, I _really_ believed all graphemes (which, in my mind, were "character combinations") could be stored in just 4 bytes. I was aware of combining characters, I just assumed all of them could fit in 32bit. You folks taught me that it can't.

However, there's another subject we're dealing with here as well. Storing these characters at a lower level, whether they form graphemes or not at a higher level of abstraction.

The fact that I was wrong about graphemes, directly impacts the _example_ that I gave about the length of a string, but not the _fundamental principle_ of UTF-32 I was trying to convey (that you don't need to count lead bytes at the character level). Can we agree on that? If we can't, I need a more in-depth explanation on _why_ regarding this as well, and if given, that would mean I am wrong once again, and if that happens, I am fine, but it hasn't yet.


> The fact that I was wrong about graphemes, directly impacts the _example_ that I gave about the length of a string, but not the _fundamental principle_ of UTF-32 I was trying to convey (that you don't need to count lead bytes at the character level). Can we agree on that? If we can't, I need a more in-depth explanation on _why_ regarding this as well, and if given, that would mean I am wrong once again, and if that happens, I am fine, but it hasn't yet.

As I said in the other thread, to try to minimize confusion, consider Character and Grapheme as synonymous. They are made up of one or more codepoints. The world you’ve made up that everything is characters and they all take exactly four bytes in UTF32 is just wrong. Yes, many graphemes are a single codepoint, so yes, they are 4 bytes in UTF32, but not ALL (and it’s not just emoji’s to blame).

If what you’re on about is the leading zeros, yes, they don’t matter individually. Unicode by rule is limited to 21bits to represent all codepoints so the 11 bits left as leading zeros are wasted, which is why folks don’t use UTF32 typically as it’s the least efficient storage wise and doesn’t have really any advantage over UTF-16 outside easy counting of codepoints (but again, codepoints aren’t characters).


I'm talking more about C (or any low level stuff) than Unicode now. The world I'm using as a reference has only byte arrays, smokes and mirrors.

I'm constantly pointing you _to the awk codebase_. It's a relevant context, the title of the post and it matters. Can you please stop ignoring this? There's no Unicode library there, it's an implementation from scratch.

If you are doing it from scratch, there's a part of the code that will deal with raw bytes, way before they are recognized as Unicode things (whatever they might be).

Ever since this entire post was created, the main context was always this: an implementation from scratch, in an environment that does not have unicode primitives as a first-class citizens. Your string functions don't have \u escapes, YOU are doing the string functions that support \u.


Ok, now I know you’re just trolling. Enjoy and goodbye!


I'm not kidding. Just look at the code:

https://github.com/onetrueawk/awk/commit/d322b2b5fc16484affb...

I am talking about that environment. You are not.


To make the point more clear, the female golfing Unicode “character” is encoded as follows in various UTFs:

UTF16 (12 bytes total):

\ud83c\udfcc\ufe0f\u200d\u2640\ufe0f

UTF32 (20 bytes total):

u+0001f3ccu+0000fe0fu+0000200du+00002640u+0000fe0f

UTF8 (16 bytes total):

\xf0\x9f\x8f\x8c\xef\xb8\x8f\xe2\x80\x8d\xe2\x99\x80\xef\xb8\x8f


You are still presenting me with abstractly encoded data. \u and u+ are in a higher level of abstraction. The only raw bytes I am seeing here are in the UTF-8 string, which you decided to serialize as hexadecimals (did you had a choice? why?).

If you had all of these expressed as pure hexadecimals (or octals, or any single-byte unit), how would they be serialized?

Then, once all of them are just hexadecimals, how would you go about parsing one of these sequences of hexadecimals into _characters_? (each hexa representing a raw byte, just like in the awk codebase we are talking about)

Another question: do you need first to parse the raw bytes into characters before recognizing graphemes, or can you do it all at once for both variable-length and fixed-length encodings?


> You are still presenting me with abstractly encoded data.

That’s the actual encoding for that grapheme as specified by the spec for UTF8, UTF16, and UTF32.

> \u and u+ are in a higher level of abstraction.

No, it’s not, it’s how you write escaped 16bit and 32bit hexadecimal strings for UTF-16 and UTF-32 respectively. Notice there’s 4 hex characters after \u and 8 hex after u+. Those are the “raw bytes” in hex.

> The only raw bytes I am seeing here are in the UTF-8 string, which you decided to serialize as hexadecimals

All three forms are “raw bytes” in hex form. \x is how you represent an escaped 8 bit UTF-8 byte in hex.

> Another question: do you need first to parse the raw bytes into characters before recognizing graphemes, or can you do it all at once for both variable-length and fixed-length encodings?

You need to “parse” (more like read for UTF16 and UTF32, as there’s not much actual parsing outside byte order handling) the raw bytes into codepoints. To try to minimize confusion, consider Character and Grapheme as synonymous. They are made up of one or more codepoints. It really doesn’t matter if it’s variable length or fixed length, you still have to get the codepoints before you can determine character/graphemes.


I am asking you "how do you purify water"? And you're holding a bottle of Fiji and telling me "look, it's simple".

You're absolutely right about what is a character and what is a grapheme. I already said that, this subject is done. You're right, no need to come back to it. You win, I already yielded several comments ago.

Now, to the other subject: I would very much prefer if we talked only about bytes. Yes, talking only about bytes makes things harder. Somewhere down the line, there must be an implementation that deals with the byte sequence. I'm talking about these, just above the assembler (for awk). There IS confusion at this level, no way to avoid it except abstracting it by yourself, byte by byte (or, 4 bytes by 4 bytes in UTF-32).


Take a step back and ask "why do I need to know the length of this thing?" If it's so you know how much storage to allocate, then the process/time is the same for both (the answer is: however many bytes you happen to have). If your array is bigger than the number of valid characters (code points) and you need to search through it to find the "end" (last valid character), you can do that with almost identical complexity (you don't actually need to iterate over ever byte and Code Point with UTF-8 because of how elegantly the encoding was designed).

Why else might you need to know the length? If it's to know how much space to allocate in a GUI (or even on a console) then neither encoding is going to help.

Maybe it's because of some arbitrary limitation like "your name must be less than 50 characters" and I'll just say that if that's the case, you are doing it wrong (if you need to limit it for storage/efficiency purposes, fine, but you will probably be better off limiting by bytes and using UTF-8 since most people will be able to squeeze in more of their names).

I'm not saying there aren't reasons for needing to know the "length" (number of Code Points) of a string, and certainly many existing algorithms are written in a way that they assume that calculating string length and being able index arbitrarily into the middle of a string are fast (O( 1 ) for indexing) but in reality, for almost any real world problem beyond "how much storage do I need" almost everything you need to do actually requires iterating over a string one Code Point at a time (which is O( n ) for both, with the biggest difference being that UTF-8 may require more branching, but also it's common enough that in many cases between vectorization and just generally better optimizations because of it's popularity, UTF-8 will do just fine while usually using less storage, which can significantly benefit CPU cache locality).


It needs the length for operations such as substring, or to apply length modifiers on regular expressions (such as \w{3,5}), which is a common thing in awk programs.

In fact, the return value of the u8_rune as implemented in the branch we are discussing (https://github.com/onetrueawk/awk/compare/unicode-support) returns a length to be used as an offset later.

This is not me saying, it's the author. There is a code comment there:

> For most of Awk, utf-8 strings just "work", since they look like null-terminated sequences of 8-bit bytes. Functions like length(), index(), and substr() have to operate in units of utf-8 characters. The u8_* functions in run.c handle this.

I know there might be different ways of doing it, but we're talking about a specific implementation.

I was wrong to assume he is storing stuff in UTF-32. He could have, but there was already code in place there to make the UTF-8 storage easier to implement.


It's also the case that `char` in Rust is a 32-bit primitive. Though, `String` being UTF-8 is convenient until you need to do Unicode operations on actual chars...

It is surely better than dealing with UTF-16 "surrogate pairs"[2] in most pre-UTF-8 tech where a `char` is 16-bits but can't fit all Unicode codepoints without another 16-bit `char`.

[1]: https://doc.rust-lang.org/std/primitive.char.html#representa...

[2]: https://stackoverflow.com/a/54922477/809572


Well, it's not like we don't have plenty of combining marks in Unicode itself, so I suspect this is something you have to deal with either way.


UTF-32 performance woes are exaggerated. UTF-8 decoding requires branching whereas UTF-32 does not and in terms of storage real world text does not consume much space.


Utf32 has the BOM which adds complexity and at least a top level branch. Maybe it's not so bad in practice, just duplicated generated code.

But as an internal representation, some of it can be ignored by choice.


I believe no distro actually ships this version of awk by default. They ship GNU awk which has Unicode support anyways.


OpenBSD uses "The One True AWK."

  $ awk -V
  awk version 20211208
Kernighan's version is likely used in other places where the GPL is eschewed.


I think the other BSDs do too, including macOS.


Yes, here on macOS 11.6.8

  $ awk -version
  awk version 20200816


On a FreeBSD 12.3-RELEASE box, FWIW:

   # awk -version
   awk version 20210215



So it turns out the default on Debian is mawk which does NOT support Unicode. Thanks for pointing that out. This simple test gives different results for gawk and mawk.

  $ echo 'ö' | awk '{print length}'


…only if the current locale is set to use UTF-8 (or some other variable-width encoding). Which nowadays the default locale usually does, but in principle it doesn’t need to be.


It's somewhat comforting to hear even Brian K. say he doesn't understand Git well.


Yeah, my thought on reading, "I still don't have a proper understanding" is that I'm not sure anybody has that. It seems like you only know too little or too much.

I have only worked with one person really who knew is backwards and forwards. That seemed helpful at first until he used his deep knowledge to try some sort of tricky fix-up and in the process lost some of my work. Given that a version control system's one job is to not lose work, I was very disappointed.


Local Man Deletes Files, Leaves Coworkers Very Disappointed


What context do you remember about this scenario that left the reflog a rektlog

Not confined to your local machine?


This was 8 years ago, so I just remember us being in some sort of merge pickle, my colleague going, "Ah, git is easy if you just understand it as..." and then jumping in to fix it. Which he'd done a few times before successfully, as he really did understand it well. But this time he did something wrong and data was lost.

He was very apologetic about it, and it wasn't more than a day or two of work, so I was ok with it. But it was one of many experiences that have left me permanently suspicious of doing anything fancy with git. Which is fine, because for a lot of reasons mostly not to do with any particular VCS, I work to keep branching to an absolute minimum.


If he means it in the "I have decided not to take the time to learn this" then yes, it's comforting. Otherwise it's just sad -- I'm sure there are many people who would be willing to tutor him in it in exchange for only the honour of having tutored Kernighan.


i'm guessing his issue is not with git porcelain. The actually internals of git are kind of messy.


I came to the comments to say that it’s reassuring. :)


Likewise.


And to be fair he's been a professor for years, not a developer.

When I took my undergrad roughly a decade ago, github was already a thing but our courses were still using subversion.

Academia is slow to adopt new tools.


Git really is a mess. The fact that commits and not diffs have hashes should be lampooned despite arguably a few small benefits. Geniuses make mistakes too and git is linuses. The only reason git is respected is because it came from Linus. If it were from Microsoft it would get all the criticism it deserves and then 20 times more


I think it would be a little more accurate to say that git is kind of a weird default: it’s a revision control power-user tool with all the sharp corners we expect from such tools and is overkill for most users in most situations. It’s understandable that typical users resent the complexity and the foot cannons.

Whether it’s git or vi/emacs or C or LaTex or whatever, there are a bunch of us old-timers who went through the pain already and now notice the modest loss of capability on hg or VSCode or whatever.

But even as such an old-timer, I don’t think it’s a good idea to default everyone into this low-level shit. Mercurial and VSCode and Python are fine, you can do great work without “the bare metal”.

Edit: Clarified that it’s a “revision control” power-user tool, plenty of “power users” have more typical revision-control needs.


I never quite get this. Why would you want diffs to be the primary artifact?

I’ve always found git to be exceedingly simple in design. Though, the CLI is a quite messy I grant you.


I liked this interview with Brian with Lex Fridman: https://www.youtube.com/watch?v=O9upVbGSBFo


+1. Also, I like your username.


Here is Brian Kernighan mentioning the Unicode work in an interview: https://www.youtube.com/watch?v=GNyQxXw_oMQ


For context this is 37 years after it was released (1985)


Whilst great Brian's added this, didn't Plan9's version of AWK support UTF-8?

I haven't used it, but my understanding was all the unix tools were upgraded to work with unicode - don't quite see why it's taken so long if the work was done already years ago


I also became aware of this from Kernighan's interview on Computerphile. And I thought to myself "man, such an omnipresent and fundamental piece of software doesn't have UTF-8 support yet?" and then I thought that, after >20 years of using awk in my script, I never had the urge to use it to parse UTF-8 - but I guess that it's just the luck of living in a country that uses a Latin alphabet, and I couldn't imagine how people could use it on documents in e.g. Chinese or Arabic. This was really an example of how discriminatory software is when it comes to languages.


What does adding unicode support mean? Is it about making all the string indexing match codepoints instead of bytes? Or is it more than that?


extending ascii using bytes on the fly for type setting bounded in a utf-8 character space.


Brian Kernighan is 80 years old. Donald Knuth is 84 years. Both still write code and avoid most social/non-social media in general. It's amazing given how often we get surrounded by people giving up coding at rather young age for various reasons. Who else would you include in this class?


Is use of tools like awk, m4, sed etc. common and growing, or decreasing?

I've always found that for my personal use cases, writing a little Python snippet has always been much faster and simpler than understanding what CLI tool to use, and how to use it, to achieve the same thing.


Depends on the situation.

I've had to work on environments where awk was available by default, but Python was not (and was not going to be).

It can be good to have a screwdriver for every type of screw in the toolbox.


I only use awk. sed is unnecessary when you know awk has gsub or sub. Though I would say, I'd really like lua to be in there considering its small size.



I'd run across issues in handling Unicode content in a recent awk (gawk) project. This is excellent news.


K is a gift to humanity who keeps on giving. I am a big fan of his C book and the AWK book!

Long live, K!


So, has anyone seen/awk'd a unicode mul-tics TeX check box example yet?


Per discussion, header should be revised to "Awk kerning again a unicode mul-tics Tex thing"


Of course he did. Aho has better things to do and Weinberger is too rich to write code any more.


Guessing compiler stuff sells better than scripted stuff, per popularity of "Compilers: Principles, Techniques and Tools" 1st & 2nd editions.


i think Peter Weinberger is still writing code at google...


[off-topic]

Following the spirit of UNIX, I did a little analysis on the upvotes this post got over time (fish-shell):

  while true; curl -sL  'https://news.ycombinator.com/item?id=32534173' | pup  '#score_32534173 text{}' | awk -F'[^0-9]*' '{print $1}' | tee -a points; sleep 15s; end
(Initially I used `grep -Po '\d+'` but switched it with an awk solution due to...context!)

I started it approx. when I posted it. Now ~2 hours have passed since. Using `gnuplot`:

  f(x) = a*x+b; fit f(x) "points" via a,b; set terminal png size 1920,1080 enhanced font "Inconsolata,20" ;set output "HN-analysis.png" ;set grid; set ylabel "points";set key bottom right ; set xlabel "sample # (15s interval)"; plot 'points' w linesp lt 7 lw 3 lc rgb "orange", f(x) lc rgb 'blue' lw 2

We generate the plot: https://i.imgur.com/pS6AaI5.png

(The jump #100 sample is due to a network error on my side)

And here are the coeffs. of a linear fit over the data (note that every 4 samples is 1 minute, so this post got ~1.52 upvotes per minute)

  a = 0.380809, b = 19.8437


Fun stuff ... now correct by plotting X-axis as actual lapsed epoch time since start of sampling :-)

( For bonus fun, fit successive unbiased moving averages of a selectable width, for further fun do this for non uniform sample times .. time weighted, and look beyond simple moving averages to (say) Savitzky–Golay filter family for both uniform and non uniform sampling intervals )


This is cool. Thanks for sharing how you did it. Nice to see awk being used in a post about awk!



I didn't get it :)

Are you suggesting another method for graphing?


GRAP was/is Bell Labs developed software (by L. Bentley & Brian W. Kernighan ).

gnuplot was/is Berkley / GNU developed software.

Guess more of a 70's/80's pre-open source observational comment.


Getting a t-shirt with "awk < 1" printed on it hand auto-grap-ed by Brian Kernighan would be more memorable though.


> Once I figure out how… I will try to submit a pull request. I wish I understood git better, but in spite of your help, I still don't have a proper understanding, so this may take a while.

Even Kernighan struggles with git.


Torvalds is a better programmer than that.

Pull requests are feature of GitHub, not a part of git.

https://docs.github.com/en/pull-requests/collaborating-with-...


Nitpick, but GitHub pull requests (which are really merge requests, you aren't pulling anything) are named after the actual pull requests used eg. between kernel maintainers. ("Please, could you pull from ...")

Git has tooling to help with those, so it kinda is a git feature: git request-pull. https://git-scm.com/docs/git-request-pull


Thanks for clarifying. I learned something.

I guess it might be more accurate for me to think about pull requests as Github's reification of the social protocol of pulling among kernel developers.


I'd say its a feature of all(?) dvcs, almost all of them or the ones that I know of must use PRs/forks.


This reminds me of the relevant xkcd: https://xkcd.com/1597/


The culture around p.r.s is truly a high barrier of entry for many people.

Figuring out how all of this works is substantially more difficult I find in practice than fixing many longstanding trivial bugs in a great deal of software.


What’s the alternative? The old way (which is still used by many projects) is to send patches to mailing lists, which I find more difficult: you need to learn how to generate the patch from your source code repo, send the patch as an e-mail (needing weird hacks like `git imap-send`), and then configure your MUA not to mangle it somehow. Then you also don’t have a centralized search/tracking interface.

Some good reasons not to use GitHub is because you’re familiar with standard/traditional tools, or because you prefer not to use centralized services. Both of those are fine reasons! But “the traditional way is easier” isn’t.


Seems like you're arguing against something they didn't say.

They just said, "The culture around p.r.s is truly a high barrier of entry for many people." That's true and it's important to acknowledge that on any project that wants contributions from as many people as possible. When I was at Code for America I saw smart people putting plenty of time into helping people get over that barrier.

That's certainly the case here. If a professor of CS receives help from an open-source maintainer who's been coding for decades and still doesn't feel confident, then it's safe to say a) there's room for improvement, and b) until we figure out what the improvement is, we should make sure that we're providing the necessary support to everybody we want to contribute.


An alternative universe: you have an email you can send patch files too. You send the patch file (like, gen a patch file and then just manually attach that to an email). That automatically creates a PR looking thing.

The fact that (for example) GitHub requires a fork for you to send a PR to a project (that you don’t have push access to) is soooo overkill.


What happens if multiple people contribute on the same PR (which is extremely common)? With a patch, that history is lost.


I mean… you can imagine a history being stiched together serverside. The ingest mechanism isn’t the final result. Just that “have to make a PR, have to make a branch name, etc” feels a bit silly


> I mean… you can imagine a history being stiched together serverside.

And since Git has a builtin mechanism for tracking such things, called branches, we could ask users to create a new one. Then add a possibility for people to comment on the proposed change. Sounds like you're incrementally reinventing PRs.


But _we don't need to ask users to do this_. We can just do "the obvious thing".

We're talking about usability here, of course the fundamental features are available in Git. It's Git!

Here [0] is a more well written out version of this idea (by a developer of Mercurial).

[0] https://gregoryszorc.com/blog/2020/01/07/problems-with-pull-...


Oh, thought being told to git solved the conflicts.


Of course not, the email can list the contributors, or not as it's own pleasure.


One would no even need to make a diff in theory, simply the new version.

In many cases, it is sufficient to simply send an email with:

> There's a bug in the function `handle_request` in `src/network/core.rs` that can cause it to return a double answer when invoked during a leap second. This fixes that issue: ...

And just include the new function inside of the email body, without attachment.

That should be sufficient for many cases, but many projects do not allow this and demand p.r.s.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: