Hacker News new | past | comments | ask | show | jobs | submit login
Why did base64 win against uuencode? (retrocomputing.stackexchange.com)
225 points by egorpv 10 months ago | hide | past | favorite | 103 comments



One reason that uuencode lost out to Base64 was that uuencode used spaces in its encoding. It was fairly common for Internet protocols in those days to mess with whitespace, so it was often necessary to patch up corrupted uuencode files by hand.

Base64, on the other hand, was carefully designed to survive everything from whitespace corruption to being passed through non-ASCII character sets. And then it became widely used as part of MIME.


and yet, Internet protocols (http, at least) don't play well with equal signs which are part of base64, sometimes. That little issue has caused lots of intermittent bugs for me over the years, either from forgetting to urlencode it or not urldecoding it at the right time.


So there are 7 base64 encodings, one with “+ / =“, one with “- _ =“, one with “+,” and no “=“… https://en.wikipedia.org/wiki/Base64#Variants_summary_table


And decoders typically aren't interoperable, requiring you to use the specific decoder for that combination.


Which is silly, because there’s no good reason to, except for strict validation.


TIL.

And Python uses RFC 4648


Python might say that, but as often it’s not really true: it really mostly works off of 2045

- the “default” encoder (“b64encode”) will pad the output

- although it will not linebreak (“encodebytes”) does that)

- the default decoder will error if the input is not padded

- the default decoder will ignore all non-encoding characters by default

Also both b64encode and encodebytes actually use binascii.b2a_base64, which claims conformance to RFC 3548, which attempts to unify 1421 and 2045. Except RFC 3548 requires rejecting non-encoding data, whereas (again) Python accepts an ignores it by default, in 2045 fashion.


And slashes as well, which is a magic character in both urls and file systems. Means you can't reliably use normal base64 for filenames, for instance. That might seem like a niche use-case, but it's really not, because you can use it for content-based addressing. Git does this, names all the blobs in the .git folder after their hash, but you can't encode the hash with regular base64.


There’s the URL- and filename-safe variant of Base64 [0]. Decoders can support it simultaneously and transparently.

[0] https://www.rfc-editor.org/rfc/rfc4648.html#section-5


you can also manually replace the with urlsafe codes


Ditto the obnoxious "quoted-printable" mail encoding, which turns every = into =3D.

Still more robust than uuencode though.


It's basically the same as URL encoding, they just picked = instead of %


It is, plus extra segmenting with `=` escaped line breaks [1]:

> Lines of Quoted-Printable encoded data must not be longer than 76 characters. To satisfy this requirement without altering the encoded text, soft line breaks may be added as desired. A soft line break consists of an =

IIUC in Base64 you can throw whichever white space anywhere and it should be ignored. And in URL ("percent") encoding there is no insignificant white space possible (?) and encoding of white space depends on implementation (dreaded space `%20` vs ` ` vs `+` in application/x-www-form-urlencoded [2]).

[1] https://en.wikipedia.org/wiki/Quoted-printable [2] https://en.wikipedia.org/wiki/Percent-encoding


I am using base62 for data that can be included in URIs.


all three symbols are some of the worst possible choices for compatibility with urls and many other things

.-_ would have been a better choice tha +/=


base64 is older than URLs, though.


And now we can have whitespace in url queries but we are still using %20 everywhere because "that's standard"...


Try copy-pasting a link that has actual whitespace in its URL queries and see if it gets linkified correctly. Just because you can doesn't mean you should! A space is like the one delimiter that is applicable for separating out URLs from the context of a larger blob of text.


Browsers will often display %20 as a space, but that's not the same thing as spaces being legal within URLs.


You are right. Seems firefox displays %20 as whitespace and converts whitespace to %20 when you use it. Chrome displays it as %20 but still converts whitespace to %20 if you try to use it.


Space is not legal at the HTTP request level, because the opening line uses space as a delimiter like:

    GET /your/path-to/the.file HTTP/1.1


Have fun with newline and spaces/tabs conversions when allowing whitespace in URLs.


...only for base64 to become fragmented into standard or URL due to character choice, and padded or not padded as a "cool trick to save some bytes".


Having lived through the transition, I can say personally it comes down to "packaging" -if MIME had adopted UUENCODE format, I probably would have used it but as materials emerged to me which depended on base64 decode, it became compelling to use it. Once it was ubiquitously available in e.g ssl, it became trivial to decode a base64 encoded thing, no matter what. Not all systems had a functioning uudecode all the time. DOS for instance, you had to find one. If you're given base64 content, you install a base64 encode/decode package and then its what you have.

There was also an extended period of time where people did uux much as they did shar: both of which are inviting somebody else's hands into your execution state and filestore.

We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall, but just as we discussed lempel-zif and gzip tuning on usenet news, we discussed uuencode/base64 and other text wrapping.

Ned Freed, Nathaniel Borenstein, Patrik Falstrom and Robert Elz amongst others come to mind as people who worked on the baseXX encoding and discussed this on the lists at the time. Other alphabets were discussed.

uu* was the product of Mike Lesk a decade before, who was a lot quieter on the lists: He'd moved into different circles, was doing other things and not really that interested in the chatter around line encoding issues.


Here are Usenet comments from the the 1994 comp.mail.mime thread "Q: why base64 ,not UUencode?"

1) https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi...

> Some of the characters used by uuencode cannot be represented in some of the mail systems used to carry rfc 822 (and therefore MIME) mail messages. Using uuencode in these environments causes corruption of encoded data. The working group that developed MIME felt that reliability of the encoding scheme was more important that compatibility with uuencode.

In a followup (same link):

> "The only character translation problem I have encountered is that the back-quote (`) does not make it through all mailers and becomes a space ( )."

A followup from that at https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi... says:

> The back-quote problem is only one of many. Several of the characters used by uuencode are not present in (for example) the EBCDIC character set. So a message transmitted over BITNET could get mangled -- especially for traffic between two different countries where they use different versions of EBCDIC, and therefore different translate tables between EBCDIC and ASCII. There are other character sets used by 822-based mail systems that impose similar restrictions, but EBCDIC is the most obvious one.

> We didn't use uuencode because several members of our working group had experience with cases where uuencoded files were garbaged in transit. It works fine for some people, but not for "everybody" (or even "nearly everybody").

> The "no standards for uuencode" wasn't really a problem. If we had wanted to use uuencode, we would have documented the format in the MIME RFC.

That last comment was from Keith Moore, "the author and co-author of several IETF RFCs related to the MIME and SMTP protocols for electronic mail, among others" says https://en.wikipedia.org/wiki/Keith_Moore .


After a given point usenet was nearly 8-bit clean, and thus https://en.wikipedia.org/wiki/YEnc was also developed to convolve all the octets (I + 42 (decimal)) and escape the results that happened to still match reserved characters (CR, LF, 0x0, = (yEnc escape)) - it seems that if the result character was among that set, then = was output and new output determined by O = (I+64) % 256 instead.


Yenc is still used a lot actually, for the purpose of what Usenet has de facto become, a piracy network :)


It's too bad yenc didn't take the place of base64 for email.


yEnc was rejected by the MIME standardization group for two main reasons, one good and one bad. The good reason was that it has some encoding pathologies, although these could have been fixed in the standardization process. The bad reason was "it's too hard to add a new Content-Transfer-Encoding because you have to change all the user agents", which given that by that time all the clients were changing to support yEnc it was quite clear that uptake would likely of a new addition would have been fairly rapid.


> We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall

uuencode has file headers/footers, like MIME. But the actual content encoding is basically base64 with a different alphabet; both add precisely 1/3 overhead (plus up to 2 padding bytes at the end).


uuencode has some additional overhead, namely 2 additional bytes per line, that means it varies from 60-70%, the latter being best case, while base64 is 75% efficient in all cases.


On a related note, I'm getting flashbacks to being on the web in the late-1990s, back when "Downloads!" was a reason to visit a particular website; and noticing that Windows users like myself could just download-and-run an .exe file, while the same downloads for Mactintosh users would be a BinHex file that'd also be much larger than the Windows equivalent - and this wasn't over FTP or Telnet, but an in-browser HTTP download, just like today.

Can anyone explain why BinHex remained "popular" in online Mac communities through to the early 2000s? Why couldn't Macs download "real" binary files back then?


Classic Macintosh files were basically 2 separate files with the same name (data fork and resource fork). Additionally, there was important meta data (Finder Info, most importantly the file type, creator type). Since other file systems couldn't handle forks or finder info, it had to be encapsulated in some other format like binhex, macbinary, applesingle, or stuffit. The other 3 were binary so they would have been smaller. Why not them... shrug


I wasn't a macintosh user back in the day but for the file archives I frequented (apple II), sometimes files were in BINSCII which was a similar text encoding. The advantage being that they could be emailed inline, posted to usenet, didn't require an 8-bit connection (important back in the 80s), and could be transferred by screen scraping if there wasn't a better alternative.


So this is really random, but Kermit actually could route around your 7-bit issues. Everyone remembers it as a godawful protocol choice, because terminal programs usually implemented the most basic version of the protocol, but if configured correctly it was on par with or better than Zmodem.


Kermit was still useful into the 90s. I had a brief gig helping a friends dad set up an ecom store. Their payment processor got nightly batches of authorizations and captures by model. We were having issues getting our machine to talk with their machine, and surprisingly enough I was able to get an actual engineer on the phone. We used kermit to debug the modem handshake interactively while we conversed, figuring out the magic Hayes incantation to get them happy.

A very different world than today.


What I saw most of the time was a file that was compressed with StuffIt, and then encoded with BinHex. They were usually about inverse in terms of efficiency, so what you saved with StuffIt you would then turn around and lose with BinHex. But the resulting file was roughly the same size as the original file set.


The most common Macintosh archive format was (eventually) StuffIt, but StuffIt Expander couldn't open a .sit file which was missing its resource fork, and when you downloaded a file from the internet, it only came with a data fork.

So a common hack was to binhex the .sit file. Binhex was originally designed to make files 7-bit clean, but had the side effect that it bundled the resource fork and the data fork together.

Later versions of StuffIt could open .sit files which lacked the resource fork just fine, but by then .zip was starting to become more common.


I could be remembering wrong, but didn't later versions of stuffit compress to a .sit file that had no resource fork, so it would stay fully-intact on any filesystem? I may be imagining that, but I remember hitting a certain version where "copying to Windows" would no longer ruin my .sit files... haha


I don't remember there ever being a resource fork, but I Think around v4 it started allowing you to drag-n-drop any file to decompress, ignoring the missing/required finder info from previous versions


It's been a long time, but I don't think that's exactly true. The resource fork simply became optional, assisted by later versions of MacOS, which let applications open files which didn't have one.


Ahh hmm OK, I'll have to check into this some time. I feel like the filesize would stay the same (suggesting no loss of data), but it's totally possible I'm misremembering considering how long it has been, hahah :)


It's quite plausible that the "file size" simply didn't include the resource fork.


Funny because today I find the install process for Mac much simpler. Most installs are "drag this .app file to your Applications folder", meanwhile on Windows you download an installer that downloads another installer that does who-knows-what to your system and leaves ambiguously-named files and registry modifications all over the place.


There are plenty of portable windows applications (distributed as a zipped directory) and there are plenty of pkg macOS installers.

I don't really understand why macOS users like this "simple" installation, because when you "uninstall" the app, it leaves all the trash in your system without a chance to clean up. And implying that macOS application somehow will not do "who-knows-what" to your system is just wrong. Docker Desktop is "simple", yet the first thing it does after launch is installing "who-knows-what".


Windows uninstallers also leave all the trash in %AppData%. There’s no generic way to clean all the folders that a program decided to create. Only some uninstallers ask if you want to delete settings and caches.

Given that, dragging a ready-to-run file (folder) to /Apps symlink is much more convenient than “setting up your system for preparation of initializing of downloading of the installation process starter manager, please wait and press next sometimes”.


That's definitely true for more complex apps, but the fact that you can have the executable and all it's resources in one `.app` file is so much simpler and easier for the everyday user. (Yes I know it's a folder that the OS treats as an application, but to a user it looks like one file)

I go back and forth between Windows/Mac/Linux on the daily (right tool for the right job) and each has some strengths. App packaging is far and away one of Mac's current strengths.

I maintained Nativefier (a now defunct open source project that would package web sites as Electron apps) and the ease of packaging an app was Mac > Windows > Linux.


If the installer on Windows is properly done, you actually know exactly what it does to your system (including registry modifications). This includes the ability to remove the application completely.

Whereas on macOS, installation is trivial, but then the application sets up stuff upon first run and that is really intransparent then, with no way of properly uninstalling the app unless there is a dedicated uninstaller.


There are plenty of inscrutable installers for macOS software. DRM-riddled bullshit and enterprise crapware are a disease.

But yeah, the simple case is quite nice.


The one annoying thing macOS apps do is pollute /Library. Even apps that don’t explicitly write to this area end up with dozens of permafiles. Tons of stuff is spewed in there when you install an application that actually uses it. It’s like a directory version of a registry kitchen sink.


Spare a thought for us Windows users - we went from our pristine and oddly beautiful home directories in Windows 7, where everything was neatly squared-away to either AppData\Roaming or AppData\Local - to our post-Electron, lazily-ported software world where my home directory now has no-less than twenty Unix-style dot-directories littering my %USERPROFILE%

Incidentally, the worst offender is Microsoft themselves: it all got worse with .nuget, .vs, .azcopy, .azdata, .azure, .azuredatastudio, .dotnet, etc. I just don't understand it.


We Linux users suffer it. Supposedly, nowadays applications should store their files under ~/.config, ~/.local and ~/.cache, but you still find a million applications that create their own folders without following any standards. But at least file browsers hide those folders by default...


I'd have thought you could easily enable some fs-jail that maps any-and-every request matching /~/..+/i wherever you want?


I had never heard of it. Maybe it is possible, but I am too lazy to try it...


Do you have a link to documentation for that?


It's in the XDG Base Directory Specification [0] maintained by freedesktop.org [1] (formerly X Desktop Group)

0: https://specifications.freedesktop.org/basedir-spec/basedir-...

1: https://en.wikipedia.org/wiki/Freedesktop.org


Pristine? You mean the same home directory that contains the 80 character NTUSER files? ;)


Or the back-compat symlinks for NetHood, Start, Recent, SendTo, ah yes. I had a post-install VBScript that cleaned those out.

My current sad-thing I’m unhappy about is how the “My Documents” folder ended up being a second AppData folder, with lots of software storing settings, templates, project files, etc in that dir instead of AppData.

Windows absolutely needs application-silos to protect users from lazy apps. I hate to say it, but Apple was 100% right to make iPhone OS a file-system-free OS - we can’t do that on desktop, but gosh-darn-it, why is software so terrible? :(


My solution is to create another folder like “~/Documents/Projects” (because I have no free-standing documents really) and use it as “my” dir. All other paths are known to apps and will be abused.


I do the same thing (on win, mac, and linux). Except I call it "proj" because I'm lazy. In fact, I split it between github and proj because the former is already backed up, but the latter is not.


A thing I wonder: why is using = padding required in the most common base64 variant?

It's redundant since this info can be fully inferred from the length of the stream.

Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream (and = does not always appear so is not a separator).

There's no way that using the = instead of per-byte length-checking gains any speed, since to prevent reading out of bounds you must check the per byte length anyway, you can't trust input to be a multiple of 4 length.

It could only make sense if it's somehow required to read 4 bytes at once, and you can't possibly read less, but what platform is such?


from Wikipedia:

  The padding character is not essential for decoding, since the number of missing bytes can be inferred from the length of the encoded text. In some implementations, the padding character is mandatory, while for others it is not used. An exception in which padding characters are required is when multiple Base64 encoded files have been concatenated.


IMO padding is not necessary and just a relic of old implementations.


I think so too. It feels similar to how many specifications from the 90s use big endian 4-byte integers for many things (like png, riff, jpeg, ...) despite little endian CPU's being most common since the 80s already, and those specifications seemingly assuming that you would want to decode those 4-byte values with fread without any bounds checking or endianness dependency.


Without padding, how would you encode, for example, a message with just a single zero? To be more precise, how do you distinguish it from two zeroes and three zeroes?


Both for encoding and decoding the padding is not needed. Without ='s, you get a uniquely different base64 encoding for NULL, 2 NULLs and 3 NULLs.

This shows the binary, base64 without padding and base64 with padding:

NULL --> AA --> AA==

NULL NULL --> AAA --> AAA=

NULL NULL NULL --> AAAA --> AAAA

As you can see, all the padding does is make the base64 length a multiple of 4. You already get uniquely distinguishable symbols for the 3 cases (one, two or three NULL symbols) without the ='s, so they are unnecessary


Oh right. Problems only show up when you concatenate two messages, because a single null is AA, but two nulls are AAA, not AAAA.


The output padding is only relevant for decoding. For encoding, since the alphabet of Base64 is 6 bits wide, the padding is 0 when the input is not a multiple of 6 (e.g. encoding two bytes (16 bits) needs two more bits to become a multiple of 6 (18))

Refer to the "examples" section of the wikipedia page


Perhaps to simplify implementations that read multiple characters at a time?

But I think it's likely just poor design taste.


> Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream

I'm not sure I understand this part. You can decode aGVsbG8=IHdvcmxk, what do you need to know?


The = does not appear if the base64 data is a multiple of 4 length. So you wouldn't know if aGVsbG8I is one or two streams. The = is not a separator, only padding to make the base64 stream a multiple of 4 length for some reason.

I only mentioned the concatenation because Wikipedia claims this use case requires padding while in reality it doesn't.


Base64 doesn't have a concept of "stream". Conceptually base64-encoded string with padding is a concatenation of fragments that are always 4 bytes long but can encode one to three bytes. Concatenating two base64-encoded strings with padding therefore don't destroy fragment structures and can be decoded into a byte sequence that is a concatenation of two original input sequences. Without padding, fragments can be also 2 or 3 bytes and short fragments are not distinguishable from long fragments, so the concatenation will destroy fragment structures.


Oh I see, so it's for concatenating multiple base64 fragments of the same single piece of data? But where is this used? Never seen that. Javascript's base64 decoder gives an error for ='s in the middle (but I just found out the Linux base64 -d command supports it!)


I actually don't know if it's an intention, but it is the only explanation that makes sense. It should be noted that the original PEM specification (RFC 989) did have a similar use case where alternating encrypted and unencrypted bytes can be intermixed by `*` characters, but you are still required to pad each portion to 4n bytes (e.g. `c2VjcmV0LCA=*cHVibGlj*IGFuZCBzZWNyZXQgYWdhaW4=`). It is still the closest to what I think padding characters are required for.


It would decode correctly but you wouldn't know the boundary, if that matters. I see, thanks.


You know, when I first got into binary encoding into text I asked myself this very question but never put any effort into looking it up.

Now, 25+ years later, I have some answers - thanks!


There is still one sort-of efficient way of embedding binary content in an HTML file. You must save the file as UTF-16. A Javscript string from a UTF-16 HTML file can contain anything except these: \0, \r, \n, \\, ", and unmatched surrogate pairs 0xD800-0xDFFF.

If you escape any disallowed character in the usual way for a string ("\0", "\r", "\n", "\\", "\"", "\uD800") then there is no decoding process, all the data in the string will be correct.

If you throw data that is compressed in there, you're unlikely to get very many zeroes, so you can just hope that there aren't too many unmatched surrogate pairs in your binary data, because those get inflated to 6 times their size.

Note that this operates on 16-bit values. In order to see a null, \r, \n, \\ and ", the most significant byte must also be zero, and in order for your data to contain a surrogate pair, you're looking at the two bytes taken together. When the data is compressed, the patterns are less likely.



Not listed was a clever encoding for MS-DOS files, XXBUG[1]. DOS had a rudimentary debugger and memory editor. (It even stuck around all the way to Windows XP but didn't survive the transition to 64-bit.) Because it had the ability to write to disk you could convert any file to hexadecimal bytes and sprinkle some control commands about to create a script for DEBUG.EXE. The text-encoded file could then be sent anywhere without needing to download a decoder program first.

[1] http://justsolve.archiveteam.org/wiki/XXBUG


but uuencode makes for fun: https://github.com/bnjf/compre.sh


Ascii85 survived by hiding in popular bloat.


I first met base64 in ASP.NET viewstate.


Base64 is very bizarre in general. Why did they use such a weird pattern of symbols instead of a contiguous section, or at least segments ordered from low->high (on that note, ASCII is also quite strange, I'm guessing due to some backwards compatibility idiocy that seemed like it made sense at some point (or maybe changing case was super important to a lot of workloads or something, making a compelling reason to fuck over the future in favor of optimisation now))?


The original specification is in RFC 989 [0] from 1987, called “Printable Encoding”, where it explains “The bits resulting from the encryption operation are encoded into characters which are universally representable at all sites, though not necessarily with the same bit patterns […] each group of 6 bits is used as an index into an array of 64 printable characters; the character referenced by the index is placed in the output string. These characters, identified in Table 1, are selected so as to be universally representable, and the set excludes characters with particular significance to SMTP (e.g., ".", "<CR>", "<LF>").”

Using the array-indexing method, the noncontiguity of the characters doesn’t matter, and the processing is also independent of the character encoding (e.g. works exactly the same way in EBCDIC).

[0] https://www.rfc-editor.org/rfc/rfc989.html#page-9


The comments point out conversion issues with EBCDIC. You can't use ASCII characters like @ which are not in EBCDIC.

https://datatracker.ietf.org/doc/html/rfc2045#section-6.8 says:

   This subset has the important property that it is represented
   identically in all versions of ISO 646, including US-ASCII, and all
   characters in the subset are also represented identically in all
   versions of EBCDIC. Other popular encodings, such as the encoding
   used by the uuencode utility, Macintosh binhex 4.0 [RFC-1741], and
   the base85 encoding specified as part of Level 2 PostScript, do not
   share these properties, and thus do not fulfill the portability
   requirements a binary transport encoding for mail must meet.
If you want to learn why ASCII is the way it is, try "The Evolution of Character Codes, 1874-1968" at https://archive.org/details/enf-ascii/mode/2up by Eric Fischer (an HN'er). My reading is contiguous A-Z was meant for better compatibility with 6-bit use.


I thought the ASCII upper-case <-> lower-case being a bit operation as being clever.


> I thought the ASCII upper-case <-> lower-case being a bit operation as being clever.

From "Things Every Hacker Once Knew" (2017), has an entire section on ASCII and the clever bit-fiddling that occurs:

* http://www.catb.org/~esr/faqs/things-every-hacker-once-knew/...

* Discussion from ~2 months ago: https://news.ycombinator.com/item?id=37701117


In the context of a terminal, the Control key is also a bitwise operation.

Shifted numerals were nearly a bitwise operation as well, but we didn't end up using that keyboard layout.



Yep! There's even a term for it -- this is called a bit-paired keyboard.


Yes, though in principle you could interleave AaBbCc and so on, which would also be a single bit difference, and the naive collation would be more like that people expect.

The design considerations at https://ia800606.us.archive.org/17/items/enf-ascii-1972-1975... show that 6-bit support was more important than naive collation support:

> A6.4 It is expected that devices having the capability of printing only 64 graphic symbols will continue to be important. It may be desirable to arrange these devices to print one symbol for the bit pattern of both upper and lower case of a given alphabetic letter. To facilitate this, there should be a single-bit difference between the upper and lower case representations of any given letter. Combined with the requirement that a given case of the alphabet be contiguous, this dictated the assignment of the alphabet, as shown in columns 4 through 7.

I just found and skimmed Bob Bemer's "A Story of ASCII", which includes personal recollections of the history. It seems that the 6-bit subset was firmed up first. From https://archive.org/details/ascii-bemer/page/n17/mode/2up?q=... :

> This is reflected in the set I proposed to X3 on 1961 September 18 (Table 3, column 3), and these three characters remained in the set from that time on. The lower case alphabet was also shown, but for some time this was resisted, lest the communications people find a need for more than the two columns then allocated for control functions.

but serious discussion of lower case wasn't taken up until later. From https://archive.org/details/ascii-bemer/page/n25/mode/2up?q=... :

> ISO/TC97/SC2 held its next meeting in 1963 October, at which time it was decided to add the lower case alphabet.

and at https://archive.org/details/ascii-bemer/page/n27/mode/2up?q=... :

> At the 1963 May meeting in Geneva, CCITT endorsed the principle of the 7-bit code for any new telegraph alphabet, and expressed general but preliminary agreement with the ISO work. It further requested the placement of the lower case alphabet in the unassigned area.

Bemer did not like interleaving lower- and upper-case. From https://archive.org/details/ascii-bemer/page/n5/mode/2up?q=l... :

> I had a great opportunity to start on the standards road when invited by Dr. Werner Buchholz to do the main design of the 120-character set [9,24] for the Stretch computer (the IBM 7030). I had help, but the mistakes are all mine (such as the interspersal of the upper and lower case alphabets). ...

> he didn't make the same mistake I made for STRETCH by interspersing both cases of the alphabet!


Base64 and ASCII both made perfect sense in terms of their requirements, and the future, while not fully anticipated at the time, is doing just fine, with ASCII being now incorporated into largely future-proof UTF-8.

Considerably stranger in regard to contiguity was EBCDIC, but it too made sense in terms of its technological requirements, which centered around Hollerith punch cards. https://en.wikipedia.org/wiki/EBCDIC

There are numerous other examples where a lack of knowledge of the technological landscape of the past leads some people to project unwarranted assumptions of incompetence onto the engineers who lived under those constraints.

(Hmmm ... perhaps I should have read this person's profile before commenting.)


P.S. He absolutely did attack the competence of past engineers. And "questioning" backwards compatibility with ASCII is even worse ... there was no point in time when a conversion would not have been an impossible barrier.

And the performance claims are absurd, e.g.,

"A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability."

WHICH conversion, uppercase hex or lowercase hex? You can't have both. And it's ridiculous to think that the character set encoding should have been optimized for either one or that it would have made a measurable net difference if it had been. And instruction counts don't determine speed on modern hardware. And if this were such a big deal, the conversion could be microcoded. But it's not--there's no critical path with significant amounts of binary to ASCII hex conversion.

"There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is."

That is not a usable conversion. Anyone who has actually written parsers knows that the encodings of these characters is not relevant ... nothing would have been saved in parsing "loops". Notably, programming language parsers consume tokens produced by the lexer, and the lexer processes each punctuation character separately. Anything that could be gained by grouping punctuation encodings can be done via the lexer's mapping from ASCII to token values. (I have actually done this to reduce the size of bit masks that determine whether any member of a set of tokens has been encountered. I've even, in my weaker moments, hacked the encodings so that <>, {}, [], and () are paired--but this is pointless premature optimization.)

Again, this fellow's profile is accurate.


Show me a quote. Where did I attack the competence of past engineers. Quote it for me or please just stop lying. I never attacked anyone. I even (somewhat obliquely) referred to several reasons they may have had to make decisions that confound me. Are you mad that I think backwards compatibility is a poor decision? That's not an attack against any engineers, it's just a matter of opinion. Your weird passive-aggressive behavior is just baffling here.


Here is a quote: "that seemed like it made sense at some point (or maybe changing case was super important to a lot of workloads or something, making a compelling reason to fuck over the future in favor of optimisation now))?"

You used "that seemed like it made sense" when you could have written "that made sense." The additional "seemed like" implies the past engineers were unable to see something they should have.

You used "fuck over the future in favor of optimisation now" implying the engineers were overly short-sighted or used poor judgement when balancing the diverse needs of an interchange code.


Hindsight is 20/20. Something that seemed like a good decision at the time may have been a good decision for the time, but not necessarily a great decision half a century later. That has nothing to do with engineering competency, only fortune telling competency.

I get that people here don't like profanity, but I don't see any slight in describing engineering decisions like optimizing for common workloads today over hypothetical loads tomorrow as 'fucking over the future'. Slightly hyperbolic, sure, but it's one of the most common decisions made in designing systems, and commonly causes lots of issues down the line. I don't see where saying something is a mistake that looks obvious in retrospect is a slight. Most things look obvious in tetrospect.


Again, "seemed like it made sense" expresses doubt, in the way that "it seems safe" expressed doubt that it actually is safe.

If you really meant your comment now, there was no reason to add "seemed like it" in your earlier text.

> I don't see any slight

You can see things however you want. The trick is to make others understand the difference between what you say and that utterances of an ignorant blowhard, "full of sound and fury, signifying nothing."

You don't seem to understand the historical context, your issues don't make sense, your improvement seem pointless at best, and you have very firm and hyperbolic viewpoints. That does not come across as 20/20 hindsight.


P.S I'm not the one lying here. Not only are there lies, strawmen, and all sorts of projection, but my substantive points are ignored.

"some backwards compatibility idiocy that seemed like it made sense at some point"

Is obviously attack on their judgment.

"a compelling reason to fuck over the future in favor of optimisation now"

Talk about passive-aggressive! Of course the person who wrote this does not think that there was any such "compelling reason", which leaves us with the extremely hostile accusation.

And as I've noted, the arguments that these decisions were idiotic or effed over the future are simply incorrect.


I never questioned the competence of past engineers, I question the use of backwards compatibility.

Hardware has advanced, but software depends on standards and conventions formulated for far less capable hardware, and that's a problem.

The efficiency of string processing/generation is hugely important in terms of global energy consumption.

A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability.

Bounds-checking for the English alphabet requires either an upfront normalization or twice the checking, so 50-100% more instructions for that.

There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is.

[({< <-> >})] would have been just as or more useful than the alphabet being convertible and saved a few instructions in common parsing loops.


> takes twice as many instructions

What is your preferred system? How does it affect other needs, like collation, or testing if something is upper-case vs. lower-case, or ease of supporting case-insensitivity?

Have you measured the performance difference? https://johnnylee-sde.github.io/Fast-unsigned-integer-to-hex... shows a branchless UlongToHexString which is essentially as fast as a lookup table and faster than the "naive" implementation.

> Bounds-checking for the English alphabet

In the following it goes from 2 assembly instructions to three:

  int is_letter(char c) {
    c |= 0x20;  // normalize to lowercase
    return ('a' <= c) && (c <= 'z');
  }
Yes, that's 50% more assembly, to add a single bit-wise or, when testing a single character.

But, seriously, when is this useful? English words include an apostrophe, names like the English author Brontë use diacritics, and æ is still (rarely) used, like in the "Endowed Chair for Orthopædic Investigation" at https://orthop.washington.edu/research/ourlabs/collagen/peop... .

And when testing multiple characters at a time, there are clever optimizations like those used in UlongToHexString. SIMD within a register (SWAR) is quite powerful, eg, 8 characters could be or'ed at once in 64 bits, and of course the CPU can do a lot of work to pipeline things, so 50% more single-clock-tick instructions does not mean %50 more work.

> like front and back braces/(angle)brackets/parens not being convertible

I have never needed that operation. Why do you need it?

Usually when I find a "(" I know I need a ")", and if I also allow a "[" then I need an if-statement anyway since A(8) and A[8] are different things, and both paths implicitly know what to expect.

> and saved a few instructions in common parsing loops.

Parsing needs to know what specific character comes next, and they are very rarely limited to only those characters. The ones I've looked use a DFA, eg, via a switch statement or lookup table.

I can't figure out what advantage there is to that ordering, that is, I can't see why there would be any overall savings.

Especially in a language like C++ with > and >> and >>= and A<B<int>> and -> where only some of them are balanced.


Look at ASCII mapped out with four bits across and four bits down and the logic may suddenly snap into place. Also remember that it was implemented by mechanical printing terminals.


> I'm guessing due to some backwards compatibility idiocy that seemed like it made sense at some point ... > ... making a compelling reason to fuck over the future in favor of optimisation now

> I never questioned the competence of past engineers

False just based on your opening volley of toxic spew. Backwards compatibility is an engineering decision and it was made by very competent people to interoperate with a large number of systems. The future has never been fucked over.

You seem to not understand how ASCII is encoded. It is primarily based on bit-groups where the numeric ranges for character groupings can be easily determined using very simple (and fast) bit-wise operations. All of the basic C functions to test single-byte characters such as `isalpha()`, `isdigit()`, `islower()`, `isupper()`, etc. use this fact. You can then optimize these into grouped instructions and pipeline them. Pull up `man ascii` and pay attention to the hex encodings at the start of all the major symbol groups. This is still useful today!

No, the biggest fuckage of the internet age has been Unicode which absolutely destroys this mapping. We no longer have any semblance of a 1:1 translation between any set of input bytes and any other set of character attributes. And this is just required to get simple language idioms correct. The best you can do is use bit-groupings to determine encoding errors (ala UTF-8) or stick with a larger translation table that includes surrogates (UTF-16, UTF-32, etc). They will all suffer the same "performance" problem called the "real world".


What do you find strange about ASCII?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: