The slide interspersed with text format could be seen as a more heavily illustrated, easy-reading sort of paper. More similar to the talk than a traditional accompanying paper would be, but not just the slides. But I think it's still non-trivial work to make a good version, which is why people often just throw up the slides, because that's almost literally no additional work.
You basically have to take the same source material and think how you would write it up into a blog post, which is not 100% the same as thinking of how to give it as a talk. Though I guess a first-cut solution could be to just record the talk and type up a transcript in between the slides.
They've become part of the language and it would be culturally insensitive to give them a big "fuck you, us westerner's don't need your damn icons, either keep those to yourself and hence don't be able to inter-operate with everyone else or stop using them even though they've been a big part of your culture for 10+ years"
Unicode is supposed to be the central hub for all graphemes, but I would certainly like to see some argument that emoji are actually graphemes. For one thing it's quite bizarre that suddenly Unicode appears to be specifying colors as well as shapes, which is one bright line I'd be concerned about. (There may have been colors before but they would be much more exceptional, and I'm not aware of any.) Unicode was ambitious as the all-grapheme repository, it's simply a guaranteed failure if it tries to become the repository of all vaguely iconic/smilyish/little pictures in the world.
Unfortunately, entrenched business and political interests have made encoding flags and emoticons a higher priority.
Encoding Mii's in Unicode, brilliant! Nintendo should really try to get a spot on the committee.
Every character is assigned a Unicode code point. The Unicode consortium defines a grapheme as a "user perceived character", usually made up of one Unicode code point, but sometimes two or more. A base character can be followed by one or more non-spacing marks, together forming a "grapheme", the most common of which usually have a "canonical mapping" to a single character, but need not.
Why should a brief overview go back even to the pre-ISO-8859 days except to mention ASCII? None of them are directly relevant: The world we're dealing with now on the Web begins with ASCII, moves through a Pre-Unicode Period, and finishes up in the Land of Unicode, where it's at least possible to do things Right. All history tells a narrative; when it comes to character encodings, that's a good default unless you really think your audience cares about why FORTRAN was spelled that way back in the Before Time.
Tom Jennings has an interesting history:
I know nothing about the implementation of early web browsers/gopher/etc, but I doubt there ever was anything on the web that used ASCII. 7-bit email may have been around at e time, but I would guess Tim Berners Lee just used whatever character set his system used by default (corrections welcome; being snarky isn't the only reason I write this)
All headers, HTTP, email, or otherwise, are 99% or more ASCII. HTML markup is over 99% ASCII for most documents, especially the complex ones.
ASCII is the only text encoding you can guarantee everything on the Web (and the Internet in general, really) knows how to speak. Finally, guess what all valid UTF-8 codepoints in the range U+00 to U+7F inclusive are compatible with: ASCII.
No, that's not what I meant. I meant that all of the essential bits are ASCII, all of the software that generates those important pieces as to know ASCII, and it's entirely possible for software that speaks only ASCII to handle it as long as the filenames (the main source of non-ASCII characters) being served are also ASCII.
Read the HTTP specification sometime.
I've never really understood the advantage of one over the other, but this section on Wikipedia http://en.wikipedia.org/wiki/Endianness#Optimization helps explain that there are optimizations that can be made at a hardware level when performing arithmetic on little-endianness values.
Is the argument for using LE in Unicode similar?
ASCII was actually invented in the US in the 1960s, as a
standardised way of encoding text on a computer. ASCII
defined 128 characters, that's a character for half of the
256 possible bytes in an 8-bit computer system.
Someone who doesn't know the difference between utf-16, utf-8, etc. might not know which to use.
(Yes, you can use heuristics to guess which endianness is in use. The problem is that while this is trivial for Western languages I don't even know how you'd begin when presented with arbitrary text using an East Asian or African written language.)
Not actually that hard.
Consider a document which is encoded in either a) ASCII like you know it or b) ASCII where the top 4 bits and bottom 4 bits are transposed. How would you tell the difference? Well, one can imagine creating a histogram of the bits for each half of the bytes and comparing them to expectations based on the distribution of bits in naturally occurring English text. The half with most of the entries in the 0x5, 0x6, and 0x7 is the upper order half.
If you don't know what naturally occurring e.g. Japanese looks like in Unicode code points, take this on faith: flipping the order does not give you a document which looks probably correct. (Also, crucially, Japanese with the order flipped doesn't resemble any sensible document in any language -- you end up with Unicode code points from a mishmash of unrelated pages.)
P.S. Why care about that algorithm? Here's a hypothetical: you're a forensic investigator or system administrator who, given a hard drive which has been damaged, need to extract as much information as possible from it. The BOM is very possibly not in the same undamaged sector which you are reading right now, and it may be impossible to stitch the sectors without first reading the text. How would you determine a) whether an arbitrary stream of bytes was likely a textual document, b) what the encoding was, c) what endianness it was, if appropriate, and d) what human language it was written in?
1. Try both byte orders
2. If one produces valid text and the other does not, choose that one (this will get you the correct answer almost every time, even if the source text is Chinese)
3. If both happen to produce valid text, use the one with the smallest number of scripts
(Note that this just determines byte order, while Patrick was talking about the more ambitious task of heuristically determining whether a random string of bytes is text and if so what encoding it is. My point is just that you really don't need to be told the order of the bytes in most cases.)
Try saving a text file in Windows XP Notepad with the words "Bush hid the facts" and nothing else. Close it and open the file again. WTF Chinese characters! Conspiracy!
The UTF-16 encoding scheme may or may not begin with a BOM.
However, when there is no BOM, and in the absence of a
higher-level protocol, the byte order of the UTF-16
encoding scheme is big-endian.
Another thing I noticed in the article: the encodings issue was historically even more complicated: Unicode previously only covered what is now the basic multilingual plane (BMP), which meant all code points could be encoded with a single 2-byte value, a.k.a. UCS-2. When more code points got added, that exceeded UCS-2's range, so that mutated into the variable-length UTF-16 encoding. Had Unicode been introduced into Windows and Java later, who knows if they'd ended up using UTF-16.