Hacker News new | comments | show | ask | jobs | submit login
Why aren't ◎ܫ◎ and ☺ valid Javascript variable names? (stackoverflow.com)
86 points by mambodog 2013 days ago | hide | past | web | 20 comments | favorite

Answer: Javascript allows variables to begin with unicode letters, but these characters are classified as symbols, not letters.

Internet Explorer 9 doesn't seem to follow the above rule, and permits unicode symbols at the beginning of variable names.

The ECMAscript standard is actually a bit more prohibitive: you can only use Unicode letters or (non-space) combining symbols in variable names.

IE has relaxed this a little bit, by allowing variables to contain (but not start) with symbols.

Serious question: Do people around the world really use non-English code to develop websites? If <div> and <span> have to be in English, CSS styles are in English, and libraries like jQuery have functions in English, are there developers somewhere who absolutely need the ability to write JS in Kannada or French? If not, why is Unicode even allowed as a variable name? It seems like a potential vulnerability waiting to be exploited.

An example from Perl: you can use unicode characters in the source code with

use utf8;

This pragma tells perl to parse the source as UTF-8. It allows us to use unicode characters in string like this:

my $exampletext = 'Hallöchen Unicode! Ça va bien?';

And we can use unicode characters in identifiers:

my $γ = 2.5;

my $π = 3.14159;

my $θ = $γ / $π;

(This is not possible for subroutine and package names)

Using greek characters as identifiers could be useful for some people.

> my $θ = $γ / $π;

> (This is not possible for subroutine and package names)

> Using greek characters as identifiers could be useful for some people.

That makes perfect sense and is a very good use-case for unicode variable names. Thanks!

English letters seem to be most common in variable names, but dashes and underscores are often used as word delimiters. Numbers have their place, too.

I see other keyboard symbols used in a lot of variable names — especially $. And, CSS uses . and # as prefixes.

I once saw a JavaScript library whose main object was named ƒ. Some symbols and accented characters are surprisingly easy to type, especially on international keyboards.

If a character is printable and doesn't conflict with symbols used in the language, how would you decide if it should be allowed?

> If a character is printable and doesn't conflict with symbols used in the language, how would you decide if it should be allowed?

I don't know what should be allowed or not, hence this question. I've only ever programmed in English. I have written tons of localization/Unicode software but the code itself was English. I can imagine there being lots of non-web programming languages in any number of natural languages but my question pertained to the web. If the code that runs the client end of a website must be in English, why allow it to accept non-ASCII characters in variable names? I guess what you're saying is, why not?

>I guess what you're saying is, why not?

Yes, I will even try to give an argument for "why yes".

I see no compelling reason, why in the programming of a website, consisting maybe of HTML, JavaScript and CSS, and maybe server side technologies like PHP, generally only words from the English language should be used. For example, it doesn't matter at all, if you use English words to name CSS classes. class="content" looks as nice to the browser as class="inhalt", which is the word for "content" in German. Some people, who are not native speakers of English, will prefer using variable names in their programs, which they can understand easier. Allowing characters that go beyond the collection of ASCII could be a step to make the world easier for these developers. After all, the web was made for people who want to share content and it should be made as easy as possible.

If JavaScript source code is delivered to a browser, the only thing that matters is that the script works correctly. The details of the variable names used in the script are only of interest to people who read and maintain the source code. Of course, in an (international) open-source project it would exclude many users to participate in the project or to use the code in own projects, if the language for the developers, who are reading and modifying existing code, is unknown and exotic. But for other websites or projects a developer might like to pick any language that makes his life easier.

Maybe a variable called "counter" will be understood by more developers around the world than a variable called "zaehler". But those, who understand "zaehler", might even prefer to write it as "zähler", perhaps they even have the character "ä" on the keyboard.

Large parts of programs appear to be written in English, since the programming language itself is using many words from English, not only on the web. For example, we may have "while" and "for" loops, "if" conditions and things like "unless". This could give the impression that there is only one human language used in programming (for the web or elsewhere).

If the support of Unicode characters later turns out to be exploitable for malicious purposes, ... well, this may happen with any technology we use for creating the web. The only way to absolute security seems to me to use and do nothing at all. Then there is no web - boring.

We were joking about using the Euro sign (€) as a shortcut for a JavaScript library we built (in a European Union -funded research project), but where disappointed to notice that $ was the only currency symbol special-cased to be accepted.

Kind of. I've seen plenty of code with Portuguese words as variable and function names, but they tend to drop the non-ASCII parts like accents (`´^) and cedillas (¸).

There's also ports of some languages (I think I've seen Pascal and PHP) with the actual keywords translated to other languages.

How could it be exploited?

Don't know why you were down-voted, it is a perfectly good question. Personally I don't know but then again I'm not samy.pl. I know there have been unicode vulnerabilities in IIS in the past but what I'd be more concerned with would be XSS. While most sites sanitize data to prevent XSS, maybe someone could come up with a dirty input that doesn't get filtered and runs an exploit.

For this specific issue to be relevant, your sanitizer would have to assume that a particular string is not a valid javascript identifier when in fact it is, and that changes the semantic meaning from something that is safe to embed directly into something that is not.

Unless you're trying to parse Javascript and see whether it's safe to embed, it's not really relevant. XSS attacks are dealt with by escaping the special characters that can change the context in which something is interpreted (e.g. " changing the context from "contents of a string" to "javascript code", or < changing the context from "text on a page" to "an html tag").

Javascript will let it go through (as it is a valid identifier) and for example on the server PHP will not catch it.

A lot of Chinese websites have code with chinese variables etc..

For a bit more in-depth understanding have a look at some of the problems as noted in Google's closure library (goog.i18n.uChar).

Any unicode character can be used as a JavaScript identifier for example:

     var 按 = function (){
                return '\uD869\uDED6';
works with no issues.

A (quite large) subset of Unicode characters, to be specific. Of course some "letters" are hardly recognizable as letters for many people (for example, Vai syllabary) but they are used as letters in at least one script.

There are a few in the higher planes that may give you problems (for example phoenician etc..). It is also important that the Browser does not do a replacement. You need a font like code2000 or latest version.

because unicode is tedious, and wtf needs to use shit like ◎ܫ◎ as a variable name?

Damn you IE, always being different!

    (╯`□′)╯ ┴—┴

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact