Hacker News new | past | comments | ask | show | jobs | submit login

I have never seen this in Java! Is there any use cases where it could be useful?





Javac uses the platform encoding [0] by default to interpret Java source files. This means that Java source code files are inherently non-portable. When Java was first developed (and for a long time after), this was the default situation for any kind of plain text files. The escape sequence syntax allows to transform [1] Java source code into a portable (that is, ASCII-only) representation that is completely equivalent to the original, and also to convert it back to any platform encoding.

Source control clients could apply this automatically upon checkin/checkout, so that clients with different platform encodings can work together. Alternatively, IDEs could do this when saving/loading Java source files. That never quite caught on, and the general advice was to stick to ASCII, at least outside comments.

[0] Since JDK 18, the default encoding defaults to UTF-8. This probably also extends to javac, though I haven’t verified it.

[1] https://docs.oracle.com/javase/8/docs/technotes/tools/window...


I don't know about usefulness but it does let us write identifiers using Unicode characters. For example:

  public class Foo {
      public static void main(String[] args) {
          double \u03c0 = 3.14159265;
          System.out.println("\u03c0 = " + \u03c0);
      }
  }
Output:

  $ javac Foo.java && java Foo
  π = 3.14159265
Of course, nowadays we can simply write this with any decent editor:

  public class Foo {
      public static void main(String[] args) {
          double π = 3.14159265;
          System.out.println("π = " + π);
      }
  }
Support for Unicode escape sequences is a result of how the Java Language Specification (JLS) defines InputCharacter. Quoting from Section 3.4 of JLS <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>:

  InputCharacter:
    UnicodeInputCharacter but not CR or LF
UnicodeInputCharacter is defined as the following in section 3.3:

  UnicodeInputCharacter:
    UnicodeEscape
    RawInputCharacter

  UnicodeEscape:
    \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

  UnicodeMarker:
    u {u}

  HexDigit:
    (one of)
    0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

  RawInputCharacter:
    any Unicode character
As a result the lexical analyser honours Unicode escape sequences absolutely anywhere in the program text. For example, this is a valid Java program:

  public class Bar {
      public static void \u006d\u0061\u0069\u006e(String[] args) {
          System.out.println("hello, world");
      }
  }
Here is the output:

  $ javac Bar.java && java Bar
  hello, world
However, this is an incorrect Java program:

  public class Baz {
      // This comment contains \u6d.
      public static void main(String[] args) {
          System.out.println("hello, world");
      }
  }
Here is the error:

  $ javac Baz.java
  Baz.java:2: error: illegal unicode escape
      // This comment contains \u6d.
                                   ^
  1 error
Yes, this is an error even if the illegal Unicode escape sequence occurs in a comment!

I wonder if full unicode range was accepted because some companies are writing code in non-english.



Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: