My name
is
Jon Skeet

Unicode escape behavior in Java programs

A few days ago, i was asked about this program's output:

public static void main(String[] args) {
    // \u0022 is the Unicode escape for double quote (")
    System.out.println("a\u0022.length() + \u0022b".length());
}

My first thought was this program should print the a\u0022.length() + \u0022b length, which is 16 but surprisingly, it printed 2. I know \u0022 is the unicode for " but i thought this " going to be escaped and only represent one " literal, with no special meaning. And in reality, Java somehow parsed this string as following:

System.out.println("a".length() + "b".length());

I can't wrap my head around this weird behavior, Why Unicode escapes don't behave as normal escape sequences?

Update Apparently, this was one of brain teasers of the Java Puzzlers: Traps, Pitfalls, and Corner Cases book written by Joshua Bloch and Neal Gafter. More specifically, the question was related to Puzzle 14: Escape Rout.

Why Unicode escapes doesn't behave as normal escape sequences?

Basically, they're processed at a different point in reading the input - in lexing rather than parsing, if I've got my terminology right. They're not escape sequences in character literals or string literals, they're escape sequences for the whole source file. Any character that's not part of a Unicode escape sequence can be replaced with the Unicode escape sequence. So you can write programs entirely in ASCII, which actually have variable, method and class names which are non-ASCII...

Fundamentally I believe this was a design mistake in Java, as it can cause some very weird effects (e.g. if you have the escape sequence for a line break within a // comment...) but it is what it is...

This is detailed in section 3.3 of the JLS:

A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.

...

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

See more on this question at Stackoverflow