My name
is
Jon Skeet

What is the propper way to get a char's code point?

I need to do some stuff with codepoints and a newline. I have a function that takes a char's codepoint, and if it is \r it needs to behave differently. I've got this:

if (codePoint == Character.codePointAt(new char[] {'\r'}, 0)) {

but that is very ugly and certainly not the right way to do it. What is the correct method of doing this?

_{(I know that I could hardcode the number 13 (decimal identifier for \r) and use that, but doing that would make it unclear what I am doing...)}

If you know that all your input is going to be in the Basic Multilingual Plane (U+0000 to U+FFFF) then you can just use:

char character = 'x';
int codePoint = character;

That uses the implicit conversion from char to int, as specified in JLS 5.1.2:

19 specific conversions on primitive types are called the widening primitive conversions:

...

char to int, long, float, or double

...

A widening conversion of a char to an integral type T zero-extends the representation of the char value to fill the wider format.

However, a char is only a UTF-16 code unit. The point of Character.codePointAt is that it copes with code points outside the BMP, which are composed of a surrogate pair - two UTF-16 code units which join together to make a single character.

From JLS 3.1:

The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are greater than U+FFFF are called supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range, (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

If you need to be able to cope with that more complicated situation, you'll need the more complicated code.

See more on this question at Stackoverflow