My name
is
Jon Skeet

How to get UTF 8 codepoints of C# string?

I have a German string in C#

string s = "Menü";

I would like to get UTF-8 codepoints:

expected result:

\x4D\x65\x6E\xC3\xBC

The expected result has been verified via online UTF-8 encoder/decoder and via Unicode code converter v8.1

I tried a lot of conversion methods but I cannot get the expected result.

UPDATE:

Funny, the problem was not in the source code but in the wrong encoding in the input file :-) These answers helped me a lot.

There's no such thing as "UTF-8 codepoints" - there are UTF-8 code units, or Unicode code points.

In the string Menü, there are 4 code points:

U+004D
U+0065
U+006E
U+00FC

For BMP characters (i.e. those in the range U+0000 to U+FFFF) it's as simple as iterating over the char values in a string. For non-BMP characters that's slightly trickier. StringInfo looks helpful here, but it includes combining characters when iterating over text elements. It's not terribly hard to spot surrogate pairs in a string, but I don't think there's a very simple way of iterating over all the code points in a string.

Finding the UTF-8 code units - i.e. the UTF-8-encoded representation of a string as bytes, is simple:

byte[] bytes = Encoding.UTF8.GetBytes(text);

That will give you the five bytes you listed in your question: 0x4d, 0x65, 0x6e, 0xc3, 0xbc.

See more on this question at Stackoverflow