I have a German string in C#
string s = "MenĂ¼";
I would like to get UTF-8 codepoints:
expected result:
\x4D\x65\x6E\xC3\xBC
The expected result has been verified via online UTF-8 encoder/decoder and via Unicode code converter v8.1
I tried a lot of conversion methods but I cannot get the expected result.
UPDATE:
Funny, the problem was not in the source code but in the wrong encoding in the input file :-) These answers helped me a lot.
There's no such thing as "UTF-8 codepoints" - there are UTF-8 code units, or Unicode code points.
In the string MenĂ¼, there are 4 code points:
For BMP characters (i.e. those in the range U+0000 to U+FFFF) it's as simple as iterating over the char
values in a string. For non-BMP characters that's slightly trickier. StringInfo
looks helpful here, but it includes combining characters when iterating over text elements. It's not terribly hard to spot surrogate pairs in a string, but I don't think there's a very simple way of iterating over all the code points in a string.
Finding the UTF-8 code units - i.e. the UTF-8-encoded representation of a string as bytes, is simple:
byte[] bytes = Encoding.UTF8.GetBytes(text);
That will give you the five bytes you listed in your question: 0x4d, 0x65, 0x6e, 0xc3, 0xbc.
See more on this question at Stackoverflow