What is the reason that Encoding.UTF8.GetString and Encoding.UTF8.GetBytes are not inverse of each other?

Probably I am missing something, but I do not understand why Encoding.UTF8.GetString and Encoding.UTF8.GetBytes are not working as inverse transformation of each other?

In the following example the myOriginalBytes and asBytes are not equal, even their length is different. Could anyone explain what am I missing?

byte[] myOriginalBytes = GetRandomByteArray();
var asString = Encoding.UTF8.GetString(myOriginalBytes);
var asBytes = Encoding.UTF8.GetBytes(asString);
Jon Skeet
people
quotationmark

They're inverses if you start with a valid UTF-8 byte sequence, but they're not if you just start with an arbitrary byte sequence.

Let's take a concrete and very simple example: a single byte, 0xff. That's not the valid UTF-8 encoding for any text. So if you have:

byte[] bytes = { 0xff };
string text = Encoding.UTF8.GetString(bytes);

... you'll end up with text being a single character, U+FFFD, the "Unicode replacement character" which is used to indicate that there was an error decoding the binary data. You'll end up with that replacement character for any invalid sequence - so you'd get the same text if you started with 0x80 for example. Clearly if multiple binary inputs are decoded to the same textual output, it can't possibly be a fully-reversible transform.

If you have arbitrary binary data, you should not use Encoding to get text from it - you should use Convert.ToBase64String or maybe hex. Encoding is for data that is naturally textual.

If you go in the opposite direction, like this:

string text = GetRandomText();
byte[] bytes = Encoding.UTF8.GetBytes(text);
string text2 = Encoding.UTF8.GetString(bytes);

... I'd expect text2 to be equal to text with the exception of odd situations where you've got invalid text to start with, e.g. with "half" a surrogate pair.

people

See more on this question at Stackoverflow