My name
is
Jon Skeet

JavaL Different Big Decimal values converted to UTF 8 Strings have the same value

As part of having fun with Avro, I discovered the following:

new String(new BigDecimal("1.28").unscaledValue().toByteArray(), Charset.forName("UTF-8"))
.equals(
new String(new BigDecimal("1.29").unscaledValue().toByteArray(), Charset.forName("UTF-8")))
-> true !!!!!!!!


DatatypeConverter.printBase64Binary(new BigDecimal("1.28").unscaledValue().toByteArray())
.equals(
DatatypeConverter.printBase64Binary(new BigDecimal("1.29").unscaledValue().toByteArray()))
-> false (as expected)

but

new String(new BigDecimal("1.26").unscaledValue().toByteArray(), Charset.forName("UTF-8"))
.equals(
new String(new BigDecimal("1.27").unscaledValue().toByteArray(), Charset.forName("UTF-8")))
-> false (as expected)

Can someone explain to me what is going on? Seems like 1.27 is the cuttoff. Ideally, I need

new String(new BigDecimal("1.28").unscaledValue().toByteArray(), Charset.forName("UTF-8"))

to work for every BigDecimal value.

Can someone explain to me what is going on?

Yes, you're misusing your data. The result of BigDecimal.toByteArray() is not a UTF-8-encoded representation of a string, so you shouldn't try to convert it to a string that way.

Some different byte arrays may be "decoded" to strings via UTF-8 as the same, if they're basically invalid. If you look at the result of new BigDecimal("1.28").unscaledValue().toByteArray() and likewise for 1.29, you'll find that they're invalid, so both decode to strings containing "?". However, if you're doing this at all then you're doing it wrong.

The two byte arrays in question are { 0x00, 0x80 } and { 0x00, 0x81 }. The first byte of that will be decoded to U+0000, and the second byte of it is the start of a UTF-8-encoding of a character, but it's incomplete - so the decoder uses ?. So both strings are "\0?".

If you want to convert a BigDecimal to a string, just call toString(). If you want to represent arbitrary binary data as a string, use base64 or hex, or some similar encoding scheme designed to represent arbitrary binary data as strings. UTF-8 is designed to represent arbitrary text data as binary data.

See more on this question at Stackoverflow