Why does LINQ to XML not escape characters like '\x1A'?

I get exception if in XElement's content I include characters such as '\x1A', '\x1B', '\x1C', '\x1D', '\x1E' or '\x1F'.

using System;
using System.Collections.Generic;
using System.Xml.Linq;

namespace LINQtoXMLInvalidChars
{
    class Program
    {
        private static readonly IReadOnlyCollection<char> InvalidCharactersInXml = new List<char>
        {
            '<',
            '>',
            '&',
            '\'',
            '\"',
            '\x1A',
            '\x1B',
            '\x1C',
            '\x1D',
            '\x1E',
            '\x1F'
        };

        static void Main()
        {
            foreach (var c in InvalidCharactersInXml)
            {
                var xEl = new XElement("tag", "Character: " + c);
                var xDoc = new XDocument(new XDeclaration("1.0", "utf-8", null), xEl);

                try
                {
                    Console.Write("Writing " + c + ": ");
                    Console.WriteLine(xDoc);
                }
                catch (Exception e)
                {
                    Console.WriteLine("Oops.    " + e.Message);
                }
            }

            Console.ReadKey();
        }
    }
}

In an answer from Jon Skeet to the question String escape into XML I read

You set the text in a node, and it will automatically escape anything it needs to.

So now I'm confused. Do I misunderstand something?

Some background information: The string content of the XElement comes from the end user. I see two options for making my application robust: 1) to Base-64 encode the string before passing it in to XElement 2) to narrow the accepted set of characters to e.g. alphanumeric characters.

Jon Skeet
people
quotationmark

Most of those characters simply aren't valid in XML 1.0 at all. Personally I wish that LINQ to XML would fail to produce a document that later it wouldn't be able to parse, but basically you should avoid them.

I would also recommend avoiding \x as an escape sequence anyway, preferring \u - the fact that \x will take "up to" 4 hex digits can be very confusing.

From the XML 1.0 spec:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now U+000D and U+000A are interesting cases - they won't be escaped in text nodes; they'll just be included verbatim. Whether or not that's then present when you parse the node will depend on parse settings (and whether there are non-whitespace characters around it).

In terms of how to handle this in your case: you definitely have options of:

  • Performing your own encoding/escaping. This is generally somewhat painful, and will lead to XML documents which are hard to read compared with regular ones. You could potentially do this only when required, adding an attribute to the element to say that you've done it, for example.
  • Detect and remove characters which are invalid in XML
  • Detect and reject strings containing characters which are invalid in XML

We can't really tell which of these is most appropriate in your scenario.

people

See more on this question at Stackoverflow