My name
is
Jon Skeet

Invalid Hexadecimal Characters in XML

I have an XML with invalid hexadecimal characters. I've read this, this and this and any other links given but failed to make it work.

I'm using XmlReader - XmlDocument, XDocument and XmlTextReader are not my options, because there are XML files with more than 500GB size and 500 million in volume. XMLReader is my best choice because of its "forward" approach, and not loading into the memory all of the XML details. Also, because of this, I can't have the XML file recreated or loaded just to replace the invalid characters.

Here's the code that I'm working on:

case XmlNodeType.Element:
if (xmlReader.Name.Equals("ROW"))
{
    DataRow dataRow = xmlDataTable.NewRow();
    XmlReader row = XmlReader.Create(xmlReader.ReadSubtree(), new XmlReaderSettings { CheckCharacters = false
                                                                            , ValidationType = ValidationType.None });

    // iterate on elements inside ROW
    // these are the column items
    if (row != null)
    {
        while (row.Read())
        {

            if (row.IsStartElement())
            {

                if (!row.Name.Equals("ROW"))
                {

                    string columnName = row.Name;
                    //row = XmlReader.Create(CleanInvalidXmlChars(row.ReadInnerXml()));

                    row.Read();
                    string value = CleanInvalidXmlChars(row.Value.ToString());

                    // all other logics ...

The exception raises on the row.Read(); statement. Here's a sample XML file I'm reading:

<?xml version="1.0" encoding="UTF-8"?>
<MFAINSBRP>
<ROW>
    <INSTITUTION_CODE>828  </INSTITUTION_CODE>
    <BRANCH_CODE>GJ102</BRANCH_CODE>
    <BRANCH_NAME>                                   </BRANCH_NAME>
    <BRANCH_NAME_FRENCH>                                   </BRANCH_NAME_FRENCH>
    <LANGUAGE_CODE>E</LANGUAGE_CODE>
    <ADDR_NO>815412</ADDR_NO>
    <FAX_AREA>0</FAX_AREA>
    <FAX_PHONE>0</FAX_PHONE>
    <AREA_CODE>0</AREA_CODE>
    <PHONE_NO>0</PHONE_NO>
    <STATUS>A</STATUS>
    <PHONE_EXT>0</PHONE_EXT>
</ROW>
<!--ALL OTHER RECORDS-->
</MFAINSBRP>

Right now, I'm stuck on making this work.

EDIT:

The sample XML file is the record that makes my code break. I copied at pasted it here from Notepad++ but it doesn't show the invalid characters. Here's the image of how it looks in Notepad++:

How I create the xmlReader object is just this simple statement:

using (xmlReader = XmlReader.Create(filePath, new XmlReaderSettings { CheckCharacters = false }))

It's unclear to me why CheckCharacters = false isn't fixing the problem for you, and as I've mentioned the far, far better fix is to get the data in a clean fashion to start with.

However, you can work around this by replacing each invalid character with a replacement in the TextReader that the XmlReader uses. Here's a short but complete example:

using System;
using System.IO;
using System.Xml;

class Test
{
    static void Main()
    {
        var text = "<foo>\0</foo>";
        var reader = XmlReader.Create(
             new XmlReplacingReader(new StringReader(text), ' '));
        while (reader.Read())
        {
            Console.WriteLine(reader.NodeType);
        }
    }
}

public sealed class XmlReplacingReader : TextReader
{
    private readonly TextReader original;
    private readonly char replacementChar;

    public XmlReplacingReader(TextReader original, char replacementChar)
    {
        this.original = original;
        this.replacementChar = replacementChar;
    }

    override public int Peek()
    {
        int ret = original.Peek();
        return MaybeReplace(ret);
    }

    override public int Read()
    {
        int ret = original.Read();
        return MaybeReplace(ret);        
    }

    override public int Read(char[] buffer, int index, int count)
    {
        int ret = original.Read(buffer, index, count);
        for (int i = 0; i < ret; i++)
        {
            buffer[i + index] = MaybeReplace(buffer[i + index]);
        }
        return ret;
    }

    protected override void Dispose(bool disposing)
    {
        if (disposing)
        {
            original.Dispose();
        }
    }

    public override void Close()
    {
        original.Close();
    }

    private int MaybeReplace(int x)
    {
        return x < 0 ? x : MaybeReplace((char) x);
    }

    private char MaybeReplace(char c)
    {
        return (c >= ' ' || c == '\r' || c == '\n' || c == '\t') ? c : replacementChar;
    }
}

This relies on you being able to create a TextReader for the file, of course - which you can do with File.OpenText if you know the encoding. If you need to handle other encodings, you may need a more cunning solution, but this should get you started.

Note that this approach replaces the invalid characters. If you want to remove them instead, it becomes harder and probably less efficient, as the bulk Read method would need to find out whether or not it needs to remove characters, do the removal, and then return a different value. The code would be a lot trickier - I'm hoping you don't need it.

See more on this question at Stackoverflow