My name
is
Jon Skeet

c# remove special charachters from string

I have the following string which represents an xml:

string xmlStr7 = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Response xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">\r\n  <Market>en-US</Market>\r\n  <AnswerSet ID=\"0\">\r\n    <Answers>\r\n      <Answer ID=\"0\">\r\n        <Choices>\r\n          <Choice ID=\"2\" />\r\n          <Choice ID=\"8\" />\r\n        </Choices>\r\n      </Answer>\r\n      <Answer ID=\"1\">\r\n        <Choices>\r\n          <Choice ID=\"1\" />\r\n          <Choice ID=\"4\" />\r\n        </Choices>\r\n      </Answer>\r\n      <Answer ID=\"2\">\r\n        <Choices>\r\n          <Choice ID=\"1\" />\r\n          <Choice ID=\"7\" />\r\n        </Choices>\r\n      </Answer>\r\n      <Answer ID=\"3\">\r\n        <Choices>\r\n          <Choice ID=\"4\" />\r\n        </Choices>\r\n      </Answer>\r\n    </Answers>\r\n  </AnswerSet>\r\n</Response>";

I want to parse it into an XDocument object and in order to do so I must get rid of all the newlines and unnecessary spaces (otherwise I get a parsing error). I've removed the special characters manually and saw that the parsing works when I use the following string:

string xmlStr2 = "<?xml version=\"1.0\" encoding=\"utf-8\"?><Response xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><Market>en-US</Market><AnswerSet ID=\"0\"><Answers><Answer ID=\"0\"><Choices><Choice ID=\"2\" /><Choice ID=\"8\" /></Choices></Answer><Answer ID=\"1\"><Choices><Choice ID=\"1\" /><Choice ID=\"4\" /></Choices></Answer><Answer ID=\"2\"><Choices><Choice ID=\"1\" /><Choice ID=\"7\" /></Choices></Answer><Answer ID=\"3\"><Choices><Choice ID=\"4\" /></Choices></Answer></Answers></AnswerSet></Response>";

I use the following code to achieve this programatically:

public static string replaceSubString(string st)
    {
        string pattern = ">\\s+<";
        string replacement = "><";
        Regex rgx = new Regex(pattern);
        string result = rgx.Replace(st, replacement);
        return result;
    }

By calling this method I expect to get a string that I will be able to parse to an XDocument object:

string newStr = replaceSubString(xmlStr7);
XDocument xmlDoc7 = XDocument.Parse(newStr);

However, this does not work. In addition, there seem to be a difference between this string and the string xmlStr2 from which I removed all the special charachters manually (string.Compare returns false and newStr is longer in 1 char than xmlStr2). I can't see this difference by printing both strings, they seem identical. Could anyone help?

Your string starts with a byte order mark (U+FEFF).

Ideally, you shouldn't get that into your string to start with, but if you do have it, you should just strip it:

string text = ...;
if (text.StartsWith("\ufeff"))
{
    text = text.Substring(1);
}
XDocument doc = XDocument.Parse(text);

Interestingly, XDocument.Load(Stream) can handle a BOM at the start of the data, but XDocument.Load(TextReader) can't. Presumably the expectation is that a reader will strip the BOM when it reads it anyway.

It's not clear where your data is coming from, but if you have it in a binary format somewhere (e.g. as a byte[] or a Stream) then I suggest you just load that instead of converting it to a string and then parsing the string. That will remove this problem and save you from the possibility of applying the wrong encoding.

See more on this question at Stackoverflow