Load a large text file into a string

I'm looking to load a 150 MB text file into a string. The file is UTF16 encoded, so it will produce a string that's about 150 MB in memory. All the methods I have tried result in an Out of Memory exception.

I know this is a huge string, and certainly not how I'd like to do things. But there's really not much I can do about that at the moment without a lot of really deep changes to an application about to head out the door. The file does not have an evenly distributed set of lines in it. One line can contain 80% or so of the entire file size.

Here's what I've tried:

Method 1

// Both of these throw Out of Memory exception
var s = File.ReadAllText(path)
var s = File.ReadAllText(path, Encoding.Unicode);

Method 2

var sb = new StringBuilder();

// I've also tried a few other iterations on this with other types of streams
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
  string line;
  while ((line = sr.ReadLine()) != null)
  {
    sb.AppendLine(line);
  }
}

// This throws an exception
sb.ToString();

Method 3

using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (StreamReader sr = new StreamReader(fs, Encoding.Unicode))
{
  int initialSize = (int)fs.Length / 2;  // Comes to a value of 73285158 with my test file
  var sb = new StringBuilder(initialSize); // This throws an exception

  string line;
  while ((line = sr.ReadLine()) != null)
  {
    sb.AppendLine(line);
  }

  sb.ToString();
}

So, what can I do to load this file into a string variable?

Edit: Added additional attempts to resolve issue based on comments.

Jon Skeet
people
quotationmark

Both of your attempts so far are treating the file as if it were in UTF-8. In the best case, that's going to take twice as much memory - and it's very likely to be invalid data (as UTF-8), basically. You should try specifying the encoding:

var text = File.ReadAllText(path, Encoding.Unicode);

If that doesn't work, you could try a variant on your second code, but specifying the encoding to StreamReader (and probably ignoring the BufferedStream - I don't think it'll help you here), and also specifying an initial capacity for the StringBuilder, equal to half the size of the file.

EDIT: If this line is throwing an exception:

var sb = new StringBuilder(initialSize);

... then you don't have a chance. You are unable to allocate enough contiguous memory.

You may find that you're able to use a List<string> instead:

var lines = File.ReadLines(path).ToList();

... in that at least you've then got lots of little objects. It will take more memory, but it won't require as much contiguous memory. That's assuming you really need the whole file in memory at a time. If you can possibly stream the data instead, that would be a much better bet.

In a small console app I'm able to read a file of the same size with no problem using File.ReadAllText, with both the 32-bit and 64-bit CLR... so it may be a matter of your physical memory and what else you're doing in the program.

people

See more on this question at Stackoverflow