My name
is
Jon Skeet

Quickest way to Read Specific Line from Multiple Files one Line at a time

I have a text-based database that represents logs, sorted by timestamp. For testing purposes my database has approximately 10,000 lines but this number can be larger. It is of the format:

primary_key, source_file, line_num
1, cpu.txt, 2
2, ram.txt, 3
3, cpu.txt, 3

I query the database and as I read the results I want to add the actual data to a string which I can then display. Actual data in the above example would be the contents of line 2 from cpu.txt, followed by the contents of line 3 from ram.txt, etc. The line contents can be quite long.

An important note is that the line numbers per file are all in order. That is to say, the next time I encounter a cpu.txt entry in the database it will have line 4 as the line number. However, I might see a cpu.txt entry only after thousands of other entries from ram.txt, harddrive.txt, graphics.txt, etc.

I have thought about using something along the lines of the following code:

StringBuilder odbcResults = new StringBuilder();
OdbcDataReader dbReader = com.ExecuteReader();  // query database
while (dbReader.Read())
{
   string fileName = dbReader[1].ToString(); // source file
   int fileLineNum = int.Parse(dbReader[2].ToString());  // line number in source file

   odbcResults.Append(File.ReadLines(fileName).Skip(fileLineNum).First());
}

However, won't File.ReadLines() want to dispose of its TextReader after every iteration? Not very efficient?

I also had this idea, keeping a StreamReader for every file that I need to read in a Dict:

Dictionary<string, StreamReader> fileReaders = new Dictionary<string, StreamReader>();
StringBuilder odbcResults = new StringBuilder();
OdbcDataReader dbReader = com.ExecuteReader();
while (dbReader.Read())
{
   string fileName = dbReader[1].ToString(); // source file
   int fileLineNum = int.Parse(dbReader[2].ToString());  // line number in source file

   if (!fileReaders.ContainsKey(fileName))
   {
      fileReaders.Add(fileName, new StreamReader(fileName));
   }

   StreamReader fileReader = fileReaders[fileName];
   // don't have to worry about positioning? Lines consumed consecutively
   odbcResults.Append(fileReader.ReadLine());
}
// can't forget to properly Close() and Dispose() of all fileReaders

Do you agree with any of the above examples or is there an even better way?
For the second example I am running on the assumption that the StreamReader will remember its last position - I believe this is saved in the BaseStream.

I have read over How do I read a specified line in a text file?, Read text file at specific line, StreamReader and seeking (the first answer provides a link to a custom StreamReader class with positioning capabilities, but I only know the line number I need to be on, not an offset) but don't think they answer my question specifically.

It sounds like you're going to want to have in memory (for display in textboxes) everything that the user selects - so that's a natural boundary for what's feasible anyway. I suggest the following approach:

Read all of the matching metadata (i.e. within the user-specified time range) from the database, into a list. Keep a set of the files we'll need to read.
Create a new array of the same size as the list - this will hold the final data
Go through the required files one at a time:
- Open the file, and remember we're at line 0
- Iterate over the metadata list. For every entry that matches the file we currently have open, read forward to the right line, and populate the final data array element corresponding to the list entry we're looking at. We should only need to read forward, as we're still going in timestamp order.
- Close the file

At that point, the "final data array" should be fully populated. You only need to have one file open at a time, and you never need to read the whole file. I think this is simpler than having a dictionary of open files - aside from anything else, it means you can use a using statement for each file, rather than having to handle all the closing more manually.

It does mean having all the database metadata entries in memory at a time, but presumably each metadata entry is smaller than the result data which you need to have in memory anyway by the end in order to display the result to the user.

Even though you'll be going over the database metadata entries multiple times, that will all happen in memory. It should be insignificant compared with the IO to the file system or the database.

An alternative would be to group the metadata entries by filename as you read them, maintaining the index as part of the metadata entry.

See more on this question at Stackoverflow