My name
is
Jon Skeet

yield return memory optimization

And yet another question about yield return

So I need to execute remotely different SQL scripts. The scripts are in TFS so I get them from TFS automatically and the process iterates through all the files reading their content in memory and sending the content to the remote SQL servers.

So far the process works flawlessly. But now some of the scripts will contain bulk inserts increasing the size of the script to 500,000 MB or more.

So I built the code "thinking" that I was reading the content of the file once in memory but now I have second thoughts.

This is what I have (over simplified):

    public IEnumerable<SqlScriptSummary> Find(string scriptsPath)
    {
        if (!Directory.Exists(scriptsPath))
        {
            throw new DirectoryNotFoundException(scriptsPath);
        }

        var path = new DirectoryInfo(scriptsPath);

        return path.EnumerateFiles("*.sql", SearchOption.TopDirectoryOnly)
            .Select(x =>
            {
                var script = new SqlScriptSummary
                {
                    Name = x.Name,
                    FullName = x.FullName,
                    Content = File.ReadAllText(x.FullName, Encoding.Default)
                };

                return script;
            });
    }

....

    public void ExecuteScripts(string scriptsPath)
    {
        foreach (var script in Find(scriptsPath))
        {
            _scriptRunner.Run(script.Content);
        }
    }

My understanding is that EnumerateFiles will yield return each file at a time, so that's what made me "think" that I was loading one file at a time in memory.

But...

Once that I'm iterating them, in the ExecuteScripts method what happens with the script variable used in the foreach loop after it goes out of scope? Is that disposed? or does it remain in memory?

If it remains in memory that means that even when I'm using iterators and internally using yield return when I iterate through all of them they are still in memory right? so at the end it would be like using ToList just with a lazy execution is that right?
If the script variable is disposed when it goes out of scope then I think I would be fine

How could I re-design the code to optimize memory consumption, like forcing just to load the content of a script into memory one at a time

Additional questions:

How can I test (unit/integration test) that I'm loading just one script at a time in memory?
How can I test (unit/integration test) that each script is released/not released from memory?

Once that I'm iterating them, what happens with the script variable used in the foreach loop after it goes out of scope? Is that disposed? or does it remain in memory?

If you mean in the ExecuteScripts method - there's nothing to dispose, unless SqlScriptSummary implements IDisposable, which seems unlikely. However, there are two different things here:

The script variable goes out of scope after the foreach loop, and can't act as a GC root
Each object that the script variable has referred to will be eligible for garbage collection when nothing else refers to it... including script on the next iteration.

So yes, basically that should be absolutely fine. You'll be loading one file at a time, and I can't see any reason why there's be more than one file's content in memory at a time, in terms of objects that the GC can't collect. (The GC itself is lazy, so it's unlikely that there'd be exactly one script in memory at a time, but you don't need to worry about that side of things, as your code makes sure that it doesn't keep live references to more than one script at a time.)

The way you can test that you're only loading a single script at a time is to try it with a large directory of large scripts (that don't actually do anything). If you can process more scripts than you have memory, you're fine :)

See more on this question at Stackoverflow