Windows – Editing the first/last lines of a 1GB+ text file on Windows without loading the entire file into memory

csvpowershelltext editingwindows

I have some flat-text data files ("CSV") with sizes up to 3GB and simply need to remove the first 3 lines of text, and add an empty line at the end. Since I have a lot of these files, I would like to find a fast way of doing this.

The problem with these first lines is that they are not CSV data, but random text that doesn't follow the column format. Because of this, SQL Server's Bulk Insert statement can't process these files.

One option would be to use a PowerShell script, but using Get-content or streams would always involve reading the entire file and completely outputting it again. Is there a way to directly modify the file on-disk, without loading it entirely into memory and recreating the file?

Preferably, I'm looking for a PowerShell way to do this. Although third-party tools might also be interesting…

Best Answer

Removing content from the beginning of a file requires rewriting the file.

You can use tail -n +4 input.csv > output.csv to remove the first three lines (requires 105 seconds for a 15 GB Wikipedia dump on my low-end server, i.e. about 150 MB per second). On Windows tail is available with Cygwin e.g.

Related Question