Sort large CSV files (90GB), Disk quota exceeded

disk-usagelarge filesparallelismsort

Here is what I do right now,

sort -T /some_dir/ --parallel=4 -uo file_sort.csv -k 1,3 file_unsort.csv

the file is 90GB,I got this error message

sort: close failed: /some_dir/sortmdWWn4: Disk quota exceeded

Previously, I didn't use the -T option and apparently the tmp dir is not large enough to handle this. My current dir has free space of roughly 200GB. Is it still not enough for the sorting temp file?

I don't know if the parallel option affect things or not.

Best Answer

The problem is that you seem to have a disk quota set up and your user doesn't have the right to take up so much space in /some_dir. And no, the --parallel option shouldn't affect this.

As a workaround, you can split the file into smaller files, sort each of those separately and then merge them back into a single file again:

## split the file into 100M pieces named fileChunkNNNN
split -b100M file fileChunk
## Sort each of the pieces and delete the unsorted one
for f in fileChunk*; do sort "$f" > "$f".sorted && rm "$f"; done
## merge the sorted files    
sort -T /some_dir/ --parallel=4 -muo file_sort.csv -k 1,3 fileChunk*.sorted

The magic is GNU sort's -m option (from info sort):

‘-m’
‘--merge’
    Merge the given files by sorting them as a group.  Each input file
    must always be individually sorted.  It always works to sort
    instead of merge; merging is provided because it is faster, in the
    case where it works.

That will require you to have ~180G free for a 90G file in order to store all the pieces. However, the actual sorting won't take as much space since you're only going to be sorting in 100M chunks.

Related Question