How to sort access log efficiently in blocks

large filessort

Access logs are more or less sorted by time, but to aggregate connections by time (uniq -c), you need to sort them a bit more. For a huge access log sort is very inefficient, because it buffers and sorts the whole file before printing out anything.

Do you know any option for sort or version of sort, that could sort only given ammount of lines at once, the print that block?

I have searched for the following keywords: "streaming sort", "block sort", "approximate sort". I have read the whole manual through, without use. Setting the buffer size (-S) did not influenced this.

Best Answer

Try split --filter:

split --lines 1000 --filter 'sort ... | sed ... | uniq -c' access.log

This will split access.log into chunks of 1000 lines and pipe each chunk through the given filter.

If you want to save the results for each chunk separately, you can use $FILE in the filter command and possibly specify a prefix (default is x):

split --lines 1000 --filter '... | uniq -c >$FILE' access.log myanalysis-

This will generate a file myanalysis-aa containing the result of processing the first chunk, myanalysis-ab for the second chunk, etc.

The --filter option to split was introduced in GNU coreutils 8.13 (released in September 2011).

Related Solutions

How to Sort Large Files – Efficient Sorting Techniques

GNU sort (which is the default on most Linux systems), has a --parallel option. From http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html:

‘--parallel=n’

Set the number of sorts run in parallel to n. By default, n is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using n threads increases the memory usage by a factor of log n. Also see nproc invocation.

Since your cpu has 2 cores, you could do:

sort --parallel=2 -uo list-sorted.txt list.txt

It is better to specify the actual number of cores since there may appear to be more due to the processor having hyper-threading.

You could also experiment with nice to influence the processor scheduling priority and ionice to influence I/O scheduling. You can increase the priority over other processes like this, I don't think this will give you large savings as they are usually better for making sure a background process doesn't use too much resources. Never-the-less you can combine them with something like:

nice -n -20 ionice -c2 -n7 sort --parallel=2 -uo list-sorted.txt list.txt

Note also that as Gilles commented, using a single GNU sort command will be faster than any other method of breaking down the sorting as the algorithm is already optimised to handle large files. Anything else will likely just slow things down.

Sort large CSV files (90GB), Disk quota exceeded

The problem is that you seem to have a disk quota set up and your user doesn't have the right to take up so much space in /some_dir. And no, the --parallel option shouldn't affect this.

As a workaround, you can split the file into smaller files, sort each of those separately and then merge them back into a single file again:

## split the file into 100M pieces named fileChunkNNNN
split -b100M file fileChunk
## Sort each of the pieces and delete the unsorted one
for f in fileChunk*; do sort "$f" > "$f".sorted && rm "$f"; done
## merge the sorted files    
sort -T /some_dir/ --parallel=4 -muo file_sort.csv -k 1,3 fileChunk*.sorted

The magic is GNU sort's -m option (from info sort):

‘-m’
‘--merge’
    Merge the given files by sorting them as a group.  Each input file
    must always be individually sorted.  It always works to sort
    instead of merge; merging is provided because it is faster, in the
    case where it works.

That will require you to have ~180G free for a 90G file in order to store all the pieces. However, the actual sorting won't take as much space since you're only going to be sorting in 100M chunks.

Best Answer

Related Solutions

How to Sort Large Files – Efficient Sorting Techniques

Sort large CSV files (90GB), Disk quota exceeded

Related Question