How to sort access log efficiently in blocks

large filessort

Access logs are more or less sorted by time, but to aggregate connections by time (uniq -c), you need to sort them a bit more. For a huge access log sort is very inefficient, because it buffers and sorts the whole file before printing out anything.

Do you know any option for sort or version of sort, that could sort only given ammount of lines at once, the print that block?

I have searched for the following keywords: "streaming sort", "block sort", "approximate sort". I have read the whole manual through, without use. Setting the buffer size (-S) did not influenced this.

Best Answer

Try split --filter:

split --lines 1000 --filter 'sort ... | sed ... | uniq -c' access.log

This will split access.log into chunks of 1000 lines and pipe each chunk through the given filter.

If you want to save the results for each chunk separately, you can use $FILE in the filter command and possibly specify a prefix (default is x):

split --lines 1000 --filter '... | uniq -c >$FILE' access.log myanalysis-

This will generate a file myanalysis-aa containing the result of processing the first chunk, myanalysis-ab for the second chunk, etc.

The --filter option to split was introduced in GNU coreutils 8.13 (released in September 2011).

Related Question