What should be the buffer size for the sort command

buffersort

I have a machine with 2 TB of RAM and I am running a sort command on a file of size 150G where I have specified the buffer-size as 1000G, after doing my bit of reasearch on google, I got this piece of information "the more is the buffer size, the better is the performance". This is the command that I ran

sort -rk2 --buffer-size=1000G master_matrix_unsorted.csv > master_matrix_sorted.csv

But this is taking a lot of time and I have no clue on the progress of the task.

Any idea on what should be the best buffer size for this operation? I am planning to re-run this task with a new buffer-size.

Best Answer

You don't specify the OS and the sort implementation; I assume you mean GNU sort. You also don't say how long "a lot of time" is, or how long you expect it to take. Most important, you don't mention the I/O subsystem capability, which will be the governing factor.

An ordinary SATA drive delivers ~150 MB/s. At that rate your 150 GB file will take 1000 seconds just to read, about 15 minutes. Try $ time cat filename >/dev/null to see. If ~15 minutes (or whatever time cat shows) is OK, you might be able to get sort(1) to work in about 3X the time, because the output has to be written, too.

Your best bet for speedup would seem to be --parallel, because your data fit in memory and you have spare processors. According to the info page, --buffer-size won't matter, because

... this option affects only the initial buffer size. The buffer grows beyond SIZE if `sort' encounters input lines larger than SIZE.

whereas a quick search indicates GNU uses merge sort, which is amenable to parallelization.

If you really want to know how GNU sort determines buffer sizes and what algorithm it uses for parallel sorting, the coreutils source code and accompanying documentation is readily available.

But if I were you I wouldn't bother. Whatever you're doing with master_matrix_unsorted.csv, sort(1) is surely not up to the task.

First, a CSV file will, one day, trip you up because the CSV syntax is far beyond sort's ken. Second, it is the slowest possible way, because sort(1) is forced to sort entire rows (of indeterminate length), not just the second column. Third, when you're done, what will you have? A sorted CSV file. Is that really better? Why does the order matter so very much?

Sorting sounds like one step along the way toward a goal that likely includes some kind of computation on the data, which computation will require numbers in binary format. If that's the case, you might as well get the CSV file into a more tractable, computable, binary format first in, say, a DBMS. You may find that sorting it turns out to be unnecessary to the ultimate goal.

Related Solutions

Buffer size for capturing packets in kernel space

Tcpdump has the option -B to set the capture buffer size. The value is then passed to libpcap (library used by tcpdump to do the actual packet capturing) via pcap_set_buffer_size() function. Tcpdump manpage does not specify in what units the buffer size is specified with -B, but from the source it seems that it is KiB.

manual page of pcap_set_buffer_size() does not specify default buffer size (which is used if this function is not called), but again, from the libpcap source, this seems to be 2 MiB, at least on linux (but is most likely system dependent).

With regard to packet buffering and dropping, you should also pay attention to setting snaplen (-s) parameter accordingly. man tcpdump:

-s     Snarf  snaplen bytes of data from each packet rather than the
default of 65535 bytes.  Packets truncated because of a limited snapshot
are indicated in the output with ``[|proto]'', where proto is the name of
the protocol level at which the truncation has occurred. Note that  taking
larger  snapshots both increases the amount of time it  takes  to
process packets and, effectively, decreases the amount of packet buffering.
This may cause packets to be lost. You should limit snaplen to the
smallest number that will capture the protocol information you're
interested in. Setting snaplen to 0 sets it to the default of 65535, for
back-wards compatibility with recent older versions of tcpdump.

This means that with fixed buffer size, you can increase the number of packets that fit into the buffer (and thus not being dropped) by decreasing the snaplen size.

Linux Kernel – How to Restrict Size of Buffer Cache

If you do not want an absolute limit but just pressure the kernel to flush out the buffers faster, you should look at vm.vfs_cache_pressure

This variable controls the tendency of the kernel to reclaim the memory which is used for caching of VFS caches, versus pagecache and swap. Increasing this value increases the rate at which VFS caches are reclaimed.

Ranges from 0 to 200. Move it towards 200 for higher pressure. Default is set at 100. You can also analyze your memory usage using the slabtop command. In your case, the dentry and *_inode_cache values must be high.

If you want an absolute limit, you should look up cgroups. Place the Ceph OSD server within a cgroup and limit the maximum memory it can use by setting the memory.limit_in_bytes parameter for the cgroup.

memory.memsw.limit_in_bytes sets the maximum amount for the sum of memory and swap usage. If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units — k or K for kilobytes, m or M for Megabytes, and g or G for Gigabytes.

References:

[1] - GlusterFS Linux Kernel Tuning

[2] - RHEL 6 Resource Management Guide

Best Answer

Related Solutions

Buffer size for capturing packets in kernel space

Linux Kernel – How to Restrict Size of Buffer Cache

Related Question