Tcpdump has the option -B
to set the capture buffer size. The value is then passed to libpcap (library used by tcpdump to do the actual packet capturing) via pcap_set_buffer_size()
function. Tcpdump manpage does not specify in what units the buffer size is specified with -B, but from the source it seems that it is KiB.
manual page of pcap_set_buffer_size()
does not specify default buffer size (which is used if this function is not called), but again, from the libpcap source, this seems to be 2 MiB, at least on linux (but is most likely system dependent).
With regard to packet buffering and dropping, you should also pay attention to
setting snaplen (-s
) parameter accordingly. man tcpdump
:
-s Snarf snaplen bytes of data from each packet rather than the
default of 65535 bytes. Packets truncated because of a limited snapshot
are indicated in the output with ``[|proto]'', where proto is the name of
the protocol level at which the truncation has occurred. Note that taking
larger snapshots both increases the amount of time it takes to
process packets and, effectively, decreases the amount of packet buffering.
This may cause packets to be lost. You should limit snaplen to the
smallest number that will capture the protocol information you're
interested in. Setting snaplen to 0 sets it to the default of 65535, for
back-wards compatibility with recent older versions of tcpdump.
This means that with fixed buffer size, you can increase the number of packets that fit into the buffer (and thus not being dropped) by decreasing the snaplen
size.
You don't specify the OS and the sort implementation; I assume you mean GNU sort. You also don't say how long "a lot of time" is, or how long you expect it to take. Most important, you don't mention the I/O subsystem capability, which will be the governing factor.
An ordinary SATA drive delivers ~150 MB/s. At that rate your 150 GB file will take 1000 seconds just to read, about 15 minutes. Try $ time cat filename >/dev/null
to see. If ~15 minutes (or whatever time cat
shows) is OK, you might be able to get sort(1) to work in about 3X the time, because the output has to be written, too.
Your best bet for speedup would seem to be --parallel, because your data fit in memory and you have spare processors. According to the info page, --buffer-size won't matter, because
... this option affects only the initial buffer size. The buffer grows beyond SIZE if `sort' encounters input lines larger than SIZE.
whereas a quick search indicates GNU uses merge sort, which is amenable to parallelization.
If you really want to know how GNU sort determines buffer sizes and what algorithm it uses for parallel sorting, the coreutils source code and accompanying documentation is readily available.
But if I were you I wouldn't bother. Whatever you're doing with master_matrix_unsorted.csv
, sort(1) is surely not up to the task.
First, a CSV file will, one day, trip you up because the CSV syntax is far beyond sort's ken. Second, it is the slowest possible way, because sort(1) is forced to sort entire rows (of indeterminate length), not just the second column. Third, when you're done, what will you have? A sorted CSV file. Is that really better? Why does the order matter so very much?
Sorting sounds like one step along the way toward a goal that likely includes some kind of computation on the data, which computation will require numbers in binary format. If that's the case, you might as well get the CSV file into a more tractable, computable, binary format first in, say, a DBMS. You may find that sorting it turns out to be unnecessary to the ultimate goal.
Best Answer
If you do not want an absolute limit but just pressure the kernel to flush out the buffers faster, you should look at
vm.vfs_cache_pressure
Ranges from 0 to 200. Move it towards 200 for higher pressure. Default is set at 100. You can also analyze your memory usage using the
slabtop
command. In your case, thedentry
and*_inode_cache
values must be high.If you want an absolute limit, you should look up
cgroups
. Place the Ceph OSD server within a cgroup and limit the maximum memory it can use by setting thememory.limit_in_bytes
parameter for the cgroup.References:
[1] - GlusterFS Linux Kernel Tuning
[2] - RHEL 6 Resource Management Guide