scp
itself has no such feature. With GNU parallel
you can use the sem
command (from semaphore) to arbitrarily limit concurrent processes:
sem --id scp -j 50 scp ...
For all processes started with the same --id
, this applies a limit of 50 concurrent instances. An attempt to start a 51st process will wait (indefinitely) until one of the other processes exits. Add --fg
to keep the process in the foreground (default is to run it in the background, but this doesn't behave quite the same a shell background process).
Note that the state is stored in ${HOME}/.parallel/
so this won't work quite as hoped if you have multiple users using scp
, you may need a lower limit for each user. (It should also be possible override the HOME
environment variable when invoking sem
, make sure umask
permits group write, and modify the permissions so they share state, I have not tested this heavily though, YMMV.)
parallel
requires only perl
and a few standard modules.
You might also consider using scp -l N
where N is a transfer limit in kBps, select a specific cipher (for speed, depending on your required security), or disable compression (especially if the data is already compressed) to further reduce CPU impact.
For scp
, ssh is effectively a pipe and an scp
instance runs on each end (the receiving end runs with the undocumented -t
option). Regarding MaxSessions
, this won't help, "sessions" are multiplexed over a single SSH connection. Despite copious misinformation to the contrary, MaxSessions
limits only the multiplexing of sessions per-TCP connection, not any other limit.
The PAM module pam_limits
supports limiting concurrent logins, so if OpenSSH is built with PAM, and usePAM yes
is present in the sshd_config
you can set limit by username, group membership (and more). You can then set a hard maxlogins
to limit the logins in /etc/security/limits.conf
. However this counts up all logins per user, not just the new logins using just ssh
, and not just scp
, so you might run into trouble unless you have a dedicated scp
user id. Once enabled, it will also apply to interactive ssh sessions. One way around this is to copy or symlink the sshd
binary, calling it sshd-scp
then you can use a separate PAM configuration file, i.e. /etc/pam.d/sshd-scp
(OpenSSH calls pam_start()
with the "service name" set to that of the binary it was invoked as). You'll need to run this on a separate port (or IP), and using a separate sshd_config
is probably a good idea too.
If you implement this, then scp
will fail (exit code 254) when the limit is reached, so you'll have to deal with that in your transfer process.
(Other options include ionice
and cpulimit
, these may cause scp
sessions to timeout or hang for long periods, causing more problems.)
The old school way of doing something similar is to use atd
and batch
, but that doesn't offer tuning of concurrency, it queues and starts processes when the load is below a specific threshold. A newer variation on that is Task Spooler that supports queueing and running jobs in a more configurable sequential/parallel way, with runtime reconfiguration supported (e.g. changing queued jobs and concurrency settings), though it offers no load or CPU related control itself.
The short answer is:
ulimit -m 1000000
ulimit -v 1000000
which will limit each process to 1 GB RAM.
Limiting the memory the "right" way is in practice extremely complicated: Let us say you have 1 GB RAM. You start a process every 10 seconds and each process uses 1 MB more every second. So after 140 seconds you will have something like this:
10██▎
20██████▍
30██████████▌
40██████████████▋
50██████████████████▊
60██████████████████████▉
70███████████████████████████
80███████████████████████████████▏
90███████████████████████████████████▎
100██████████████████████████████████████▍
110██████████████████████████████████████████▌
120██████████████████████████████████████████████▋
130██████████████████████████████████████████████████▊
140██████████████████████████████████████████████████████▉
This sums up to 1050 MB RAM, so now you need kill something. What is the right job to kill? Is it 140 (assuming it ran amok)? Is it 10 (because it has run the least amount of time)?
In my experience jobs where memory is an issue are typically very predicable (e.g. transforming a bitmap) or very little predictable. For the very predictable ones you can do the computation beforehand and see how many jobs can be run.
For the unpredictable you ideally want the system to start few jobs that take up a lot of memory, and when they are done, you want the system to start more jobs that take up less memory. But you do not know beforehand which jobs will take a lot, which will take a little, and which ones run amok. Some jobs normal life cycle is to run with little memory for a long time and then balloon to a much bigger size later on. It is very hard to tell the difference between those jobs and jobs that run amok.
When someone points me to a well thought out way to do this in a way that will make sense for many applications, then GNU Parallel will probably be extended with that.
Best Answer
Split urls into one file per host. Then run 'parallel -j5' on each file.
Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:
Edit:
I think this will work:
sem
is part of GNU Parallel (it is a shorthand forparallel --semaphore
).{= m://([^/]+):; $_=$1 =}
grabs the hostname.-j5
tellssem
to make a counting semaphore with 5 slots.--fg
forcessem
to not spawn the job in the background. By using the hostname as ID you will get a counting semaphore for each hostname.-q
is needed forparallel
if some of your URLs contain special shell chars (such as &). They need to be protected from shell expansion becausesem
will also shell expand them.