Bash – What happens if I start too many background jobs

background-processbashexpectjobstelnet

I need to do some work on 700 network devices using an expect script. I can get it done sequentially, but so far the runtime is around 24 hours. This is mostly due to the time it takes to establish a connection and the delay in the output from these devices (old ones). I'm able to establish two connections and have them run in parallel just fine, but how far can I push that?

I don't imagine I could do all 700 of them at once, surely there's some limit to the no. of telnet connections my VM can manage.

If I did try to start 700 of them in some sort of loop like this:

for node in `ls ~/sagLogs/`; do  
    foo &  
done

With

  • CPU 12 CPUs x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz

  • Memory 47.94 GB

My question is:

  1. Could all 700 instances possibly run concurrently?
  2. How far could I get until my server reaches its limit?
  3. When that limit is reached, will it just wait to begin the next iteration off foo or will the box crash?

I'm running in a corporate production environment unfortunately, so I can't exactly just try and see what happens.

Best Answer

Could all 700 instances possibly run concurrently?

That depends on what you mean by concurrently. If we're being picky, then no, they can't unless you have 700 threads of execution on your system you can utilize (so probably not). Realistically though, yes, they probably can, provided you have enough RAM and/or swap space on the system. UNIX and it's various children are remarkably good at managing huge levels of concurrency, that's part of why they're so popular for large-scale HPC usage.

How far could I get until my server reaches its limit?

This is impossible to answer concretely without a whole lot more info. Pretty much, you need to have enough memory to meet:

  • The entire run-time memory requirements of one job, times 700.
  • The memory requirements of bash to manage that many jobs (bash is not horrible about this, but the job control isn't exactly memory efficient).
  • Any other memory requirements on the system.

Assuming you meet that (again, with only 50GB of RAM, you still ahve to deal with other issues:

  • How much CPU time is going to be wasted by bash on job control? Probably not much, but with hundreds of jobs, it could be significant.
  • How much network bandwidth is this going to need? Just opening all those connections may swamp your network for a couple of minutes depending on your bandwidth and latency.
  • Many other things I probably haven't thought of.

When that limit is reached, will it just wait to begin the next iteration off foo or will the box crash?

It depends on what limit is hit. If it's memory, something will die on the system (more specifically, get killed by the kernel in an attempt to free up memory) or the system itself may crash (it's not unusual to configure systems to intentionally crash when running out of memory). If it's CPU time, it will just keep going without issue, it'll just be impossible to do much else on the system. If it's the network though, you might crash other systems or services.


What you really need here is not to run all the jobs at the same time. Instead, split them into batches, and run all the jobs within a batch at the same time, let them finish, then start the next batch. GNU Parallel (https://www.gnu.org/software/parallel/) can be used for this, but it's less than ideal at that scale in a production environment (if you go with it, don't get too aggressive, like I said, you might swamp the network and affect systems you otherwise would not be touching). I would really recommend looking into a proper network orchestration tool like Ansible (https://www.ansible.com/), as that will not only solve your concurrency issues (Ansible does batching like I mentioned above automatically), but also give you a lot of other useful features to work with (like idempotent execution of tasks, nice status reports, and native integration with a very large number of other tools).