How to display curl’s individual Exit Status from multiple requests

bashcurlurl

My question is simple – is there a way to display curl's individual Exit Status for each URL when curl is doing multiple requests?

Let's imagine that I need to check sites a.com, b.com, c.com and see their:

  • HTTP return code
  • if HTTP return code is 000, I need to display curl's exit code.

NOTE – a.com, b.com, c.com are used as an example in this code/question. In the real script, I do have a list of valid URLs – more than 400 of them with non-overlapping patterns – and they return a variety of HTTP codes – 200/4xx/5xx as well as 000.

The 000 is the case when curl could not make a connection but provides Exit Codes to understand what prevented it to establish a connection. In my cases, there are a number of exit codes as well – 6, 7, 35, 60.

I tried to run the following code

unset a
unset rep
a=($(curl -s --location -o /dev/null -w "%{response_code}\n" {https://a.com,https://b.com,https://a.com}))
rep+=("$?")
printf '%s\n' "${a[@]}"
echo
printf '%s\n' "${rep[@]}"

While the above code returns the HTTP return code for each individual request, the Exit Code is displayed only from the last request.

000
000
000

60

I do need the ability to log individual Exit Code when I supply multiple URLs to curl.
Is there a workaround/solution for this problem?

Some additional information: currently I put all my URLs in an array and run a cycle thru it checking each URL separately. However, going thru 400 URLs takes 1-2 hours and I need to somehow speed up the process.
I did try to use -Z with curl. While it did speed up the process about 40-50%, it didn't help because in addition to show only the above-mentioned last Exit Status, the Exit Status, in this case, is always displayed as 0, which is not correct.

P.S. I am open to using any other command-line tool if it can resolve the above problem – parallel checking of 10s/100s of URLs with logging of their HTTP codes and if the connection can't be established – log additional information like curl's Exit Codes do.

Thanks.

Best Answer

Analysis

The exit code is named "exit code" because it is returned when a command exits. If you run just one curl then it will exit exactly once.

curl, when given one or more URLs, might provide a way to retrieve a code equivalent to the exit code of separate curl handling just the current URL; it would be something similar to %{response_code} you used. Unfortunately it seems there is no such functionality (yet; add it maybe). To get N exit codes you need N curl processes. You need to run something like this N times:

curl … ; echo "$?"

I understand your N is about 400, you tried this in a loop and it took hours. Well, spawning 400 curls (even with 400 echos, if echo wasn't a builtin; and even with 400 (sub)shells, if needed) is not that time consuming. The culprit is in the fact you run all these synchronously (didn't you?).


Simple loop and its problems

It's possible to loop and run the snippet asynchronously:

for url in … ; do
   ( curl … ; echo "$?" ) &
done

There are several problems with this simple approach though:

  1. You cannot easily limit the number of curls that run simultaneously, there is no queue. This can be very bad in terms of performance and available resources.
  2. Concurrent output from two or more commands (e.g from two or more curls) may get interleaved, possibly mid-line.
  3. Even if output from each command separately looks fine, curl or echo from another subshell may cut in between curl and its corresponding echo.
  4. There is no guarantee a subshell invoked earlier starts (or ends) printing before a subshell invoked later.

parallel

The right tool is parallel. Basic variant of the tool (from moreutils, at least in Debian) solves (1). It probably solves (2) in some circumstances. This is irrelevant anyway because this variant does not solve (3) or (4).

GNU parallel solves all these problems.

  • It solves (1) by design.

  • It solves (2) and (3) with its --group option:

    --group
    Group output. Output from each job is grouped together and is only printed when the command is finished. Stdout (standard output) first followed by stderr (standard error). […]

    (source)

    which is the default, so usually you don't have to use it explicitly.

  • It solves (4) with its --keep-order option:

    --keep-order
    -k
    Keep sequence of output same as the order of input. Normally the output of a job will be printed as soon as the job completes. […] -k only affects the order in which the output is printed - not the order in which jobs are run.

    (source)

In Debian GNU parallel is in a package named parallel. The rest of this answer uses GNU parallel.


Basic solution

<urls parallel -j 40 -k 'curl -s --location -o /dev/null -w "%{response_code}\n" {}; echo "$?"'

where urls is a file with URLs and -j 40 means we allow up to 40 parallel jobs (adjust it to your needs and abilities). In this case it's safe to embed {} in the shell code. It's an exception explicitly mentioned in this answer: Never embed {} in the shell code!

The output will be like

404
0
200
0
000
7
…

Note the single-quoted string is the shell code. Within it you can implement some logic, so exit code 0 is never printed. If I were you I would print it anyway, in the same line, on the leading position:

<urls parallel -j 40 -k '
   out="$(
      curl -s --location -o /dev/null -w "%{response_code}" {}
   )"
   printf "%s %s\n" "$?" "$out"'

Now even if some curl is manually killed before it prints, you will get something in the first column. This is useful for parsing (we'll return to it). Example:

0 404
0 200
7 000
…
143 
…

where 143 means curl was terminated (see Default exit code when process is terminated).


With arrays

If your URLs are in an array named urls, avoid this syntax:

parallel … ::: "${urls[@]}"    # don't

parallel is an external command. If the array is large enough then you will hit argument list too long. Use this instead:

printf '%s\n' "${urls[@]}" | parallel …

It will work because in Bash printf is a builtin and therefore everything before | is handled internally by Bash.

To get from urls array to a and rep arrays, proceed like this:

unset a
unset rep
while read -r repx ax; do
   rep+=("$repx")
   a+=("$ax")
done < <(printf '%s\n' "${urls[@]}" \
         | parallel -j 40 -k '
              out="$(
                 curl -s --location -o /dev/null -w "%{response_code}" {}
              )"
         printf "%s %s\n" "$?" "$out"')
printf '%s\n' "${a[@]}"
echo
printf '%s\n' "${rep[@]}"

Notes

  • If we generated exit codes in the second column (which is easier, you don't need a helper variable like out) and adjusted our read accordingly, so it's read -r ax repx, then a line <empty ax><space>143 would save 143 into ax because read ignores leading spaces (it's complicated). By reversing the order we avoid a bug in our code. A line like 143<space><empty ax> is properly handled by read -r repx ax.

  • You will hopefully be able to check 400 URLs in few minutes. The duration depends on how many jobs you allow in parallel (parallel -j …), but also on:

    • how fast the servers respond;
    • how much data and how fast curls download;
    • options like --connect-timeout and --max-time (consider using them).