Shell – Standard Out Append to File Size Limitations

curlgnu-parallelshell-scriptstdout

I'm pulling VIN specifications from the National Highway Traffic Safety Administration API for approximately 25,000,000 VIN numbers. This is a great deal of data, and as I'm not transforming the data in any way, curl seemed like a more efficient and lightweight way of accomplishing the task than Python (seeing as Python's GIL makes parallel processing a bit of a pain).

In the below code, vins.csv is a file containing a large sample of the 25M VINs, broken into chunks of 100 VINs. These are being passed to GNU Parallel which is using 4 cores. Everything is dumped into nhtsa_vin_data.csv at the end.

$ cat vins.csv | parallel -j10% curl -s --data "format=csv" \
   --data "data={1}" https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/ \
      >> /nas/BIGDATA/kemri/nhtsa_vin_data.csv

This process was writing about 3,000 VINs a minute at the beginning and has been getting progressively slower with time (currently around 1,200/minute).

My questions

  • Is there anything in my command that would be subject to increasing overhead as nhtsa_vin_data.csv grows in size?
  • Is this related to how Linux handles >> operations?

UPDATE #1 – SOLUTIONS

First solution per @slm – use parallel's tmp file options to write each curl output to its own .par file, combine at the end:

$ cat vins.csv | parallel \
--tmpdir /home/kemri/vin_scraper/temp_files \
--files \
-j10% curl -s \
--data "format=csv" \
--data "data={1}" https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/ > /dev/null

cat <(head -1 $(ls *.par|head -1)) <(tail -q -n +2 *.par) > all_data.csv

Second solution per @oletange – use –line-buffer to buffer output to memory instead of disk:

$ cat test_new_mthd_vins.csv | parallel \
    --line-buffer \
    -j10% curl -s \
    --data "format=csv" \
    --data "data={1}" https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/ \
    >> /home/kemri/vin_scraper/temp_files/nhtsa_vin_data.csv

Performance considerations

I find both the solutions suggested here very useful and interesting and will definitely be using both versions more in the future (both for comparing performance and additional API work). Hopefully I'll be able to run some tests to see which one performs better for my use case.

Additionally, running some sort of throughput test like @oletange and @slm suggested would be wise, seeing as the likelihood of the NHTSA being the bottleneck here is non-negligible.

Best Answer

My suspicion is that the >> is causing you contention on the file nhtsa_vin_data.csv among the curl commands that parallel is forking off to collect the API data.

I would adjust your application like this:

$ cat p.bash
#!/bin/bash

cat vins.csv | parallel --will-cite -j10% --progress --tmpdir . --files \
   curl -s --data "format=csv" \
     --data "data={1}" https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/

This will give your curl commands their own isolated file to write their data.

Example

I took these 3 VINs, 1HGCR3F95FA017875;1HGCR3F83HA034135;3FA6P0T93GR335818;, that you provided me and put them into a file called vins.csv. I then replicated them a bunch of times so that this file ended up having these characteristics:

VINs per line
$ tail -1 vins.csv | grep -o ';' | wc -l
26
Number of lines
$ wc -l vins.csv
15 vins.csv

I then ran my script using this data:

$ ./p.bash

Computers / CPU cores / Max jobs to run
1:local / 1 / 1

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:1/0/100%/0.0s ./pard9QD3.par
local:1/1/100%/10.0s ./paruwK9L.par
local:1/2/100%/8.5s ./parT6rCS.par
local:1/3/100%/7.3s ./pardzT2g.par
local:1/4/100%/6.8s ./parDAsaO.par
local:1/5/100%/6.8s ./par9X2Na.par
local:1/6/100%/6.7s ./par6aRla.par
local:1/7/100%/6.7s ./parNR_r4.par
local:1/8/100%/6.4s ./parVoa9k.par
local:1/9/100%/6.1s ./parXJQTc.par
local:1/10/100%/6.0s ./parDZZrp.par
local:1/11/100%/6.0s ./part0tlA.par
local:1/12/100%/5.9s ./parydQlI.par
local:1/13/100%/5.8s ./par4hkSL.par
local:1/14/100%/5.8s ./parbGwA2.par
local:0/15/100%/5.4s

Putting things together

When the above is done running you can then cat all the files together to get a single .csv file ala:

$ cat *.par > all_data.csv

Use care when doing this since every file has it's own header row for the CSV data that's contained within. To deal with taking the headers out of the result files:

$ cat <(head -1 $(ls *.par|head -1)) <(tail -q -n +2 *.par) > all_data.csv

Your slowing down performance

In my testing it does look like the DOT website is throttling queries as they continue to access their API. The above timing I saw in my experiments, though small, were decreasing as each query was sent to the API's website.

My performance on my laptop was as follows:

$ seq 5 | parallel --will-cite --line-buffer 'yes {} | head -c 1G' | pv >> /dev/null
   5GiB 0:00:51 [99.4MiB/s] [                                                                                                                                                                  <=>       ]

NOTE: The above was borrowed from Ole Tange's answer and modified. It writes 5GB of data through parallel and pipes it to pv >> /dev/null. pv is used so we can monitor the throughput through the pipe and arrive at a MB/s type of measurement.

My laptop was able to muster ~100MB/s of throughput.

FAQ for NHTSA API

API

For the ‘Decode VIN (flat format) in a Batch’ is there a sample on making this query by URL, similar to the other actions?

For this particular API you just have to put a set of VINs within the box that are separated by a “;”. You can also indicate the model year prior to the “;” separated by a “,”. There is an upper limit on the number of VINs you can put through this service.

Example in the box is the sample: 5UXWX7C5*BA,2011; 5YJSA3DS*EF

Source: https://vpic.nhtsa.dot.gov/MfrPortal/home/faq searched for "rate"

The above mentions that there's a upper limit when using the API:

There is an upper limit on the number of VINs you can put through this service.

References

Related Question