I have a csv file with the format:
Input.csv:
TIMESTAMP,Data1,Data2,Data3,Data4
"2021-01-03 00:00:00",80953,3.243183,2.943338,358.0123
"2021-01-03 00:01:00",80954,2.173187,1.990327,344.5851
...
"2021-01-03 23:59:00",80957,4.04172,3.82053,355.5481
"2021-01-04 00:00:00",80955,3.700353,3.593842,346.2665
...
"2021-01-04 23:59:00",80956,3.125094,2.922542,350.9915
"2021-01-05 00:00:00",80957,4.04172,3.82053,355.5481
...
"2021-01-05 23:59:00",80956,3.125094,2.922542,350.9915
etc...
The file contains per minute data for multiple days and gets updated every minute.
I would like to write a bash script that creates multiple csv files based on previous day's TIMESTAMP's Column in input.csv as follows:
cat 20210103000000.csv
TIMESTAMP,Data1,Data2,Data3,Data4
"2021-01-03 00:00:00",80953,3.243183,2.943338,358.0123
cat 20210103000100.csv
TIMESTAMP,Data1,Data2,Data3,Data4
"2021-01-03 00:01:00",80954,2.173187,1.990327,344.5851
… so forth, up to last minute for that day
cat 20210103235900.csv
TIMESTAMP,Data1,Data2,Data3,Data4
"2021-01-03 23:59:00",80957,4.04172,3.82053,355.5481
If data for a certain time, e.g. "2021-01-03 17:06:00", is missing/does not exist, then the following file must be created:
20210103170600.csv:
TIMESTAMP,Data1,Data2,Data3,Data4
"2021-01-03 17:06:00",0,0,0,0
The solution on this article How to split a CSV file per initial column (with headers)?
awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv
partly solves my problem but it create files for all the data contained in the input.csv file and doesn't take into account missing records.
Best Answer
Try:
Using GNU
awk
for the strftime() and mktime() functions to reduce the execution time for generating the timestamps instead of calling externaldate
command and also store the files in separate day directory and remove all double quotes:As in GNU
awk
documentation:systime()
Return the current time of day as the number of seconds since the Epoch (1970-01-01 00:00:00 UTC on POSIX systems). Let's print it:mktime(timestamp)
Turn the timestamp in the format of theYYYY MM DD HH MM SS
into the epoch time.Let's print it;
strftime(format, timestamp)
: Format timestamp according to the specification in format. The timestamp should be in the epoch type.Let's format a timestamp:
Remember all 3 above
awk
time functions.Now let's see what they do one by as used in the answer:
Notice the
86400
is the number of seconds in each day or 24hours; at above we saidsystime()
return current time of day as the number of seconds since the Epoch, so if we minus seconds of a day from current time it gives us the time with yesterday date.Let's convert it to human readable to see what is that:
now it's clear what timestamp it is, we used Hour/Min/Sec to "00" because we need this timestamp as start point and we store it into
start
variable in the code.then we used a for-loop to generate the rest of the timestamps from the timestamp in the
start
variable as following:Notice the number
1440
? that is the number of minutes in a day or 24hours (24*60=1440); butmktime()
accept timestamp as epoch and in seconds, so we multiply each minute to 60 to get timestamps in seconds then turn it into this format%F %H:%M
(F
ull format of the date same as%Y-%m-%d
,H
our andM
inute) and save into an awk array we name ittimestamp[...]
; now we have yesterday's date of all timestamps minutely.you can even print them to see what they are:
Below gsub() function removes all quotes from the current line:
then we backup the input file's first line which is the header line into
hdr
variable as we need the header line to be added into every file we generate; then also we create a directory with yesterday date as well and it will take the format ofdir_%Y%m%d
; below code block runs once only when it's first input lineNR==1 { "run these" }
:With the system() function we are calling the external command
mkdir
to create that directory.Going into next block, run the following block only if timestamp from the first column was seen in the
timestamp
array(substr($1,1,16) in timestamp) { "run these" }
; substr(string, start [, length ]) function return a length-character-long substring of string, starting at character number start.cp=$1
: we copy first column intocp
variable, we will use value in thecp
for later processing.gsub(/[-: ]|00$/, "", cp);
; strip characters-
,:
and Space from thecp
variable as well as the trailing double zero "00"s.print hdr ORS $0 >(yday"/"cp".csv");
:print
the header line which we keep it in thehdr
var, anORS
(that's a newline character for Output Record Separator by default) and entire line$0
into the relateddirectory/fileName.csv
.close(yday"/"cp".csv");
: close() the file after write.delete timestamp[substr($1,1,16)]
: and delete that timestamp from the array.and in the
END { "run these" }
block we print to the files for those timestamps that didn't exist in the input file.To process multiple files and split each input file into individual day directory.