How to Split a File by Keyword Boundaries

filessplit

I have a vcf file that contains numerous vcards.

When importing the vcf file to outlook it seems to only import the first vcard.

Hence I want to split them up.

Given that a vcard starts with

BEGIN:VCARD

and ends with

END:VCARD

What is the best way to split each vcard into it's own file.

Thanks

UPDATE

Thanks for all the responses. As with questions of this nature there's various ways to skin a cat. Here's the reasoning why I chose the one I did.

ROUND-UP

Here's a roundup of what I liked from each answer and what drove me to select one of them.

  • csplit: I really really liked the conciseness of this method. I just wished it was able to also set the file extension.
  • gawk: It did everything i asked of it.
  • paralell: Worked. But I had to install new things. (it also decided to make a new /bin dir in my home dir)
  • perl: I liked that it created vcf based on contact's name. But the -o option didn't really work

Conclusion

  • So the first one to go was perl because it was a bit broken
  • Next was paralell because I had to install new things
  • Next was csplit, because as far as I can see it can't create extensions on the output files
  • So the award goes to gawk, for being a utility that's readily available, and versatile enough that I can chop and change the file name a bit. Bonus marks for cmp too 🙂

Best Answer

You can use awk for the job:

$ curl -O https://raw.githubusercontent.com/qtproject/qt-mobility\
/d7f10927176b8c3603efaaceb721b00af5e8605b/demos/qmlcontacts/contents/\
example.vcf

$ gawk ' /BEGIN:VCARD/ { close(fn); ++a; fn=sprintf("card_%02d.vcf", a); 
        print "Writing: ", fn } { print $0 > fn; } ' example.vcf
Writing:  card_01.vcf
Writing:  card_02.vcf
Writing:  card_03.vcf
Writing:  card_04.vcf
Writing:  card_05.vcf
Writing:  card_06.vcf
Writing:  card_07.vcf
Writing:  card_08.vcf
Writing:  card_09.vcf

$ cat card_0* > all.vcf
$ cmp example.vcf all.vcf
$ echo $?
0

Details

The awk line works like this: a is counter that is incremented on each BEGIN:VCARD line and at the same time the output filename is constructed using sprintf (stored in fn). For each line the current line ($0) is appended to the current file (named fn).

The last echo $? means that the cmp was successful, i.e. all single files concatenated are equal to the original example vcf example.

Note that the output redirection in awk works differently than in shell. That means that with > fn awk first checks if the file is already open. If it is already open then awk appends to it. If it is not then it opens and truncates it.

Because of this redirection logic we have to explicitly close the implicitly opened files, since otherwise the call would hit the open file limit in cases where the input file contains many records.

Related Question