Shell – Script to extract selected entries from a bibtex file

sedshell-scripttext processing

I have a large bibtex file with many entries where each entry has the general structure

@ARTICLE{AuthorYear,
item = {...},
item = {...},
item = {...},
etc
}

(in some cases ARTICLE might be a different word e.g. BOOK)

What I would like to do is write a simple script (preferably just a shell script) to extract entries with given AuthorYear and put those in a new .bib file.

I can imagine that I can recognize the first sentence of an entry by AuthorYear and the last by the single closing } and perhaps use sed to extract the entry, but I don't really know how to do this exactly. Can someone tell me how I would achieve this?

It should probably be something like

sed -n "/AuthorYear/,/\}/p" file.bib

But that stops due to the closing } in the first item of the entry thus giving this output:

@ARTICLE{AuthorYear,
item = {...},

So I need to recognize whether the } is the only character at a line and only have 'sed' stop reading when this is the case.

Best Answer

The following Python script does the desired filtering.

#!/usr/bin/python
import re

# Bibliography entries to retrieve
# Multiple pattern compilation from: http://stackoverflow.com/a/11693340/147021
pattern_strings = ['Author2010', 'Author2012',]
pattern_string = '|'.join(pattern_strings)
patterns = re.compile(pattern_string)


with open('bibliography.bib', 'r') as bib_file:
    keep_printing = False
    for line in bib_file:
        if patterns.findall(line):
            # Beginning of an entry
            keep_printing = True

        if line.strip() == '}':
            if keep_printing:
                print line
                # End of an entry -- should be the one which began earlier
                keep_printing = False

        if keep_printing:
            # The intermediate lines
            print line,

Personally, I prefer moving to a scripting language when the filtering logic becomes complex. That, perhaps, has an advantage on the readability factor at least.

Related Question