Transform Bibliographic References for Use with LaTeX

latexmacostext processing

I was given a long word document which I have to port to Latex. In the document all citations appear in the classic form with author and year. Something like

Lorem ipsum dolor (Sit, 1998) amet, consectetur adipiscing (Slit 2000, Sed and So 2002, Eiusmod et al. 1976).
Tempor incididunt ut labore et dolore magna aliqua (Ut et al. 1312)

This references need to get the proper key reference as it appears in a list of bib references. In other words the text should translate to

Lorem ipsum dolor \cite{sit1998} amet, consectetur adipiscing \cite{slit2000,sed2002,eiusmod1976}.
Tempor incididunt ut labore et dolore magna aliqua \cite{ut1312}

That means:

extract all the strings that are composed of name(s) and year enclosed in parentheses
strip that string of spaces, second names (everything after the first name) and capital letters
use the resulting string to form the new \cite{string}

I understand that this may be quite a complex task. I was wondering maybe someone has written a script fo this specific task. Alternatively any partial suggestion is also welcome. I am currently working in MacOS.

Best Answer

The following awk program should work. It looks for ( ... ) elements in each line and checks if they fit the "author(s), year" or "author(s)1 year1, author(s)2 year2, ..." pattern. If so, it creates a citation command and replaces the ( ... ) group; otherwise it leaves the group as it is.

#!/usr/bin/awk -f


# This small function creates an 'authorYYYY'-style string from
# separate author and year fields. We split the "author" field
# additionally at each space in order to strip leading/trailing
# whitespace and further authors.
function contract(author, year)
{
    split(author,auth_fields," ");
    auth=tolower(auth_fields[1]);
    return sprintf("%s%4d",auth,year);
}



# This function checks if two strings correspond to "author name(s)" and
# "year", respectively.
function check_entry(string1, string2)
{
    if (string1 ~ /^ *([[:alpha:].-]+ *)+$/ && string2 ~ /^ *[[:digit:]]{4} *$/) return 1;
    return 0;
}




# This function creates a 'citation' command from a raw element. If the
# raw element does not conform to the reference syntax of 'author, year' or
# 'author1 year1,author2 year2, ...', we should leave it "as is", and return
# a "0" as indicator.
function create_cite(raw_elem)
{
    cite_argument=""

    # Split at ','. The single elements are either name(list) and year,
    # or space-separated name(list)-year statements.
    n_fields=split(raw_elem,sgl_elem,",");

    if (n_fields == 2 && check_entry(sgl_elem[1],sgl_elem[2]))
    {
        cite_argument=contract(sgl_elem[1],sgl_elem[2]);
    }
    else
    {
        for (k=1; k<=n_fields; k++)
        {
            n_subfield=split(sgl_elem[k],subfield," ");

            if (check_entry(subfield[1],subfield[n_subfield]))
            {
                new_elem=contract(subfield[1],subfield[n_subfield]);
                if (cite_argument == "")
                {
                    cite_argument=new_elem;
                }
                else
                {
                    cite_argument=sprintf("%s,%s",cite_argument,new_elem);
                }
            }
            else
            {
                return 0;
            }
        }
    }


    cite=sprintf("\\{%s}",cite_argument);
    return cite;
}




# Actual program
# For each line, create a "working copy" so we can replace '(...)' pairs
# already processed with different text (here: 'X ... Y'); otherwise 'sub'
# would always stumble across the same opening parentheses.
# For each '( ... )' found, check if it fits the pattern. If so, we replace
# it with a 'cite' command; otherwise we leave it as it is.

{
    working_copy=$0;

    # Allow for unmatched ')' at the beginning of the line:
    # if a ')' was found before the first '(', mark is as processed
    i=index(working_copy,"(");
    j=index(working_copy,")");
    if (i>0 && j>0 && j<i) sub(/\)/,"Y",working_copy);

    while (i=index(working_copy,"("))
    {
        sub(/\(/,"X",working_copy); # mark this '(' as "already processed

        j=index(working_copy,")");
        if (!j)
        {
            continue;
        }
        sub(/\)/,"Y",working_copy); # mark this ')', too


        elem=substr(working_copy,i+1,j-i-1);

        replacement=create_cite(elem);
        if (replacement != "0")
        {
            elem="\\(" elem "\\)"
            sub(elem,replacement);
        }

    }
    print $0;
}

Call the program with

~$ awk -f transform_citation.awk input.tex

Note that the program expects the input to be "reasonably" well-formed, i.e. all parentheses on a line should be matched pairs (although one closing parentheses at the beginning of the line is allowed for, and unmatched opening parentheses will be ignored).

Note also that some of the syntax above requires GNU awk. To be portable to other implementations, replace

if (string1 ~ /^ *([[:alpha:].-]+ *)+$/ && string2 ~ /^ *[[:digit:]]{4} *$/) return 1;

with

if (string1 ~ /^ *([a-zA-Z.-]+ *)+$/ && string2 ~ /^ *[0123456789][0123456789][0123456789][0123456789] *$/) return 1;

and ensure you have set the collation locale to C.

Related Solutions

Vim latex: disable quickfix

_{Note: I do not have vim-latex installed. I have just looked at likely parts of the plugin’s code.}

What behavior do you want to avoid?

The documentation says that you can set g:Tex_GotoError to prevent automatically jumping to the first error after using \ll to compile.

let g:Tex_GotoError = 0

Also, the code indicates that you can inhibit the log file preview-mode window (below the quickfix window) by setting Tex_ShowErrorContext:

let g:Tex_ShowErrorContext = 0

I did not see an option for controlling whether the quickfix window itself is left open, however. You can close it manually with :cclose (which can be shortened to :ccl).

For both of the above variables, you can use a buffer-local (b:…) or window-local (w:…) instead of the global (g:…) to localize the effect if you do not want to change the behavior globally.

Printing Latex source with a2ps

The message (in English "egrep: Invalid range end") comes from a bug in a2ps.

Its /usr/bin/texi2dvi4a2ps shell script calls egrep wrongly:

Instead of

echo "$command_line_filename" | egrep '^(/|[A-z]:/)' >/dev/null \
|| command_line_filename="./$command_line_filename"

it should be

echo "$command_line_filename" | egrep '^(/|[A-Za-z]:/)' >/dev/null \
|| command_line_filename="./$command_line_filename"

As the bug is in a shell script, you can fix it easily by just editing the file.

The pattern checks if the filename is absolute (starts with an /, relevant on Unix-like systems) or starts with a drive name (e.g. C:, relevant only for Windows systems). Otherwise, the filename is prepended with ./.

Feel free to report this bug upstream or to the distribution you use.

Best Answer

Related Solutions

Vim latex: disable quickfix

Printing Latex source with a2ps

Related Question