Regex to LaTeX – Converting Question Formats

perlregular expression

1. Lorem ipsun la la la?
1. Sopor
2. Stupor
3. Torpor

2. A patient has Lorem?
1. Sopor
2. Stupor
3. Somnolentia 
4. La
5. Coma

3. Doesn't Response to strong external irritants is short. Tendon, pupillary, corneal reflexes are retained. Doesn't Response to strong external irritants is short. Tendon, pupillary, corneal reflexes are retained. What disorder of consciousness does he have?
1. Stupor
2. Sopor
3. Somnolentia 
4. Euphoria
5. Coma

... [777 questions]

which I want to be

l. Lorem ipsun la la la?
\begin{enumerate}
\item Sopor
\item Stupor
\item Torpor
\end{enumerate}    

2. A patient has Lorem?
\begin{enumerate}
\item Sopor
\item Stupor
\item Somnolentia 
\item La
\item Coma
\end{enumerate}

3. Doesn't Response to strong external irritants is short. Tendon, pupillary, corneal reflexes are retained. Doesn't Response to strong external irritants is short. Tendon, pupillary, corneal reflexes are retained. What disorder of consciousness does he have?
\begin{enumerate}
\item Stupor
\item Sopor
\item Somnolentia 
\item Euphoria
\item Coma
\end{enumerate}

Some notes

  • There is between 3 and 5 options for answers
  • I am not sure which order it is best to do these changes.
  • list 1.[ A-Za-z123345679.]*5.\n should be replaced with \n\begin{enumerate}[match]\end{enumerate}\n
    • if not successful with 1.[ A-Za-z123345679.]*4.\n should be replaced with \n\begin{enumerate}[mathch]\end{enumerate}\n
    • if not successful with 1.[ A-Za-z123345679.]*3.\n should be replaced with \n\begin{enumerate}[mathch]\end{enumerate}\n

What command-line tools should I use to do this? Perl comes to my mind, but I'm not sure.


I just noticed that catting the file content gives different output than expected and in the viewer. I am using the newest OSX with Perl v5.16.2 at the moment.

This is the test file.

Input, command and output

$ cat questions_copy.tex 
1. Lorem ipsun la la la?
1. Sopor
2. Stupor
3. Torpor

2. A patient has Lorem?
1. Sopor
2. Stupor
3. Somnolentia 
4. La
5. Coma

% STRANGE cat output here - Not correct!
3. Doesn't Response to strong external irritants is short. Tendon, pupillary, corneal reflexes are retained. Doesn't Response to strong external irritants is short. Tendon, 3. Somnolentia eal reflexes are retained. What disorder of consciousness does he have?
5. Comaoria
% PERL do the same mistakes
$ perl -000pe 's/\n/\n\\begin{enumerate}\n/; s/\n\d./\n\\item /g; s/$/\\end{enumerate}\n/' questions_copy.tex 
1. Lorem ipsun la la la?
\begin{enumerate}
\item  Sopor
\item  Stupor
\item  Torpor
\end{enumerate}

2. A patient has Lorem?
\begin{enumerate}
\item  Sopor
\item  Stupor
\item  Somnolentia 
\item  La
\item  Coma
\end{enumerate}

3. Doesn't Response to strong external irritants is short. Tendon, pupillary, corneal reflexes are retained. Doesn't Response to strong external irritants is short. Tendon, 3. Somnolentia eal reflexes are retained. What disorder of consciousness does he have?
\begin{enumerate}
5. Coma\end{enumerate}

$ 

Best Answer

Here's one way. This assumes that questions are separated by consecutive newlines (\n\n).

$ perl -000pe 's/\n/\n\\begin{enumerate}\n/; 
                s/\n\d./\n\\item /g; s/$/\\end{enumerate}\n/' file 
l. Lorem ipsun la la la?
\begin{enumerate}
\item  Sopor
\item  Stupor
\item  Torpor
\end{enumerate}

2. A patient has Lorem?
\begin{enumerate}
\item  Sopor
\item  Stupor
\item  Somnolentia 
\item  La
\item  Coma\end{enumerate}

Explanation

  • -000 : activate Perl's paragraph mode, this causes "lines" to be defined by two consecutive newlines (\n\n), so that each of your questions is treated as a single line.
  • -pne : read each line of the input file and print it (-p) after applying the script passed as -e.
  • s/\n/\n\\begin{enumerate}\n/ : replace the 1st newline of this line (question) with \begin{enumerate}\n.
  • s/\n\d./\n\\item /g : substitute any (g) number that comes right after a newline character with the newline character and \item followed by a space.
  • s/$/\\end{enumerate}\n/' : substitute the end of the record ($) with \end{enumerate} and a newline.
Related Question