Linux – n efficient way to copy text from a PDF without the line breaks

copy/pastelinuxmicrosoft excelpdf

I need to get thousands of snippets of text from PDFs to a spreadsheet. They are short, seldom more than 2-3 rows, but each line break creates a new cell, and I have to repair that manually, which costs lots of time.

Because I have so many of them, using the "paste into Word and do a find-and-replace" workaround is just too time-wasting for me. Is there a way to have the line break disappear on copy? Maybe there is a viewer which offers a special copy mode for this, or has a plugin?

The documents are scientific articles. The text arrangement is quite linear. You can assume that the text I'm copying is not inside a table or a float, and not rotated or anything. (If such a thing happens, I think I'll deal with it manually). The text is frequently set in two columns, but I have no trouble marking just the text I need from its column. I don't need to preserve any special formatting. I'm willing to try a solution which removes all unprintable characters, for example. The texts are in English, it is OK if the solution only works in ASCII/strips all non-alphanumeric ASCII of the copied text.

I have a very strong preference for a solution which will work on Linux, possibly some kind of Okular plugin. But if there happens to be a Windows-only solution, I want to hear about it too. I have a license for a somewhat recent Acrobat Pro on the Windows machine.

Best Answer

I had a similar problem while I was working on a text to speech script a while ago. My script would try to break up the text input into chunks by looking for newlines. With PDF files this would result in a mess because of the way each line ends with a newline.

So what I did was compose a few sed and tr commands to only consider newlines ending with a full stop as actual line breaks. It wasn't very pretty but it worked.

Using this snippet I wrote a small script for you that I hope will help:

#!/bin/bash

# title: copy_without_linebreaks
# author: Glutanimate (github.com/glutanimate)
# license: MIT license

# Parses currently selected text and removes 
# newlines that aren't preceded by a full stop

SelectedText="$(xsel)"

ModifiedText="$(echo "$SelectedText" | \
    sed 's/\.$/.|/g' | sed 's/^\s*$/|/g' | tr '\n' ' ' | tr '|' '\n')"

#   - first sed command: replace end-of-line full stops with '|' delimiter and keep original periods.
#   - second sed command: replace empty lines with same delimiter (e.g.
#     to separate text headings from text)
#   - subsequent tr commands: remove existing newlines; replace delimiter with
#     newlines
# This is less than elegant but it works.

echo "$ModifiedText" | xsel -bi

The script uses xsel to parse the currently highlighted text and then modifies it with the sed and tr command-line I mentioned above. The processed text is then passed back to the clipboard via xsel -bi.

Here's how you can use the script in your scenario:

  1. Make sure you have xsel installed (sudo apt-get install xsel on (K)Ubuntu)
  2. save the script as copy_without_linebreaks or something similar and make it executable
  3. assign the script to a hotkey of your choice in your WM preferences
  4. highlight some text and press the hotkey
  5. The clipboard should automatically be filled with the modified text
Related Question