I need to get thousands of snippets of text from PDFs to a spreadsheet. They are short, seldom more than 2-3 rows, but each line break creates a new cell, and I have to repair that manually, which costs lots of time.
Because I have so many of them, using the "paste into Word and do a find-and-replace" workaround is just too time-wasting for me. Is there a way to have the line break disappear on copy? Maybe there is a viewer which offers a special copy mode for this, or has a plugin?
The documents are scientific articles. The text arrangement is quite linear. You can assume that the text I'm copying is not inside a table or a float, and not rotated or anything. (If such a thing happens, I think I'll deal with it manually). The text is frequently set in two columns, but I have no trouble marking just the text I need from its column. I don't need to preserve any special formatting. I'm willing to try a solution which removes all unprintable characters, for example. The texts are in English, it is OK if the solution only works in ASCII/strips all non-alphanumeric ASCII of the copied text.
I have a very strong preference for a solution which will work on Linux, possibly some kind of Okular plugin. But if there happens to be a Windows-only solution, I want to hear about it too. I have a license for a somewhat recent Acrobat Pro on the Windows machine.
Best Answer
I had a similar problem while I was working on a text to speech script a while ago. My script would try to break up the text input into chunks by looking for newlines. With PDF files this would result in a mess because of the way each line ends with a newline.
So what I did was compose a few
sed
andtr
commands to only consider newlines ending with a full stop as actual line breaks. It wasn't very pretty but it worked.Using this snippet I wrote a small script for you that I hope will help:
The script uses
xsel
to parse the currently highlighted text and then modifies it with thesed
andtr
command-line I mentioned above. The processed text is then passed back to the clipboard viaxsel -bi
.Here's how you can use the script in your scenario:
xsel
installed (sudo apt-get install xsel
on (K)Ubuntu)copy_without_linebreaks
or something similar and make it executable