Merging:
The easiest way to merge two files is to copy and paste. Notepad++ has no build-in file merging feature.
You can, however, install a plugin for this. See Combining files in Notepad++.
Another solution would be the command line's copy
. See Need to combine lots of files in a directory
Replacing line breaks:
Removing duplicates will be trickier than removing short words since Notepad++'s search does not search over multiple lines at once, so we will have to convert the line breaks into something else.
To achieve this, you can perform an Extended replace, finding all \r\n
(DOS line break) and replacing them by #
(or any other character that does not appear in your list).
If the last line was not blank, append a #
to the end of the resulting string.
Removing duplicates:
Now perform a Regular expression replace, finding all ([^#]+)#(.*#)\1#
and replacing them by \1#\2
.
If there were duplicates in a single file, you might have to do that more than once.
Removing words of 4 or less characters:
This one is easy. Just perform a Regular expression replace, finding all #.?.?.?.?#
and replacing them by #
.
Line breaks:
Now you can get rid of the line break hack. Just perform an Extended replace, finding all #
and replacing them by \r\n
.
Finally, delete the last line as it will be blank.
Best Answer
This answer is inspired by a YouTube video. Updated to maintain original sort order, if that is important.
Notepad++ has a built-in TextFX tool that sorts selected lines alphabetically. This tool can be hijacked to sort by the length of the lines by placing spaces on the left of each line, and making sure that all the lines are the same length.
"The Zoo" comes alphabetically before "Their House" because the space is treated as a character and comes before "i".
__X
(pretending the underscores are really spaces) will similarly come alphabetically before_XX
. The idea in this answer is to add spaces and line numbers so that__________092dog
will be sorted above_003alligator
.I'll use the following as example data:
Step 1. Add line numbers.
(Note added by barlop- a note for the reader regarding this step, we will not be sorting according to these line numbers, we're sorting according to the length of the lines. But the reason for adding the line numbers, is so we know the natural order, so that when for example, two+ lines are of equal length we can sort those lines according to that natural order)
Assuming your text file only has the data in it, place the text cursor (the vertical line) into the very first position of the file. Then in the
Edit
menu selectColumn Editor...
(Alt+C). Choose "Number to Insert" and start with 1, increase by 1, and include leading zeros. Note that this will retain the original ordering when sorting from shortest string to longest string. Reverse all lines first if you want to sort longest to shortest.Step 2. Pad all lines with leading spaces.
Place the text cursor (the vertical line) into the very first position of the file. Then in the
Edit
menu selectColumn Editor...
(Alt+C). Insert enough spaces so that the shortest line of data will be padded out to the length of the longest line of data. If your shortest line has 4 characters, and your longest 44, then make sure you insert at least 40 spaces.Step 3. Trim lines to a uniform length.
Use the following Regular Expression Find/Replace (Ctrl+H) to match the right-hand characters equalling or exceeding the length of your longest data line.
Replace all with
$1
. That will trim everything except the right-most 50 characters of every line. If your data is longer (or short) than 50, adjust the{50}
in the Regular Expression.(Note added by barlop- the idea here is the shortest lines have the most spaces at the beginning)
Step 4. Sort the lines.
Select all of the text (Ctrl+A). Via the TextFX menu, go to
Text FX > TextFX Tools > Sort lines case sensitive (at column)
. Your data should now be in length order, from shortest to longest. If you want them in order from longest to shortest, uncheck theText FX > TextFX Tools > + Sort ascending
option before sorting. Note how line numbers are reversed as well.Step 5. Remove leading spaces.
Use another Regular Expression Find/Replace (Ctrl+H) to match the leading spaces.
That's a space between the caret and asterisk. Replace all with nothing. That will remove all leading spaces and the inserted line numbers, if you had 4-digit line numbers. Replace the
{4}
with the correct number of digits in your line numbers.MACRO
I recorded the above steps using Notepad++'s macro feature, and it doesn't work. I'm not sure which step fails, but I haven't diagnosed why. You could probably use AutoHotKey to automate this if you do it repeatedly.