Linux – Sort all files in a folder independently, with an output file for each

bashcommand linecygwin;linuxsorting

I have several folders that contain numerous text files, ranging from tens to 100s. These text files are simple databases containing millions of lines, with each line containing a single record. However, the records in them are unsorted and contain many duplicates. I'd like to to sort and de-duplicate them all individually (i.e. independently of each other), but to my understanding, sort can only produce a concatenated output of all input files – that is, even if given multiple files, it will only produce one output file containing the combined results of all those files.

How can I sort all files in the current folder to produce an individually sorted output file for each one? I'd also like for the output files to be outputted to a subfolder within the current directory. A for loop is the obvious solution to me, but I'm asking here in case there's some simpler way to do this with sort that I've not come across or missed. My bash knowledge is also very lacking, so if a for loop is the simplest solution, I'd appreciate someone providing the best way to go about that rather than me spending many days hacking something together that would still fall short of what I want to do.

Best Answer

Yes, you can do this with for. Even if there is "some simpler way to do this with sort" (but I don't think so), this is also quite simple:

# cd to the directory you want to process

mkdir sorted    
for file in *; 
do
   printf 'Processing %s\n' "$file"
   [ -f "$file" ] && sort -u "$file" > "./sorted/$file"
done

Notes:

for file in * doesn't process files in subdirectories.
printf is only to report progress. In fact it should be placed after [ ... ] (see below) but I don't want to overcomplicate the code. You can just remove the printf line, if you want the whole thing to be silent.
[ -f "$file" ] checks if $file is a regular file. With the most general pattern (i.e. *) we need this condition at least to avoid running sort with the sorted directory as an argument (this would throw an error, harmless but non-elegant). Most likely this test won't be needed if you use a more specific glob like *.txt or *.db instead of * (e.g. to skip a stray desktop.ini file that shouldn't be processed). In this case you can omit [ ... ] && and start the line with sort (leaving the line intact shouldn't hurt though).
sort supports various options and you may want to use some of them, depending on how you need to sort.
sort -u de-duplicates entries right after sorting them, and when already using sort is a less redundant alternative to using the uniq command.

If you needed to pick files according to conditions more complex than a simple glob, find might be better to start with. For your current task for should be fine.

Related Solutions

Sort files with grep

for a in *;do grep -q ERROR_1 "$a" && mv "$a" subfolder_1 || mv "$a" subfolder_2;done

This should work.

Linux – way around broken pipe

sort: write failed: standard output: Broken pipe

The problem is not between find and sort. The sort has problem with output, which means the shell is not willing to read as long list in a variable.

You'll have to process the input with while read…, storing it in temporary file if you need it more than once. With the added advantage, that this splits on newline only, so it correctly handles filenames with spaces which the backtick approach does not.

Unfortunately you don't say how you want to use the result, I can't tell you how to exactly rewrite it.

Note, that arrays are not part of POSIX shell specification and there are shells that are noticeably faster than bash, but don't have them. That's why many people, including me, often avoid using them in scripts.

Best Answer

Related Solutions

Sort files with grep

Linux – way around broken pipe

Related Question