Linux – Sort all files in a folder independently, with an output file for each

bashcommand linecygwin;linuxsorting

I have several folders that contain numerous text files, ranging from tens to 100s. These text files are simple databases containing millions of lines, with each line containing a single record. However, the records in them are unsorted and contain many duplicates. I'd like to to sort and de-duplicate them all individually (i.e. independently of each other), but to my understanding, sort can only produce a concatenated output of all input files – that is, even if given multiple files, it will only produce one output file containing the combined results of all those files.

How can I sort all files in the current folder to produce an individually sorted output file for each one? I'd also like for the output files to be outputted to a subfolder within the current directory. A for loop is the obvious solution to me, but I'm asking here in case there's some simpler way to do this with sort that I've not come across or missed. My bash knowledge is also very lacking, so if a for loop is the simplest solution, I'd appreciate someone providing the best way to go about that rather than me spending many days hacking something together that would still fall short of what I want to do.

Best Answer

Yes, you can do this with for. Even if there is "some simpler way to do this with sort" (but I don't think so), this is also quite simple:

# cd to the directory you want to process

mkdir sorted    
for file in *; 
do
   printf 'Processing %s\n' "$file"
   [ -f "$file" ] && sort -u "$file" > "./sorted/$file"
done

Notes:

  • for file in * doesn't process files in subdirectories.
  • printf is only to report progress. In fact it should be placed after [ ... ] (see below) but I don't want to overcomplicate the code. You can just remove the printf line, if you want the whole thing to be silent.
  • [ -f "$file" ] checks if $file is a regular file. With the most general pattern (i.e. *) we need this condition at least to avoid running sort with the sorted directory as an argument (this would throw an error, harmless but non-elegant). Most likely this test won't be needed if you use a more specific glob like *.txt or *.db instead of * (e.g. to skip a stray desktop.ini file that shouldn't be processed). In this case you can omit [ ... ] && and start the line with sort (leaving the line intact shouldn't hurt though).
  • sort supports various options and you may want to use some of them, depending on how you need to sort.

  • sort -u de-duplicates entries right after sorting them, and when already using sort is a less redundant alternative to using the uniq command.

If you needed to pick files according to conditions more complex than a simple glob, find might be better to start with. For your current task for should be fine.

Related Question