I have several folders that contain numerous text files, ranging from tens to 100s. These text files are simple databases containing millions of lines, with each line containing a single record. However, the records in them are unsorted and contain many duplicates. I'd like to to sort and de-duplicate them all individually (i.e. independently of each other), but to my understanding, sort
can only produce a concatenated output of all input files – that is, even if given multiple files, it will only produce one output file containing the combined results of all those files.
How can I sort all files in the current folder to produce an individually sorted output file for each one? I'd also like for the output files to be outputted to a subfolder within the current directory. A for
loop is the obvious solution to me, but I'm asking here in case there's some simpler way to do this with sort
that I've not come across or missed. My bash
knowledge is also very lacking, so if a for
loop is the simplest solution, I'd appreciate someone providing the best way to go about that rather than me spending many days hacking something together that would still fall short of what I want to do.
Best Answer
Yes, you can do this with
for
. Even if there is "some simpler way to do this withsort
" (but I don't think so), this is also quite simple:Notes:
for file in *
doesn't process files in subdirectories.printf
is only to report progress. In fact it should be placed after[ ... ]
(see below) but I don't want to overcomplicate the code. You can just remove theprintf
line, if you want the whole thing to be silent.[ -f "$file" ]
checks if$file
is a regular file. With the most general pattern (i.e.*
) we need this condition at least to avoid runningsort
with thesorted
directory as an argument (this would throw an error, harmless but non-elegant). Most likely this test won't be needed if you use a more specific glob like*.txt
or*.db
instead of*
(e.g. to skip a straydesktop.ini
file that shouldn't be processed). In this case you can omit[ ... ] &&
and start the line withsort
(leaving the line intact shouldn't hurt though).sort
supports various options and you may want to use some of them, depending on how you need to sort.sort -u
de-duplicates entries right after sorting them, and when already usingsort
is a less redundant alternative to using theuniq
command.If you needed to pick files according to conditions more complex than a simple glob,
find
might be better to start with. For your current taskfor
should be fine.