I am looking for a way to determine file types in a folder with thousands of files. File names do not reveal much and have no extension, but are different types. Specifically, I am trying to determine if a file is a sqlite database.
When using the file
command, it determines the type of 2-3 files per second. This seems like a good way to address the problem, except it is too slow.
Then I tried opening each file with sqlite3 and checking to see if I get an error. That way, I can check 4-5 files per second. Much better, but I think that there might be a better way to do this.
Best Answer
2-3 files per second tested with
file
seems very slow to me.file
actually performs a number of different tests to try and determine the file type. Since you are looking for one particular type of file (sqlite), and you don't care about identifying all the others, you can experiment on a known sqlite file to determine which test actually identifies it. You can then exclude the others using the-e
flag, and run against your full file set. See the man page:Edit: I tried some tests myself. Summary:
file
by about 15%, for tests to determine sqlite. Which is something, but not the huge improvement I expected.file
, or...?For a 16MB sqlite DB file, I did:
Timing on the command line:
Trying the different test excludes, and assuming the determination is made based on a single test, it is the 'soft' (i.e. magic file lookup) test which identifies the file. Accordingly, I modified the
file
command to exclude all the other tests:Running this 1000 times: