2-3 files per second tested with file
seems very slow to me. file
actually performs a number of different tests to try and determine the file type. Since you are looking for one particular type of file (sqlite), and you don't care about identifying all the others, you can experiment on a known sqlite file to determine which test actually identifies it. You can then exclude the others using the -e
flag, and run against your full file set. See the man page:
-e, --exclude testname
Exclude the test named in testname from the list of tests made to
determine the file type. Valid test names are:
apptype
EMX application type (only on EMX).
text
Various types of text files (this test will try to guess the
text encoding, irrespective of the setting of the ‘encoding’
option).
encoding
Different text encodings for soft magic tests.
tokens
Looks for known tokens inside text files.
cdf
Prints details of Compound Document Files.
compress
Checks for, and looks inside, compressed files.
elf
Prints ELF file details.
soft
Consults magic files.
tar
Examines tar files.
Edit: I tried some tests myself. Summary:
- Applying my advice with the right flags can speed up
file
by about 15%, for tests to determine sqlite. Which is something, but not the huge improvement I expected.
- Your file tests are really slow. I did 500 on a standard machine in the time you did 2-3. Are you on slow hardware, or checking enormous files, running an ancient version of
file
, or...?
- You must keep the 'soft' test to successfully identify a file as sqlite.
For a 16MB sqlite DB file, I did:
#!/bin/bash
for i in {1..1000}
do
file sqllite_file.db | tail > out
done
Timing on the command line:
~/tmp$ time ./test_file_times.sh; cat out
real 0m2.424s
user 0m0.040s
sys 0m0.288s
sqllite_file.db: SQLite 3.x database
Trying the different test excludes, and assuming the determination is made based on a single test, it is the 'soft' (i.e. magic file lookup) test which identifies the file. Accordingly, I modified the file
command to exclude all the other tests:
file -e apptype -e ascii -e encoding -e tokens -e cdf -e compress -e elf -e tar sqllite_file.db | tail > out
Running this 1000 times:
~/tmp$ time ./test_file_times.sh; cat out
real 0m2.119s
user 0m0.060s
sys 0m0.280s
sqllite_file.db: SQLite 3.x database
People had been writing scripts (and possibly C programs) to run file
on a file,
capturing the output with $(file foobar)
or popen()
,
and doing a string match check
to see whether the output from file
contained (or ended with) the word “text”.
Then the developers of the Berkeley Software Distribution
(at the University of California, Berkeley) did as described
and caused all those scripts and programs
not to recognize shell script files as text files.
Best Answer
The magic(5) manual page says only (referring to this as a datatype):
and libmagic's associating the ID3 tags with mp3 has been noticed, e.g., Discussion: libmagic for MP3 can go horribly wrong, since the feature was added in 2008:
The ID3 format stores the tag length as a special 32-bit integer (which is the length you are asking about):
The ID3v2 tag size is stored as a 32 bit synchsafe integer (section 6.2), making a total of 28 effective bits (representing up to 256MB).
Further reading: