Shell – check patterns that don’t exist in sqlite

efficiencyperformanceshellsqlsqlite

I explained a similar situation with plain text files on Grep huge number of patterns from huge file. Many people there said I should, so now I'm migrating my data to a sqlite database:

I have a file from which I extract about 10,000 patterns. Then I check if the database doesn't contain such patterns. If it doesn't, I need to save them externally in file for further processing:

for id in $(grep ^[0-9] keys); do
  if [[ -z $(sqlite3 db.sqlite "select id from main where id = $id") ]]; then
    echo $id >>file
  fi
done

Since I'm new to SQL, I couldn't find a simple way to do this. Also, this loop is useless as it is 20 times slower than what I achieved with awk on the mentioned URL.

Since the database is huge, keeps growing, and I run this loop very frequently, is it possible to make this faster?

Best Answer

For each pattern, you're invoking a new instance of the sqlite program which connects to the database anew. That's a waste. You should build a single query that looks for any of the keys, then execute that one query. Database clients are good at executing large queries.

If the matching lines in the keys file only contain digits, then you can build the query as follows:

{
  echo 'select id from main where id in (';
  <keys grep -x '[0-9][0-9]*' |     # retain only lines containing only digits
  sed -e '1! s/^/, /' |             # add ", " at the beginning of every line except the first
  echo ');'
} | sqlite3 db.sqlite

For more general input data, you get the idea: use text transformations to build a single large query. Be careful to validate your input; here we make sure that what gets injected into the query is syntactically valid. There's actually a corner case in the example above: if there is no match in the file, then the SQL syntax is invalid; if that might happen, you'll need to treat this case specially. Here's more complex code that takes care of the empty case:

<keys grep -x '[0-9][0-9]*' |
if read first; then {
    echo 'select id from main where id in (' "$first"
    sed -e 's/^/, /'
    echo ');'
  } | sqlite3 db.sqlite
fi

Related Solutions

Ssh – perform remote sqlite command

That's all comes from quoting. Try this one:

ssh aaron@10.1.150.53 'sqlite3 /home/aaron/dbname.db "UPDATE data SET \
LastStart = DATETIME('''NOW''') WHERE TaskName = '''taskname'''"'

ps. You need to quote NOW, otherwise sqlite will try to find column with such name. But your quotes ' will be eaten by quotes from ssh. You can't escape ', therefore three quotes ''' are used (the first off ssh quote, second it the quote you need to pass to sqlite, and the last one open ssh quote again).

pps. Furthermore you can inverse quotes like this:

ssh aaron@10.1.150.53 "sqlite3 /home/aaron/dbname.db \"UPDATE data SET \
LastStart = DATETIME('NOW') WHERE TaskName = 'taskname'\""

Fast way to determine if a file is a SQLite database

2-3 files per second tested with file seems very slow to me. file actually performs a number of different tests to try and determine the file type. Since you are looking for one particular type of file (sqlite), and you don't care about identifying all the others, you can experiment on a known sqlite file to determine which test actually identifies it. You can then exclude the others using the -e flag, and run against your full file set. See the man page:

 -e, --exclude testname
         Exclude the test named in testname from the list of tests made to
         determine the file type. Valid test names are:

         apptype
            EMX application type (only on EMX).
         text
            Various types of text files (this test will try to guess the
            text encoding, irrespective of the setting of the ‘encoding’
            option).
         encoding
            Different text encodings for soft magic tests.
         tokens
            Looks for known tokens inside text files.
         cdf
            Prints details of Compound Document Files.
         compress
            Checks for, and looks inside, compressed files.
         elf
            Prints ELF file details.
         soft
            Consults magic files.
         tar
            Examines tar files.

Edit: I tried some tests myself. Summary:

Applying my advice with the right flags can speed up file by about 15%, for tests to determine sqlite. Which is something, but not the huge improvement I expected.
Your file tests are really slow. I did 500 on a standard machine in the time you did 2-3. Are you on slow hardware, or checking enormous files, running an ancient version of file, or...?
You must keep the 'soft' test to successfully identify a file as sqlite.

For a 16MB sqlite DB file, I did:

#!/bin/bash
for  i in {1..1000}
do
    file sqllite_file.db | tail > out
done

Timing on the command line:

~/tmp$ time ./test_file_times.sh; cat out

real    0m2.424s
user    0m0.040s
sys 0m0.288s
sqllite_file.db: SQLite 3.x database

Trying the different test excludes, and assuming the determination is made based on a single test, it is the 'soft' (i.e. magic file lookup) test which identifies the file. Accordingly, I modified the file command to exclude all the other tests:

file -e apptype -e ascii -e encoding -e tokens -e cdf -e compress -e elf -e tar sqllite_file.db | tail > out

Running this 1000 times:

~/tmp$ time ./test_file_times.sh; cat out

real    0m2.119s
user    0m0.060s
sys         0m0.280s
sqllite_file.db: SQLite 3.x database

Best Answer

Related Solutions

Ssh – perform remote sqlite command

Fast way to determine if a file is a SQLite database

Related Question