SQLite Database Design – Storing Pandas DataFrames with Metadata

database-designpythonsqlalchemysqlite

I need some help with designing a database. My aim is to persistently store a number of pandas DataFrames in a searchable way, and from what I've read SQLite is perfect for this task.

Each DataFrame contains about a million rows of particle movement data like this:

              z            y            x  frame  particle
0     49.724138    45.642857   813.035714      0         0
3789  14.345679  2820.537500  4245.162500      0         1
3788  10.692308  2819.210526  1646.842105      0         2
3787  34.100000  2817.700000  1375.300000      0         3
3786   8.244898  2819.729167  1047.375000      0         4

Using sqlalchemy I can already store each DataFrame as a table in a new DataBase:

from sqlalchemy import create_engine
import pandas as pd


engine = create_engine("sqlite:////mnt/storage/test.db")

exp1.to_sql("exp1", engine, if_exists="replace")
exp2.to_sql("exp2", engine, if_exists="replace")
exp3.to_sql("exp3", engine, if_exists="replace")

But this is too basic. How can I store each DataFrame/experiment with a couple of metadata fields like Name, Date in such a way that later on it's possible to return all experiments conducted by a certain person, or on a specific date?

I will add more columns over time. Assuming each DataFrame/experiment has a column velocity, how could I retrieve all experiments where the mean temperature value is below or above an arbitrary threshold?

Best Answer

You've created 3 separate tables (well 2, pending the apparent typo?). If you want to unify the data, you probably shouldn't be forcibly overwriting target tables with if_exists="replace"

From the .to_sql() documentation

replace: Drop the table before inserting new values.

append: Insert new values to the existing table.

Assuming your similarly named files have the same schema, you can edit the last 3 lines as follows.

exp1.to_sql("exp", engine, if_exists="append")
exp2.to_sql("exp", engine, if_exists="append")
exp3.to_sql("exp", engine, if_exists="append")

This will insert all three datasets to a single table named exp instead of 3 separate tables.

If each csv isn't uniquely identified from the others within itself - for example if exp1.csv looks like this...

Name,Date,Temperature
Peter,2020-01-01,50
Paul,2020-01-01,55
Mary,2020-01-01,53
Jane,2020-01-01,49

...then you can append the experiment identifier to each dataset as needed in the dataframe. For example by...

>>> exp1['ExpName'] = 'exp1'
>>> exp1
    Name        Date  Temperature ExpName
0  Peter  2020-01-01           50    exp1
1   Paul  2020-01-01           55    exp1
2   Mary  2020-01-01           53    exp1
3   Jane  2020-01-01           49    exp1
>>>

...which will allow you to group by experiment in any follow-on SQL you may run against your database.

...how could I retrieve all experiments where the mean temperature value is below or above an arbitrary threshold?

...well given an arbitrary additional two datasets of...

➜  /tmp cat exp2.csv
Name,Date,Temperature
Peter,2020-01-02,51
Paul,2020-01-02,56
Mary,2020-01-02,54
Jane,2020-01-02,50
➜  /tmp cat exp3.csv
Name,Date,Temperature
Peter,2020-01-02,52
Paul,2020-01-02,57
Mary,2020-01-02,55
Jane,2020-01-02,51
➜  /tmp

...that you likewise appended the expN identifier to in the dataframe, then you would run the following SQL to retrieve experiments where the average temp was below 53

SELECT 
    ExpName,
    AVG(Temperature)
FROM exp
GROUP BY ExpName
HAVING AVG(Temperature) < 53;

Which I'll leave to you to plug into SQLAlchemy as you like :)

Related Solutions

Database Design – Many-to-Many Relationship with Attributes

I would use The Party Model, so that you can associate a book with an Individual or Organization, not just an Individual. For example, you might want Random House in the role of Publisher. Or Mötley Crüe as the author of the book, not just Tommy Lee.
Otherwise, looks good. You might want to consider associating the book and the PartyRole, instead of Party and Role separately, as you might have extra info pertaining to a Party in a particular Role. Perhaps author pseudonyms.

Postgres-y SQL, but close enough to SQLite:

create table parties (
  party_id int primary key,
  party_type smallint not null references party_types(party_type_id), --1=individual,2=org
  name text not null
);

create table party_roles (
  party_role_id int primary key,
  party_id int not null references parties(party_id),
  role_id smallint not null references roles(role_id),
  ...
  extra information about a party pertaining to a role
  ...
);

create table book_roles (
  book_id int references books(book_id),
  party_role_id int references party_roles(party_role_id),

  primary key (book_id, party_role_id)
);

SQLite – How to Read Metadata from Database

The equivalent of, for instance, MySQL describe t1; is

PRAGMA TABLE_INFO(t1);

and

SELECT * FROM sqlite_master WHERE tbl_name = 't1';

is also useful. Most of the time, however, I simply use

.schema t1

which outputs the SQL definition of the table or view, even with the original comments (exactly the same as SELECT sql FROM sqlite_master WHERE..., with less typing. .sch works too.)

Best Answer

Related Solutions

Database Design – Many-to-Many Relationship with Attributes

SQLite – How to Read Metadata from Database

Related Question