Postgresql – Storing repetetive file metadata in databse

database-designpostgresql

I have a situation where I want to store data like the following.

Path: String
isDirectory: Bool
isSymLink: Bool
linksTo: File 
lastmodifiedData: Long
size: Long,
permissions: String

I will be storing this kind of data for many differet directory structures that very often have the same file metadata.

I do not want to have many repeated entries in the database, but I also do not want to have a 5 column Primary key.

I was considering a table, lets call it FILE_BASE

Path: String
isDirectory: Bool
isSymLink: Bool
linksTo: File

and another called FILE_INSTANCE

directory-root:String
fileBase:            // should this be an ID, or a multi-column FK
lastmodifiedData: Long
size: Long, 
permissions: String

I am trying to understand the best way to go about creating a key for the first table that can be referenced in the second table.

If I use a composite key, then basically all of the columns in the first table are part of the primary key.

If I use surrogate keys, then I need to do a query before every insert to determine if the file(s) base that i want to add already exists in the table, and recover the surrogate ID. If I am dealing with thousands of files per insert, then this could get cumbersome quickly.

I could generate the ID's myself using some sort of UUID algorithm on a concatenation of the table columns, but that could potentially generate some pretty big ID's for many of the file bases, depending on their path in the file system.

I feel like this is just a trade-off between space and time complexity. I also feel like I am overthinking this, as I do not have a lot of database experience. Luckily this is just a test project that will likely get thrown away anyway, but I could use some input, or suggestions on reading so that I can understand what a good implementation would look like.

Is there something obvious that I am missing here?

–Edited: Added directory root identifier to second table.

Best Answer

I will be storing this kind of data for many different directory structures that very often have the same file metadata. I do not want to have many repeated entries in the database, but I also do not want to have a 5 column Primary key.

In order to normalize, you need to do is accommodate for LVM, and partition information. Then you should only have to keep inode, and meta data, (stat and lsattr if you want to be super anal). You may or may not want emulate hierarchy in path. We do in this example. The repetitive meta-data isn't an issue.

CREATE TABLE physical_volume (
  pv_id      int PRIMARY KEY,
  hw_uud     uuid
);

CREATE TABLE volume_group (
  vg_id int PRIMARY KEY
);

CREATE TABLE pv_vg (
  pv_id int REFERENCES physical_volume,
  vg_id int REFERENCES volume_group,
);

CREATE TABLE logical_volume (
  lv_id int PRIMARY KEY,
  name  text
);

CREATE TABLE partition (
  partition_id int PRIMARY KEY
  fs           text,
  mount_attr   jsonb,
);

CREATE TABLE files (
  file_id       int PRIMARY KEY
  partition_id  int REFERENCES partition
  inode         int,
  .. repetitive metadata
  UNIQUE (partition_id, inode)
);

CREATE TABLE path (
  path_id            int PRIMARY KEY,
  parent_path        int REFERENCES path
  file_id            int REFERENCES files
  name               text
);

Related Solutions

Postgresql – How to concatenate psql variables

\set path '/tmp/'
\set file 'foo'
\set pf :path:file \echo :pf
/tmp/foo

Why does this work? I quote the manual here:

\set [ name [ value [ ... ] ] ]

Sets the internal variable name to value or, if more than one value is given, to the concatenation of all of them. [...]

Emphasis mine.

PostgreSQL Surrogate Key – Resolving Surrogate Key vs Natural Key Argument

A proper Postgres schema for your example could look like this:

CREATE TABLE selected_task(
  selected_task_id serial PRIMARY KEY  -- surrogate PK
, user_id          int REFERENCES users(user_id)
, task_number      int
, name             text
, jobsite_id       int REFERENCES site(jobsite_id)
);

CREATE TABLE sub_task(
  sub_task_id      serial PRIMARY KEY
, selected_task_id int REFERENCES selected_task  -- this is how
, assignee         int REFERENCES users(assignee)
, time_allowed     int
, report_to        int REFERENCES groups(report_to)
);

Use a serial column as surrogate PK.
You can't use reserved words like user or group as identifier.
Some other improvements. Detailed instructions here:
- How to implement a many-to-many relationship in PostgreSQL?

`INSERT` in both tables at the same time?

Use a data-modifying CTE with the RETURNING clause, or two INSERT statements in a transaction with lastval() (which is the Postgres equivalent for SCOPE_IDENTITY() in SQL server @Aaron mentioned in his comment):

Related answer with code example for either method.

Inserting into related tables

Best Answer

Related Solutions

Postgresql – How to concatenate psql variables

PostgreSQL Surrogate Key – Resolving Surrogate Key vs Natural Key Argument

INSERT in both tables at the same time?

Related Question

`INSERT` in both tables at the same time?