Postgresql – Storing repetetive file metadata in databse

database-designpostgresql

I have a situation where I want to store data like the following.

Path: String
isDirectory: Bool
isSymLink: Bool
linksTo: File 
lastmodifiedData: Long
size: Long,
permissions: String

I will be storing this kind of data for many differet directory structures that very often have the same file metadata.

I do not want to have many repeated entries in the database, but I also do not want to have a 5 column Primary key.

I was considering a table, lets call it FILE_BASE

Path: String
isDirectory: Bool
isSymLink: Bool
linksTo: File 

and another called FILE_INSTANCE

directory-root:String
fileBase:            // should this be an ID, or a multi-column FK
lastmodifiedData: Long
size: Long, 
permissions: String

I am trying to understand the best way to go about creating a key for the first table that can be referenced in the second table.

If I use a composite key, then basically all of the columns in the first table are part of the primary key.

If I use surrogate keys, then I need to do a query before every insert to determine if the file(s) base that i want to add already exists in the table, and recover the surrogate ID. If I am dealing with thousands of files per insert, then this could get cumbersome quickly.

I could generate the ID's myself using some sort of UUID algorithm on a concatenation of the table columns, but that could potentially generate some pretty big ID's for many of the file bases, depending on their path in the file system.

I feel like this is just a trade-off between space and time complexity. I also feel like I am overthinking this, as I do not have a lot of database experience. Luckily this is just a test project that will likely get thrown away anyway, but I could use some input, or suggestions on reading so that I can understand what a good implementation would look like.

Is there something obvious that I am missing here?

–Edited: Added directory root identifier to second table.

Best Answer

I will be storing this kind of data for many different directory structures that very often have the same file metadata. I do not want to have many repeated entries in the database, but I also do not want to have a 5 column Primary key.

In order to normalize, you need to do is accommodate for LVM, and partition information. Then you should only have to keep inode, and meta data, (stat and lsattr if you want to be super anal). You may or may not want emulate hierarchy in path. We do in this example. The repetitive meta-data isn't an issue.

CREATE TABLE physical_volume (
  pv_id      int PRIMARY KEY,
  hw_uud     uuid
);

CREATE TABLE volume_group (
  vg_id int PRIMARY KEY
);

CREATE TABLE pv_vg (
  pv_id int REFERENCES physical_volume,
  vg_id int REFERENCES volume_group,
);

CREATE TABLE logical_volume (
  lv_id int PRIMARY KEY,
  name  text
);

CREATE TABLE partition (
  partition_id int PRIMARY KEY
  fs           text,
  mount_attr   jsonb,
);

CREATE TABLE files (
  file_id       int PRIMARY KEY
  partition_id  int REFERENCES partition
  inode         int,
  .. repetitive metadata
  UNIQUE (partition_id, inode)
);

CREATE TABLE path (
  path_id            int PRIMARY KEY,
  parent_path        int REFERENCES path
  file_id            int REFERENCES files
  name               text
);