Review database schema / thought process

database-designschema

Please review my thoughts below. It is written in an authoritative style because it is the most succinct, not because I am competent. Please poke as many holes as possible.

I am trying to create a database to store graphs. Each graph is represented as a set of Nodes and Relations:

Node (Table)
id (PK)

Relation (Table)
id (PK)
from (FK Node)
to (FK Node)

It should be possible to create multiple graphs, and no information is shared between the graphs. This gives me the following options:

  1. Create a new database for each graph
    • Simple: Pick the right database at the start of the request, and we are done.
    • Scalable: Possible to split the graphs easily across different servers.
  2. Create a new set of tables for each graph, i.e., Node_{graphid} and Relations_{graphid}
    • Simple: Need only a single database.
    • Complex: Must always transform the table names.
  3. Modify the schemes above to have a graphid
    • Simple: No need to transform table names.
    • Complex: Queries are more complex, since we must always account for the graphid.
    • Expensive: Need to index on graphid, or do some other tricks?

A new database for each graph seems the best choice, and is assumed in the discussion below.

Users will create and share graphs, which leads to two levels of access:

Access (Table)
id (PK)
graph (database id for graph)
edit (FK Group)         # allows editing of graph
view (FK Group)         # allows viewing of graph

A user can be part of many groups:

User (Table)
id (PK)

Group (Table)
id (PK)

User_Group (Table)
id (PK, is this really needed?)
user (FK User)
group (FK Group)

Some graphs will be public. Can use either:

  1. NULL means public.
  2. Add every user to public group by default.

Trade offs (approached from NULL means public):

  • Saves storage: but a laughable amount (assuming 1 million users: 1e6 * (8 + 8) / (1024 * 1024) = 15 MB)
  • Complex: Some queries need to check for NULL explicitly (while the same code would handle both public and restricted group otherwise)
  • Simple: Easier to deal with users that do not have an account (only access to graphs with NULL group).
  • Faster: To answer the query: ''Can the user view this particular graph?'', we will not always have to check the User_Group table.
  • Implicit information: NULL has a special meaning that I will need to document somewhere.

Gracefully handling users without account (otherwise I have to always keep in mind what particular ID belongs to the public group), is enough to sway this in favor of 'NULL means public'.

The backend will be written in Python, with either PostgreSQL or MySQL as database management system. Ideally users will be able to edit graphs concurrently.

Best Answer

Recommendations are as follows:

  1. For access control use the groups instead of individual users. Create a group called public to which all users belong therefore any graphs assigned to public will be visible. Look to Role Based Access Control (RBAC) for more on this
  2. Each graph can be accessed by one or more groups, which allows you to have graphs that are visible to different cross sections of users (through groups). This means your access table will be graphid, groupid, edit, view with a unique key constraint on graphid and groupid.
  3. Answering the question can a user edit this graph will require a join from User_Group to Access table
  4. There is no special meaning for nulls, as none exist
  5. Handling users with no account is easier since no authentication is provided
  6. For the graphs, you need a graph table, and also a graphid added to the node and relation tables to simplify queries.