MySQL – Designing Table for Document Annotation with Future Columns

database-designeavMySQLrdbms

I am working on a web application that allows students/researchers to annotate PDF documents shallowly by (manually) counting interesting phenomena. Right now, there is a total of about 45 of these phenomena, most of which are nominal values, booleans or integers. The database is a MySQL database, and the amount of documents would be between 100 and 500, so the database is quite small.

The application adds these documents and their respective annotations to a corpus which will also be searchable (and filterable based on a large subset of these phenomena).

In the future, some phenomena may be added, or existing ones may be disabled or automatised through automatic tagging, etc.

This could be done by, for example:

1. Altering the table and adding columns via the web application

Ultimately the table might become too big (column-wise), but it is easy to implement, even if it seems clumsy to have a table with 60 columns. I'm not sure if it is as inefficient as it seems.

2. Storing the annotation part of the web application externally

For example, as JSON documents or in a nonrelational database such as MongoDB. This intuitively makes sense as I'm presenting the need for a flexible (and thus nonrelational/schemaless) solution for annotation. But connecting two databases together might unnecessarily increase the complexity of the application. Also, using foreign keys might become impossible this way (?).

3. Storing the annotations as one blob in a MySQL column

I have virtually no experience with blobs, as I've always avoided them. I read it in a different question

4. Storing all/new attributes in an EAV table in the MySQL database

Seems like a hacky way to mimic nonrelational behaviour in a relational database. I also read that it could cause performance issues.

I'd love some advice on which method is better or if there is something entirely different I should have a look at.

Best Answer

JSON documents are ideal for this purpose. MySQL can store JSON documents via its JSON datatype, and it has a variety of function for extracting information in such documents. However, MySQL doesn't currently have an effective native way of indexing JSON documents for fast search. Googling for "mysql json index" will render several strategies for doing so, some using triggers and external tables, others using indexes on computed columns.

PostgreSQL, however, has very good support for JSON datatypes including indexing, opening up documents as recordsets (using LEFT JOIN LATERAL in combination with jsonb_each is very powerful), and GIN indexes that support JSON documents (for both tags and contents). Unless you're heavily invested in MySQL already, PostgreSQL would be a great option for your application. Alternatively, you may want to consider PostgreSQL instead of MongoDB if you'd use it for this purpose alone. Additionally, PostgreSQL can easily access MySQL tables with via Foreign Data Wrappers, it has good support for regex text manipulation, and you can do all sorts of magic on JSON documents through standard perl libraries, using pl/perl stored procedures.