Mysql – Integrity constraints in a relational database – should we overlook them

database-designdenormalizationMySQLnormalizationrelational-theory

I’m in a permanent discussion with the developers of the company where I work because they say it is better to get rid of relationship enforcement (via FOREIGN KEY constraint definitions) in a relational database in order to speed up large queries and to gain better performance.

The platform under consideration is MySQL 5.x, and no FOREIGN KEY has been set up, even some PRIMARY KEY constraints of the relevant tables are missing which, at least for me, is not reasonable. Maybe they’re right and I’m wrong, but I don’t have enough arguments to discuss about this situation.

This has been the preferred approach for three years now. I’m new in this company (only one month) but, as the product “works”, there is hesitation to enhance the database; nevertheles, the first thing I noticed is one page taking 1 minute to load (yes, 60 seconds!).

One of the claims behind the current state of affairs is that a “denormalized” database is faster than a normalized one, but I don’t believe that’s true.

Most of the relevant queries include JOIN operations, which makes them run very, very, very slow with large amounts of data (the database contains millions of rows).

Commonly, the handling of “CRUD” operations is implemented at the application program code level; for example, in order to DELETE some data FROM, let’s say, TableA:

  • it is necessary to first check on the fly if there is some relationship between the rows of TableA and TableB,
  • in case that said relationship is “detected”, then the app program code won’t allow to DELETE the pertinent row(s), but
  • if for some reason the app program code fails, then the DELETE operation will “succeed”, no matter if there is any relationship regarding the involved rows and tables.

Question

Could you help me to elaborate a good, accurate and solid answer to enrich the debate?


Note: Maybe something like this has been asked (and answered) before, but I couldn’t find anything by means of Google.

Best Answer

If, as stated in your post, the intention is to create a relational database (RDB for brevity) and, therefore, it is expected that it functions as such, the short answer is:

  • No, you should not overlook data integrity constraints.

The primary objective should be to manage the pertinent data as it is: a quite valuable organizational asset, and a reliable manner to achieve said objective is employing technical means that are supported on sound theory.

Thus, as a database professional, you can take advantage of the state-of-the-art and elegant relational model mechanisms supplied by Dr. E. F. Codd to enforce business rules, and avoid the (technical and organizational) problems that would eventually arise if they are not utilized.

In this respect, I will share (a) my overall take on constraints and also (b) several considerations about the state of affairs of the database and the work environment at issue as follows.

FOREIGN KEY constraints, data relationships and referential integrity

A RDB must reflect the characteristics of the business context of interest with high accuracy, which definitely requires an in-depth conceptual-level analysis led by a modeler or designer who follows best practices, counting with the indispensable assistance of the business experts. That analysis must yield the correct identification and formulation the applicable business rules.

Consequently, if such a modeler has identified that there exist interrelationships between the data of relevance, he or she must configure the corresponding logical-level restrictions so that the database management system (DBMS) can guarantee that the data remains consistent with the exact characteristics and rules determined in the analysis referred to above at all times.

Regarding the database under discussion, one can infer that the pertinent interrelationships have been identified, since you mention that there is a procedural (and easy to circumvent) attempt to enforce them from outside of the DBMS facilities, by dint of application program code (which is a pre-relational approach) that in any case has to “touch” the database to try to validate the wholeness of said interrelationships.

However, as you know, that is not the optimal technique to protect referential integrity, because the relational science has prescribed a very powerful instrument for this purpose, i.e., FOREIGN KEY (FK) constraints. These constraints are very easy to create (via the superior declarative approach) as they are single sentences that avoid resorting to unnecessary and error prone ad hoc procedures. It is very useful to note that the execution speed of FK constraints has been highly optimized by specialized programmers (and the major platform vendors have worked on it for even decades now).

Furthermore, since a RDB must be an independent (self-protective, self-describing, etc.) software component that is capable of being accessed by multiple application programs (desktop, automatic, web, mobile, combinations thereof), it should not be “coupled” with the code of any of these apps.

Likewise, the data —being an significant organizational resource— naturally tends to outlive application programs, application programmers, application development platforms and programming paradigms.

PRIMARY KEY constraints and implications of duplicate rows

When —conceptually speaking— a particular kind of thing has been deemed of significance in a business environment, a database modeler has to (1) determine its relevant characteristics —i.e., its properties—, confirm said kind of thing as an entity instances prototype —i.e., an entity type— and (2) represent it by way of a table that is integrated by one or more columns in a logical design.

Then, just like it is paramount to distinguish each individual instance of a given entity type in the real world business, each corresponding row enclosed in a table must be uniquely distinguished as well. If a table does not have any KEY declared, it will eventually retain duplicates, and if there are two or more rows that retain exactly the same values, then they all carry the same meaning, they all represent the same fact.

On that point, duplicate rows should be discarded due to multiple reasons. From a theoretical perspective, the designer has to make sure that each row is always unique for the purpose of having tables that work as relationally as the SQL data sub-language permits (having important repercussions on data manipulation operations). Besides, from an informational perspective, if multiple rows represent the same fact, recording them is not only superfluous but harmful, as exemplified bellow:

  • Suppose that someone has inserted two identical rows in a certain table.
  • Later, someone else comes and updates only one occurrence of the duplicates. As a consequence, the other occurrence is not up-to-date anymore.
  • Successively, another person updates the occurrence that had not been modified so far. In this manner, both duplicates have undergone different changes at distinct points in time.
  • After that, when someone is interested in selecting the information conveyed by the rows in question, he or she can find two different “versions” of it.

In this way:

  • Which “version” can be considered the correct, reliable one?
  • Which one reflects the real world accurately?

As you know, this phenomenon can even have legal implications, a circumstance that surely is of enormous importance.

Besides, the time and effort that has to be employed to handle such contradictions (perhaps through some kind of “update synchronization”) should better be devoted to tasks that actually produce value for your organization. So, retaining contradictory rows ought to be avoided by design to keep the consistency of a database intact.

That is why the identification of a PRIMARY KEY (PK) and the declaration of the respective constraint should always be performed by the database designer. But it must be mentioned too that a table may have more than one column or combination of columns that hold values that uniquely identify every row; as a consequence, besides setting up a PK constraint (ideally established as PRIMARY due to pragmatical reasons), the designer must as well declare one or more ALTERNATE KEYs (usually defined via one or more UNIQUE plus NOT NULL constraints) when applies (which is pretty common).

Another advantageous property of PKs is that, when “migrated” to other tables to take part in single or composite FKs, they can help to enforce the cardinality ratios of the relationships/associations that exist among the data. All this, yes, by means of simple and efficient declarative settings, ensured by the DBMS invariably.

(Current) CHECK constraints and single-row validation

Let us not forget about the relevance of (current) CHECK constraints that, restricting declaratively the valid set of column values of a row (which may appear simple, but is in fact an fundamental feature of a relational DBMS), help as well to make certain that the rules of the business context are reflected with precision always.

As you marked your question with the MySQL tag, starting with version 8.0.16 (see also this MySQL server team blog post) such a platform finally! enforces this type of constraint. In this regard, it must be mentioned that in all its prior versions this DBMS did permit declaring CHECK restrictions, but ignored its enforcement!, situation that, understandably, was reported as a bug since 2004.

So, if using legacy versions, you would have to take care of this factor by other means, e.g., ACID TRANSACTIONS, TRIGGERS, or other methods within the DBMS itself (see this answer by @ypercubeᵀᴹ for information on this subject) so that the data continues to be consistent.

ASSERTION constraints: setting up further multi-row and multi-table business rules declaratively

One aspect that for whatever reasons is very poorly supported —if at all— by the different SQL DBMSs, including MySQL, is enabling multi-row and multi-table constraints in a declarative fashion —beyond PKs and FKs, evidently—.

For its part, the SQL standard has included ASSERTIONs from many years now. I do not know what rules of your business environment would benefit from that logical-level validation approach but, as a database designer, I consider that it would be pretty handy to constraint data with one or more ASSERTIONs when required, although I have to mention that from the point of view of the DBMS developers, this paramount kind of tool has been difficult to implement at the physical level of abstraction.

It appears that the Oracle vendor and/or developers are evaluating ASSERTION support since 2016, and that would make that DBMS more relationally-compliant and, hence, more robust and competitive. I guess that, if (i) their consumers keep pushing and (ii) Oracle succeeds in the implementation, then (iii) other DBMS vendors/communities will have to enable them too, and their usage will start to spread. Certainly, that would be a huge progress in the data management field, and being one of the most distinctive tools envisioned by Dr. Codd, I personally hope that we will see that happening soon.

Data consistency and the decision-making process

As discussed above, one of the most important aspects of a RDB is that it guarantees by itself the consistency of the data it retains, and said consistency is only met when the RDB complies with the integrity constraints declared by the modeler.

In this respect, it is compulsory to have base tables (those established in a DDL structure) which integrity is protected in order to be able to create derived tables (e.g., a SELECT statement or view that retrieves columns from multiple tables) that are trustworthy, because derived tables have to be produced necessarily in terms of base tables.

It is pretty obvious that people use information as the main tool in the organizational (and in the ordinary) decision-making process. Then, if the information presented by a database is not coherent and accurate, the decisions based on such information will not be sound (to say the least). That is why a RDB must be carefully designed and implemented: it should be built to become a reliable resource that can assist its users to make well-founded decisions.

“Denormalization”

Alas, “a ‘denormalized’ database is faster than a normalized one” is a widely spread misconception, although it is also an “argument” that can be refuted on logical, physical and pragmatical grounds.

Firstly, denormalization implies necessarily that a base table has been previously normalized (by virtue of a formal, science-based, procedure fulfilled at the logical level of abstraction of a database).

So, assuming that said table was in actual fact normalized correctly, “denormalizing” it (which, in contrast to the formal meaning of the word, involves appending to it columns that belong in, and are also part of, other tables in an ad hoc fashion) might aid, e.g., to speed up (at the physical level) the processing of only one or a few particular SELECT statement, while such course of action might, at the same time, be undermining the execution of many other associated data manipulation operations (e.g., several INSERT, UPDATE, DELETE and SELECT statements, or combinations thereof enclosed within a single or multiple ACID TRANSACTIONS).

In addition, denormalization (be it formal or informal) would introduce update/modification anomalies that deteriorate the coherence of the database, a problem that “may” be handled by complex, costly and error-prone procedures, when all this can be prevented from the very beginning.

Physical-level scaffoldings supporting normalized and “denormalized” tables

A logical (abstract) layout (SQL-DDL design) that is meant to be utilized in the real world clearly holds physical (concrete) repercussions that must be considered.

In this manner, a “denormalized” table would necessarily be “wider” (holding additional columns) which means that its rows would necessarily be heavier (requiring more and larger physical-level components), so that means that the underlying computing processes (e.g., those that have to do with the hard drive or memory) can easily turn slower.

In contrast, a normalized table that is of course “narrower” (having less columns) would be a “lighter” element (served by less and smaller physical components) that “behaves faster”, which would speed up the series of actions related to, e.g., data writing and reading.

That being so, it is very convenient to (a) normalize the relevant tables formally and prudently, keeping them as such, and then (b) to make use of any physical level resource that can optimize data retrieval and modification speed, e.g., implementing a careful and efficient indexing strategy, enabling proper software and hardware server configurations, upgrading network bandwidth capabilities, etc.

The functioning of the database under consideration and approaching your colleagues

The following paragraphs of your question have to do with the speed of the data retrieval operations:

[A]s the product “works”, there is hesitation to enhance the database; nevertheless, the first thing I noticed is one page taking 1 minute to load (yes, 60 seconds!).

If loading a certain page takes that long, it is evident that the users of the system are not receiving an appropriate service; therefore, even when it “works”, its functioning does not seem to be optimal at all, point that demonstrates that your intentions to make the entire computerized information system (database and apps) more efficient are well sustained, and shows a very constructive attitude.

Then, even when science definitely supports you and hence you should maintain a firm posture, I suggest approaching the situation in a diplomatic way, since at the end of the day, your employers, coworkers and yourself are JOINing efforts in order to make the whole organization more successful. Thus, that is one argument that you should stress, that, while they are doing other things more than well, improving general and specific data management practices can considerably help in producing more organizational and individual growth.

Most of the relevant queries include JOIN operations, which makes them run very, very, very slow with large amounts of data (the database contains millions of rows).

It is worth to note that the JOIN operator is an essential and powerful element that pertains to relational manipulation of data. Then, although more robust platforms serve it with comparatively faster executions, the circumstance you describe is most probably a symptom of a non-eficient design (at the conceptual, logical and physical levels of abstraction). So, my first sight estimations are:

  • The INDEX settings may require improvement.
  • The PK and FK column type and size definitions need to be reviewed (and I totally agree with @Rick James regarding his PK considerations, as composite KEYs tend to be much more efficient than appended surrogates in the appropriate cases).
  • Further (formal, science-based) normalization might help to alleviate these problems, on account of the fact that, in the right circumstances (i.e., carried out in a well-designed RDB), JOINs are executed very fast.

Moreover, yes, as @TommCatt mentions in his answer, sometimes a (logical) rewrite of a query modifies its (physical) execution plan accelerating data reading/writing, which is a factor that should decidedly be taken into account.