Fundamental question that I cannot seem to find an answer on. I have a database that stores line items from receipts along with the username and receipt number.
Right now the lineItem
column is just a long string of data separated by commas (original file appears to have been just an Excel file). This information is parsed in a PHP script for viewing on the front end.
The table looks like this:
|----------|----------|----------|
|lineItem |receiptID |customerID|
|----------|----------|----------|
|CD, DVD, |001 |User01 |
|----------|----------|----------|
|CD, CD, |002 |User02 |
|DVD, usb, | | |
|----------|----------|----------|
Ultimately, is this bad practice? Should the lineItem
values be linked to related values in another table instead maybe?
Best Answer
Why storing data as a string a problem :
Storing multiple datums as (in this case comma separated) strings is bad practice because:
- First reason:
It breaches
Codd's second rule
(called the"Guaranteed Access Rule"
) which states thatEach and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
So, if you wish to refer to
user02
's USBlineItem
, you have to do further processing beyond simply knowing the table name, thePRIMARY KEY
and the column name.From
here
:Among the conventional database models, the Relational Model of data has a simple, sound mathematical foundation based on the notions of set theory.
And from the wiki on theRelational Model
(RM) we have:So, basically, the only data model with a sound mathematical foundation is the relational one.
Most relational databases use the SQL data definition and query language; these systems implement what can be regarded as an engineering approximation to the relational model.
[ibid].Codd derived his rules as a guide to a practical implementation of his relational calculus - given that it is the only model with a sound mathematical foundation, it seems that it would be a bad idea to breach any of them.
Caveat: now, if for example, you will NEVER EVER want to break out the line items into their individual components, then storing it as one "unit" would be acceptable, but I can see many instances of where you would want to split it up into its component parts (see the fifth reason below).
An example of where you might want to store data in .csv form might be storing somebody's name and title for an academic journal - it might be stored thus:
and that's the only way you're ever going to print/process/transmit/store this information, then it is a datum, not comma separated variables -
datum
ordata
is very much a contextual concept.- Second reason:
As mentioned in comments, your
lineItem
table isn't even in first normal form (see the chart here -Atomic columns (cells have single value)
. This is obviously related to the point above.Database normalisation
isThese "forms" flowed from the RM/Relational Calculus and Codd's rules as a way of ensuring that data remains consistent, which is obviously of paramount importance in any database system - in simple terms, it's how we ensure that the definitive archetype of a given datum is stored in one place and one place only.
- Third reason:
You have no way of controlling what data is entered into that field - i.e. you have no way of controlling
Declarative Referential Integrity
(DRI). This means, for example, that there's nothing to stop you referring to products which don't exist (say,DVDx
).DRI is one of the most important benefits of using the RM - it means that internal data consistency can be maintained, the benefits of which you will greatly appreciate if you've ever you've had the misfortune to have worked with a system where this has broken down.
In point two, we said that Normal Forms were to
ensure that the definitive archetype of a given datum is stored in one place and one place only
- DRI ensures that all other references to that datum point to that one place and nowhere else.- Fourth reason:
SQL is not designed for parsing strings - it can be done, it's just messy, time-consuming and error-prone. Many proprietary extensions have been developed by the various RDBMS providers to try and overcome this lacuna, but it's still far easier to deal with properly normalised tables (see the SQL below).
- Fifth reason:
Apart from the "theoretical" (more or less) reasons for not doing this, is the massive PRACTICAL problem of not being able to assign individual quantities and prices to items under your schema - suppose I'm doing my Christmas shopping and I wanted the new "U2 CD" for 3 of my friends who are U2 fanatics? No way of telling the system that there are 3 U2 CD's other than having a field value like this:
'"U2 CD", "U2 CD", "U2 CD" "UB40 CD", "U2 DVD", "Kingston USB 32GB"' -- note repetition of "U2 CD".
Suppose you want to know the number of USB's sold? The number of USB's per client? The number per client district/area/country - depending on the scale of your operation (see SQL below)? Suppose I want to know how much was spent on USB drives last week - absolutely no way of getting any of that information! The list goes on...
How to deal with the problem:
So, having dealt with the first part of your question, we can now progress to the second part -
Should the lineItem values be linked to relational values in another table instead maybe?
.- First solution (extra fields):
Here is a further example of the problems associated with storing strings. In this case, adding fields to a given record is the solution - i.e. splitting the string into its component parts and making each one a field! Very helpful for enforcing DRI and controlling data correctness if one has reference tables for (in this case) ZIP codes, street names &c...
- Second solution (extra records - 1-to-many relationships):
In this particular case of your question, what we have here is a classic
1-many relationship
- also known as parent-child wherereceipt
is the parent andline_item
is the child.Your table structure is this:
What you should have is something like this (see the fiddle here - all data and tables are also given at the bottom of this answer):
and your data will (rather cryptically) look like this:
The
receipt_id
fields and theitem_id
fields point to thePRIMARY KEY
s of their respective tables - and there is no redundant, extraneous information in the table - nocustomer_id
stored multiple times for example! This way of modelling allows one to write queries of the form:Result:
See the fiddle (or below) for the all the DDL and DML! I challenge you to do this trivially with a .csv string containing your
line_item
products - especially in MySQL! It probably would be feasible in PostgreSQL using something likearray_to_table
after feeding the string into an array but I leave that as an exercise for you!So, for a 1-many relationship, you add items to your
line_item
table - one item for each element in your .csv string - 1receipt
parent record can have1
to many (possibly a very large number) ofline_item
children.Now, the
item
table is also a parent ofline_item
and in its case, there can be0
to many children if, for example, if an item hasn't sold at all there will be no references to it in theline_item
table.- Third solution (extra tables - many-to-many relationships):
There is an appropriate circumstance when "values SHOULD be linked to relational values in another table" (as you hint at in your question) and this is when there is an
m-to-n
relationship - otherwise know as amany-to-many
relationship.Consider the old favourite Databases-101 example of students and courses and the many courses taken by many students! See the fiddle here - I haven't populated the tables this time. I've used PostgreSQL for the fiddle (my favourite server) but a bit of tweaking will get it to work on any reasonable RDBMS.
Create tables course and student:
This is where a
JOIN
ing table (akalinking
table(more formally known as an [
Associative Entity`]13 - as an aside, there are 17 different names for this type of table on that page) comes in.So, you deal with this by creating the
Associative Entity
- yourJOIN
ing table:I then add a
PRIMARY KEY
- I kept it out of the table definition to illustrate the point, but it could (and normally would) be part of the table creation DDL.So now,
a given student can only be enroll in a given course once and
a given course can only have the same student enrolled once
There are many other situations where this construct is useful - basically, it's the only way to meaningfully model many real-life situations.
An example from my own career:
Think of a
flight
table containing aflight_id
field, a list of departure and arrival airports and the relevant times and then also acrew
table with crew members and acrew_id
field (and other details obviously).Having the
flight_id
andcrew_id
fields in aJOIN
ing table proved to be very useful for the system - it really helped with scheduling and rostering which was a mess with the other system - constant conflicts in both. It takes time and experience to recognise when which schema design is suitable for which scenario, but 1-many (extra records in existing table) and many-many (extraJOIN
ing table) is a good rule of thumb!p.s. welcome to the forum!
_____________ Full DDL and DML _______________
Customer table:
Customer data:
item table:
CREATE TABLE item ( item_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, item_code VARCHAR (25) NOT NULL UNIQUE, item_desc VARCHAR (200) NOT NULL, item_price DECIMAL(10, 2), item_supplier INTEGER NOT NULL -- refers to supplier table - not shown! -- -- other fields of interest to you -- );
item data:
receipt table:
CREATE TABLE receipt -- often called "orders" but receipt is OK ( receipt_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, customer_id INTEGER NOT NULL, -- refer to customer table - see below
receipt_total DECIMAL(10, 2), -- kept updated by a trigger (not shown) -- can be calculated on the fly or -- possibly a generated field receipt_dt TIMESTAMP NOT NULL, -- date and time of sale receipt_asst INTEGER, -- refers to the sales assistant table - not shown CONSTRAINT rec_cust_fk FOREIGN KEY (customer_id) REFERENCES customer (customer_id) );
receipt data:
line_item table:
line_item data:
Query:
Result: