MySQL – How to Keep Updates (Diffs) of an Entity in the Database

MySQLpostgresqlrelational-theory

What is the best way to keep updates (diffs) of the some entity in the database? At StackOverflow we can edit questions and answers. And then we can look at any revision of the question or answer we want. For example: revisions of some random question. Maybe someone knows how it realized in StackOverflow?

To be clear in my case I have some entity (article) with some fields (name, description, content). Many users can edit the same article. I want to keep history of the article updates (something like version control) and I want to keep only diffs, not the whole content of the updated article. By the way I use PostgreSQL, but can migrate to any other database.

Here is some requirements. You don't need to fully satisfy them. But if you do it will be much better. Nevertheless any answer is much appreciated. So I want to have ability:

to keep only diffs to not waste my space for no purpose.
to fetch any revision (version) of some article. But fetching last revision of the article must be really quick. Fetching speed of other revisions is not so important.
to fetch any diff (and list of diffs) of some article. Article can have changes in fields: header, description or content (like StackOverflow have changes in header and content), so it must be taken into account.

Best Answer

Check out Drupal and RDF. Check out what StackExhange (etc) uses. And study how GIT works.

I recommend keeping the current value in the primary table, and put the history in other table(s). That is, plan for efficient access to the current value, at the expense of history.

If you are talking about "articles" of, say, 1K characters, then compress them. Compress old copies; optionally compress the 'current' copy. But do the compression in the client, so as to offload the database server. This will save space. If you are talking about ancillary data (dates, names, etc), don't bother compressing.

Do not save the entire set each time there is a change. That is, if you change the date, don't store another, identical, copy of the article body. Too bulky.

You mentioned diffs -- Do you intend to do diffs on paragraphs? If so, do you have a good algorithms for such? (My previous comments do not assume "diff".)

Oh, well, I guess I am not adding much to your stated requirements.

Related Solutions

MySQL – Update date_modified Field Only When Other Fields Change Using CASE

Below is my final working solution. It turns out that moving this...

date_modified = (CASE
                WHEN name <> values(name)
                OR description <> values(description)
                OR status <> values(status)
                OR type <> values(type)
                OR priority <> values(priority)
                  THEN UTC_TIMESTAMP()
                  ELSE date_modified
            END),

to the top of the list, above these fields...

name='$name',
description='$description',
status='$status',
priority='$priority',
type='$type',

So the final solution looks like this below...

$sql = "
    INSERT INTO
        $this->tasksDbTableName(task_id, project_id, name, description, status, priority, type, date_entered, date_modified, sort_order, heading)
    VALUES
        ('$taskId', '$projectId', '$name', '$description', '$status', '$priority', '$type', UTC_TIMESTAMP(), UTC_TIMESTAMP(), '$sort_order', '$heading')
    ON DUPLICATE KEY UPDATE
        date_modified = (CASE
            WHEN name <> values(name)
            OR description <> values(description)
            OR status <> values(status)
            OR type <> values(type)
            OR priority <> values(priority)
              THEN UTC_TIMESTAMP()
              ELSE date_modified
        END),
        name='$name',
        description='$description',
        status='$status',
        priority='$priority',
        type='$type',
        sort_order='$sort_order',
        heading='$heading'";

Just moving my CASE Statement to the top above the other set fields, made it start working correctly!

It almost seems as if these DB column fields name description status priority type were getting Updated before my Case statement could run which resulted in them always appearing to be the same as the DB fields and never any different so my Case statement was about useless. Moving it to the top, it now works 100% correctly.

Hopefully this will help someone with a similar problem someday as many people across 3 sites tried to fix it without success. Just moving it to the top did the trick though so the Order in this case very much matters! ] Thanks to all that have contributed their time to try and find a solution, I always appreciate the help I receive on the StackExchange network sites!

PostgreSQL – Ignoring Substring in the Middle of a String

One solution is to use regexps to remove that part of the filename. Assuming the separators are known

regexp_replace(filename, '^(.*?)[-_][rR][0-9]+(\.[^.]+)$', E'\\1\\2')

This would assume separators - or _, then r or R followed by at least one number, a dot and then something but never more dots. The part with r/R+numers would be removed.

The efficiency of this depends on the amount of data and how the searches are done. It could be used in an index to speed up searching. For example

CREATE INDEX filename_name_idx ON documents (regexp_replace(filename, '^(.*?)[-_][rR][0-9]+(\.[^,]+)', E'\\1\\2'));

Then doing a search

SELECT filename FROM documents
  WHERE regexp_replace(filename,
    '^(.*?)[-_][rR][0-9]+(\.[^,]+)', E'\\1\\2') = 'LLE-MET-AP-0000-PLA-COB.pdf';

would use index (if the query optimizer deems it faster).

Do note that the regexp part has to be exactly the same in the index and the search, otherwise the index cannot be used. You can also combine it with LOWER() if case insensitive searching is required.

The same regexp can be used to remove the revision identifier from the search string also, if there is a need to search for matching files for, e.g., LLE-MET-AP-0000-PLA-COB-R00.pdf

Making a view with this would make it a lot nicer to use and the actual regexp would be in the database and not in the application layer.

Best Answer

Related Solutions

MySQL – Update date_modified Field Only When Other Fields Change Using CASE

PostgreSQL – Ignoring Substring in the Middle of a String

Related Question