How to prove the lack of implicit order in a database

data integrityorder-by

Recently I was explaining to colleagues the importance of having a column by which to sort data in a database table if it is necessary to do so, for example for chronologically-ordered data. This proved somewhat difficult because they could simply re-run their query seemingly endlessly and it would always return the same set of rows in the same order.

I have noticed this before and all I could really do is insist that they trust me and not simply assume that a database table will behave like a traditional CSV or Excel file.

For example, executing the (PostgreSQL) query

create table mytable (
    id INTEGER PRIMARY KEY,
    data TEXT
);
INSERT INTO mytable VALUES
    (0, 'a'),
    (1, 'b'),
    (2, 'c'),
    (3, 'd'),
    (4, 'e'),
    (5, 'f'),
    (6, 'g'),
    (7, 'h'),
    (8, 'i'),
    (9, 'j');

will create a table with a clear conceptual order. Selecting that same data in the simplest way would be:

SELECT * FROM mytable;

Always gives me the following results:

 id | data 
----+------
  0 | a
  1 | b
  2 | c
  3 | d
  4 | e
  5 | f
  6 | g
  7 | h
  8 | i
  9 | j
(10 rows)

I can do this over and over again and it will always return to me the same data in the same order. However I know that this implicit order can be broken, I have seen it before, particularly in large datasets, where some random value will get apparently thrown up into the "wrong" place when selected. But it has occurred to me that I don't know how this happens or how to reproduce it. I find it difficult to get results on Google because the search query tends to just return general help on sorting result sets.

So, my questions are essentially these:

  1. How can I demonstrably and concretely prove that the return order of rows from a query without an ORDER BY statement is not reliable, preferably by causing and showing a breakdown of the implicit order even when the table in question is not updated or edited?

  2. Does it make any difference at all if the data is only inserted once en masse and then never updated again?

I would prefer a postgres-based answer since that is the one I am most familiar with but I'm more interested in the theory itself.

Best Answer

I see three ways to try to convince them:

  1. Let them try the same query but with bigger table (more number of rows) or when the table is being updated between executions. Or new rows are inserted and some old ones are deleted. Or an index is added or removed between executions. Or the table is vacuumed (in Postgres). Or indexes are rebuilt (in SQL Server). Or the table is changed from clustered to a heap. Or the database service is restarted.

  2. You can suggest that they prove that different executions will return the same order. Can they prove it? Can they provide a series of tests that proves that any query will give the result in the same order, no matter how many times it is executed?

  3. Provide the documentation of various DBMS in that matter. For example:

PostgreSQL:

Sorting Rows

After a query has produced an output table (after the select list has been processed) it can optionally be sorted. If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on. A particular output ordering can only be guaranteed if the sort step is explicitly chosen.

SQL Server:

SELECT - ORDER BY Clause (Transact-SQL)

Sorts data returned by a query in SQL Server. Use this clause to:

Order the result set of a query by the specified column list and, optionally, limit the rows returned to a specified range. The order in which rows are returned in a result set are not guaranteed unless an ORDER BY clause is specified.

Oracle:

order_by_clause

Use the ORDER BY clause to order rows returned by the statement. Without an order_by_clause, no guarantee exists that the same query executed more than once will retrieve rows in the same order.