Recently I was explaining to colleagues the importance of having a column by which to sort data in a database table if it is necessary to do so, for example for chronologically-ordered data. This proved somewhat difficult because they could simply re-run their query seemingly endlessly and it would always return the same set of rows in the same order.
I have noticed this before and all I could really do is insist that they trust me and not simply assume that a database table will behave like a traditional CSV or Excel file.
For example, executing the (PostgreSQL) query
create table mytable (
id INTEGER PRIMARY KEY,
data TEXT
);
INSERT INTO mytable VALUES
(0, 'a'),
(1, 'b'),
(2, 'c'),
(3, 'd'),
(4, 'e'),
(5, 'f'),
(6, 'g'),
(7, 'h'),
(8, 'i'),
(9, 'j');
will create a table with a clear conceptual order. Selecting that same data in the simplest way would be:
SELECT * FROM mytable;
Always gives me the following results:
id | data
----+------
0 | a
1 | b
2 | c
3 | d
4 | e
5 | f
6 | g
7 | h
8 | i
9 | j
(10 rows)
I can do this over and over again and it will always return to me the same data in the same order. However I know that this implicit order can be broken, I have seen it before, particularly in large datasets, where some random value will get apparently thrown up into the "wrong" place when selected. But it has occurred to me that I don't know how this happens or how to reproduce it. I find it difficult to get results on Google because the search query tends to just return general help on sorting result sets.
So, my questions are essentially these:
-
How can I demonstrably and concretely prove that the return order of rows from a query without an
ORDER BY
statement is not reliable, preferably by causing and showing a breakdown of the implicit order even when the table in question is not updated or edited? -
Does it make any difference at all if the data is only inserted once en masse and then never updated again?
I would prefer a postgres-based answer since that is the one I am most familiar with but I'm more interested in the theory itself.
Best Answer
I see three ways to try to convince them:
Let them try the same query but with bigger table (more number of rows) or when the table is being updated between executions. Or new rows are inserted and some old ones are deleted. Or an index is added or removed between executions. Or the table is vacuumed (in Postgres). Or indexes are rebuilt (in SQL Server). Or the table is changed from clustered to a heap. Or the database service is restarted.
You can suggest that they prove that different executions will return the same order. Can they prove it? Can they provide a series of tests that proves that any query will give the result in the same order, no matter how many times it is executed?
Provide the documentation of various DBMS in that matter. For example:
PostgreSQL:
SQL Server:
Oracle: