How to prove the lack of implicit order in a database

data integrityorder-by

Recently I was explaining to colleagues the importance of having a column by which to sort data in a database table if it is necessary to do so, for example for chronologically-ordered data. This proved somewhat difficult because they could simply re-run their query seemingly endlessly and it would always return the same set of rows in the same order.

I have noticed this before and all I could really do is insist that they trust me and not simply assume that a database table will behave like a traditional CSV or Excel file.

For example, executing the (PostgreSQL) query

create table mytable (
    id INTEGER PRIMARY KEY,
    data TEXT
);
INSERT INTO mytable VALUES
    (0, 'a'),
    (1, 'b'),
    (2, 'c'),
    (3, 'd'),
    (4, 'e'),
    (5, 'f'),
    (6, 'g'),
    (7, 'h'),
    (8, 'i'),
    (9, 'j');

will create a table with a clear conceptual order. Selecting that same data in the simplest way would be:

SELECT * FROM mytable;

Always gives me the following results:

 id | data 
----+------
  0 | a
  1 | b
  2 | c
  3 | d
  4 | e
  5 | f
  6 | g
  7 | h
  8 | i
  9 | j
(10 rows)

I can do this over and over again and it will always return to me the same data in the same order. However I know that this implicit order can be broken, I have seen it before, particularly in large datasets, where some random value will get apparently thrown up into the "wrong" place when selected. But it has occurred to me that I don't know how this happens or how to reproduce it. I find it difficult to get results on Google because the search query tends to just return general help on sorting result sets.

So, my questions are essentially these:

How can I demonstrably and concretely prove that the return order of rows from a query without an ORDER BY statement is not reliable, preferably by causing and showing a breakdown of the implicit order even when the table in question is not updated or edited?
Does it make any difference at all if the data is only inserted once en masse and then never updated again?

I would prefer a postgres-based answer since that is the one I am most familiar with but I'm more interested in the theory itself.

Best Answer

I see three ways to try to convince them:

Let them try the same query but with bigger table (more number of rows) or when the table is being updated between executions. Or new rows are inserted and some old ones are deleted. Or an index is added or removed between executions. Or the table is vacuumed (in Postgres). Or indexes are rebuilt (in SQL Server). Or the table is changed from clustered to a heap. Or the database service is restarted.
You can suggest that they prove that different executions will return the same order. Can they prove it? Can they provide a series of tests that proves that any query will give the result in the same order, no matter how many times it is executed?
Provide the documentation of various DBMS in that matter. For example:

PostgreSQL:

Sorting Rows

After a query has produced an output table (after the select list has been processed) it can optionally be sorted. If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on. A particular output ordering can only be guaranteed if the sort step is explicitly chosen.

SQL Server:

SELECT - ORDER BY Clause (Transact-SQL)

Sorts data returned by a query in SQL Server. Use this clause to:

Order the result set of a query by the specified column list and, optionally, limit the rows returned to a specified range. The order in which rows are returned in a result set are not guaranteed unless an ORDER BY clause is specified.

Oracle:

order_by_clause

Use the ORDER BY clause to order rows returned by the statement. Without an order_by_clause, no guarantee exists that the same query executed more than once will retrieve rows in the same order.

Related Solutions

Mysql – Want to order on two columns but not the usual way

SELECT *
FROM Mytable
ORDER BY
   userID, Date

I assume Date is really a date/time type and not varchar...

Edit, after clarification:

Untested

SELECT
    M.*
FROM
    ( --one row for each user
    SELECT MIN(Date) AS FirstUserDate, userID
    FROM MyTable
    GROUP BY userID
    ) foo
    JOIN
    MyTable M ON foo.userID = M.userID
ORDER BY
    foo.FirstUserDate, M.userID, M.Date;

SQL Server – ORDER BY Clause Ignored in View with WHERE Clause

The query you posted is not valid for creating a view; running CREATE VIEW xy AS for this query will result in an error. Are you using a TOP clause?

A view, being a table expression (a set), can't have the order defined, since that would be against the principles of a relational model (there is no order for rows in a relational table - a set is an unordered collection of tuples). Same goes for other table expressions - derived tables, CTEs etc.

From BOL article about the ORDER BY clause:

The ORDER BY clause is not valid in views, inline functions, derived tables, and subqueries, unless either the TOP or OFFSET and FETCH clauses are also specified. When ORDER BY is used in these objects, the clause is used only to determine the rows returned by the TOP clause or OFFSET and FETCH clauses. The ORDER BY clause does not guarantee ordered results when these constructs are queried, unless ORDER BY is also specified in the query itself.

Long story short: Use the ORDER BY clause in the outer query that references the view. Do not use it in a view. Even using it with TOP(100) PERCENT (or on SQL Server 2012, the OFFSET-FETCH equivalent) does not guarantee presentation order, it just means you'll get the top 100% of the rows, in any order.

Best Answer

Related Solutions

Mysql – Want to order on two columns but not the usual way

SQL Server – ORDER BY Clause Ignored in View with WHERE Clause

Related Question