Sql-server – Update a column for every row of a large table to a unique value in a set range

performancequery-performancesql serversql-server-2008-r2

We have a SQL Server table with over 3 million rows in it. Each row represents a product. These products are provided by about 500 suppliers, and are updated each night via a file from each supplier.

These products are displayed by our e-commerce site, and because of the nature of the business, products are always ordered by the product 'last updated' date. This is a column on the products table called LastUpdated.

This gives suppliers whose products were updated last an unfair advantage.

To get around this problem, we shuffle the products by setting the LastUpdated datetime column to a random datetime value over a 4 hour period. The problem is that this process takes about 15 hours. We must also ensure that the suppliers are spread out evenly, so suppliers' products are not bunched together when queried by the website.

We can't change the way the website queries the data (always by LastUpdated DESC), because of the way other parts of the business work, so we can't randomise when querying the data.

We are looking for a faster way to update the LastUpdated column, for every table row, to a unique datetime covering a 4 hour window, that also prevents supplier products from being bunched together when ordering by LastUpdated.

Example

Example of data before and after datetime shuffle

Note: Some suppliers have a different number of products, so any bunching of supplier records must be as random as possible towards the bottom of the queried results set.

To clarify, we have a very slow query using cursors to do this product shuffling, but we feel there is a better way to do this, that does not take 15 hours.

Update: Vladimir's answer is excellent, as is the comment from Shoeless, but what about part of the question ensuring suppliers products are not bunched together when randomising?

Update: Some people have questioned why the LastUpdated order by is so important. Suppliers pay our website to display these products (approx. 10,000 products per subscription), but they can also pay for a limited number of these products to remain at the top of the results for a user's search criteria. If multiple suppliers want this top spot, then we have to rotate the order of these at the top of the results. We do this base on a queue system, based on an updated date. There could be 30 suppliers all paying for the same top spot for the chosen criteria.

Best Answer

Four hours is only 14,400 seconds and you have 3M rows. If we generate random timestamp with 1 second precision within 4 hour interval we'll get approximately 3,000,000 / 14,400 ~= 208 rows for each second. To reduce the number of rows that have exactly the same timestamp we should use fractions of the second. If you use datetime type, its precision is about 3 milliseconds, which should be good enough.

So, for each row of the table I'll generate a random timestamp within a 4 hour interval with a millisecond precision, which will be rounded to the precision of the datetime type.

I will use CRYPT_GEN_RANDOM function for it.

For example,

(CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5)

This generates 4 random bytes as varbinary. We have to explicitly cast them to int first. Then result is transformed into a float number between 0 and 1. This can be miltiplied by the size of the window in milliseconds and added to the timestamp that defines the start of the window.

Sample data

DECLARE @T TABLE 
(ID int, Supplier nvarchar(50), ProductCode char(5), LastUpdated datetime);

INSERT INTO @T (ID, Supplier, ProductCode, LastUpdated) VALUES
(1, 'Acme',  '00001', '2016-03-23 06:00:00'),
(2, 'Acme',  '00002', '2016-03-23 06:00:00'),
(3, 'XYZ',   '00023', '2016-03-23 05:30:00'),
(4, 'XYZ',   '00055', '2016-03-23 05:30:00'),
(5, 'Q & B', '00453', '2016-03-23 04:15:00'),
(6, 'Q & B', '00045', '2016-03-23 04:15:00');

SELECT * FROM @T ORDER BY ID;


+----+----------+-------------+-------------------------+
| ID | Supplier | ProductCode |       LastUpdated       |
+----+----------+-------------+-------------------------+
|  1 | Acme     |       00001 | 2016-03-23 06:00:00.000 |
|  2 | Acme     |       00002 | 2016-03-23 06:00:00.000 |
|  3 | XYZ      |       00023 | 2016-03-23 05:30:00.000 |
|  4 | XYZ      |       00055 | 2016-03-23 05:30:00.000 |
|  5 | Q & B    |       00453 | 2016-03-23 04:15:00.000 |
|  6 | Q & B    |       00045 | 2016-03-23 04:15:00.000 |
+----+----------+-------------+-------------------------+

Query

DECLARE @VarWindowStart datetime = '2016-03-23 06:00:00';
DECLARE @VarWindowSizeMilliseconds int = 4 * 3600 * 1000;

UPDATE @T
SET LastUpdated = 
    DATEADD(millisecond, 
    (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) * @VarWindowSizeMilliseconds, 
    @VarWindowStart);

SELECT * 
FROM @T
ORDER BY LastUpdated;

Result

+----+----------+-------------+-------------------------+
| ID | Supplier | ProductCode |       LastUpdated       |
+----+----------+-------------+-------------------------+
|  2 | Acme     |       00002 | 2016-03-23 06:02:57.260 |
|  6 | Q & B    |       00045 | 2016-03-23 06:48:07.203 |
|  3 | XYZ      |       00023 | 2016-03-23 06:52:38.813 |
|  1 | Acme     |       00001 | 2016-03-23 07:20:16.750 |
|  4 | XYZ      |       00055 | 2016-03-23 07:23:21.140 |
|  5 | Q & B    |       00453 | 2016-03-23 08:59:53.303 |
+----+----------+-------------+-------------------------+

I'm pretty sure it will take less than 15 hours to process 3M rows using this query.


To clarify, this method doesn't guarantee that generated values in LastUpdated will be unique. There is a pretty good chance that for 3M rows there will be few duplicates and I'm too lazy now to calculate the probability, but I think that for your purposes the result will be good enough.

You can use datetime2(7) with a microsecond precision, which would substantially reduce the probability of collisions. There still will be a chance of collision with this method. If this is not acceptable, you should use some other method.


Update

I will try to guess what you mean by

We must also ensure that the suppliers are spread out evenly, so suppliers' products are not bunched together

The query above generates random number for each row independently. As a result, if your table has, for example, supplier A with 100 rows and supplier B with 10 rows, you would get ~10 rows from A, then 1 row from B, then another ~10 rows from A, then 1 row from B, etc (with various random fluctuations, of course).

This result will look like rows from A are clustered together at the top (and at the bottom, which is not so important).

If you want to guarantee that when table has N suppliers, the first N rows are all from N different suppliers, the next N rows are again from N different suppliers, etc., then below is one way to achieve it.

In the example with suppliers A and B you may want to have at first 2 random rows from A & B, then another 2 random rows from A & B, and so on for 10 pairs of rows, then the rest 90 rows from A in random order. This result will look thoroughly shuffled at the top between suppliers, while it will have only supplier A at the bottom.

I'll extend your sample data slightly.

DECLARE @T TABLE 
(ID int, Supplier nvarchar(50), ProductCode char(5), LastUpdated datetime);

INSERT INTO @T (ID, Supplier, ProductCode, LastUpdated) VALUES
(1, 'Acme',  '00001', '2016-03-23 06:00:00'),
(2, 'Acme',  '00002', '2016-03-23 06:00:00'),
(3, 'XYZ',   '00023', '2016-03-23 05:30:00'),
(4, 'XYZ',   '00055', '2016-03-23 05:30:00'),
(5, 'Q & B', '00453', '2016-03-23 04:15:00'),
(6, 'Q & B', '00045', '2016-03-23 04:15:00'),
(10, 'ABC',   '00010', '2016-03-23 06:30:00'),
(11, 'ABC',   '00011', '2016-03-23 06:30:00'),
(12, 'ABC',   '00012', '2016-03-23 06:30:00'),
(13, 'ABC',   '00013', '2016-03-23 06:30:00'),
(20, 'QWE',   '00020', '2016-03-23 02:30:00'),
(21, 'QWE',   '00021', '2016-03-23 02:30:00'),
(22, 'QWE',   '00022', '2016-03-23 02:30:00'),
(23, 'QWE',   '00023', '2016-03-23 02:30:00'),
(24, 'QWE',   '00024', '2016-03-23 02:30:00'),
(25, 'QWE',   '00025', '2016-03-23 02:30:00'),
(26, 'QWE',   '00026', '2016-03-23 02:30:00');

Query

WITH
CTE_Rnd
AS
(
    SELECT
        ID
        ,Supplier
        ,ProductCode
        ,LastUpdated
        ,CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5 AS RandomNo
    FROM @T
)
SELECT
    ID
    ,Supplier
    ,ProductCode AS Code
    ,LastUpdated
    ,RandomNo
    ,ROW_NUMBER() OVER (PARTITION BY Supplier ORDER BY RandomNo) AS RowNo
FROM CTE_Rnd
ORDER BY RowNo, RandomNo
;

Here we assign a random number for each row and partition all rows by supplier, so that we get one random row from each supplier first, then another random row from each supplier again, and so on.

Result

+----+----------+-------+-------------------------+----------------+-------+
| ID | Supplier | Code  |       LastUpdated       |    RandomNo    | RowNo |
+----+----------+-------+-------------------------+----------------+-------+
|  5 | Q & B    | 00453 | 2016-03-23 04:15:00.000 | 0.027009220917 |     1 |
| 10 | ABC      | 00010 | 2016-03-23 06:30:00.000 | 0.091210213162 |     1 |
| 24 | QWE      | 00024 | 2016-03-23 02:30:00.000 | 0.128781275971 |     1 |
|  3 | XYZ      | 00023 | 2016-03-23 05:30:00.000 | 0.248359622609 |     1 |
|  1 | Acme     | 00001 | 2016-03-23 06:00:00.000 | 0.347667260759 |     1 |
| 13 | ABC      | 00013 | 2016-03-23 06:30:00.000 | 0.154523770012 |     2 |
| 20 | QWE      | 00020 | 2016-03-23 02:30:00.000 | 0.211252812229 |     2 |
|  2 | Acme     | 00002 | 2016-03-23 06:00:00.000 | 0.580557575235 |     2 |
|  6 | Q & B    | 00045 | 2016-03-23 04:15:00.000 | 0.921966472273 |     2 |
|  4 | XYZ      | 00055 | 2016-03-23 05:30:00.000 | 0.952229986538 |     2 |
| 11 | ABC      | 00011 | 2016-03-23 06:30:00.000 | 0.425760054944 |     3 |
| 23 | QWE      | 00023 | 2016-03-23 02:30:00.000 | 0.455146492682 |     3 |
| 21 | QWE      | 00021 | 2016-03-23 02:30:00.000 | 0.485142913877 |     4 |
| 12 | ABC      | 00012 | 2016-03-23 06:30:00.000 | 0.554860722984 |     4 |
| 25 | QWE      | 00025 | 2016-03-23 02:30:00.000 | 0.668510247526 |     5 |
| 22 | QWE      | 00022 | 2016-03-23 02:30:00.000 | 0.815049941724 |     6 |
| 26 | QWE      | 00026 | 2016-03-23 02:30:00.000 | 0.818780491668 |     7 |
+----+----------+-------+-------------------------+----------------+-------+

As you can see top 5 rows are all from different suppliers in random order. Next 5 rows are again from different suppliers in random order. Then we have only supliers ABC and QWE left, and finally only supplier QWE.

I will leave it to you to figure out how to combine two numbers from RowNo and RandomNo columns into one datetime value in LastUpdated. RowNo could become number of minutes and RandomNo could become seconds plus milliseconds.