Postgresql – Storing logs on per day basis in PostgreSQL

clusteringpostgresql

Requirements here are to store historical logs.

For the sake of simplicity I'll assume this sample scenario where we are selling some products on our website and we need to keep records of sales of every product made by a salesman on a per day basis. Number of Salesman and Products are known and constant.

Now, Say we have 3 products for sale, Laptop, coffee-mug and pen.
Here I have this table to keep record of sales for today (Live records, which will be updated throughout the day)

CREATE TABLE IF NOT EXISTS sales_record (
    id SERIAL,
    salesman_id INT NOT NULL,
    sold_laptop INT NOT NULL,
    sold_mugs INT NOT NULL,
    sold_pen INT NOT NULL,
    PRIMARY KEY (id)
);

And another table to keep records of old data

CREATE TABLE IF NOT EXISTS sales_record_log (
    id SERIAL,
    salesman_id INT NOT NULL,
    sold_laptop INT NOT NULL,
    sold_mugs INT NOT NULL,
    sold_pen INT NOT NULL,
    record_for_day TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
);

Having this up and running, all we need to do is keep updating the recordset and increment proper column by 1 for every sale.

Problem we are experiencing with this setup is that it is necessary to move data from live table to another table at the end of every day and flush the sold_* records from live table, like this:

INSERT INTO sales_record_log
    (salesman_id, sold_laptop, sold_mugs, sold_pen)
        SELECT salesman_id, sold_laptop, sold_mugs, sold_pen
        FROM sales_record;

UPDATE sales_record SET sold_laptop = 0, sold_pen = 0, sold_mugs = 0;

To automate the process of logging, moving data and cleaning up, I was looking into a possible situation with Table cluster in PostgreSQL and Triggers to automatically update a descendant table whenever a new record is inserted into parent table, but I cant seem to find a way to synchronize proper records on every UPDATE on Live table and then check whether the Log table has a record for today, If yes then Update proper column else Insert a new row and then Update the value.
Keeping in mind that there can be a salesman who has not made any sale for several days, when He make a sale after 5 days, Log table will have no record for this salesman regarding those 5 days when no sale was made.

I have a strong feeling that there can be much better setup than this but I am all clueless. can anyone please help me setting it up or guide me if I am doing it the wrong way ?

Edit

I decided to maintain both Live and Old records in same table and choose the Old-Record's table structure for it, with slight modifications in PRIMARY KEY i.e.

CREATE TABLE IF NOT EXISTS sales_record (
    id SERIAL,
    salesman_id INT NOT NULL,
    sold_laptop INT NOT NULL,
    sold_mugs INT NOT NULL,
    sold_pen INT NOT NULL,
    record_for_day DATE NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (salesman_id, record_for_day)
);

Please notice the changes in PRIMARY KEY and Data Type of record_for_day column.

Now, I Insert the data with this UPSERT like function:

CREATE FUNCTION update_log(salesman INT,
                           laptops INT DEFAULT 0,
                           mugs INT DEFAULT 0,
                           pens INT DEFAULT 0,
                           for_day DATE DEFAULT CURRENT_TIMESTAMP)
RETURNS VOID AS
'
BEGIN
    LOOP
        UPDATE sales_record SET
            sold_laptop = sold_laptop + laptops,
            sold_mugs = sold_mugs + mugs,
            sold_pen = sold_pen + pens
            WHERE salesman_id = salesman AND record_for_day = for_day;

        IF found THEN
            RETURN;
        END IF;

        BEGIN
            INSERT INTO sales_record
                (salesman_id, sold_laptop, sold_mugs, sold_pen, record_for_day)
            VALUES
                (salesman, laptop, mugs, pens, for_day);
            RETURN;

            EXCEPTION WHEN unique_violation THEN
                -- Encountered only if someone else has inserted the record in meantime
                -- In that case, lets loop and try to update again
        END;
    END LOOP;
END;
'
LANGUAGE plpgsql;

Trying to INSERT

SELECT * FROM sales_record;

0 Records

SELECT update_log(2, 1, 0, 0, CURRENT_TIMESTAMP);

SELECT * FROM sales_record;

+---------------------------------------------------------------------+
| id  | salesman_id | sold_laptop | sold_mugs | sold_pen |   for_day  |
|---------------------------------------------------------------------|
| 1   |    2        |    1        |    0      |   0      | 18/02/2013 |
+---------------------------------------------------------------------+

SELECT update_log(2, 0, 1, 2, CURRENT_TIMESTAMP);

+---------------------------------------------------------------------+
| id  | salesman_id | sold_laptop | sold_mugs | sold_pen |   for_day  |
|---------------------------------------------------------------------|
| 1   |    2        |    1        |    1      |   1      | 18/02/2013 |
+---------------------------------------------------------------------+

SELECT update_log(32, 0, 1, 1, CURRENT_TIMESTAMP);

SELECT * FROM sales_record;

+---------------------------------------------------------------------+
| id  | salesman_id | sold_laptop | sold_mugs | sold_pen |   for_day  |
|---------------------------------------------------------------------|
| 2   |    32       |    0        |    1      |   1      | 18/02/2013 |
|---------------------------------------------------------------------|
| 1   |    2        |    1        |    1      |   1      | 18/02/2013 |
+---------------------------------------------------------------------+

Better than CRON, Triggers or Rules but still not satisfactory.
I know I am acting too greedy, but guys, Who dont want more ?

It looks more like a hack to me, it might fail at some point of time though I have not tested it.

I am still open to suggestions.

Best Answer

The best option IMO is to use a cron job to generate the logs for 'yesterday'::date. You could also use triggers before insert/update/delete to update the other table but this adds complexity and overhead, and for the current day, but this gets pretty complicated. Generate your historical logs once the data won't change anymore.

In this case you write a sql query and run it via psql and cron.

I would also add a trigger denying update or delete to records covered in your historical data if you can.

This gives you a few benefits:

It is more obvious when it breaks
It is simpler, with simpler failure cases

Now, as per your concerns:

You say you need rows for every day. This can be handled a number of relatively easy ways in PostgreSQL (remember that dates support integer math so you can take a base date and add a series to it, to generate a date series). This is a pretty easy way to get around if you are generating rows per day of week, etc.
You say you cant guarantee things won't change. The key question here is what your change window is and to do your historical reports after this window has closed. For example if it is after a month, you can generate reports for all dates in a month a month prior (i.e. generate all dates in January during early March). You can then rely on a view to handle newer rows vs older rows in a live basis. You can then have a trigger which ensures that the date of an inserted row in the orders table is newer than the newest date in the other table.

In my experience worrying about keeping this as a live summary usually isn't necessary. Small organizations (with small data sets) tend to close out books at least once a year, and live reporting is an option there. Larger organizations with larger data sets tend to close out receivables and payables (i.e. invoices) once a month or so, and so the only areas that have to be reported live (because they are subject to adjustment or revision) are open orders (which can be revised) and invoices which may need to be reviewed occasionally (and should never be revised but may have adjustments issued against them which might or might not need to be tracked in such a system).

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Possible to have nested inserts in Postgres 8.4

You should be able to do something like this with a writable CTE:

WITH i AS (
   INSERT INTO host (hostname, hostrole) VALUES ('foobar', 'Virtual') RETURNING id
)
INSERT INTO interface (name, mac, host)
SELECT 'eth0', '00:50:56:9d:34:d4', id
FROM i

(untested, but it should be something like that)

Writable CTE is in PostgreSQL 9.1 and up.

Edit

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Possible to have nested inserts in Postgres 8.4

Related Question