Postgresql – postgres fetch record from partitions without the check constraints

partitioningpostgresql

I am using postgres 9.4,
I have a messages table that has 100M records, messages belongs to one feed_id, so I decided to partition the tables by posted_at.

 Table "public.messages"
            Column            |            Type             | Modifiers
------------------------------+-----------------------------+-----------
 id_str                       | character varying(255)      | not null
 feed_id                      | integer                     |
 replied_parent_id            | character varying(255)      |
 shared_parent_id             | character varying(255)      |
 message                      | character varying(255)      |
 posted_at                    | timestamp without time zone |
Triggers:
    messages_insert_trigger BEFORE INSERT ON messages FOR EACH ROW EXECUTE                              PROCEDURE create_messages_partition_and_insert()
Number of child tables: 231 (Use \d+ to list them.)
Child tables: messages_y2008_w45,
          messages_y2015_w01,
          messages_y2015_w04,
          messages_y2015_w03,
          messages_y2015_w02,
          messages_y2009_w23,
          ....

\d messages_y2008_w45
Table "public.messages_y2008_w45"
-- SAME AS messages
Indexes:
    "messages_y2008_w45_pkey" PRIMARY KEY, btree (id_str)
    "index_messages_y2008_w45_on_posted_at" btree (posted_at DESC)
Check constraints:
    "messages_y2008_w45_posted_at_check" CHECK (posted_at >= '2008-11-03 00:00:00'::timestamp without time zone AND posted_at < '2008-11-10 00:00:00'::timestamp without time zone)

Inherits: messages

There is a problem when I try to query parent messages. A parent message can have multiple child messages that posted at different date.
I fetch the child messages using the date range, and select the shared_parent_id from the child messages. Then I try to fetch the parents messages, but I don't have the date range for the parent messages, the parent messages can be outside the child messages' date range. What I can do is add a parent_posted_at in the messages table. But updating all the existing messages will take very long.
Also it's not ideal to add parent info in the child messages.

Is there any better way?

My question maybe more related to the design of the DB, should I use partition for this case or maybe partition differently?

EDIT:

Here's some example rows: I want to all 2015 May messages. So I selected [3,4,5,6]. But [5,6] are referring to [1,2].

To select the May messages, I have the date range, so it fits the check contraints, but after that I have to find the [1,2], the problem is I do not know the parent's posted_at, I only have the ids. So it will scan through all the partition tables. Maybe I should not partition by posted_at, instead I should partition by feed_id?

 id_str | feed_id | replied_parent_id | shared_parent_id |  message |      posted_at
--------+---------+-------------------+------------------+----------+---------------------
      1 |      26 |                   |                  | message1 | 2015-03-03 15:55:55
      2 |      26 |                   |                  | message2 | 2015-04-03 15:55:40
      3 |      26 |                   |                  | message3 | 2015-05-03 15:54:41
      4 |      26 |                   |                  | message4 | 2015-05-03 15:54:34
      5 |      26 |                   |                1 |          | 2015-05-03 15:54:15
      6 |      26 |                   |                2 |          | 2015-05-03 15:54:00

(6 rows)

Best Answer

A few details are still not clear, but let's see, what we can do now. First, having about 100M rows and 231 partitions sounds not that good. The resulting tables will be too small, in turn their number too high - I cannot tell the threshold, but at some point the query planning migt get too expensive. I think it is quite possible that yearly partitions would be enough. Alternatively, if you really want to fetch a whole month at a time, create monthly partitions.

Now to the actual problem.

It is not quite clear to me why you have rows in the parent table. The usual way of partitioning is that the parent is empty, and every row is redirected to one of the children.

At the same time, is you have an index on posted_at of the parent table (as you have it on the children), finding rows in the parent based on the timestamp is easy.

On the other hand, while I'm not sure which column shared_parent_id refers to, you can define an index on it, too - looking rows up based on this will be easy, too.

The only thing still has to be added is tell your query to look for parents in the parent table only. Let's have a look at a possible query:

WITH child_messages AS (
    SELECT shared_parent_id, {other interesting columns}
      FROM messages
     WHERE posted_at {matches your needs}
)
SELECT *
  FROM child_messages

UNION ALL

SELECT shared_parent_id, {other interesting columns}
  FROM ONLY messages -- this way it does not go to the children
 WHERE {unclear column} IN (SELECT shared_parent_id FROM child_messages);

The WITH query may pick up rows from the parent, too - this you may or may not want, adjust the query accordingly.

Furthermore, the performance might not be ideal, in this case there is room for tweaking the query (eg. a JOIN instead of the IN(), pushing the query in the WITH clause into a (sub)query and so on).

And a final notice: varchar(255) is usually a sign of a value of unknown-before length - if you really want to constrain it, you may want to choose a meaningful limit. Otherwise, an unlimited varchar (or text) has a slight performance advantage in PostgreSQL over the limited ones. Furthermore, from your example it seems that shared_parent_id is a number (integer) - use the best fitting type.