Postgresql – How to escape XML field in XML-output query

postgresqlxml

I have following request:

SELECT XMLAGG(data) FROM
(SELECT XMLELEMENT(name item, XMLFOREST(
   title,
   body as description,
   date_created as date,
   'http://example.com/news'||id as link 
)) AS data FROM NEWS) AS smthng;

The issue is that body is actually XML itself, so the question is –
How can I escape XML field (changing < to < > to > etc.) in this query?

Best Answer

The naive approach would be to wrap your query around another SELECT and use replace.

SELECT replace(
    (SELECT XMLAGG(data) FROM
         (SELECT XMLELEMENT(name item, XMLFOREST(
          title,
          body as description,
          date_created as date,
          'http://example.com/news'||id as link 
         )
     ) AS data FROM NEWS) AS smthng)::text, '<', '&lt;')

And do the same for >. A easier and more elgant mthod would be to to it with the language that should deliver/process/... the XML.

Indexes

For optimal performance, you could ...

partition your indexes
exclude irrelevant rows from the indexes
automatically recreate indexes at off-hours with updated predicate.

Your indexes could look like:

CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE  id BETWEEN 0 AND 999999
AND    updated_at > '2012-10-01 0:0'::timestamp;  -- some minimum date

CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE  id BETWEEN 1000000 AND 1999999
AND    updated_at > '2012-10-01 0:0'::timestamp;  -- some minimum date

-- etc.

(Assuming updated_at is type timestamp. With timestamptz, the cast to date is not IMMUTABLE and you need to define the time zone first ...)

The second condition excludes irrelevant rows from the index right away, which should make it smaller and faster - depending on your actual data distribution. In accordance with my preliminary comment, I am assuming you want newer rows.

The condition also automatically excludes NULL values in updated_at - which you seem to allow in the table and obviously want to exclude in the query. The usefulness of the index deteriorates over time. The query always retrieves the latest entries. Recreate the index with an updated WHERE clause periodically. This requires an exclusive lock on the table, so do it at off hours. There is also CREATE INDEX CONCURRENTLY to minimize the duration of locks:

CREATE INDEX CONCURRENTLY objects_id_up_201211_idx ...; -- create new idx
DROP INDEX CONCURRENTLY objects_id_up_201210_idx;       -- then drop old

DROP INDEX allows CONCURRENTLY since Postgres 9.2.

Related answer on SO:

Postgres returns records in wrong order

To further optimize, you could use CLUSTER like we mentioned in the comments. But you need a full index for that. Doesn't work with a partial index. You would create temporarily:

CREATE INDEX objects_full_idx ON objects (id/1000000, (updated_at::date) DESC NULLS LAST);

This form of the full index matches the sort order of above partial indexes.

CLUSTER objects USING objects_full_idx;
ANALYZE objects;

This will take a while, since the table is rewritten physically. It is also effectively a VACUUM FULL. It needs an exclusive write lock on the table, so do it at off-hours - provided you can afford that at all. Again, there are less invasive alternatives: pg_repack or pg_squeeze.

You can then drop the index again (if it's unused). It's a one-time effect. I would at least try this once to see how much your queries benefit from it. The effect deteriorates with subsequent write operations. You could repeat this procedure at off hours if you see a substantial effect.

If your table receives a lot of write operations, you have to weigh cost and benefit for this step. For many UPDATEs consider setting a FILLFACTOR lower than 100. Do that before you CLUSTER.

Query

SELECT count(*)
FROM   objects
WHERE  id BETWEEN 0 AND 999999  -- match conditions of partial index!
AND    updated_at > '2012-10-01 0:0'::timestamp
AND    updated_at::date > (now()::date - 7);

A more advanced technique for index partitioning:

Can spatial index help a "range - order by - limit" query

Among other things it provides example code for automatic index (re-)creation.

Make sure that autovacuum is running properly. The huge gain by CLUSTER you have reported may be due in part to the implicit vacuuming that you get from CLUSTER. Maybe this is set up by Heroku automatically, not sure.
The settings in your question look good. So that's probably not an issue here and CLUSTER really was that effective.

Declarative partitioning

has finally matured in Postgres 12. I would consider using that now instead of manual index partitioning (or at least additionally). Range partitioning with updated_at as the partition key. There are also multiple improvements to general performance, big data and B-tree index performance in particular.

Best Answer

Related Solutions

Sql-server – Select statement to retrieve xml output as below

Postgresql – Index optimization with dates

Indexes

Query

More

Declarative partitioning

Related Question