Postgresql – Full text search in XML documents

full-text-searchindex-tuningpostgresqlpostgresql-9.2xml

I would like to do full-text search in XML documents.

However, I can't get matches for attribute values (I get a syntax error if I provide the full attribute value including the : and / characters in the example below). Also, I don't know how to specify my own delimiters for the parser.

Below is SSCCE:

I would like to be able to select for the last component of the attribute value: attributevalue and for the parser to recognize : and / as delimiters and so yield boo1, boo2, some and attributevalue as lexemes.

DROP TABLE IF EXISTS xmldocument;
CREATE TABLE IF NOT EXISTS xmldocument (
i SERIAL NOT NULL,
content XML NOT NULL
);
ALTER TABLE xmldocument ADD PRIMARY KEY (i);

INSERT INTO xmldocument (content) VALUES
('<a>foo</a>')
,('<a boo=''boo1:boo2:boo3/boo4/some/attributevalue''>foo</a>')
;


-- matches both records as expected
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'foo'; 

-- no match
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'attributevalue'; 

-- no match
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'boo2'; 

 -- no match
 SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'boo4'; 

As to why I don't just use ILIKE % etc. The reason is that I need to optimize by using a GIN index and I don't think it is possible or meaningful to build an index on simple VARCHAR values.

* update after accepting the answer*

Based on the accepted answer, the following queries all work as expected:

SELECT * FROM fts.xmldocument WHERE 
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('foo');

SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('attributevalue');

SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('boo2');

SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('boo4');

Best Answer

If you don't want to write your own parser, the quick and dirty solution would be to replace < and > with some other punctuation, so that the existing parsers don't decide to discard them as html tags.

SELECT * 
FROM xmldocument 
WHERE to_tsvector(regexp_replace(content::text,'[<>]',' ','g')) @@ 'boo2';

As to why I don't just use ILIKE % etc. the reason is that I need to optimize by using a GIN index and I don't think it is possible or meaningful to build an index on simple VARCHAR values.

If you use the pg_trgm extension, you can build a gin index on VARCHAR values which will optimize ILIKE queries. How effective it is depends on the size of your indexed documents, and the size of your query. I'd recommend trying it and seeing how it does for you.

Version 1.2 of pg_trgm (to be included in PostgreSQL 9.6, but it is fairly easy to back-port into 9.4 and 9.5 if you are willing to compile some code) will be much more effective with large queries.