Postgresql – Filtering out duplicate domains from URL column using Postgres full-text search parsers

full-text-searchparsepostgresql

I have a PostgreSQL database containing pages and links downloaded by a web crawler, with the following tables:

pages
----------
id:          Integer (primary key)
url:         String  (unique)
title:       String
text:        String
html:        String
last_visit:  DateTime
word_pos:    TSVECTOR


links
----------
id         Integer (primary key)
source:    String
target:    String  
link_text: String
UNIQUE(source,target)


crawls
---------
id:         Integer (primary key)
query:      String


crawl_results
-------------
id:       Integer (primary key)
score:    Integer (constraint 0<=score<=1)
crawl_id: Integer (foreign key, crawls.id)
page_id:  Integer (foreign key, pages.id)

The source and target fields in the links table contain URLs. I am running the following query to extract scored links from the top-ranking search results, for pages that haven't been fetched yet:

WITH top_results AS 
    (SELECT page_id, score FROM crawl_results 
     WHERE crawl_id=$1 
     ORDER BY score LIMIT 100)
SELECT top_results.score, l.target
FROM top_results 
    JOIN pages p ON top_results.page_id=p.id
    JOIN links l on p.url=l.source 
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)

However, I would like to filter these results so that only one row is returned for a given domain (the one with the lowest score). So for instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8, 'http://www.foo.com/zor'), I only want the first because it has same domain foo.com and has the lower score.

I was able to find documentation for the builtin full text search parsers, which can parse URLS and extract the hostname. For instance, I can extract the hostname from a URL as follows:

SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6;

    token    
-------------
 www.foo.com
(1 row)

However, I can't figure out how I would integrate this into the above query to filter out duplicate domains from the results. And because this is the docs for "testing and debugging text search", I don't know if this use of ts_parse() is even related to how the URL parser is intended to be used in practice.

How would I use the host parser in my query above to return one row per domain? Also, how would I appropriately index the links table for host and url lookup?

Best Answer

After asking around on IRC, the Postgres mailing list, and a few other places, this is the best I've been able to come up with. I'm going to put it up in case it helps someone else with similar problem,and also because I'm hoping that someone can tell me how to improve on it. I first created the following views:

    CREATE VIEW scored_pages AS (
        SELECT crawl_results.crawl_id, crawl_results.score, 
            crawl_results.page_id, pages.url
        FROM crawl_results 
        JOIN pages ON crawl_results.page_id = pages.id 
    );


    CREATE VIEW scored_links AS (
        SELECT scored_pages.score, links.source, 
            links.target, links.link_text
        FROM links
        JOIN scored_pages ON scored_pages.url = links.source
    );

Then, using these views, I did the following query to extract the links from the lowest scored pages in the results:

SELECT score, host, target 
FROM (
        SELECT DISTINCT ON (token) token AS host, score, target 
        FROM scored_links,
        LATERAL ts_parse('default', target) WHERE tokid = 6
        ORDER BY token, score
) as x
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE target=pp.url)
ORDER BY score;

I would appreciate any feedback on how to improve this, or anything I'm doing wrong. In the meantime, I'll leave this question open in case someone comes by later with a better answer.

Related Solutions

Sql-server – Full Text search returns duplicate keys

It turns out that the problem was cause by a bad full text search index. It had to be rebuild with do not track changes, and the correct word breaker for the given language.

Postgresql – Postgres copy data with \xYY as plain string instead of interpreting as encoded string

The backslash \ is the default escape character in the (default) text format of COPY. The format you describe is recognized as (quoting the manual here)

backslash sequences are recognized by COPY FROM
...
\xdigits | Backslash x followed by one or two hex digits specifies the character with that numeric code

.. which seems to be just as intended.
To avoid this effect you could use the CSV format instead, where the backslash has no special meaning:

COPY tbl FROM '/path/to/file/log.sql' (FORMAT csv);

Be sure to read the manual to learn about additional differences.

Alternatively you could replace every \ with \\ in your file with a sed script or something.

If the above doesn't solve your problem, please supply:

PostgreSQL version
Database encoding
Demo data
Definition of the table you COPY to
The exact COPY invocation you used

Best Answer

Related Solutions

Sql-server – Full Text search returns duplicate keys

Postgresql – Postgres copy data with \xYY as plain string instead of interpreting as encoded string

Related Question