Postgresql – Filtering out duplicate domains from URL column using Postgres full-text search parsers

full-text-searchparsepostgresql

I have a PostgreSQL database containing pages and links downloaded by a web crawler, with the following tables:

pages
----------
id:          Integer (primary key)
url:         String  (unique)
title:       String
text:        String
html:        String
last_visit:  DateTime
word_pos:    TSVECTOR


links
----------
id         Integer (primary key)
source:    String
target:    String  
link_text: String
UNIQUE(source,target)


crawls
---------
id:         Integer (primary key)
query:      String


crawl_results
-------------
id:       Integer (primary key)
score:    Integer (constraint 0<=score<=1)
crawl_id: Integer (foreign key, crawls.id)
page_id:  Integer (foreign key, pages.id)

The source and target fields in the links table contain URLs. I am running the following query to extract scored links from the top-ranking search results, for pages that haven't been fetched yet:

WITH top_results AS 
    (SELECT page_id, score FROM crawl_results 
     WHERE crawl_id=$1 
     ORDER BY score LIMIT 100)
SELECT top_results.score, l.target
FROM top_results 
    JOIN pages p ON top_results.page_id=p.id
    JOIN links l on p.url=l.source 
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)

However, I would like to filter these results so that only one row is returned for a given domain (the one with the lowest score). So for instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8, 'http://www.foo.com/zor'), I only want the first because it has same domain foo.com and has the lower score.

I was able to find documentation for the builtin full text search parsers, which can parse URLS and extract the hostname. For instance, I can extract the hostname from a URL as follows:

SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6;

    token    
-------------
 www.foo.com
(1 row)

However, I can't figure out how I would integrate this into the above query to filter out duplicate domains from the results. And because this is the docs for "testing and debugging text search", I don't know if this use of ts_parse() is even related to how the URL parser is intended to be used in practice.

How would I use the host parser in my query above to return one row per domain? Also, how would I appropriately index the links table for host and url lookup?

Best Answer

After asking around on IRC, the Postgres mailing list, and a few other places, this is the best I've been able to come up with. I'm going to put it up in case it helps someone else with similar problem,and also because I'm hoping that someone can tell me how to improve on it. I first created the following views:

    CREATE VIEW scored_pages AS (
        SELECT crawl_results.crawl_id, crawl_results.score, 
            crawl_results.page_id, pages.url
        FROM crawl_results 
        JOIN pages ON crawl_results.page_id = pages.id 
    );


    CREATE VIEW scored_links AS (
        SELECT scored_pages.score, links.source, 
            links.target, links.link_text
        FROM links
        JOIN scored_pages ON scored_pages.url = links.source
    );

Then, using these views, I did the following query to extract the links from the lowest scored pages in the results:

SELECT score, host, target 
FROM (
        SELECT DISTINCT ON (token) token AS host, score, target 
        FROM scored_links,
        LATERAL ts_parse('default', target) WHERE tokid = 6
        ORDER BY token, score
) as x
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE target=pp.url)
ORDER BY score;

I would appreciate any feedback on how to improve this, or anything I'm doing wrong. In the meantime, I'll leave this question open in case someone comes by later with a better answer.