Mysql – How to query Wikipedia’s database to retrieve the main topic and the surface forms of a given title

MySQLoptimizationselectsubquery

First, Wikipedia's database scheme diagram is available here. I have used one of their dumps locally to work with it.

Now, I need a single MySQL query to check both page and redirect tables with a given input 'Dog_breeder' or 'Dog_breeding' for example. I want to get the list of both types of pages linked to this input.

For example, Dog_breeding is a main topic and the below pages will redirect to it:

  • Dog_Breeder
  • Dog_breeder
  • Dog_breeders
  • Stud_dog

In the database, this is stored in this fashion:

|---------------------------|       |---------------------------|
| table: page               |       | table: redirect           |
|---------------------------|       |---------------------------|
| page_id   | page_title    |       | rd_from   | rd_title      |
|---------------------------|       |---------------------------|
| 476072    | Dog_breeding  |       | 796236    | Dog_breeding  |
| 796236    | Dog_breeder   |       | 16274701  | Dog_breeding  |
| 16274701  | Dog_breeders  |       | 37353772  | Dog_breeding  |
| 37353772  | Stud_dog      |       | 40738352  | Dog_breeding  |
| 40738352  | Dog_Breeder   |       |---------------------------|
| 11606599  | Dog_breeders  |       
| 22534134  | Dog_breeders  |       
|---------------------------|       

So all pages are in the page table and then redirections are stored in the redirect table (rd_from is the page that will be redirected to the main rd_title page)

I have the query to retrieve all redirect entries of a main topic:

SELECT 
    page_id, 
    page_title, 
    rd_from,
    rd_title
FROM 
    redirect 
    JOIN page ON page_id = rd_from
WHERE rd_title = 'Dog_breeding'

But two problems:

  1. it is not giving me the main page itself ('Dog_breeding')
  2. if I put WHERE rd_title = 'Dog_breeder' it finds nothing…

I would like to get something like the below regardless of the input being Dog_breeding or Dog_breeder:

page_id     | page_title    | rd_from   | rd_title  
-------------------------------------------------------
476072      | Dog_breeding  |           | 
796236      | Dog_breeder   | 796236    | Dog_breeding
16274701    | Dog_breeders  | 16274701  | Dog_breeding
37353772    | Stud_dog      | 37353772  | Dog_breeding
40738352    | Dog_Breeder   | 40738352  | Dog_breeding

I guess I could do it with PHP and several back and forth with the Database (check if it's a main topic, if not grab the main topic, then query like I did above) but I feel like there is a one-query solution?

Best Answer

Initially it seems pretty easy to solve the question, by adding a new criteria to the WHERE clause.

SELECT 
    page.page_id, 
    page.page_title, 
    redirect.rd_from,
    redirect.rd_title
FROM 
    redirect 
    JOIN page ON page.page_id = redirect.rd_from
WHERE 
    redirect.rd_title = 'Dog_breeding' or page.page_title = 'Dog_breeding'

But as Lazhar has pointed out, this only works if the requested text belongs to the main wiki page, not for the redirected pages.

After taking a look at table definitions:

MediaWiki - Manual:page table

page_is_redirect A value of 1 here indicates the article is a redirect; it is 0 in all other cases

MediaWiki - Manual:redirect table

Contains for each page that is currently a redirect (i.e., not for old revisions which are redirects) the id of the source page and the information about target. The target page may or may not exist.

I've decided to use this field to solve the problem.

First we need to know the title of the main page, the page that is not a redirection. In this question: Dog_breeding

select page.page_title as title from page where page.page_title = 'Dog_breeders' and page_is_redirect = 0
union
select redirect.rd_title as title 
from redirect 
where redirect.rd_from in (select page.page_id from page where page.page_title = 'Dog_breeders' and page_is_redirect = 1) 

Then using the first approach:

SELECT 
    page.page_id, 
    page.page_title, 
    redirect.rd_from,
    redirect.rd_title
FROM 
    page 
    left JOIN redirect ON page.page_id = redirect.rd_from
    join (select page.page_title as title from page where page.page_title = 'Dog_breeders' and page_is_redirect = 0
          union
          select redirect.rd_title as title 
          from redirect 
          where redirect.rd_from in (select page.page_id from page where page.page_title = 'Dog_breeders' and page_is_redirect = 1) 
          ) cteTitle 
WHERE 
    redirect.rd_title = cteTitle.title or page.page_title = cteTitle.title;

This is the result:

+----------+--------------+----------+--------------+
|  page_id |  page_title  |  rd_from |   rd_title   |
+----------+--------------+----------+--------------+
|  796236  |  Dog_breeder |  796236  | Dog_breeding |
+----------+--------------+----------+--------------+
| 16274701 | Dog_breeders | 16274701 | Dog_breeding |
+----------+--------------+----------+--------------+
| 37353772 |   Stud_dog   | 37353772 | Dog_breeding |
+----------+--------------+----------+--------------+
| 40738352 |  Dog_Breeder | 40738352 | Dog_breeding |
+----------+--------------+----------+--------------+
|  476072  | Dog_breeding |   NULL   |     NULL     |
+----------+--------------+----------+--------------+

Can check it here: http://rextester.com/OQMV97513

NOTE: There is another solution using pagelinks table in SO