Mysql – get duplicate “base” URLs with theSQL

MySQL

I can't quite seem to figure this out, and hoping someone can help. I was trying something like this, but doesn't seem to work:

SELECT * 
FROM testtable 
GROUP BY SUBSTRING_INDEX(urlTest,'/',3) 
HAVING COUNT(*)>1

Here is a (simplified) table structure with what I am using:

CREATE TABLE `testtable` (
  `field1` int(11) NOT NULL,
  `field2` datetime NOT NULL,
  `urlTest` varchar(255) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

INSERT INTO `testtable` (`field1`, `field2`, `urlTest`) VALUES
(2, '2010-01-01', 'http://test1.com/somethingelse.php?id=1'),
(5, '2010-01-01', 'http://test1.com/'),
(6, '2012-02-02', 'http://test1.com/newscript/something'),
(7, '2013-02-02', 'http://test2.com/newscript/something'),
(8, '2014-02-02', 'http://test3.com/newscript/something'),
(9, '2015-02-02', 'http://test3.com/');

ALTER TABLE `testtable`
  ADD PRIMARY KEY (`field1`);

The output I want to get is to return any "base" urls that are identical with an 'identical' date, that I can manually inspect.

In other words, although "test1.com/somethingelse.php?id=1" and "test1.com" are "unique" url entries, the "base" url (i.e., "test1.com") is identical/a duplicate (because the dates -> 2010-01-01 <- are also identical). However, test1.com/newscript/something would not count as a 'duplicate' – because although the 'base' url is the same – it has a 'unique' date to the rest of the 'test1.com' urls (i.e., "2012-02-02")

So the output I'd like to get is:

2, 2010-01-01, 'http://test1.com/somethingelse.php?id=1'
5, 2010-01-01, 'http://test1.com/'

(because only test1.com has the same "base" unique URL (test1.com) as well as has the same timestamp. "test3.com" – while the same "base" url – since the actual timestamps are different, would be considered 'unique' urls)

How would I accomplish this?

Thanks!

Best Answer

Selecting all fields makes no sense - server will output one RANDOM field value from all values in the group. Use

SELECT testtable.*
FROM testtable
JOIN ( SELECT SUBSTRING_INDEX(urlTest,'/',3) AS domain,
              field2
       FROM testtable 
       GROUP BY SUBSTRING_INDEX(urlTest,'/',3), field2 
       HAVING COUNT(*) > 1 ) domains
WHERE SUBSTRING_INDEX(testtable.urlTest,'/',3) = domains.domain
  AND testtable.field2 = domains.field2

fiddle