Mysql – Remove duplicate terms from column

duplicationMySQLmysql-workbench

This the tags column where all terms should be listed only once. Would like to remove all additional words from this column. For example the below contant:

Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Features, , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Cuisines, , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Services, , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Meal For Two, , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Last Order/Booking (P.M), , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, BookUrTableId,

Should show as below:

Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, East, West, Malad East, Dafftary Road, Mumbai, Maharashtra, Features, Cuisines, Services, Meal For Two, Last Order/Booking (P.M), BookUrTableId

Where i tried to run the query and insert this into a table it gave an error as the content in this column was going too long and stops the execution of the query.

Also it would be preferred if i can get this done in mysql rather than using php or any other programming language.

Please help.

Best Answer

As has been mentioned, the database is perhaps not the best place to accomplish this; however, it can be done, using a stored function. You wouldn't want to rely on this in a live environment, bit it works.

If the following seems a bit primitive, that's an indication of the somewhat poor match between tool and task.

DELIMITER //

DROP FUNCTION IF EXISTS `dedup_csv` //
CREATE FUNCTION `dedup_csv` (in_str LONGTEXT) RETURNS LONGTEXT
DETERMINISTIC
NO SQL
BEGIN

-- http://dba.stackexchange.com/questions/87144/remove-duplicate-terms-from-column
-- given a comma-separated string of values, return a comma-separated string of
-- unique values found in the list

DECLARE out_str LONGTEXT DEFAULT NULL; -- pending output
DECLARE next_str TEXT DEFAULT NULL;    -- next element under consideration

dedup:
LOOP

  IF CHAR_LENGTH(TRIM(in_str)) = 0 OR in_str IS NULL THEN
    LEAVE dedup; -- no more data to consider
  END IF;

  SET next_str = SUBSTRING_INDEX(in_str,',',1);                   -- find the next element
  SET in_str = SUBSTRING(in_str FROM (CHAR_LENGTH(next_str) + 1 + 1)); -- remove that element

  SET in_str = TRIM(in_str), next_str = TRIM(next_str); -- trim the new and the rest

  IF FIND_IN_SET(next_str,out_str) OR CHAR_LENGTH(next_str) = 0 THEN -- if empty or already found
    ITERATE dedup;
  END IF;

  SET out_str = CONCAT_WS(',',out_str,next_str); -- append the new to pending output 

END LOOP;

RETURN out_str;

END //

DELIMITER ;

See this live demo: http://sqlfiddle.com/#!9/f32ee/1

The output from this function, given the input above:

Lakshmi Bar & Restaurant,Dining & Leisure,Restaurants,Restaurants- Multi Cuisine,Hotel,
East#West,Malad East,Dafftary Road,Mumbai,Maharashtra,Features,Cuisines,Services,Meal For Two,
Last Order/Booking (P.M),BookUrTableId

Leading and trailing spaces and empty values are cleaned up automatically.

mysql> SELECT dedup_csv('foo,bar,,,   ,baz,foo,  foo  , bar') AS deduped;
+-------------+
| deduped     |
+-------------+
| foo,bar,baz |
+-------------+
1 row in set (0.00 sec)

Again, the database not being the ideal tool is reflected in the benchmarks, which are highly dependent on input length and number of elements.

On the short and simple string shown above, the actual runtime on MySQL 5.5 on a 1 GHz Opteron 1218 is ~ 150 µsec, while the long string in the original question requires about 2400 µsec to process... which works out to a fairly substantial amount of time for large data sets.

Related Solutions

Mysql – remove duplicate rows in thesql table that does not contain primary key

In the spirit of @yercube's answer, I have an answer that has an added twist.

CREATE TABLE stage
(
    id int not null auto_increment,
    name varchar(20),
    primary key (id)
);
CREATE TABLE stage2 LIKE stage;
INSERT INTO stage (name) SELECT name FROM item;
INSERT INTO stage2 (id) SELECT min_id FROM
(SELECT MIN(id) min_id,name FROM stage GROUP BY name) A;
UPDATE stage2 A INNER JOIN stage B USING (id) SET A.name=B.name;
TRUNCATE TABLE item;
INSERT INTO item (name) SELECT name FROM stage2;
DROP TABLE stage;
DROP TABLE stage2;

This will load stage2 with the first occurrence of each name from item, zap the item table, and load the unique occurrences back.

If you look back in @yercube's answer and compare it to my answer, his is much more simplistic because

@yercube uses one temp table, while I use two
I had to create a column for iteration control, @yercube did not need to
@yercube has fewer steps
both answers achieve the same thing

I do not expect my answer to be accepted. The sole purpose of my answer was demonstrate that other answers lose the concise clarity needed to solve your problem. Again, hats off to @yercube.

PostgreSQL – How to Remove Duplicate Records in a Join Table

In my experience (and as shown in many tests) NOT IN as demonstrated by @gsiems is rather slow and scales terribly. The inverse IN is typically faster (where you can reformulate that way, like in this case), but this query with EXISTS (doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:

DELETE FROM questions_tags q
WHERE  EXISTS (
   SELECT FROM questions_tags q1
   WHERE  q1.ctid < q.ctid
   AND    q1.question_id = q.question_id
   AND    q1.tag_id = q.tag_id
   );

Deletes every row where another row with the same (tag_id, question_id) and a smaller ctid exists. (Effectively keeps the first instance according to the physical order of tuples.) Using ctid in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).

ctid is the internal tuple identifier present in every row and necessarily unique. Further reading:

Test

I ran a test case with this table matched to your question and 100k rows:

CREATE TABLE questions_tags(
  question_id integer NOT NULL
, tag_id      integer NOT NULL
);

INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM   generate_series(1, 100000);

ANALYZE questions_tags;

Indexes do not help in this case.

Results

NOT IN
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.

EXISTS
Finishes in half a second in this SQLfiddle.

Alternatives

If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.

If you have dependencies and want to keep them, you could:

Drop all foreign keys and indexes - for performance.
SELECT survivors to a temporary table.
TRUNCATE the original.
Re-INSERT survivors.
Re-CREATE indexes and foreign keys. Views can just stay, they have no impact on performance. More here or here.