Mysql – Remove duplicate terms from column

duplicationMySQLmysql-workbench

This the tags column where all terms should be listed only once. Would like to remove all additional words from this column. For example the below contant:

Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Features, , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Cuisines, , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Services, , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Meal For Two, , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, Last Order/Booking (P.M), , Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, , East#West, Malad East, Dafftary Road, Mumbai, Maharashtra, BookUrTableId,

Should show as below:

Lakshmi Bar & Restaurant, Dining & Leisure, Restaurants, Restaurants- Multi Cuisine, Hotel, East, West, Malad East, Dafftary Road, Mumbai, Maharashtra, Features, Cuisines, Services, Meal For Two, Last Order/Booking (P.M), BookUrTableId

Where i tried to run the query and insert this into a table it gave an error as the content in this column was going too long and stops the execution of the query.

Also it would be preferred if i can get this done in mysql rather than using php or any other programming language.

Please help.

Best Answer

As has been mentioned, the database is perhaps not the best place to accomplish this; however, it can be done, using a stored function. You wouldn't want to rely on this in a live environment, bit it works.

If the following seems a bit primitive, that's an indication of the somewhat poor match between tool and task.

DELIMITER //

DROP FUNCTION IF EXISTS `dedup_csv` //
CREATE FUNCTION `dedup_csv` (in_str LONGTEXT) RETURNS LONGTEXT
DETERMINISTIC
NO SQL
BEGIN

-- http://dba.stackexchange.com/questions/87144/remove-duplicate-terms-from-column
-- given a comma-separated string of values, return a comma-separated string of
-- unique values found in the list

DECLARE out_str LONGTEXT DEFAULT NULL; -- pending output
DECLARE next_str TEXT DEFAULT NULL;    -- next element under consideration

dedup:
LOOP

  IF CHAR_LENGTH(TRIM(in_str)) = 0 OR in_str IS NULL THEN
    LEAVE dedup; -- no more data to consider
  END IF;

  SET next_str = SUBSTRING_INDEX(in_str,',',1);                   -- find the next element
  SET in_str = SUBSTRING(in_str FROM (CHAR_LENGTH(next_str) + 1 + 1)); -- remove that element

  SET in_str = TRIM(in_str), next_str = TRIM(next_str); -- trim the new and the rest

  IF FIND_IN_SET(next_str,out_str) OR CHAR_LENGTH(next_str) = 0 THEN -- if empty or already found
    ITERATE dedup;
  END IF;

  SET out_str = CONCAT_WS(',',out_str,next_str); -- append the new to pending output 

END LOOP;

RETURN out_str;

END //

DELIMITER ;

See this live demo: http://sqlfiddle.com/#!9/f32ee/1

The output from this function, given the input above:

Lakshmi Bar & Restaurant,Dining & Leisure,Restaurants,Restaurants- Multi Cuisine,Hotel,
East#West,Malad East,Dafftary Road,Mumbai,Maharashtra,Features,Cuisines,Services,Meal For Two,
Last Order/Booking (P.M),BookUrTableId

Leading and trailing spaces and empty values are cleaned up automatically.

mysql> SELECT dedup_csv('foo,bar,,,   ,baz,foo,  foo  , bar') AS deduped;
+-------------+
| deduped     |
+-------------+
| foo,bar,baz |
+-------------+
1 row in set (0.00 sec)

Again, the database not being the ideal tool is reflected in the benchmarks, which are highly dependent on input length and number of elements.

On the short and simple string shown above, the actual runtime on MySQL 5.5 on a 1 GHz Opteron 1218 is ~ 150 µsec, while the long string in the original question requires about 2400 µsec to process... which works out to a fairly substantial amount of time for large data sets.