I would strongly recommend that you treat your database basically the same way as you treat your application code. You can script your database out to it's component parts and check those into source control and then use the same labels & versions there that you use for your apps.
To get the objects into source control there are a number of tools you can use. Microsoft has a tool that is nicknamed Data Dude. It works with Visual Studio. They're also preparing to release a new tool called SQL Server Database Tools (SSDT), again, working with Visual Studio. My company, Red Gate Software, makes a tool that works with SSMS called SQL Source Control.
In terms of process, I wrote several chapters for the book Red Gate Guide to Team Development. It's available as a free download (or if you want to kill a tree you can purcahse one from Amazon). I go into a lot more details about working with databases in development teams there.
I've done this in SQL Server using a modified version of Dice's Coefficient. Basically, you store a precomputed table of q-grams from your lookup data set. This table should have the primary key of the source record, the resulting q-gram, and the number of times that q-gram appears when breaking up the string.
Then when you want to do a lookup, break your input string into a similar list of q-grams (temporary table, subquery, etc.), and join those to your lookup table on the q-gram, grouping by the primary key from the lookup row (which, in the lookup table, is actually a foreign key rather than a primary key). For each matched q-gram, calculate the number of matches as the minimum cardinality between the input value and the lookup table row, e.g. the string 'aaaa' could have 'aa' * 3, and 'aaaaa' could have 'aa' * 4, thus you would have 3 matches. Double this number to get 6.
Once you've got this doubled number of matches, divide by the total number of q-grams obtained from both the input string, and the rows in the lookup table associated with a given primary key. So 'aaaa' would have 3 q-grams, and 'aaaaa' would have 4. You would calculate the similarity as 6/7. You can then filter this similarity quotient based on your desired threshold.
The nice part about this algorithm is that it's fast, since after splitting your input string, it's just a bunch of index lookups. The downside is that it doesn't put any emphasis on ordering of q-grams, so having two words out of order won't affect the match quality (assuming you're not letting q-grams cross word boundaries).
Following is some sample code from SQL Server to explain the concept, but it makes use of features not found in MySQL. You'd probably need to rewrite some of this using subqueries rather than common-table expressions.
CREATE PROCEDURE FuzzyCustomerLookup
@name varchar(8000),
@threshold decimal(10,4) = 0.0,
@limit_results int = 2147483647
AS
WITH inp_q AS ( --Split input string into q-grams
SELECT
q.qgram,
COUNT(*) AS cardinality
FROM dbo.Qgrams(@name) q --Function to split string into q-grams
GROUP BY q.qgram
),
matches AS ( --Count matches of input string against each lookup row
SELECT
c.row_id,
SUM(CASE WHEN c.cardinality < i.cardinality THEN c.cardinality ELSE i.cardinality END) AS matches
FROM CustomerQgrams c --Precomputed table of q-grams
INNER JOIN inp_q i
ON c.qgram = i.qgram
GROUP BY c.row_id
),
dice AS ( --Calculate match quality
SELECT
matches.row_id,
CAST(matches.matches * 2 AS decimal(10,4)) / (CAST(i.cardinality AS decimal(10,4)) + CAST(c.cardinality AS decimal(10,4))) AS similarity
FROM matches
INNER JOIN (
SELECT row_id, SUM(cardinality) AS cardinality
FROM CustomerQgrams
GROUP BY row_id
) c
ON matches.row_id = c.row_id
CROSS JOIN (
SELECT SUM(cardinality) AS cardinality FROM inp_q
) i
)
SELECT TOP (@limit_results)
ch.CustomerNo,
ch.CustomerName,
dice.similarity
FROM dice
INNER JOIN Customers ch
ON dice.row_id = ch.row_id
WHERE dice.similarity >= @threshold
ORDER BY dice.similarity DESC
Best Answer
Sybase 3rd party tools are sparse unless you spend money it seems.
The usual vendor suspects have Sybase offerings: