Mysql – Find the user with closest taste of mine

MySQLoptimizationquery-performance

I trying to write in SQL the following requirement:

Each user can rate as much films as they like. (1= I don't like, 5=I love it)
And the system should provide to the current user a list of films that he doesn't rate yet and that other users like too.

To simplify, I have this table and the following data:

create table wich (
  uid int,
  film varchar(50),
  rate int
);

insert into wich values 
(1, 'usual suspect', 5), (1, 'gataca', 4), (1, 'goldeneye', 2),
(2, 'usual suspect', 4), (2, 'gataca', 5), (2, 'i am a legend', 4), (2, 'the hobbit', 1),
(3, 'usual suspect', 1), (3, 'gataca', 5), (3, 'goldeneye', 5),
(4, 'usual suspect', 5), (4, 'goldeneye', 5),
(5, 'usual suspect', 5), (5, 'gataca', 4), (5, 'goldeneye', 5),
(6, 'usual suspect', 4), (6, 'gataca', 3), (6, 'goldeneye', 3), (6, 'shrek', 4);

This can be read as: User 1 like 'usual suspect' and 'gataca' but not 'goldeneye'.

So what I want know is to find the user(s) that have the closest tastes of the current user and get a list of films that the current user could probably like.

Following, what I have done so far:

Step 1: for every films rated by user 1, if user1 and user2 rate the film in the same way, compute a score (should be 0 if the difference is greater than 2)
and it's great if both like a film or both doesn't like a film.

select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
        from wich w1
        inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
        where w1.uid=1

Step 2: compute a score by user

select scores.uid, SUM(scores.score) * 100 / sum(scores.nb) as score
from (
    select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
    from wich w1
    inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
    where w1.uid=1
) scores
group by uid
ORDER BY score desc

step 3: find films that user1 has not yet rated that he could like

select w3.uid, w3.film,w3.rate, matches.score
from (
    select scores.uid, SUM(scores.score) * 100 / sum(scores.nb) as score
    from (
        select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
        from wich w1
        inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
        where w1.uid=1
    ) scores
    group by uid
    ORDER BY score desc
) as matches
inner join wich w3 on matches.uid=w3.uid and w3.rate >= 3
left join wich w4 on w4.uid=1 and w3.film=w4.film
where w4.uid is null
order by w3.rate desc, matches.score desc;

This seems to work, but I am not sure that this will still respond in a small amount of time if each user rate of lot of films and the table become become bigger.

What do you think ?

Is there a better way to accomplish such a thing ?

A working fiddle: http://sqlfiddle.com/#!2/89c57/1/0

Edit:

In my real table, the film column is an integer and I have a second table with films data (title, description, year, …)

create table wich (
  uid int,
  film int,
  rate int
);

The example given here is just to simplify the problem with just one table.

Best Answer

You can make a table which holds the users with similar taste for every user.

CREATE TABLE `similar_users` (
    `uid1` INT,
    `uid2` INT,
    `score` INT,
    PRIMARY KEY (`uid1`, `uid2`)
)

The algorithm for calculating the score can be anything, as long as a higher score means a closer taste.

You you should limit the amount of similar users to a fixed number and also exclude the ones with a score less than a certain value (the ones with a not so similar tastes).

The contents of the table will be calculated periodically (e.g. every day).

Then step 3 from your solution becomes something like this:

SELECT w.film
FROM similar_users su
JOIN wich w ON w.uid = su.uid2 AND w.rate > 2
LEFT JOIN wich wn ON wn.film = w.film AND wn.uid = {current_user}
WHERE su.uid1 = {current_user}
ORDER BY su.score * w.rate DESC

Related Solutions

Mysql – How to create a user-defined aggregate function in MySQL (redux)

OK, it's almost working; the only thing that is wrong is I evidently don't know how to coerce the constant argument args->args[1] to (any) type of int value. If I code target_depth to a fixed integer, the program works. It's only when I try to read it in from the query that I get the "ERROR 2013 (HY000): Lost connection to MySQL server during query" error. Any help with reading in the target depth for UDF_SMALL (2nd lowest value, 3rd lowest value, etc.) would be greatly appreciated.

/******************************************************************************
 ** udf_small.c
 **
 ** This MySQL user-defined function (UDF) sorts through an aggregate dataset, 
 ** then returns the ith smallest element each group in the GROUP BY statement.
 **
 ** Author:        Jeffrey Rolland
 ** Creation Date: 12/07/2014
 ** Modifications: None
 ******************************************************************************/

/******************************************************************************
** A dynamically loadable file should be compiled shared.
** (something like: gcc -shared -o my_func.so -I /usr/includes/mysql/ my_func.c).
** You can easily get all switches right by doing:
** cd sql ; make udf_example.o
** Take the compile line that make writes, remove the '-c' near the end of
** the line and add -shared -o udf_example.so to the end of the compile line.
** The resulting library (udf_example.so) should be copied to some dir
** searched by ld. (/usr/lib/mysql/plugin/ ?)
** If you are using gcc, then you should be able to create the udf_example.so
** by simply doing 'make udf_example.so'.
**
** After the library is made one must notify mysqld about the new
** functions with the commands:
**
** CREATE FUNCTION metaphon RETURNS STRING SONAME "udf_example.so";
** CREATE FUNCTION myfunc_double RETURNS REAL SONAME "udf_example.so";
** CREATE FUNCTION myfunc_int RETURNS INTEGER SONAME "udf_example.so";
** CREATE FUNCTION sequence RETURNS INTEGER SONAME "udf_example.so";
** CREATE FUNCTION lookup RETURNS STRING SONAME "udf_example.so";
** CREATE FUNCTION reverse_lookup RETURNS STRING SONAME "udf_example.so";
** CREATE AGGREGATE FUNCTION avgcost RETURNS REAL SONAME "udf_example.so";
** CREATE FUNCTION myfunc_argument_name RETURNS STRING SONAME "udf_example.so";
**
** After this the functions will work exactly like native MySQL functions.
** Functions should be created only once.
**
** The functions can be deleted by:
**
** DROP FUNCTION metaphon;
** DROP FUNCTION myfunc_double;
** DROP FUNCTION myfunc_int;
** DROP FUNCTION lookup;
** DROP FUNCTION reverse_lookup;
** DROP FUNCTION avgcost;
** DROP FUNCTION myfunc_argument_name;
**
** The CREATE FUNCTION and DROP FUNCTION update the func@mysql table. All
** Active function will be reloaded on every restart of server
** (if --skip-grant-tables is not given)
**
** If you ge problems with undefined symbols when loading the shared
** library, you should verify that mysqld is compiled with the -rdynamic
** option.
**
** If you can't get AGGREGATES to work, check that you have the column
** 'type' in the mysql.func table. If not, run 'mysql_upgrade'.
**
**************************************************************************/

/*************************************************************************
** Syntax for the new aggregate commands are:
** CREATE AGGREGATE FUNCTION <function_name> RETURNS {STRING|REAL|INTEGER}
** SONAME <name_of_shared_library>
**
** Syntax for avgcost: AVGCOST( t.quantity, t.price )
** with t.quantity=integer, t.price=double
**
** (This example is provided by Andreas F. Bobak <bobak@relog.ch>)
**************************************************************************/

#ifdef STANDARD
    /* STANDARD is defined, don't use any mysql functions */
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    #ifdef __WIN__
        typedef unsigned __int64 ulonglong; /* Microsofts 64 bit types */
        typedef __int64 longlong;
    #else
        typedef unsigned long long ulonglong;
        typedef long long longlong;
    #endif /*__WIN__*/
#else
    #include <my_global.h>
    #include <my_sys.h>
    #if defined(MYSQL_SERVER)
        #include <m_string.h>   /* To get strmov() */
    #else
        /* when compiled as standalone */
        #include <string.h>
        #define strmov(a,b) stpcpy(a,b)
        #define bzero(a,b) memset(a,0,b)
    #endif
#endif
#include <mysql.h>
#include <ctype.h>
#ifdef _WIN32
    /* inet_aton needs winsock library */
    #pragma comment(lib, "ws2_32")
#endif
#ifdef HAVE_DLOPEN
    #if !defined(HAVE_GETHOSTBYADDR_R) || !defined(HAVE_SOLARIS_STYLE_GETHOST)
        static pthread_mutex_t LOCK_hostname;
    #endif
#endif
#include <stdbool.h>

my_bool udf_small_init( UDF_INIT* initid, UDF_ARGS* args, char* message );
void udf_small_deinit( UDF_INIT* initid );
void udf_small_reset( UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error );
void udf_small_clear( UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error );
void udf_small_add( UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error );
double udf_small( UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error );

struct SortArray
{
    ulonglong length;
    double *data;
};

/*************************************************************************
** Example of init function
** Arguments:
** my_bool maybe_null 1 if function can return NULL
** Default value is 1 if any of the arguments
** is declared maybe_null
**
**
** initid Points to a structure that the init function should fill.
** This argument is given to all other functions.
**
** unsigned int decimals Number of decimals.
** Default value is max decimals in any of the
** arguments.
**
** unsigned int max_length Length of string result.
** The default value for integer functions is 21
** The default value for real functions is 13+
** default number of decimals.
** The default value for string functions is
** the longest string argument.
**
** char *ptr; A pointer that the function can use.
**
**
** args Points to a structure which contains:
** unsigned int arg_count Number of arguments
**
** enum Item_result *arg_type Types for each argument.
** Types are STRING_RESULT, REAL_RESULT
** and INT_RESULT.
**
** char **args Pointer to constant arguments.
** Contains 0 for not constant argument.
**
** unsigned long *lengths; max string length for each argument
** char *maybe_null Information of which arguments
** may be NULL
**
**
** message Error message that should be passed to the user on fail.
** The message buffer is MYSQL_ERRMSG_SIZE big, but one should
** try to keep the error message less than 80 bytes long!
**
** This function should return 1 if something goes wrong. In this case
** message should contain something usefull!
**************************************************************************/
my_bool udf_small_init( UDF_INIT* initid, UDF_ARGS* args, char* message )
{
    struct SortArray *sort = NULL;
    ulonglong target_depth;
    ulonglong i;

    if (args->arg_count != 2)
    {
        strcpy(message, "Wrong number of arguments: SMALL() requires two arguments");
        return 1;
    }
    if ((args->arg_type[0] != REAL_RESULT) || (args->arg_type[1] != INT_RESULT) )
    {
        strcpy(message, "Wrong argument type: SMALL() requires an REAL and a INT");
        return 1;
    }
    if (*((longlong*)args->args[1]) < 1)
    {
        strcpy(message, "Wrong argument value: SMALL() requires second parameter that is positive");
        return 1;
    }

    initid->maybe_null = 0; /* The result may be null */
    initid->decimals = 10; /* We want 10 decimals in the result */
    /* initid->max_length = 10; */ /* 10 digits + . + 10 decimals */

    target_depth = *((longlong*)args->args[1]);

    sort = malloc((sizeof(struct SortArray)));

    if(sort == NULL)
    {
        strcpy(message, "Couldn't allocate memory");
        return 1;
    }   

    initid->ptr = (char*)sort;

    sort->length = 0;   

    sort->data = malloc((target_depth)*(sizeof(double)));

    if(sort->data == NULL)
    {
        strcpy(message, "Couldn't allocate memory");
        return 1;
    }   

    for(i = 0; i < target_depth; i++)
    {
        sort->data[i] = 999999999;
    }

    return 0;
}

/****************************************************************************
** Deinit function. This should free all resources allocated by
** this function.
**
** Arguments:
** initid Return value from xxxx_init
****************************************************************************/
void udf_small_deinit( UDF_INIT* initid )
{
    free(initid->ptr);  
}

/****************************************************************************
** Small Aggregate Function.
**
** There are 3 extra functions for an aggregate function, xxx_reset, 
** xxx_clear, and xxx_add
****************************************************************************/

/****************************************************************************
** xxx_reset
**
** Arguments:
** initid Structure filled by xxx_init
**
** args The same structure as to xxx_init. This structure
** contains values for all parameters.
** Note that the functions MUST check and convert all
** to the type it wants! Null values are represented by
** a NULL pointer
**
** is_null If the result is null, one should store 1 here.
**
** message Error message that should be passed to the user on fail.
** The message buffer is MYSQL_ERRMSG_SIZE big, but one should
** try to keep the error message less than 80 bytes long!.
**
** xxx_reset gets done every time we begin processing the records for a new 
** group. It calls xxx_clear and xxx_add
****************************************************************************/

/* This is only for MySQL 4.0 compability */
void udf_small_reset(UDF_INIT* initid, UDF_ARGS* args, char* is_null, char* message)
{
    udf_small_clear(initid, args, is_null, message);
    udf_small_add(initid, args, is_null, message);
}

/****************************************************************************
** xxx_clear
**
** Arguments:
** initid Structure filled by xxx_init
**
** args The same structure as to xxx_init. This structure
** contains values for all parameters.
** Note that the functions MUST check and convert all
** to the type it wants! Null values are represented by
** a NULL pointer
**
** is_null If the result is null, one should store 1 here.
**
** message Error message that should be passed to the user on fail.
** The message buffer is MYSQL_ERRMSG_SIZE big, but one should
** try to keep the error message less than 80 bytes long!.
**
** xxx_clear resets the processing variables to their initial values for a 
** new group
****************************************************************************/

/* This is needed to get things to work in MySQL 4.1.1 and above */
void udf_small_clear(UDF_INIT* initid, UDF_ARGS* args, char* is_null __attribute__((unused)), char* message __attribute__((unused)))
{
    struct SortArray *sort = NULL;
    ulonglong target_depth;
    ulonglong i;

    sort = (struct SortArray*)initid->ptr;

    target_depth = *((longlong*)args->args[1]);

    /* if(sort == NULL)
    {
        strcpy(message, "Initid Pointer is NULL");
        exit(1);
    }

    if(sort->data == NULL)
    {
        strcpy(message, "Couldn't allocate memory");
        exit(1);
    } */

    sort->length = 0;

    for(i = 0; i < target_depth; i++)
    {
        sort->data[i] = 999999999;
    }
}

/****************************************************************************
** xxx_add
**
** Arguments:
** initid Structure filled by xxx_init
**
** args The same structure as to xxx_init. This structure
** contains values for all parameters.
** Note that the functions MUST check and convert all
** to the type it wants! Null values are represented by
** a NULL pointer
**
** is_null If the result is null, one should store 1 here.
**
** message Error message that should be passed to the user on fail.
** The message buffer is MYSQL_ERRMSG_SIZE big, but one should
** try to keep the error message less than 80 bytes long!.
**
** xxx_add is the main processing workhorse of a anggregate UDF. It processes
** each new record in the group.
****************************************************************************/

void udf_small_add(UDF_INIT* initid, UDF_ARGS* args, char* is_null __attribute__((unused)), char* message __attribute__((unused)))
{
    if (args->args[0] && args->args[1])
    {
        struct SortArray *sort = NULL;
        ulonglong target_depth;
        double col_value;
        ulonglong counter;
        ulonglong i, j;
        bool done;

        sort = (struct SortArray*)initid->ptr;

        target_depth = *((longlong*)args->args[1]); 

        col_value = *((double*)args->args[0]);  

        /* if(sort == NULL)
        {
            strcpy(message, "Initid pointer is NULL");
            exit(1);
        } */

        i = target_depth;
        counter = 1;
        done = FALSE;

        /* sort->data[0] = col_value + 1.0;
        sort->data[1] = col_value;

        col_value = 0.05; */

        while ((!(done)) && (i > 0))
        {
            if(col_value <= sort->data[i-1])
            {
                for(j = 0; j < i-1; j++)
                {
                    sort->data[j] = sort->data[j+1];
                }
                sort->data[i-1] = col_value;
                done = TRUE;
            }
            else
            {
                i--;
                counter++;
            }
        }

        /* if(counter >= sort->length)
        {
            sort->length = counter;
        } */

    }
}

/***************************************************************************
** UDF double function.
**
** Arguments:
** initid Structure filled by xxx_init
**
** args The same structure as to xxx_init. This structure
** contains values for all parameters.
** Note that the functions MUST check and convert all
** to the type it wants! Null values are represented by
** a NULL pointer
**
** is_null If the result is null, one should store 1 here.
**
** error If something goes fatally wrong one should store 1 here.
**
** This function should return a double. It returns a value at the end of 
** each group.
***************************************************************************/

double udf_small( UDF_INIT* initid, UDF_ARGS* args __attribute__((unused)), char* is_null, char* error __attribute__((unused)))
{
    struct SortArray *sort = NULL;
    ulonglong target_depth;

    sort = (struct SortArray*)initid->ptr;
    target_depth = *((longlong*)args->args[1]);

    sort->length = target_depth; /* Kill me once working */

    if(sort != NULL)
    {
        if(sort->length == target_depth)
        {
            *is_null = 0;
            return sort->data[0];
        }   
    }
    else
    {
        *is_null = 1;
        return  10.0;
    }
}

How to Rank Rows Based on Generated Scores in MySQL

You want 1, 2, 3, ... tacked onto the output?

SELECT @rank := @rank + 1,
       s.*
    FROM ( SELECT @rank := 0 ) AS init
    JOIN (
        SELECT ... -- your existing SELECT
         ) s;

Best Answer

Related Solutions

Mysql – How to create a user-defined aggregate function in MySQL (redux)

How to Rank Rows Based on Generated Scores in MySQL

Related Question