Inspired by Matthew Podwysocki’s Functional Programming and Collective Intelligence post I decided to recreate some of the similarity algorithms discussed in the Programming Collective Intelligence in Erlang.
Collective Intelligence is a shared intelligence that emerges from the collaboration or competition amongst many individuals, it is used by online retailers such as Amazon.com and Netflix to recommend books and movies. Today we’ll use these algorithms to compare user preference, and in subsequent posts apply see how these algorithms are used in recommendation systems and clustering algorithms.
Before we continue, there are two unfamiliar concepts which we need to review: list Comprehensions and dictionaries.
List comprehensions are expressions (which are very common to functional programming languages) that allow you to create lists without having to use funs, maps, or filters. This helps you to write programs which are shorter and easier to debug. In Erlang, the syntax is:
[ F(X) || X <- L]
which means, create a list of F(X) where F is a function and X is an element taken from the list L. Additionally, in Erlang we can use qualifiers to filter out elements of the list L.
Dictionaries (Hashes/Maps) are not directly supported by Erlang and to use them in your code you must use the dict module. The module includes the function dict:from_list which converts a list of two element tuples into a dictionary; it also includes the function dict:fetch which takes the Key and the dictionary as an argument and returns the value which matches the key.
Similarity
Below is a list of movie ratings by user
|
|
Lady in The Water |
Snakes on a Plane |
Just My Luck |
Superman Returns |
You, me and Dupree |
The Night Listener |
| Lisa Rose | 2.5 | 3.5 | 3.0 | 3.5 | 2.5 | 3.0 |
| Mick LaSalle | 3.0 | 4.0 | 2.0 | 3.0 | 2.0 | 3.0 |
| Jack Mathews | 3.0 | 4.0 | 5.0 | 3.5 | 3.0 | |
| Sheheryar | 2.5 | 3.0 | 3.0 |
which we can store in Erlang as a dictionary:
dict:from_list([
{
"Lisa Rose",
dict:from_list( [{"Lady in the Water", 2.5},{"Snakes on a Plane", 3.5},
{"Just My Luck", 3.0},{"Superman Returns", 3.5},
{"You, Me and Dupree", 2.5},{"The Night Listener", 3.0}])
},
{"Mick LaSalle",
dict:from_list( [{"Lady in the Water", 3.0},{"Snakes on a Plane", 4.0},
{"Just My Luck", 2.0},{"Superman Returns", 3.0},
{"You, Me and Dupree", 2.0},{"The Night Listener", 3.0}])
},
{"Jack Matthews",
dict:from_list( [{"Lady in the Water", 3.0},{"Snakes on a Plane", 4.0},
{"Superman Returns", 5.0},{"You, Me and Dupree", 3.5},
{"The Night Listener", 3.0}])
},
{"Shey",
dict:from_list( [{"Snakes on a Plane", 2.5},{"Superman Returns", 3.0},
{"You, Me and Dupree", 3.0}])}
]).
To determine how similar two users preferences are we need to compute a similarity score. An easy way to calculate similarity is to use the scaled Euclidean Distance formula which we learned back in high school geometry. The function returns a value between 0 and 1, where a value of 1 means that two people have identical tastes.
sim_distance(List1, List2) when length(List1) /= 0, length(List1) == length(List1) ->
Sum_of_squares = lists:sum([ math:pow(X-Y,2) || {X,Y} <- lists:zip(List1,List2) ]),
1.0 / (1.0 + math:sqrt(Sum_of_squares)).
The function accepts two lists, each lists contains the user’s rating of a movie, where the ith element of each lists represents the user’s rating for the same movie. Note the use of the lists:zip function which takes two lists of equal length and returns one list of tuples with two elements each, where the first element of each tuple is taken from the first list and the second element is taken from corresponding element in the second list. Finally, the function also has two guard expressions that prevent the user from calling the function with empty lists or lists of unequal length. Another simple distance metric is the Manhattan distance, it represents the absolute differences between coordinates of a pair of objects:
sim_manhattan(X,Y) when length(X) /= 0, length(X) == length(Y) ->
Sum = lists:sum([ abs(Xn-Yn) || {Xn,Yn} <- lists:zip(X,Y)]),
1.0/ (1.0 + Sum).
In this scenario, a slightly better way to determine the similarity between two objects is to use a Pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line and it gives better results in when the data isn’t well normalized– for example, if critics’ movie rankings are routinely harsher than average. You can read about other other distance metrics for comparing items on Wikipedia.
sim_pearson(X, Y) when length(X) == length(Y) ->
Len = length(X),
%% sum
SumX = lists:sum(X),
SumY = lists:sum(Y),
%% sum of squares
SumXSq=lists:sum([math:pow(N,2) || N <- X]),
SumYSq=lists:sum([math:pow(N,2) || N <- Y]),
%% sum of products
Sum_products = lists:sum( [ A*B || {A,B} <- lists:zip(X,Y)]),
%% Calculate Pearson Score
Numer = Sum_products - (SumX*SumY/Len),
case math:sqrt( (SumXSq - math:pow(SumX,2) /Len) * (SumYSq - math:pow(SumY,2)/Len)) of
0.0 ->
0.0;
Denom ->
Numer/Denom %% Don't put a comma or semi-colon here
end.
A complete listing of the code is available online; you can experiment with the different distance metrics using the main function, but due to some limitations of the command line interface, you’ll have to assign the function to a variable before passing it to the main function:
1> c(pci).
2> SimPearson = fun(X,Y) -> pci:sim_pearson(X,Y) end.
3> pci:main("Mick LaSalle", "Lisa Rose", SimPearson).
While this code is (mostly) functional, we haven’t taken advantage of Erlang’s distributed processes, soon we’ll explore the K-Means clustering algorithm which is very well suited to parallel implementations.
Your comments and Feedback are appreciated.
