Archive for the ‘datamining’ tag

Inspired by Matthew Podwysocki’s Functional Programming and Collective Intelligence post I decided to recreate some of the similarity algorithms discussed in the Programming Collective Intelligence in Erlang.

Collective Intelligence is a shared intelligence that emerges from the collaboration or competition amongst many individuals, it is used by online retailers such as Amazon.com and Netflix to recommend books and movies.  Today we’ll use these algorithms to compare user preference, and in subsequent posts apply see how these algorithms are used in recommendation systems and clustering algorithms.

Before we continue, there are two unfamiliar concepts which we need to review: list Comprehensions and dictionaries.

List comprehensions are expressions (which are very common to functional programming languages) that allow you to create lists without having to use funs, maps, or filters. This helps you to write programs which are shorter and easier to debug. In Erlang, the syntax is:

[ F(X) || X <- L]

which means, create a list of F(X) where F is a function and X is an element taken from the list L. Additionally, in Erlang we can use qualifiers to filter out elements of the list L.

Dictionaries (Hashes/Maps) are not directly supported by Erlang and to use them in your code you must use the dict module. The module includes the function dict:from_list which converts a list of two element tuples into a dictionary; it also includes the function dict:fetch which takes the Key and the dictionary as an argument and returns the value which matches the key.

Similarity

Below is a list of movie ratings by user

Lady in The Water

Snakes on a Plane

Just My Luck

Superman Returns

You, me and Dupree

The Night Listener

Lisa Rose 2.5 3.5 3.0 3.5 2.5 3.0
Mick LaSalle 3.0 4.0 2.0 3.0 2.0 3.0
Jack Mathews 3.0 4.0 5.0 3.5 3.0
Sheheryar 2.5 3.0 3.0

which we can store in Erlang as a dictionary:

  dict:from_list([
	{
	"Lisa Rose",
	dict:from_list( [{"Lady in the Water", 2.5},{"Snakes on a Plane", 3.5},
				 {"Just My Luck", 3.0},{"Superman Returns", 3.5},
				 {"You, Me and Dupree", 2.5},{"The Night Listener", 3.0}])
	},
	{"Mick LaSalle",
	dict:from_list( [{"Lady in the Water", 3.0},{"Snakes on a Plane", 4.0},
				 {"Just My Luck", 2.0},{"Superman Returns", 3.0},
				 {"You, Me and Dupree", 2.0},{"The Night Listener", 3.0}])
	},
	{"Jack Matthews",
	dict:from_list( [{"Lady in the Water", 3.0},{"Snakes on a Plane", 4.0},
				 {"Superman Returns", 5.0},{"You, Me and Dupree", 3.5},
				 {"The Night Listener", 3.0}])
	},
	{"Shey",
	dict:from_list( [{"Snakes on a Plane", 2.5},{"Superman Returns", 3.0},
				 {"You, Me and Dupree", 3.0}])}
	]).

To determine how similar two users preferences are we need to compute a similarity score. An easy way to calculate similarity is to use the scaled Euclidean Distance formula which we learned back in high school geometry. The function returns a value between 0 and 1, where a value of 1 means that two people have identical tastes.

sim_distance(List1, List2) when length(List1) /= 0, length(List1) == length(List1)  ->
  Sum_of_squares = lists:sum([ math:pow(X-Y,2) || {X,Y} <- lists:zip(List1,List2) ]),
  1.0 / (1.0 + math:sqrt(Sum_of_squares)).

The function accepts two lists, each lists contains the user’s rating of a movie, where the ith element of each lists represents the user’s rating for the same movie. Note the use of the lists:zip function which takes two lists of equal length and returns one list of tuples with two elements each, where the first element of each tuple is taken from the first list and the second element is taken from corresponding element in the second list. Finally, the function also has two guard expressions that prevent the user from calling the function with empty lists or lists of unequal length. Another simple distance metric is the Manhattan distance, it represents the absolute differences between coordinates of a pair of objects:

sim_manhattan(X,Y) when length(X) /= 0,  length(X) == length(Y)  ->
  Sum = lists:sum([ abs(Xn-Yn) || {Xn,Yn} <- lists:zip(X,Y)]),
  1.0/ (1.0 + Sum).

In this scenario, a slightly better way to determine the similarity between two objects is to use a Pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line and it gives better results in when the data isn’t well normalized– for example, if critics’ movie rankings are routinely harsher than average. You can read about other other distance metrics for comparing items on Wikipedia.

sim_pearson(X, Y) when length(X) == length(Y)  ->
  Len = length(X),

  %% sum
  SumX = lists:sum(X),
  SumY = lists:sum(Y),

  %% sum of squares
  SumXSq=lists:sum([math:pow(N,2) || N <- X]),
  SumYSq=lists:sum([math:pow(N,2) || N <- Y]),

  %% sum of products
  Sum_products = lists:sum( [ A*B || {A,B} <- lists:zip(X,Y)]),

  %% Calculate Pearson Score
  Numer = Sum_products - (SumX*SumY/Len),

  case math:sqrt( (SumXSq - math:pow(SumX,2) /Len) * (SumYSq - math:pow(SumY,2)/Len)) of
    0.0 ->
      0.0;
    Denom ->
      Numer/Denom %% Don't put a comma or semi-colon here
  end.

A complete listing of the code is available online; you can experiment with the different distance metrics using the main function, but due to some limitations of the command line interface, you’ll have to assign the function to a variable before passing it to the main function:

1> c(pci).
2> SimPearson = fun(X,Y) -> pci:sim_pearson(X,Y) end.
3> pci:main("Mick LaSalle", "Lisa Rose", SimPearson).

While this code is (mostly) functional, we haven’t taken advantage of Erlang’s distributed processes, soon we’ll explore the K-Means clustering algorithm which is very well suited to parallel implementations.

Your comments and Feedback are appreciated.

Introduction to Data Mining by by Pang-Ning Tan, Michael Steinbach, Vipin Kumar

I’ve always maintained that stocks follow repeatable patterns in the short term and a computer should be able to exploit that fairly easily, which is why I’ve always preferred the idea of using pattern recognition over technical analysis in my next automated trading platform.

I needed to have a grasp of the basics of machine learning and data mining which is why I started reading Programming Collective Intelligence.

There are a couple of ideas that I’ve been throwing around in my head—applying pattern recognition and machine learning to financial data—I’m going to use Orange to play with these ideas a little further and I hope, eventually develop a trading strategy.

Having finished PCI, I’m also ready for something more theoretical, so I’ve bought a copy of  Introduction to Data Mining– I’ll read the book over the next few weeks while following along Stanford’s Stats 202 course syllabus, watching their videos and hopefully, form a deeper understanding of which machine learning algorithms to apply to this domain.

I’ve only read 6 chapters of Programming Collective Intelligence and I’ve learned more from it than two AI courses at the University of Houston.  The way the subject matter is presented in university courses is dreadful and obfuscated, filled with mathematical theory without any practical examples to make understanding AI more tractable.

This book covers a broad range of topics: optimization, searching, classification and decision trees, all which the author, Toby Segaran, explains in plain English, with very readable code examples that illustrate the application of these concepts to real world problems. 

Toby has done a great job explaining the source code included in this book, they’re very easy to understand.  I’ve never programmed in Python before, but I find myself learning Python using his examples.

In all, I highly recommend this book to all developers interested in machine learning or data mining.

Search