Recommendations Worth a Million

Similar documents
Recommendation Systems UE 141 Spring 2013

Recommender systems and the Netflix prize. Charles Elkan. January 14, 2011

Regent Student Film Showcase Submission Manual Contact: Phone:

Recommender Systems TIETS43 Collaborative Filtering

Genetic algorithm applied in Clustering datasets. James Cunha Werner ) Terence C. Fogarty

The Year In Demand. An Exclusive White Paper. for Members of the International Academy of Television Arts & Sciences

Millions Frank Cottrell Boyce

Film Genre Introduction

Contents. Introduction and Review... Pages 3 9 a) Basic Grammar Review b) Grammar Quiz c) Peer Editing d) Interview e) Topic Sentence

THE FUTURE OF STORYTELLINGº

Creature Features: The Science Fiction, Fantasy, And Horror Movie Guide By John Stanley

With Elwood Blues, aka Dan Aykroyd

Relational Algebra: Aggregation

MIPCOM 2017 Market Snapshot: North America. Top SVOD titles and genre trends: Demand share and YoY growth

Understanding Your Writing Style: How to Reach New Readers in Four Easy Steps

I. INTRODUCTION II. LITERATURE SURVEY. International Journal of Advanced Networking & Applications (IJANA) ISSN:

Movie Genres. Movie Genres Definition Examples Chinese 1. Action. 2. Adventure. 3. Comedy. 4. Drama. 5. Crime. 6. Horror. 7. Fantasy. 8.

Raw Data. Cleaned, Structured Data. Exploratory Data Analysis. Verify Hunches (stats) Data Product

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Horror, Science Fiction, And Fantasy Movie Posters (Horror, Sci-Fi & Fantasy Movie Posters)

The Bad Seed (Battle For The Heavens, #1) By Michael Lackey

SUPPORT FOR THE DEVELOPMENT OF EUROPEAN VIDEO GAMES

Advanced Data Visualization

18.05 Problem Set 1, Spring 2014 Solutions

Machine Learning Practical Part 2: Group Projects. MLP Lecture 11 MLP Part 2: Group Projects 1

An Introduction to the Graphic Novel

Predicting the movie popularity using user-identified tropes

The Great Science Fiction Series READ ONLINE

John Lennon: Imagine READ ONLINE

Die Chemie in Spielfilmen, Cartoons und Kinderzeichnungen

Monte Carlo Tree Search

FRP. Final Research Paper

The Sean Wyatt Thriller Box Set: A Suspense Action Fiction Thriller Collection (Sean Wyatt Mystery Thrillers Book 4) By Ernest Dempsey READ ONLINE

Old Time Radio Comedy Favorites (Smithsonian Collection) By Smithsonian Collection READ ONLINE

Entries will be judges on content and illustrations by a panel of Erie Times-News staff, Erie Art Museum staff, and industry professionals.

Band Of Brothers (Hbo Mini-Series) By Cotter Smith, Stephen E. Ambrose

Matchstick By Andrew Cocco

GAME AUDIENCE DASHBOARD MAIN FEATURES

AUTOMATED MUSIC TRACK GENERATION

Dicing The Data from NAB/RAB Radio Show: Sept. 7, 2017 by Jeff Green, partner, Stone Door Media Lab

Global Journal of Engineering Science and Research Management

Rocky, The Screenplay: Original Movie Script By Sylvester Stallone

Amazon Studios. Contest Rules. for the. Best Horror/Thriller Trailer Award for Fourth Week of February 2012

Guiding Lights 9. FAQs - UK

Spy Killer (Mystery & Suspense Short Stories Collect) By L. Ron Hubbard READ ONLINE

Feline Fatale (A British Comedy Private Investigator Series Book 2) By Sean Cameron

Your Neighbors Affect Your Ratings: On Geographical Neighborhood Influence to Rating Prediction

Now that you have achieved your Bronze Award, where you could pick any book you wanted, it s time to broaden your horizons!

Predicting outcomes of professional DotA 2 matches

NaFF Screenwriting Competition 2015

The Republic: A Post-Apocalyptic Thriller By M.G. Herron

Man-made Horrors (World Of Horror) By John Hamilton

FAQ for Voting and Prizes

EMERGING WRITERS FELLOWSHIP FAQ

The Very Best Of John Denver (Strum It Guitar) By John Denver

Who plays mobile games? Player insights to help developers win

Spiderman Movie Ii (Spider-man 2) By Michael Teitelbaum READ ONLINE

Submission Deadline: Thursday, June 7 th

EMERGING WRITERS FELLOWSHIP FAQ

Matthew Fox CS229 Final Project Report Beating Daily Fantasy Football. Introduction

Linguistics Big Assignment

robocop 93B9F9E761F6AABBDC2F50A60C Robocop 1 / 6

1 HB By Representatives Boothe, Clouse, Rowe and Martin. 4 RFD: State Government. 5 First Read: 02-MAR-17. Page 0

Polaris Nordic Digital Music in the Nordics. By: Simon Bugge Jensen & Marie Christiansen Krøyer

Reducing confounding factors in automatic acoustic recognition of individual birds

Science Fiction Stars And Horror Heroes: Interviews With Actors, Directors, Producers And Writers Of The... By John Brunas and Michael Brunas

Jonathan Strange And Mr Norrell

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

Dota2 is a very popular video game currently.

Hawks On Hawks By Howard Hawks

On-line Digital Campaign Latam

Nimona online. Nimona online

Teaching icub to recognize. objects. Giulia Pasquale. PhD student

Unit 4 Review. Multiple Choice: Identify the choice that best completes the statement or answers the question.

Digital and Non-Digital Media. Media 8 and 9

Kids will learn that God used the prophets to tell of Jesus birth and bring hope to the people.

The Edge Of Tomorrow By Thomas Anthony Dooley READ ONLINE

k-means Clustering David S. Rosenberg December 15, 2017 Bloomberg ML EDU David S. Rosenberg (Bloomberg ML EDU) ML 101 December 15, / 18

Outline. Collective Intelligence. Collective intelligence & Groupware. Collective intelligence. Master Recherche - Université Paris-Sud

CISC 1600 Introduction to Multi-media Computing

Black Science Fiction Script Movie Project 2012

Anja's Star: A Sci-Fi Romance (Outer Settlement Agency Book 2) [Kindle Edition] By Lyn Brittan

Nightwings: A Graphic Adaptation (Science Fiction Graphic Novel) By Cary Bates, Robert Silverberg

Winning Poker Tournaments One Hand At A Time Volume I

Smartphone & Tablet Gaming 2013 GAMES MARKET SECTOR REPORT

FAVORITE MEALS NUMBER OF PEOPLE Hamburger and French fries 17 Spaghetti 8 Chili 12 Vegetarian delight 3

Black Mirror By Nancy Werlin READ ONLINE

Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis

Winning Poker Tournaments One Hand At A Time Volume I

Copyright 2008, Yan Chen

arrival 0D4144FFC7E8E66A72800EA2B4101FD0 Arrival 1 / 7

On a loose leaf sheet of paper answer the following questions about the random samples.

the santa trap 76B20F831D151D58F3A0517D7D305E64 The Santa Trap 1 / 5

Purenet: The Sanction Scifi Series (The Sanction Thriller Series) (Volume 1) By HJ Lawson

Radio Times Guide To Films 2014

INTRODUCTION. What is

A Hand In Love And Murder: Historical Mystery Romance (Fire In My Heart) (Volume 4) By C.L. Bush

Jadoo By John A. Keel READ ONLINE

The Stephen Spielberg Story: The First Stephen Spielberg Comic Biogrhaphy (Great Hero Series) By T.S. Lee

Marvel Avengers Character Guide READ ONLINE

Transcription:

Recommendations Worth a Million An Introduction to Clustering 15.071x The Analytics Edge Clapper image is in the public domain. Source: Pixabay.

Netflix Online DVD rental and streaming video service More than 40 million subscribers worldwide $3.6 billion in revenue Key aspect is being able to offer customers accurate movie recommendations based on a customer s own preferences and viewing history Photo of Netflix envelopes by BlueMint on Wikimedia Commons. License: CC BY. 15.071x Recommendations Worth a Million: An Introduction to Clustering 1

The Netflix Prize From 2006 2009 Netflix ran a contest asking the public to submit algorithms to predict user ratings for movies Training data set of ~100,000,000 ratings and test data set of ~3,000,000 ratings were provided Offered a grand prize of $1,000,000 USD to the team who could beat Netflix s own algorithm, Cinematch, by more than 10%, measured in RMSE 15.071x Recommendations Worth a Million: An Introduction to Clustering 2

Contest Rules If the grand prize was not yet reached, progress prizes of $50,000 USD per year would be awarded for the best result so far, as long as it had >1% improvement over the previous year. Teams must submit code and a description of the algorithm to be awarded any prizes If any team met the 10% improvement goal, last call would be issued and 30 days would remain for all teams to submit their best algorithm. 15.071x Recommendations Worth a Million: An Introduction to Clustering 3

Initial Results The contest went live on October 2, 2006 By October 8, a team submitted an algorithm that beat Cinematch By October 15, there were three teams with algorithms beating Cinematch One of these solutions beat Cinematch by >1%, qualifying for a progress prize 15.071x Recommendations Worth a Million: An Introduction to Clustering 4

Progress During the Contest By June 2007, over 20,000 teams had registered from over 150 countries The 2007 progress prize went to team BellKor, with an 8.43% improvement on Cinematch In the following year, several teams from across the world joined forces 15.071x Recommendations Worth a Million: An Introduction to Clustering 5

Competition Intensifies The 2008 progress prize went to team BellKor which contained researchers from the original BellKor team as well as the team BigChaos This was the last progress prize because another 1% improvement would reach the grand prize goal of 10% 15.071x Recommendations Worth a Million: An Introduction to Clustering 6

Last Call Announced On June 26, 2009, the team BellKor s Pragmatic Chaos submitted a 10.05% improvement over Cinematch 15.071x Recommendations Worth a Million: An Introduction to Clustering 7 Screenshot of Netflix prize is in the public domain. Source: Wikimedia Commons.

Predicting the Best User Ratings Netflix was willing to pay over $1M for the best user rating algorithm, which shows how critical the recommendation system was to their business What data could be used to predict user ratings? Every movie in Netflix s database has the ranking from all users who have ranked that movie We also know facts about the movie itself: actors, director, genre classifications, year released, etc. 15.071x Recommendations Worth a Million: An Introduction to Clustering 8

Using Other Users Rankings Men in Black Apollo 13 Top Gun Terminator Amy 5 4 5 4 Bob 3 2 5 Carl 5 4 4 Dan 4 2 Consider suggesting to Carl that he watch Men in Black, since Amy rated it highly and Carl and Amy seem to have similar preferences This technique is called Collaborative Filtering 15.071x Recommendations Worth a Million: An Introduction to Clustering

Using Movie Information We saw that Amy liked Consider recommending Men In Black to Amy: It was directed by Barry Sonnenfeld s Barry Sonnenfeld movie Get Shorty Classified in the genres Jurassic Park, which is of action, adventure, in the genres of action, sci-fi and comedy adventure, and sci-fi It stars actor Will Will Smith s movie Smith Hitch This technique is called Content Filtering 15.071x Recommendations Worth a Million: An Introduction to Clustering 1

Strengths and Weaknesses Collaborative Filtering Systems Can accurately suggest complex items without understanding the nature of the items Requires a lot of data about the user to make accurate recommendations Millions of items need lots of computing power Content Filtering Requires very little data to get started Can be limited in scope 15.071x Recommendations Worth a Million: An Introduction to Clustering 11

Hybrid Recommendation Systems Netflix uses both collaborative and content filtering For example, consider a collaborative filtering approach where we determine that Amy and Carl have similar preferences. We could then do content filtering, where we would find that Terminator, which both Amy and Carl liked, is classified in almost the same set of genres as Starship Troopers Recommend Starship Troopers to both Amy and Carl, even though neither of them have seen it before 15.071x Recommendations Worth a Million: An Introduction to Clustering 12

MovieLens Data www.movielens.org is a movie recommendation website run by the GroupLens Research Lab at the University of Minnesota They collect user preferences about movies and do collaborative filtering to make recommendations We will use their movie database to do content filtering using a technique called clustering 15.071x Recommendations Worth a Million: An Introduction to Clustering 13

MovieLens Item Dataset Movies in the dataset are categorized as belonging to different genres (Unknown) Action Adventure Animation Children s Comedy Crime Documentary Drama Fantasy Film Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western Each movie may belong to many genres Can we systematically find groups of movies with similar sets of genres? 15.071x Recommendations Worth a Million: An Introduction to Clustering 14

Why Clustering? Unsupervised learning Goal is to segment the data into similar groups instead of prediction Can also cluster data into similar groups and then build a predictive model for each group Be careful not to overfit your model! This works best with large datasets Clustering image is in the public domain. Source: Wikimedia Commo 15.071x Recommendations Worth a Million: An Introduction to Clustering 1

Types of Clustering Methods There are many different algorithms for clustering Differ in what makes a cluster and how to find them We will cover Hierarchical K-means in the next lecture 15.071x Recommendations Worth a Million: An Introduction to Clustering 16

Distance Between Points Need to define distance between two data points Most popular is Euclidean distance Distance between points i and j is q d ij = (x i1 - x j1 ) 2 +(x i2 - x j2 ) 2 +...+(x ik - x jk ) 2 where k is the number of independent variables 15.071x Recommendations Worth a Million: An Introduction to Clustering 17

Distance Example The movie Toy Story is categorized as Animation, Comedy, and Children s Toy Story: (0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0) TM The movie Batman Forever is categorized as Action, Adventure, Comedy, and Crime Batman Forever: (0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0) TM 15.071x Recommendations Worth a Million: An Introduction to Clustering 1

Distance Between Points Toy Story: (0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0) Batman Forever: (0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0) Other popular distance metrics: Manhattan Distance Sum of absolute values instead of squares Maximum Coordinate Distance Only consider measurement for which data points deviate the most 15.071x Recommendations Worth a Million: An Introduction to Clustering 19

Distance Between Clusters Minimum Distance Distance between clusters is the distance between points that are the closest 15.071x Recommendations Worth a Million: An Introduction to Clustering Clustering image is in the public domain. Source: Wikimedia Commons.

Distance Between Clusters Maximum Distance Distance between clusters is the distance between points that are the farthest 15.071x Recommendations Worth a Million: An Introduction to Clustering 1

Distance Between Clusters Centroid Distance Distance between centroids of clusters Centroid is point that has the average of all data points in each component 15.071x Recommendations Worth a Million: An Introduction to Clustering

Normalize Data Distance is highly influenced by scale of variables, so customary to normalize first In our movie dataset, all genre variables are on the same scale and so normalization is not necessary However, if we included a variable such as Box Office Revenue, we would need to normalize. 15.071x Recommendations Worth a Million: An Introduction to Clustering 23

Hierarchical Start with each data point in its own cluster 15.071x Recommendations Worth a Million: An Introduction to Clustering 4

Hierarchical Combine two nearest clusters (Euclidean, Centroid) 15.071x Recommendations Worth a Million: An Introduction to Clustering 2

Hierarchical Combine two nearest clusters (Euclidean, Centroid) 15.071x Recommendations Worth a Million: An Introduction to Clustering

Hierarchical Combine two nearest clusters (Euclidean, Centroid) 15.071x Recommendations Worth a Million: An Introduction to Clustering

Hierarchical Combine two nearest clusters (Euclidean, Centroid) 15.071x Recommendations Worth a Million: An Introduction to Clustering

Hierarchical Combine two nearest clusters (Euclidean, Centroid) 15.071x Recommendations Worth a Million: An Introduction to Clustering

Hierarchical Combine two nearest clusters (Euclidean, Centroid) 15.071x Recommendations Worth a Million: An Introduction to Clustering

Hierarchical Combine two nearest clusters (Euclidean, Centroid) 15.071x Recommendations Worth a Million: An Introduction to Clustering 1

Hierarchical Combine two nearest clusters (Euclidean, Centroid) 15.071x Recommendations Worth a Million: An Introduction to Clustering

Display Cluster Process Cluster Dendrogram Height of vertical lines represents distance between points or clusters Data points listed along bottom Height 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 3 1 4 2 5 15.071x Recommendations Worth a Million: An Introduction to Clustering

Select Clusters Cluster Dendrogram 4 10 3 9 1 2 7 Height 0 1 2 3 4 5 5 8 6 15.071x Recommendations Worth a Million: An Introduction to Clustering 4

Meaningful Clusters? Look at statistics (mean, min, max,...) for each cluster and each variable See if the clusters have a feature in common that was not used in the clustering (like an outcome) 15.071x Recommendations Worth a Million: An Introduction to Clustering 35

Beyond Movies: Mass Personalization If I have 3 million customers on the web, I should have 3 million stores on the web Jeff Bezos, CEO of Amazon.com Recommendation systems build models about users preferences to personalize the user experience Help users find items they might not have searched for: A new favorite band An old friend who uses the same social media network A book or song they are likely to enjoy 15.071x Recommendations Worth a Million: An Introduction to Clustering 36

Cornerstone of these Top Businesses TM TM TM TM TM TM 15.071x Recommendations Worth a Million: An Introduction to Clustering

Recommendation Method Used Collaborative Filtering Amazon.com Last.fm Spotify Facebook LinkedIn Google News MySpace Netflix Content Filtering Pandora IMDB Rotten Tomatoes Jinni Rovi Corporation See This Next MovieLens Netflix 15.071x Recommendations Worth a Million: An Introduction to Clustering 38

The Netflix Prize: The Final 30 Days 29 days after last call was announced, on July 25, 2009, the team The Ensemble submitted a 10.09% improvement When Netflix stopped accepting submissions the next day, BellKor s Pragmatic Chaos had submitted a 10.09% improvement solution and The Ensemble had submitted a 10.10% improvement solution Netflix would now test the algorithms on a private test set and announce the winners 15.071x Recommendations Worth a Million: An Introduction to Clustering 39

Winners are Declared! On September 18, 2009, a winning team was announced BellKor s Pragmatic Chaos won the competition and the $1,000,000 grand prize Photo of Team Bellkor Pragmatic Chaos by Mike K on Flickr. License: CC BY-NC. 15.071x Recommendations Worth a Million: An Introduction to Clustering 40

The Edge of Recommendation Systems In today s digital age, businesses often have hundreds of thousands of items to offer their customers Excellent recommendation systems can make or break these businesses Clustering algorithms, which are tailored to find similar customers or similar items, form the backbone of many of these recommendation systems 15.071x Recommendations Worth a Million: An Introduction to Clustering 41

MIT OpenCourseWare https://ocw.mit.edu/ 15.071 Analytics Edge Spring 2017 For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.