Potential and Limitations of Commercial Sentiment Detection Tools

Potential and Limitations of Commercial Sentiment Detection Tools Fatih Uzdilli joint work with Mark Cieliebak and Oliver Dürr 03.12.2013 @ ESSEM 13

About Me Fatih Uzdilli Institute of Applied Information Technology (InIT) ZHAW, Winterthur, Switzerland Email:, more about me: home.zhaw.ch/~uzdi Research Interest Information Retrieval, Machine Learning, Sentiment Analysis Background Software Engineer, Social Media Monitoring, Search Technologies 2

Abstract Evaluation of 9 commercial sentiment tools on approx. 30'000 short texts. Best commercial tools have accuracy of only 60%. Combining all tools using Random Forest improved the accuracy. 3

Motivation Scientific results for sentiment detection: C «very good performance: > 80% accuracy» Blog posts about commercial tools: D «very poor quality, unusable» 4

Motivation Scientific results for sentiment detection: C «very good performance: > 80% accuracy» Blog posts about commercial tools: D «very poor quality, unusable» 5

How good is commercial Sentiment Detection? source: http://www.commute.com/images/schools_evaluation.jpg 03.12.2013 Is there potential for improvement? source: http://3.bp.blogspot.com/-u3ack_wjalu/ulyv51mhehi/ AAAAAAAAARY/ DIZqOfxuswc/s1600/IcebergQ1.jpg Fatih Uzdilli 6

Evaluation Setup 7 Public Text Corpora Single Statements Different Media Types Tweet, News, Review, Speech Transcript Total: 28653 Texts 9 Commercial APIs Stand-alone Free for this evaluation Arbitrary Text POSITIVE NEGATIVE OTHER ( neutral / mixed ) 7

Tool Accuracy Accuracy 0.8 0.7 0.6 0.5 Best Tool per Corpus Average of All Tools Worst Tool per Corpus Avg. 61% 52% 40% 0.4 0.3 0.2 C1(Tweets) C2(Quotations) C3(Reviews) C4(Headlines) C5(Reviews) C6(Reviews) C7(News) 8

Tool Accuracy Accuracy 0.8 0.7 0.6 0.5 Best Tool per Corpus Average of All Tools Worst Tool per Corpus Overall Best Tool Overall Worst Tool Avg. 61% 52% 40% 59% 45% 0.4 0.3 0.2 C1(Tweets) C2(Quotations) C3(Reviews) C4(Headlines) C5(Reviews) C6(Reviews) C7(News) 9

Further Findings Longer texts are hard to classify Corpus annotations might be erroneous 10

Can a Meta-Classifier do better? 1st Approach: Majority Classifier Sentiment with most votes chosen Illustration: api1 api2 api3 api4 api5 api6 api7 Majority Text 1 + + - o - + o + Text 2 - + + - - - - - Text 3 - o + + + + - + Text n o o + o - o o o 11

Tool Accuracy 0.8 0.7 Best Tool per Corpus Average of All Tools Worst Tool per Corpus Accuracy 0.6 0.5 0.4 0.3 0.2 C1(Tweets) C2(Quotations) C3(Reviews) C4(Headlines) C5(Reviews) C6(Reviews) C7(News) 12

Majority Classifier beats Average 0.8 0.7 Best Tool per Corpus Average of All Tools Worst Tool per Corpus Majority Classifier Accuracy 0.6 0.5 0.4 0.3 0.2 C1(Tweets) C2(Quotations) C3(Reviews) C4(Headlines) C5(Reviews) C6(Reviews) C7(News) 13

2nd Approach: Random-Forest api1 api2 api3 api n annotation Text 1 + - + o + Text 2 - + o + - Text 3 - o - + - Text 4 + o + - + Text 5 + o + o o Text 6 + o o - o Text 7 + - + o unknown Text 8 + + o - unknown Text 9 o - + o unknown Train Train Train Train Train Train Predict Predict Predict Random Forest Classifier + + o 14

Before Random Forest 0.8 0.7 Best Tool per Corpus Average of All Tools Worst Tool per Corpus Majority Classifier Accuracy 0.6 0.5 0.4 0.3 0.2 C1(Tweets) C2(Quotations) C3(Reviews) C4(Headlines) C5(Reviews) C6(Reviews) C7(News) 15

Random Forest Beats Best Single Tool 0.8 0.7 Best Tool per Corpus Average of All Tools Worst Tool per Corpus Majority Classifier Random Forest Classifier Accuracy 0.6 0.5 0.4 0.3 0.2 C1(Tweets) C2(Quotations) C3(Reviews) C4(Headlines) C5(Reviews) C6(Reviews) C7(News) 16

Summary Best Tool: 59% Accuracy Random Forest combination: Up to 9% improvement <=9% 17