Global Journal of Engineering Science and Research Management

Similar documents
Lecture 3 - Regression

Recommender Systems TIETS43 Collaborative Filtering

Dynamic Throttle Estimation by Machine Learning from Professionals

Efficient Target Detection from Hyperspectral Images Based On Removal of Signal Independent and Signal Dependent Noise

Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis

A Spatial Mean and Median Filter For Noise Removal in Digital Images

A New Forecasting System using the Latent Dirichlet Allocation (LDA) Topic Modeling Technique

1) Evaluating Internet Resources

On-site Traffic Accident Detection with Both Social Media and Traffic Data

On the GNSS integer ambiguity success rate

AN EFFICIENT METHOD FOR FRIEND RECOMMENDATION ON SOCIAL NETWORKS

Academic Lesson Plan

COLOR IMAGE QUALITY EVALUATION USING GRAYSCALE METRICS IN CIELAB COLOR SPACE

Using Administrative Records for Imputation in the Decennial Census 1

THE TOP 100 CITIES PRIMED FOR SMART CITY INNOVATION

ACADEMIC LESSON PLAN

Reduce the Wait Time For Customers at Checkout

If a fair coin is tossed 10 times, what will we see? 24.61% 20.51% 20.51% 11.72% 11.72% 4.39% 4.39% 0.98% 0.98% 0.098% 0.098%

DESCRIBING DATA. Frequency Tables, Frequency Distributions, and Graphic Presentation

IOMAC' May Guimarães - Portugal

37 Game Theory. Bebe b1 b2 b3. a Abe a a A Two-Person Zero-Sum Game

Sentiment Analysis of User-Generated Contents for Pharmaceutical Product Safety

CHAPTER 6 PROBABILITY. Chapter 5 introduced the concepts of z scores and the normal curve. This chapter takes

in SCREENWRITING MASTER OF ARTS One-Year Accelerated LOCATION LOS ANGELES, CALIFORNIA

in SCREENWRITING MASTER OF FINE ARTS Two-Year Accelerated

Math 113-All Sections Final Exam May 6, 2013

AUTOMATED MUSIC TRACK GENERATION

2. Overall Use of Technology Survey Data Report

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements

Why Google Result Positioning Matters

Study of WLAN Fingerprinting Indoor Positioning Technology based on Smart Phone Ye Yuan a, Daihong Chao, Lailiang Song

Game Mechanics Minesweeper is a game in which the player must correctly deduce the positions of

Graph-of-word and TW-IDF: New Approach to Ad Hoc IR (CIKM 2013) Learning to Rank: From Pairwise Approach to Listwise Approach (ICML 2007)

FUTURE-PROOF INTERFACES: SYSTEMATIC IDENTIFICATION AND ANALYSIS

3. A box contains three blue cards and four white cards. Two cards are drawn one at a time.

RACE TO THE TOP: Integrating Foresight, Evaluation, and Survey Methods

Recommendations Worth a Million

Predicting the outcome of NFL games using machine learning Babak Hamadani bhamadan-at-stanford.edu cs229 - Stanford University

Techniques for Sentiment Analysis survey

Spring 06 Assignment 2: Constraint Satisfaction Problems

The Game-Theoretic Approach to Machine Learning and Adaptation

Outcome Forecasting in Sports. Ondřej Hubáček

Introduction to Filmmaking

AI S GROWING IMPACT USING ARTIFICIAL INTELLIGENCE TO ENGAGE AUDIENCES. Smart machines are giving storytellers and risk managers alike a helping hand.

Your Neighbors Affect Your Ratings: On Geographical Neighborhood Influence to Rating Prediction

Casual & Puzzle Games Data Benchmarks North America, Q1 2017

Chapter 11. Sampling Distributions. BPS - 5th Ed. Chapter 11 1

The Diverse Voices Screenplay Contest by WeScreenplay Rules and Information

Localization Algorithm for Large Scale Mobile Wireless Sensor Networks

ORIGINAL ARTICLE A COMPARATIVE STUDY OF QUALITY ANALYSIS ON VARIOUS IMAGE FORMATS

How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory

ECC419 IMAGE PROCESSING

Experimental study of traffic noise and human response in an urban area: deviations from standard annoyance predictions

THE VEHICLE ROUTING PROBLEM: LATEST ADVANCES AND NEW CHALLENGES (OPERATIONS RESEARCH/COMPUTER SCIENCE INTERFACES SERIES) FROM SPRINGER

MIMO Channel Capacity of Static Channels

GE 113 REMOTE SENSING

COWLEY COLLEGE & Area Vocational Technical School

Impulse Noise Removal Based on Artificial Neural Network Classification with Weighted Median Filter

An Efficient Noise Removing Technique Using Mdbut Filter in Images

Miguel I. Aguirre-Urreta

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

Do Stocks Outperform Treasury Bills?

SELECTING RELEVANT DATA

MIMO III: Channel Capacity, Interference Alignment

Dynamic Data-Driven Adaptive Sampling and Monitoring of Big Spatial-Temporal Data Streams for Real-Time Solar Flare Detection

Multiple Input Multiple Output (MIMO) Operation Principles

Image Quality Assessment for Defocused Blur Images

SSB Debate: Model-based Inference vs. Machine Learning

Exercises to Chapter 2 solutions

Investigating Determinants of Voting for the Helpfulness of Online Consumer Reviews: A Text Mining Approach

HOMEWORK 3 Due: next class 2/3

Reliability and Power Quality Indices for Premium Power Contracts

The Diverse Voices Screenplay Contest by WeScreenplay Rules and Information

If a fair coin is tossed 10 times, what will we see? 24.61% 20.51% 20.51% 11.72% 11.72% 4.39% 4.39% 0.98% 0.98% 0.098% 0.098%

Narrow-Band Interference Rejection in DS/CDMA Systems Using Adaptive (QRD-LSL)-Based Nonlinear ACM Interpolators

MEDIA AND INFORMATION

System and method for subtracting dark noise from an image using an estimated dark noise scale factor

Reference Free Image Quality Evaluation

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Predicting the movie popularity using user-identified tropes

Empirical Study on Quantitative Measurement Methods for Big Image Data

UNIT-III LIFE-CYCLE PHASES

Enhancing the Economics of Satellite Constellations via Staged Deployment

Combinatorics and Intuitive Probability

COMP 776 Computer Vision Project Final Report Distinguishing cartoon image and paintings from photographs

November 11, Chapter 8: Probability: The Mathematics of Chance

Sampling Terminology. all possible entities (known or unknown) of a group being studied. MKT 450. MARKETING TOOLS Buyer Behavior and Market Analysis

COMPARITIVE STUDY OF IMAGE DENOISING ALGORITHMS IN MEDICAL AND SATELLITE IMAGES

RADIO SYSTEMS ETIN15. Channel Coding. Ove Edfors, Department of Electrical and Information Technology

Reelwriting.com s. Fast & Easy Action Guides

Comparative Study of various Surveys on Sentiment Analysis

Final report - Advanced Machine Learning project Million Song Dataset Challenge

Demosaicing Algorithm for Color Filter Arrays Based on SVMs

Math 58. Rumbos Fall Solutions to Exam Give thorough answers to the following questions:

A new quad-tree segmented image compression scheme using histogram analysis and pattern matching

The point value of each problem is in the left-hand margin. You must show your work to receive any credit, except on problems 1 & 2. Work neatly.

Concept Connect. ECE1778: Final Report. Apper: Hyunmin Cheong. Programmers: GuanLong Li Sina Rasouli. Due Date: April 12 th 2013

Chapter 9 Image Compression Standards

An Efficient Color Image Segmentation using Edge Detection and Thresholding Methods

How to Make the Perfect Fireworks Display: Two Strategies for Hanabi

Transcription:

A KERNEL BASED APPROACH: USING MOVIE SCRIPT FOR ASSESSING BOX OFFICE PERFORMANCE Mr.K.R. Dabhade *1 Ms. S.S. Ponde 2 *1 Computer Science Department. D.I.E.M.S. 2 Asst. Prof. Computer Science Department, D.I.E.M.S. KEYWORDS: Semantic variable, green lightning (key words). ABSTRACT A method to predict performance of box office of movie at time of Green lightning, when only budget and script is available. The level of extraction of textual features from screen writing domain knowledge like Genre & content, natural language processing techniques ie. bag of words and human input as Semantic variables. Textual variable which defines an distance metrics of scripts which used as input for kernel based approach for assessing box office performance. INTRODUCTION Movie studios have to choose among thousands of scripts to decide which ones to turn into movies. Despite the large amount of money is invested in movies this process known as green-lighting in the movie industry is largely a guesswork based on experts experience and intuitions[1][3]. In this proposed system [5] a new approach to help studios evaluate scripts which will then lead to more profitable prior decisions. It combines screenwriting domain knowledge, natural language processing techniques, and statistical learning methods to forecast a movie s return-on-investment [4] based only on textual information available in movie scripts [5]. It tests the model in a holdout decision task to show that this model is able to improve a studio s gross return-on-investment significantly. While deciding which scripts to turn into movies (i.e. green-lighting ) movie studios and film makers need to assess the box performance of a movie based only on its script and allocated production budget as most post-production drivers of box office performance e.g., actor, actress, director, MPAA rating are unknown at the point of green-lighting when financial commitments have to be made. Usually movie producers rely on a comps -based approach [3] to assess the box office potential of a new script. Specifically, they identify around past movies which are similar to the script and we use the box office performance of those movies as benchmarks for the revenue potential.so question is similarity between movies scripts should be measured. As instance one should focus on theme the actual words/language used or structure of the scenes and dialogues? The goal is to answer above questions and in develop a decision which helps studios make decisions based greenlighting. We can develop a method based on text mining and the kernel approach that identifies the comps of a new script based on its content and textual features, and hence assesses its revenue potential. The research contribution is in three parts. First that collects and analyzes actual movie scripts. Second show that the kernel approach outperforms both regression and tree-based methods in the context of assessing box office performance. Third the estimated feature weights provide some insights about which textual features require particular attention when identifying useful comps for a new script. The next section describes an overview of the script data set & how we extract textual information from script and section 3 describes the kernel-based approach and how can we estimate the obtained feature weights. In next section we compare our method with other benchmark methods and present a hypothetical portfolios selection scenario this proposed method can gives lower mean square error. TEXTUAL FEATURES FROM MOVIE SCRIPTS Data is comprised of more than 300 movies script which are available online we than record the U.S box office revenue and production budget from IMDB i.e Internet Movie Database. Genre and Content Variable: The textual information in movie script can summarize by the content variable and genere of scripts summarize by overall theme of movie so genre of script we considered eight genres and the content describes the variable which give detail about script like ending of story is happy or sad? We considered eight genre based on category of movie as follows [175]

Romance(ROM),Thriller(THR),Drama(DRA),Comedy(COM)Horror(HOR),Family(FAM),Action(ACT) and Scific (SCI). The set of few questions is provided regarding storyline of each script based on genre which questions are simply yes & no type which have been identified by script writing experts. Semantic Variables This textual information captures from the scripts of movie an semantic variables is used and it provides a preview that how the final movie will look. The script is organized into interior/exterior scenes whereas each scene is comprised characters dialogue. The semantic variable is second layer of textual information where structure of an script is captured and final preview is provided about the script. Here we define two level. (i)at scene level- Here we can obtain total no of scenes in movie & the way how an character interact with co-actor. (ii) Dialogue level- Here We can obtain the manner how character communicates all information is carried from script. i) Number of scenes (NSCENE). ii) Interior scenes percentage (INTPREC). iii) Number of dialogues (NDIAG). iv) Average of dialogues length (AVGDIAGLEN). v) The concentration index of dialogues (DIAGCONC). We use HH index to compute the concentration index of dialogues. The value of HH index is between 0 & 1.The higher index indicates concentration of a few characters in a dialogues. Bag-of-Words Variables The bag of words is third layer of textual information by using natural language processing technique. The words used in scripts and frequencies of their usage are backbone of story-line. We can extract bag of words through scripts using the following steps. (i)we then eliminate all punctuation as stop words and a Standard English names. (ii)a stemming algorithm is used for reducing words to simplest form. After the eliminating stemming & stop words even though there are thousands of unique words appeared in one or more scripts. Hence we compute an importance index for each word. I i di 1 N, i D (1) Above formula is used to measure importance index where d i denotes no of scripts which contains i th Word. And N i is total frequency occurrence of i th word. We keep few 100 words as important words and finally we perform LDA to further reduce dimensional of the words document matrix. Based on singular-value decomposition (SVD) it provide us to index each script by a set of scores. Summary and Potential Data Limitations Summary statistics for each variable in data set is taken. All textual variables and the (log-) production is considered and used as predictors in a kernel-based approach which forecast box office performance. budget A KERNEL-BASED APPROACH TO FORECAST BOX OFFICE PERFORMANCE The kernel-based method utilizes a distance metric to Identify the similarity between a new observation and each observation in the training database. The kernel- based approach is free of functional form this allow flexibility to capture complex relationship between features in textual script and box office performance. So we feel that kernel based approach is appropriate & correct relationship between textual variable of scripts & box office. Another approach of kernel based is it is business friendly as we can directly communicate to studio manager. [176]

Textual Variable MAX Mean SD Min GENRE_DRA 1.00 0.55 0.44 0.0 GENRE_ROM 1.00 0.41 0.33 0.0 GENRE_COM 1.00 0.28 0.36 0.0 GENRE_HOR 1.00 0.25 0.41 0.0 Figure2.1Table Summary statistic of variables KERNEL BASED APPROACH With use of following notations scripts in the training sample are indexed by i 1...N. Each script is comprised of J distinct features and is denoted as X long with a response variable y i.we define the response variable for each movie by its (transformed) return of investment (ROI). Specifically: Yi = log (BOX OFFICE i/ BUDGET) ----------- (2) We specify (transformed) ROI as the response variable in the kernel based method because such specification confers several statistical advantages. First the distribution of y i is much closer to normality than box office revenues which has a heavy right tail. The Notation based y i is response variable we define it for each movie y i is much closer to normality than box office revenue the features we consider here are the textual variables extracted from each script along with its production budget. The distance metric between two observations is defined, based on (weighted) Euclidean distance as follows: d( x i, x l ) = v j i 1 2 j ( xij - x lj ) 2 ----------------(1) is a vector of feature weights.as shown that the conceptual argument above we set the value of by appealing to studios domain knowledge. The studio managers typically look at no more than 10 comps when making a green-lighting decision. Therefore, we select such that any comp beyond the 10th will receive minimal weight this is achieved by setting so that on average the 10th comp receives a weight that is proportional to the density of a standard normal distribution at two standard deviations from the mode, Hence the 11th or further comps have weights that are negligible. Featured Weight calibration ( v ): The calibrated featured weight v as a starting point a reasonable default choice is to put equal weight on every variable i.e V j =1. We refer it as Kernel-I approach. We will evaluate its predictive performance verses kernel- II Approach that involves features weight. The proposed approach is based on cross validation to calibrated features weight v for kernel II approach. We define Leave one out mean squared error, LOOMSE a key,, v component of our objective function. We let i=1 n(n=265)index the scripts in training sample & let ẑ i ( ) be the predicted value of the log box office revenue of i th script when all except the i th script are used as training th data. Z i denotes actual log box office revenue for i script. LOOMSE (,, v n 1 ) = (z i ẑ i (,, v )) n 2--------------------(2) 1 i Portfolio Selection Now we demonstrate the potential economic significance of our proposed method & we conduct a hypothetical portfolio selection exercise so that we can compares the performance of the comps-based approach with our proposed Kernel-I/II methods. We consider the following portfolio selection setting. Suppose we would like to pick r scripts to form a movie portfolio. [177]

First, based on the predicted box office revenue and the given production budget, we compute the predicted ROI of each of the 35 scripts in the holdout sample. Then scripts in the holdout sample are ranked based on predicted ROI, and the r scripts that have the highest predicted ROI are selected. We vary from 5 to 20 and compare the ROIs of the overall portfolios which are selected by the comps based method Kernel-I and the Kernel-II method, respectively. The results are shown in Fig. 2. While there is a lot of variability in portfolio ROIs ((total box office budget)/budget) across all methods, portfolios selected by Kernel-I and Kernel-II approaches consistently provide higher portfolio returns compared to those selected by the comps-based method. when r = 10 movies scripts are selected to form a portfolio, the selections by Kernel-I and Kernel-II method yield portfolio ROIs of 130.3 percent (Box office = $1184.7M; Budget = $514.5M) and 134.6 percent (Box office = $1236.3M; Budget = $527.0M), respectively, while the selection by the comps based method yields a ROI of 76.4 percent (Box office =$307.8M; Budget = $174.5M).8 Across different values of r(from 5 to 20), the median ROI of portfolios selected by Kernel-I and Kernel-II is around 134.0 and 134.1 percent (respectively), while the median ROI of portfolios selected by comps-based method is only around 83.9 percent. Thus it is clear that the improvement in prediction accuracy afforded by the Kernel-I/II methods is also economically significant. Kernel1 Kernel-II Bag of Words 0.4300 0.4021 Semantics 0.444 0.4219 Original result 0.4096 0.3822 Table 3: Holdout Predictive Performance (in terms of MSE) for Kernel I and Kernel-II CONCLUSION The paper consist a methodology which is depend on the kernel-based approach to predict the box office potential of movie scripts at the point of green-lighting with lowest mean square error by which it can possible to access box office performance using movie scripts. REFERENCES 1. J. Eliashberg, S.K. Hui, and Z. John Zhang, From Story Line to Box Office: A New Approach for Green- Lighting Movie Scripts, Management Science, vol. 53, no. 6, pp. 881-893, 2007. 2. Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data: 3. Mining Online Reviews for Predicting Sales Performance in the Movie Domain: 4. Online Review Mining For Forecasting Sales: 5. H. Chipman, E. Geroge, and R. McCulloch, BART: Bayesian Additive Regresion Trees, The Annals of Applied Statistics, vol. 4, no. 1, pp. 266-298, 2010. 6. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R.Harshman, Indexing by Latent Semantic Analysis, J. Am. Soc. for Information Science, vol. 41, no. 6, pp. 391-407, 1990. 7. H. Mukerjee, Nearest Neighbor Regression with Heavy-Tailed Errors, The Annals of Statistics, vol. 21, no. 2, pp. 681-693, 1993. 8. J. Eliashberg, S.K. Hui, and Z. John Zhang, From Story Line to Box Office: A New Approach for Green- Lighting Movie Scripts, Management Science, vol. 53, no. 6, pp. 881-893, 2007. 9. J. Eliashberg, C. Weinberg, and S. Hui, Decision Models for the Movie Industry, Handbook of Marketing Decision Models,pp. 437-468, Springer, 2008. 10. E.J. Epstein, The Big Picture: The New Logic of Money and Power in Hollywood. Random House, 2005. [178]

11. S. Field, Screenplay: The Foundations of Screenwriting. third ed., DellPublishing, 1994. [179]