Saliency and Task-Based Eye Movement Prediction and Guidance

Size: px

Start display at page:

Download "Saliency and Task-Based Eye Movement Prediction and Guidance"

Denis Neal
6 years ago
Views:

1 Saliency and Task-Based Eye Movement Prediction and Guidance by Srinivas Sridharan Adissertationproposalsubmittedinpartialfulfillmentofthe requirements for the degree of Doctor of Philosophy in the B. Thomas Golisano College of Computing and Information and Sciences Rochester Institute of Technology February, 2015

2 B. THOMAS GOLISANO COLLEGE OF COMPUTING AND INFORMATION AND SCIENCES ROCHESTER INSTITUTE OF TECHNOLOGY ROCHESTER, NEW YORK CERTIFICATE OF APPROVAL Ph.D. DEGREE PROPOSAL The Ph.D. Degree Proposal of Srinivas Sridharan has been examined and approved by the dissertation committee as satisfactory for the dissertation required for the Ph.D. degree in Computing and Information Sciences Dr. Reynold J Bailey, Dissertation Advisor Date Coordinator Ph.D. Degree Program Date Dr. Joe M Geigel Dr. Anne Haake Dr. Linwei Wang ii

3 Saliency and Task-Based Eye Movement Prediction and Guidance by Srinivas Sridharan Submitted to the B. Thomas Golisano College of Computing and Information and Sciences in partial fulfillment of the requirements for the Doctor of Philosophy Degree at the Rochester Institute of Technology Abstract The ability to predict and guide viewer attention has important applications in computer graphics, image and scene understanding, object detection, visual search and training. Human eye movements have interested researchers as they provide insight into the cognitive processes involved in task performance. It has also interested researchers to understand what guides viewer attention in a scene. It has been shown that saliency in the image, scene context, and task at hand play a significant role in guiding attention. Many computational models have been proposed to predict regions in the scene that are most likely to attract human attention. These models primarily deal with bottom-up visual attention and typically involve free viewing of the scene. In this proposal we would like to develop a more comprehensive computational model for visual attention that uses scene context, scene saliency, task at hand, and eye movement data to predict future eye movements of the viewer. We would also like to explore the possibility of guiding viewer attention about the scene in a subtle manner based on the predicted gaze obtained from the model. Finally, we would like to tackle the challenging inverse problem - to infer task being performed by the viewer based on scene information and eye movement data. iii

4 Contents List of Tables List of Figures vi vii 1 Introduction 1 2 Saliency and Task Based Eye Movement Prediction Problem Definition Research Objectives and Contributions Background and Related Work Bottom-up Saliency Based Visual Attention Top-Down Cognition Based Visual Attention Proposed Approach Scene Context Extraction Saliency Map Generation Comprehensive Model Training Phase Testing Phase Evaluation Measurement Kullback-Leibler (KL) divergence Normalized Scanpath Saliency (NSS) Linear Correlation Coe cient (LCC) iv

5 Contents v 3 Adaptive Subtle Gaze Guidance Using Estimated Gaze Problem Definition Research Objectives and Contributions Background and Related Work Subtle Gaze Direction Proposed Approach Adaptive Subtle Gaze Direction Using Estimated Gaze Evaluation User Study Task Inference Problem Problem Definition Research Objectives and Contributions Background and Related Work Approach Evaluation User Study Timeline 37 6 Conclusion 38 A Eye Tracking Datasets 40 Bibliography 42

6 List of Tables A.1 Eye tracking datasets over still images. D, T and d columns stand for viewing distance in centimeters, stimuli presentation time in seconds and screen size in inches respectively. Reproduced from [1] vi

7 List of Figures 2.1 (A) This figure illustrates the information preserved by the global features for two images. (B) The average of the output magnitude of the multiscale-oriented filters on a polar plot. (C) The coe cients (global features) obtained by projecting the averaged output filters into the first 20 principal components (D) shows noise images with filtered outputs at 1, 2, 4 and 8 cycles per image, representing the gist of the scene and maintaining the spatial organization and texture characteristics of the original image. The texture contained in this representation is still relevant for scene categorization (e.g., open, closed, indoor, outdoor, natural or urban scenes).reproduced from [2] (a) Image shows the schematic representation of Koch and Ullman model to compute saliency model using primitive feature maps and the center surround neurophysiological properties of the human eye (b) Image shows the flowchart of the model developed by Itti to compute saliency map based on the Koch and Ullman model. This flowchart shows the filtering process involved, extraction of feature maps, center-surround normalization and also methods to combine feature maps to obtain the saliency map. Reproduced from [3] vii

8 List of Figures viii 2.4 Schematic diagram of the model for predicting task-based eye movements. The training phase shows the selected training images, eye tracking data on the training images, task based feature extraction, saliency map for each image and training fixations extracted from the eye tracking data. We combine all the features to obtain the training feature set and then perform PCA/ICA reduction if necessary to reduce the feature size. A Trainer is implemented to train on this feature set. Second half of the image shows the testing phase where similar testing features are extracted as in the training phase and then the Trainer is used to predict the eye position. This new predicted eye position can then be compared to testing fixations which server as ground-truth data Figure shows a mammogram image. The large red circle shows the area marked by the expert as an irregularity Hypothetical image with current fixation region F and predetermined region of interest A. Insetillustratesgeometricdotproducttocompute Gaze distributions for an image under static and modulated conditions. Input image (top). Gaze distribution for static image (bottom left). Gaze distribution for modulated image (bottom right). White crosses indicate locations preselected by researchers for modulation Image on left is the image viewed by the subject when assigned the task of counting the number of deer in the scene. The red circles in the image indicate viewer s fixation data. The image on right shows the corresponding task based saliency map, highlighting task relevant regions to direct the viewer s attention

9 List of Figures ix 4.1 Figure showing Image A, a street view image with eye tracking data where the task provided to viewer was to locate the cars in the scene. Image B shows a similar image that can be classified as a street image and also has cars in the image to be task relevant Figure shows the experiment conducted by Yarbus in Image on top-left shows the picture of Family and an unexpected visitor and the scanpaths of a subject for each task in the experiment while viewing the stimulus image Image (A) shows the eye movements of a subject when given the task to check the rear view mirrors. Image (B) shows the eye movements of the subject when given the task to check the gauges on the dashboard. The circle in red/green indicates the fixations made by the subject when performing the task. The number indicated inside the circle shows the order of fixations. Note that the subject also gathers information of the road, the GPS when performing the task at hand. 36

10 Chapter 1 Introduction Predicting human gaze behavior and guiding viewer attention in a given scene are very challenging tasks. Humans perform a wide variety of tasks such as navigation, reading, playing sports, and interacting with objects in the environment. Each task performed depends on the input from the environment and from memory about the task. Attention research has been concerned with understanding the input stimuli, neural mechanisms, information processing, memory, and the motor signals involved in task performance. Eye movements provide information about the regions attended in an image and gives insight about the underlying cognitive process [4]. Saliency in the image has been shown to guide attention, e.g. regions with high local contrast, high edge density, bright colors (bottom-up e ect) [9, 3, 10]. Humans are also immediately drawn to faces or regions with contextual information (top-down e ect) [8]. Finally the pattern of eye movements not only depends on the scene being viewed but also on the viewer s intent or task assigned [5, 6, 7]. Researchers continue to debate whether it is salient features, contextual information or both that ultimately drives attention during free viewing (no task specified) of static images [11, 12, 13, 14]. There are many computational models that predict regions that are most likely to attract viewer attention in a scene. These computational models are designed based on bottom-up attention, top-down attention or a combination of both. However, many of these models only consider free viewing and as such do not 1

11 2 take into account the impact of any specific task on eye movements. In the proposed work we plan to: 1. Develop and evaluate a comprehensive model of human visual attention prediction that incorporates: Scene context (GIST, SIFT, SURF, Bag of Words etc.) Bottom-up scene saliency Task at hand Eye movement data across multiple subjects 2. Develop and evaluate a novel adaptive approach to guide viewer attention about a scene that requires no permanent or overt changes to the scene being viewed, and has minimal impact on viewing experience. 3. Develop a framework for task inference based on scene information and eye movement data. This framework attempts to di erentiate eye movements for task performance versus eye movements made to gather information in the scene. In the next three chapters each of these research objectives are explained in more detail, specifically we provide the problem definition, background and related work, proposed approach and evaluation measures. Saliency and Task Based Eye Movement Prediction is presented in chapter 2, Adaptive Subtle Gaze Guidance Using Estimated Gaze is presented in chapter 3, and TaskInferenceProblem is presented in chapter 4. Timeline for the proposed work is presented in chapter 5. Chapter 6 presents the conclusion and highlights potential future work that is beyond the scope of this proposal. An appendix is provided listing several research datasets (images and corresponding eye movements) that will be utilized over the course of this work.

12 Chapter 2 Saliency and Task Based Eye Movement Prediction 2.1 Problem Definition Predicting gaze behavior of a human in a given scene is a very challenging task. There are multiple factors that influence human gaze behavior. The salient features in the scene, the task at hand and prior knowledge of the scene are some of the factors that highly influence gaze behavior. Visual saliency based models predict regions of interest that attract the gaze of a subject based on image features such as contrast, color, orientation etc [15, 16, 17, 18]. There are other top-down computational models that combine saliency maps and scene context. Some top down models use face detection, object detection, and image blobs with visual saliency to gather visual attention details in the scene [19, 20, 21]. The task being undertaken has averystronginfluenceondeploymentofattention[5]. It has been shown that humans process visual information in a need-based manner. We look for things that are relevant for the current task and pay less attention to irrelevant objects in the scene. Researchers have shown that there is a high correlation between visual cognition and eye movements when dealing with complex tasks [22]. When subjects are asked to perform a visually guided task their fixations were found to be on 3

13 2.2. Research Objectives and Contributions 4 task-relevant locations. This finding was established using the block-copying task where the subjects were asked to assemble building blocks, and it was shown that subjects eye movements reveal the algorithm used for completing the task [23]. Others have studied gaze behavior while performing tasks in natural environments such as driving, sports, walking, etc [6, 24, 22, 25]. The view of many is that both bottom-up and top-down factors are combined to direct our attention. There have been many computational models using Bayesian approaches to integrate top-down and bottom-up salient cues [26]. Eye-tracking technology helps to estimate visual attention of the subject while performing a task. Eye trackers provide fixation and saccade information in real-time that could give insight of top-down task based visual attention and the scene features provide the bottom-up saliency map. Many gaze prediction algorithms have been proposed based on image scene features and visual saliency maps [27, 28]. These computational models lack two key factors in gaze prediction, 1) these models seldomly account for the top-down visual attention that can be obtained by considering the scene s context and 2) the training data used to develop these models were obtained during free viewing and so does not take into account the impact of specific tasks on eye movements. Hence, there is a need for a comprehensive computational model of human visual attention prediction, that can identify regions in the scene most likely to be attended for a given task at hand. 2.2 Research Objectives and Contributions The goal of this aspect of the proposed work is to develop a comprehensive model of human visual attention prediction that incorporates, scene context, bottom-up scene saliency, task at hand and eye movement data obtained across multiple subjects to build a task based saliency map. The task based saliency map will predict regions in the scene that attract viewer s attention while performing an assigned task. The model is further trained to predict viewer gaze on new (related) stimuli images. Such amodelwillallowresearcherstogainmoreinsightintothetasksolvingbehavior

14 2.3. Background and Related Work 5 and also predict the task solving approach under di erent input conditions. The model can then be used to understand the subject gaze behavior for a given task and compare it with other tasks on similar stimuli. The proposed model will also help to address the time-consuming burden of creating manual annotations of the regions of images used in perception-related experiments. Our proposed model will be able to aid people performing repeated image search tasks, by suggesting regions of interest based on the predicted gaze. While gaze target prediction techniques are prone to false positives, they can sill be very valuable in providing additional suggestions for viewing. For example, consider a radiologist searching for abnormal regions in a mammogram. At the end of the task, our prediction system can suggest other regions to look which he/she might have missed. In this manner, the technology is seen as providing assistance rather than attempting to replace the expert. This model can also be used to study di erences in visual attention between subjects. Hence for a given task and gaze behavior it could be possible to di erentiate experts from non-experts. 2.3 Background and Related Work When we look around us, we perceive some objects in the scene to be more interesting than others. There are certain objects in the scene that pop-out and grab our attention over others. The drawing of our attention in this fashion is termed as bottom-up or saliency-based visual attention. Our focused attention can be thought of as a rapidly shifting spotlight and the areas focused are the salient regions in the scene. These salient regions can be represented as a 2-dimensional saliency map, that captures these regions of high attention. However, human visual attention is not plainly a feed-forward spatially selective filtering process. There is also cognitive feedback to the visual system to focus attention in a top-down manner. For example, there may be contextually relevant areas of the image (such as faces) that also draw our attention. Several computational models have been proposed to model bottom-up, top-down or both to understand visual attention.

15 2.3. Background and Related Work Bottom-up Saliency Based Visual Attention Saliency-based attention models are classified based on the saliency computation mechanism. Most saliency based models intend to highlight regions of interest that attract attention in the scene. Bottom-up visual saliency models can be broadly classified in several ways [29]: Cognitive Models: Models that closely relate the psychological and neurophysiological attributes of the human visual system to compute saliency. These models account for contrast sensitivity functions, perceptual decomposition, visual masking, and center-surround interactions [15, 30]. Bayesian Models: Models using Baye s rule to detect object of interest or saliency regions by probabilistically combining extracted scene features with known prior knowledge of the scene or scene context [17, 31]. Decision Theoretic Models: Visual attention is believed to produce decisions on the state of the scene being viewed such that there is an optimal decision based on minimizing the probability of error. Hence salient features can be defined as best recognized classes over all other visual classes available [32, 33]. InformationTheoretic Models: Information theoretic models define saliency to be regions computed that maximizes information sampled from a given scene. The most informative regions are selected from all possible regions available in the scene [34, 35]. Graphical Models: Visual attention models are computed based on eye movements. These eye movements when treated as a time series and the

16 2.3. Background and Related Work 7 hidden variables influencing these eye movements can be modeled using Hidden Markov Models, Dynamic Bayesian Networks and Conditional Random Fields [36, 18]. Spectral Analysis Models: Adigitizedsceneviewedinthespatialdomain can be converted to frequency domain and saliency models are derived based on the premise that similar regions in frequency domain imply redundancy. Such models are simpler to explain and compute but do not necessarily explain psychological and neurophysiological attributes of the human visual system [37, 38, 39] Top-Down Cognition Based Visual Attention Top-down models on the other hand are goal-driven or task-driven compared to bottom-up cues which are mainly based on characteristics of a visual scene [40]. Topdown visual attention models are determined by cognitive factors such as knowledge of the scene, expectations, rewards, tasks at hand and goals. Bottom-up attention being feed-forward tends to be both involuntary and fast. On the other hand, top-down attention is slow, task driven and voluntary. Top-down attention is also referred to as a closed-loop [41, 42]. Two major sources have been explored that influence top-down or cognition based visual attention. Scene context or layout of a scene have also been shown to influence eye movements and visual attention. For example, people are naturally attracted to faces and regions relevant to the scene. The second one is task-based, certain complex tasks such as driving or reading highly influence when and where humans fixate in the scene. It has been proposed that humans are able to exhibit interest towards targets by relatively changing the gains on di erent basic features which attract attention [43]. For example when asked to look for a specific colored object a higher gain will be assigned to search that particular color among the other available colors in the scene. A computational model was developed that integrate

17 2.4. Proposed Approach 8 the visual cues for target detection by maximizing the signal-to-noise ratio of target vs. background [44]. An evolutionary algorithm was also developed to search in the basic saliency model parameter-space for the target objects [20]. In comparison to gain adjustment for fixed feature detection, other top-down attention models were suggested, in which preferred features were obtained by tuning the width of feature detectors [45]. These models study the role of object features in visual search and are similar to the techniques of object detection in computer vision. However, these models are based on human visual attention as compared to computer vision models which use predefined feature templates for detecting/tracking cars, humans or faces [46, 47]. 2.4 Proposed Approach Both top-down and bottom-up computational models provide a single saliency map which indicate regions in the image that are most likely to be attended. The saliency maps are generated while free viewing and features detectors are tuned for specific image attributes. A major drawback for having such a saliency map is that the location with the highest saliency value does not necessarily translate to the region most attended, it has been shown that majority of fixations are towards task-relevant locations [22]. It is also very di cult to predict the order of attention from a single saliency map. The saliency maps provides us with a mask which eliminates locations that may be least attended and also locations which most likely attract the viewer s attention. In this aspect of the proposed work we aim to develop a comprehensive model of human visual attention that uses scene context, saliency map, task at hand and eye movement data to obtain a task based saliency map. The model is further trained to test if it is capable of predicting human fixations in other similar images for the same task.

18 2.4. Proposed Approach Scene Context Extraction Scene context plays a vital role in attracting visual attention to specific regions in the scene. Humans have a high degree of accuracy is describing a scene or image even with viewing times as low as 80ms. This ability enables humans to capture enough information to obtain a rough representation or gist of the scene [48]. This enables humans to quickly classify the scene such as indoor vs outdoor, urban vs. rural or natural vs. man-made. It has been shown that semantic associations play a vital role in guiding visual attention. When searching for shoes for example, humans are more likely to look for them on floor than on top of a table or on the ceiling [49, 50]. Several models utilizing low-level features have been presented to obtain the gist of the scene. A computational model was proposed that is based on the spatial envelope using low-dimensional representation of the scene. The model generates a multidimensional space reduced by applying principal-component-analysis and independent-component-analysis in which scenes sharing membership in semantic categories are projected [2]. Gabor filters were used on input images to extract a selected number of universal textons (from the training set using K-means clustering) [51]. Researchers have also used the biological center-surround features (receptive field) from the orientation, color and intensity channels for modeling gist [52]. Gist representation is a well-known field in computer vision as it provides a global scene information which is especially useful for searching scene databases with many images. It has also been used to limit the region for object search in a scene rather than processing it in the entireity. The most important use of gist representation is in the modeling of top-down attention [53, 54]. In this proposal we use gist of the scene [2] toobtainalow-dimensionalrepresentation of the scene that does not require explicit segmentation of image regions and objects. Gist refers to the meaningful information that an observer can identify from the glimpse of the scene. We use gist description to include the semantic label of the scene with few objects and their surface characteristics and layout. It represents the global properties of the space that the scene subtends and not neces-

19 2.4. Proposed Approach 10 sarily include individual objects that the scene contains. Every scene is defined by 8 categories namely naturalness, openness, expansion, depth, roughness, complexity, ruggedness and symmetry. Each scene is described as a vector of meaningful values indicating image s degree of naturalness openness, roughness, expansion, mean depth etc. The gist of the scene will help classify similar images and also provide a global low-dimensional representation for image groups. Figure 2.1 shows the two representative sample images, a polar plot showing average responses of multiscale-oriented filters on these images obtained by applying principal component analysis (global feature templates), global features projected in the first 20 principal components, and low-frequency representation (noise image) representing the gist maintaining the spatial organization and texture characteristics of the original image. Figure 2.1: (A) This figure illustrates the information preserved by the global features for two images. (B) The average of the output magnitude of the multiscaleoriented filters on a polar plot. (C) The coe cients (global features) obtained by projecting the averaged output filters into the first 20 principal components (D) shows noise images with filtered outputs at 1, 2, 4 and 8 cycles per image, representing the gist of the scene and maintaining the spatial organization and texture characteristics of the original image. The texture contained in this representation is still relevant for scene categorization (e.g., open, closed, indoor, outdoor, natural or urban scenes).reproduced from [2] Within image local features and between image features can be obtained after computing gist for each image. Feature detection algorithms such as Scale-invariant

20 2.4. Proposed Approach 11 feature transform (SIFT) [55], Speeded up Robust Feature (SURF) [56], Maximally Stable Extremal Regions (MSER) [57], Histogram of Oriented Gradients (HOG) etc can be used to identify key local features in the scene. Local feature detection and matching algorithms help identify regions that are similar within the image and also regions similar between classified images. This will further enable us to build scene context (region based) with similar features and group them into a labeled category. Alistofsuchcategories(e.g.Bag-of-words)canbeusedtoassociateregionsofthe scene to a task at hand Saliency Map Generation Most attention models are directly or indirectly inspired by the physiological or neurophysiological properties of the human eye. The basic model proposed by Itti et al. [3] uses four assumptions. First, visual input is represented in the form of topographic feature map. The feature maps are constructed based on the idea of center-surround representation of the features at di erent spatial scales and competition among feature for visual attention. The second assumption is that these feature maps are combined to give a single local saliency map of any location with respect to its neighborhood. Third, the maximum of the saliency map is the most salient location at a given time and it also helps determine the next location for attention shift. Fourth, attention is shifted to di erent parts of the stimuli based on the saliency map and the order of attention shift is represented by the decreasing order of saliency in the map. Figure 2.2 shows the schematic representation proposed by Koch and Ullman and the 2.3 shows model proposed by Itti et al. In the early model proposed by Koch and Ullman [58] low-levelfeaturesofthevisualsystemsuchascolor,intensity and orientation were computed to obtain a set of pre-attentive feature maps which were based on the retinal input to the eye. The activity of all these feature maps were combined for a given location. This combination of feature maps provide the topographic saliency map. A simple winner-take-all network was designed to detect the most salient location. The second part of the image shows the schematic

21 2.4. Proposed Approach 12 Figure 2.2: (a) Image shows the schematic representation of Koch and Ullman model to compute saliency model using primitive feature maps and the center surround neurophysiological properties of the human eye Figure 2.3: (b) Image shows the flowchart of the model developed by Itti to compute saliency map based on the Koch and Ullman model. This flowchart shows the filtering process involved, extraction of feature maps, center-surround normalization and also methods to combine feature maps to obtain the saliency map. Reproduced from [3]

22 2.5. Comprehensive Model 13 diagram used for the study which was built on the Koch and Ullman architecture and provides a complete implementation of all stages. Multi-scale spatial images (eight spatial scales per channel) are computed and the center-surround di erences for each feature (3 features) is computed to obtain the local spatial feature map (42 feature maps). A lateral inhibition scheme is used to initiate competition for saliency with the feature map. These individual feature maps are then combined to form a single conspicuity map for each feature type. The conspicuity maps are then combined to obtain a single topographic saliency map. 2.5 Comprehensive Model We propose a comprehensive computational model of human visual attention prediction, that can identify regions in the scene most likely to be attended from the scene context, saliency map, task at hand and eye movement data of the subject. A set of images from publicly available image databases are chosen and the gist and saliency maps for these images are pre-computed. The task to be performed when viewing these set images is determined ahead of time. Subject s eye movements are recorded for these images while performing the given task for a specified period of time. The images are then randomly divided into training and testing datasets. Figure 2.4 shows the schematic representation of the proposed model. The model is divided into a Training and Testing phase and each phase is explained in detail below Training Phase The stimuli images are randomly separated into training and testing images. The images for training are eye tracked using a remote eye-tracker. Fixation data is gathered from subjects looking at images in the training set using an eye-tracking device. A fixation map (averaged across all subjects) is then created. The eye tracking data is then split into two-groups n-initial fixations which serve as a feature vector to the model, remaining fixations are used as data to train the model. The

eye tracking data. We combine all the features to obtain the training feature set and then perform PCA/ICA reduction if necessary to reduce the feature size.

23 2.5. Comprehensive Model 14 Figure 2.4: Schematic diagram of the model for predicting task-based eye movements. The training phase shows the selected training images, eye tracking data on the training images, task based feature extraction, saliency map for each image and training fixations extracted from the eye tracking data. We combine all the features to obtain the training feature set and then perform PCA/ICA reduction if necessary to reduce the feature size. A Trainer is implemented to train on this feature set. Second half of the image shows the testing phase where similar testing features are extracted as in the training phase and then the Trainer is used to predict the eye position. This new predicted eye position can then be compared to testing fixations which server as ground-truth data. saliency map obtained from saliency based visual attention model is also used as a feature vector. The gist and local image-based features are also extracted and are provided as an input to the model. The task at hand is encoded as an independent variable to the model. Using the gist, local image features, saliency map, task at hand and eye tracking data the final feature vector is generated. This feature vector will be of very high dimensionality, hence the feature space is reduced using techniques such as principle component analysis (PCA) or Independent Component Analysis(ICA). The stimuli images are then trained (linear model, neural-network,

24 2.6. Evaluation Measurement 15 gaussian mixture, support vector machines) with the reduced features. The learning algorithm is now trained for the saliency, scene context (gist + additional local features) and n-initial eye movement data. The final learning model (Trainer) will assign weights to features based on training fixations. To counter over-learning bias the training images are split randomly using 80/20 rule. The model will learn on 80% of the images in the training dataset and will be tested on the remaining 20% of the images to re-parameterize the model Testing Phase The stimuli images which have not been used for training are used as testing images. Fixation data from several subjects is also gathered for these images using a remote eye-tracker. The eye tracking data is preprocessed and split into two-groups, n- initial fixations which serve as a feature to the model (similar to Training phase) and remaining fixations that act as ground-truth data. Similar to the training phase, saliency based features are extracted, the saliency map is obtained and is used as feature to the model. The gist, local image-based features are also extracted as an input. The final features are reduced in dimension using PCA/ICA and is provided as input to the learnt model. The output of the model is the predicted gaze position (point-based or region-based). This predicted gaze position is then compared to the ground truth data (remaining fixations). 2.6 Evaluation Measurement Many attention and gaze prediction models are validated against eye tracking data of human observers. Eye movements provide an easy mechanism to understand the cognitive process involved in image perception and how eye movements vary with task. We can evaluate the predicted gaze obtained from the model to that of the eye movement data obtained from the human observer viewing the scene. The evaluation can be classified as 1) point-based 2) region based and 3) subjective evaluation. In the point-based approach the predicted gaze points are compared with

25 2.6. Evaluation Measurement 16 the ground-truth eye tracking gaze point and a distance measure can be obtained. In the region based approach instead of evaluating a single gaze point we compare the estimated gaze region to that of the region of fixations from multiple subjects. Subjective scores can also be obtained from experts to evaluate estimated gaze on a Likert scale. However, subjective evaluation is time-consuming, error-prone and is not quantitative compared to methods 1 and 2. In literature the following are the widely used evaluation techniques Kullback-Leibler (KL) divergence. KL divergence also known as information divergence, is a measure of di erences between two probability distributions P and Q and is denoted as D KL (P k Q). In the context of saliency and gaze prediction it is used as a distance metric between distributions. P is the discrete probability distribution of the predicted gaze and Q is the ground-truth distribution. Models that can predict human fixations exhibit higher KL divergence since human subjects will fixate on fewer regions (with maximum response) and will avoid most of the regions with lower response from the model [59]. KL divergence is both sensitive to any di erences between distributions and is invariant to reparameterizations thereby not a ecting the scoring Normalized Scanpath Saliency (NSS). The normalized scan path saliency is defined as the response value at a given position on the predicted gaze region which is normalized to have a zero mean and unit standard deviation NSS = 1 (S(x, y) µ S). NSS is computed once for each fixation s and subsequently the mean and stander error are computed across the set of NSS scores. When the value of NSS is 1 it indicates that the subject s eye fixation fall in aregionwherethepredictedgazeisonesstandarddeviationaboveaverage. Wecan also say that a NSS value of 1 show that the region predicted is the most probable region to fixate than any other region on the image [60]. Whereas a NSS value of 0 show that the model does not perform any better than randomly picking a fixation

26 2.6. Evaluation Measurement 17 location in the scene Linear Correlation Coe cient (LCC). The linear correlation coe cient measures the strength of a linear relationship between two variables being measured. LCC measure is widely used to compare two images for registration, features, object recognition and disparity measurement. LCC(Q, P )= P x,y (Q(x, y) µ Q).(P (x, y) µ P ) q (2.1) 2 Q. 2 P In equation 2.1 P and Q represent the predicted fixation region and ground truth subject s fixation in the region x, y respectively. µ and 2 represent the mean and variance of the pixel values in the region around x and y. The advantage of using LCC is that it is bounded in comparison to KL divergence and it is easy to compute than NSS or AUC. A correlation value of +1/ 1indicatethatthereisaperfect linear relationship between the two variables and a value of 0 indicate no correlation.

27 Chapter 3 Adaptive Subtle Gaze Guidance Using Estimated Gaze 3.1 Problem Definition The previous chapter focused on the problem of gaze prediction. In this chapter we focus on the related problem of gaze guidance. When viewing traditional static images the viewer s gaze pattern is guided by a variety of influences (bottom-up and top-down). For example, the pattern of eye movements may depend on the viewer s intent or task [5, 6]. Image content also plays a significant role. For example, it is natural for humans to be immediately drawn to faces or other informative regions of an image [8]. Additionally, research has shown that our gaze is drawn to regions of high local contrast or high edge density [9, 10]. Although traditional images are limited to these passive modes of influencing gaze patterns, digital media o ers the opportunity for active gaze control. The ability to direct a viewer s attention has important applications in computer graphics, data visualization, image analysis, and training. Existing computer-based gaze manipulation techniques, which direct a viewer s attention about a display, have been shown to be e ective for spatial learning, search task completion, and medical training applications. We propose a novel mechanism for guiding visual attention 18

28 3.2. Research Objectives and Contributions 19 about a scene. Our proposed approach guides the viewer in a manner that has minimal impact to the viewing experience. It also requires no permanent alterations to the scene to highlight areas of interest. Previous work on guiding visual attention typically involved having the researchers manually select the relevant regions of the scene. This process is slow and tedious. We propose to overcome this issue by combining our gaze guidance technique with our gaze prediction framework. While gaze prediction techniques are prone to false positives, they can still be very valuable in providing additional suggestions for viewing. 3.2 Research Objectives and Contributions Our proposed gaze guidance mechanism will be developed with the following goals in mind It should perform in real time It should adapt to image/ scene content as well as viewing configuration It Should adapt to the task assigned to the viewer The technique should be subtle and have minimal impact on viewing experience The proposed model is adaptive (real-time) in selecting task relevant regions in the image based on the regions not previously fixated by the user and the taskbased saliency map. These predicted regions from the model are used to actively map regions in the scene to guide the viewer s attention. The adaptive model can highlight task relevant regions that have not been viewed or other salient regions in the image to assist the viewer for task completion. An adaptive gaze guidance technique will enable researchers to quickly and accurately direct viewer s attention to unattended relevant regions in the image. Such a model is novel as it selects regions of interest in the image to guide a viewer based

29 3.3. Background and Related Work 20 on the current viewing pattern. The location and order of fixations of no two viewers is the same, hence manually pre-selecting regions to guide attention (as done in previous work) is not ideal. A gaze guidance model of this nature eliminates the need for manual intervention and also adapts in real-time for each image being viewed. The model can learn over time and also provide assistance to the viewer in real-time while performing the task at hand. Our adaptive subtle gaze guidance technique can also be deployed in psychophyisical experiments involving short-term information recall, learning, visual search and problem solving tasks. 3.3 Background and Related Work Jonides [62] exploredthedi erencesbetweenvoluntaryandinvoluntaryattention shifts and referred to cues which trigger involuntary eye-movements as pull cues. Computer based techniques for providing these pull cues are often overt. These include simulating the depth-of-field e ect from traditional photography to bring di erent areas of an image in or out of focus or directly marking up on the image to highlight areas of interest [63, 64]. The issue with these types of approaches is that they require permanent, overt changes to the image which impacts the overall viewing experience and may even hide or obscure important information in the image. Figure 3.1 for example shows a mammogram that has a red circle highlighted to visually identify abnormal region in the image. Actively guiding viewer s attention to relevant information has been shown to improve problem solving [64, 65]. Guiding attention has shown to enhance spatial learning by improving the recollection of location, size and shape of objects in images [66, 67, 68]. It has also been shown to improve training, learning and education[71, 72, 73, 74]. Gaze manipulation strategies have also been used for improving performance on visual search tasks by either guiding attention to previously unattended regions [69] orguidingattentiondirectlytotherelevantregions in a scene [70]. Subtle techniques have been proposed to guide viewer s attention e ectively to regions of interest in a scene using remote eye trackers [61].

3.3. Background and Related Work 21 Figure 3.1: Figure shows a mammogram image. The large red circle shows the area marked by the expert as an irregularity.

30 3.3. Background and Related Work 21 Figure 3.1: Figure shows a mammogram image. The large red circle shows the area marked by the expert as an irregularity. Our proposed approach is based on Subtle Gaze Direction (SGD) technique that works by briefly introducing motion cues (image-space modulations) to the peripheral region of the field of view [61]. Since the human visual system is highly sensitive to motion, these brief modulations serve as excellent pull cues. To achieve subtlety these modulations are presented only to the peripheral regions of the field of view. This is determined by using a real-time eye tracking device. The eye tracker provides us the current gaze position thereby giving us the accurate location of where the subject is foveated. These peripheral modulations are terminated before the viewer can scrutinize them with their high acuity foveal vision Subtle Gaze Direction Figure 3.2 shows a hypothetical image, suppose the goal is to direct the viewer s gaze to some predetermined area of interest A. LetF be the position of the last recorded

3.3. Background and Related Work 22 Figure 3.2: Hypothetical image with current fixation region F and predetermined region of interest A. Inset illustrates geometric dot product to compute.

31 3.3. Background and Related Work 22 Figure 3.2: Hypothetical image with current fixation region F and predetermined region of interest A. Inset illustrates geometric dot product to compute. fixation and let ~V,bethevelocityofcurrentsaccade,let ~W be the vector from F to A, andlet be the angle between ~V and ~W. Modulations are performed on the pixel region A. Once the modulation commences, saccadic velocity is monitored using feedback from an eye tracker and the angle is continually updated using the geometric interpretation of the dot product. A small value of indicates that the center of gaze is moving towards the modulated region. In such cases, modulation is terminated immediately. This contributes to the overall subtlety of the technique. By repeating this process for other predetermined areas of interest, the viewer s gaze is directed about the scene. A user study conducted with 10 participants showed that the activation time (from start of the modulation to detection of movement towards modulation) was 0.5 seconds for nearly 75% of the target regions, indicating that participants responded to the majority of modulations. Nearly, 70% of the fixations were within one perceptual span from the modulation and 93% were within two perceptual spans. Finally, figure 3.3 shows that it is possible to guide viewer s attention to regions of interest in a subtle manner. The user study shows that it is possible to guide subject s attention to relevant regions of the scene, while these observations show that the SGD

White crosses indicate locations preselected by researchers for modulation.

32 3.3. Background and Related Work 23 Figure 3.3: Gaze distributions for an image under static and modulated conditions. Input image (top). Gaze distribution for static image (bottom left). Gaze distribution for modulated image (bottom right). White crosses indicate locations preselected by researchers for modulation. technique is successful in directing gaze it does not necessarily mean that the viewer fully processed the visual details of the modulated regions or remembered them. To better understand the impact of Subtle Gaze Direction on short-term spatial information recall and its applicability for training scenarios, we have already conducted several studies. See [68, 71, 72, 75] formoredetails.

33 3.4. Proposed Approach Proposed Approach Our approach combines the subtle gaze direction technique with the saliency and task based eye movement prediction model (in chapter 2) toactivelyandadaptively guide viewer s attention to task relevant regions in the scene. By combining the two methods we can guide viewer s attention in real-time based on the predicted gaze obtained from the comprehensive model and also achieve subtlety to ensure that there is minimal impact on the overall viewing experience Adaptive Subtle Gaze Direction Using Estimated Gaze The biggest challenge for gaze guidance is that the next fixation of the viewer is not available ahead of time, it has to be computed based on the direction of the movement of eye (saccade velocity) with the help of an eye tracker. Also the regions where the subject s gaze is to be guided is pre-computed manually and the sequence of regions are fixed ahead of time. This approach is both time consuming and cumbersome. Each viewer s scanpath is unique and changes based on the task at hand. Our saliency and task-based eye movement prediction model can be used to automatically generate task relevant regions for the gaze guidance technique. Figure 3.4: Image on left is the image viewed by the subject when assigned the task of counting the number of deer in the scene. The red circles in the image indicate viewer s fixation data. The image on right shows the corresponding task based saliency map, highlighting task relevant regions to direct the viewer s attention.

34 3.5. Evaluation 25 Figure 3.4 shows an image viewed by the subject when the task assigned was to count the number of deer in the scene. The corresponding task based saliency map image is shown on right, highlighting the regions in the image that are task relevant. The subject is eye tracked during the task and our proposed model predicts the gaze of the user based on the series of fixations recorded. The intensity map (right image) highlights the priority and relevance to the task regions in the scene. Task relevant regions are placed in a queue based on their saliency value and are moved to the end or popped once the subject has scrutinized the region for a desired duration of time. The model will then be able to guide the subject using SGD to these task relevant regions if previously unattended. The viewer s gaze is directed to task relevant regions by presenting a brief luminance modulation to the peripheral region of the field of view. The modulation is terminated as soon as the direction of saccade is towards the region of interest. This approach makes sure that our model is able to subtly guide viewer attention to task relevant regions that are previously unattended by the subject, and ensures that maximum visual coverage is achieved for successful completion of the task. 3.5 Evaluation User Study The goal of the user study will be to test the e ectiveness of the adaptive subtle gaze guidance using estimated gaze from the proposed model. Participants are chosen randomly and eye tracked while viewing a collection of static images. All participants are chosen to have normal or corrected-to-normal vision with no cases of color blindness. Each participant will undergo a brief calibration procedure to ensure proper eye-tracking. The images are pre-processed and the saliency map, gist and local image features are computed along with the previously recorded eye movement data as mentioned in chapter 2. After viewing the scene for a short period of time, the model gathers eye movement data of the subject in real-time and attempts to guide their attention to task-based relevant regions that are unat-

35 3.5. Evaluation 26 tended. This ensures that all task-relevant regions are attended, and the image is su ciently scrutinized to successfully complete the task at hand. The relevant regions are highlighted by briefly projecting motion cues (image-space modulations) to the peripheral region of the field of view. Eye tracking data and scene stimuli from each subject are recorded and the accuracy of performance will be computed against acontrolgroupthatisnotguidedusingadaptivesubtlegazeguidancetechnique. The following methods will be used to evaluate performance: Activation Time Activation time is defined as the time elapsed between the start of the modulation and the detection of movement in the direction of the modulation. As shown in the subtle gaze direction technique [61], that the criteria for terminating the modulation was met within 0.5 seconds for approximately 75 percent of the target regions and within 1 second for approximately 90 percent of the target regions, indicating that the participants responded to the majority of the modulations. The adaptive subtle gaze guidance should be tested to ensure that the activation time is similar or better than the SGD technique. In the SGD technique modulated regions were manually pre-selected to ensure the faster onset and termination of visual cues. In case of adaptive subtle gaze guidance the model has to predict the next possible fixation and also keep an account of the sequence of fixations previously made by the viewer. The model has to run in real-time, and accurately predict the next task-relevant gaze location to decide if viewer s attention is to be guided to this new location. Accuracy Measurement For tasks involving problem solving, training or visual search it is important to measure the accuracy of performance. The adaptive gaze guided group is compared with a static group to see if they performed significantly better for the given task at hand. The accuracy of the groups is evaluated using the following: Binary Classification Statistics

36 3.5. Evaluation 27 Binary classification statistics [76] canbeusedtoestablishmeasuresofaccuracy as well as sensitivity and specificity. To calculate these properties it is necessary to categorize the test outcomes as true positives(tp), true negatives(tn), false positives(fp), and false negatives(fn). Sensitivity is computed as follows: Sensitivity = (#oft P ) 100 (3.1) (#oft P +#off N) Specificity is defined as follows: Specif icity = (#oft N) 100 (3.2) (#oft N +#off P ) The sensitivity and specificity values can then be combined to produce a binary classification based measure of accuracy as follows: Accuracy = (#oft P +#oft N) 100 (3.3) (#oft P +#oft N +#off N +#off P ) The accuracy value can be compared between adaptive subtle gaze guided and control groups, higher accuracy would indicate better performance of the task at hand. Area Under Curve (AUC) Area Under Curve (AUC) or Receiver Operating Characteristic (ROC) curve can be used as a binary classifier system with a variable threshold. AUC or area under ROC curve is used to assess the performance of the adaptive gaze guidance technique. A value of 1 on the curve indicates perfect classification of task relevant regions. The ROC curve is e ectively used to test if the regions selected by the viewers is better-than random classification. This measure will ensure to see if the control group or adaptive gaze guided group performance is not by random chance. AUC or ROC along with accuracy measure from

37 3.5. Evaluation 28 binary classification will provide a complete result of the group performance. Levenshtein Distance Levenshtein distance [77, 78] isastringmetric, developed in the field of information theory and computer science to compute di erences between sequences. Levenshtein distance provides an appropriate measure to compare distances between task that require an ordered sequences. To accurately compare sequences using Levenshtein distance the correct (intended) viewing order of each image is converted into a string sequence. All responses from each participant are also converted to an appropriate string sequence in order to facilitate comparison to the correct sequence. Since the number of relevant regions varies across the images we normalize the distance measure computed for each image by dividing by the number of correct regions for the task. Each correct region is assigned a label. Suppose for the eight relevant regions in the scene the correct viewing order is [ABCDEFGH]. A Levenshtein distance of 0 ([ABCDEFGH]) would indicate no di erence, whereas a distance of 8 ([DCBAGHFE]) would indicate the maximum di erence.

38 Chapter 4 Task Inference Problem 4.1 Problem Definition It has been shown that task at hand influences visual attention greatly. The best known example for task based top down attention was proposed by Yarbus in 1967 [5, 79]. Eye movements convey vital information of the cognitive process involved when performing task such as driving, reading, visual search and scene understanding. Eye movements reveal the shift in attention and a sequence of eye movements highly relate to the task at hand. The di culty and complexity of the task also significantly influences eye movement. This is based on the assumption that eye-movements and visual cognition are highly correlated [22]. Eye movement can be used both as data to understand the underlying cognitive process and also to validate the computational models on visual attention. Thus, eye movements are used to better represent the task at hand and the fixations extracted are used as features for the computational model as described chapter 2. However the inverse of this process, to determine the task at hand from eye movement data is very di cult. Eye movements are made to perform a task at hand, also to gather additional information in the scene while performing the task. Salient regions in the image that are task irrelevant also attract visual attention. The process of di erentiating eye movements as task-based and information-based is the holy grail of eye tracking. 29

4.1. Problem Definition 30 Hence the task inference problem can be defined as identifying the task performed by the user while viewing the scene with the help of image features and real-time eye

39 4.1. Problem Definition 30 Hence the task inference problem can be defined as identifying the task performed by the user while viewing the scene with the help of image features and real-time eye movement data. A generic model to predict the task at hand from eye movement data is far from reach. The problem needs to be simplified as, for a given set of stimuli images and relevant tasks is it possible to identify the task based on eye movement data? It is also important to extend the idea to any new image which can be classified to an existing image group in the data set and has relevance to the task defined for that image group. For example, if image A in figure 4.1 belongs to street image group and the task is to locate the cars in the image, then another image B can be a new stimulus image that can be classified as a street image and the provided task is applicable to it. Figure 4.1: Figure showing Image A, a street view image with eye tracking data where the task provided to viewer was to locate the cars in the scene. Image B shows a similar image that can be classified as a street image and also has cars in the image to be task relevant. The problem can now be defined as, given a set of p images in a group (i 1..p I) andeyemovementdataforeachimage(e i )inthegroupforn assigned tasks (t 1..n T ). The model should be able to identify the task performed by the viewer for a new stimuli image i new which can be classified in I when the tasks t 1..n T are relevant to it.

40 4.2. Research Objectives and Contributions Research Objectives and Contributions In this chapter the objective is to develop a framework for predicting the task being undertaken based on the scene context, bottom-up scene saliency, and eye movement data. The model is initially trained on a set of training stimuli images to compute the task-based saliency maps for the task at hand across multiple human subjects. A human observer (not part of training phase) is then presented with a new image that is similar to the training images (scene classification) and is relevant to the task at hand. The model has to then accurately in real-time predict the task performed by the user on the stimuli image based on the eye movement data gathered. Such a model can be used to obtain vital information about the viewer s intent in repeated image search tasks. For example, TSA experts look for certain specific objects (hazardous or harmful) in the image and this search process is highly repetitive. The model can be tuned to certain specific image groups enabling assistance in the visual search process. The idea can be extensively used in many image search tasks. It can also be used in training and learning environments to better understand the viewer s eye movements. Finally, this model can also provide a rich data-set of stimulus images and corresponding task dependent eye movements which can serve as ground-truth information for other visual attention models. This dataset can also be used for validation and to conduct empirical and performance studies with di erent saliency computational models. 4.3 Background and Related Work Yarbus showed that eye movement not only depends on the scene presented but also on current task at hand. Subjects were asked to watch a picture (a room with a family and an unexpected visitor entering the room) under di erent task conditions involving guessing the age, material circumstances of the family, reaction of the family and free viewing of the image. Figure 4.2 shows the scan path of a subject for the various tasks while viewing the same stimulus image. Attention in humans has also been broadly di erentiated on the basis of its at-

4.3. Background and Related Work 32 Figure 4.2: Figure shows the experiment conducted by Yarbus in 1967.

41 4.3. Background and Related Work 32 Figure 4.2: Figure shows the experiment conducted by Yarbus in Image on top-left shows the picture of Family and an unexpected visitor and the scanpaths of a subject for each task in the experiment while viewing the stimulus image. tribute namely covert and overt. Overtattentionistheprocessofdirecting the fovea towards a desired object or stimuli to fixate on the object and gather information. Covert attention on the other hand, is while focusing on an object simultaneously gathering information on surrounding objects without necessarily making an eye movement. An example of covert attention is while driving, the driver while focusing on the road covertly keeps tracks of his gauges, road signs and tra c lights. The theory behind covert attention is to quickly gather information on other interesting objects or features in the scene other than the one currently fixated. The reason for covert attention is due to the physiology of the eye that

42 4.4. Approach 33 maps slow saccades to other locations in the scene to gather interesting information for the next fixation [80]. However, researchers are still trying to understand the complex interactions between overt and covert attention. Many computational models try to find the regions that attract eye fixation and explain the process of overt attention. However, there are no computational frameworks that explain the reasons and mechanisms of covert attention and there is also no known measure for covert attention. Thus, visual saliency models cover the likelihood for a region in the scene being attended to but cannot explain whether the information gathered is through covert or overt attention. Most models predict very specific tasks such as locating humans which require these models to detect human faces [81, 47], skin color [82], skeletal structure and posture to detect humans in the scene. There have been approaches to detect specific features such as skin, faces, horizontal and vertical lines, curvature, corners and crosses, shapes texture, depth etc. These features enable us to di erentiate the salient region and to group similar regions based on the above features. The other approaches use specific object detection and scene classification techniques to identify images of interest. Other models predict the gaze of the subjects for a very specific task within a controlled setup [53, 83]. However there are no known models that use eye movement data to predict the task being performed by the user on similar images. 4.4 Approach Task inference using only eye movements is a very complex problem and it has been shown that it is extremely di cult to di erentiate task related fixations to other fixations in the scene. The comprehensive computational model of human visual attention prediction proposed uses saliency, gist, local image-based features, eye movement data and also encodes the task at hand to predict viewer s gaze. The model proposed will narrow down the regions of interest in the scene from saliency map or just eye movement data. When the task-based saliency map of the model

43 4.5. Evaluation 34 is generated with the eye movement data from multiple subjects the task relevant regions will be isolated. Thus predicting the task relevant gaze position and the real-time eye movement data will enable the model to run as a controlled feedback loop to predict the task being performed by the user. For example, in a driving scenario for a task to locate speed signs, the fixations are going to be on a speed sign (if present) in the scene. If there are multiple speed signs the attention is going to shift from one sign to another. Figure 4.3 shows the eye tracked image of a person driving a virtual truck. In image A the subject is given a task to view tra c using the rear-view mirrors. Whereas in image B the task provided to the subject was to monitor the gauges and other instruments while driving. In images A and B it can be clearly seen that there are fixation on the task-relevant regions and there are other fixations made to gather additional information. The proposed model will predict the task-relevant regions and for all the tasks specified for the image, and will infer the task performed by the user based on real-time eye movement data. 4.5 Evaluation Eye tracking has been extensively used to validate visual attention and gaze guidance models. Researchers manually record and view eye movement data to understand the cognitive process involved while performing a task. The proposed model will predict the task being performed by the user using image saliency map, gist, local imagebased features and pre-computed eye movement data. It is necessary to evaluate the performance of the model on task inference. The accuracy of the model is 100% if it predicted the task performed correctly and 0% if it failed to predict the task performed by the user User Study The goal of the user study is to test the accuracy of the model in predicting the task performed by the user given the image saliency map, gist, local image-based

44 4.5. Evaluation 35 features and pre-computed eye movement data. The model would have already been evaluated on gaze prediction as described chapter 2. A user study will be conducted to evaluate how quickly and accurately the model perform the task inference. Participants will be chosen randomly and eye tracked while viewing a collection of static images that have not been trained by the model. All participants are chosen to have normal or corrected-to-normal vision with no cases of color blindness. Each participant will be assigned a specific task while viewing the image for the specified period of time and the model will attempt to infer the task being performed. At the end of the study the task assigned to the subject will be compared with the task inferred by the model. The speed and accuracy of the model will be tested for each image and each image group overall. The eye movement data of the subject is recorded and the fixation map will be compared to the task based saliency map computed by the model. The evaluation measures discussed in chapter 2 section 2.6 can be used to compare the fixation distribution to the task based saliency map. A binary classification test can also be performed as described in chapter 3 section 3.5.

45 4.5. Evaluation 36 Figure 4.3: Image (A) shows the eye movements of a subject when given the task to check the rear view mirrors. Image (B) shows the eye movements of the subject when given the task to check the gauges on the dashboard. The circle in red/green indicates the fixations made by the subject when performing the task. The number indicated inside the circle shows the order of fixations. Note that the subject also gathers information of the road, the GPS when performing the task at hand.

Visual Search using Principal Component Analysis

Visual Search using Principal Component Analysis Project Report Umesh Rajashekar EE381K - Multidimensional Digital Signal Processing FALL 2000 The University of Texas at Austin Abstract The development