Composition Context Photography

Size: px

Start display at page:

Download "Composition Context Photography"

Horatio Blair
6 years ago
Views:

1 UNIVERSITY OF CALIFORNIA Santa Barbara Composition Context Photography ADissertationsubmittedinpartialsatisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Daniel André Vaquero Committee in Charge: Professor Matthew Turk, Chair Professor Tobias Höllerer Professor Yuan-Fang Wang Doctor Kari Pulli September 2012

2 The Dissertation of Daniel André Vaquero is approved: Prof. Tobias Höllerer Prof. Yuan-Fang Wang Dr. Kari Pulli Prof. Matthew Turk, Committee Chairperson September 2012

4 To my family, and to everyone else who strives to make the Universe a better place. iv

5 Acknowledgements IamextremelygratefultomyadvisorMatthewTurkforhisinsightfuladvice, wisdom on making important decisions, numerous conversations, and financial support during my studies. Matthew gave me the freedom and encouragement to pursue my own ideas, and provided an excellent environment for research in the lab. Ialsowouldliketothankmycommitteeforvaluablediscussionsandfeedback. IfeelveryfortunateforhavingtheopportunitytocollaboratewithKariPulli since my internship with his group at Nokia. The work conducted by Kari and his colleagues was a source of inspiration for defining my research topic, and I do not think I would have thought about it if I have not had been part of that group. I also wish to thank Tobias Höllerer for the conversations about the human factors involved in my research, and Yuan-Fang Wang for the encouragement, insights and expertise in computer vision. Besides the internship at Nokia, where I also had invaluable advice from Natasha Gelfand and Marius Tico, I had the opportunity to do research internships at Mitsubishi Electric Research Labs (MERL) and IBM Research. I wish to thank Ramesh Raskar for his mentorship and collaboration on the multiflash imaging project, and for the inspiration to set ambitious goals. It was also a privilege to collaborate with and have the guidance of Rogerio Feris, initially during v

6 the multiflash project and then at a research internship at IBM. In addition to our research collaborations, Rogerio provided me with innumerous tips about life as a PhD student and career advice. I am very happy to have his friendship. I also would like to thank George Legrady for the recent collaboration, which gave me the opportunity to further my understanding of the arts. I am grateful to Andrew Adams and Eino-Ville Talvala for advice on implementing digital zoom on the N900 Frankencamera. I also thank Kari Pulli and Nokia for providing a Nokia N900 and an experimental inertial sensor box. I am very thankful to my colleagues at the Four Eyes Lab, for being good friends and providing a great research atmosphere. In particular, I would like to thank Ste en Gauglitz for the image alignment code used in my dissertation research. I also really appreciate the help of the sta at the Computer Science and Media Arts and Technology departments, who never hesitated to assist me. I am grateful to Lisa Kaz for a fellowship during my initial year at UCSB. IwouldliketothankallmyfriendsformakingmystayinSantaBarbaramore enjoyable. I also thank my family for the continuous encouragement. They have always supported me, even being far away from here. Finally, I feel extremely blessed for having met my partner in life María Inés Canto Carrillo during my time in Santa Barbara. This was the best part of my PhD, and I sincerely wish that the connection we have will continue throughout our lives. vi

7 Curriculum Vitæ Daniel André Vaquero Education 2012 PhD in Computer Science, University of California, Santa Barbara, USA Master of Science in Computer Science, University of São Paulo, Brazil Bachelor of Science in Computer Science, University of São Paulo, Brazil. Experience 2009 Research Intern, Nokia Research Center, Palo Alto, California Research / Global Technology Services Intern, IBM T.J. Watson Research Center, Hawthorne, New York Visiting Research Intern, Mitsubishi Electric Research Labs, Cambridge, Massachusetts. Selected Awards 2011 Semi-Finalist, ACM Student Research Competition, SIGGRAPH 2011 IBM First Plateau Invention Achievement Award vii

8 2011 IBM First Patent Application Invention Achievement Award 2006 Lisa Kaz Graduate Fellowship, UC Santa Barbara 2004 MSc Scholarship, FAPESP, Brazil 2003 Undergraduate Research Scholarship, CNPq, Brazil Selected Publications D. Vaquero and M. Turk. The Composition Context in Point-and-Shoot Photography. ACM SIGGRAPH 2011 Posters, Vancouver, Canada, August 7-11, C. Chen, D. Vaquero, and M. Turk. Illumination Demultiplexing from a Single Image. International Conference on Computer Vision (ICCV 2011), Barcelona, Spain, November 6-13, D. Vaquero, N. Gelfand, M. Tico, K. Pulli, and M. Turk. Generalized Autofocus. IEEE Workshop on Applications of Computer Vision (WACV 2011), Kona, Hawaii, January 5-7, D. Vaquero, M. Turk, K. Pulli, M. Tico, and N. Gelfand. A Survey of Image Retargeting Techniques. SPIE Applications of Digital Image Processing XXXIII, Andrew G. Tescher, Editor, Proc. SPIE 7798, , San Diego, California, August 1-5, A. Adams, E. Talvala, S. H. Park, D. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. Lensch, W. Matusik, K. Pulli, M. Horowitz, viii

9 M. Levoy. The Frankencamera: An Experimental Platform for Computational Photography. ACM Transactions on Graphics (SIGGRAPH 2010), Los Angeles, California, July 25-29, D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur and M. Turk. Attribute-Based People Search in Surveillance Environments. IEEE Workshop on Applications of Computer Vision (WACV 2009), Snowbird, Utah, December 7-8, D. A. Vaquero, R. Raskar, R. S. Feris and M. Turk. A Projector-Camera Setup for Geometry-Invariant Frequency Demultiplexing. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, Florida, June 22-24, D. A. Vaquero, R. S. Feris, M. Turk and R. Raskar. Characterizing the Shadow Space of Camera-Light Pairs. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), Anchorage, Alaska, June 24-26, ix

10 Abstract Composition Context Photography Daniel André Vaquero In the digital point-and-shoot photography paradigm, users typically point the camera at the subjects, adjust framing, and trigger functions such as autofocus in preparation for taking a picture. This dissertation explores the opportunities for expanding the outcome of the point-and-shoot photography process by utilizing additional information collected while the user is framing a picture. We propose a novel framework for digital photography that explores contextual information gathered while a photographer is composing and framing a picture. We call the information given by a video taken from the camera s viewpoint while framing, accompanied by per-frame capture parameters and inertial sensor data, the composition context of a photograph. In practice, this video can be recorded by saving the images displayed at the viewfinder during framing. We started our investigation by conducting an exploratory study with users to gather composition context data in real-world photographic situations. In the study, we collected a large database of pictures with associated composition context data. An analysis of this dataset confirms that a significant portion of the composition context is highly correlated with the picture taken. In other words, x

11 part of the composition context can be understood as imaging the same scene pictured in the photograph with variations in capture parameters, such as field of view, exposure time, and focus. By exploring this correlation and variability, we then show how computer vision and computational photography techniques can be applied to provide a wide range of interesting and compelling photo suggestions as a result of the picture-taking process. These include image composites, such as panoramas and high dynamic range images, and individual frames selected from the composition context. We demonstrate that our framework can be integrated into new camera functionality. A composition context camera preserves the interface of current pointand-shoot cameras, but it uses the information in the composition context to compute additional interesting photo choices. We expect this capability to expand the photographic possibilities for casual and amateur users, who often rely on automatic camera modes. Professor Matthew Turk Dissertation Committee Chair xi

12 Contents Acknowledgements Curriculum Vitæ Abstract List of Figures List of Tables v vii x xv xix 1 Introduction Problem Approach Thesis Statement Contributions Outline Related Work Computational Photography Programmable Cameras Use of Contextual Information Contextual Information in Photography Supporting Computer Vision Algorithms Image Alignment Moving Object Detection Interestingness Measures Computational Aesthetics Attention-Based Saliency xii

13 3 Exploratory Study Motivation Capture Application Data Collection Dataset Statistics and Analysis Discussion Composition Context Photography The Composition Context Composition Context Acquisition Composition Context Camera Generation of Photo Suggestions Image Alignment and Estimation of Moving Areas Frame Selection and Image Composites Practical Considerations Selection of Interesting Frames Overview Attention Map and Measure Quality Measures Camera Motion Moving Object Area Rule-Of-Thirds Spatial Distribution of Edges Color Simplicity Frame Selection Experimental Results Discussion Image Composites Composite Types Panoramas Collages Extended Dynamic Range All-in-Focus Imaging Flash/No-Flash Imaging Synthetic Shutter Speed Photography Motion Summaries Synthetic Panning Moving Object Removal xiii

14 6.2 Identifying Composites Panoramas and Collages Image Stack Composites Further Refinement of Input Frames Problem and Motivation Solution Experimental Results Discussion Conclusions and Future Work Summary Composition Context Framework Composition Context Dataset Suggestion of Interesting Frames Composite Identification and Generation Complete System Prototype Remarks Research Agenda Extensive Assessment Active Approaches Generalization of the Input and Output Human Factors Core Computer Vision Problems Quality of the Generated Suggestions On-Camera Integration Domain-Specific Composition Context Bibliography 159 xiv

15 List of Figures 1.1 Capturing the best moment. (a) Final picture taken by the user. The butterfly is almost gone. (b-f) A few frames from the composition context. The frame in (e) has a much better view of the butterfly Di erences between the final image and composition context frames. (a) Final image taken by the photographer using parameters automatically determined by the camera. (b-l) A few frames from the composition context. There are significant variations in point of view (a-c), focal length (d), exposure time (e-g), focus (h), and moving subjects (i-l) Image composites obtained by fusing information in multiple composition context frames. (a) panorama; (b) collage; (c) all-in-focus image; (d) extended dynamic range; (e) synthetic long exposure while zooming; (f) panning; (g) motion summary Overview of the composition context photography framework. The user frames and takes a picture as with a point-and-shoot camera, but a collection of interesting photo variations, computed using information from the composition context, may be obtained in addition to the actual photograph Our contextual information capture prototype. (a) Nokia N900 with attached sensor box; (b) Point-and-shoot camera application, including a viewfinder; (c) Camera buttons used for digital zoom and autofocus / shutter trigger Some of the participants in the user study framing their photographs Some of the pictures taken by the participants of the exploratory study xv

16 3.4 Duration of the framing procedure. (a) Distribution of framing times. (b) Distribution of framing times considering only periods when the viewfinder overlaps with the final image Variations of capture parameters for viewfinder frames that overlap with the final image during the framing procedure. (a) Variations in focus. (b) Variations in exposure (exposure time multiplied by sensor gain). (c) Variations in zoom Contextual information we propose to collect, given by viewfinder frames, their capture parameters, and inertial sensor data Image alignment (adapted from Brown and Lowe [16], 2003 IEEE). (a-b) Images to be aligned; (c) Images aligned according to a homography Filtering frames by camera motion. The idea is to eliminate frames with large camera motion, which are likely to be blurry and contain rolling shutter artifacts Filtering frames by exposure and focus to avoid extreme variations in intensity and blur Filtering frames by alignment. We observed that frames with larger displacements from the original were more likely to contain registration failures. In this visualization, we added the two frames after registration Intensity normalization Background model. The black areas near edges have been disconsidered due to alignment errors Background subtraction procedure. Pixels that significantly di er from the background model are marked as foreground Morphological filtering Filtering by connected component size. Small blobs are discarded Final results after refinement using the GrabCut segmentation algorithm Attention map. Brighter values correspond to larger attention values. In this example, the user spent most of the time fixating at the final picture s region on the left, but there was also a brief period of fixation to the right The rule of thirds (figure adapted from Mai et al. [63], 2011 IEEE). Important subjects of a photograph should be placed along the thirds lines or near their intersections xvi

17 5.3 Spatial distribution of edges (figure adapted from Ke et al. [55], 2006 IEEE). It is suggested that cluttered scenes (on the left) tend to be less aesthetically pleasing than simple scenes (on the right) Color simplicity (figure adapted from Ke et al. [55], 2006 IEEE). The image on the left has a hue count of 3, while the image on the right has a hue count of 11. The image in the left is arguably simpler than the one on the right with respect to the distribution of colors Individual frame suggestion algorithm. First, interesting frames are selected using the attention map. The frames are then grouped by proximity, and quality measures are optimized within each group to generate a set of suggestions and labels Statistics related to the generated suggestions. (a) Number of frames in the composition context that have been registered to the final image using the alignment algorithm. (b) Number of suggestion frame groups found by our suggestion algorithm. (c) Number of frames per group (group size). (d) Number of suggestions generated per group of frames. (e) Total number of suggestions generated for each final picture in the dataset A few examples of suggested frames from our dataset. Each pair displays the final picture and one of the suggestions. The suggestions provide interesting alternatives, such as di erent views, focus at di erent depths, di erent exposures, and moving subjects at di erent moments in time All-in-focus imaging (image from Vaquero et al. [108], 2011 IEEE). Afocalstack(a c)iscapturedbyfocusingatdi erentdistances,and the images are fused to obtain an all-in-focus result (d). This example was captured by manually focusing a Canon 40D camera and fused using the algorithm in Agarwala et al. [5] Synthetic Shutter Speed Imaging (figure adapted from Telleen et al. [100], 2007 The Eurographics Association and Blackwell Publishing). (a) Long exposure handheld; (b) Short exposure; (c) Synthetic shutter speed. By aligning and adding multiple short-exposure frames, alongexposureimageissimulated. Thisimageislesslikelytosu er from blur due to camera shake Input frame selection process. Groups of frames suitable to be provided as input to di erent image compositing algorithms are identified and selected xvii

18 6.4 How the number of input images a ects focal stack fusion (figure adapted from Vaquero et al. [108], 2011 IEEE). (a) Fusion result for 24 input images. (b) Fusion result for 3 images, selected by eliminating redundancy. (c-e) Details of the 3 selected input images Registration issues due to parallax (figure adapted from Vaquero et al. [108], 2011 IEEE). (a-b) two frames from the 24-image focal stack after registration. (c) the yellow square regions from (a) and (b). Notice the di erent distances between the brown and green cans due to parallax, caused by handshake Distribution of the number of identified input frames per composite. (a) Panoramas and collages. (b) High Dynamic Range. (c) Synthetic Long Exposure. (d) All-in-focus Distribution of the number of identified input frames per composite based on moving objects. (a) Motion summaries. (b) Synthetic Panning. (c) Moving Object Removal Examples of panoramas and collages created using the composition context Examples of HDR images and flash/no-flash composites created using the composition context Examples of synthetic long exposure composites created using the composition context Examples of motion summaries and synthetic panning composites created using the composition context Examples of moving object removal using the composition context to fill gaps xviii

19 List of Tables 6.1 Number of generated image composites using our dataset, and average number of identified input frames per composite type xix

20 Chapter 1 Introduction Capturing compelling pictures is a complex task. The choice of adequate capture parameters is very important for successful photography, as it influences the overall quality and composition of the image. The size of the lens aperture, the sensor sensitivity (ISO), and the shutter speed determine the amount of light recorded during a given exposure, making the picture be darker or brighter. The lens aperture also relates to the depth of field. A shallow depth of field allows for keeping a foreground subject in focus while blurring distracting background details, while a large depth of field renders sharp scenes with substantial depth variation, enabling the creation of storytelling compositions that include subjects at di erent depths. In scenes with moving subjects, fast shutter speeds allow for freezing motion, such as water drops in a waterfall. On the other hand, moving objects create motion blur at slow shutter speeds, rendering interesting e ects, such as a waterfall or a river with a cotton candy appearance. 1

21 Chapter 1. Introduction Another important element of compelling pictures is an interesting composition. It is generally di cult to define what the elements present in a captivating picture are, but common characteristics include: is pleasant to the viewer, evokes amessageintendedbythephotographer,capturesthemoment,andhashighaesthetic value. Quoting Ansel Adams, There are no rules for good photographs, there are only good photographs. Nevertheless, photographers have knowledge of a range of artistic composition rules that help to shape photographs to have more visual impact. Criteria such as the rule of thirds, balance, harmony, and color simplicity are often found in photography books and tutorials about photo composition. They serve as a set of guidelines for framing a photograph, which can be useful to consistently capture compelling shots. At the same time, pictures that deviate from the norm can also be found to be appealing. Framing the desired picture is typically accomplished by looking at the camera s viewfinder and adjusting capture parameters such as point of view, zoom, aperture size, and focus. This is achieved by panning and rotating the camera while pointing at the scene, and using the diverse camera controls, such as a focus ring or autofocus, aperture selection, and zoom. Current digital cameras have functionalities for automatically estimating optimal parameters for a given scene at the moment of capture, in order to make the photographic experience more palpable to laymen. These cameras, often referred to as point-and-shoot cam- 2

22 Chapter 1. Introduction eras, attempt to reduce the photographic process to simply pointing the camera at the scene of interest (after adjusting the point of view to capture the desired composition) and shooting a photograph. Examples of functionalities present in point-and-shoot cameras include autofocus mechanisms, which can be used to automatically determine a distance to focus the lens so that the overall sharpness of the image is maximized. Light metering is also performed to find the optimal aperture size and shutter speed to obtain a well-exposed picture, and to trigger the flash if necessary. An autoexposure routine can dynamically adjust the viewfinder brightness in response to the environment. While professional photographers still tend to favor a manual or semi-automatic selection of parameters, point-and-shoot cameras are very popular and made photography accessible to the general public. In this dissertation, we argue that the process of framing in preparation for taking a picture is a rich source of information that can be explored to generate variations of the captured image, which could potentially be more compelling than the captured image itself. In traditional point-and-shoot cameras, the multiple viewfinder frames are simply displayed to the user and then discarded. Once the photographer triggers the shutter, the photo is captured and saved to the camera s memory storage. In contrast, we propose to explore information available in viewfinder frames and their capture parameters, obtained during the framing 3

23 Chapter 1. Introduction procedure, to generate additional photo suggestions that can be presented to the user as the outcome of triggering the shutter. We call this information the composition context of the picture. To illustrate the usefulness of the composition context, let us begin with an example. A common problem in photography for moving subjects is that it can be di cult to trigger the shutter at the right moment [20]. The presence of moving subjects during framing can provide di erent compositions. Recording the composition context can be helpful, since one or more of the viewfinder frames might have captured the desired moment. Imagine that the photographer is about to take a picture of a flower (Figure 1.1). During framing, a butterfly arrives and lands at the flower, but immediately leaves. The photographer triggers the shutter as fast as possible to try to capture the butterfly, but the butterfly leaves and the final picture barely captures it (Figure 1.1(a)). However, by looking at the composition context frames, one of them indeed captured a better view of the butterfly (Figure 1.1(e)). The composition context is also useful in other photographic situations. Imagine that the user is framing a picture of a building instants before sunset. The user moves the camera while adjusting the composition in the viewfinder, leading to di erent views of the scene. The autoexposure algorithm attempts to adjust the exposure time to match the scene, which includes a bright sky and a somewhat 4

Chapter 1. Introduction (a) (b) (c) (d) (e) (f) Figure 1.1: Capturing the best moment.

(b-f) A few frames from the composition context.

Also, the user adjusts the magnification level by zooming in and out, and triggers autofocus before

There are several capture parameters varying during the framing process, such as zoom, focus,

To illustrate this, we simulated the framing process by panning the camera left, right, and then

24 Chapter 1. Introduction (a) (b) (c) (d) (e) (f) Figure 1.1: Capturing the best moment. (a) Final picture taken by the user. The butterfly is almost gone. (b-f) A few frames from the composition context. The frame in (e) has a much better view of the butterfly. dark building. Also, the user adjusts the magnification level by zooming in and out, and triggers autofocus before taking the picture. A person walks by while the user is framing. There are several capture parameters varying during the framing process, such as zoom, focus, exposure, and field of view, which can be explored to generate additional pictures. To illustrate this, we simulated the framing process by panning the camera left, right, and then back to pointing at the desired view. The user then zoomed in and zoomed out, and adjusted the focus. To simulate the auto-exposure mechanism, we captured a few images with varying exposure times. A moving subject crossed the scene walking from left to right, and the user then 5

25 Chapter 1. Introduction captured the final picture. This picture is shown in Figure 1.2(a), and some of the simulated viewfinder frames are presented in Figure 1.2(b-l). There are reasons for keeping some of the viewfinder frames instead of the final picture. The automatic choice of parameters made the sky appear quite saturated in Figure 1.2(a), so the image in Figure 1.2(e) might be more desirable and compelling, even though the building appears slightly darker; Figure 1.2(d) shows a larger zoom level, and the posts are easier to recognize; and Figures 1.2(i-l) show a pedestrian in di erent parts of the scene. The multiple frames in Figure 1.2 image the same scene under di erent capture parameters. In some cases, it is possible to explore this variation in parameters to create image composites from multiple frames. Such composites aim to preserve the best aspects of each input image and usually depict the scene in a way that is not possible with a single captured image. Figure 1.3 shows some composites created from viewfinder frames captured for the example in Figure 1.2, manually created using the Adobe Photoshop software. Figure 1.3(a-b) shows two extended field of view suggestions: a panorama and a collage. Figure 1.3(c) shows an allin-focus picture. Figure 1.3(d) presents an extended dynamic range version of the photograph, and Figure 1.3(e) simulates a long exposure while the zoom ring is rotated by averaging multiple images taken with di erent focal lengths. The moving subject also allows for interesting motion e ects, such as panning (Figure 1.3(f)) 6

parameters automatically determined by the camera.

26 Chapter 1. Introduction (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 1.2: Di erences between the final image and composition context frames. (a) Final image taken by the photographer using parameters automatically determined by the camera. (b-l) A few frames from the composition context. There are significant variations in point of view (a-c), focal length (d), exposure time (e-g), focus (h), and moving subjects (i-l). 7

27 Chapter 1. Introduction and motion summaries (Figure 1.3(g)). These are interesting alternative depictions of the scene in Figure 1.2(a). These examples illustrate that information in the composition context can be useful for suggesting additional pictures, either by selecting individual frames from the composition context or by combining information in multiple frames to generate compelling pictures. To summarize, Figure 1.4 illustrates the complete process of generating photo suggestions using composition context data. 1.1 Problem We propose the following research problem: from a picture and its composition context, characterized by viewfinder frames, their capture parameters, and inertial sensor data, the goal is to automatically generate interesting and compelling alternative versions of the picture by utilizing information in the composition context. 1.2 Approach We provide a solution to the aforementioned problem by addressing the complete process for generating alternative photos, from the capture of composition context data to the creation of suggestions. We design a camera application that 8

28 Chapter 1. Introduction (a) (b) (c) (d) (e) (f) (g) Figure 1.3: Image composites obtained by fusing information in multiple composition context frames. (a) panorama; (b) collage; (c) all-in-focus image; (d) extended dynamic range; (e) synthetic long exposure while zooming; (f) panning; (g) motion summary. 9

Chapter 1. Introduction the user frames and takes a photo gain:1.64 exposure: gain:1.431 0.03 focus: exposure: gain: 1.02 9.8750.03 focus: exposure: gain:1.64 9.8750.03 focus: exposure: gain: 2.51 9.

29 Chapter 1. Introduction the user frames and takes a photo gain:1.64 exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: focus: composition context + photo results: the actual photo + variations panorama collage motion summary panning HDR interesting single frames Figure 1.4: Overview of the composition context photography framework. The user frames and takes a picture as with a point-and-shoot camera, but a collection of interesting photo variations, computed using information from the composition context, may be obtained in addition to the actual photograph. 10

30 Chapter 1. Introduction resembles a point-and-shoot camera in automatic mode but silently records composition context data while the user is framing a photo. We also show how to use this data in conjunction with several computer vision and computational photography algorithms to generate a wide range of photographic suggestions. Our approach employs image alignment and moving object detection in conjunction with an analysis of capture parameters to automatically select individual frames to be suggested to the user and to identify groups of frames to be provided as inputs to image fusion algorithms. These create compelling composites such as panoramas, collages, high dynamic range, synthetic panning, and all-in-focus images. 1.3 Thesis Statement This dissertation introduces a framework for utilizing contextual information in point-and-shoot photography, obtained while the user is framing a picture, to generate additional interesting photo suggestions. It demonstrates that useful and compelling suggestions can be created in real-world photographic scenarios by showing how several computer vision and computational photography algorithms can be employed to select interesting viewfinder frames or to combine them into compelling image composites. 11

31 Chapter 1. Introduction 1.4 Contributions The major contributions of this dissertation are: We define the concept of composition context in point-and-shoot photography, and propose a unifying framework where the composition context enables the generation of suggestions of interesting alternative versions of the picture taken. This extends the range of photographic possibilities for a given scene with minimal additional e ort from the photographer. We present methods that enable the use of composition context information to generate picture suggestions: A technique for choosing frames of interest to present to the user, selected from the composition context. The technique takes into account several measures of interestingness, such as computational aesthetics features, amount of camera motion, presence of moving subjects in the scene, and user fixation during composition. Atechniqueforcreatingimagecompositesfromcompositioncontext data. It addresses how to choose proper frames from the composition context to be provided as input to several computational photography algorithms, such as panoramic stitching, high dynamic range imaging, synthetic shutter speed photography, and focal stack fusion. 12

32 Chapter 1. Introduction We demonstrate that passively recorded composition context data in pointand-shoot photography contains enough information to enable the generation of interesting photo suggestions in addition to the actual picture taken by the user. We show that a system that implements the techniques mentioned above is able to generate compelling photo suggestions starting from real-world composition context data gathered during an exploratory study with users. 1.5 Outline The remainder of this dissertation is organized as follows. We begin by reviewing related work in Chapter 2. We then present the results of an exploratory study with users in Chapter 3. In the study, we gathered a large collection of composition context instances in common photographic scenarios. The findings provide additional motivation and information to design our composition context photography framework, which we introduce in Chapter 4. Within the framework, we propose two di erent ways of generating suggestions: selection of individual interesting frames (Chapter 5) and creation of image composites from multiple frames (Chapter 6). Finally, a summary of the outcomes and a research agenda for further investigation are discussed in Chapter 7. 13

33 Chapter 2 Related Work In this chapter, we review work related to the proposed composition context photography framework. We start with an overview of computational photography, including recent developments on programmable cameras that made the research in this dissertation possible. We then discuss applications of context, and recent uses of contextual information in photography. In addition, we briefly survey core computer vision algorithms for image alignment and moving object detection, which are an important part of our framework. Finally, we review areas relevant to automatically determining interesting frames on composition context data, such as computational aesthetics and attention-based visual saliency. 2.1 Computational Photography Computational Photography [84] is an emerging research area that aims to extend or enhance the capabilities of digital photography. It is marked by the 14

34 Chapter 2. Related Work convergence of computer vision, graphics, and applied optics. Computational photography techniques result in an ordinary photograph, but one that could not have been captured by a traditional camera. The approaches introduce computational elements into one or more steps of the photographic process, including illumination, optics, sensors, and processing. Aclassofmethodsincomputationalphotographycanbegroupedunderthe umbrella of Epsilon Photography [81]: capturing multiple images of the scene with di erent capture parameters (often with small variations, from where the term epsilon comes from), and then combining the images to obtain an image that could not have been taken with a single shot. Examples of techniques in this category include extending the field of view, i.e., capturing several images from di erent and overlapping fields of view and then mosaicking the images to create a panorama [79, 99, 111] or a collage [71, 12]; all-in-focus imaging, i.e., capturing a stack of images focused at di erent distances, and then fusing them to generate an image that appears sharp in its entirety [5]; high dynamic range (HDR) imaging [28], which produces images with extended dynamic range by capturing images with di erent exposure times and combining them, allowing for simultaneous representation of bright and dark regions in the same image; and flash/no-flash imaging [33, 77], where variations in illumination allow for 15

35 Chapter 2. Related Work preserving the natural colors of an image taken without flash while keeping the high signal-to-noise ratio of a flash image. Photography in motion scenarios can be both challenging and rewarding. There are two kinds of motion that influence the captured images: motion of the camera during an exposure, and motion of subjects in the scene being photographed. Assuming a static camera on a tripod and a scene with moving subjects, the photographer decides whether to use a very fast shutter speed in order to freeze the action, or a slower shutter speed that renders blur trails due to moving objects. The synthetic shutter speed method [100] aligns several short-exposure images and adds them to simulate a long exposure image. When moving objects are present in the images, this also enables synthesis of motion blur due to object movement. When the camera is handheld, camera motion due to handshake can lead to blur, and deblurring images captured in this scenario is an active research area [35, 53, 102, 117]. On the other hand, intentional motion of the camera is also explored for artistic e ects. The panning technique consists of moving the camera while tracking a moving subject during the exposure time. If the tracking is successful, the subject will be imaged at the same sensor region during the entire exposure time, and will appear sharp, while the remainder of the image will be blurred due to the camera s movement [76]. The Casio Exilim EX-F1 camera 16

36 Chapter 2. Related Work includes a mode that attempts to synthesize the panning e ect from several short exposure frames, by aligning the frames to a moving object and averaging them. In our work, we propose to record and explore composition context data in standard point-and-shoot cameras. The most useful computational photography approaches for us are the ones that do not require special modifications to the camera hardware. Therefore, methods based on aligning and combining frames, such as panoramic stitching, collages, all-in-focus imaging, HDR, flash/no-flash and synthetic shutter speed are the most suitable. However, there are several other computational photography methods that propose modifications to the camera hardware (e.g., [59, 68, 82, 83]). Designing hardware changes to better explore composition context data can be a promising future research topic Programmable Cameras Many computational photography techniques follow a similar pattern: take multiple images with varying capture parameters, and combine them to create a new picture. This is a simple procedure, but in practice it could be di cult to implement and experiment with that. Current cameras lack a flexible programmable interface for systematically varying capture parameters for multiple images. This often leads to configurations that are not portable and experiments restricted to laboratory settings. Besides, there are closed components in current 17

37 Chapter 2. Related Work cameras, such as the auto-focusing mechanism. It is impossible to have access to the procedure or to modify it, thus forcing researchers who want to experiment with that to build their own custom devices. The recently proposed Frankencamera architecture [4] aims to address these issues and facilitate research and experimentation in computational photography. The architecture includes the hardware, an open source software stack, and a C++ API, called FCam [3], that allow us to flexibly write programs to control the camera and perform processing on the camera itself. The FCam API provides flexible control over the imaging pipeline, and every frame requested from the camera comes back tagged with the parameters used during its capture. Two hardware realizations of the architecture have been originally proposed: the F2 Frankencamera, a custom camera built from o -the-shelf components; and the Nokia N900 smartphone, which has been modified to include the Frankencamera software stack. It has been demonstrated that the Frankencamera facilitates the implementation of many computational photography techniques, such as high dynamic range viewfinding and capture, automated acquisition of extended dynamic range panoramas, low-light viewfinding and capture, and handshake detection using inertial sensors. The additional flexibility provided by the architecture enables research related to camera components that have traditionally been closed. For example, hav- 18

38 Chapter 2. Related Work ing access to the focus procedure allows experimentation with di erent autofocus strategies [108]. Also, standard consumer cameras do not provide access to their viewfinder frames, making it infeasible for us to record them as required by our composition context photography framework (Chapter 4). The Frankencamera made the research in this dissertation possible, by allowing access to viewfinder frames and their capture parameters. The Frankencamera also provides a portable platform, encouraging data collection and testing in unconstrained real-world scenarios. After the publication of the Frankencamera work, there has been growing interest in programmable cameras. The FCam API is now supported by the Nokia N9 smartphone and the NVIDIA development platform for Tegra 3. An extension of the FCam for multiple cameras [106] was introduced as well, and a programmable stereo camera control system was proposed [46]. 2.2 Use of Contextual Information The use of contextual information to achieve a goal or improve the solution to a given task appears in many subject areas in computer science. Dey introduced the following definition: Context is any information that can be used to characterise the situation of an entity. An entity is a person, place, or object that is considered 19

39 Chapter 2. Related Work relevant to the interaction between a user and an application, including the user and applications themselves. [30]. The principle of locality [29], or locality of reference, observesthatthesame value or related storage locations are frequently accessed, both in the spatial dimension, i.e., data that lies close in storage tends to be more frequently accessed, and in the temporal dimension, i.e., if a given data position is accessed at a certain time, it is likely that it is going to be accessed again in the near future. The principle has applications in optimization techniques such as caching. In wearable computing systems, context is explored as part of the always on, always running characteristic [86]. A wearable is always working and sensing, and acting to respond to newly acquired information. The system described in [86] overlays information on top of a view of the world based on what it senses. With the inclusion of GPS systems in current cameras and smartphones, pictures can be tagged with location coordinates, a process called geotagging [105]. The location context can then be exploited in di erent applications, such as photo browsing [104] and augmented reality [58]. In his book Honest Signals [73], Alex Pentland discusses the use of a sociometer to collect nonverbal cues in social contexts, such as measurements of the time users spend talking face-to-face, speech feature analysis, body movement, and physical proximity to other people. He proposes to analyze patterns in this data 20

40 Chapter 2. Related Work to infer intentions and goals in social interactions, which can be used to predict the outcome of situations such as dates, job interviews, and business transactions Contextual Information in Photography There has been increasing interest in using contextual information in photography, by collecting extra information before and after a picture is taken. A major motivation is to increase the chances of capturing the best moment in scenes with motion, as skillfully showcased by the photographer Henri Cartier-Bresson in his Decisive Moment book [20]. The idea of capturing a burst of images and later selecting the best one [21] is often used by professional photographers when taking pictures of scenes with moving subjects [76]. A few modern digital cameras now have built-in functionality for recording bursts of images. Variations of the idea for specific use cases are also present. When the goal is to minimize blur due to handshake, the lucky imaging application presented in the Frankencamera paper [4] records a burst of images and motion sensor data during the exposure time of each image; the shot that minimizes the motion sensor output is then saved, as it is less likely to su er from blur due to camera shake. In a portrait photography scenario, research on automatically estimating candid portraits has been reported [37]. The idea is to capture multiple frames of the face, and then use a machine learning approach to select candid portraits. 21

41 Chapter 2. Related Work A few commercial photography applications that include contextual information have been recently announced. The BlackBerry 10 mobile operating system from Research In Motion should include an image editing feature that would allow semi-automatic manipulation of photographs by letting the user tap at certain areas and replace them with corresponding content from frames captured before or after a picture is taken [8]. A similar idea is explored by Scalado in its Rewind product [94]. The Samsung Galaxy S III smartphone features a Best Shot mode [90], which captures a burst of 8 frames at 6 frames per second, analyzes them and suggests the best picture for the user to save. Finally, the Remove [93] application by Scalado is a semi-automatic solution for removing unwanted moving objects from a picture taken on a situation where the camera is held still during framing, and multiple moving objects are in the field of view. The user selects unwanted objects, which are segmented and inpainted using information from frames captured before the picture is taken. We believe that the commercial interest in applications that explore contextual information is a good indicator that this is a promising field for research. We provide a unified framework for incorporating contextual information in photography, and these commercial applications deal with specific use cases that are encompassed as subdomains of our framework. 22

42 Chapter 2. Related Work In our work, we explore contextual data passively collected during the framing procedure in point-and-shoot photography. While there has been research on what kinds of subjects users prefer to photograph, and what they do with the images once they are captured [107], to the best of our knowledge no visual data collection studies on the process of framing an image using a camera have been performed. Holleis et al. [47] presented a preliminary study that suggested that data collected from inertial sensors during framing could be applied to enhance the photographic experience. For example, the camera could automatically switch the display on when looking at the photo after a shot was taken, or tips could be provided to photographers by comparing the handling of the camera between amateurs and professionals. Håkansson et al. [44] studied the implications on user behavior induced by a camera that captures sound and motion information as contextual elements and uses this information to manipulate the photographs taken. However, our work includes the unique aspect of collecting and using aviewfinderimagestreamascontextualinformationtoimproveorcreatenew photographs. Bourke et al. [14] uses GPS and compass output, and parameters determined by the auto-exposure algorithm (ISO, aperture size, and exposure time) to recommend matches for a taken picture, which are retrieved from online databases. An interface to aid in reframing a selected recommendation is also 23

43 Chapter 2. Related Work proposed. Nevertheless, it also does not take viewfinder frames into account when making recommendations. 2.3 Supporting Computer Vision Algorithms In this section, we provide a brief overview of previous work related to the computer vision algorithms that support our framework image alignment and moving object detection Image Alignment In the image alignment problem, also known as image registration, thegoalis, given two images with overlapping structures, to find a transformation that aligns their content. This is a core area of computer vision research, and surveys have been published [15, 98]. A popular approach is to first extract keypoints from the images, match them to find a set of correspondences, and then use the relationships between corresponding keypoints to estimate a geometric transformation that warps one image onto the other [40]. Approaches can vary according to the kind of keypoints extracted, the matching strategy, and the type of geometric transformation used. 24

44 Chapter 2. Related Work The Viewfinder Alignment [2] technique has been recently introduced for aligning viewfinder frames in mobile device camera applications. It provides a fast algorithm based on alignment of edge and corner features to estimate a similarity transformation between frames, given by two-dimensional translation, rotation, and scaling. This assumes that variations are quite small between subsequent frames and perspective e ects are negligible. Also targeting the mobile device scenario, Wagner et al. [111] proposed a method for image alignment and creation of panoramas. Corner features are extracted, and correspondences are found and used for estimating a homography transformation between frames. Aligning sequences of images captured with varying parameters, such as focus and exposure time, can be particularly challenging due to the di erences introduced by the parameter values. Ward introduced a registration method specially suited for exposure stacks [112], with the ultimate goal of creating a high dynamic range image. A method for non-linear registration of image stacks for computational photography in the presence of scene changes has also been recently proposed [48] Moving Object Detection Analyzing images to extract regions that correspond to moving objects in a scene is another classic computer vision problem. This is typically done on video 25

45 Chapter 2. Related Work data, by analyzing di erences between frames to infer moving areas. Defining what kinds of changes are relevant depends on the particular application at hand [80]; in a surveillance scenario, changes due to illumination or weather may not be relevant, while, in our scenario of viewfinder frames, changes induced by variation of camera parameters should not trigger change detection in the scene. Background subtraction techniques [78, 54, 96, 56, 101, 62, 118] assume a static video camera, and first build a background model from multiple frames. To detect moving objects, new frames are then compared against the background model, and pixels that significantly di er from the background are classified as being part of moving objects. It is challenging to create a background model robust to scene changes, such as illumination and weather conditions. The models can vary from simply computing the median intensity value for each pixel over the duration of the video to multimodal models such as mixtures of Gaussians. A limitation of background subtraction approaches is that when moving objects have intensities similar to the background they may not be detected. Traditionally, background subtraction methods have been applied in video surveillance to automatically detect events of interest and trigger alerts. Also relevant to our framework is the generalized background subtraction problem, where the goal still is to detect moving objects in video, but without the assumption of a static camera. As the camera can freely move, it is much more 26

46 Chapter 2. Related Work challenging to model the background. A simple way to address this problem is by first aligning the video frames using a registration algorithm, and then performing background subtraction in the same way as with a static camera (e.g., [23]). More sophisticated methods have also been proposed [116, 95, 57, 41]. While the techniques show promising performance in a few videos presented in the experiments, it is still very challenging to solve this problem in unconstrained scenarios. Moving object detection finds several applications in image compositing. In [85], algorithms for depicting motion in static images are presented. In [23], a narrative is created from a video by aligning the background areas, stitching them together as a panorama, and detecting and adding moving foreground objects. When combining multiple images in computational photography techniques, a weighted average of pixels from di erent images can generate ghost-like artifacts in the presence of moving objects. To overcome this, some approaches attempt to detect moving regions and selectively choose pixels from di erent images to be combined, in order to avoid averaging pixels from moving objects with pixels from static structures [5, 27, 39, 79]. Once moving objects are detected, image inpainting [9] techniques can be applied to remove them from the image while filling the holes left by the objects with appropriate content. 27

47 Chapter 2. Related Work 2.4 Interestingness Measures We now review attempts to automatically predict the interestingness of images. This is a very di cult problem due to the inherent subjectivity factor, i.e., images that look appealing to some people may not be of interest to other people. Also, the solution may depend on high-level semantic understanding of images, which is currently an open research direction. A user study where users were given a photo album and asked which pictures they would keep [92] shows that high-level, semantic and subjective factors (such as composition, presence of people, close-up and action ) are considered more relevant than objective attributes such as noise, contrast and colorfulness. Determining interestingness in images is still a long-term challenge. Automatically rating interestingness is relevant to our framework, since it would enable automatic suggestion of the most interesting frames from the composition context as additional results of the point-and-shoot process. There have been many attempts at quantifying interestingness according to di erent criteria. One example is the work of Isola et al. [50], which tries to determine which images would most likely be remembered by viewers after seeing them. Computational aesthetics [52], reviewed in more detail in the next section, aims to quantitatively model and measure aesthetic criteria in images, in order to answer whether an 28

48 Chapter 2. Related Work image is aesthetically pleasing. We also briefly review the area of attention-based visual saliency, which estimates the parts of an image that draw the viewers attention Computational Aesthetics In recent years, the problem of automatically quantifying the aesthetic quality of photographs using computer programs has been gaining popularity. The goal is to, given a photograph as input, output a score in the [0, 1] interval that represents how aesthetically appealing the picture is. Or, in a simpler version of the problem, abinarylabelisgiventoimagestoclassifythemas good versus bad, or professional versus snapshot. A few approaches have been proposed making use of machine learning, with classifiers trained from examples taken from Internet websites such as Flickr and photo.net, wherevisitorscanprovideratingstothe uploaded pictures. The average rating of a picture is considered as the ground truth about its aesthetic appeal. In [25, 55], the authors propose to extract a set of features from the images and design a classifier that discriminates between high quality or professional photos and low quality photos or snapshots. The features are inspired by compositional rules of thumb in photography, such as the rule of thirds, color balance, and overall image complexity. A review of aesthetics-based features can be found in [75]. 29

49 Chapter 2. Related Work Marchesotti et al. [64] design classifiers trained from generic image features (such as a bag-of-visual-words model [24] of SIFT features [61] and color histograms) and claim that they outperform photography-inspired features. Su et al. [97] also uses a bag-of-visual-words model based on low-level color, texture, saliency, and edge features to aesthetically classify scenic photographs; they also claim better results over methods that use features inspired by photography rules. On the other hand, Dhar et al. [31] uses high-level attribute classifiers, such as obeys the ruleof-thirds, contains people, contains animals, portrait, mountain, and sunset as predictors of aesthetic value and interestingness. Classifiers based on low-level features are trained for detecting these attributes, and then a classifier based on the output of the attribute detections is trained to classify the image as interesting or not interesting. Nishiyama et al. [69] explore bags of local color harmony features to predict aesthetics. Bhattacharya et al. [11] proposed a regression model based on aesthetics for rating photographs, and an interactive technique to adjust image composition. The model is based on the rule-of-thirds (for single subject photographs) and the golden rule (for landscape photographs). The interactive tool allows for segmentation of the main subject and horizon lines, and suggestions of alternative compositions are presented. Computational aesthetics has also found applications in di erent related areas. An autonomous robot photographer [18] detects faces using a video camera and 30

50 Chapter 2. Related Work automatically frames shots by reasoning about face placement using the rule of thirds. In [70], an aesthetic quality classifier is applied to achieve automatic image cropping. It is proposed that the classifier would be applied to every sub-window of an image, and the sub-window that maximizes the aesthetic score would be cropped. In [60], the problem is to, given an image, determine the best combination of cropping and retargeting operators in order to maximize the aesthetic value of the result. Features inspired by the rule of thirds, diagonals, visual balance and size of prominent objects are extracted, and an objective function based on such features is optimized to find the best result. Image aesthetics has also been applied to image ranking and retrieval [115]. The user can adjust the weights of individual features (e.g., rule-of-thirds and color) or choose example pictures for the system to retrieve similar images. In [114], a system to provide feedback to photographers after they take a picture is proposed. The picture is sent to a server and is analyzed to obtain scores of aesthetics and color harmony, and to retrieve a set of similarly composed professional photographs. ACQUINE [26] is an online web-based system that implements a lightweight version of [25] for assigning aesthetic scores to pictures. Users can upload pictures and obtain a rating in response. 31

51 Chapter 2. Related Work Attention-Based Saliency Attention-based visual saliency [13] aims to identify areas that draw human attention in images 1.Theoutputofsaliencyalgorithmsistypicallyasaliency map, which assigns a saliency value to every pixel indicating levels of attention. There are two categories of approaches to automatically estimate saliency: bottom-up methods, and top-down methods. Bottom-up methods are based on low-level features such as edge orientation, color, and intensities, while top-down methods make use of semantic information, such as the locations of important objects (e.g., faces, bodies, and text), structures, and symmetries. The basic idea is to estimate areas that stand out from the rest of the image. Aclassicapproachforcomputingbottom-upsaliencywasproposedbyItti et al. [51]. It is inspired by the human visual system, and is based on low-level features: color, intensity, and orientation. A multiresolution pyramid of the image is built, and significant changes in the features are searched for and combined into a single high-resolution map. Achanta and Süsstrunk [1] proposed a saliency measure based on comparing pixels in a blurred version of the image to the average color of the original image in the Lab color space. Goferman et al. [42] proposed amethodthat,besidesfindingsalientareas,alsoincludesregionsnearthesalient objects that are important to give them context. 1 The text in this section has been adapted from Vaquero et al. [110]. 32

52 Chapter 2. Related Work Gaze tracker systems can also be employed to build a saliency map. Santella et al. [91] used an eye gaze tracker to estimate the regions where users focus their attention, marking them as high-saliency areas. In our framework, information in the composition context can also be useful for estimating an attention map of the scene, by identifying how long the user spent pointing the camera at specific regions (Chapter 5). 33

53 Chapter 3 Exploratory Study In order to obtain initial hands-on experience and better understand contextual information in photography, we performed an exploratory study with users to gather and analyze context data in real-world scenarios. The study provided insights to guide the design of our framework. This chapter describes our setup for collecting data, the procedures followed by the study participants, and an analysis of the dataset we gathered. In a preliminary stage, we reported results in an abstract and poster presentation at the ACM SIGGRAPH conference [109]. 3.1 Motivation We were interested in collecting videos of viewfinder frames during the process of framing a picture to obtain insights on what information could be leveraged in this context and to better understand how users approach the overall picture framing procedure in point-and-shoot photography. We are not aware of any 34

54 Chapter 3. Exploratory Study other studies that record viewfinder frames during the framing process. Videos composed of the images taken by the camera while the photographer is framing could provide interesting information to aid in the design of our framework. We wanted to understand how capture parameters vary in this scenario, and to obtain an indication of how long users generally take to frame a photograph. Also, we wanted to know how often the images presented at the camera s viewfinder indeed overlap with the actual picture taken 1,andthusarehighlycorrelatedtoit. To clarify these questions, we have conducted an exploratory study with users, since the answers directly depend on human factors that relate to how people use cameras to take pictures. We aim to obtain knowledge about the actions performed by users while framing photos using a point-and-shoot camera in automatic mode, in preparation for taking a photograph. 3.2 Capture Application In order to collect the aforementioned contextual information, we need a way to e ectively record viewfinder frames during the framing procedure using a pointand-shoot camera. The idea is to passively record viewfinder frames, their capture parameters, and inertial sensor data while the user is framing a picture. How- 1 From now on, we refer to this picture as the final picture, as it is the final output of the point-and-shoot photography process. 35

55 Chapter 3. Exploratory Study ever, conventional cameras are not suitable for this task, since in such cameras viewfinder frames are typically displayed and then discarded, and there is no way to access them. The Frankencamera architecture for programmable cameras [4] addresses this limitation and provides a platform for research in computational photography. It includes hardware, an open-source software stack and a C++ API that allow us to write programs that involve controlling the camera and performing oncamera processing. The FCam API [3] gives us flexible control over the imaging pipeline, and every frame requested from the camera comes back tagged with the parameters used during its capture. We used the Nokia N900 Frankencamera as our hardware platform for implementing a point-and-shoot camera that is able to record contextual information during framing. The Nokia phone runs the Linux-based Maemo operating system, has a 600 MHz OMAP 3 processor, 256 MB of RAM, and a 5 MP camera. The N900 includes accelerometers, but does not have gyroscopes. To overcome this limitation, we attached an experimental sensor box, kindly provided by the Nokia Research Center in Palo Alto, that contains accelerometers and gyroscopes, among other sensors. The box communicates with the phone through a Bluetooth connection. Figure 3.1(a) shows the N900 with the attached sensor box. 36

Chapter 3. Exploratory Study (a) (b) (c) Figure 3.1: Our contextual information capture prototype.

buttons used for digital zoom and autofocus / shutter trigger.

Tocreateapoint-and-shoot camera with contextual information recording capabilities, we started from this example.

56 Chapter 3. Exploratory Study (a) (b) (c) Figure 3.1: Our contextual information capture prototype. (a) Nokia N900 with attached sensor box; (b) Point-and-shoot camera application, including a viewfinder; (c) Camera buttons used for digital zoom and autofocus / shutter trigger. The FCam software package includes an example program that aims to emulate abasicpoint-and-shootcamerainautomaticmode. Tocreateapoint-and-shoot camera with contextual information recording capabilities, we started from this example. It continuously captures images and displays them on a viewfinder, and includes algorithms for dynamic and automatic adjustment of exposure time and sensor gain to match the overall brightness of the scene. The application 37

57 Chapter 3. Exploratory Study also has an autofocus procedure, which can be triggered by halfway pressing the shutter button. Once the shutter button is fully pressed, a high resolution (5 MP) photograph is captured with automatically determined settings (exposure time, gain, focus, and flash) based on the viewfinder frames and the autofocus procedure. The viewfinder frames returned by the FCam library come tagged with the parameters used for capture (exposure time, focal length, aperture size, sensor gain, and focal distance 2 ). Timestamps are also provided, making it feasible to synchronize frames with inertial sensor data collected from the sensor box. Therefore, for every frame that is captured and displayed on the viewfinder, metadata consisting of capture parameters and inertial sensor outputs is available. To record the viewfinder images, we modified the program so that every captured viewfinder frame is added to a circular bu er before being displayed. The bu er holds the viewfinder frames and capture parameters for the most recent n seconds. As we are interested in measuring how long users take to frame photographs, we opted for setting n to the maximum possible length of time that the Nokia N900 could support. Thus, we set n =18withviewfinderframesof320x240resolutiondue to the memory limitations of the device. However, other values could have been used (see Section 4.5 for a discussion). 2 The aperture size and focal length are fixed on the Nokia N900 at f2.8 and 5.2 mm, respectively. 38

58 Chapter 3. Exploratory Study Once the shutter is triggered, the resolution is switched to the full sensor resolution (2592 x 1968 pixels), a photograph is captured and stored, and the viewfinder frames are saved to disk alongside their capture parameters and inertial sensor data. Note that the contextual information is not constrained to instants before the capture of the photograph, but can also include instants after the shutter is triggered. We recorded frames only before the capture of the photograph due to a limitation of our implementation platform: there is a resolution switch lag of about 700 ms when changing from full resolution to viewfinder resolution, making it impractical to record useful viewfinder frames after the capture of the photograph. The example point-and-shoot camera application provided with the FCam also does not include controls for zoom. To increase the similarity of our point-andshoot camera application to current consumer cameras, which typically let users manipulate zoom, we modified the FCam library and API for the N900 to allow specification of a digital zoom rectangle that determines the sensor region to be cropped and scaled. We achieved this by performing the following modifications: in the V4L2Sensor module, calls to the Video4Linux2 ioctls VIDIOC CROPCAP, VIDIOC G CROP and VIDIOC S CROP were added to query and set the sensor crop rectangle. The Daemon module was changed to update digital zoom rectangle information on captured Frames, so that they are properly tagged with the zoom 39

59 Chapter 3. Exploratory Study level e ective during capture; and a control field for digital zoom was added to the N900 Shot. We then modified the point-and-shoot camera application to link the N900 zoom keys to changes in the requested zoom levels, by setting the rectangle to be used for the viewfinder s Shot. The resulting point-and-shoot camera, implemented on a Nokia N900 running our customized version of the FCam library, can then be used to passively (that is, without special actions from the user) record contextual information obtained before the user captures a photograph. The user frames a picture using the viewfinder, and may trigger autofocus and change zoom levels if desired; once the user fully presses the shutter button, a full resolution photograph accompanied by its contextual information is saved, and the program returns to viewfinder mode so that the process can be repeated. Figure 3.1(b) shows the simple interface of our point-and-shoot camera, including the viewfinder. Figure 3.1(c) shows the buttons used for digital zoom and autofocus. 3.3 Data Collection We used the point-and-shoot camera described in the previous section to collect contextual data. We have recruited 45 volunteers to participate in our study, 24 male and 21 female, with ages ranging from 18 to 30 years old. The group consisted 40

of undergraduate and graduate students from various disciplines at the University of

A study session was conducted with a single user at a time and lasted about 45 minutes.

The session consisted of filling out a pre-study questionnaire about photography

familiar with the camera controls (no data was recorded at this point), the actual data

In a data collection session, participants were provided with our camera and requested

60 Chapter 3. Exploratory Study Figure 3.2: Some of the participants in the user study framing their photographs. of undergraduate and graduate students from various disciplines at the University of California, Santa Barbara, and the study took place between April and June of A study session was conducted with a single user at a time and lasted about 45 minutes. Each user received $5 for their participation. The session consisted of filling out a pre-study questionnaire about photography knowledge, followed by a quick familiarization procedure to make sure users were familiar with the camera controls (no data was recorded at this point), the actual data collection session, and filling out a post-study questionnaire. In a data collection session, participants were provided with our camera and requested to walk around the UCSB campus and take pictures of di erent scenes. The users were videotaped when they authorized it, but this was not a requirement for participation. Figure 3.2 shows some of the participants framing photographs 41

61 Chapter 3. Exploratory Study during the data collection session. They were instructed to try their best to capture compelling images, relying on their own sense of what makes a good picture. We did not tell them that the camera was recording additional contextual information; they were only told to take pictures as they are normally used to do with other cameras. The users were requested to take three di erent photographs for each of seven di erent categories that represent common photographic situations: an o ce environment; aclose-upscene; abuilding; asign; an open area; aposedpictureofapersonorgroupofpeople; amovingsubject. The participants were free to choose the actual scenes to be photographed, as long as they corresponded to the requested categories. Once these 21 pictures had been taken, the users were then requested to take at least five additional pictures of scenes chosen at their discretion. Figure 3.3 displays some of the 42

62 Chapter 3. Exploratory Study Figure 3.3: Some of the pictures taken by the participants of the exploratory study. photographs taken during the study. While we recognize that the range of realworld photographic scenarios is very large, and our sample group represents just a small portion of the general population, care has been taken to collect a dataset that captures some of the day-to-day photography situations. The scene categories have been selected to represent common photography scenarios. We also had study sessions at di erent times of the day and weather conditions. The pre-study questionnaire was composed of questions that attempt to gauge the user s experience as a photographer. First, the users were asked to rate their level of experience as a photographer in an integer scale from 1 (not at all ex- 43

63 Chapter 3. Exploratory Study perienced) to 7 (very experienced). The average response was 4, with standard deviation of 1.33, minimum of 1 and maximum of 7. The second question referred to how long the participant had been a user of digital cameras. The average was 5.75 years, with standard deviation of 2.71, minimum of 0 and maximum of 11 years. Twenty users reported that they take pictures on a weekly basis, 19 do so on a monthly basis, 3 do it on a daily basis and 2 rarely take pictures. Finally, when asked whether they use the camera s automatic mode most of the time, 28 answered yes, 13 said no, and 4 did not know what the automatic mode is. In the post-study questionnaire, the users were asked to rate the similarity of the interface of our camera prototype to three types of cameras: digital pointand-shoot, cameraphone, and digital SLR (single lens reflex). The ratings were given in an integer scale ranging from 1 (not at all similar) to 7 (very similar), and the users were asked to leave the response blank in case they were not familiar with a given type of camera. The statistics of the answers were as follows: for the digital point-and-shoot camera, the average rating was 5.37, with standard deviation of 1.3 and minimum and maximum responses of 2 and 7, respectively; for the cameraphone, the average rating was 5.57, with standard deviation of 1.28 and minimum and maximum responses of 2 and 7, respectively; and for the digital SLR, the average rating was 2.94, with standard deviation of 1.57 and minimum and maximum responses of 1 and 7, respectively. All 45 respondents rated the 44

64 Chapter 3. Exploratory Study similarity of our camera to point-and-shoot cameras and cameraphones, but 11 of them left the answer to digital SLR blank. These ratings are good indication that the interface of our contextual information capture prototype successfully mimics the user interface of point-and-shoot cameras and cameraphones. 3.4 Dataset Statistics and Analysis The data collection sessions from our exploratory study resulted in a large database of photos with associated contextual information. A total of 1213 fullresolution (5 MP) pictures were collected. Each picture is associated to its contextual information, given by a low-resolution video (320 x 240) of viewfinder frames, their per-frame capture parameters, and inertial sensor data (accelerometer and gyroscope outputs). We now present statistics from the dataset. To compute these statistics, we manually watched every video in the dataset and took notes of di erent observed characteristics. We started by annotating, for each video, the time intervals where the user was framing the photograph. This is di cult to define since we do not have access to information about the user s intentions, but for this purpose we considered the frames for which the user seemed to be deliberately pointing the camera at a scene. We also annotated the time intervals whose views overlap with the one in the final picture. 45

65 Chapter 3. Exploratory Study Given these annotations, we computed statistics on the duration of the framing procedure. Figure 3.4(a) shows a distribution of framing times for the pictures in the dataset, by defining the beginning of the framing procedure to be at the first frame for which the user seemed to be framing, and the end to be the time when the final picture is taken. On average, the participants spent 9.89 seconds framing a picture. Due to the memory limitations of the implementation platform, framing durations longer than 18 seconds have been recorded as 18 seconds. It is also interesting to analyze the amount of time while the user was framing and the viewfinder image overlapped with the final image. Figure 3.4(b) shows a distribution of the times for this. On average, the camera displayed viewfinder frames that overlapped with the final picture during 9.15 seconds of the framing time. We also measured the ranges of variations in capture parameters during the framing procedure. For the viewfinder frames that overlap with the final picture (Figure 3.4(b)), we computed the minimum and maximum values of the following parameters: focus, exposure (as the multiplication of the exposure time by the sensor gain), and zoom. The focus variation was measured by the di erence between the maximum and minimum focus values, in diopters. The N900 lens can focus from 5 cm to infinity (20 to 0 diopters), and the variations in our contextual data come from the autofocusing procedure. The exposure variation was com- 46

66 Chapter 3. Exploratory Study Framing Duration Framing Duration with Final Image Overlap 200 Average: 9.89 s 200 Average: 9.15 s #photos #photos seconds seconds (a) (b) Figure 3.4: Duration of the framing procedure. (a) Distribution of framing times. (b) Distribution of framing times considering only periods when the viewfinder overlaps with the final image. puted in stops, by dividing the maximum exposure by the minimum exposure (an increase in one stop corresponds to doubling the light intensity), and was induced by the autoexposure algorithm, which adjusts the exposure accordingly to the scene being viewed. The zoom variation was given in magnification change, by dividing the maximum zoom value by the minimum zoom value. The maximum allowed magnification change was of 3x, implemented as digital zoom, and the changes in zoom are directly initiated by the user through plus and minus buttons that increase or decrease the magnification by 0.067x. Figure 3.5 shows the distribution of the variations for the images in our dataset. We have also taken notes about other aspects of the contextual videos. Foreground moving objects or subjects were present within the camera s field of view 47

6 0 1 1.4 1.8 2.2 2.6 3.0 magnification change (X) (b) (c) Figure 3.5: Variations of capture parameters for viewfinder frames that overlap with the final image during the framing procedure.

67 Chapter 3. Exploratory Study 400 Focus Variation 300 #photos diopters 20 (a) 800 Exposure Variation 1000 Zoom Variation #photos 400 #photos stops magnification change (X) (b) (c) Figure 3.5: Variations of capture parameters for viewfinder frames that overlap with the final image during the framing procedure. (a) Variations in focus. (b) Variations in exposure (exposure time multiplied by sensor gain). (c) Variations in zoom. during framing of 643 of the 1213 videos; and in 193 videos there were background moving objects, such as tree branches and waves. There was noticeable depth complexity in 763 videos, causing parallax e ects when the camera moves. Flash was used in 43 of the 1213 photos (it was automatically triggered depending 48

68 Chapter 3. Exploratory Study on the light level sensed by the auto-exposure algorithm), and zoom was used in 633 of the final pictures. There are interesting qualitative observations about the contextual videos. The time interval right before the picture is taken often corresponds to the user slowly framing and adjusting the composition, and there might be other intervals like that in other parts of the video. During framing, it is also natural to perform sudden movements to quickly change views. There was at least one interval of smooth camera motion in 1194 videos, and 823 videos contained at least one sudden camera movement. In 1154 videos there was an interval of stillness of the camera (disconsidering motion due to handshake) before the final picture was taken, while in the remainder of the videos the user suddenly moved the camera right before triggering the shutter (often to catch a moving subject). However, we cannot expect the entire recorded video to be of smooth framing. Since we are recording the most recent 18 seconds before the final picture is taken, there might be periods of time when the user is erratically moving the camera. Erratic movements typically happen when framing is not in progress, and the camera records random scenes that are not of interest. For example, if the user takes a picture at one site and then walks to another site to take a second picture, part of the moving period between sites may be captured, and random scenes may be recorded at that time (such as the user s feet, or parts of the surroundings such 49

69 Chapter 3. Exploratory Study as the ground or the sky). The frames in these erratic intervals often have a large amount of blur due to camera motion as well. During framing, there are also varying behaviors in the dataset. In part of the videos, the photographer picks a scene to take a picture from and adjusts framing around that area without many variations in point of view. However, there are people who attempt to find a scene by looking at the viewfinder, possibly fixating at more than one scene before deciding for one of them to be in the picture. Some people also attempt views from di erent angles for the chosen scene. In 841 videos, the user seemed to stop walking to adjust the framing while standing at the same place, while in the remainder of the videos the user also walked during framing. There are people who seem to first choose the photographed scene mentally and then point the camera straight at it, while others use the viewfinder to browse the environment and find the scene of interest. Due to the camera interface, most people ended up triggering autofocus right before capture (since halfway pressing the shutter automatically triggers it, which is standard behavior in many cameras), unless they first focused on something and held the shutter button pressed halfway while moving the camera toward a di erent scene chosen to be in the final image. Some people prefer to zoom into the object of interest, while others walked closer to obtain the same view. 50

70 Chapter 3. Exploratory Study Landscape was the most popular orientation for the final picture, with 867 instances; 339 pictures were taken in portrait orientation; and 7 pictures had a tilted (di erently from portrait and landscape) orientation. In 207 videos, the user attempted to frame the picture in both portrait and landscape position before capturing the photograph. The shutter lag of 700 ms present in our prototype influenced the behavior when capturing moving subjects; often times the moving subject was missed due to the shutter lag. Some users compensated for that by predicting the motion and pointing the camera toward the area where they expected the moving subject to be when the shutter was actually triggered. In those cases, the final image and the viewfinder frames typically do not overlap. Finally, while viewfinder frames may exhibit some level of blur due to camera motion, there are many frames that are sharply captured in periods of stillness or slow motion. We also observed that the amount of camera motion due to handshake can vary significantly among users. 3.5 Discussion The analysis presented in the previous section indicates that part of the gathered contextual information exhibits useful properties. A large number of frames is highly correlated to the final picture, due to the overlaps between views. The 51

71 Chapter 3. Exploratory Study variations in capture parameters present in the overlapped frames can be understood as imaging the same scene in the final picture under di erent capture parameters. The duration of the framing procedure may also inform the decision on how much contextual information to capture, by choosing an optimal duration given hardware costs and constraints. In the next chapters, we will introduce our composition context photography framework, which explores the correlation and variations present in the contextual information to generate photo suggestions. There are possibilities for further studies using the collected dataset. A more sophisticated analysis of the framing behavior could be conducted, by correlating answers to the questionnaires with the observed user actions. A study could, for example, verify whether there are di erences in behavior among users of different experience levels. Also, it could be interesting to investigate whether different scene types induce di erent behavior. For example, we observed that the viewfinder frames are more likely to contain large variations in field of view when the users were photographing open areas; this is possibly most likely to happen for users who typically frame looking through the viewfinder, and scan the surroundings before deciding for a final picture. As discussed in Section 3.2, our prototype did not include frames after the photo is taken due to the resolution switch lag present in our implementation platform. In the future, it would be interesting to experiment with this in order 52

72 Chapter 3. Exploratory Study to gather statistics on useful time periods. It is expected that a much shorter recording duration is reasonable for frames after the picture is taken, since the user typically moves the camera away from the scene of interest after taking the shot. A workaround for a future experiment would be to also take the final picture in low resolution, avoiding the resolution switch lag. 53

73 Chapter 4 Composition Context Photography In this chapter, we describe the concept of composition context photography. We discuss the elements pertaining to the design of our framework, including what kind of context information we collect, the output of the process, and an overview of a general solution that addresses the complete point-and-shoot process using the composition context, from its acquisition to the computational generation of photo suggestions. 4.1 The Composition Context As presented in Chapter 3, a lot of information is available in the process of framing a picture using a point-and-shoot camera. The user is typically adjusting the composition and capture parameters for a certain period of time before the 54

74 Chapter 4. Composition Context Photography shutter is triggered. Another observation is that the images presented at the camera s viewfinder to aid the photographer in framing the photograph are probably highly correlated to the final picture. It is likely that the field of view of many of these viewfinder frames will overlap with the region captured by the final picture, and we can understand this process as imaging the same scene under potentially di erent capture parameters. These parameters may be automatically varied by the camera, such as the exposure time, which is usually controlled by an autoexposure routine that dynamically adjusts the brightness of the viewfinder images in response to the sensed environment; or they may be triggered by the user, such as an autofocus routine that varies the focal distance in order to optimize sharpness for certain areas of the image. Traditional point-and-shoot cameras display viewfinder frames to aid the photographer in composing the picture, and then save the captured photo to memory once the shutter is triggered. In automatic mode, the camera adjusts the capture parameters for the final picture using measurements collected during framing. In contrast, we propose to explore information available in viewfinder frames and their capture parameters, obtained during the framing procedure, to generate additional photo suggestions to be presented to the user as the outcome of triggering the shutter, thus extending the result from just a single picture to a collection of pictures. For the purposes of this dissertation, we define the composition context 55

75 Chapter 4. Composition Context Photography of a photograph as the viewfinder frames displayed by the camera (for a period of time before the shutter is triggered, and possibly for another short period after the shutter is triggered), their capture parameters, and inertial sensor data. However, there are other contextual elements that could also fit within our framework, such as geographical location, obtained from, for example, a GPS receiver; photos previously saved onto the camera s memory; and user specified preferences. Exploring those constitutes interesting future work. More specifically, in our approach we record the following contextual information: Viewfinder frames: the sequence of images shown to the user through the camera s viewfinder while framing a photograph. Per-frame capture parameters. For every viewfinder frame, we record: the focal length of the lens; the focal distance; the exposure time; the gain (ISO) level; the amount of digital zoom, if used; the aperture size. 56

Chapter 4. Composition Context Photography viewfinder frames gain:1.64 exposure: gain:1.431 gain: 1.02 0.03 exposure: 0.03 focus: exposure: gain:1.64 9.875 focus: gain: 0.03 9.875 2.51 exposure: 0.

875 2.51 exposure: 0.03 focus: exposure: gain: 9.875 2.51 0.03 focus: exposure: gain: 9.875 2.51 focus: gain: 0.03 9.875 2.51 exposure: 0.03 focus: exposure: gain: 9.875 2.51 0.03 focus: exposure: gain: 9.875 2.51 0.03 focus: 9.

76 Chapter 4. Composition Context Photography viewfinder frames gain:1.64 exposure: gain:1.431 gain: exposure: 0.03 focus: exposure: gain: focus: gain: exposure: 0.03 focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: gain: exposure: 0.03 focus: exposure: gain: focus: exposure: gain: focus: gain: exposure: 0.03 focus: exposure: gain: focus: exposure: gain: focus: exposure: gain: focus: exposure: focus: focus: per-frame capture parameters sensor output time inertial sensor data Figure 4.1: Contextual information we propose to collect, given by viewfinder frames, their capture parameters, and inertial sensor data. Inertial sensor data. We obtain sensor outputs from accelerometers and gyroscopes for the duration of the framing period. Most high-end cameraphones and some digital cameras already integrate inertial sensors, and we think it is reasonable to assume that they will be part of consumer cameras in the near future. Figure 4.1 illustrates the type of contextual information we collect. 4.2 Composition Context Acquisition In this work, we opted for making minimal changes to the point-and-shoot photography process. We propose to passively record the composition context while the user is framing the picture and adjusting capture parameters. In this way, the variations in capture parameters present in the composition context are 57

77 Chapter 4. Composition Context Photography either induced by the user, such as variations in point of view, zoom, and focus; or induced by the camera s automatic control of parameters, such as auto-exposure and autofocus. Alternatively, one could consider introducing active variations of capture parameters into the framing process to induce larger variability in the composition context. However, it is not natural to vary point of view in this case, since it would require some sort of physical movement of the camera. Another undesirable e ect is that displaying variations in the viewfinder that were not requested by the user could be distracting or make the user experience worse. Capturing additional frames with exposure times outside of the range normally covered by the auto-exposure algorithm could generate motion blur and frame lag, and physically changing the focus setting without user input could introduce undesirable blur. Hence, actively increasing variability in capture parameters without adding possibly disturbing changes to the viewfinder interface is challenging and out of the scope of this work. However, we argue that the information and variability present in a passively recorded composition context are still useful for creating alternative pictures as the result of the point-and-shoot process. In Section 4.4, we present an overview of how this information can be employed to generate alternative photo suggestions. In Chapters 5 and 6 we present evidence that our framework enables the creation 58

78 Chapter 4. Composition Context Photography of a wide range of photo suggestions in real-world situations, by applying our proposed techniques to the dataset collected in our exploratory study. Also, an advantage of passively recording composition context data is that, from the user s point of view, nothing is changed in the framing process. This could ease adoption, as the users would not need to learn a new mode of interaction with a camera. In addition, the users could still purposely increase the variability of capture parameters in the composition context to a certain extent, by actively panning, zooming and adjusting focus before taking their pictures. However, we present experimental results that suggest that the natural process of framing a picture already contains useful variability in many cases. 4.3 Composition Context Camera We propose that our composition context photography framework be integrated into a point-and-shoot camera, which would explore additional information present in the composition context to computationally generate a wide range of photo suggestions to be presented to the photographer once the shutter is triggered. As a result of the picture taking process, the photographer would obtain not just a single picture of the scene, but possibly additional pictures computed using information from the composition context. 59

79 Chapter 4. Composition Context Photography As previously discussed, passively recording composition context information has the benefit of preserving the standard user interface in a point-and-shoot camera. From the user s point of view, our technique follows a very similar workflow to traditional point-and-shoot photography. The main steps are: 1. Turn the camera on; 2. Frame the picture by looking at the viewfinder and adjusting the composition, possibly controlling parameters such as zoom, and triggering autofocus; 3. Trigger the shutter, acquiring a photograph; 4. Review the picture taken, and the additional pictures computed from the composition context; 5. Save the desired pictures to the camera s memory. This process is the standard for point-and-shoot photography; the only di erences are in steps 4 and 5. Instead of simply saving a single picture, the camera would present a collection of suggested variations of the captured scene in addition to the final picture. The user would then select one or more pictures, and these pictures would be saved to the camera s memory. 60

80 Chapter 4. Composition Context Photography From the perspective of the camera, we need to introduce the recording of composition context data, and computation to process the recorded data and generate additional pictures: 1. Wait for the user to turn the camera on; 2. Continuously capture and display viewfinder images. Use them for automatic metering, and passively record viewfinder frames and their capture parameters; 3. When the user triggers the image capture, acquire a photograph; 4. Analyze and combine the captured image and the recorded context information to generate suggestions of additional pictures of the scene; 5. Present the image taken by the user and the suggestions, so that the user can choose the desired ones; 6. Save the selected images to the camera s memory. Comparing to standard point-and-shoot photography, the main di erences are that we record additional information during the picture framing process, and we use this information once the picture is taken to possibly suggest variations of the captured photograph. 61

81 Chapter 4. Composition Context Photography 4.4 Generation of Photo Suggestions Once a picture is taken and its composition context is recorded, we propose to apply several computer vision and computational photography algorithms to analyze and process the data in order to generate and suggest variations of the captured image. We now describe how the composition context data can be processed to create suggestions. We have decided for having still images as outputs, instead of tridimensional information or videos. This allows for easy sharing, distribution, publishing and visualization through conventional means. Adding 3D information to the output would require the use of specialized tools or viewers, and is beyond the scope of this work. However, generating videos or tridimensional models from composition context data might as well be promising future research directions. Our photo suggestion procedure starts with a preprocessing stage, whencomputer vision algorithms for image alignment and estimation of moving areas are applied. Using results from this stage, we generate two types of photo suggestions: individual frames that maximize a measure of interestingness, andimage composites created by aligning and combining multiple frames captured with di erent parameters. 62

82 Chapter 4. Composition Context Photography Image Alignment and Estimation of Moving Areas In a composition context photography scenario, viewfinder frames that possibly overlap with the final picture are available. It is useful to know the relative pose of each viewfinder frame with respect to the final picture, so that pixel correspondences can be established. In other words, aligning the viewfinder images to the final image allow us to tell, for a given pixel in the final image, which pixels in the viewfinder images represent the same spatial region in the world. This correspondence also allows us to estimate regions of the scene that might be more interesting to the user, by locating areas where many frames overlap. These correspond to areas where the user spent more time pointing the camera at while framing. Moving objects or subjects are also common in photography scenarios. People blink, animals run away, and lights flash. Famous photographers such as Henri Cartier-Bresson mastered the skill of capturing the right moment [20]. In addition, moving areas change the spatial composition of a scene. We estimate moving areas by aligning frames, building a background model, and comparing each frame against the background model. Now we describe the details of the image alignment and moving area estimation algorithms we use. We note that both problems are still very challenging problems in computer vision, and subject of ongoing research e orts. For the purposes of this dissertation, we implemented existing and/or 63

83 Chapter 4. Composition Context Photography simple algorithms that su ce for demonstrating our concept. Incorporating more sophisticated techniques to deal with the same problems is compatible with our framework and can be the topic of further research. Image Alignment The problem of image alignment, also referred to as image registration, consists of, given two images A and B, finding a geometric transformation that, when applied to image A, warps it so that its features align to image B. Figure 4.2, adapted from Brown and Lowe [16], illustrates the problem. For this step, we used an algorithm similar to [111], which shows good results for videos taken in mobile scenarios. It first computes feature points using the FAST corner detector [87] on the first frame, and tracks those corners onto the second frame by maximizing a normalized cross correlation score on neighboring locations. Once feature points are found, random sampling and consensus (RANSAC) [36] is used to eliminate outliers. The remaining correspondences are then used to estimate an a ne homography [45] 2 H AB = 4 a b c d e f (4.1) that warps image A onto image B. We constrained the transformation to be an a ne homography (instead of estimating a full homography with 8 degrees of 64

Chapter 4. Composition Context Photography (a) (b) (c) Figure 4.

(a-b) Images to be aligned; (c) Images aligned according to a homography.

homographies for the images collected using our camera prototype.

84 Chapter 4. Composition Context Photography (a) (b) (c) Figure 4.2: Image alignment (adapted from Brown and Lowe [16], 2003 IEEE). (a-b) Images to be aligned; (c) Images aligned according to a homography. freedom) because a ne homographies led to more stable warpings than full homographies for the images collected using our camera prototype. In our application, it is useful to align each viewfinder frame to the final picture, in order to use information from viewfinder frames to create variations of 65

85 Chapter 4. Composition Context Photography the picture. In other words, the picture F is placed at the origin of the coordinate system, and homographies H FVi that align each viewfinder frame V i to F should be estimated. In a viewfinder scenario, it is expected that the motion will be small between frames, and the overlapping regions should usually be large [2]. When this assumption is true, it is generally possible to find an alignment between every two subsequent frames. However, as discussed in Chapter 3, in practice this does not hold for parts of the recorded composition context data. While there are time intervals when the assumption is satisfied (intervals while the user is slowly framing and adjusting the composition), there may also be sudden movements resulting from the process of framing, and periods of time when the user is erratically moving the camera. Due to these practical complications, there is a possibility that image alignment will fail for parts of the composition context. As a workaround, we propose to extract only the frames from the composition context that are reasonably correlated to the final picture; these are the ones that will likely be useful for our end goal of generating additional photo suggestions. We define this set of frames as the successfully aligned viewfinder frames that either directly overlap with the final picture, or that do not directly overlap with the final picture but can be sucessfully aligned to another viewfinder frame that does. For the first type of frames, a homography that maps the viewfinder frame onto the final image con- 66

86 Chapter 4. Composition Context Photography stitutes the alignment transformation. For the second type, if V i is the viewfinder frame in question and V j is another viewfinder frame that aligns to the final picture F,alignmentisgivenbyacompositionofthehomographiesthatwarpV i onto V j and V j onto F. Finding such alignments can be done by first trying to align every viewfinder frame to the final picture; and then repeating the process to align viewfinder frames that did not register with the final picture to viewfinder frames that were successfully aligned. Detection of Moving Areas The problem of detecting image regions corresponding to areas of movement in the scene is known as background subtraction [78]. Typically, it uses a static camera at a fixed point of view to capture video, build a background model from the video, and then compare new frames against the background model to determine regions of change, which work as estimates of foreground or areas in motion. This problem is already very challenging when using a static camera, given that dynamic variations of the background can occur due to illumination or factors such as snow and rain. In our scenario, we also would like to detect moving areas in the scene, but the assumption of having a fixed static camera is invalid, making the problem much more di cult to solve. This version of the problem, which considers a moving camera, is referred to as a generalized back- 67

87 Chapter 4. Composition Context Photography ground subtraction problem. A few approaches have been proposed to address it [116, 95, 57, 41]. Nevertheless, while there s been good progress to date, the presented results suggest that the algorithms still do not satisfactorily perform in general scenarios such as the composition context. It is not within the scope of this work to improve on existing solutions to the generalized background subtraction problem, or to implement sophisticated existing algorithms. However, we still would like to provide hints on the usefulness of the detection of moving objects to create new pictures using composition context data. As a component of our proof-of-concept system, we opted for designing a simple approach that has its limitations, but is still useful. Our approach is inspired by a simple method [23], which consists of aligning and di erencing frames. However, our method also includes other steps to address practical issues found in our particular scenario. Our algorithm for moving object detection consists of the following steps. It processes a video given by the final picture (as the first frame) followed by the viewfinder frames from the composition context, and outputs, for every frame, a binary mask image that indicates the pixels corresponding to detected moving object areas. 1. Filter out frames with large camera motion: blur due to camera motion and rolling shutter distortion [38] can cause undesirable artifacts when building a 68

88 Chapter 4. Composition Context Photography background model. We discard viewfinder frames that have a considerable amount of camera motion, estimated from the collected gyroscope data. Similarly to the Lucky Imaging application discussed in the Frankencamera paper [4], we measure the gyroscope output for the duration of the exposure of every frame. Frames that have sensor output larger than a threshold are dropped. Figure 4.3 illustrates the idea. 2. Filter out frames by focus and exposure: viewfinder frames whose focal distance or exposure (defined by the multiplication of the exposure time by the gain level) considerably di er from the focal distance and exposure with which the final picture has been captured are also discarded. Those variations induce intensity and blur changes in the frames that can be problematic in the background subtraction process, and are di cult to compensate for. Figure 4.4 shows examples of frames eliminated by this filter. 3. Align images: the final image is aligned to every viewfinder frame using the algorithm described in the previous section. For the purpose of moving object detection, we keep only the frames that directly overlap with the final image after alignment. 4. Filter out frames by alignment: in practice, we observed that viewfinder frames which, after warping by the homography, had a shape that consid- 69

89 Chapter 4. Composition Context Photography erably di ered from the original rectangular shape of the final picture were more likely to generate artifacts on the moving object detection output due to registration mistakes. Therefore, we introduced a filtering step to keep only the frames whose corners were contained within a given radius of each of the final picture s corners, after warping by the homography. After this step, we have a stack of registered frames with small variations in point of view, exposure and focus, and very small camera motion per frame. Figure 4.5 shows alignment results kept or discarded by this criterion. 5. Normalize intensities: there may still be small variations in exposure present in the registered image stack. To ameliorate this, we normalize the pixel intensities. First, we select an anchor frame in the image stack, by picking the frame that maximizes the number of pixels that are not overexposed or underexposed (this is estimated by counting the number of pixels with intensities between 5 and 250). Then, we compute the mean intensity for every image (also disregarding very dark or very bright pixels). Finally, every image is normalized by multiplying each pixel by the ratio of the anchor s mean and the image s mean. This procedure attenuates variations in intensity due to small changes in exposure. Figure 4.6 displays the selected anchor frame, and the intensity normalization result for one of the frames. 70

90 Chapter 4. Composition Context Photography 6. Compute background image: from the stack of aligned and normalized images, we create a background model of the scene. Our background model is simply obtained by computing, for every pixel, its median value over the corresponding pixels on the stack, resulting in a single image. In this stage, errors in alignment of the image stack lead to innacuracies in the background image. Most alignment errors we found are on the order of 1 to 2 pixels, so problems tended to appear near edges of objects. To alleviate those, we introduced a heuristic that first applies the Canny edge detector operator [19] to find edges on every image on the stack, then morphologically dilates the edge map by a 3 x 3 box, and disregards any pixels in the dilated edge map when computing the background model. The problems near edges will still be present when the alignment errors are larger than 2 pixels, and this also prevents detection of very thin moving objects and structures. However, this heuristic eliminates many false positive detections near edges, which are considerable in our scenario due to alignment errors. Figure 4.7 shows two examples of background models. 7. Compare to background image: for every image in the registered stack of images, we compare the intensity of each of its pixels against the intensity of the corresponding pixel in the background model. Pixels that considerably di er from the background model are then marked as foreground. This is 71

91 Chapter 4. Composition Context Photography performed by thresholding the absolute value of the intensity di erence. To avoid problems near edges due to alignment, pixels near edges are immediately marked as background using the same edge estimation procedure from the background model step. The result of this step is a stack of binary mask images that depict foreground areas. Figure 4.8 illustrates the background subtraction procedure. 8. Morphological filtering: the masks for foreground objects may still contain false positives due to misalignment or noise. We use morphological operators to filter out thin structures and close small holes. We perform, in sequence: amorphologicalopeningbya3-pixelhorizontalline,amorphologicalclosing by a 3-pixel horizontal line, a morphological opening by a 3-pixel vertical line, and a morphological closing by a 3-pixel vertical line. Finally, we perform a morphological closing by a 11 x 11 square to close larger holes and connect blobs. Figure 4.9 illustrates the results of morphological filtering. 9. Connected component filtering: we keep only objects of area larger than a threshold (in our implementation, 350 pixels), eliminating small detections but keeping larger blobs. This is performed by finding connected components on the binary images and filtering them by pixel count. Figure

92 Chapter 4. Composition Context Photography shows an example where a small blob that passed through previous filtering steps is now eliminated. 10. Segmentation refinement: finally, the remaining objects might have innacuracies near their edges due to the heuristics utilized to filter noise and small structures. To refine the boundaries of segmented objects, we use the Grab- Cut segmentation algorithm [88]. For a given detected object, we find its smallest enclosing box and scale it by a factor of 2. We use this scaled rectangle as input to the GrabCut algorithm. The pixels inside the rectangle are marked as definite foreground for pixels in the detected object, and as probable background otherwise. Pixels outside of the rectangle are marked as definite background. After one iteration of the GrabCut algorithm, the detected object pixels are set to be the pixels now classified as either probable foreground or definite foreground. Figure 4.11 shows two examples of refinement. The proposed method has a few limitations, such as detecting moving objects only on frames that have capture parameters similar to the final picture s parameters, missing thin objects or structures, missing moving objects with intensities similar to the background, inaccurately finding boundaries of objects that are not properly detected using the GrabCut method, and possibly outputting false 73

93 Chapter 4. Composition Context Photography Figure 4.3: Filtering frames by camera motion. The idea is to eliminate frames with large camera motion, which are likely to be blurry and contain rolling shutter artifacts. final image different exposure different focus Figure 4.4: Filtering frames by exposure and focus to avoid extreme variations in intensity and blur. Figure 4.5: Filtering frames by alignment. We observed that frames with larger displacements from the original were more likely to contain registration failures. In this visualization, we added the two frames after registration. 74

normalization after normalization Figure 4.

94 Chapter 4. Composition Context Photography anchor before normalization after normalization Figure 4.6: Intensity normalization. Figure 4.7: Background model. The black areas near edges have been disconsidered due to alignment errors. Figure 4.8: Background subtraction procedure. Pixels that significantly di er from the background model are marked as foreground. 75

Chapter 4. Composition Context Photography Figure 4.9: Morphological filtering. Figure 4.10: Filtering by connected component size. Small blobs are discarded.

95 Chapter 4. Composition Context Photography Figure 4.9: Morphological filtering. Figure 4.10: Filtering by connected component size. Small blobs are discarded. positives when the alignment algorithm does not perform correctly. However, it does a good job at detecting large objects, which we explore in other components of our framework. The variation of capture parameters inherently present in our scenario provides a challenge for moving object detection, and we believe that interesting future work to stabilize appearance changes induced by these variations could be explored, along the lines of [34]. 76

96 Chapter 4. Composition Context Photography Figure 4.11: Final results after refinement using the GrabCut segmentation algorithm Frame Selection and Image Composites In this section, we provide an overview of the methods we use for generation of photo suggestions from composition context data. We create two types of suggestions: interesting frames, whicharesingleframesselectedfromtheviewfinder stream; and image composites, which are generated by fusing information present in multiple composition context frames. 77

97 Chapter 4. Composition Context Photography Selection of Interesting Frames As illustrated in Chapter 1, automatically selecting and recommending frames from the composition context can be useful to provide interesting alternative moments or views of the scene. Automatically determining what is interesting in the video is a long-term challenge, but in Chapter 5 we describe a possible structure for a solution along these lines. We present multiple measures of quality or interestingness of frames that aim to capture di erent aspects of good images. As a measure of quality, a sharpness indicator predicts how blurry the image is given inertial sensor data. Measures that attempt to model di erent aspects of aesthetics are also included, such as an estimator for the use of the rule of thirds and measures for the simplicity of structures and colors. In addition, we attempt to capture interestingness in the scene. The presence of moving objects in an otherwise static composition context may be an indicator of an interesting event. From the composition context video, it is also possible to create a map of the scene that highlights areas where the user spent more time framing, and thus could be an indicator of what draws attention. We finally combine these techniques into a solution that generates individual frame recommendations. 78

98 Chapter 4. Composition Context Photography Image Composites Many computational photography techniques actively control capture parameters, such as illumination, shutter speed, focal distance, aperture size, and field of view, while capturing multiple images. After alignment, the images are then combined to create a picture that could not have been captured with a single shot. Examples of techniques like that include all-in-focus imaging [5], where the sharpest regions for every image in a focal stack are combined into a picture that is in entirely in focus; high dynamic range imaging [28], where images taken with di erent exposure times are combined to have an image with extended dynamic range; multi-flash imaging [83], where illumination position is varied among di erent frames in order to reconstruct depth edges based on shadow information; and panoramic stitching [99], where a panorama of a scene is mosaicked from multiple overlapping views. Inspired by these techniques, the second type of photo suggestions we propose to create is given by image composites from multiple frames of the composition context. As discussed in Chapter 3 and Section 4.1, parts of the composition context can be understood as imaging the scene with di erent capture parameters. After image alignment, we propose to use selected frames from the resulting image stacks as inputs to computational photography algorithms, in order to create 79

99 Chapter 4. Composition Context Photography composites that could not have been captured with a single shot. These can provide compelling alternatives to the final image, as discussed in Chapter 1. When applying image compositing algorithms to composition context data, it is important to select relevant frames as input. Simply providing the entire composition context video may require a huge a amount of processing time and memory to run the algorithms, and artifacts resulting from poor registration can also impact the quality of the results. In Chapter 6, we will introduce a method for automatic identification of groups of composition context frames that can be given as inputs to image compositing algorithms. Once the input frames are selected, we use computational photography algorithms to create panoramas [99], collages [71], all-in-focus images [5], extended dynamic range images [28], synthetic long exposure images [100], and flash/noflash composites [77, 33]. We also show that detected moving objects can be explored to simulate e ects such as panning, by aligning moving objects across multiple frames and then averaging the images, creating background blur while keeping the moving object in focus. A moving subject could be repeated in multiple places in the scene along its trajectory, to depict motion in a static image. And viewfinder frames could be used to inpaint regions corresponding to moving objects in the final picture, removing undesired moving areas. 80

100 Chapter 4. Composition Context Photography 4.5 Practical Considerations As defined, the composition context spans a period of time before and after a picture is taken. In theory, an arbitrary number of frames before and after taking the picture could be considered. Ideally, we would like to record frames for adurationthatisenoughtocapturetheentireframingprocessformostusers. However, in practice, hardware limitations such as the amount of available memory and processing power constrain the length of the composition context that can be used. In the Nokia N900, we had to constrain the length of the recorded composition context so that the circular bu er would fit into the available memory. Our implementation recorded the most recent 18 seconds of viewfinder frames before apictureistaken,atarateof25framespersecondwithresolutionof320x240 pixels. Our exploratory study (see Chapter 3) indicates that, for the photos taken by our participants, a length of 16 seconds is enough to record the entire framing procedure in 80% of the cases. However, there are cases when users took longer than 18 seconds to frame. Increasing the resolution of the viewfinder frames is possible as well, but this would come at the expense of frame rate or length of the recorded composition context. 81

101 Chapter 4. Composition Context Photography We envision that the entire composition context photography pipeline, from capture to generation of suggestions, could be integrated into a camera. Our proof-of-concept prototype performs the collection of context data on camera and generates the photo suggestions on a laptop computer, but having the entire pipeline integrated into a camera would allow the photographer to choose among suggested images right after capture, without the need to download the images and the composition contexts into their computers for later processing. Optimizing the suggestion generation algorithms to run on mobile devices is another interesting topic for further investigation. Also relevant are compressing the composition context while streaming viewfinder frames, so that the recording can be done more e ciently; and accelerating processing for generation of suggestions by exploring unused cycles during viewfinding (for example, performing image alignment). The resolution of the viewfinder images in our implementation platform was low compared to the resolution of the final picture. For the purpose of proving the concept of composition context photography, we downsampled the full resolution image to match the viewfinder s resolution, and generated results with the same resolution as the viewfinder. However, this is merely a limitation of our hardware platform there are cameras nowadays that can stream high resolution video at acceptable frame rates. Alternatives to increase the resolution of the results could be to apply superresolution methods [49] using data from the composition context, 82

102 Chapter 4. Composition Context Photography or to combine the full resolution picture with the low resolution viewfinder frames to synthesize a full resolution video of viewfinder frames [10, 43]. 83

103 Chapter 5 Selection of Interesting Frames We now present a method for automatically selecting individual frames from composition context videos, which could then be suggested to the user as part of the result of the point-and-shoot process. Our method attempts to pick frames that are likely to be interesting to the user. We select those by using an attention map created from the composition context in combination with several quality measures. Automatically determining what is interesting in the video is very challenging due to the subjectivity involved and the di culty on automatically inferring high-level semantic interpretations. However, we believe that our approach provides a possible structure for a longer-term solution, which we believe to be a valuable initial step in this direction. 84

104 Chapter 5. Selection of Interesting Frames 5.1 Overview Our approach uses multiple measures of quality, which aim to capture di erent aspects of good images. Each measure provides a score that could then be used to rank the images and pick the best one. For example, the sharpest frame could be picked by maximizing a sharpness measure. Computational aesthetics measures can be used to predict how each image complies to specific aesthetic guidelines in photography. However, these measures alone are not su cient to quantify interestingness. An an example, a composition context frame that non-intentionally captured the ground while the photographer was walking before beginning to frame a picture may receive a high aesthetic score if the foot of the photographer happened to be placed near one of the rule-of-thirds vertices and had high contrast with respect to the ground. To address these issues, we strategically explore information about user attention that can be extracted from the composition context. Inspired by eye gaze tracking methods for predicting attention, we propose a method to compute an attention map of the scene based on the views seen during the frame process. The attention map is such that scene areas where the user spent more time pointing the camera at have higher values. In this way, it is possible to estimate the most interesting parts of the scene and constrain the optimization of quality measures 85

105 Chapter 5. Selection of Interesting Frames to frames that capture these regions. This avoids suggesting frames for regions where the user did not significantly look at, and is desirable assuming that the user spends more time pointing the camera at regions that are indeed interesting to them. Asuggestionalgorithmshouldalsoavoidrecommendingframesthatlooktoo similar to the final image. To address this, we propose to use the per-frame capture parameter data stored with the composition context to prevent the algorithm from suggesting frames taken from approximately the same view of the final image with very similar capture parameters. E ectively, the results of the suggestion algorithm consist of frames that image the most interesting areas or significantly overlap with them, di er from the final image, and maximize a measure of photographic quality. 5.2 Attention Map and Measure We now introduce a method for creating an attention map given composition context data and assigning attention scores to composition context frames. Inspired by eye gaze tracking techniques [32], we create a map that associates scene locations to attention scores, which are indicators of how long the user spent pointing the camera at a particular location. 86

106 Chapter 5. Selection of Interesting Frames Our algorithm for computing the attention map of a scene given the composition context works as follows. First, we attempt to align all viewfinder frames to the final picture, as described in Section Given the homographies, we can then compute the dimensions of the smallest rectangle L that contains the largest frame after warping (notice that after warping by a homography, the original image dimensions may not be large enough to entirely contain the warped image shape). All images are then warped and padded to be of the size of L, i.e.,the images will have a portion that corresponds to the actual warped image shape and empty areas that fill space so that the final image is of the size of L (the images in Figure 4.6 illustrate this). We then create an empty image M with the size of L, which will be our attention map. For every pixel in every warped image in the registered stack, we add 1 to that pixel in M if the pixel is part of the warped image. Therefore, the attention map is an accumulator of the e ective image regions after warping. Figure 5.1 displays an example of attention map, where the user initially pointed the camera at the scene present in the final picture, then quickly rotated it to the right and fixated for a brief period of time, and then moved back to point at the original scene location. We can see the attention focused on the left, but there is also a small peak to the right indicating the brief period that the user spent fixating there. 87

107 Chapter 5. Selection of Interesting Frames Figure 5.1: Attention map. Brighter values correspond to larger attention values. In this example, the user spent most of the time fixating at the final picture s region on the left, but there was also a brief period of fixation to the right. Given the attention map, we compute an attention score AT T (i) forevery viewfinder frame i by averaging the attention map values inside the warped frame shape. The composition context also contains information that could be used to compute a path of fixations across the scene over time [89]. This could be useful for a more detailed understanding of framing behavior patterns, which is another interesting topic for future research. 5.3 Quality Measures We now present a few quality measures for images. First, we describe a sharpness measure that attempts to minimize the amount of blur due to camera motion. 88

108 Chapter 5. Selection of Interesting Frames This is estimated by using gyroscope data collected during the exposure time of each picture. We then propose a measure based on the size of detected moving object areas, as this can be correlated to interestingness. Finally, we describe three computational aesthetics measures that aim to capture di erent photography guidelines: the rule-of-thirds, structural simplicity (avoiding clutter), and color simplicity (avoiding many colors in an image). We note that these measures were chosen as examples of quality criteria that can be optimized, but additional measures could have been considered. For example, noise and well-exposedness could also be taken into account, and other photography-inspired measures could be included. Criteria that attempt to capture important semantic information (such as presence of faces or specific objects) could as well be used depending on the target scenario Camera Motion In the Frankencamera work [4], a lucky imaging application for minimizing blur due to camera motion was presented. The motivation is that when taking long exposure pictures while holding a camera with the hands, camera motion due to handshake can cause blur. However, if multiple shots are taken, the photographer can get lucky and perhaps one or more of the shots will be sharp due to the exposures happening during periods of stillness. In the paper, gyroscope data was 89

109 Chapter 5. Selection of Interesting Frames collected during the exposure of each frame and analyzed to select the frame with the least amount of motion. In our composition context scenario, some frames can also contain blur due to camera motion during framing, while other frames might have been captured at periods of stillness. While camera motion is sometimes used to achieve artistic e ects, in most cases we would like to suggest frames with the least amount of blur due to camera motion. We apply a solution similar to the lucky imaging application. A motion energy score MOT (i) is computed for every frame i in the composition context using the gyroscope data collected during the exposure period of i. Assuming that the gyroscope returns values in the [ 1, 1] range for each of its three axes, given by gyr.x, gyr.y, andgyr.z, wecalculateforevery gyroscope sample the average absolute motion energy: E GY R = gyr.x + gyr.y + gyr.z 3 (5.1) If there are n gyroscope samples collected during the exposure time of a frame i, the average gyroscope motion energy during the exposure is computed: Ē GY R (i) = 1 n nx E GY R (j) (5.2) j=1 Finally, the motion energy score for the frame i is given by: CM (i) =1 Ē GY R (i) (5.3) 90

110 Chapter 5. Selection of Interesting Frames Moving Object Area In video surveillance, detection of moving objects is typically used to find events of interest and trigger alerts. In our scenario, we propose a quality measure that uses the overall moving object area as an indicator of interestingness. The motivation behind this measure is that the presence of moving objects might mean that something interesting is happening in the scene, as in the butterfly example from Figure 1.1. The measure counts the number of pixels in the image that correspond to moving areas (detected by an algorithm such as the moving object detector from Section 4.4.1) and divides it by the total number of pixels in the image: MO (i) = #moving object pixels(i) width(i) height(i) (5.4) According to this measure, static scenes have a quality score of zero, while scenes with moving objects have quality proportional to the size of the moving areas. In the future, it would be interesting to design more sophisticated criteria that take into account the motion patterns of objects over time by analyzing the trajectories resulting from tracking individual objects. 91

3 Rule-Of-Thirds The rule-of-thirds [63] is a photographic composition guideline that states that the important subjects of a photograph should be placed along the thirds lines or near their

111 Chapter 5. Selection of Interesting Frames Figure 5.2: The rule of thirds (figure adapted from Mai et al. [63], 2011 IEEE). Important subjects of a photograph should be placed along the thirds lines or near their intersections Rule-Of-Thirds The rule-of-thirds [63] is a photographic composition guideline that states that the important subjects of a photograph should be placed along the thirds lines or near their intersections. The thirds lines and vertices are defined by dividing the image into nine regions, as illustrated by Figure 5.2. The dividing lines are referred to as thirds lines, and the thirds vertices, also known as power points, are at the intersections of the thirds lines. The rule of thirds is a popular photography guideline, and several computational aesthetics approaches include features inspired by it (e.g., [115, 63, 25, 11] to cite a few). Despite its popularity, it is still challenging to automatically detect the usage of the rule of thirds from images due to the need to locate important subjects, which often times requires semantic understanding of the image content. 92

112 Chapter 5. Selection of Interesting Frames Mai et al. classifies images between obeys versus does not obey the rule of thirds based on saliency maps and an objectness map [7], and achieves accuracy of around 80% on their test set. We use a quality measure based on the rule of thirds, inspired by ideas from [115, 63, 11]. We first compute a saliency map of the image i using the method from [1]. We then estimate important subjects by thresholding the saliency map. Pixels with saliency greater than twice the average saliency in the map are deemed salient. Let S be the set of salient pixels in i, andletdist(j, closest(j)) denote the Euclidean distance between pixel j and the rule-of-thirds vertex which is closest to j. Wecalculatearuleofthirdsscoreas: RT (i) =1 1 max dist S X dist(j, closest(j)) saliency(j) (5.5) j2s Where max dist is a normalization factor given by the maximum possible Euclidean distance that any pixel can have to its closest rule-of-thirds vertex, and saliency(j) isthesaliencyvalueofthepixelj in the [0, 1] interval. In this measure, salient pixels that are close to rule-of-thirds vertices contribute less to the sum, while salient pixels far from the vertices contribute more to the sum Spatial Distribution of Edges Ke et al. [55] suggest that simplicity is an important aspect of aesthetically pleasing images. One of their proposed aesthetic features aims to capture sim- 93

plicity by computing the spatial distribution of high-frequency edges in an image.

113 Chapter 5. Selection of Interesting Frames Figure 5.3: Spatial distribution of edges (figure adapted from Ke et al. [55], 2006 IEEE). It is suggested that cluttered scenes (on the left) tend to be less aesthetically pleasing than simple scenes (on the right). plicity by computing the spatial distribution of high-frequency edges in an image. The intuition is that cluttered images probably have edges uniformly distributed across the entire image area, while simple images (e.g., containing a single subject against an uniform background) are more likely to have their high-frequency edges concentrated on the subject of interest. Figure 5.3, from Ke et al. [55], illustrates this. It displays the edge maps of a cluttered scene (on the left) and a simpler scene (on the right). The paper argues that images like the one on the right are more likely to be aesthetically pleasing than images such as the one on the left. We use an aesthetic measure to capture the spatial distribution as in Ke et al. [55]. However, instead of using a Laplacian operator to compute edges, we apply the Canny edge detector [19], since the Laplacian resulted in extremely 94

114 Chapter 5. Selection of Interesting Frames noisy edge maps for the images collected using our prototype. Once the edge map is found, the method computes the area of the bounding box that encloses the top 96.04% of the edge energy. The idea is that cluttered backgrounds should result in a large bounding box, while well defined subjects would produce a smaller bounding box. This is achieved by projecting the edge map onto the x and y axes: P x (u) = X y E(u, y) (5.6) P y (v) = X x E(x, v) (5.7) Define the width in pixels of 98% mass of the projections P x and P y as w x and w y.theedgesimplicityqualitymeasureforimagei is given by ES (i) =1 w x w y width(i) height(i) (5.8) The edge simplicity quality measure for the edge maps in Figure 5.3 is equal to 0.06 and 0.44, respectively Color Simplicity Another feature proposed by Ke et al. [55] that tries to capture simplicity is the hue count. The idea is that the number of unique hues in professional photographs is typically lower than in snapshots, although each color may contain variability in tones (brightness and saturation levels). We compute the hue count of an image as 95

The image in the left is arguably simpler than the one on the right with respect to the distribution of colors. in [55]. First, the image is converted to the HSV (hue-saturation-value) colorspace.

115 Chapter 5. Selection of Interesting Frames Figure 5.4: Color simplicity (figure adapted from Ke et al. [55], 2006 IEEE). The image on the left has a hue count of 3, while the image on the right has a hue count of 11. The image in the left is arguably simpler than the one on the right with respect to the distribution of colors. in [55]. First, the image is converted to the HSV (hue-saturation-value) colorspace. Then, only pixels with brightness in the range [0.15, 0.95] and saturation greater than 0.2 areconsideredtobuilda20-binhistogramh of hue values. Let m be the maximum value of the histogram. The hue count N is given by the number of bins which have values greater than m. Theparameter is set to 0.05 in the paper. Using this algorithm, the hue counts for the pictures in Figure 5.4 are 3 and 11, respectively. Our color simplicity quality measure for image i is then given by: CS (i) = 20 N(i) 19 (5.9) 96

116 Chapter 5. Selection of Interesting Frames 5.4 Frame Selection Now we describe how to combine the attention map and the quality measures to automatically select suggestions from the composition context. The idea is to select frames that the user had interest in while framing (this is estimated from the attention map), are not too similar to the final picture, and maximize the measures of quality presented in the previous section, either as individual measures or as a combination of measures. This results in photo suggestions that can be presented with labels such as the sharpest, the one with simplest colors or the one with the best composition. We start by computing the attention map and an attention score, as described in Section 5.2, for every viewfinder frame that can be aligned to the final picture through a homography transformation. We then keep only the frames with attention scores greater than a threshold. We used a threshold of 50, as this corresponds to frames where the user spent roughly at least 2 seconds looking at, as our videos are captured at 25 frames per second. In addition, we filter out frames that after alignment have roughly the same viewpoint (as in Figure 4.5), focus, zoom and exposure as the final image, to avoid suggesting frames too similar to the final image. 97

117 Chapter 5. Selection of Interesting Frames At this point, the remaining frames are the ones such that the user spent significant time pointing the camera at and have capture parameters that di er from the final image; we put those in a list. We group the frames in this list into subsets by using the following clustering algorithm: find the frame m with maximum attention score; for every frame in the list, find a homography that directly aligns it to m; theframesthatafteralignmentsignificantlyoverlapwith m are grouped together and removed from the list of frames. This is repeated until no frames are left in the list; every frame will then belong to a group. We defined frames a and b to significantly overlap when the intersection of a and b after warping corresponds to either more than 30% of a s area or more than 30% of b s area. Once the groups of frames are determined, we then filter out the ones with less than 15 frames, as we observed that errors in alignment may induce the creation of very small groups that actually have similar views to frames in the larger groups. Finally, for each group we generate a separate set of suggestions. For every frame i in a group, the following suggestions are generated: least camera motion: the frame that maximizes CM (i); best agreement with the rule-of-thirds: the frame that maximizes RT (i); least cluttered: the frame that maximizes ES (i); 98

118 Chapter 5. Selection of Interesting Frames simplest colors: the frame that maximizes CS (i); best overall frame: the frame that maximizes CM (i)+ RT (i)+ ES (i)+ CS (i)+ MO (i). If there is at least one frame in the group such that MO (i), the frame that maximizes MO (i) isalsosuggestedastheonethat maximizesaction. If the same frame maximizes more than one criterion, the duplicates are omitted and the frame is presented with all labels that describe the maximized criteria. In this way, there may be one to six photo suggestions generated per group. Figure 5.5 illustrates the suggestion algorithm step-by-step. In this example, two groups that correspond to di erent views of the scene have been selected, and the best frames according to di erent criteria have been suggested for each group. 5.5 Experimental Results We have implemented the suggestion generation algorithm presented in this chapter. The algorithm was then applied to the 1213 composition context instances from the dataset described in Chapter 3. We now present statistics on the image suggestions and the generation process. First, to provide an indication of how many composition context frames are usually candidates for being suggested, we counted the number of frames for each 99

119 Chapter 5. Selection of Interesting Frames group 1 (remove frames with capture parameters similar to final picture s) attention map final picture group 2 optimize quality measures optimize quality measures suggestions simplest colors rule-of-thirds least camera motion simplest colors, best overall rule-of-thirds maximizes action least cluttered, best overall least camera motion least cluttered Figure 5.5: Individual frame suggestion algorithm. First, interesting frames are selected using the attention map. The frames are then grouped by proximity, and quality measures are optimized within each group to generate a set of suggestions and labels. composition context video such that the registration algorithm was able to find a homography transformation that aligns them to the final image. Figure 5.6(a) shows a histogram for this frame count for the composition context videos in the dataset. In 91 of the 1213 videos the alignment algorithm was not able to find any correspondences with the final image, but for the rest of the videos there was 100

120 Chapter 5. Selection of Interesting Frames at least one corresponding frame. On average, 221 frames have been successfully registered per video. We also measured the number of suggestion groups created by our frame clustering algorithm. Figure 5.6(b) shows a histogram of the number of groups per video. In most cases, there is a single group which corresponds to the final picture. This is analogous to a unimodal attention map. However, there are cases when more groups are created due to the user fixating at di erent areas during framing, or due to the registration algorithm failing to register images. A nice property is that each group usually contains similar views, and, in the cases when registration fails due to parallax, each group contains similar views of close objects within the group, but di erent views of those objects among di erent groups. We also counted the number of frames in each group, i.e., the group sizes. Figure 5.6(c) shows a histogram; on average, each group contained 150 frames. The number of frame recommendations varies between 1 and 6 per group, depending on the number of equivalent recommendations for di erent quality criteria and the presence or absence of moving objects. Figure 5.6(d) displays the distribution of the number of recommendations per group. The average was of about 4.5. Finally, we also measured the number of recommendations per final picture, which depends on the number of groups and the number of recommen- 101

450 #aligned video frames (a) 750 Number of Groups per Video 500 Group Size (frames) 500 400 #videos 250 #groups 300 200

Suggested Images per Final Picture 600 600 #groups 400 200 #final pictures 400 200 0 1 2 3 4 5 6 #suggestions 0 5 10 15

(a) Number of frames in the composition context that have been registered to the final image using the alignment

121 Chapter 5. Selection of Interesting Frames Registered Frames per Video #videos #aligned video frames (a) 750 Number of Groups per Video 500 Group Size (frames) #videos 250 #groups #groups group size (b) (c) Suggested Images per Group 800 Suggested Images per Final Picture #groups #final pictures #suggestions #suggestions (d) (e) Figure 5.6: Statistics related to the generated suggestions. (a) Number of frames in the composition context that have been registered to the final image using the alignment algorithm. (b) Number of suggestion frame groups found by our suggestion algorithm. (c) Number of frames per group (group size). (d) Number of suggestions generated per group of frames. (e) Total number of suggestions generated for each final picture in the dataset. 102

122 Chapter 5. Selection of Interesting Frames dations in each group. Figure 5.6(e) shows the histogram of the total number of recommendations per final picture, which has an average value of 6.3. Figure 5.7 displays a few examples of images from our dataset and useful suggestions recommended by our algorithm. The examples include alternative views of the scene (examples in the first row), alternative orientation (second row, left), di erent exposure and moving objects (second row, right), focus at di erent distances (flowers), and moving subjects at di erent moments in time (smile and bicycles). 5.6 Discussion The problem of automatically determining interesting frames in the composition context is very challenging, and we believe that it can be a topic for a longer term research agenda. We have provided an initial method that includes three aspects that we consider to be important in a solution to the problem. It is important to determine regions in the scene that should indeed be of interest to the user (e.g., regions related to the final picture rather than erratically imaged scenes such as the ground). To address this, we have proposed the use of an attention map computed from the composition context data. The second important aspect of the solution is to avoid suggesting frames that look too similar to the final im- 103

123 Chapter 5. Selection of Interesting Frames final image suggestion final image suggestion Figure 5.7: A few examples of suggested frames from our dataset. Each pair displays the final picture and one of the suggestions. The suggestions provide interesting alternatives, such as di erent views, focus at di erent depths, di erent exposures, and moving subjects at di erent moments in time. age. We have utilized the capture parameters present in the composition context to prevent this. The third aspect is to be able to, among a possibly large set of individual frames, select the one which indeed is the most satisfying to the user as a photo suggestion. As a first step toward this goal, we have applied a few simple 104

124 Chapter 5. Selection of Interesting Frames quality measures that attempt to quantify di erent characteristics of good images, but further research would also involve creation of more sophisticated measures that take into account the particular user s preferences, taking subjectivity into account. For the first aspect, it would be interesting to pursue the inclusion of semantic information, although providing reliable semantic interpretation of general scenes is still a long-standing problem in computer vision. Another alternative would be to include domain information about the specific picture being taken. For example, if it is known beforehand (or inferred from the final picture) that the user is trying to capture a portrait, only composition context frames that contain the face of the subject would be considered as candidates for suggestion. Related to the second aspect, while capture parameters provide good indicators for detecting frames similar to the final image, there are cases where the parameters alone are not su cient to predict whether the images will look similar. An example is the focus parameter. Cameras with di erent focus settings can still produce similar images depending on the depths of the objects in the scene and the size of the depth of field. An image duplicate detection algorithm that takes into account capture parameters could be investigated. For the third aspect, a solution to the problem in our scenario may involve fine-grained comparison between images involving high-level semantic elements, 105

125 Chapter 5. Selection of Interesting Frames since there is large correlation among the candidate frames and the decision of which frames are better than others depends on fine detail in many cases, such as the blink of an eye or the orientation of a moving object. This is di erent from the general scenario addressed by the computational aesthetics approaches, which consider pictures taken under fairly di erent conditions and scenes. Within acompositioncontextvideo,itispossiblethatmostorallpicturestherearenot worth recommending, and it would interesting to investigate the possibility of taking that into account in the recommendation process. In our approach, we consider the composition context frames in their original form as candidates for suggestion. It would be interesting to investigate the feasibility of augmenting the candidate pool with additional images, such as enhanced versions of the captured frames. For example, frames could be deblurred using inertial sensor information [53] or additional compositions could be attempted [60]. 106

126 Chapter 6 Image Composites In this chapter, we introduce a method for using composition context data to automatically generate image composites, i.e., photographs created by fusing information from multiple images. We ll describe the composites we generate, which are created from images taken with di erent capture parameters or moving objects. Then we ll show how to identify groups of input frames from which to create these composites in composition context data. Finally, we discuss the benefits of providing a minimal set of inputs that contains the necessary information for each composite algorithm, in order to avoid fusion artifacts and minimize use of computational resources. We illustrate the usefulness of the composition context to create image composites by applying our techniques to the dataset collected in our exploratory study (Chapter 3). To generate the composites, we applied existing o -the-shelf open-source implementations of compositing algorithms, or 107

127 Chapter 6. Image Composites we implemented simple methods that su ce for demonstrating that composites can be created from composition context data. 6.1 Composite Types We start by describing the types of image composites we generate, and the approaches and implementations we employed to create them. The general idea is to, given a set of input frames, fuse information in the multiple frames to generate a new image. We consider two main classes of composites: composites that receive input images captured with di erent settings, such as field of view, focus, and exposure; and composites that explore the presence of moving objects to depict motion or manipulate the image Panoramas There are situations when the limited field of view of a camera prevents the photographer from capturing a large scene in a single photograph. A possible approach for overcoming this limitation is to capture multiple pictures from different and overlapping fields of view, and then stitching the pictures together into a panorama that e ectively extends the field of view. Figure 1.3(a) shows an example. Typical approaches for panorama creation first align the di erent images, 108

128 Chapter 6. Image Composites project them onto a surrounding sphere or cylinder, and blend the di erent images into a panorama, possibly accounting for di erences in color and illumination across the several images and reduction of artifacts due to moving objects or registration mistakes. We have used the panorama stitching implementation from the OpenCV open-source computer vision library [72] to generate panoramas, whose pipeline is very similar to the one in [17] Collages An alternative way for obtaining an extended field of view consists of creating a collage of multiple pictures with di erent fields of view. This is achieved by registering the pictures and pasting them on top of each other after alignment. Figure 1.3(b) displays a collage. Di erently from panoramas, the images are not blended together, and registration mistakes and di erences between the images are evident. However, a collage can also be seen as a useful or artistic way of representing a scene [12, 71]. To create collages, given multiple input images, we first align them. We then copy the registered images on top of each other at the order they have been presented. This creates interesting collages and is su cient to demonstrate that collages can be created from composition context. However, more sophisticated techniques for determining the order of inputs could also have been used [71]. 109

129 Chapter 6. Image Composites Extended Dynamic Range Cameras have limited dynamic range, which means that at times it is not possible to capture the entire range of light intensities in a scene with a single image. For example, imagine that the photographer is trying to take a picture of a somewhat dark room with a corner well illuminated by light coming in through awindow. Ifalongexposuretimeisused,thedetailsinthedarkareaswillbe visible in the picture, but the bright areas will appear as white (saturated). On the other hand, using a short exposure time will clearly show the details in the bright areas, but there will not be enough light to make anything in the dark areas visible. An approach to extend the dynamic range is to capture multiple pictures with di erent exposure times, and then fuse the images by picking pixels from the images where they appear best exposed. Figure 1.3(d) shows an example of a high dynamic range image. As the picture was taken before sunset, the sky is much brighter than the buildings; by combining multiple exposures, it is possible to see the blue sky and the colors of the buildings in the same image. To create an extended dynamic range image given a stack of aligned images taken with di erent exposures, we have used the Enfuse 4.0 open-source implementation [66] of the Exposure Fusion algorithm [65]. 110

Chapter 6. Image Composites 6.1.4 All-in-Focus Imaging Cameras have limited depth of field, which means that objects located outside of a certain range of depths appear blurry in a photograph.

130 Chapter 6. Image Composites All-in-Focus Imaging Cameras have limited depth of field, which means that objects located outside of a certain range of depths appear blurry in a photograph. If it is desired that the entire image appears sharp, a possible solution is to capture multiple images while varying focus, so that objects at di erent depths appear sharp at di erent images, and then fuse the images by picking the sharpest areas to create an all-in-focus image. Figure 6.1, from Vaquero et al. [108], illustrates this process. (a) (b) (c) (d) Figure 6.1: All-in-focus imaging (image from Vaquero et al. [108], 2011 IEEE). Afocalstack(a c)iscapturedbyfocusingatdi erentdistances,andtheimages are fused to obtain an all-in-focus result (d). This example was captured by manually focusing a Canon 40D camera and fused using the algorithm in Agarwala et al. [5]. To create all-in-focus images given a stack of aligned images taken with varying focus settings, we also used the Enfuse 4.0 open-source software [66] and specified that sharpness should be optimized rather than exposure in the Exposure Fusion algorithm [65]. 111

131 Chapter 6. Image Composites Flash/No-Flash Imaging Photography in low-light environments is challenging due to the lack of light, typically requiring long exposure times to obtain acceptable results. An alternative is to use a short exposure and boost the sensor gain; however, this leads to increased noise in the image. Cameras also have flashes that can be used to introduce illumination into a scene, thus allowing the use of shorter exposure times in dark environments. However, a drawback of images taken with flash is that often times the colors appear unnatural due to the artificially introduced illumination. The main idea in flash/no-flash imaging [33, 77] is to capture two short-exposure images, one with flash and another with boosted sensor gain, and combine the two to create a flash/no-flash composite that preserves the natural appearance of the no-flash image but keeps the noise levels low as in the flash image. Flash/no-flash photography has also been used to remove glass reflections created by the flash [6]. We generate flash/no-flash composites when the final image is taken with a flash. This is done by combining the final image with one of the composition context frames that aligns to the final image. To simulate a flash/no-flash composite method, we used the Exposure Fusion algorithm [65] to fuse both images in the same way as a multiple exposure stack is fused, also using the Enfuse 4.0 implementation [66] for that. 112

Chapter 6. Image Composites 6.1.6 Synthetic Shutter Speed Photography Capturing long exposure images with a handheld camera is di cult due to the camera motion due to handshake, which generates blur.

The method aligns the several short exposure images and adds them to obtain a long exposure image; due to the alignment operation, the e ects of camera shake are reduced. Figure 6.2 illustrates this.

132 Chapter 6. Image Composites Synthetic Shutter Speed Photography Capturing long exposure images with a handheld camera is di cult due to the camera motion due to handshake, which generates blur. To address this problem, Telleen et al. [100] proposed a method to simulate a long exposure image using a sequence of short exposure images. The method aligns the several short exposure images and adds them to obtain a long exposure image; due to the alignment operation, the e ects of camera shake are reduced. Figure 6.2 illustrates this. (a) (b) (c) Figure 6.2: Synthetic Shutter Speed Imaging (figure adapted from Telleen et al. [100], 2007 The Eurographics Association and Blackwell Publishing). (a) Long exposure handheld; (b) Short exposure; (c) Synthetic shutter speed. By aligning and adding multiple short-exposure frames, a long exposure image is simulated. This image is less likely to su er from blur due to camera shake. We created synthetic long exposure composites from composition context data, using a simplified version of the algorithm in [100]. We first align the viewfinder frames to the final picture to create a registered image stack, and then add the intensities of each pixel across the stack and remap the values to the range to create the final image. 113

133 Chapter 6. Image Composites Motion Summaries In scenes with moving objects, a way to represent motion in a single image is by displaying the trajectory of moving objects over time. Figure 1.3(g) shows an example, where the person walking from left to right is repeated at di erent points of his trajectory. More sophisticated methods for representing motion in a single image have also been proposed [85]. When moving objects or subjects are detected in the composition context, we create a composite that attempts to depict motion. We use a simple method: given the aligned images with corresponding moving object masks (detected using the method from Section 4.4.1), we sequentially copy the moving object regions from the composition context onto the final image, without overwriting previously copied pixels from other frames or pixels that have been marked as moving objects in the final image. The frames with moving objects are processed backwards in time, i.e., the first frame to be processed is the closest frame in time to the final image Synthetic Panning In photography, intentional motion of the camera may be used for artistic e ects. An example is the panning technique, which consists of moving the camera while tracking a moving subject during the exposure time. If the tracking is successful, the subject will be imaged at the same sensor region during the 114

134 Chapter 6. Image Composites entire exposure time, and will appear sharp. The remainder of the image will be blurred due to the camera s movement [76]. Figure 1.3(f) shows a synthetic example, obtained by aligning the images to the position of the moving subject and averaging them. We also create synthetic panning composites from frames in the composition context. To simplify the process, we consider only the frames that have exactly one detected moving blob, and we assume that a single moving object appears in the field of view during framing. In the future, this assumption could be eliminated by performing tracking of moving objects across frames. Given the input frames, all containing a single moving blob, the centroid of the moving object is found, and the images are aligned so that the centroids are at the same position. Finally, the images are added and remapped to the interval Moving Object Removal The last type of composite we create aims to remove moving objects from the final pictures. In busy locations, often times pictures are captured with unwanted subjects who entered the field of view at the exact moment when the shutter is triggered. If moving objects are detected in the final image, we attempt to remove them. Given the final image and a stack of aligned images, we search the images in the stack for pixels corresponding to the moving regions in the final image and copy them over the final image if they are not part of a moving object in the stack 115

135 Chapter 6. Image Composites image. The copied pixels are marked as inpainted and the process is repeated for the next stack image until all pixels in moving areas have been inpainted or we reach the end of the stack of images. This simple approach demonstrates the concept, but artifacts may be present due to improper segmentation of the moving objects or di erences in appearance between the final image and the composition context images. To address those, more sophisticated blending strategies could be applied (e.g., Poisson blending [74]). 6.2 Identifying Composites The algorithms for creating image composites described in the previous section receive multiple images as inputs and output a single composite. However, simply providing the entire composition context video as input is not ideal, since many frames in the video are not appropriate inputs. For example, to create a high dynamic range photograph, frames that capture scenes di erent from the final image should not be included, and frames with variations of other parameters (such as focus) may impact the quality of the resulting image. Similarly, to create panoramas, we should avoid including frames that cannot be aligned to the final image, and disconnected frames that have been randomly captured (such as the ground or sky captured at erratic camera movements). 116

136 Chapter 6. Image Composites We now introduce an approach to identify groups of frames in the composition context that can be provided as inputs to the composite algorithms. Inspired by the work of Brown and Lowe [16], which searches image collections looking to identify groups of frames that can be stitched as panoramas, we extend this idea to find inputs for other image composites within the composition context. In addition to frames that can constitute a panorama that contains the final image, we find stacks of frames for high dynamic range, all-in-focus imaging, synthetic shutter speed, and moving object composites based on the final image. Figure 6.3 exemplifies the process. Our approach leverages the capture parameter metadata included in the composition context to identify these input groups. For image composites, we opted for using only frames that are likely to exhibit insignificant blur due to camera motion, by filtering out frames with large motion energy using gyroscope data Panoramas and Collages For composites that extend the field of view, we first find all composition context frames that can be connected to the final image through a homography transformation, as described in Section Also, frames with di erences in capture parameters other than field of view tend to make stitching more challenging and prone to artifacts, so we remove frames with exposure that di er by more 117

137 Chapter 6. Image Composites composition context inputs to panorama / collage inputs to HDR inputs to motion summaries / panning Figure 6.3: Input frame selection process. Groups of frames suitable to be provided as input to di erent image compositing algorithms are identified and selected. 118

138 Chapter 6. Image Composites than 25% of the final image s exposure, frames with focus that di er by more than 10% of the final image s focus, frames with zoom di erent from the final image s, and frames with moving objects. Finally, we determine whether or not the resulting set of frames significantly extends the field of view. We use these frames to compute the attention map as in Section 5.2, and compare the size of the attention map with the size of a viewfinder frame. If the attention map is at least 120 pixels wider or taller than the viewfinder, we determine that a panorama and a collage should be generated from the selected frames. Otherwise, we do not generate a panorama or a collage for this composition context video Image Stack Composites The remaining composite types are based on stacks of images captured from the same point of view with di erent parameters or presence of motion in the scene. Therefore, for these composites we first select a group of frames that has been approximately captured from the same point of view as the final picture, and then filter frames out based on their camera parameters. To determine whether a frame has been captured from the same view as the final image, we analyze, after warping, how close its shape is to the shape of the final image, as illustrated by Figure 4.5. Frames that are not close enough are discarded. We also require the zoom factor of the remaining frames to be the same as the zoom factor of the 119

139 Chapter 6. Image Composites final picture. The frames that pass these filters are put in a stack S. For each composite type, we select frames from S based on their capture parameters. High Dynamic Range Images. For extended dynamic range images, we keep only the images in S whose focus does not di er by more than 10% of the final image s focus and do not contain moving objects. We find the frames of maximum and minimum exposure in the remaining stack, and determine that an extended dynamic range composite should be created only if the ratio between the maximum and minimum exposures is greater than 1.5, that is, there is variation of at least 1.5 stops in exposure. All-In-Focus Images. To select a stack of inputs for an all-in-focus composite, we keep only the images in S whose exposure does not di er by more than 25% of the final image s exposure and do not contain moving objects. To determine whether an all-in-focus image should be created, we count the number of frames in the remaining stack focused closer than 15cm from the camera, and the number of frames focused farther than 15cm. If there is at least one frame in each category, we generate an all-in-focus composite using the remaining stack as input; otherwise, we do not generate an all-in-focus image. This criterion was implemented due to the characteristics of our capture device, whose optics determine that di erences in blur due to focus are only practically noticeable between 120

140 Chapter 6. Image Composites near focus or far focus. With di erent optics, additional focus levels should be taken into account. Flash/No-Flash. We create flash/no-flash composites when the final image has been taken using the flash and there is at least one frame in S that does not di er in focus from the final image by more than 10%. Among these frames, we pick the closest frame in time to the final image to be the no-flash input. Synthetic Long Exposure. For synthetic long exposure images, we keep only the images in S whose focus does not di er by more than 10% of the final image s focus. If the remaining stack is not empty, we create a synthetic long exposure image by combining the final image with the images in the remaining stack. Motion Summaries. As inputs to the motion summaries algorithm, we select only the frames from S that contain moving objects and the final image if it contains moving objects. If there are at least 2 images with moving objects, we create a motion summary composite. Synthetic Panning. From the stack selected as input to the motion summaries algorithm, we pick only the images that have exactly one moving blob. If there are at least 2 images in the resulting stack, we create a synthetic panning composite. 121

141 Chapter 6. Image Composites Moving Object Removal. Finally, we create a moving object removal composite if there are moving objects detected in the final image. We provide as inputs the final image and the stack S (from where non-moving-object pixels are selected while creating the composite). The result of this input frame selection and composite identification procedure is given by groups of frames to be provided as input to di erent composite algorithms; however, these algorithms are invoked only for the types of composites for which enough variation in the composition context has been detected to justify the creation of the composite in question. For example, if the user did not move the camera enough while framing for creating a panorama, a panorama suggestion is not created; and if there was not enough variation of exposure due to the auto-exposure algorithm, a high dynamic range suggestion is not generated. 6.3 Further Refinement of Input Frames In this section, we discuss the problem of, given a group of frames selected as input to an image compositing algorithm (as described in the previous section), how to further refine it in order to eliminate redundancy in the data and provide only the most essential images for the composite algorithm to fuse. The motivation is that using fewer images may reduce the requirements on memory and processing 122

142 Chapter 6. Image Composites power, and may also be beneficial to minimize artifacts in the output due to poor registration. For example, when capturing a panorama, if the area covered by one of the frames is already covered by two other frames, only the two other frames could be provided as input, as the first frame contains a redundant view. We discuss a possible solution to the problem of creating all-in-focus images, which we presented as part of the Generalized Autofocus method [108]. The material in Sections and has been adapted from [108] 2011 IEEE. We leave the solutions for analogous problems for the other composite types to be investigated in future work Problem and Motivation Given the focal stack selected from the composition context to be the input for an all-in-focus fusion algorithm, the problem we address in this section is: given the objects in the scene that appear sharp in at least one of the images, the goal is to automatically select a minimal set of images from the stack, focused at di erent depths, such that all these objects are in focus in at least one image. Eliminating redundant images is beneficial to reduce time and memory requirements for the fusion step. Also, fusing more images than needed increases the likelihood of stitching artifacts, potentially lowering the quality of the result. 123

143 Chapter 6. Image Composites Let us give an example. We have collected a stack of images from a scene with three depth layers using a handheld Nokia N900 Frankencamera by varying focus. 24 images have been captured with resolution of 5 megapixels each, while varying focus from 5 cm to infinity. After image registration, we attempted to fuse the focal stack onto a single all-in-focus image using the Exposure Fusion algorithm [65]. Figure 6.4(a) shows the fused result by providing all 24 images as input. The fusion process took 290 seconds on a Linux PC (Intel Core 2 Duo, 2 GHz, with 2 GB of RAM). On the other hand, if we fuse only three images from the stack, which are su cient to represent all depth layers in focus (Figure 6.4(c-e)), we obtain the result in Figure 6.4(b). The processing time is reduced to seconds. Regarding visual quality of the output, using the 24 images (Figure 6.4(a)) resulted in several artifacts due to the camera being handheld during image capture, which led to camera motion and accentuated parallax for close objects. This is very di cult to compensate for using image registration; Figure 6.5 illustrates this problem. On the other hand, registration mistakes are likely to be minimized on a smaller set of images, resulting in fewer artifacts (Figure 6.4(b)). This suggests that reducing the number of input images to a minimum may also be beneficial in terms of quality. 124

Chapter 6. Image Composites (a) (b) (c) (d) (e) Figure 6.

[108], 2011 IEEE). (a) Fusion result for 24 input images.

144 Chapter 6. Image Composites (a) (b) (c) (d) (e) Figure 6.4: How the number of input images a ects focal stack fusion (figure adapted from Vaquero et al. [108], 2011 IEEE). (a) Fusion result for 24 input images. (b) Fusion result for 3 images, selected by eliminating redundancy. (c-e) Details of the 3 selected input images Solution Our solution for selecting the minimal set of input images consists of two steps: sharpness analysis, whichaimstodetermine,foreachimageinthestack, the regions that are sharp and the regions that are blurry; and selection of the minimal set of images, which is made by a plane-sweep algorithm. 125

145 Chapter 6. Image Composites (a) (b) (c) Figure 6.5: Registration issues due to parallax (figure adapted from Vaquero et al. [108], 2011 IEEE). (a-b) two frames from the 24-image focal stack after registration. (c) the yellow square regions from (a) and (b). Notice the di erent distances between the brown and green cans due to parallax, caused by handshake. Sharpness Analysis The goal of the sharpness analysis procedure is to localize the sharp regions in each image. A sharpness map is an image such that each pixel value is in the [0, 1] range, indicating the level of sharpness of the objects at the region represented by the pixel. In our Generalized Autofocus work, we used sharpness maps computed by the camera hardware, by dividing the image into a grid, running the images through a [ 12 1] filter, and summing the absolute value of the responses for all pixels in each block. Given a sharpness map for each input image, the sharpness analysis procedure consists of two classification steps: 1. Foreground/background classification. This step segments the space into foreground (areas that appear sharp on at least one of the images in the 126

146 Chapter 6. Image Composites stack but are blurry on the others) and background (areas that have similar sharpness in every image of the stack, such as a white wall). 2. Sharp/blurry classification. For a given foreground pixel, this procedure attempts to determine the images on the stack in which the pixel appears sharp, and the images in which the pixel is blurred. Let T be the number of images in the input stack. For a pixel (j, k) inthe map, Sp(i) denotesitssharpnessatimagei in the stack. In order to perform foreground/background classification, the standard deviation of the sharpness values {Sp(i)}, i 2{1,...,T} is computed. The pixel is classified as foreground if > t 1, and as background otherwise. In the Generalized Autofocus implementation, the threshold t 1 =0.05 was experimentally determined to yield good results. The next step consists of performing sharp/blurry classification for the pixels determined to be part of the foreground. For a pixel (j, k)intheforeground,letm be an image that maximizes its sharpness, i.e., Sp(M) Sp(i), 8i 2{1,...,T}. We classify a pixel at level (i) assharp/blurrybythefollowingcriteria: (i) issharpifi = M; (i) issharpifi<m, Sp(M) Sp(i) <t 2,and(i +1)issharp; (i) issharpifi>m, Sp(M) Sp(i) <t 2,and(i 1) is sharp; 127

147 Chapter 6. Image Composites (i) isblurryotherwise. The threshold t 2 allows for some tolerance on the level of blurriness when deciding whether or not a pixel is sharp, by also labeling pixels with sharpness values close to the maximum sharpness in the stack as sharp. In the Generalized Autofocus work, the threshold t 2 was experimentally set to 0.2. Choice of the Minimal Set of Images The sharpness analysis step outputs a binary sharpness map for each image in the stack, such that sharp pixels have value 1, and blurry pixels have value 0. It also returns a binary foreground mask that represents the foreground and background regions in the scene. Background pixels are not relevant in the computation of the final all-in-focus result; they could be picked from any image, since they all look similar. Hence, only foreground pixels are considered. The goal is to select a collection of images in the stack such that the union of their sharp foreground pixels is equal to all foreground pixels, and the size of this collection is minimized. This problem is known as the set cover problem [22], which in an abstract form is NP-complete. However, we observe that when the input frames are sorted by the camera focus parameter, i.e., going from near focus to far focus, two key properties that cannot be assumed in general set cover problems are present in our problem: 128

148 Chapter 6. Image Composites There is an ordering between sets, given the focus parameter; For a given pixel in the sharpness map, its sharpness varies as a function of the images in the stack. This function is unimodal [67]; therefore, it becomes a box function once it is thresholded by the sharpness analysis procedure. We exploit this additional knowledge about the problem domain, and propose a plane sweep algorithm for selecting the minimal subset of required images. The idea is to simulate the lens sweeping process through the di erent depths of the scene, by analyzing the images in the sequence they were captured. A set A of active pixels and a set P of processed pixels are kept in memory. The set A contains pixels that appear sharp in the current image; P contains the pixels for which an image has already been included in the final result and do not need to be examined again. For every new image, the following sets of pixels are computed from P C (the complement of P ): F 0 :pixelsthatweresharpinthepreviousimage,butbecameblurryinthe current image; F 1 :pixelsthatwereblurryinthepreviousimage,butbecamesharpinthe current image. If F 0 6= ;, animagemustbeaddedtotheresult,asthepixelsinf 0 will not be sharp in any further images on the stack. The previous image is added to the 129

149 Chapter 6. Image Composites result, the pixels in A are added to P (as they have now been covered), and A is emptied. If F 1 6= ;, thenthenewsharppixelsareaddedtotheactiveset. A pseudocode version of the algorithm is shown below. It is linear in the number of captured images, as each image in the stack is analyzed only once. Each binary sharpness map in the stack S can be viewed as a set, such that a pixel at a given sharpness map is in the set if and only if it is sharp. function minimageset(stack of binary sharpness maps S, foreground mask M) // consider only foreground pixels for each set B in S; do B = intersection(b, M); end S = S.add(empty); // sentinel set Active = empty; // current list of sharp pixels set Processed = empty; // pixels for which an image // has been chosen list selectedimages = empty; // result i = 1; for each set B in S; do // sharp pixels for which an image // has not yet been chosen set Sharp = B - Processed; set F0 = Active - Sharp; set F1 = Sharp - Active; // active sharp pixels that became blurry if F0!= empty; then // add the previous image to the result selectedimages.add(i-1); // mark the active pixels as processed Processed = union(processed, Active); Active = empty; end // new sharp pixels if F1!= empty; then // add the new sharp pixels to the active set Active = union(active, F1); end i = i + 1; end return selectedimages; 130

150 Chapter 6. Image Composites 6.4 Experimental Results In order to have an indication of how many image composites can be suggested based on a passively recorded composition context in real-world photography scenarios, we applied our composite input frame identification algorithms to the data collected during our exploratory study, and used these frames to generate variations of the final image. Our method was able to identify input frames for at least one type of image composite in 1105 of the 1213 composition context videos. Table 6.1 shows the number of composites generated by type, as well as the average number of identified input frames for each of the composites. Table 6.1: Number of generated image composites using our dataset, and average number of identified input frames per composite type. Generated Average #inputs per composite Panorama Collage HDR Synthetic Long Exposure All-in-Focus Flash/No-Flash 29 N/A Motion Summaries Synthetic Panning Moving Object Removal Figures 6.6 and 6.7 display the histograms of the number of input frames per created composite for di erent suggestion types. 131

#composites 4 2 0 50 100 150 200 250 300 350 400 #input frames 0 25 50 75 100 125 150

#composites 10 5 0 50 100 150 200 250 300 350 #input frames 0 50 100 150 200 #input frames

151 Chapter 6. Image Composites 40 Panoramas and Collages 8 High Dynamic Range 30 6 #composites #composites #input frames #input frames (a) (b) 600 Synthetic Long Exposure 15 All-in-Focus #composites #composites #input frames #input frames (c) (d) Figure 6.6: Distribution of the number of identified input frames per composite. (a) Panoramas and collages. (b) High Dynamic Range. (c) Synthetic Long Exposure. (d) All-in-focus. 132

152 Chapter 6. Image Composites 300 Motion Summaries #composites #input frames (a) 200 Synthetic Panning 60 Moving Object Removal #composites #composites #input frames #input frames (b) (c) Figure 6.7: Distribution of the number of identified input frames per composite based on moving objects. (a) Motion summaries. (b) Synthetic Panning. (c) Moving Object Removal. 133

153 Chapter 6. Image Composites Figure 6.8 shows the panoramas and collages generated from the composition context for a few pictures in the dataset. They e ectively extend the field of view of the final picture, and the collages provide an artistic e ect in some cases. Figure 6.9 displays a few high dynamic range and flash/no-flash composites. The dark areas are better exposed in the results. The created all-in-focus composites show many artifacts due to misalignment, and have not been included here. Using a technique such as [108] would potentially address this problem. In Figure 6.10, synthetic long exposure composites are shown. The composites are typically brighter than the final images, as expected; when moving objects are present, they appear blurred, similarly to motion blur when a long exposure is used. In the second row, the results suggest that synthetic long exposure composites can be useful in low-light situations. Finally, in the third row, when the image alignment registered the images such that the stack of images aligned to aparticularobject,weobtainane ectsimilartopanning. Thealignedobject appears sharp in the image, while the rest of the image is blurry. We have also created composites based on moving objects. Figure 6.11 shows some results of motion summaries and synthetic panning. In Figure 6.12, results for moving object removal are shown. Artifacts appear due to segmentation errors and the blending method used (we simply copied pixels from viewfinder images to fill out holes), but more sophisticated algorithms such as Poisson blending 134

154 Chapter 6. Image Composites would result in more uniform merging. However, our results demonstrate that the composition context is useful for filling gaps left by moving objects with the proper content, avoiding the need to use inpainting algorithms that guess the content of holes given their neighborhood. Some of these composites show stitching artifacts, but this is due to the simple algorithms used and the basic image alignment and moving object detection methods. Using more sophisticated techniques would ameliorate this. In any case, these examples confirm that the composition context contains information to create interesting image composites, which could be provided as compelling alternatives to the final image. 6.5 Discussion We have proposed to apply several image compositing algorithms to frames in the composition context to generate photo suggestions. However, simply providing the entire composition context video as input can be ine cient and ine ective; averyimportantproblemistoidentifyappropriateinputsforeachcompositing algorithm. To address this problem, we introduced a method to identify suitable groups of input frames in the composition context to be provided as inputs to the image fusion algorithms. Our solution leverages the information about capture 135

155 Chapter 6. Image Composites final picture panorama collage Figure 6.8: Examples of panoramas and collages created using the composition context. 136

156 Chapter 6. Image Composites final picture HDR final picture flash/no-flash Figure 6.9: Examples of HDR images and flash/no-flash composites created using the composition context. parameters collected with the composition context. We also discussed the problem of further reducing the size of the identified groups to allow more e cient fusion and reduce the likelihood of stitching artifacts. We presented a solution for the case of all-in-focus images, which can then serve as inspiration for future research on minimizing the input stack size for the other composite types. The types of photo suggestions that the technique can generate for a given picture depend on the amount of variation of parameters in the composition context. 137

157 Chapter 6. Image Composites final picture synthetic long exposure final picture synthetic long exposure Figure 6.10: Examples of synthetic long exposure composites created using the composition context. This relates to the user s actions while framing the picture, and the scene being photographed. We expect that users spend a few seconds panning and zooming near the scene of interest, and they trigger an autofocus routine. The camera movement while framing should contain a few intervals of fixation, and smooth and slow motion. These assumptions seem to be valid in many situations, as suggested by our study results. The scene being photographed also influences which kinds of suggestions can be generated. For example, high dynamic range sug- 138

158 Chapter 6. Image Composites final picture motion summaries synthetic panning Figure 6.11: Examples of motion summaries and synthetic panning composites created using the composition context. gestions are only created when the auto-exposure algorithm generates su ciently large variations in exposure time. Motion-based composites are only created when moving objects are present within the camera s field of view. 139

159 Chapter 6. Image Composites final picture moving object removal Figure 6.12: Examples of moving object removal using the composition context to fill gaps. 140

Composition Context Photography

Composition Context Photography Daniel Vaquero Nokia Technologies daniel.vaquero@nokia.com Matthew Turk Univ. of California, Santa Barbara mturk@cs.ucsb.edu Abstract Cameras are becoming increasingly aware