High Performance Imaging Using Large Camera Arrays Presentation of the original paper by Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Talvala, Emilio Antunez, Adam Barth, Andrew Adams, Mark Horowitz, and Marc Levoy. Presented by Jeffrey Sharkey for CS 525.
Introduction This paper proposes that large arrays of inexpensive cameras offer better performance and features when compared with an expensive single-camera design. They build on previous research by surveying several existing system designs and pointing out the weaknesses of each. They then design and build a new system that overcomes these weaknesses. After building their new system, they use it to compare results for three specific areas of high-performance photography: (1) high-resolution video, (2) high-resolution HDR video and (3) high-speed video. They also explain how they solved several problems typically encountered when using camera arrays. Motivation From economics, we know that things are often cheaper in bulk. This also holds for most electronic components, such as digital cameras. Regardless of the device complexity, large-volume items are almost always cheaper. Take Google for example, who pioneered the use of simple off-the-shelf computers to build large computing clusters. To buy a single computer with equivalent processing power would have cost hundreds of times what they actually paid. In the same way, the authors of this paper propose that by building an array of inexpensive digital cameras, we can derive performance equal to, or even greater than, a single expensive digital camera. History The idea of using multiple cameras in a single scene was first pioneered in 1996 by Dayton Taylor, who initially applied them in the film industry to create an effect known as virtual camera movement. You ve probably seen this effect used in several movies and commercials, but its best known for its appearance in The Matrix series. By simultaneously triggering a series of still cameras, you can effectively freeze the scene in time. This allows us to move a virtual camera in the space dimension while keeping the time dimension frozen, something not physically possible with a single camera. However, his approach only used static camera images, limiting their motion only to the instant of capture. In 1997, the Virtualized Reality Project improved on Taylor s research by using continuous video streams instead of static images. However, at the time recording massive amounts of video was only feasible by using VCR s for storage. At that time, computers only had enough speed and storage space to record only about 9 seconds. In addition, they required one dedicated PC for every three cameras in the array. Their project experimented with both recording methods, and concluded that computers must improve significantly before becoming a feasible solution. Sharkey 2
Design and Build The goal of their project was to build an array of inexpensive digital cameras, along with its supporting infrastructure, to meet and exceed the challenges faced by previous projects. For their system, it was important that they choose off-the-shelf technology. They quickly discovered that most commercially available systems didn t offer enough flexibility for their project. So instead, they chose to build their system from raw components. They maintained a strict goal of leveraging existing standards while trying to stay away from customized hardware. For the actual image sensors, they chose to use CMOS technology over CCD sensors. This choice was mostly because CMOS offers a direct digital output, and better input controls when compared with CCD. The CMOS industry has also historically focused on high-volume markets, yielding lower prices when compared to CCD sensors. Instead of sending the raw video streams directly to a computer, the authors chose to add local processing boards. These boards have multiple camera inputs, include an Sony MPEG-2 encoder, and also an FPGA to help with video preprocessing. Finally, the compressed output is sent over a standard IEEE-1494 Firewire bus to a host computer. These local processing boards are important for two reasons. First, by including an FPGA, we can rapidly perform preprocessing, such as rotation, skew, and color adjustments. Because they are field-programmable, these parameters can be adjusted at each calibration. In addition, FPGA s can work with raw video much faster than a traditional computer. Second, by compressing the video using MPEG-2, we can put more cameras onto each Firewire bus. To ensure that video quality isn t lost, they have a local sanity check that maintains buffers of both raw and compressed frames for comparison purposes. Even at high bitrates, MPEG-2 still offers a 17:1 compression ratio. Finally, they needed to synchronize the timing of each camera in the array. Because Firewire doesn t guarantee timing resolution strict enough for their requirements, they built a second dedicated timing system. Their solution was a master timing clock running Sharkey 3
at 27MHz that fed uniform triggers to all local processing boards. They proved that this design has no more than 200-nanoseconds of clock skew, which is far below their minimum timing resolution of 200-microseconds. When finished, they had an array of 100 inexpensive cameras that could capture video at 30 frames per second. The raw video is compressed by a local processing board, and then routed back to a host PC through a tree-like structure of other processing boards. Each root node, which is usually a host PC, can support about 30 cameras. High-performance Applications After building the camera array, they wanted to show how it could be used in three specific high-performance applications. High-resolution Video The first application they examined was high-resolution video. By configuring their array to be pointed at a common center with about 50% overlap between each camera, they created a virtual camera with a field of view of about 30-degrees horizontally and 15-degrees vertically. In addition, most points in the scene are now covered by four cameras, allowing us to compensate for noise by averaging values. The process of stitching (or mosaicing) together each set of frames was automated using a program called Autostitch, which was based on previous research at the University of British Columbia. The program uses scale-invariant features to match frames together, and then uses homographies to stitch them together. Homographies are essentially projection transformations that map points between two planes. In our case, we are mapping the pixels from the individual camera images into the coordinate space of the final image. Sharkey 4
One major challenge they had to face was color calibration. Because the array had several inexpensive sensors, each camera had different color responses. They first turned off the automatic white-balancing feature in each camera, which is normally useful, but introduces an unknown variable in our system. They found that the sensors had a roughly linear color response except in cases of extreme values. They showed this was acceptable for their approach, because they only needed uniform results, and were not concerned with exact accuracy. High-resolution HDR Video Because their system had a large number of cameras, they could use varying exposure times to create high-resolution HDR video. High-dynamic-range (HDR) imaging is essentially compressing the dynamic range of an image so that lighter and darker areas are closer together. It can bring out details in dark areas while still showing detail in light areas without over-exposure. Sharkey 5
To build HDR images, we need identical images with varying exposure times. This is simple for static scenes where we can take multiple pictures, but can be difficult for video. A common approach is to split a single image from an incoming lens into three or more channels which are then processed by three video cameras at different exposure timings. However, these optics are very expensive, and do not offer very much control over the entire scene. The authors point out that individual cameras in their array can be configured with different exposure timings. Because each camera has an overlapping field of view, we can combine multiple exposures from neighboring cameras to create HDR video. In the paper, they show the results of dynamically choosing between four exposure timings. Based on the overall brightness of the current frame from a camera, its exposure timing is adjusted before capturing the next frame. By comparing pixel intensities against a reference image taken by a high-quality Canon camera, they found that the Canon s resolution was roughly 1.5 times better. However, they countered by pointing out that their system scales nicely, and can increase its resolution by simply adding more cameras. They also found that contrast of their array was worse than the reference Canon. They simply blamed this on the cheap lenses they were using. However, contrast can always be improved with post-processing. High-speed Video Because their system can be precisely timed, it can be used to create high-speed video by staggering the trigger times across the array. By using only 52 cameras that are completely overlapped, they can simulate a 1560 frame per second camera. To their knowledge, there is no non-classified single camera that can meet these specifications. In addition, the speed can be linearly increased with the number of cameras, effectively giving it unlimited resolution in the time domain. Sharkey 6
Another feature of using a camera array is that we can integrate each frame for longer than the current frame-rate. This means we can capture more light per unit time than is possible using a single high-speed camera. However, one of the major problems encountered when recording high-speed video from multiple cameras is parallax. The situation is more complex when an object in the scene is moving, in addition to the motion of the observation point. In this case, we can stagger the trigger timings for adjacent cameras, so that it appears as though the object isn t moving. The worst-case parallax occurs at the near and far depth planes. The worst case temporal motion will occur if the object is moving at the maximum velocity on the near-depth plane. We can approximate this temporal motion with a vector along the axis of focus, based on the maximum velocity. Relating this back to the distance between our adjacent cameras, we can find the best timing stagger to counteract this motion. Thus, by staggering our timing triggers, we can increase our sampling rate along the time axis with no additional cost. This is beneficial in scenes with high motion, because it reduces interpolation artifacts. In addition, it doesn t create any addition penalties in low-motion scenes. They conclude that staggered timing should always be used in highspeed video applications. Conclusion The authors have continued existing work by creating a new system that breaks past the limitations of previous research. They provide excellent details into the architecture of their system, along with rationale for their design decisions. Finally, they examine several applications of camera arrays while also handling the unique problems their array approach raises. Sharkey 7