arxiv: v1 [cs.cv] 24 Nov 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.cv] 24 Nov 2017"

Nelson Warren
5 years ago
Views:

End-to-End Deep HDR Imaging with Large Foreground Motions Shangzhe Wu Jiarui Xu Yu-Wing Tai Chi-Keung Tang Hong Kong University of Science and Technology Tencent Youtu arxiv:1711.08937v1 [cs.

1 End-to-End Deep HDR Imaging with Large Foreground Motions Shangzhe Wu Jiarui Xu Yu-Wing Tai Chi-Keung Tang Hong Kong University of Science and Technology Tencent Youtu arxiv: v1 [cs.cv] 24 Nov 2017 Figure 1. Our goal is to produce a HDR image from a stack of LDR images that can be corrupted by large foreground motions. The left shows the three input LDR images and our HDR image after tone mapping. On the right, the first two columns show that the optical flow alignment approach used by Kalantari [13] introduces severe geometric distortions and color artifacts, which are unfortunately preserved in the final HDR results. The last three columns show the comparison of the results produced by other methods, and that of our end-to-end approach where no optical flow alignment is used. Our simple Unet with residual blocks produces high quality ghost-free HDR image in the presence of large-scale saturation and foreground motions. Abstract This paper proposes the first end-to-end deep framework for high dynamic range (HDR) imaging of dynamic scenes with large-scale foreground motions. In state-of-the-art deep HDR imaging such as [13], the problem is formulated as an image composition problem, by first aligning input images using optical flows which are still error-prone due to occlusion and large motions. In our end-to-end approach, HDR imaging is formulated as an image translation problem and no optical flows are used. Moreover, our simple translation network can automatically hallucinate plausible HDR details in the presence of total occlusion, saturation and under-exposure, which are otherwise almost impossible to recover by conventional optimization approaches. We perform extensive qualitative and quantitative comparisons to show that our end-to-end HDR approach produces excellent results where color artifacts and geometry distortion are significantly reduced compared with existing state-ofthe-art methods. 1. Introduction Off-the-shelf digital cameras typically fail to capture the entire dynamic range of a 3D scene. In order to produce high dynamic range (HDR) images, custom capture and special devices have been proposed [25, 7, 24], but they are too heavy and/or too expensive for capturing fleeting moments to cherish, which are typically photographed using a cellphone camera. The other more practical HDR approach is to merge several low dynamic range (LDR) images captured at different exposures. If the LDR images are perfectly aligned, in other words no camera motion or object motion is observed, the merging problem is considered almost solved [17, 2]. However, foreground and background misalignments are unavoidable in the presence of large-scale foreground motions in addition to small camera motions. While the latter can be resolved to a large extent by homography transformation without introducing artifacts and large distortions [26], foreground motions, on the other hand, will make the composition a nontrivial process. Many solutions proposed to tackle this issue are prone 1

2 to introducing artifacts or ghosting in the final HDR image [14, 31, 13], or fail to incorporate misaligned HDR content by simply rejecting the pixels in misaligned regions as outliers [16, 8, 19], see Figure 1. Recent works have proposed to learn this composition process using deep neural networks [13, 4]. In [13], they first used optical flow to align input LDR images, followed by feeding the aligned LDRs into a convolutional neural network (CNN) to produce the final HDR image. Optical flows are notoriously unreliable, especially for images captured with different exposure levels, which inevitably introduce artifacts and distortions in the presence of large object motions. While in [13] it was claimed that the network is able to resolve these issues in the merging process, failure cases still exist as shown in Figure 1, where color artifacts and geometry distortions are very visible in the final results. A more recent work [4] attempts to reconstruct a HDR image from one single LDR image using CNN. It is demonstrated that their network can hallucinate details in regions where input LDRs exhibit only very weak response. However, one intrinsic limitation of their approach is the total reliance on one single input LDR image, which often fails to capture any response in some regions in a highly contrastive scene. Therefore, we intend to explore better solutions to merge HDR contents from multiple LDR images. Although [4] was not available while we were designing our framework, we include extensive comparisons in later sections. We regard merging multiple LDR images with different exposures into an HDR image as an image translation problem, which have been actively studied in recent years. [10] proposed a powerful solution to learn a mapping between images in two domains using a Generative Adversarial Network (GAN). Meanwhile, CNNs have been demonstrated to have the ability to learn misalignment [3] and hallucinate missing details [30]. Inspired by these, we believe that optical flow may be an overkill for HDR imaging. In this paper, we propose a simple end-to-end network that can learn to translate multiple LDR images into a ghost-free HDR image even in the presence of large foreground motions. In summary, our method has the following advantages. First, it is trained end-to-end without optical flow alignment, thus intrinsically avoiding artifacts and distortions caused by erroneous flows. Second, similar to [4], our network can hallucinate plausible details that are totally missing or their presence is extremely weak in all LDR inputs. This is particularly desirable when dealing with large foreground motions, because usually some contents are not captured in all LDRs due to saturation and occlusion. Finally, since it is formulated as a translation problem, the same framework can be easily extended to more LDR inputs, and possibly with any specified reference image. We perform extensive qualitative and quantitative comparisons, and show that our simple network outperforms the state-ofthe-art approaches in HDR synthesis, including both deep learning based or optimization based methods. 2. Related Work Over the past decades, many research works have been dedicated to the problem of HDR imaging. As mentioned above, one practical solution is to compose an HDR image from a stack of LDR images. Early works such as [17, 2] produce excellent results for static scenes and static cameras. To deal with camera motions, previous works [14, 26, 11] register the LDR images before merging them into the final HDR image. Since many image registration algorithms depend on the brightness consistence assumptions, the brightness changes are often addressed by mapping the images to another domain, such as luminance domain or gradient domain, before estimating the transformation. Compared to camera motions, object motions are much harder to handle. Some methods reject the moving pixels using weightings in the merging process [16, 8]. Another approach is to detect and resolve ghosting after the merging [5, 20]. Such methods simply ignore the misaligned pixels, and fail to fully utilize available contents to generate an HDR image. There are also more complicated methods [14, 31] that rely on optical flow or its variants to address dense correspondence between image pixels. However, optical flow often results in artifacts and distortions when handling large displacements, introducing extra complication in the merging step. Among the works in this category, [13] produces perhaps the best results, and is highly related to our work. The authors proposed a CNN that learns to merge LDR images aligned using optical flow into the final HDR image. Our method is different from theirs in that we do not use optical flow for alignment, which intrinsically avoids the artifacts and distortions that are present in their results. We provide concrete comparisons in the later sections. Another approach to address the dense correspondence is patch-based system [23, 9]. Although these methods produce excellent results, the running time is much longer, and often fail in the presence of large motions and large saturated regions. Typically, to produce an HDR image also involves tonemapping and dynamic range compression in order to display the final HDR image. Our work is focused on the composition process. On the other hand, there are more expensive solutions that use special devices to capture a higher dynamic range [25, 7, 24] and directly produce HDR images. For a complete review of the problem, readers may refer to [21]. 2

3 (a) Framework (b) Merger Figure 2. Our framework is composed of three components: encoder, merger and decoder. Different exposure inputs are passed to different encoders, and concatenated before going through the merger and decoder. We experimented with two network structures for the merger, a straight-forward Unet and a ResNet. We use skip-connections both between the encoders and the decoder, and between the mirrored layers within the merger. The output HDR of the decoder is tonemapped before it can be displayed. 3. Approach We formulate the problem of HDR imaging as an image translation problem. Similar to [13], given a set of LDR images {I 1, I 2,..., I k }, we define a reference image I r. In our experiments, we use three LDRs, and set the middle exposure shot as reference. Since ours is a translation problem, the same network can be extended with more LDR inputs, and possibly with any specified reference image. We provide results in Section and to substantiate such robustness. Specifically, our goal is to learn a mapping from a stack of LDR images {I 1, I 2, I 3 } to a ghost-free HDR image H that is aligned with the reference LDR input I r (same as I 2 ), and contains the maximum possible HDR contents. These contents either come directly from LDR inputs, or from hallucinations based on the surrounding regions when certain contents are completely missing. We focus on handling large foreground motions, and assume the input LDR images, which are typically taken in a burst on a handheld devices, have small background motions, since the latter can be easily handled using homography transformation. We capitalize on CNNs to learn such a mapping. As shown in Figure 2, our framework consists of three components: encoder, merger and decoder. We experimented with two similar architectures for the merger, while the overall structure is a symmetric translation network. The first architecture is a straight-forward Unet proposed in [22]. The second one is a Unet with residual blocks [6] in the middle layers, similar to Image Transformation Networks proposed in [12]. In this paper, we name the second one ResNet, as opposed to the first one, straight-forward Unet. We compare their performance in later sections Unet Encoder-decoder structure is a common tool to tackle translation problems. Unet is essentially an encoderdecoder architecture, with skip-connections that forward the output of the encoder layer directly to the input of the corresponding decoder layer. In [10], Unet is used as a generator network in an adversarial setting to learn image-toimage translation. Through a wide range of translation scenarios, such as aerial image map, edge photo, etc., they demonstrate such a structure significantly facilitates the training and produces outputs strictly aligned with the inputs. This is a highly desirable feature in our situation, since we require our output HDR image to be aligned with the reference LDR input. However, unlike [10], we do not need a discriminator network because the mapping from LDR to HDR is relatively easy to learn, compared to other scenarios in [10], where the two images domains are much more distinct, such as edge photo. Despite our intuition, it can be insightful to see how the network will perform with a discriminator network. We also experimented training with a discriminator, and provide discussion in Section Encoding, Merging and Decoding With the hourglass Unet structure, the translation process can be conceptually divided into three parts: encoding, merging and decoding. Encoding Since we have multiple exposure shots, intuitively we may want separate branches to extract different types of information from different exposure inputs. Instead of duplicating the whole network, we separate the first two input layers as encoders for each exposure inputs. 3

4 Merging and Decoding After extracting the contents, the network learns to merge them, mostly in the middle layers, and to decode them into an HDR output, mostly in the last layers. To facilitate the merging, we adopt the Image Transformation Networks [12], which replaces the middle layers in the Unet with several residual blocks [6]. Such structure is also used in [29] for image translation. Compared to straight-forward Unet, ResNet generally yields better performance Loss Function Given a set of LDR I = {I 1, I 2, I 3 } sorted by their exposure biases as input, which is in LDR domain, we first map them to H = {H 1, H 2, H 3 } in the HDR domain. We use simple gamma encoding for this mapping: H i = Iγ i t i, γ > 1 (1) where t i is the exposure time of image I i. Note that we use H to denote the target HDR image, and H i to denote the LDR inputs mapped to HDR domain. We then concatenate I and H along the channel dimension into a 6-channel input. This is also suggested in [13]. While the LDRs facilitate detection of misalignment and saturation, the exposure-adjusted HDRs improve the robustness of the network across LDRs with various exposure levels. Our network f is thus defined as: Ĥ = f(i, H) (2) where Ĥ is the output HDR image. HDR images are usually displayed after tonemapping. We compute the loss function on the tonemapped HDR images, which is more effective than directly computed in the HDR domain. In [13] the author proposed to use µ-law, which is commonly used for range compression in audio processing: log(1 + µh) T (H) = (3) log(1 + µ) where H is the output HDR image, and µ is a parameter controlling the level of compression. We set µ to Although there are other powerful tonemappers, most of them are typically complicated and not fully differentiable, which makes them not suitable for training a neural network. Finally, our loss function is defined as: L Unet = T (Ĥ) T (H) 2 (4) where H is the ground truth HDR image. 4. Datasets We use the dataset provided by [13] for training. While other HDR datasets are available, many of them do not have Sen Hu Kalantari HDRCNN Ours Ours [23] [9] [13] [4] Unet ResNet Time (s) Table 1. Comparison of average running time on the testing set under CPU environment. ground truth HDR images, or contain only a few scenes. This dataset consists of 89 scenes and contains ground truth HDR images. As described in [13], for each scene, 3 different exposure shots were taken while the object was moving, and another 3 shots were taken while the object remained static. The static sets are used to produce ground truth HDR with reference to the middle exposure shot. This reference image then replaces the middle exposure shot in the dynamic sets. All images are resized to Both dynamic and static sets consist of LDR images with exposure biases of { 2.0, 0.0, +2.0} or { 3.0, 0.0, +3.0} Data Preparation To keep our network focused on handling foreground motions, we first align the background using simple homography transformation without introducing artifacts and distortions. We also tried to train the network without background alignment, but we found this is too confusing for the network to learn. Examples are provided in Section Data Augmentation and Patch Generation The dataset was split into 74 training examples and 15 testing examples by [13]. For the purpose of efficient training, instead of feeding the original full-size image into our model, we crop the images into patches with a stride of 64, which produces around patches. We then perform data augmentation (flipping and rotation), which further increases the training data by 8 times. However, a large portion of these patches contain only background regions. Therefore, to keep the training focused on foreground motions, we detect large motion patches by thresholding on the structural similarity between different exposure shots, and replicate these patches in the training set. 5. Experiments and Results We trained two networks (Unet and ResNet), and present the implementation details, results, evaluation and comparisons with the state-of-the-art methods in this section Implementation Details Given three input LDRs, each is respectively mapped to HDR domain, and then concatenated into a 6-channel input before feeding into the network. Each of the three 6-channel inputs is passed to a 2-layer encoder separately, and then concatenated along the channel for merging. 4

image. The lower half shows the comparison of our results against other methods. The numbers in brackets at the bottom indicate the PSNR of the tonemapped images.

5 (a) (b) Figure 3. Results on the testing set, which has no overlap with training set. In the upper half of each example, the left column shows in the input LDRs, the middle is our tonemapped HDR result, and the last three columns show three zoomed-in LDR regions marked in the HDR image. The lower half shows the comparison of our results against other methods. The numbers in brackets at the bottom indicate the PSNR of the tonemapped images. Our method outperforms all others, especially in saturated regions with foreground motions, where optical flow used in [13] results in artifacts and distortion, HDRCNN [4] produces blurry results, and all other methods result in artifacts. Note that our method can hallucinate plausible details in saturated regions, such as (a) I, (b) I and (b) III, where the input LDRs have no or extremely weak responses. Other methods do not achieve this, except HDRCNN [4], although the effect does not seem as apparent. The Encoding layers use 5 5 kernels with a stride of 2, while the decoding layers are deconvolution layers which use 5 5 kernels with a stride of 1/2. The output of the last deconvolution layer is connected to a flat convolution layer with 5 5 kernel to produce the final HDR. All layers are followed by batch normalization (except the first layer and the output layer), and leaky ReLU (encoding layers) or ReLU (decoding layers). The channel numbers are doubled each layer from 64 to 512 during encoding and halved from 512 to 64 during decoding. For the straight-forward Unet, input patches are passed through 8 encoding layers to produce a block, followed by 8 decoding layers and an output layer to produce a HDR patch. Our ResNet is different only in that after 3 encoding layers, the block is passed through 9 residual blocks with 3 3 kernels, followed by 3 decoding layers and an output layer. PSNR-T PSNR-L HDR-VDP-2 Sen [23] Hu [9] Kalantari [13] Ours Unet Ours ResNet Table 2. Quantitative comparisons of the results on our test set. The first row is PSNR computed using tonemapped outputs and ground truth, the second row is PSNR computed using linear images and ground truth, and the third row is HDR-VDP-2 [18] sores. All values are the average across 15 testing images in the original test set. Pascal), our Unet and ResNet take 0.225s and 0.239s respectively to process 3 LDR images of size Evaluation and Comparison We perform qualitative and quantitative evaluations, and compare results with the state-of-the-art methods, including two patch-based methods [23, 9], motion rejection method [19], the flow-based method with CNN merger [13], and the most recent single image HDR imaging [4]. For all methods, we used the codes provided by the authors. Note that all the HDR images are displayed after tonemapping using Photomatix [1], which is different from the tonemapper used in training Running Time We compare the running time with other methods in Table 1. Although our network is trained with GPU, other conventional optimization methods are optimized with CPU environment. Thus, we evaluated the running time under CPU environment, on a PC with i7-4790k (4.0GHz) and 32GB RAM. Note that optical flow is used in [13], which takes 59.4s on average. When running with GPU (Titan X 5

(a) (b) Figure 4. Results on Sen s dataset [23]. Our method produces better details in highlight regions, such as (a) I and (b) II, while others are prone to introducing artifacts.

In saturated regions where object motion is present, such as (b) I, optical flow used in [13] tends to introduce distortions, while others may result in artifacts. (a) (b) Figure 5.

While other methods except HDRCNN [4] fail to recover the highlights, such as (a) I, (a) II and (b) I, both our Unet and ResNet produces appealing results, clearer than those of HDRCNN [4].

In dark regions like (b) II, both Unet and ResNet produce less noisy results compared to other methods. 5.3.1 Qualitative Results captured in 9 images with one stop separation.

6 (a) (b) Figure 4. Results on Sen s dataset [23]. Our method produces better details in highlight regions, such as (a) I and (b) II, while others are prone to introducing artifacts. In under-exposed moving regions, such as (a) II, our results appear less noisy compared to Sen [23], Kalantari [13] and HDRCNN [4], and contains sharper edges compared to Hu [9]. In saturated regions where object motion is present, such as (b) I, optical flow used in [13] tends to introduce distortions, while others may result in artifacts. (a) (b) Figure 5. Results on Tursun s dataset [28]. (a) is a complex, dynamic scene, while scene (b) is highly contrastive. While other methods except HDRCNN [4] fail to recover the highlights, such as (a) I, (a) II and (b) I, both our Unet and ResNet produces appealing results, clearer than those of HDRCNN [4]. Moreover, as shown in (b) I, our ResNet tends to produce more details, while our straight-forward Unet tends to reject these details. In dark regions like (b) II, both Unet and ResNet produce less noisy results compared to other methods Qualitative Results captured in 9 images with one stop separation. Most of them exhibit larger foreground motions. Our method produces appealing results, and outperform others on these datasets that are different from training dataset, which indicates our method does not overfit the training data. Even though our network is trained only on LDR inputs with 2 or 3 stop separations, it is robust across various input LDRs, and can handle larger exposure separations. Test Set Figure 3 shows the examples of testing results, which are not seen in training. In regions where no object motion is present, all methods can produce comparably good results, although learning-based methods generally recover the details better, especially at highly saturated highlight regions, where other methods may produce artifacts. However, when large object motion is present in saturated regions, all other methods tend to result in artifacts, distortions and ghosting as shown in Figure 3. Specifically, in flow-based method [13], these artifacts and distortions are introduced during optical flow alignment, and are not successfully removed in the merging process. Both of our two networks produce comparably good results, although in general, ResNet seems to produce more consistent results with higher capacity for hallucinating visually plausible details Quantitative Results Table 2 shows quantitative comparison of our networks with state-of-the-art methods. Note that all results are calculated on the original test set which has no overlap with the training set. We compute the PSNR between the generated HDR and the ground truth HDR, both before and after tonemapping using µ-law. We also compute the HDR-VDP-2 [18], a metric specifically designed for measuring the visual quality of HDR images. For the two parameters used to compute the HDR-VDP-2 scores, we set the diagonal display size to 24 inches, and the viewing distance to 0.5 meter. We did not compare with [19] and [4], since the former is optimized for more than 5 LDR inputs and the latter produces unbounded HDR results. While [13] results in higher PSNR scores, our methods Other Datasets To examine the robustness of our network, we also tested our network on other two HDR datasets, Sen s dataset [23] and Tursun s dataset [28, 28], and show results in Figure 4 and Figure 5 respectively. The former consists of 34 diverse scenes that are captured with 3 or more exposure shots. Most of these sets contain small foreground motions, and exposure separations vary across the scenes. The latter consists of 27 complex scenes, each 6

Figure 6. Results on Tursun s dataset [27] with different number of input LDRs. The leftmost column shows 5 LDR images.

Kalantari [13] produces color artifacts. Sen [23] tends to miss the details, and the results with 5 inputs do not seem to be better than those with 3 inputs.

In our experiments, however, there are cases where massive regions of contents are completely missing, which makes the hallucination extremely difficult.

7 Figure 6. Results on Tursun s dataset [27] with different number of input LDRs. The leftmost column shows 5 LDR images. The integers in brackets indicate the number of LDR images used to generate produce the HDR. With only the middle exposure input, HDRCNN [4] produces blurry results. Kalantari [13] produces color artifacts. Sen [23] tends to miss the details, and the results with 5 inputs do not seem to be better than those with 3 inputs. Comparing our results, the use of 5 input LDRs enhances the details in saturated and under-exposed regions. LDRs due to total occlusions by object motions. In our experiments, however, there are cases where massive regions of contents are completely missing, which makes the hallucination extremely difficult. We believe this is almost impossible to achieve without a substantial highlevel understanding of the complex scenes. A much easier solution is to use more input LDRs to capture a more complete dynamic range. In Section 5.3.6, we show that our framework can be extended to 5 LDR inputs, which may help reduce such failures. Figure 7. This example illustrates effectiveness of merging. While the reference LDR exhibits large saturation, after merging the contents from the other two inputs, the network generates appealing HDR results. The residual image in the last column indicates the difference between the input HDR and the final HDR Intuitively, if the reference input LDR (middle exposure) contains complete information, no additional information is needed from other input LDRs, which makes the merging trivial. To test whether our network effectively utilizes the contents presented by all input LDRs, we examine the difference between the final output HDR H and the middle exposure input mapped to HDR domain H2. Examples are shown in Figure 7. Figure 8. This example demonstrates our network s capability to generate HDR images with different reference image specified by the middle input. The first row shows three LDR inputs, and the second row shows the corresponding HDR results with reference to each input Different Reference Image One potential advantage of our image translation formulation is, although we do not have perfect datasets to test, our flexibility in selecting the reference image, even in generating multiple HDR images with reference to all different input LDRs in one single network. However, instead of showing such results, which is beyond what we can achieve with the current datasets, we demonstrate our network s capacity in producing satisfying results with respect to the low or high exposure shots, when given only 2 exposure shots, such as {Low-Low-Medium} in Figure 8. This also shows the robustness of our network in handling different combinations of exposure shots. result in higher HDR-VDP-2 scores. Besides, our ResNet yields higher scores than straight-forward Unet Test of Effectiveness Hallucination One important feature of our method is the capability of hallucinating missing details that are not possible to recover using conventional optimization approaches. Examples are shown in Figure 3. This is highly desirable in HDR imaging, since in dynamic scenes, contents in over-exposed or under-exposed regions may often be missing in all other 7

Using a discriminator, we could not produce comparable results to those of a simple ResNet. 5.3.

8 Figure 9. This example illustrates the effect of background alignment. Without background alignment on the input images, our network tends to produce blurry edges where background is largely misaligned. Figure 10. Comparison of results with and without discriminator. Using a discriminator, we could not produce comparable results to those of a simple ResNet More Input LDRs Since our problem is formulated as image translation, naturally the same framework will be able to support multiple input LDRs of more than 3, which can be more desirable, since more LDRs would capture a more complete range of contents in the scene especially in the presence of largescale foreground motions. One challenge, however, is the lack of such HDR datasets that are suitable for training a deep network, which requires a sufficient number of images from a variety of scenes, and a ground truth HDR image for each LDR stack. Most of the existing datasets do not have ground truth. Although the dataset provided by [15] contains static sets of LDRs for each corresponding dynamic set, it has only 4 distinct scenes, and contains only roughly rigid object motions. We decided to use the dataset provided by [23], which contains a sufficient range of diverse scenes, and used their produced results as ground truth for training. Although this is not an ideal dataset since their results are yet to be perfect, it is sufficient to serve our purpose to prove the extensibility of our framework, and its potential to tackle the challenge of large-scale saturation. We test our framework using this datasets with 5 LDR inputs. Figure 6 compares our results with the results produced by state-of-the-art methods. 6. Discussions 6.1. Background Alignment In all our experiments and comparisons, since we are focused on handling large foreground motions, we align the backgrounds of the LDR inputs using homography transformation. Without background alignment, we found that although the network is able to learn to merge some contents, it is highly unassertive at producing sharp edges, as shown in Figure 9. This can be due to the confusion caused by the background motion, which CNN is generally weak at dealing with. However, such issues can be easily resolved using simple homography transformation that almost perfectly aligns the background in most cases, and does not introduce artifacts and distortions. Recall that in practice LDR inputs for generating HDR are captured within a split second using handheld devices Training with GAN Although we believe including a discriminator in training can be an overkill in HDR imaging, and training with Generative Adversarial Networks predictably requires extra caution and much longer training time, it may be insightful to know how our network performs in an adversarial setting. To test this, we used the framework proposed in [10], which also uses Unet as the generator network. In our experiments, besides the fact that the training process was very unstable and harder to control, as have been suggested by other researchers, we are not able to produce comparable results to those produced by Unet without a discriminator, despite extensive parameters tuning and network modifications. See Figure 10. More interestingly, we found that in our situation the discriminator often loses the game in an early stage and outputs a classification score of 0.5 for both real and generated images. 7. Conclusion and Future Work In this paper, we demonstrate that the problem of HDR imaging can be formulated as an image translation problem, which can be tackled using deep CNNs. We conduct extensive experiments to show that a simple translation network can produce compelling results, outperforming the state-of-the-arts, including both learning-based and conventional optimization approaches. Furthermore, CNN can also hallucinate plausible details in largely saturated regions in the presence of foreground motions. We also show that our method is robust across various LDR inputs, and can be extended with more inputs, and different reference images. While our advantages are clear, it is yet to be a perfect solution. We also observe challenges of recovering massive saturated regions with minimal number of input LDRs. In the future, we would attempt to incorporate high-level knowledge to facilitate such recovery, and devise a more powerful solution. We gratefully acknowledge all authors for providing codes and datasets which were instrumental to our work in this paper. 8

9 References [1] Photomatix [2] P. E. Debevec and J. Malik. Recovering high dynamic range radiance maps from photographs. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 97, pages , New York, NY, USA, ACM Press/Addison-Wesley Publishing Co. 1, 2 [3] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. in IEEE ICCV, [4] G. Eilertsen, J. Kronander, G. Denes, R. Mantiuk, and J. Unger. Hdr image reconstruction from a single exposure using deep cnns. ACM TOG, 36(6), , 4, 5, 6, 7 [5] O. Gallo, N. Gelfandz, W.-C. Chen, M. Tico, and K. Pulli. Artifact-free high dynamic range imaging. In 2009 IEEE International Conference on Computational Photography (ICCP), pages 1 7, April [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/ , , 4 [7] F. Heide, M. Steinberger, Y.-T. Tsai, M. Rouf, D. Pajk, D. Reddy, O. Gallo, J. L. abd Wolfgang Heidrich, K. Egiazarian, J. Kautz, and K. Pulli. Flexisp: A flexible camera image processing framework. ACM TOG, 33(6), December , 2 [8] Y. S. Heo, K. M. Lee, S. U. Lee, Y. Moon, and J. Cha. Ghost-Free High Dynamic Range Imaging, pages Springer Berlin Heidelberg, Berlin, Heidelberg, [9] J. Hu, O. Gallo, K. Pulli, and X. Sun. Hdr deghosting: How to deal with saturation? In IEEE CVPR, , 4, 5, 6 [10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Imageto-image translation with conditional adversarial networks. IEEE CVPR, , 3, 8 [11] K. Jacobs, C. Loscos, and G. Ward. Automatic high-dynamic range image generation for dynamic scenes. IEEE Computer Graphics and Applications, 28(2):84 93, March [12] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution , 4 [13] N. K. Kalantari and R. Ramamoorthi. Deep high dynamic range imaging of dynamic scenes. ACM TOG, 36(4), , 2, 3, 4, 5, 6, 7 [14] S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski. High dynamic range video. ACM TOG, 22(3): , July [15] K. Karaduzovic-Hadziabdic, J. H. Telalovic, and R. Mantiuk. Subjective and objective evaluation of multi-exposure high dynamic range image deghosting methods. In Proceedings of the 37th Annual Conference of the European Association for Computer Graphics: Short Papers, EG 16, pages 29 32, Goslar Germany, Germany, Eurographics Association. 8 [16] E. A. Khan, A. O. Akyuz, and E. Reinhard. Ghost removal in high dynamic range images. In 2006 International Conference on Image Processing, pages , Oct [17] S. Mann and R. W. Picard. On being undigital with digital cameras: Extending dynamic range by combining differently exposed pictures. In Proceedings of Imaging Science and Technology, pages , , 2 [18] R. Mantiuk, K. J. Kim, A. G. Rempel, and W. Heidrich. Hdr-vdp-2: A calibrated visual metric for visibility and quality predictions in all luminance conditions. ACM TOG, 30(4):40:1 40:14, July , 6 [19] T. H. Oh, J. Y. Lee, Y. W. Tai, and I. S. Kweon. Robust high dynamic range imaging by rank minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(6): , June , 5, 6 [20] S. Raman and S. Chaudhuri. Reconstruction of high contrast images for dynamic scenes. The Visual Computer, 27(12): , Dec [21] E. Reinhard, G. Ward, S. Pattanaik, and P.. High Dynamic Range Imaging: Acquisition, Display, and Image- Based Lighting (The Morgan Kaufmann Series in Computer Graphics). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, [22] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation, pages Springer International Publishing, Cham, [23] P. Sen, N. K. Kalantari, M. Yaesoubi, S. Darabi, D. B. Goldman, and E. Shechtman. Robust Patch-Based HDR Reconstruction of Dynamic Scenes. ACM TOG, 31(6):203:1 203:11, , 4, 5, 6, 7, 8 [24] A. Serrano, F. Heide, D. Gutierrez, G. Wetzstein, and B. Masia. Convolutional sparse coding for high dynamic range imaging. Computer Graphics Forum, 35(2), , 2 [25] M. D. Tocci, C. Kiser, N. Tocci, and P. Sen. A versatile hdr video production system. ACM TOG, 30(4):41:1 41:10, July , 2 [26] A. Tomaszewska and R. Mantiuk. Image registration for multi-exposure high dynamic range image acquisition. In International Conference in Central Europe on Computer Graphics and Visualization, WSCG 07, , 2 [27] O. T. Tursun, A. O. Akyüz, A. Erdem, and E. Erdem. The state of the art in hdr deghosting: A survey and evaluation. Computer Graphics Forum, 34(2): , [28] O. T. Tursun, A. O. Akyüz, A. Erdem, and E. Erdem. An objective deghosting quality metric for hdr images. Computer Graphics Forum, 35(2): , May [29] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks. in IEEE ICCV, [30] S. Zhu, S. Liu, C. C. Loy, and X. Tang. Deep cascaded binetwork for face hallucination. In ECCV, [31] H. Zimmer, A. Bruhn, and J. Weickert. Freehand hdr imaging of moving scenes with simultaneous resolution enhancement. Computer Graphics Forum, 30(2): ,

Deep High Dynamic Range Imaging with Large Foreground Motions

Deep High Dynamic Range Imaging with Large Foreground Motions Shangzhe Wu 1,3[0000 0003 1011 5963], Jiarui Xu 1[0000 0003 2568 9492], Yu-Wing Tai 2[0000 0002 3148 0380], and Chi-Keung Tang 1[0000 0001