RAISE - A Raw Images Dataset for Digital Image Forensics

RAISE - A Raw Images Dataset for Digital Image Forensics Duc-Tien Dang-Nguyen 1, Cecilia Pasquini 2, Valentina Conotter 2, Giulia Boato 2 1 DIEE - University of Cagliari, Italy 2 DISI - University of Trento, Italy ductien.dangnguyen@diee.unica.it, cecilia.pasquini@unitn.it, {conotter, boato}@disi.unitn.it ABSTRACT Digital forensics is a relatively new research area which aims at authenticating digital media by detecting possible digital forgeries. Indeed, the ever increasing availability of multimedia data on the web, coupled with the great advances reached by computer graphical tools, makes the modification of an image and the creation of visually compelling forgeries an easy task for any user. This in turns creates the need of reliable tools to validate the trustworthiness of the represented information. In such a context, we present here RAISE, a large dataset of 8156 high-resolution raw images, depicting various subjects and scenarios, properly annotated and available together with accompanying metadata. Such a wide collection of untouched and diverse data is intended to become a powerful resource for, but not limited to, forensic researchers by providing a common benchmark for a fair comparison, testing and evaluation of existing and next generation forensic algorithms. In this paper we describe how RAISE has been collected and organized, discuss how digital image forensics and many other multimedia research areas may benefit of this new publicly available benchmark dataset and test a very recent forensic technique for JPEG compression detection. Categories and Subject Descriptors H.3.7 [Digital Libraries]: Collection, dissemination; H.2.8 [Database applications]: Image database General Terms Experimentation; Security Keywords Data Set, Raw Images, Benchmark, Image Forensics 1. INTRODUCTION Thanks to the worldwide spreading of smart portable devices, every day an enormous amount of digital multimedia information is created, stored and shared among peer users (a) Original version. (b) Modified version. Figure 1: Both photos depict the construction site of the hydroelectric dam of Belo Monte, Brazil. The modified version on the right was used by The Spiegel newspaper in 2013, in an article evaluating the environmental impact of the structure. by means of social media, web portals, apps. Such tools enable the diffusion of user-generated multimedia contents in an extremely easy, fast and wide manner, providing the chance for anybody to disseminate images, video and audio tracks. However, the authenticity and trustworthiness of such multimedia data cannot be given for granted, as low-cost and user-friendly editing software allow nowadays for easy manipulations of such material even by non-expert people. Considering also the more intuitive and immediate impact of visual data with respect to textual documents, the potential diffusion of distorted or completely fake multimedia content on websites, information media, advertisement and legal proceedings represents an urgent issue to be addressed. Thus, the trustworthiness of such data has to be seriously assessed, so to avoid an illegitimate exploitation, be it malicious or not, of the semantic message they may convey. Fig. 1 serves as an example to show how digital modification may influence the users opinion towards a given topic. In particular, in panel (a) is the original image of the construction site for the Belo Monte Dam in Brazil, while in panel (b) is its digitally altered version, where the visual impact of the image is clearly altered to look more degradated due to the social and environmental consequences of the construction of the hydroelectric dam. Recently, the debate about the role of visual content manipulation used online has significantly attracted the attention of the international community 1, questioning the way that the use of manipulated images impacts users perceptions [5, 20]. Over the last decade, the field of digital multimedia forensics has emerged as a valuable mean for assessing and verifying 1 www.spiegel.de/international/world/growing-concern-thatnews-photos-are-being-excessively-manipulated-a-898509.html

image trustworthiness, by developing passive techniques for a blind authentication of images (different from active techniques like [2]), videos and audio recordings. The idea behind digital forensics is that different kinds of processing will leave subtle specific numerical and statistical traces into the digital data, which can be detected and used as evidence of possible forgeries [8]. Although research on audio and video forensics is also gaining increasing attention, the study of images plays a central role in this field, addressing different goals, such as the detection of potential modifications applied to the content (e.g., splicing, enhancement, compression), the identification of the source device that created it (e.g., camera brand and model), or the differentiation between contents physically recorded by a device or generated by means of computer graphics software (see, e.g., [4, 6, 15, 17] and references therein). Given the wide variety of potential manipulations and the actual lack of a universal forensic framework, each forensic technique usually targets a single class of processing operations and needs to be carefully assessed under a number of controlled experimental settings (where order and parameters of different operations have to be known), in order to properly understand its potential effectiveness in a realworld scenario. Thus, images in raw uncompressed formats are extensively used in forensic research since their pristine condition is a valuable and essential starting point for building the testing sets needed, as it happens for the widely used UCID database [19]. To this extent, in this paper we present RAISE (RAw ImageS dataset), a new collection of 8156 raw images including a wide variety of both semantic contents and technical parameters, which is publicly available and can be downloaded in full or in smaller subsets. With this release, we aim at providing researchers in image forensics with a valuable benchmarking tool for the experimental evaluation of their novel forensic techniques. Differently from other existing datasets targeting a specific forensic task (like splicing [12] and copy-move detection [1, 3] or source identification [10]) where the images are processed accordingly, RAISE is conceived for a general purpose within the image forensics field, since raw images contain all the information related to the acquisition process and can be exploited to test any kind of processing (or chains of operators) according to the experimental setting needed. Even if the RAISE dataset has been primarily designed for digital forensics investigators, we foresee its potential use also within many other research fields where large and diverse data sets are needed in order to carry out meaningful investigations on multimedia. The structure of the paper is the following: in Section 2 the existing datasets used in image forensics are reviewed, while in Section 3 the new RAISE dataset is presented in details, by describing the collection and organization of the images, together with their basic statistics. In Section 4 an exemplifying experimental validation is presented, by testing a very recent JPEG compression detector, while Section 5 draws some concluding remarks. 2. RELATED WORK Digital image forensics has been developing at break-neck speeds over the last decade, while a common benchmarking image dataset for algorithm evaluation and fair comparison is still lagging behind. Several general image benchmark datasets are available in the literature (e.g., MIRFlickr [13]), but unfortunately none of them is suitable for forensic purposes due to the uncontrolled nature of the provided images (e.g., origin, compression, applied processing). In the forensic literature a few benchmarking image datasets have been proposed so far, each one designed for the detection of a specific type of manipulation. For instance, a couple of image datasets have been developed for supporting copy-move forgery detection. This manipulation is a common form of tampering that creates a spliced image by copying a portion of its content and pasting it within the same or into a different image, in order to remove or conceal objects. The Columbia Uncompressed Image Splicing Detection Evaluation Dataset [12] provides 183 authentic and 180 spliced images in uncompressed format (BMP or TIFF format), whose sizes range from 757 568 to 1152 768 pixels. Since the splicing procedure is reached by quasirandomly copying a part of an image and pasting it into a different one without any post-processing, the spliced images result to be semantically meaningless and visually not compelling. The Casia dataset [9] tried to overcome this issue by providing more realistic spliced images. However, most of them are of size 384 256 pixels, thus too small for a real scenario test-case. Amerini et al. released two benchmark datasets of realistic spliced images [1]: MICC F220 consists of 220 images, while MICC F2000 contains 2000 images, all of 2048 1536 pixels. Since the processing operations employed for creating the forgeries are limited to only rotation and scaling and the source files are not available, the applicability of such datasets to realistic scenarios tests is limited. Finally, in 2012 a complete benchmark for copy-move detection has been proposed in [3]. The authors provided a code for generating as many as needed semantically meaningful spliced images, providing also the ground truth. In such a way, they were able to create a common benchmark for a fair comparison and an in-depth evaluation of all the state-of-the-art forensic techniques devoted to discovering copy-move forgeries. Other proposed forensic databases have slightly different goals, such as the identification of sensor fingerprint in order to link an image to the camera that took it. To support such research, Goljan et al. proposed a large-scale database [11], essential in order to include in the analysis all possible camera devices present is commerce nowadays, but, to the best of our knowledge, such dataset is not publicly available. Similarly, the Dresden Image Database [10] aims at targeting the source camera identification problem by providing replicas of 83 images, ranging from 3072 2304 to 4352 3264 pixels, taken from 73 different devices, for a total of 16961 JPEG images, 1491 RAW (unprocessed) images, 1491 RAW images processed in Lightroom 2.5 and 1491 RAW images processed in DCRaw 9.3. In the literature, such dataset has been exploited also for general digital forgery detection, such as single/multiple JPEG compression, CFA artifacts, median filtering. However, it lacks of diversity in the image contents, since the RAW images always depict the same set of predefined contents, just taken with each selected cameras.

Properties Resolution Image Quality (a) Original photo. (b) Modified photo. Figure 2: In panel (a) is a raw image taken from the RAISE dataset, while panel (b) shows its digitally modified version. To overcome this issue, in the last years the forensic community focused his attention on the UCID database [19] for the evaluation of digital forgery algorithms. Such dataset includes 1338 images stored in TIFF format, whose size is either 512 384 or 384 512 pixels. UCID dataset was originally designed for the evaluation of image retrieval techniques, thus providing the predefined query images with the corresponding model images that should be retrieved (released as ground truth). UCID is currently the most widely used benchmarking dataset for forensic purposes requiring uncompressed images for testing. Also within the European project REWIND [18] a set of 200 uncompressed images taken with a Nikon camera has been released (among several others for splicing detection, copy-move forgeries and recapture videos, all including a few number of samples). Unfortunately, the limited number of provided images (both in the UCID and REWIND datasets) and the poor resolution offered by the common UCID database make these benchmarks not fully reliable for extensive forensic evaluations. 3. DATASET DESCRIPTION In this work, we propose a challenging real-world image dataset, primarily designed for the evaluation of digital forgery detection algorithms. The RAISE dataset is intended to become a powerful and shared resource for forensic researchers and investigators allowing for a fair comparison of existing and forthcoming algorithms. Indeed, custom-made datasets undermine and strongly influence the performance of an algorithm, thus making reproducibility and comparison of results often a difficult issue. This in turns creates a huge obstacle to the technical and scientific growth of the given field of research. Besides this, RAISE dataset is also intended to overcome the copyright and privacy problems related to the use of commonly available images downloaded from the Web. Please note that such issues are not necessarily specific of digital image forensics research, but common to many other research fields where large and diverse data collections are needed in order to carry out meaningful investigations on multimedia. Thus, we expect that the proposed collection of images will be useful and constitute a valuable resource also within other domains dealing with image analysis and processing (e.g., multimedia retrieval, denoising). Digital image forensics aims at discovering the processing history of a content. The basic assumption is that any operators that have been applied to an image will leave subtle traces in the statistics of the image. To the extend that such traces can be detected, they can be used as evidence for image forgeries. Recently, the forensic community has started focusing its attention on the detection of chains of opera- Camera Model Color Spaces # of images 3008 2000 4288 2848 4928 3264 Compress Raw 12-bit Lossless Compress Raw 14-bit Nikon D40 Nikon D90 Nikon D7000 srgb Adobe RGB 76 2276 5804 2352 5804 76 2276 5804 5950 2206 Table 1: Main properties of the RAISE dataset. tors [18]. Fig. 2 serves as an example to show how different processing operators can be applied to an image in order to change its content. In particular, shown in panel (a) is an original untouched image taken from the RAISE dataset, while in panel (b) is the same image but processed with different operators, such as filtering, copy-pasting, geometric transformations, blending and color adjustments. Such scenario is challenging since any post-processing can influence and perturb the traces left by previously applied operators. In this context, it becomes clear that a benchmark collection of controlled and never-processed images is needed as a starting point for building a reliable test scenario, so to guarantee a stable and objective study and evaluation of the various tools performance. As described in Section 2, existing forensic databases lack in both an adequate number of images to guarantee an extensive validation setting and in a realistic resolution. The RAISE dataset has been built in order to cope with such issues, providing 8156 RAW images, uncompressed and guaranteed to be camera-native (i.e, never touched nor processed). In the following, we describe into details how the images have been collected, together with their relative statistics and associated metadata. Finally, we illustrate the main organization of the image collection, explaining the image annotation intended to add a significant value to the proposed dataset and wide its applicability to future algorithms in image forensics, as well as other multimedia research fields. 3.1 Dataset collection The images present in the RAISE dataset have been collected from 4 photographers over a period of 3 years (2011-2014), capturing different scenes and moments in over 80 places in Europe. Three different camera were employed, i.e., a Nikon D40, a Nikon D90 and a Nikon D7000. The original number of captured images was 26964, but we had to manually revise them and discard those images that were meaningless for the defined purpose. At first, we deleted all the mostly dark and blurry images, such as those taken at night (mainly caused by too high shutter speed or vibration), since they cannot be used to extract any significant statistics for forensic purposes. Then, we discarded near-duplicated images (i.e., images shot within 1 second apart) since the RAISE dataset is intended to be as diverse as possible. To this aim, we automatically analyzed multiple instances on the same scene taken with different exposure values in order to obtain High-Dynamic-Range (HDR) photos. We only kept the ones with exposure value equal to 0 in

each HDR set. Finally, for copyright protection and privacy issues, we discarded personal pictures, such as depicting the photographers themselves. At the end of such selection process, the publicly available RAISE dataset contains a total of 8156 photos, saved in an uncompressed format (RAW) as natively provided by the employed cameras. We underline that the size of the the RAISE dataset is much larger compared to the existing forensics benchmarks, thus opening a new road for extensive analysis of existing and new-generation forensic techniques. 3.2 Dataset statistics and metadata All the images present in the RAISE dataset are taken in very high resolution (3008 2000, 4288 2848 and 4928 3264 pixels), overcoming by far the resolution limitation of current published data sets. All photos are stored in uncompressed formats, in high quality (Compress Raw 12-bit and Lossless Compress Raw 14-bit). Such information are shown in Table 1 in terms of number of images, together with the employed camera models and used color space. Besides such information, many other information about the images are available, since we allow users to download all the metadata associated to each image in the dataset. Among the others are the sensor focal lengths, flash modes (e.g., i-ttl-bl, i-ttl, i-ttl-bl -3.0), white balance settings (e.g., Shade, Direct sunlight, Cloudy, Color Temp, Incandescent) and the used lenses (e.g., 35mm f/1.8d, 50mm f/1.8d, 18-55mm f/3.5-5.6g), which could be seen as a valuable resource for studying the stability of lens aberrations in different scenes. 3.3 Dataset annotation On one hand, an extensive evaluation of any forensic algorithm needs a diverse benchmark in order to avoid contentdependent results. On the other hand, though, some specific techniques may require the analysis of specific image contents. In order to fulfill both requirements, we added an extra value to the proposed RAISE dataset and supplied all the images with tags. We labeled each image with 7 possible categories, namely, "outdoor", "indoor", "landscape", "nature", "people", "objects" and "buildings". In particular, "outdoor" indicates all those images taken outside, as opposite to "indoor", which includes those pictures taken inside a building. The tag "landscape" denotes those images depicting a section or expanse of rural inland or coastal scenery, usually extensive; "nature" denotes those contents representing a detail of a natural environment or animals, being the main subject, usually taken as a close-up picture; "people" indicates the presence of one or more persons within the image; "objects" refers to any image containing a physical object in a relevant position within the picture; finally, "buildings" indicates an enclosed construction over a plot of land, generally used for any of a wide variety of activities, as living, or manufacturing. Shown in Figure 3 are some examples of images from each category. Table 2 reports all the labeled categories along with the total number of images falling into that category. Please notice that the categories are not all mutually exclusive (except Category # of images Outdoor 6954 Indoor 1202 Landscape 2522 Nature 1167 People 1097 Objects 860 Buildings 2475 Table 2: Categories used for labeling the RAISE dataset. "indoor" and "outdoor"), so each image may have multiple labels. These tags result to be a valuable resource for many digital forensics areas. For instance, "outdoor"-tagged images may be helpful for studying light-based forensic algorithms (e.g., [14]), while the "building"-tagged images could be of great interest for all those geometric-based forensic techniques relying on the analysis of vanishing points, but also for more general photogrammetry studies (e.g., [7]). In fact, we foresee the utility of the RAISE tagged-images not limited to the forensic domain, but also for a wider range of investigations on multimedia processing and understanding. As an example, we could think of a visual privacy application for the "people"-tagged images, where the personal visual information should effectively be obscured (see the MediaEval 2014 task on visual privacy 2 ). 3.4 Dataset organization RAISE is available for scientific purposes at the following URL: http://mmlab.science.unitn.it/raise/. We implemented an attractive and user-friendly interface in order to allow all the interested researchers to easily download the RAISE dataset. The website shows some examples from the image collection and allows for an easy and fast download of the whole dataset via ftp connection, either in TIFF or NEF format. Some guidelines are also provided to support stepby-step the download procedure. Since the whole dataset is quite large in terms of storage space (over 350 GB), we decided to make also smaller subsets available, namely RAISE- 1k, RAISE-2k, RAISE-4k and RAISE-6k, containing 1000, 2000, 4000 and 6000 images, respectively. Each subset is constructed so that the number of images belonging to each category is proportional to the total amount of images in each category present in the entire database. The user will have the possibility to choose to download either the entire RAISE dataset or a subset of it. As an option, the user could also select a specific image category of interest. Metadata are also provided, together with the embedded annotations, as previously described. Please note that all the images in the RAISE dataset are stored with a unique identifier (generated by the MD5 function) in order to ensure the image integrity. 2 http://www.multimediaeval.org/mediaeval2014/visualprivacy2014

Outdoor Indoor Landscape Nature People Objects Buildings Figure 3: Examples of images present in RAISE. 8156 images have been carefully tagged by experts using 7 defined categories. 4. VALIDATION In this section, we report an experimental validation on the images contained in RAISE, as a demonstration of one of its possible use within an image forensic framework. In particular, we focus on the analysis of images stored in uncompressed format (like TIFF, BMP, PNG, and so on), with the aim of detecting traces of a previous JPEG compression. This problem arises, for instance, when the subject image of a forensic analysis supposedly comes from a device set to provide raw images (like professional or semiprofessional cameras), which are usually converted to uncompressed formats (like TIFF) for wide distribution. In this case, the presence of JPEG compression traces would suggest that the image has been taken from a different camera or it has been already processed by someone. In this framework, RAISE represents a suitable dataset for testing a recent forensic technique designed to identify traces of previous compression [16]. As it is extensively exploited in image forensics literature, the core of the JPEG compression algorithm is the quantization of the DCT coefficients (obtained by applying a 8 8 block-dct in the pixel domain), with the width of the quantization interval depending on the quality factor used. According to this, the method proposed in [16] is based on the extraction of a single feature from the DCT domain, namely the Benford-Fourier (BF) coefficient, which captures the difference between the DCT coefficients of an image that has never been compressed and of one that previously underwent a compression operation. In particular, in [16] the distribution of the BF coefficient magnitude under the hypothesis of no previous compression is theoretically derived, as reported in Fig. 4. This knowledge is then exploited to design a binary hypothesis test framework, where the pristine condition of the image is the null hypothesis and the occurrence of a previous compression is the alternative hypothesis. Since the analytical expression of the pdf is available, it is possible to determine an acceptance region so that the false alarm probability (a pristine image is classified as previously compressed) is equal to a chosen value. In other words, an upper and lower threshold on the magnitude of BF coefficients can be automatically determined: if such value lies beyond the two thresholds the null hypothesis is rejected and the image is classified as previously compressed. As such, there is no need for a training phase, but it is sufficient to fix a false

NC PC NC 98.6 1.4 PC 0.1 99.9 Table 3: Confusion matrix in the classification between never compressed (NC) and previously compressed (PC) images, testing the JPEG compression detector [16] on RAISE. alarm probability value. In order to test the performance of the technique, we divided the RAISE dataset (the TIFF version) into two halves and created two different classes, labeled as NC (never compressed) and PC (previously compressed), respectively. For the NC-labeled images, the BF coefficient was computed directly from the TIFF version; on the other hand, the PClabeled images were JPEG compressed with a random quality factor in {50, 51,..., 90}, re-saved as TIFF and finally the BF coefficient was computed. According to the hypothesis testing framework proposed in [16], the images were classified as NC or PC by thresholding the magnitude of the BF coefficient. Specifically, the threshold was automatically determined from a statistical model where the predicted false alarm probability was set to a certain value (in the tests we used 0.01). In Table 3, the obtained confusion matrix is reported. Based on this, we can assess the very good performances of the classification on the RAISE dataset, proving the effectiveness of the JPEG detector in [16] also on diverse and high-quality images. 5. CONCLUSION In this paper, we have presented RAISE, a new dataset of raw images, that is proposed as a valuable support tool for benchmarking within the research field of image forensics, as well as other image processing fields. The collection and organization modalities of the database have been thoroughly described, together with some basic statistics, and the results of an exemplifying experimental validation are reported. With this work, we intend to provide researchers in image forensics with a common benchmark for an in-depth evalu- 2 0 2 4 6 8 10 12 Magnitude of BF coefficient x 10 3 Figure 4: Shown is the histogram of the BF coefficients computed on 1000 TIFF images selected from RAISE, together with the corresponding pdf theoretically predicted by the method in [16] (red curve). ation and comparison of state-of-the-art and future forensic techniques, in order to better assessing their reliability in real-world applications. Moreover, the RAISE collection, by name and by nature, might grow in the future and include new images and categories, with the goal of meeting new potential needs of the research community in multimedia. 6. REFERENCES [1] I. Amerini et al. A sift-based forensic method for copy-move attack detection and transformation recovery,. IEEE Transactions on Information Forensics and Security, 6(3):1099 1110, 2011. [2] F. Battisti et al. Watermarking and encryption of color images in the fibonacci domain. In SPIE Image Processing: Algorithms and Systems, volume 6812, 2008. [3] V. Christlein et al. An evaluation of popular copy-move forgery detection approaches. IEEE Transactions on Information Forensics and Security, 7(6):1841 1854, 2012. [4] V. Conotter and G. Boato. Analysis of sensor fingerprint for source camera identification. IEEE Electronic Letters, 47(25):1366 1367, 2011. [5] V. Conotter et al. A crowdsourced data set of edited images online. In ACM Workshop on Crowdsourcing for Multimedia, 2014. [6] V. Conotter et al. Physiologically-based detection of computer generated faces in video. In IEEE International Conference on Image Processing, 2014. [7] A. Criminisi. Single-view metrology: Algorithms and applications. In Pattern Recognition, volume 2449 of Lecture Notes in Computer Science, pages 224 239. Springer Berlin Heidelberg, 2002. [8] E. Delp et al. Special issue on digital forensics. IEEE Signal Processing Magazine, 2(26), 2009. [9] J. Dong et al. Casia image tampering detection evaluation database. In IEEE International Conference on Signal and Information Processing, pages 422 426, 2013. Online at http://forensics.idealtest.org. [10] T. Gloe and R. Böhme. The Dresden Image Database for benchmarking digital image forensics. In ACM Symposium On Applied Computing, volume 2, pages 1585 1591, 2010. Online at https://forensics.inf.tu-dresden.de/ddimgdb. [11] M. Goljan et al. Large scale test of sensor fingerprint camera identification. In SPIE Media Forensics and Security, volume 7254, 2009. [12] Y.-F. Hsu and S.-F. Chang. Detecting image splicing using geometry invariants and camera characteristics consistency. In IEEE International Conference on Multimedia and Expo, pages 549 552, 2006. [13] M. J. Huiskes and M. S. Lew. The MIR Flickr retrieval evaluation. In ACM International Conference on Multimedia Information Retrieval, pages 39 43, 2008. Online at http://press.liacs.nl/mirflickr/. [14] P. Kakar and N. Sudha. Verifying temporal data in geotagged images via sun azimuth estimation. IEEE Transactions on Information Forensics and Security, 7(3):1029 1039, 2012. [15] J. Lukas et al. Digital camera identification from sensor noise. IEEE Transactions on Information Forensics and Security, 1(2):205 214, 2006. [16] C. Pasquini et al. A Benford-Fourier JPEG compression detector. In IEEE International Conference on Image Processing, pages 5322 5326, 2014. [17] A. Piva. An overview on image forensics. ISRN Signal Processing, 2013:1 22, 2013. [18] REWIND. Reverse engineering of audio-visual content data - datasets. Online at http://www.rewindproject.eu/. [19] G. Schaefer and M. Stich. UCID - an uncompressed colour image database. In SPIE Conference on Storage and Retrieval Methods and Applications for Multimedia, pages 472 480, San Jose (CA), January 2004. Online at http://homepages.lboro.ac.uk/ cogs/datasets/ucid/ucid.html.

[20] P. Zontone et al. Impact of contrast modification on human feeling: an objective and subjective assessment. In IEEE International Conference on Image Processing, 2010.