Information Hiding: Steganography & Steganalysis 1
Steganography ( covered writing ) From Herodotus to Thatcher. Messages should be undetectable. Messages concealed in media files. Perceptually insignificant data is common in media representations. 2
Steganalysis Detecting the presence of a message. Statistically based. Extraction of message itself is secondary. 3
Steganography: Definition Simmons 1983: Prisoners problem USA USSR non-proliferation treaty compliance checking Alice and Bob are prisoners, Wendy is a warden. Alice and Bob are allowed to exchange messages, say images, but Wendy checks all messages. Alice and Bob try to hide information in their messages so that Wendy cannot detect it. Wendy cannot arbitrarily suppress all messages; the prisoners human rights cannot be violated without some proof of illegal activity. 4
Hiding by Matching Input LSB sequence c 1,c 2,c 3,...,c n. Message bits m 1,m 2,m 3,...,m k, k<n. Look for a good approximate match. Theorem: If the the number of matching bits should exceed chance then the cover should be exponentially longer that the message. Hiding by matching is very wasteful (you can hide very few bits this way) 5
LSB Methods (least significant bit) The given cover is an image. Image represented by pixel values raw images: each pixel is a byte (gray value) raw images: each pixel is a byte (color index in a palette) raw images: each pixel is three bytes (r,g,b values) Image represented by a sequence of JPEG coefficients. LSBs of pixel values or JPEG coefficients can be altered freely. There are many LSBs in an image. 6
Original Image 7
80% JPEG 8
JPEG Inserted into the Original Image 9
Image File Formats An image is a two-dimensional array of image points or pixels. Gray-level images: each pixel is described by one number corresponding to its brightness. Color images: each pixel is described by three numbers corresponding to the brightnesses of the three primary colors e.g., red, green, and blue. A typical image file has two parts, the header and the raster data. The header contains the magic number identifying the format, the image dimensions, and other format-specific information that describes how the raster data relates to image points or pixels. The raster data is a sequence of numbers that contains specific information about colors and brightnesses of image points. 10
Raw Image File Formats (e.g. BMP and PGM, TIFF) The header contains the magic number and image dimensions. The raster data is a sequence of numbers corresponding to either one or three color values at each pixel. Raw (uncompressed) images are quite large and require a lot of storage space. Some space saving can be obtained by compressing the raster data by using loss-less compression (e.g. TIFF and PNG formats). 11
Pallete Images Frequently, raw images contain much more information than required. In images saved for human viewing the reduction of the number of possible colors. In some formats (e.g. GIF and PNG) the image is saved using a reduced color set. Each pixel value is represented by a single number corresponding to an index in the color palette that is stored in the image header. The palette contains the information needed to restore colors of all image pixels. The image file can be made even smaller by using loss-less compression of the raster data e.g., run-length encoding represents a sequence of several pixels of the same color by just two numbers, the length and the color of the sequence (run). 12
Lossy Compression: JPEG Removes some image details to obtain considerable saving of storage space without much loss of image quality. The savings are based on the fact that humans are more sensitive to changes in lower spatial frequencies than in than the higher ones. In addition, it is believed that humans perceive brightness more accurately than chromaticity. JPEG uses YCrCbformat to save brightness information (Y )infull resolution and chromaticity information (Cr and Cb) in half resolution. At the encoder side each channel is divided into 8 8 blocks and transformed using the two-dimensional discrete cosine transform (DCT). 13
JPEG: DCT Transform Let f(i, j), i,j=0,...,n 1 be an N N image block in any of the channels and let F (u, v), u,v=0,...,n 1 be its DCT transform. The relationship between f(i, j) and F (u, v) is given by F (u, v) = 2 N 1 N C(u)C(v) f(i, j) = 2 N N 1 u=0 N 1 v=0 i=0 N 1 j=0 f(i, j)cos C(u)C(v)F (u, v)cos ( πu(2i +1) 2N ( πu(2i +1) 2N where C(u) =1/ 2 for u =0and C(u) =1otherwise. ) ) cos cos ( πv(2j +1) 2N ( πv(2j +1) 2N ) ) 14
JPEG: Quantization ADCTofan8 8 block of integers is an 8 8 block of real numbers. The coefficient F (0, 0) is the DC coefficient and all others are called the AC coefficients. JPEG divides the coefficients by values from a quantization table to replace the real number values by integers. It is expected that many coefficients for higher values of u + v become zero and that only a fraction of all coefficients will remain nonzero. The coefficients are reordered into a linear array by placing higher frequency coefficients (higher values of u + v) at the end of the array; those coefficients are most likely to be zeroes. Huffman codding is applied to all coefficients from all blocks in the image; zero valued coefficients are encoded separately using special markers and their count for additional saving. 15
A header consists of image type and dimensions, compression parameters, and the quantization table. It is combined with the Huffman encoded coefficients packed as a sequence of bits to form a JPEG encoded image. On the decoder side the integer valued coefficients are restored by Huffman decoding. The quantization is reversed and the inverse DCT is applied to obtain the image. Huffman coding is loss-less; the losses occur in quantization process. 16
JPEG: Quantization Table (u,v) 0 1 2 3 4 5 6 7 0 16 11 10 16 24 40 51 61 1 12 12 14 19 26 58 60 55 2 14 13 16 24 40 57 69 56 3 14 17 22 29 51 87 80 62 4 18 22 37 56 68 109 103 77 5 24 35 55 64 81 104 113 92 6 49 64 78 87 103 121 120 101 7 72 92 95 98 112 100 103 99 Default JPEG quantization table. The coefficients are divided by their corresponding values and then rounded to the nearest integer. 17
Secret Key Based Steganography If system depends on the secrecy of the method there is no key involved pure steganography. Not desirable Kerkhoff s principle Compression + Encryption of the message Secret Key based staganography Public/Private Key Steganography 18
Lossy vs. Lossless Steganography Lossless steganography: modify lossless compression methods. An example would be modifying run length encoding process to embed messages. During the encoding process the method checks all run lengths longer than one pixel. Suppose that a run length of ten pixels is considered and that one bit needs to be embedded. To embed a bit one the run length is split into two parts whose lengths add to ten, say nine and one; to embed a bit zero the run length is left unmodified. The receivers check all run lengths. Two run lengths of the same color are decoded as a one. 19
A run length longer than one pixel, preceded and followed by run lengths of different colors, are decoded as a zero. Clearly, this technique relies on obscurity since detecting a file with information embedded by this technique is not hard. Lossy steganography: replace LSBs (least significant bits), modify PoVs (pairs of values) We are interested in lossy steganography. 20
Embedding by Modifying Carrier Bits First approach identifies the carrier bits i.e. the bits that will encode a message and modifies them to encode the message. These carrier bits could be one or more LSBs of selected bytes of raster data the selection process itself can use a key to select these bytes in pseudo-random order. Also, the raster data can be either raw image bytes (brightnesses and colors), or JPEG coefficients. Embedding is done by modifying the carrier bits suitably to encode the message. The message can be decoded from the carrier bits only i.e., the receiver identifies the carrier bits and extracts the message using the key and the algorithm. 21
These techniques can be compared using the following criteria (Westfeld, F5): The embedding rate the number of embedded bits per a carrier bit. The embedding efficiency the expected number of embedded message bits per modified carrier bit. The change rate the average percentage of modified carrier bits. 22