A2-RL: Aesthetics Aware Reinforcement Learning for Automatic Image Cropping

Size: px

Start display at page:

Download "A2-RL: Aesthetics Aware Reinforcement Learning for Automatic Image Cropping"

Melina Poole
6 years ago
Views:

A2-RL: Aesthetics Aware Reinforcement Learning for Automatic Image Cropping Debang Li Huikai Wu Junge Zhang Kaiqi Huang NLPR, Institute of Automation, Chinese Academy of Sciences {debang.li, huikai.

1 A2-RL: Aesthetics Aware Reinforcement Learning for Automatic Image Cropping Debang Li Huikai Wu Junge Zhang Kaiqi Huang NLPR, Institute of Automation, Chinese Academy of Sciences {debang.li, {jgzhang, Homepage: debangli.github.io/a2rl sliding window based methods usually need tens of thousands of windows which is very time-consuming. Although we can set several different aspect ratios and densely extract candidates, it inevitably costs lots of time and is still unable to cover all conditions. We also believe that the sliding window method is different from human s cropping process and is not that natural for aesthetic quality evaluation. According to our observation, human usually takes a whole view of the input image and makes sequential decisions to find the best region, almost never searching patch by patch as the sliding window method does. Based on above observations, in this paper, we formulate the automatic image cropping problem as a sequential decision-making process, and propose an Aesthetics Aware Reinforcement Learning (A2-RL) model to solve such problem. The sequential decision-making based automatic image cropping process is illustrated in Figure 1. To our knowledge, we are the first to put forward a reinforcement learnarxiv: v1 [cs.cv] 14 Sep 2017 Abstract Image cropping aims at improving the aesthetic quality of images by adjusting their composition. Most previous methods rely on the sliding window mechanism. The sliding window mechanism requires fixed aspect ratios and limits the cropping region with arbitrary size. Moreover, the sliding window method usually produces tens of thousands of windows which is very time-consuming. Motivated by these challenges and also inspired by the human s cropping process, we firstly formulate the aesthetic image cropping as a sequential decision-making process and propose an Aesthetics Aware Reinforcement Learning (A2-RL) framework to address this problem. Particularly, the proposed method develops an aesthetics aware reward function which especially benefits image cropping. Similar to human s decision making and to better utilize the historical experience, we use a LSTM based state representation including both the current and historical experience. We train the agent using the actor-critic architecture in an end-to-end manner. The agent is evaluated on several popular unseen cropping databases. Experiment results show that our method achieves the state-of-the-art performance with much fewer candidate windows and much less time compared with previous methods. 1 Introduction Image cropping is an common task in image editing, which aims to extract well-composed regions from ill-composed images. It can improve the visual quality of images, because the composition plays an important role in the image quality. An excellent automatic image cropping algorithm can give editors professional advices and help them save a lot of time (Kao, He, and Huang 2017). In the past decades, many researchers have devoted their efforts to proposing novel methods (Yan et al. 2013; Fang et al. 2014; Chen et al. 2017a) for automatic image cropping. Most of these methods follow a three-step pipeline: 1) densely extract candidates with the sliding window method on the input image, 2) Extract carefully-designed features from each region and 3) Use a classifier or ranker to grade each window and find the best region. Although these works have achieved pretty good performance, they may not find the best results due to the limitations of the sliding window method, which requires fixed aspect ratios and limits the cropping region with arbitrary size. What s more, these Input Step 1 Step T-3 Step T-2 Step T-1 Step T: Termination & Output Figure 1: Illustration of the sequential decision-making based automatic cropping process. The cropping agent starts from the whole image and takes actions to find the best cropping window in the input image. At each step, it takes an action (yellow and red arrow) and transforms the previous window (dashed-line yellow rectangle) to a new state (red rectangle). The agent takes a termination action and stops at step T. We also use the VFN (Chen et al. 2017b) to evaluate the input image and the cropped image. The scores of original image and cropped image are and respectively, which validates the capability of our agent.

2 ing based method for automatic image cropping. The A2- RL model can finish the cropping process within several or a dozen steps and get results of arbitrary shape, which can overcome the disadvantages of the sliding window method. Particularly, A2-RL model develops a novel aesthetics aware reward function which especially benefits the image cropping. Inspired by human s decision making, the historical experience is also explored to assist the current decision with a LSTM based state representation. We test the model on three unseen popular cropping databases (Yan et al. 2013; Fang et al. 2014; Chen et al. 2017a), and the experiment results demonstrate that our method obtains the state-of-the-art cropping performance with much fewer candidate windows and much less time compared with related methods. 2 Related Work Image cropping aims at improving the composition of images, which is very important for the aesthetic quality. There are a number of previous works for aesthetic quality assessment. Many early works (Ke, Tang, and Jing 2006; Datta et al. 2006; Luo, Wang, and Tang 2011; Dhar, Ordonez, and Berg 2011) focus on designing handcrafted features based on intuitions from human s perception or photographic rules. For example, some low-level features such as colorfulness and the rule of thirds are proposed to discriminate whether an image is pleasing or not (Datta et al. 2006). Some high-level attributes, including composition and content, are also used to describe images (Dhar, Ordonez, and Berg 2011). Recently, thanks to the fast development of deep learning and newly proposed large scale databases (Murray, Marchesotti, and Perronnin 2012), there are many new works (Kong et al. 2016; Mai, Jin, and Liu 2016; Deng, Loy, and Tang 2017) which accomplish aesthetic quality assessment with convolutional neural networks. Previous automatic image cropping methods can be divided into two classes, attention-based and aesthetics-based methods. The basic approach of attention-based methods (Suh et al. 2003; Stentiford 2007; Park et al. 2012; Chen et al. 2016) is to find the most visually salient regions in the original images. Attention-based methods can find cropping windows that draw more attention from people, but they may not generate very pleasing cropping windows, because they hardly consider about the image composition (Chen et al. 2017a). For those aesthetics-based image cropping methods, they aim to find the most pleasing cropping window from original image. As these methods use the aesthetics quality as criterion, they use almost the same features as aesthetics quality assessment. Some of these works (Nishiyama et al. 2009; Fang et al. 2014) use aesthetic quality classifiers to discriminate the quality of candidate windows. Other works use RankSVM (Chen et al. 2017a) or RankNet (Chen et al. 2017b) to grade each candidate window. There are also change-based methods (Yan et al. 2013), which compares original images with cropped images so as to throw away distracting regions and retain high quality ones. Most current aesthetics-based methods (Fang et al. 2014; Chen et al. 2017a; 2017b) still rely on the sliding window method to obtain the candidate windows. As discussed above, the sliding window method uses fixed aspect ratios and limits window with arbitrary size. What s more, these methods need tens of thousands of candidates to finish cropping process. In this paper, we propose a reinforcement learning based strategy to search the cropping window. With this method, we can find final result with only several or a dozen candidates of arbitrary size. Reinforcement learning based strategies have been successfully applied in many domains of computer vision, including object detection (Caicedo and Lazebnik 2015; Jie et al. 2016), image caption (Ren et al. 2017) and visual relationship detection (Liang, Lee, and Xing 2017). The active object localization method (Caicedo and Lazebnik 2015) achieves the best performance among detection algorithms without region proposals. Tree-RL method (Jie et al. 2016) uses reinforcement learning to obtain region proposals and achieve comparable result with much fewer region proposals compared to RPN (Ren et al. 2015). To our best knowledge, we are the first to put forward a deep reinforcement learning based method for automatic image cropping. 3 Aesthetics Aware Reinforcement Learning When a person crops an image, he may first take a look at the whole image and try to get a patch as initial result. After that, he may continue searching the better cropping windows based on the whole image and previous cropping results until obtaining a satisfactory result. Inspired by such observation, we think automatic image cropping can be formulated as a sequential decision-making process. In the decision-making process, an agent interacts with the environment, and takes a series of actions to optimize a target. As illustrated in Figure 2, for our problem, the agent receives observation from the input image and the cropping window. Then it samples action from the action space according to the observation and historical experiences. The sampled action is executed by the agent to manipulate shape and position of the cropping window. After each action, the agent receives a reward according to the aesthetic score of cropped image. Its target is to find a most pleasing window in the original image by maximizing the accumulated reward. In this section, we first introduce the state space, the action space and the aesthetics aware reward of our A2-RL model, then we detail the architecture of our aesthetics aware reinforcement learning (A2-RL) model and the training process. 3.1 State and Action Space At each step, the agent decides which action to execute according to the current state. The state must provide the agent with a comprehensive information for better decision. As the A2-RL model formulates the automatic image cropping as a sequential decision-making process, the state can be represented as s t = {o 1, o 2,, o t 1, o t }, where o t is the current observation of the agent. This formulation is similar to human s decision making process, which considers the current observation and historical experience. A2-RL model uses the features of cropping window and input image as the current observation o t. Agent can learn about the global

Observation (o t ) Global Feature (retained) Agent FC FC FC LSTM Action Space (14) Termination CONV5 State value sample CONV Local Feature Aesthetics score Reward Function Figure 2: Illustration of

3 Observation (o t ) Global Feature (retained) Agent FC FC FC LSTM Action Space (14) Termination CONV5 State value sample CONV Local Feature Aesthetics score Reward Function Figure 2: Illustration of the A2-RL model architecture. In the forward pass, the feature of cropped window is extracted by VFN (Chen et al. 2017b) and concatenated with the feature of whole image which is extracted and retained previously. Then, the concatenated feature vector is fed into the actor-critic branch which has two outputs. The actor output is used to sample action from action space so as to manipulate the cropping window. The critic output is used to estimate the expected reward at the current state. In addition, the feature of cropping window is also fed into an image quality evaluation branch. The output of this branch is the aesthetics score of the input cropping window and stored to compute rewards for actions. In this model, both global feature and local feature are 1000-dimension vectors, three fully-connected layers and the LSTM layer all output 1024-dim feature vectors. information and the local information with such observation. Both the local feature and the global feature are extracted in FC6 layer of the pre-trained View Finding Network (VFN) (Chen et al. 2017b), which is modified from the original AlexNet (Krizhevsky, Sutskever, and Hinton 2012). In the A2-RL model, we use a LSTM unit to memorize historical observations {o 1, o 2,, o t 1 }, and combine them with the current observation o t to form the state s t. We choose 14 pre-defined actions to form the action space which can be divided into four groups: scaling actions, position translation actions, aspect ratio translation actions and a termination action. The first three groups aim to adjust the size, position and shape of the cropping window, including 5, 4 and 4 actions respectively. These three groups follow similar definitions in (Jie et al. 2016), but with different scales. All these actions adjust the shape and position by 0.05 times of the original image size, which could capture more accurate cropping windows than a large scale. The termination action is a trigger for agent, when this action is chosen, the agent will stop the cropping process and output the current cropping window as the final result. Theoretically, the agent can cover any window with different size and position on the original image. The observation and action space are illustrated in Figure 2 for an intuitional representation. 3.2 Aesthetics Aware Reward The goal of the A2-RL model is to find the most pleasing cropping window on the original input image. So, the reward function must guide the agent to find a cropping window with better aesthetic quality at each step. In the A2-RL model, the VFN (Chen et al. 2017b) is used to give each cropped image a quality score. When an action is executed, the difference of aesthetic scores between the last cropping window and the current cropping window can be used to compute the reward for this action. More detailed, if the aesthetic score of the current window is higher than the last one, agent will receive a positive reward. On the contrary, if the score becomes lower, the agent will get a negative reward. For an input image I, we denote the score given by the VFN as S VFN (I). The current cropped image and the last cropped image are denoted as I t and I t 1 respectively, and we use a sign function to compute the basic reward for the current action, which can be formulated as: r basict = sign(s VFN (I t ) S VFN (I t 1 )) (1) This is the foundation of our aesthetics aware reward function. We also add other heuristic constraints for better cropping policies. We think well-composed images aspect ratio are limited in a particular range. In the A2-RL model, if the aspect ratio of the current window is lower than 0.5 or higher than 2, the agent will receive a negative reward as a punishment for the corresponding action. Here, we empirically set the negative reward as -5, because we want the agent learns a strict rule not to let such situation happen. To speed up the cropping process, we give the agent an additional negative reward t at every step, where t is the number of steps agent has taken since the beginning. This constraint will result in a lower reward when the agent takes too many steps. Let ar denote the aspect ratio of the current window, the whole reward function r t for the agent taking an action a t under state s t can be formulated as: { r t (s t, a t ) = 5, if ar < 0.5 or ar > 2 r basict t, otherwise (2)

4 3.3 A2-RL Model With the defined state space, action space and reward function, here we introduce the architecture of our Aesthetics Aware Reinforcement Learning (A2-RL) model. The detailed architecture of the model is illustrated in Figure 2. The A2-RL model starts with a 5-layer convolution block and a fully-connected layer for feature representation. Then the model splits into two branches, the first one is the actorcritic branch, the other is an image quality evaluation branch. The actor-critic branch is composed of three fully connected layers and a LSTM layer. The LSTM layer is used to memorize historical observations. The actor-critic branch has two outputs, the first one is policy output, which is also named Actor, the other output is value output, also named Critic. The policy output is a fourteen-dimension vector, each dimension corresponding to the probability of taking relevant action. The value output is the estimation of the current state, which is the expected accumulated reward in such situation. The image quality evaluation branch outputs an aesthetic quality score for the cropped image, which is used to compute reward. In the image cropping process, the A2-RL model provides the agent with probability of each action under the current state. As shown in Figure 2, the model feeds the cropped image into feature representation unit and extracts feature at first. Then the feature is combined with the global feature which is extracted in the first forward pass and retained for the following process. The combined feature vector is then fed into the actor-critic branch. According to the policy output, the agent samples relevant action and adjusts size and position of the cropping window correspondingly. For example, in Figure 2, the agent executes sampled action to shrink the cropping window from left and top with 0.05 times the size of image. Forward pass will continue until the termination action is sampled. 3.4 Training A2-RL Model To make our A2-RL model learn good cropping policies, our training process is based on asynchronous advantage actor-critic (A3C) algorithm (Mnih et al. 2016). Different from the original A3C, we replace asynchronous mechanism with mini-batch to increase diversity. In training stage, we use advantage function (Mnih et al. 2016) and entropy regularization term (Williams and Peng 1991) to form optimization objective of the policy output. We use R t to denote accumulated reward at step t, which is k 1 i=0 γi r t+i + γ k V (s t+k ; θ v ), where γ is the discount factor, V (s t ; θ v ) is the value output under state s t, and k ranges from 0 to t max. t max is the maximum number of steps before updating. The optimization objective of the policy output is to maximize the advantage function R t V (s t ; θ v ) and entropy of the policy output H(π(s t ; θ)), where π(s t ; θ) is the probability distribution of policy output and H is the entropy function. Entropy in optimization objective is used to increase the diversity of actions, which can make the agent learn flexible actions. The optimization objective of the value output is to minimize (R t V (s t ; θ v )) 2 /2. So, gradients of actor-critic branch can be formulated as θ logπ(a t s t ; θ)(r t V (s t ; θ v )) + β θ H(π(s t ; θ)) and θv (R t V (s t ; θ v )) 2 /2, where β is to control the influence of entropy. The whole training procedure of A2-RL model is described in Algorithm 1. T max and t max means maximum number of steps the agent takes before termination and updating network respectively. Algorithm 1: Training procedure of A2-RL model Input: original image I 1 f local = f global = F eature extractor(i) 2 previous score = S VFN (I) 3 T = 0 4 repeat 5 t = 0 6 repeat 7 o t = concat(f global, f local ) 8 s t = LST M AC (o t ) //Actor-Critic with LSTM 9 Perform a t according to policy π(a t s t ; θ) and get the cropped image I c 10 fc6 local = F eature extractor(i c ) 11 score = S VFN (I c ) 12 r t = reward(previous score, score, I c, T ) 13 t = t + 1 and T = T previous score = score 15 until { t t max or a t is termination action; 0 if at is termination action 16 R = V (s t ; θ v ) for other actions 17 for i {t 1,..., 0} do 18 R r i + γr 19 dθ dθ + θ logπ(a t s t ; θ)(r t V (s t ; θ v )) +β θ H(π(s t ; θ)) 20 dθ v dθ v + θv (R t V (s t ; θ v )) 2 /2 21 end 22 Update θ with dθ and θ v with dθ v 23 until T T max or a t is termination action; 4 Experiments 4.1 Experimental Settings In this section, we will introduce our way to obtain training data, implementation details during training procedure and evaluation data and metrics. Training Data To train our network, we select images from a large scale aesthetics image database named AVA (Murray, Marchesotti, and Perronnin 2012), which consists of training images and testing images. All these images are labeled with aesthetic score rating from one to ten by several people. As the score distribution is extremely unbalanced, we simply divide them into three classes: low quality, middle quality and high quality. These three classes correspond to score from one to four, four to seven and seven to ten respectively. We choose about 3000 images from each class to compose the training set. Finally, there are 9000 images in the training set. With these training data, the A2-RL model can learn cropping policies

5 for images with diverse quality, which can make the model generalize well to different images. Implementation Details RMSProp (Tieleman and Hinton 2012) algorithm is utilized to optimize the A2-RL model and the learning rate is set to We set the batch size as 32 and the weight of entropy loss β as The discount factor of reward is set to be T max and t max are set as 50 and 10 respectively. We also choose 900 images from AVA database as validation set following the same way as training set. As the A2-RL model aims to get a cropping window with higher aesthetic quality score than original image, on validation set, we use the improvement of aesthetic score between the original image and the cropped image as metric. We train the network on training set for 20 epochs and validate the model on validation set every epoch. The model which achieves best average improvement on validation set is chosen as the our final A2-RL model. Evaluation Data and Metric To evaluate the cropping ability of our agent, we test it on three unseen automatic image cropping databases, including CUHK Image Cropping Database (Yan et al. 2013), Flickr Cropping Database (Chen et al. 2017a) and Human Cropping Database (Fang et al. 2014). The first two databases use the same evaluation metrics, but the last one uses a different metric. We adopt the same metric as the original papers for fair comparison. There are 950 testing images in CUHK Image Cropping Database, which are annotated by three different expert photographers. Flickr Cropping Database contains 348 testing images, and each image has only one annotation. On these two databases, previous works (Yan et al. 2013; Chen et al. 2017a; 2017b) use the same evaluation metrics to measure cropping accuracy, including average intersectionover-union(iou) and average boundary displacement. In this paper, we denote the ground truth window of image i as W g i and cropping window as Wi c. The average IoU of N images can be computed as 1/N N area(w g i W i c )/area(w g i W i c ) (3) i=1 The average boundary displacement computes the average distance between the four edges of ground truth window and cropping window. In image i, we denote four edges of the ground truth window as B g i (l), Bg i (r), Bg i (u), Bg i (b), correspondingly, four edges of the cropping window are denoted as Bi c(l), Bc i (r), Bc i (u), Bc i (b). The average boundary displacement of N images can be computed as 1/N N i=1 j={l,r,u,b} B g i (j) Bc i (j) /4 (4) Human Cropping Database contains 500 testing images, each is annotated by ten people. Because it has more annotations for each image than the first two databases, so the evaluation metric is a little different. Previous works (Fang et al. 2014; Kao, He, and Huang 2017) on this database use top-k maximum intersection-over-union(iou) as the evaluation metric. This metric is similar to previous average IoU, Method Avg IoU Avg Disp Error edn RankSVM+DeCAF VFN+SW A2-RL w/o LSTM A2-RL(Ours) Table 1: Cropping accuracy on Flickr Cropping Database (Chen et al. 2017a). The best results are highlighted in bold. the only difference is that it computes the IoU between proposed cropping window and ten ground truth windows, then it chooses the maximum IoU as result. Top-k means to use k best cropping windows to compute result. 4.2 Evaluation of Cropping Accuracy In this section, we compare cropping accuracy of our A2- RL model with conventional sliding window based cropping methods to validate its effectiveness. As sliding window based VFN (Chen et al. 2017b) achieves the best results among methods without using supervised cropping data, we mainly compare our A2-RL model with this cropping method. Our A2-RL model uses actor-critic based reinforcement learning method to search the best cropping windows sequentially with only several candidate windows. But original VFN-based cropping method uses sliding window to densely extract candidate windows. We also compare with several baselines on these databases. CUHK and Flickr Cropping Databases As the previous VFN method (Chen et al. 2017b) is only evaluated on CUHK and Flickr cropping databases, we also mainly compare our framework with VFN on these two databases. Notably, the original VFN not only uses sliding window candidates, but also uses the ground truth window of test images as candidate, which leads to a remarkably high performance on these two databases. As A2-RL model aims to search the best cropping window, and in practice, there won t be any ground truth window for cropping algorithms, so, in this experiment, we don t use any ground truth window for both frameworks for fair comparison. It s also worthy to mention that, A2-RL model has never seen images from both databases during training. Besides the two frameworks discussed above, we also compare some other cropping methods. We choose the best attention-based method edn reported in (Chen et al. 2017a) on behalf of the attention-based cropping algorithms. This method computes images saliency maps with algorithms from (Vig, Dorr, and Cox 2014), and search the best cropping window by maximizing the difference of average saliency between the cropping window and other region. We also choose the best result (RankSVM+DeCAF 7 ) reported in (Chen et al. 2017a) as another baseline. In this method, aesthetic feature DeCAF 7 is extracted from AlexNet and a RankSVM is trained to find the best cropping window among all the candidates. For all these sliding-window based methods, including edn, RankSVM+DeCAF 7 and VFN+SW (sliding window), the results are all reported with

6 Method Annotation I Annotation II Annotation III Avg IoU Avg Disp Error Avg IoU Avg Disp Error Avg IoU Avg Disp Error edn RankSVM+DeCAF LearnChange VFN+SW A2-RL w/o LSTM A2-RL(Ours) Table 2: Cropping accuracy on CUHK Image Cropping Database (Yan et al. 2013). The best results are highlighted in bold. Method Top-1 Max IoU Top-2 Max IoU Top-3 Max IoU Top-4 Max IoU Top-5 Max IoU (Fang et al. 2014) (Kao, He, and Huang 2017) A2-RL w/o LSTM A2-RL(Ours) Table 3: Cropping accuracy on Human Cropping Database (Fang et al. 2014). The best results are highlighted in bold. the same sliding window setting as (Chen et al. 2017a). Experiments on Flicker Cropping Database are shown in Table 1. VFN+SW and A2-RL are the two mainly comparable frameworks. The first one, VFN+SW, means the original VFN framework (Chen et al. 2017b), which uses sliding window(sw) to densely extract crop candidates. The second one, A2-RL is our reinforcement learning based framework. Similar to Flickr Cropping Database, we list cropping accuracy of above methods on CUHK Cropping Database in Table 2. As there are 3 annotations for each image, following previous works (Yan et al. 2013; Chen et al. 2017a; 2017b), we list the result for each annotation separately. All symbols in Table 2 are the same as Table 1. What s more, we also report the best result in (Yan et al. 2013), in which this database is proposed. Notably, the method is trained with supervised cropping data on this database, which is not very fair for us to compare. As this method is change-based, we denote it as LearnChange in Table 2. From Table 1 and Table 2, we can see that our A2-RL model outperforms other cropping algorithms consistently on these two databases, which demonstrates the effectiveness of our A2-RL model. Human Cropping Database We also evaluate our A2- RL model on Human Cropping Database. Following previous works (Fang et al. 2014; Kao, He, and Huang 2017) on this database, top-k maximum intersection-over-union(iou) is used for cropping accuracy metric as discussed above. We choose two state-of-the-art methods (Fang et al. 2014; Kao, He, and Huang 2017) as our baselines. The cropping accuracy of our A2-RL model and several baselines is listed in Table 3. As A2-RL model finds one cropping window at a time, we make the agent searches on the input image k times for top-k IoU. From Table 3, we can see that our A2-RL model still outperforms the state-of-the-art methods. 4.3 Evaluation of Time Efficiency In this section, we study the time efficiency of our A2-RL model. We compare our A2-RL model with sliding window Method Avg Avg Avg Avg IoU Disp Steps Time(s) VFN+SW VFN+SW VFN+SW A2-RL(Ours) Table 4: Time Efficiency comparison on Flickr Cropping Databse. VFN+SW, VFN+SW+ and VFN+SW++ correspond different number of candidate windows, where VFN+SW follows original setting (Chen et al. 2017b). The best results are highlighted in bold. based VFN model on Flickr Cropping Database. Experimental results are shown in Table 4. The Avg Steps and Avg Time mean average value of steps and time methods cost to finish cropping process on a single image. We augment the number of sliding window in this experiment. VFN+SW, VFN+SW+ and VFN+SW++ in Table 4 correspond different number of candidate windows, where VFN+SW follows original setting (Chen et al. 2017b). Notably, all results in Table 4 are evaluated on the same machine, which has a single NVIDIA GeForce Titan X pascal GPU with 12GB memory and Intel Core i7-6800k CPU with 6 cores. From Table 4, we can easily find that cropping accuracy is improved as we augment the number of sliding windows, but the consumed time is also much higher. Unsurprisingly, our A2-RL model needs much fewer steps and costs much less time than other methods. The average number of steps our A2-RL model takes is more than 10 times less than the average number of steps the sliding window based method takes, but our A2-RL model still get a better cropping accuracy. This result shows the capacities of our RL-based model, with the novel aesthetics aware reward and historypreserved state representation, our model learns to use as few actions as possible to obtain a more pleasant image.

(a) VFN+Sliding Window (b) A2-RL(Ours) (c) Ground Truth Figure 3: Image cropping examples on Flickr Cropping Database (Chen et al. 2017a). Best viewed in color. 4.

RL Search vs Sliding Window From Table 1, Table 2 and Table 4, we can find out A2-RL is better than VFN+SW in cropping accuracy and time efficiency consistently.

7 (a) VFN+Sliding Window (b) A2-RL(Ours) (c) Ground Truth Figure 3: Image cropping examples on Flickr Cropping Database (Chen et al. 2017a). Best viewed in color. 4.4 Experiment Analysis In this section, we analyse the experiment results and study our model. RL Search vs Sliding Window From Table 1, Table 2 and Table 4, we can find out A2-RL is better than VFN+SW in cropping accuracy and time efficiency consistently. The main difference between these two methods is the way to get the cropping candidates. From this observation, we conclude that our proposed RL-search method is better than sliding window, which is very obvious. Although sliding window can densely extract candidates, it still fails to find very accurate candidate due to the fixed aspect ratios. On the contrary, our A2-RL model can find cropping window with arbitrary size. Observation+History Experience vs only Observation We use LSTM unit to memorize historical observations {o 1, o 2,, o t 1 } and combine them with current observation o t to form state s t. In this section, we study the effect of LSTM unit in our model. We abandon LSTM unit in A2-RL model, so agent only uses current observation o t as state s t to make decision. We train a new agent as such setting and evaluate it on above three databases. Results are showed in Table 1, Table 2 and Table 3, where the results of new agent denoted as A2-RL w/o LSTM. From these results, we can find that the cropping accuracy of new model is much lower in original A2-RL model, which demonstrate the importance of historical experiences. 4.5 Qualitative Analysis We show several cropping results on Flickr Cropping Database of different methods in Figure 3. From Figure 3, we can find that A2-RL model can find better cropping windows with arbitrary aspect ratio than VFN+sliding window, which demonstrates the capability of our model in an intuitive representation. More qualitative results are shown in supplementary materials due to the limit of pages. 5 Conclusion In this paper, we formulated the aesthetic image cropping as a sequential decision-making process and proposed a novel Aesthetics Aware Reinforcement Learning (A2-RL) model to address this problem. With the aesthetics aware reward and LSTM-based state representation which includes both the current and historical experience, our A2-RL model could learn good policies for automatic image cropping. The agent finished cropping process within several or a dozens steps and got cropping window with arbitrary size. Experiments on several unseen databases showed that our model can achieve the state-of-the-art cropping accuracy with much fewer candidate windows and much less time. References Caicedo, J. C., and Lazebnik, S Active object localization with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision,

8 Chen, J.; Bai, G.; Liang, S.; and Li, Z Automatic image cropping: A computational complexity study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Chen, Y.-L.; Huang, T.-W.; Chang, K.-H.; Tsai, Y.-C.; Chen, H.-T.; and Chen, B.-Y. 2017a. Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Chen, Y.-L.; Klopp, J.; Sun, M.; Chien, S.-Y.; and Ma, K.-L. 2017b. Learning to compose with professional photographs on the web. arxiv preprint arxiv: Datta, R.; Joshi, D.; Li, J.; and Wang, J. Z Studying aesthetics in photographic images using a computational approach. In European Conference on Computer Vision, Springer. Deng, Y.; Loy, C. C.; and Tang, X Image aesthetic assessment: An experimental survey. IEEE Signal Processing Magazine 34(4): Dhar, S.; Ordonez, V.; and Berg, T. L High level describable attributes for predicting aesthetics and interestingness. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Fang, C.; Lin, Z.; Mech, R.; and Shen, X Automatic image cropping using visual composition, boundary simplicity and content preservation models. In Proceedings of the 22nd ACM international conference on Multimedia, Jie, Z.; Liang, X.; Feng, J.; Jin, X.; Lu, W.; and Yan, S Tree-structured reinforcement learning for sequential object localization. In Advances in Neural Information Processing Systems, Kao, Y.; He, R.; and Huang, K Automatic image cropping with aesthetic map and gradient energy map. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Ke, Y.; Tang, X.; and Jing, F The design of high-level features for photo quality assessment. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, Kong, S.; Shen, X.; Lin, Z.; Mech, R.; and Fowlkes, C Photo aesthetics ranking network with attributes and content adaptation. In European Conference on Computer Vision, Springer. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, Liang, X.; Lee, L.; and Xing, E. P Deep variationstructured reinforcement learning for visual relationship and attribute detection. arxiv preprint arxiv: Luo, W.; Wang, X.; and Tang, X Content-based photo quality assessment. In 2011 IEEE International Conference on Computer Vision (ICCV), Mai, L.; Jin, H.; and Liu, F Composition-preserving deep photo aesthetics assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, Murray, N.; Marchesotti, L.; and Perronnin, F Ava: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nishiyama, M.; Okabe, T.; Sato, Y.; and Sato, I Sensation-based photo cropping. In Proceedings of the 17th ACM international conference on Multimedia, ACM. Park, J.; Lee, J.-Y.; Tai, Y.-W.; and Kweon, I. S Modeling photo composition and its application to photo rearrangement. In th IEEE International Conference on Image Processing (ICIP), Ren, S.; He, K.; Girshick, R.; and Sun, J Faster R- CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS). Ren, Z.; Wang, X.; Zhang, N.; Lv, X.; and Li, L.-J Deep reinforcement learning-based image captioning with embedding reward. arxiv preprint arxiv: Stentiford, F Attention based auto image cropping. In Workshop on Computational Attention and Applications, ICVS, volume 1. Suh, B.; Ling, H.; Bederson, B. B.; and Jacobs, D. W Automatic thumbnail cropping and its effectiveness. In Proceedings of the 16th annual ACM symposium on User interface software and technology, Tieleman, T., and Hinton, G Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2): Vig, E.; Dorr, M.; and Cox, D Large-scale optimization of hierarchical features for saliency prediction in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Williams, R. J., and Peng, J Function optimization using connectionist reinforcement learning algorithms. Connection Science 3(3): Yan, J.; Lin, S.; Bing Kang, S.; and Tang, X Learning the change for automatic image cropping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

arxiv: v3 [cs.cv] 12 Mar 2018

arxiv: v3 [cs.cv] 12 Mar 2018 A2-RL: Aesthetics Aware Reinforcement Learning for Image Cropping Debang Li 1,2, Huikai Wu 1,2, Junge Zhang 1,2, Kaiqi Huang 1,2,3 1 CRIPAC & NLPR, Institute of Automation, Chinese Academy of Sciences,