{"title": "DeepExposure: Learning to Expose Photos with Asynchronously Reinforced Adversarial Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2149, "page_last": 2159, "abstract": "The accurate exposure is the key of capturing high-quality photos in computational photography, especially for mobile phones that are limited by sizes of camera modules. Inspired by luminosity masks usually applied by professional photographers, in this paper, we develop a novel algorithm for learning local exposures with deep reinforcement adversarial learning. To be specific, we segment an image into sub-images that can reflect variations of dynamic range exposures according to raw low-level features. Based on these sub-images, a local exposure for each sub-image is automatically learned by virtue of policy network sequentially while the reward of learning is globally designed for striking a balance of overall exposures. The aesthetic evaluation function is approximated by discriminator in generative adversarial networks. The reinforcement learning and the adversarial learning are trained collaboratively by asynchronous deterministic policy gradient and generative loss approximation. To further simply the algorithmic architecture, we also prove the feasibility of leveraging the discriminator as the value function. Further more, we employ each local exposure to retouch the raw input image respectively, thus delivering multiple retouched images under different exposures which are fused with exposure blending. The extensive experiments verify that our algorithms are superior to state-of-the-art methods in terms of quantitative accuracy and visual illustration.", "full_text": "DeepExposure: Learning to Expose Photos with\nAsynchronously Reinforced Adversarial Learning\n\nRunsheng Yu\u2217\nXiaomi AI Lab\n\nSouth China Normal University\n\nrunshengyu@gmail.com\n\nWenyu Liu \u2217\nXiaomi AI Lab\nPeking University\n\nliuwenyu@pku.edu.cn\n\nYasen Zhang\nXiaomi AI Lab\n\nzhangyasen@xiaomi.com\n\nZhi Qu\n\nXiaomi AI Lab\n\nquzhi@xiaomi.com\n\nDeli Zhao\n\nXiaomi AI Lab\n\nzhaodeli@xiaomi.com\n\nBo Zhang\n\nXiaomi AI Lab\n\nzhangbo@xiaomi.com\n\nAbstract\n\nThe accurate exposure is the key of capturing high-quality photos in computational\nphotography, especially for mobile phones that are limited by sizes of camera\nmodules. Inspired by luminosity masks usually applied by professional photogra-\nphers, in this paper, we develop a novel algorithm for learning local exposures with\ndeep reinforcement adversarial learning. To be speci\ufb01c, we segment an image into\nsub-images that can re\ufb02ect variations of dynamic range exposures according to raw\nlow-level features. Based on these sub-images, a local exposure for each sub-image\nis automatically learned by virtue of policy network sequentially while the reward\nof learning is globally designed for striking a balance of overall exposures. The\naesthetic evaluation function is approximated by discriminator in generative adver-\nsarial networks. The reinforcement learning and the adversarial learning are trained\ncollaboratively by asynchronous deterministic policy gradient and generative loss\napproximation. To further simply the algorithmic architecture, we also prove the\nfeasibility of leveraging the discriminator as the value function. Further more,\nwe employ each local exposure to retouch the raw input image respectively, thus\ndelivering multiple retouched images under different exposures which are fused\nwith exposure blending. The extensive experiments verify that our algorithms are\nsuperior to state-of-the-art methods in terms of quantitative accuracy and visual\nillustration.\n\nIntroduction\n\n1\nRetouching raw low-quality photos into high-quality ones will greatly increase the aesthetic experi-\nence of our vision. Due to the requirement of expertise of photography, photo quality enhancement is\nbeyond the capability of non-professional users, thus leading to the new trend of automatic techniques\nof image retouching.\nThe traditional methods for automatic image retouching include retinex of a theory based on human\nimage perception [21], transform of using enhancement-parametric operators to retouch images [1],\nand exposure/contrast fusion [10, 23, 25]. But these methods have their own limitations: Most of\nthem are incapable of comprehending semantic information or object relationship in images well.\nWith the prevalence of deep learning, many researchers focus on applying this method to the image\nretouching area. In general, the image retouching approaches based on deep learning fall into three\ncategories:\n\n\u2217Joint \ufb01rst authors.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1) The transfer methods: Many researchers regard image retouching as style transfer, including\ndomain transfer and image-to-image translation [17, 18, 16, 15, 22, 5]. In light of the principle of\ngenerative adversarial network [12], these methods provide a novel perspective on image quality\nenhancement. However, one of challenges is to derive photo-realistic effects for these generation-\nbased approaches. 2) Retouching-driven methods: These methods focus on generating retouched\nphotos directly from input low-quality images [11, 6, 4, 39, 36]. Various losses are deliberately\ndesigned to enhance image from different perspectives, e.g. SSIM loss, texture loss and color loss. 3)\nThe sequence-based methods: The sequence-based methods are to generate an operation sequence\nwhich can be clearly understood by human [14, 38, 29, 7, 37, 40]. One kind of these approaches\nis to employ reinforcement learning [14, 38, 29]. Hu et al. [14] and Park et al. [29] utilized\ndeep reinforcement learning (DRL) to generate human-understandable operation sequences while\nYang et al. [38] applied DRL to generate personalized real-time exposure control.\nWhen retouching a photo, we frequently encounter dif\ufb01culty that some parts of a photo are too\ndark while other parts are too bright due to the limitation of photographic techniques or hardware,\nespecially for small camera modules embedded in mobile phones. This exposure issue cannot be\neasily addressed by the global adjustment since the required adjustment operations vary in different\nareas. Professional photographers always employ the exposure-blending with luminosity masks to\nperform image post-processing [20]. That is, they create different layer masks for different objects\nand adjust each of objects independently. By virtue of this skill, they can cope with those scenarios\nwith a wide range of illumination distribution \ufb02exibly. Nevertheless, it may not be easily used in\nautomatic retouching: Since the different objects in an image are semantically correlated, we need\nto consider the inherent relationship on illuminations and colors for each local exposure operation\ndiscreetly when designing algorithms. In addition, it is of great signi\ufb01cance to \ufb01nd an appropriate\nmetric to evaluate whether a photo is aesthetically good or not. However, the traditional image\nevaluation metrics may not work well in this situation [32].\nIn this paper, we develop a reinforced adversarial learning framework to solve these problems.\nThe deep reinforcement learning is exploited to learn multiple local exposure operations and an\nadversarial learning method is harnessed to approximate the Aesthetic Evaluation (AE) function, i.e.\nan evaluation method to judge the subjective quality of an image. Both reinforcement learning and\nadversarial learning can be trained together as a whole pipeline by asynchronous deterministic policy\ngradient and generative loss approximation.\nOur main contributions are summarized as follows:\n\n1. Based on deep reinforcement learning, we propose an exposure-blending-based framework\n\nwhich can retouch local areas of images \ufb02exibly with only exposure operation.\n\n2. There exists a non-differential operation in the whole process. We leverage the generative\nloss approximation to make it differential and asynchronous learning to make the learning\nprocess stable.\n\n3. By asynchronously reinforced adversarial learning, we propose an approach to training\nboth exposure operations and aesthetic evaluation function. The asynchronous update\npolicy gradients aid the algorithm in tuning sequential exposures while adversarial learning\nfacilitates to learn aesthetic evaluation function.\n\n4. The whole pipeline proceeds with image-unpaired training and can be ef\ufb01ciently performed\nwith super-resolution in practice. Our algorithm do not directly generate any pixels and can\nwell preserve the details of the original image.\n\n5. For computers of limited memory when training, we devise an algorithm to reuse the\ndiscriminator as the value function, effectively reducing memory occupation and accelerating\nthe training speed.\n\n2 Methodology\n\nIn this section, we will present the details of our algorithms in \ufb01ve sub-sections: the problem\nformulation, the asynchronous deterministic policy gradient, the adversarial learning for AE function,\nthe generative loss approximation and the alternative form of value function.\n\n2\n\n\fFigure 1: The schematic illustration of our algorithm. Firstly, we harness image segmentation to\nobtain sub-images. For different sub-images, we use different exposures according to the policy\nnetwork and they are fused together to form the \ufb01nal high-quality image.\n2.1 Problem formulation\n\nLet at denote the t-th exposure operation. For reinforcement learning, at is the action at step t. Thus,\nA = {a0, a1, . . . , aT} forms the action space, which is the sequential exposure operations in the\nscenario of image retouching. The local exposure image retouching can be formulated as follows:\n\narg maxA \u03c6(PT (s0,A)),\n\nPT (s0,A) = EB \u25e6 aT \u25e6 aT\u22121 \u25e6 aT\u22122 \u00b7\u00b7\u00b7 \u25e6 a0 \u25e6 s0, Sf = PT (s0,A)\n\n(2)\nwhere \u03c6(\u00b7) is the aesthetic evaluation function, \"\u25e6\" is the function composition operation, EB is\nthe exposure blending manipulation, and s0 is the \ufb01rst state; Sf is the \ufb01nal image after fusion. Our\nalgorithm is to \ufb01nd the optimal operations which maximize the \u03c6(\u00b7) function. In order to solve\nthis problem, there are two sub-problems to be considered: 1) how to derive the optimal sequential\noperations A through learning methods, and 2) how to model the AE function \u03c6(\u00b7).\nSince each exposure operation needs to consider the original low-resolution input image s0, the\nlocal fusion image sl\nt at step t, and the t-th segment sub-image segt, we de\ufb01ne the state st as st =\n{sl\nt, segt, s0} \u2208 S, where the sub-images stem from image segmentation {seg0, seg1, . . . , segT} =\nsegment(s0) and S is the state space. In order to express the probability of an interaction, we\nutilize the transition function to describe the exposure operation as st+1 = at \u25e6 st = p(st, at).\nThe EB denotes a function that integrates all the global \ufb01lter images [Sg\nn] together,\nEB : [Sg\nFor sub-problem 1, since the operation at is calculated according to its previous state st, we can\nregard sub-problem 1 as agent-environment interaction problem, i.e. the Markov Decision Process\n(MDP) problem. Here, we take the reinforcement learning is one of the feasible tools to deal with\nthat problem.\nAccording to the nature of reinforcement learning, it is plausible to determine the reward function\nwith the AE function to evaluate how action at performs. One thing that needs to consider is that due\nto the limitation in the image retouching area, it is hard to obtain the intermediate process and we\nonly have the terminating results to obtain the rewards, also known as the sparse reward issue. Thus,\nthe AE function applies only in the \ufb01nal step. Here, we de\ufb01ne the reward function rt(st, at) as\n\nn] \u2192 Sf .\n\n1 , . . . , Sg\n\n0 , Sg\n\n1 , . . . , Sg\n\n0 , Sg\n\n(cid:26) 0,\n\nrt(st, at) =\n\nt (cid:54)= T\n\n\u03c6(PT (s0,A)), t = T.\n\n(1)\n\n(3)\n\nFrom Eq. (3) we can see that the reinforcement learning employed in our image retouching framework\ndoes not need to consider the intermediate rewards, thereby simplifying the overall reinforcement\nlearning procedure. Therefore, we can write the summation of discounted reward r\u03b3\n0 (or return\nrt(cid:48)(st(cid:48), at(cid:48)) = rT (sT , aT ) = \u03c6(PT (s0,A)). To proceed, we use the\nfunction) as r\u03b3\nadvantage actor-critic framework as our basic reinforcement learning model [35].\n\n0 =(cid:80)T\n\nt(cid:48)=0 \u03b3t(cid:48)\n\n3\n\n\uf06a\uf06b\uf06c+3.77\uf06aexposuresegment+4.06\uf06cexposure+3.15\uf06bexposureFusion+++policy\uf06apolicy\uf06bpolicy\uf06c\fSince the exposure operations are decided by the current state, we can de\ufb01ne the policy \u03c0 : S \u2192 P (A)\nand discounted state visitation distribution \u03c1\u03c0 to model this process. With these de\ufb01nitions, we can\ncast the maximization of the AE function as the optimization\n\narg max\n\n\u03c0\n\nJ(\u03c0) = arg max\n\n\u03c0\n\nE\n\ns\u223c\u03c1\u03c0,PT \u223c\u03c0\n\n[rT|\u03c0].\n\nSimilarly, we use the value function\n\nV \u03c0(s) =\n\nE\n\ns\u223c\u03c1\u03c0,PT \u223c\u03c0\n\n[rT ]\n\n(4)\n\n(5)\n\nE\n\nto evaluate the states. Also, we harness the state-action value function Q\u03c0(st, at) =\n[rt(st, at) + \u03b3V \u03c0(p(st, at))] and its normalization form A\u03c0(s, a) = Q\u03c0(s, a) \u2212\ns\u223c\u03c1\u03c0,a\u223cat,PT \u223c\u03c0\nV \u03c0(s) to determine the action at at that state st to reduce high variability. The value function can be\nregarded as the proxy of the AE function.\nThe AE function and value function can be estimated by temporal difference method [33] that can be\nformulated with\n\nLV =\n\nE\n\ns\u223c\u03c1\u03c0,a\u223c\u03c0(s)\n\n1\n2\n\n[\u03b42], \u03b4 = rt + \u03b3V \u03c0(p(st, at)) \u2212 V \u03c0(st).\n\nThrough the equation above, we can use the AE function to guide the value function V by minimizing\nEq. (6).\nThe value function V and the policy function \u03c0 can be approximated by neural networks V \u03c9 and \u03c0\u03b8\nrespectively, where \u03c9 and \u03b8 are learnable parameters. Thus we can take advantage of learning method\nto approximate these two functions. Since the operation is continuous, we employ the deterministic\npolicy gradient (DPG) theorem [31] to update our model\n\n\u2207\u03b8J(\u03c0\u03b8) = E\ns\u223c\u03c1\u03c0\n\u03b8t+1 = \u03b8t + \u03b2 E\ns\u223c\u03c1\u03c0\n\n[\u2207\u03b8\u03c0\u03b8(s)\u2207aA\u03c0(s, a; \u03b8t)|a = \u03c0(s)],\n[\u2207\u03b8\u03c0\u03b8(s)\u2207aA\u03c0(s, a; \u03b8)|a = \u03c0(s)],\n\n[[r(t) + \u03b3V \u03c0(p(st, at); \u03c9) \u2212 V \u03c0(st; \u03c9)]\u2207\u03c9V \u03c0(st; \u03c9))],\n\nand \u03c9t+1 = \u03c9t + \u03b1 E\ns\u223c\u03c1\u03c0\n\nwhere A can be calculated by the normalization form equation mentioned above. Here, we can use V\nfunction to get all the equations mentioned above.\n\n2.2 Asynchronous deterministic policy gradient\n\nFrom the common point of view, a variety of reinforcement learning algorithms including DPG\nneed the assumption that the samples are independently and identically distributed [24]. However,\nthe sequential data from practical tasks usually violate the assumption, which is also known as\nhigh temporal correlations. The technique of experience replay can elaborately circumvent the\nproblem [27]. But for our case, we need to accomplish the whole process without interruption and\nit is memory consumption to store transitions (st, at, rt, st+1) since each one state contains many\nimages. Therefore, the commonly used experience replay or out-of-order training method [14] may\nnot be suitable here. Under this circumstance, we choose to update our actor network and the critic\nnetwork by virtue of asynchronous updating. The update formulas can be approximated as\n\n[\u2207\u03b8\u03c0\u03b8(s)\u2207aA\u03c0(s, a)|a = \u03c0(s)] \u2248 1\nT N\n\nE\ns\u223c\u03c1\u03c0\n\n\u2207\u03b8\u03c0\u03b8(sit)\u2207aitA\u03c0\u03b8 (sit, ait = \u03c0\u03b8(sit)),\n\n(6)\n\n(7)\n\n(8)\n(9)\n\n(10)\n\n(11)\n\nN(cid:88)\n\nT(cid:88)\n\ni=0\n\nt=0\n\nN(cid:88)\n\nT(cid:88)\n\nE\ns\u223c\u03c1\u03c0\n\n[[r + \u03b3V \u03c0(p(s, a); \u03c9) \u2212 V \u03c0(s; \u03c9)]\u2207\u03c9V \u03c0(s; \u03c9))] \u2248 1\nT N\n\u2212 V \u03c0(sit; \u03c9)]\u2207\u03c9V \u03c0(sit; \u03c9)),\n\nt=0\n\ni=0\n\n[ri(t) + \u03b3V \u03c0(p(sit, ait); \u03c9)\n\nwhere N is a mini-batch size and T is the sequence length. With this asynchronous update method,\nwe can reduce the effect of high temporal correlations. Eq. (10) reveals how to calculate the gradient\n\n4\n\n\fof parameter \u03b8 while Eq. (11) shows how to calculate the gradient of parameter \u03c9. The retouching\nprocesses can also be done by N threads in parallel, then our asynchronous update method is the\ncontinuous control of asynchronous policy gradient similar to the framework in [26].\nThis asynchronous update method can have the same effect as the replay buffer. More information\ncan be found in Appendix A.\n\n2.3 Adversarial learning for the AE function\n\nIn the preceding section, we propose a method to \ufb01nd the optimal sequential operations. Nevertheless,\nthere still exists one problem: How to get the reward function without knowing the AE function \u03c6(\u00b7).\nOne simple way is to learn this AE function by neural networks. Inspired by generative adversarial\nnetwork [12], we treat the AE function as the discriminator and learn through adversarial learning. In\nthis case, we use the Wasserstein GAN as our adversarial learning framework [2].\nLet pd denote the distribution of the expert retouched images and pa the distribution of our algorithm-\nretouched images. According to [2], we de\ufb01ne the loss of discriminator as\n\nLD = E \u02dcSf\u223cpd\n\n[D\u03b2( \u02dcSf )] \u2212 ESf\u223cpa [D\u03b2(Sf )] + \u03bbE \u02c6Sf\u223cp \u02c6Sf\n\n(12)\nwhere \u03b2 is the parameter of the discriminator and \u02c6Sf = \u0001Sf + (1 \u2212 \u0001) \u02dcSf and \u0001 \u2208 [0, 1]. The\ngradient penalty is applied to ensure that D\u03b2 is Lipschitz-continuous [13]. The discriminator is\ndesigned to discriminate whether the photos are retouched by an expert or by our own method:\nD\u03b2( \u02dcSf ) = \u03c6(PT (s0,A)). Thus, it can be leveraged as the AE function.\n\n[((cid:107)\u2207 \u02c6Sf D\u03b2( \u02c6Sf )(cid:107)2 \u2212 1)2],\n\n2.4 Generative loss approximation\n\nNormally, the algorithmic framework of adversarial learning needs to combine a generator with\nthe discriminator end-to-end. For our problem, however, the loss function of the original generator\ncannot be adopted because EB step in our pipeline is non-differential. For this reason, we opt to\napproximate the original generative loss gradient through the DPG in Eq. (7), i.e.\n\n(13)\nwhere LG is the loss of the generator function and C is a positive constant number. And if C is\nconformed to the learning rate, the gradient descent between DPG and GAN losses is equivalent. This\napproximation helps us solve the non-differential problem. More details can be found in Appendix C.\n\n\u2207\u03b8J(\u03c0\u03b8) \u2248 C\u2207\u03b8LG,\n\n2.5 Alternative form of value function\n\nTo diminish the sparse reward problem, we can leverage the discriminator D\u03b2 to replace V \u03c9 by\nsolving the value function LV directly\n\nV (st) = D(PT (s0,A)) \u2217 \u03b3\u2212t+T , Q(st, at) = D(PT (s0,A)) \u2217 \u03b3\u2212t+1+T ,\n\nand A\u03c0(st, at) = D(PT (s0,A)) \u2217 (\u03b3 \u2212 1) \u2217 \u03b3\u2212t+T .\n\n(14)\n(15)\nNow one more question is pending to be answered: Since PT (s0,A) contains future information\ndif\ufb01cult to be obtained, we cannot attain PT (s0,A) directly. But if the sub-images are not so\nmany, then we can use the intermediate state St to approximate the \ufb01nal step, say, St \u2248 PT (s0,A).\nThis approximation is plausible in our scenario because the discriminator admits an approximately\nexponential decay of importance with respect to the time dimension of parameter updating. The\ncloser the time step is to the \ufb01nal step, the larger weights the reward obtains. From this perspective,\nthe formulation is conformal to intuition as well.\nThe surrogate of value function will reduce memory occupation and the training time since one of\ndeep neural networks no longer needs re-training. De facto, this form is like the reward-shaping\nmethod [28] and more details can be found in Appendix B.\n\n3 Exposure blending\nOur algorithm will yield an exact exposure value for each segmentation from policy network. The\nexposure value will be applied for the entire image, thus resulting in multiple retouched images of\n\n5\n\n\fFigure 2: The pipeline of our algorithm. The \ufb01rst input state s0 consists of two parts: The \ufb01rst\nsegmentation sub-image seg0 as well as the raw low-resolution image S0. The policy network\ncalculates the exposure value e and the action is the whole process that generates the locally exposed\nimage Sl as well as the globally exposed image Sg (details can be found in Algorithm 3 in Appendix\nE). The value function is used to evaluate the action. We only update the value and policy gradients\nwhen \ufb01nishing a mini-batch of images retouching with the asynchronous update method. The\nprocessed images will be restored in memory. The discriminator is trained by randomly selecting a\nbatch of the algorithm-retouched images and the expert-retouched data unpairedly, and it will guide\nthe value function to update.\n\ndifferent exposures for one input image with some visual artifacts. To enhance the \ufb01nal visual effect,\nwe harness the blending approach in High Dynamic Range (HDR) Imaging for exposure fusion [25].\nWe \ufb01nd the well-exposed areas and blend them together through pyramidal image decomposition\n\n(cid:88)n\n\nl=1\n\nL(Sij\n\no )k =\n\nGauss(wl\n\nij)kL(Sg\n\nijl)k\n\n(16)\n\no )k is the k-level Laplacian pyramid decomposition at pixel i and j, Gauss(wk\n\nij) is the\nwhere L(Sij\nk-level Gaussian pyramid of weights at pixel i and j, and L(Sg\nijl)k is the k-level Laplacian pyramid\ndecomposition of the l-th exposed image at pixel i and j. In fact, this method can be regarded as\npseudo multiple exposure fusion. More details can be found in Appendix D.\n\n4 The algorithmic pipeline\n\nThe whole pipeline of our algorithm is presented as follows:\nAs shown in Figure 2, we \ufb01rst use the image segmentation method to segment the whole image into\nseveral sub-images (all the training images are of size 64 \u00d7 64 \u00d7 3 in our experiment). During the\naction-generating stage t, we concatenate the input low-resolution image S0, the sub-image segt, and\nthe direct fusion image Sl\nt as the state st. Then, a policy network is exploited to compute different\nexposures that are applied on the image locally and globally. The local \ufb01lter can be formed as\nt (cid:12) bg_mask + et \u2217 segt ,where segt is the t-th sub-image, bg_mask is the background\nt+1 = Sl\nSl\nmask which does not need exposure at this step, et is the corresponding exposure value, (cid:12) means\n0 \u2217 et\nelement-wise product, and \u2217 is scalar-matrix multiplication. The global \ufb01lter performs Sg\nand all the global \ufb01lters operate on the original image S0. When \ufb01nishing st+1, we apply the value\nfunction to evaluate the quality of this step by calculating the one-step gradient using Eq. (8). After\nall the sub-images are processed, we blend all the images of different exposures and the input image\ntogether. The exposure fusion is made with Eq. (16).\nTo update policy network and value network through Eq. (10) and Eq. (11), we repeat trials of 8\nmini-batches to collect robust gradients. For the discriminator, we randomly choose a mini-batch\nof machine-retouched and expert-retouched photos to train the discriminator through Eq. (12). Due\n\nt+1 = Sg\n\n6\n\nBatch retouched imagereal imageDiscriminator(Reward)Guide( gradient update)\u2026valuePolicy Networka0ExposureFusionPolicy NetworkPolicy NetworkPolicy Network\u2026\u2026a1at-1atvaluevaluevaluevalueretouched imageupdateupdateupdateupdateMemoryRandomly Select Retouched Imagess0sts1st-1\fT to improve the effectiveness of our model.\n\nto the advantage of GAN, the training process can be unpaired for two kinds of input images. This\nunpaired training can avoid the dif\ufb01culty of acquiring the paired data in real environment. So as to\nmake the discriminator more reliable, we take the similar method in [14]: The contrast, saturation\nand illumination features are extracted and then are concatenated with the retouched RGB image\ntogether, \ufb01nally forming a (3 + 3)-channel image as the input of the discriminator. Moreover, as we\n\ufb01nd in our experiment, the discriminator reward can be applied directly to the \ufb01nally local direct\nfusion image Sl\nIn the test stage, as Figure 1 shows in Appendix F, a raw image I0 of arbitrary size is resized to\n64 \u00d7 64. The resized image is fed into the policy network to derive exposure values and global\n\ufb01lters, as well as local \ufb01lters. These variables are used to retouch the resized image for intermediate\ncomputations. For the test image I of original size, only global \ufb01lters will be employed. After all\nexposure values are solved from the sub-images and applied to generate the re-exposed image Ii, the\n\ufb01nal retouched image of original size is blended by {I0, I1, ...It} using Eq. (16).\nThe pseudo-codes are presented in Appendix E.\n5 Experiment\n5.1\n\nImplementation details\n\nWe train our model on MIT-Adobe FiveK [3], a dataset which contains 5,000 RAW photos and\ncorresponding retouched ones edited by \ufb01ve experts for each photo. To perform fair comparison with\nstate-of-the-art algorithms, we follow the experimental protocol presented by Hu et al. [14]. We\nseparate the dataset into three subsets: 2, 000 input unretouched images, 2, 000 retouched images by\nexpert C, and 1, 000 input RAW images for testing. Unless noted otherwise, all the images in training\nand testing stages are scaled down to 500 along long edge.\nThe architecture of our networks is detailed in Figure 2 in Appendix F. Speci\ufb01cally, our model is\ndifferentiable according to the form of value function. If the value function is calculated directly,\nthe value function is approximated by the discriminator, hereinafter referred to DeepExposure II.\nOtherwise, the value function is learned with neural networks that we depict in Appendix F, hereinafter\nreferred to DeepExposure I. All the networks are optimized by Adam [19].\nHere we present some details between different networks. For discriminator network, the original\nlearning rate is 5 \u00d7 10\u22125 with an exponential decay to 10\u22123 of the original value. The batch size\nfor adversarial learning is 8. For policy network, the original learning rate is 1.5 \u00d7 10\u22125 with an\nexponential decay to 10\u22123 of the original value. The Ornstein-Uhlenbeck process [34] is used to\nperform the exploration 2. The mini-batch size for policy network is 8. The parameter will not be\nupdated until a collection of parameters are obtained. For value network, if it is DeepExposure I, the\noriginal learning rate is 5 \u00d7 10\u22124 with an exponential decay to 10\u22123 of the original value. Otherwise,\nwe do not use the value network for DeepExposure II. The \u03b3 parameter is set 0.99.\nFor image segmentation, we take advantage of the graph-based method to segment images [8]. Since\nthis segmentation is performed according to texture and color in an unsupervised manner, it will\nprovide policy network with the low-level information.\nThe codes are run on P40 Tesla GPU. DeepExposure I takes about 320 min to converge while\nDeepExposure II takes 280 min. All the networks are implemented via Tensor\ufb02ow.\n5.2 Experimental results\nThe quantitative results of our models are obtained on the test dataset of MIT-Adobe FiveK. The\ncompared baseline and state-of-the-art methods include the sequence-based method of Exposure [14],\nthe unpaired style transfer method of CycleGAN [41], the fusion-based retouching method of FI [10],\nand the paired image enhancement method of DPED [16].\nFrom Table 1, we can see that our method consistently outperforms the involved algorithms. It is\nworth noting that compared with the Exposure algorithm that is also established on reinforcement\nlearning, our algorithm attains higher scores in both MSE and PSNR using only one \ufb01lter. This\nsuccess exhibits the power of local operations in the asynchronous mode. As an adaptive exposure\nfusion method, our method outperforms FI that is the state-of-the-art algorithm of single image fusion,\n2 The Ornstein-Uhlenbeck process is dxt = \u03b6(\u00b5 \u2212 xt) dt + \u03c3 dWt, where Wt is the Wiener process and\n\n\u03b6, \u00b5, \u03c3 are the hyper-parameters. We set the hyper-parameters as \u03b6 = 0.015, \u00b5 = 0, and \u03c3 = 0.03.\n\n7\n\n\fTable 1: Quantitative comparison of compared algorithms on MIT-Adobe 5K test dataset. For MSE\n(Mean Squared Error), the smaller number the better. For PSNR (Peak Signal-to-Noise Ratio), the\nlarger number the better. The best results are highlighted in bold fonts. DE is our method, meaning\nthe abbreviation of DeepExposure. DeepExposure I (DE I) learns both value and discriminator\nnetwork. DeepExposure II (DE II) employs the alternative form of value function.\n\nMetric Exposure [14] CycleGAN [41] DPED [16]\nMSE\nPSNR\n\n101.10\n28.12\n\n97.99\n28.27\n\n99.04\n28.20\n\nFI [10] DE I\n95.44\n105.2\n28.38\n27.92\n\nDE II\n96.42\n28.33\n\nindicating that learning-based exposure with reinforcement is better than empirically crafted exposure.\nOne more thing is that our algorithm is an unpaired one but superior to the paired method DPED,\nshowing the effectiveness of our method in the unpaired setting.\nFigure 3 illustrates some imagery examples of our algorithms and other state-of-the-art methods.\nExcept for the methods mentioned above, we also compare Deep Photo Enhancer [5] and Deep Guided\nFilter [36] which are the latest relevant work3. We can \ufb01nd that among all compared approaches, our\nmethod restores the details of original images better and enhance the saturation more effectively.\nDue to the limitation of space, we cannot show more results of our experiments. We conclude some\nkey features here: 1) As the Exposure approach does, our method can deal with higher resolution\nimages than other methods. The detailed experiments can be found in Appendix G1. 2) Our algorithm\ndoes learn to adapt various exposure parameters to sub-images of various styles. The detailed\nexperiments can be found in Appendix G2. We also demonstrate more retouching results in Appendix\nG3.\n6 Conclusion\nWe develop a reinforced adversarial learning algorithm to learn the optimal exposure operations of\nretouching low-quality images. Comparing with other methods, our algorithm can restore most of the\ndetails and styles in original images while enhancing brightness and colors. Moreover, our method\nbridges deep-learning methods and traditional methods of \ufb01ltering: Deep-learning methods serve to\nlearn parameters of \ufb01lters, which makes more precise \ufb01ltering of traditional methods. And traditional\nmethods reduce the training time of deep-learning methods because \ufb01ltering pixels is much faster\nthan generating pixels with neural networks.\nOur algorithm relies on image segmentation to mimic the layering of illumination mask. Therefore,\nusing semantic segmentation algorithms instead of unsupervised one may improve the capability of\nlearning exposures. One natural extension of our algorithm is to combine with other usual \ufb01lters,\nsuch as the contrast \ufb01lter and the tone \ufb01lter, which is left for future work. One limitation is that due to\nthe sparse reward, the local exposure might not be exactly accurate value. Thus other novel methods\nlike curriculum learning [9] or curiosity-driven learning [30] will be explored in the future.\n\n7 Acknowledge\n\nThe authors would like to thank Haorui Zhang, Kaiyuan Huang, Suming Yu, and Zhenyu Shi for\nproviding some guides on professional photography and reinforcement learning. We also thank\nYuanming Hu for giving us lots of guides. We are grateful to the anonymous reviewers for the\ninsightful comments.\n\n3 Due to the limited availability of the codes during this work, we are not capable to access all the source\ncodes. Therefore, for different methods, we utilize different strategies: For methods Exposure, FI, and\nDPED, we use the already-trained model to run the results; for CycleGAN, we retrain it with MIT-Adobe\n5K training dataset; for Deep Guided Filter and Deep Photo Enhancer, we use demos from their own web-\nsites (http://wuhuikai.me/DeepGuidedFilterProject/ and http://www.cmlab.csie.ntu.edu.tw/\nproject/Deep-Photo-Enhancer/, respectively) to derive imagery results.\n\n8\n\n\fFigure 3: Retouched images of different algorithms. From left to right, top to bottom: Original\ninput image, our DeepExposure I, DeepExposure II, Exposure [14], FI [10], Expert C, DPED [16],\nCycleGAN [41], Deep Photo Enhancer [5] and Deep Guided Filter [36].\nReferences\n[1] Sabzali Aghagolzadeh and Okan K Ersoy. Transform image enhancement. Optical Engineering,\n\n31(3):614\u2013627, 1992.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In International Conference on Machine Learning (ICML), pages 214\u2013223, 2017.\n\n[3] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Fr\u00e9do Durand. Learning photographic\nglobal tonal adjustment with a database of input/output image pairs. In IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 97\u2013104. IEEE, 2011.\n\n[4] Jianrui Cai, Shuhang Gu, and Lei Zhang. Learning a deep single image contrast enhancer from\n\nmulti-exposure images. IEEE Transactions on Image Processing, 27(4):2049\u20132062, 2018.\n\n[5] Yu-Sheng Chen, Yu-Ching Wang, Man-Hsin Kao, and Yung-Yu Chuang. Deep photo enhancer:\nUnpaired learning for image enhancement from photographs with GANs. Proceedings of\nInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[6] Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafa\u0142 K Mantiuk, and Jonas Unger. HDR\nimage reconstruction from a single exposure using deep CNNs. ACM Transactions on Graphics,\n36(6):178, 2017.\n\n[7] Hui Fang and Meng Zhang. Creatism: A deep-learning photographer capable of creating\n\nprofessional work. arXiv preprint arXiv:1707.03491, 2017.\n\n[8] Pedro F Felzenszwalb and Daniel P Huttenlocher. Ef\ufb01cient graph-based image segmentation.\n\nInternational Journal of Computer Vision (IJCV), 59(2):167\u2013181, 2004.\n\n9\n\n\f[9] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse\n\ncurriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300, 2017.\n\n[10] Xueyang Fu, Delu Zeng, Yue Huang, Yinghao Liao, Xinghao Ding, and John Paisley. A\nfusion-based enhancing method for weakly illuminated images. Signal Processing, 129:82\u201396,\n2016.\n\n[11] Micha\u00ebl Gharbi, Jiawen Chen, Jonathan T Barron, Samuel W Hasinoff, and Fr\u00e9do Durand.\nDeep bilateral learning for real-time image enhancement. ACM Transactions on Graphics\n(TOG), 36(4):118, 2017.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems (NIPS), pages 2672\u20132680, 2014.\n\n[13] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of Wasserstein GANs. In Advances in Neural Information Processing Systems\n(NIPS), pages 5769\u20135779, 2017.\n\n[14] Yuanming Hu, Hao He, Chenxi Xu, Baoyuan Wang, and Stephen Lin. Exposure: A white-box\n\nphoto post-processing framework. arXiv preprint arXiv:1709.09602, 2017.\n\n[15] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Vanhoey, and Luc Van Gool. WE-\nSPE: Weakly supervised photo enhancer for digital cameras. arXiv preprint arXiv:1709.01118,\n2017.\n\n[16] Andrey Ignatov, Nikolay Kobyshev, Kenneth Vanhoey, Radu Timofte, and Luc Van Gool.\nDSLR-quality photos on mobile devices with deep convolutional networks. In International\nConference on Computer Vision (ICCV), 2017.\n\n[17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.\n\n[18] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover\ncross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192,\n2017.\n\n[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[20] Tony Kuyper.\n\nLuminosity masks.\n\nluminositymasks-1.html.\n\ngoodlight.us/writing/luminositymasks/\n\n[21] Jia Li. Application of image enhancement method for digital images based on Retinex theory.\n\nOptik-International Journal for Light and Electron Optics, 124(23):5986\u20135988, 2013.\n\n[22] Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, and Jan Kautz. A closed-form solution\n\nto photorealistic image stylization. arXiv preprint arXiv:1802.06474, 2018.\n\n[23] Zhengguo Li, Zhe Wei, Changyun Wen, and Jinghong Zheng. Detail-enhanced multi-scale\n\nexposure fusion. IEEE Transactions on Image Processing, 26(3):1243\u20131252, 2017.\n\n[24] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n[25] Tom Mertens, Jan Kautz, and Frank Van Reeth. Exposure fusion: A simple and practical\nalternative to high dynamic range photography. In Computer Graphics Forum, volume 28,\npages 161\u2013171. Wiley Online Library, 2009.\n\n[26] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-\nment learning. In International Conference on Machine Learning (ICML), pages 1928\u20131937,\n2016.\n\n10\n\n\f[27] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529, 2015.\n\n[28] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor-\nmations: Theory and application to reward shaping. In International Conference on Machine\nLearning (ICML), volume 99, pages 278\u2013287, 1999.\n\n[29] Jongchan Park, Joon-Young Lee, Donggeun Yoo, and In So Kweon. Distort-and-recover: Color\n\nenhancement using deep reinforcement learning. arXiv preprint arXiv:1804.04450, 2018.\n\n[30] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration\nIn International Conference on Machine Learning (ICML),\n\nby self-supervised prediction.\nvolume 2017, 2017.\n\n[31] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.\nDeterministic policy gradient algorithms. In International Conference on Machine Learning\n(ICML), 2014.\n\n[32] Hossein Talebi and Peyman Milanfar. NIMA: Neural image assessment. IEEE Transactions on\n\nImage Processing, 2018.\n\n[33] Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM,\n\n38(3):58\u201368, 1995.\n\n[34] George E Uhlenbeck and Leonard S Ornstein. On the theory of the Brownian motion. Physical\n\nreview, 36(5):823, 1930.\n\n[35] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. In Reinforcement Learning, pages 5\u201332. Springer, 1992.\n\n[36] Huikai Wu, Shuai Zheng, Junge Zhang, and Kaiqi Huang. Fast end-to-end trainable guided\n\n\ufb01lter. arXiv preprint arXiv:1803.05619, 2018.\n\n[37] Jianzhou Yan, Stephen Lin, Sing Bing Kang, and Xiaoou Tang. A learning-to-rank approach\n\nfor image color enhancement. In Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[38] Huan Yang, Baoyuan Wang, Noranart Vesdapunt, Minyi Guo, and Sing Bing Kang. Per-\nsonalized attention-aware exposure control using reinforcement learning. arXiv preprint\narXiv:1803.02269, 2018.\n\n[39] Xin Yang, Ke Xu, Yibing Song, Qiang Zhang, Xiaopeng Wei, and Rynson Lau. Image correction\n\nvia deep reciprocating HDR transformation. arXiv preprint arXiv:1804.04371, 2018.\n\n[40] Lu Yuan and Jian Sun. Automatic exposure correction of consumer photographs. In European\n\nConference on Computer Vision (ECCV), pages 771\u2013785. Springer, 2012.\n\n[41] Jun Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. In IEEE International Conference on\nComputer Vision (ICCV), pages 2242\u20132251, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1098, "authors": [{"given_name": "Runsheng", "family_name": "Yu", "institution": "Xiaomi Intelligent Technology Co., Ltd"}, {"given_name": "Wenyu", "family_name": "Liu", "institution": "Peking University"}, {"given_name": "Yasen", "family_name": "Zhang", "institution": "Xiaomi AI Lab"}, {"given_name": "Zhi", "family_name": "Qu", "institution": "Xiaomi AI Lab"}, {"given_name": "Deli", "family_name": "Zhao", "institution": "Xiaomi AI Lab"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Xiaomi Corp."}]}