-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
621 lines (585 loc) · 41.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/latest.js?config=AM_CHTML"></script>
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
integrity="sha384-EVSTQN3/azprG1Anm3QDgpJLIm9Nao0Yz1ztcQTwFspd3yD65VohhpuuCOmLASjC" crossorigin="anonymous">
<link rel="stylesheet" href="https://www.w3schools.com/w3css/4/w3.css">
<style>
a {
color: #212529;
text-decoration: none;
font-style: italic;
}
a:hover {
color: rgb(49, 117, 205);
}
p {
text-align: justify;
}
</style>
<title>Representative Color Transform for Image Enhancement</title>
</head>
<body>
<!-- url('https://user-images.githubusercontent.com/40756918/163460859-5a775f6e-e546-418d-8d12-9f055dbd68b3.jpg') -->
<div id="header" class="h-25 pt-4" style="background-image: url('images/background.jpg'); color: whitesmoke;">
<div class="w-50 mx-auto text-center">
<h1>Representative Color Transform for Image Enhancement</h1>
<h5>An unofficial implementation of the paper of Kim et al.: “Representative Color Transform for Image
Enhancement"</h5>
<p class="text-center">by <a style="color: white;" href="https://github.com/ThanosM97">Athanasios
Masouris</a> and <a style="color: white;" href="https://github.com/stypoumic">Stylianos
Poulakakis-Daktylidis</a></p>
</div>
</div>
<section class="w-50 mx-auto">
<!-- INTRODUCTION & PROBLEM STATEMENT -->
<h2 class="mt-5">Introduction & Problem Statement</h2>
<p>
In the modern digital era, humanity is estimated to snap as many pictures, every two minutes, as were taken
in the entire 19th century. Nevertheless,
these photographs are often of low quality and dynamic range, with under or over-exposed lighting
conditions. Additionally, in the field of professional
photography, the go-to output format is RAW over JPEG or PNG, which maintains all dynamic information in the
photograph at the cost of often darker images,
which need additional processing steps. Consequently, image enhancement and refinement techniques have
become increasingly prominent in order to improve the
visual aesthetics of photos.
</p>
<p>
Naturally, many attempts have been proposed over the years to address the issue of image refinement, making
considerable progress in that regard.
In particular, contemporary research follows two distinct main approaches, namely the encoder-decoder
structure (<a href="#2">Chen et al. (2018)</a>, <a href="#3">Yan et al. (2016)</a>,
<a href="#4">Yang et al. (2020)</a>, <a href="#5">Kim et al. (2020)</a>) and the performance of global
enhancements through intensity transformations (<a href="#6">Deng et al. (2018)</a>, <a href="#7">Kim et al.
(2020)</a>,
<a href="#8">Park et al. (2018)</a>, <a href="#9">Hu et al. (2018)</a>, <a href="#10">Kosugi et al.
(2020)</a>, <a href="#11">Guo et al. (2020)</a>), shown in <a href="#fig1">Figures 1a</a> and <a
href="#fig1">1b</a> respectively. However, the encoder-decoder structure
has some limitations in that details of the input image are not preserved, and the input is restrained to
fixed sizes, whereas global-based approaches do
not consider all channels simultaneously and rely on pre-defined color spaces and operations, which may be
insufficient for estimating arbitrary (and highly non-linear)
mappings between low- and high-quality images.
</p>
<p>
On the contrary, the recent work of <a href="#1">Kim et al. (2021)</a> successfully addresses most of these
limitations by utilizing <i>Representative Color Transforms (RCT)</i>.
The proposed method demonstrates an increased capacity for color transformations by utilizing adaptive
representative colors derived from the input image
and is independently applied on each pixel, hence allowing the enhancement of images with arbitrary sizes
without the need of image resizing. These
advantages motivated us in reproducing their state-of-the-art architecture in the contexts of this project.
An additional incentive was the lack of an
official code implementation for this work, which allowed us to get hands-on experience by attempting to
make our own unofficial implementation.
</p>
<!-- img src="https://user-images.githubusercontent.com/40756918/163460964-23b556e7-fd26-4a0a-bba0-0161c0efde02.JPG" -->
<figure id="fig1" class="text-center">
<img src="images/image_refinement_approaches.PNG" width="80%" style="cursor:zoom-in" alt="likelihoods"
class="figure-img"
onclick="document.getElementById('modal').style.display='block'; document.getElementById('modal-img').src='images/image_refinement_approaches.PNG';">
<figcaption class="figure-caption text-center">Figure 1: Outlines of image enhancement approaches: (a)
encoder-decoder, (b) intensity transformation, and (c) representative
color transform models, adapted from <a href="#1">Kim et al. (2021)</a>.</figcaption>
</figure>
<p>
In <a href="#1">Kim et al. (2021)</a> a novel image enhancement approach is introduced, namely
Representative Color Transforms, yielding large capacities for color transformations.
The overall proposed network comprises of four components: encoder, feature fusion, global RCT, and local
RCT and is depicted in <a href="#fig1">Figure 1c</a>. First the encoder is utilized
for extracting high-level context information, which in is in turn leveraged for determining representative
and transformed (in RGB) colors for the input image. Subsequently,
an attention mechanism is used for mapping each pixel color in the input image to the representative colors,
by computing their similarity. The last step involves the
application of representative color transforms using both coarse- and fine-scale features from the feature
fusion component to obtain enhanced images from the global and
local RCT modules, which are combined to produce the final image.
</p>
<!-- IMPLEMENTATION -->
<h2 class="mt-4">Implementation</h2>
<p>
RCTNet consists of 4 main components, namely encoder, feature fusion, global RCT, and local RCT, with its
overall architecture being depicted in <a href="#fig2">Figure 2</a>.
</p>
<figure id="fig2" class="text-center">
<img src="images/architecture.PNG" width="100%" style="cursor:zoom-in" alt="likelihoods" class="figure-img"
onclick="document.getElementById('modal').style.display='block'; document.getElementById('modal-img').src='images/architecture.PNG';">
<figcaption class="figure-caption text-center">Figure 2: An overview of the proposed RCTNet, adapted from <a
href="#1">Kim et al. (2021)</a>.</figcaption>
</figure>
<h4 class="mt-4">Encoder</h4>
<p>
In computer vision encoders are generally used to extract high-level feature maps given an input image, by
utilizing convolutional neural networks.
The image is passed across multiple convolutional layers of the encoder, with each consecutive layer
extracting higher level features through its increased
receptive field. Nevertheless, in the case of RCTNet instead of only using the highest-level feature maps,
multi-scale features are extracted from the last 4
layers of the encoder. The architecture of the encoder comprises of a stack of 6 <i>conv-bn-swish</i>
blocks. By <i>conv-bn-swish</i> the authors denote a block consisting
of a convolution, followed by batch normalization and a swish activation layer. The convolutional layers of
the first 5 blocks use a `3 \times 3` kernel, while for the
last block an `1 \times 1` kernel is used, followed by a global average pooling layer.
</p>
<h4 class="mt-4">Feature Fusion</h4>
<p>
The feature fusion module involves the aggregation of multi-scale feature maps, and by extension the
aggregation of information of different context.
More specifically, feature maps of the coarsest encoder layers exploit their larger receptive fields to
encapsulate global contexts, while features from lower levels
preserve detailed local contexts. RCTNet's feature fusion component is constructed by bidirectional
cross-scale connections, as in <a href="#12">Tan et al. (2020)</a>, with each single input
node in <a href="#fig2">Figure 2</a> corresponding to one <i>conv-bn-swish</i> block. For nodes with
multiple inputs a feature fusion layer precedes the <i>conv-bn-swish</i> block, with its output
being defined as:
</p>
<div class="text-center my-2"> ` O = \sum_{i=1}^{M} \frac{w_i}{\epsilon + \sum_{j} w_j} I_i `</div>
<p>
where `w_i` are learnable weights for each input. All nodes have 128 convolutional filters with a `3 \times
3` kernel, except for
coarsest-level nodes (red nodes), which use an `1 \times 1` kernel instead.
</p>
<h4 class="mt-4">Image Feature Map</h4>
<p>
An additional independent <i>conv-bn-swish</i> block is applied to the input image, thus extracting the
image feature map: `F \in \mathbb{R}^{H \times W \times C}` , with the value of 16 being
selected for the feature dimension `C`.
</p>
<h4 class="mt-4">Global RCT</h4>
<p>
The Global RCT component takes as input the feature map (spatial resolution: `1 \times 1`) of the feature
fusion's coarsest level (last red node), utilizing its global
context to determine representative color features (`R_G \in \mathbb{R}^{C \times N_G}`) and transformed
colors in RGB (`T_G \in \mathbb{R}^{3 \times N_G}`) through two distinct <i>conv-bn-swish</i> blocks.
The selected values for the feature dimension `C` and the number of global representative colors `N_G` are
16 and 64 respectively. Each of the `N_G` vectors of `T_G` (`t_i`)
correspond to the transformed RGB values of the `i^{th}` representative color.
</p>
<p>
The next step involves the application of the RCT transform, which takes as inputs the reshaped input
features `F_r \in \mathbb{R}^{HW \times C}`, the
representative color features `R_G` and the transformed colors `T_G` and produces an enhanced image `Y_G`.
Since only `N_G` representative colors are included in `T_G`,
the first step of RCT involves the mapping of pixel colors of the original image to representative colors,
thus the similarity between pixel and representative
colors is calculated. For the latter calculation, scaled dot product and the attention mechanism are
utilized as:
</p>
<div class="text-center my-2"> ` A = softmax(\frac{F_r R_G}{\sqrt(C)}) \in \mathbb{R}^{HW \times N} `</div>
<p>
where each attention weight `a_{ij}` corresponds to the similarity between the `j^{th}` representative color
and the `i^{th}` pixel. Subsequently,
the enhanced image `Y_G` is produced as:
</p>
<div class="text-center my-2"> ` Y = A T^T `</div>
<p>
<i>i.e.</i>, for the `i^{th}` pixel, the products of its attention weights with the `j^{th}` transformed
representative colors are summed to determine the pixel's
enhanced RGB values.
</p>
<h4 class="mt-4">Local RCT</h4>
<p>
The Local RCT component takes as input the feature map (spatial resolution: `32 \times 32`) of the feature
fusion's finest level
(last blue node), utilizing the contexts of local information this time to determine representative color
features (`R_L \in \mathbb{R}^{32 \times 32 \times C \times N_L}`) and
transformed colors in RGB (`T_L \in \mathbb{R}^{32 \times 32 \times 3 \times N_L}`) through two distinct
<i>conv-bn-swish</i> blocks. The selected values for the
feature dimension `C` and the number of local representative colors `N_L` are both 16.
</p>
<p>
Subsequently, the local RCT module takes as inputs `R_L` and `T_L` and produces different sets of
representative features and transformed colors for
different areas of the input image. To achieve that, a `31 \times 31` uniform mesh grid is set on the input
image, thus producing `32 \times 32` corner points in the image (each corresponding to
one of the `32 \times 32` spatial positions of `R_L` and `T_L`), as shown in <a href="#fig3">Figure 3</a>
for a `5 \times5` mesh grid example. Each grid position `B_k` is related to four corner
points, thus four sets of representative features and transformed colors, which are concatenated to produce
`R_k` and `T_k`. A grid image feature `F_k` is
also extracted from `F`, described in the Image Feature Map Section, by making the corresponding crop on the
grid region. Finally, `F_k`, `R_k` and `T_k` are fed to the RCT transform described in the Global
RCT section to yield the local enhanced image region `Y_k`. This process is replicated for all grid
positions to produce the final enhanced image `Y_L`.
</p>
<figure id="fig3" class="text-center">
<img src="images/local_RCT.PNG" width="50%" style="cursor:zoom-in" alt="likelihoods" class="figure-img"
onclick="document.getElementById('modal').style.display='block'; document.getElementById('modal-img').src='images/local_RCT.PNG';">
<figcaption class="figure-caption text-center">Figure 3: An illustration of Local RC, adapted from <a
href="#1">Kim et al. (2021)</a>.</figcaption>
</figure>
<h4 class="mt-4">Global-Local RCT Fusion</h4>
<p>
Finally, the enhanced images obtained from the global `Y_G` and local `Y_L` RCT components are combined to
produce the final enhanced image `\tilde{Y}` as:
</p>
<div class="text-center my-2"> ` \tilde{Y} = \alpha Y_G + \beta Y_L`</div>
<p>
where `\alpha` and `\beta` are non-negative learnable weights.
</p>
<h4 class="mt-4">Loss Function</h4>
<p>
The used loss function comprises of two distinct terms, with the first term corresponding to the mean
absolute error (<i>L1 loss</i>)
between the predicted and ground-truth enhanced images. The second term is the sum of the <i>L1</i> losses
between the feature representations
extracted for the predicted and ground-truth images from the `2^{nd}`, `4^{th}`, and `6^{th}` layer of a
<i>VGG-16</i> [<a href="#13">Simonyan et al. (2014)</a>] network, pretrained
on ImageNet [<a href="#14">Russakovsky et al. (2015)</a>]. Consequently, given `\tilde{Y}`: the high-quality
image prediction and `Y`: the ground-truth high-quality
image, the loss function is given as:
</p>
<div class="text-center my-2"> ` \mathcal{L} = || \tilde{Y} - Y ||_1 + \lambda \sum_{k=2,4,6} ||
\phi^k(\tilde{Y}) - \phi^k(Y) ||_1`</div>
<p>
where the hyperparameter `\lambda` was set to 0.04 to balance the two terms.
</p>
<!-- RESULTS -->
<h2 class="mt-4">Experiments</h2>
<h4 class="mt-4">Dataset</h4>
<p>
The <i>LOw-Light (LOL)</i> dataset [<a href="#16">Wang et al. (2004)</a>] for image enhancement in low-light
scenarios was used for the purposes of our
experiment. It is composed of a training partition, containing 485 pairs of low- and normal-light image
pairs, and a test partition, containing 15 such pairs.
All the images have a resolution of `400 \times 600`. For the purposes of training, all images were randomly cropped and rotated by a multiple of 90 degrees.
</p>
<h4 class="mt-4">Evaluation Metrics</h4>
<p>
The perceived enhancement of an image from different methods can be subjective. Therefore, it is salient to
establish certain metrics that would allow the
comparison of different image enhancement algorithms on the produced image quality. For the quantitative
evaluation of RCTNet we leveraged two distinct evaluation
metrics, which are well-established for assessing image enhancement models, namely <i>peak signal-to-noise
ratio (PSNR)</i> and <i>Structural SIMilarity (SSIM)</i>.
</p>
<p>
PSNR corresponds to the ratio between the power (maximum value) of a signal and the power of a noisy
distortion and is expressed in a logarithmic decibel scale.
In the image domain it measures the ratio between the power of the ground-truth enhanced image (`Y`) and the
power of the enhanced image prediction (`\tilde{Y}`), produced
by the network, as:
</p>
<div class="text-center my-2"> ` PSNR = 20log_{10}(\frac{max(Y)}{MSE(Y,\tilde(Y))}) `</div>
<p>
where MSE is the mean squared error between the ground-truth and predicted images. Therefore, the higher
PNSR values correspond to better reconstruction of the
degraded images. For colour images, the MSE is averaged across individual channels. Nevertheless, PSNR is
limited in that it solely relies on numerical pixel
value comparisons, disregarding biological factors of the human vision systems, which brings us to SSIM.
</p>
<p>
SSIM, introduced by <a href="#15">Wei et al. (2018)</a>, attempts to replicate the behaviour of the human
visual perception system, which is highly capable of identifying structural information
in a scene, and by extension differences between the predicted and ground-truth enhanced versions of an
image. The value ranges from `-1` to `1`, where `1`
corresponds to identical images. SSIM extracts 3 key features from an image, namely luminance, contrast, and
structure and subsequently applies certain comparison
functions to these features to compare the given images, followed by a final combination function that
aggregates the final result.
</p>
<h4 class="mt-4">Quantitative Evaluation</h4>
<p>
The results, in terms of the PSNR and SSIM evaluation metrics, calculated for our implementation of RCTNet
are depicted in <a href="#tab1">Table 1</a>, along with results of competing image
enhancement methods and the official implementation of RCTNet, as reported in <a href="#1">Kim et al.
(2021)</a>. It becomes evident that our results do not approximate those reported for the official implementation for both examined metrics.
</p>
<div class="center d-flex justify-content-around">
<div>
<table id="tab1" style="width: 75%; margin-left: auto; margin-right: auto;"
class="table table-border text-center">
<caption class="figure-caption text-center" width="100%">Table 1: Quantitative comparison on the LoL
dataset [<a href="#16">Wang et al. (2004)</a>].
The best results are boldfaced and the second best ones are
underlined. Our results correspond to the mean value of 100 random seed executions (*).
</caption>
<thead>
<tr>
<th>Method</th>
<th>PSNR</th>
<th>SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td>NPE [<a href="#17">Wang et al. (2013)</a>]</td>
<td>16.97</td>
<td>0.589</td>
</tr>
<tr>
<td>LIME [<a href="#18">Guo et al. (2016)</a>]</td>
<td>15.24</td>
<td>0.470</td>
</tr>
<tr>
<td>SRIE [<a href="#19">Fu et al. (2016)</a>]</td>
<td>17.34
<td>0.686</td>
</tr>
<tr>
<td>RRM [<a href="#20">Li et al. (2016)</a>]</td>
<td>17.34</td>
<td>0.686</td>
</tr>
<tr>
<td>SICE [<a href="#21">Cai et al. (2018)</a>]</td>
<td>19.40</td>
<td>0.690</td>
</tr>
<tr>
<td>DRD [<a href="#15">Wei et al. (2018)</a>]</td>
<td>16.77</td>
<td>0.559</td>
</tr>
<tr>
<td>KinD [<a href="#22">Zhang et al. (2019)</a>]</td>
<td>20.87</td>
<td>0.802</td>
</tr>
<tr>
<td>DRBN [<a href="#4">Yang et al. (2020)</a>]</td>
<td>20.13</td>
<td><b>0.830</b></td>
</tr>
<tr>
<td>ZeroDCE [<a href="#11">Guo et al. (2020)</a>]</td>
<td>14.86</td>
<td>0.559</td>
</tr>
<tr>
<td>EnlightenGAN [<a href="#23">Jiang et al. (2021)</a>]</td>
<td>15.34</td>
<td>0.528</td>
</tr>
<tr>
<td>RCTNet [<a href="#1">Kim et al. (2021)</a>]</td>
<td><u>22.67</u></td>
<td>0.788</td>
</tr>
<tr>
<td>RCTNet (ours)* </td>
<td>19.96</td>
<td>0.768</td>
</tr>
<tr>
<td>RCTNet + BF [<a href="#1">Kim et al. (2021)</a>]</td>
<td><b>22.81</b></td>
<td><u>0.827</u></td>
</tr>
</tbody>
</table>
<p>
Interestingly, the results of <a href="#tab1">Table 1</a> deviate significantly in case the augmentations proposed by the authors
(random cropping and random rotation by a multiple of 90 degrees) are also used during the evaluation. This finding
indicates that the model favours augmented images, since during training we performed augmentation operations on all
input images and for every epoch. While the authors refer to the same augmentations, they do not specify the frequency, with which those
augmentations were performed. This phenomenon becomes more evident by looking at the quantitative results, when augmentations were used
on the test images, as shown in <a href="#tab2">Table 2</a>. Furthermore, the innate randomness of the augmentation operations leads to a high variance for
both metrics, and thus a less robust model. To account for this variance, we executed our evaluation for 100 randomly selected seeds.
The mean, standard deviation, maximum, and minimum values for both evaluation metrics are shown in <a href="#tab2">Table 2</a>, when augmentations are also
included in the test set. Additionally, in Figures <a href="#fig4">4.a</a> and <a href="#fig4">4.b</a>, the plotted density distributions for
PSNR and SSIM, respectively, depict the observed high variance for both metrics.
</p>
<div class="center d-flex justify-content-around">
<div>
<table id="tab2" style="width: 70%; margin-left: auto; margin-right: auto;"
class="table table-border text-center">
<caption class="figure-caption text-center">Table 2: Mean, standard deviation, maximum, and
minimum values for PSNR and SSIM, for 100 executions with different random seeds, when augmentations are also included in the test set.
</caption>
<thead>
<tr>
<th>Evaluation Metric</th>
<th>Mean</th>
<th>Standard Deviation</th>
<th>Max</th>
<th>Min</th>
</tr>
</thead>
<tbody>
<tr>
<td>PSNR</td>
<td>20.522</td>
<td>0.594</td>
<td>22.003</td>
<td>18.973</td>
</tr>
<tr>
<td>SSIM</td>
<td>0.816</td>
<td>0.009</td>
<td>0.839</td>
<td>0.787</td>
</tr>
</tbody>
</table>
<figure id="fig4" class="text-center">
<!-- <img src="images/architecture.PNG" width="100%" style="cursor:zoom-in" alt="likelihoods"
class="figure-img"
onclick="document.getElementById('modal').style.display='block'; document.getElementById('modal-img').src='images/architecture.PNG';"> -->
<img src="images/PSNR_distribution.png" width="80%" height="auto" class="figure-img"
style="margin-left: auto;margin-right: auto;cursor:zoom-in;"
alt="PSNR density distribution"
onclick="document.getElementById('modal').style.display='block'; document.getElementById('modal-img').src='images/PSNR_distribution.png';" />
<img src="images/SSIM_distribution.png" width="80%" height="auto"
alt="SSIM density distribution"
style="margin-left: auto;margin-right: auto;cursor:zoom-in;" clas="figure-img"
onclick="document.getElementById('modal').style.display='block'; document.getElementById('modal-img').src='images/SSIM_distribution.png';" />
<figcaption class="figure-caption text-center">Figure 4: Density distributions of the
measured values for (a) PSNR and (b) SSIM after 100 executions with different random
seeds, when augmentations are also included in the test set.</a></figcaption>
</figure>
<h4 class="mt-4">Qualitative Evaluation</h4>
<p>
In <a href="#tab3">Table 3</a> some image enhancement results of the implemented RCTNet are
shown, compared to the low-light input images and the
ground-truth normal-light output images. From these examples it becomes evident that RCTNet
has successfully learned how to enhance low-light
images, achieving comparable results to the ground-truth images in terms of exposure and
color-tones. Nevertheless, the produced images are slightly less saturated
with noise being more prominent. It was conjectured that by training the network for more
epochs, some of these limitations could be alleviated. It is also observed that
RCTNet fails to extract certain representative colors that are only available in small
regions of the input image (<i>e.g.</i> the green color for the `4^{th}` image).
</p>
<div class="center d-flex justify-content-around">
<div>
<table id="tab3" style="width: 100%; margin-left: auto; margin-right: auto;"
class="table table-border text-center">
<caption class="figure-caption text-center">Table 3: Qualitative comparison on the
LoL dataset for an RCTNet trained for 500 epochs.
</caption>
<thead>
<tr>
<th>Input</th>
<th>RCTNet</th>
<th>Ground-Truth</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="images/1_low.png" alt="" width="100%" /></td>
<td><img src="images/1_prod.png" alt="" width="100%" /></td>
<td><img src="images/1_high.png" alt="" width="100%" /></td>
</tr>
<tr>
<td><img src="images/2_low.png" alt="" width="100%" /></td>
<td><img src="images/2_prod.png" alt="" width="100%" /></td>
<td><img src="images/2_high.png" alt="" width="100%" /></td>
</tr>
<tr>
<td><img src="images/3_low.png" alt="" width="100%" /></td>
<td><img src="images/3_prod.png" alt="" width="100%" /></td>
<td><img src="images/3_high.png" alt="" width="100%" /></td>
</tr>
<tr>
<td><img src="images/4_low.png" alt="" width="100%" /></td>
<td><img src="images/4_prod.png" alt="" width="100%" /></td>
<td><img src="images/4_high.png" alt="" width="100%" /></td>
</tr>
<tr>
<td><img src="images/5_low.png" alt="" width="100%" /></td>
<td><img src="images/5_prod.png" alt="" width="100%" /></td>
<td><img src="images/5_high.png" alt="" width="100%" /></td>
</tr>
</tbody>
</table>
<!-- CONCLUSIONS -->
<h2 class="mt-4">Conclusions</h2>
<p>
In conclusion, our analysis did not show comparable results to the ones presented in the original paper.
The qualitative evaluation on the LOL dataset facilitated our implementation's capability of learning
to successfully enhance low-light images with color tones matching those of the ground-truth enhanced
image. The observed dissimilarity in terms of color saturation could possibly be accounted for by
tuning certain hyperparameters of the model or training for more epochs. Regarding our quantitative
findings, the measured values for both PSNR and SSIM, were lower for our implementation compared to
the ones corresponding to the original implementation. Nevertheless, these discrepancies could be attributed to
the frequency with which the image augmentations were performed in our implementation during training.
</p>
<div id="modal" class="w3-modal" onclick="this.style.display='none'">
<span class="w3-button w3-hover-red w3-xlarge w3-display-topright">×</span>
<div class="w3-modal-content w3-animate-zoom">
<img id="modal-img" src="" style="width:100%">
</div>
</div>
<!-- REFERENCES -->
<h2 class="mt-5">References</h2>
<p id="1">[1] Kim, H., Choi, S. M., Kim, C. S., & Koh, Y. J. (2021). Representative
Color Transform for Image Enhancement. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 4459-4468).</p>
<p id="2">[2] Chen, Y. S., Wang, Y. C., Kao, M. H., & Chuang, Y. Y. (2018). Deep photo
enhancer: Unpaired learning for image enhancement from photographs with gans. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.
6306-6314).</p>
<p id="3">[3] Yan, Z., Zhang, H., Wang, B., Paris, S., & Yu, Y. (2016). Automatic photo
adjustment using deep neural networks. ACM Transactions on Graphics (TOG), 35(2),
1-15.</p>
<p id="4">[4] Yang, W., Wang, S., Fang, Y., Wang, Y., & Liu, J. (2020). From fidelity to
perceptual quality: A semi-supervised approach for low-light image enhancement. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
(pp. 3063-3072)</p>
<p id="5">[5] Kim, H. U., Koh, Y. J., & Kim, C. S. (2020, August). PieNet: Personalized
image enhancement network. In European Conference on Computer Vision (pp. 374-390).
Springer, Cham.</p>
<p id="6">[6] Deng, Y., Loy, C. C., & Tang, X. (2018, October). Aesthetic-driven image
enhancement by adversarial learning. In Proceedings of the 26th ACM international
conference on Multimedia (pp. 870-878).</p>
<p id="7">[7] Kim, H. U., Koh, Y. J., & Kim, C. S. (2020, August). Global and local
enhancement networks for paired and unpaired image enhancement. In European
Conference on Computer Vision (pp. 339-354). Springer, Cham.</p>
<p id="8">[8] Park, J., Lee, J. Y., Yoo, D., & Kweon, I. S. (2018). Distort-and-recover:
Color enhancement using deep reinforcement learning. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 5928-5936).</p>
<p id="9">[9] Hu, Y., He, H., Xu, C., Wang, B., & Lin, S. (2018). Exposure: A white-box
photo post-processing framework. ACM Transactions on Graphics (TOG), 37(2), 1-17.
</p>
<p id="10">[10] Kosugi, S., & Yamasaki, T. (2020, April). Unpaired image enhancement
featuring reinforcement-learning-controlled image editing software. In Proceedings
of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp.
11296-11303).</p>
<p id="11">[11] Guo, C., Li, C., Guo, J., Loy, C. C., Hou, J., Kwong, S., & Cong, R.
(2020). Zero-reference deep curve estimation for low-light image enhancement. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp. 1780-1789).</p>
<p id="12">[12] Tan, M., Pang, R., & Le, Q. V. (2020). Efficientdet: Scalable and
efficient object detection. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition (pp. 10781-10790).</p>
<p id="13">[13] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556.</p>
<p id="14">[14] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ...
& Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge.
International journal of computer vision, 115(3), 211-252.</p>
<p id="15">[15] Wei, C., Wang, W., Yang, W., & Liu, J. (2018). Deep retinex
decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560.</p>
<p id="16">[16] Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image
quality assessment: from error visibility to structural similarity. IEEE
transactions on image processing, 13(4), 600-612.</p>
<p id="17">[17] Wang, S., Zheng, J., Hu, H. M., & Li, B. (2013). Naturalness preserved
enhancement algorithm for non-uniform illumination images. IEEE transactions on
image processing, 22(9), 3538-3548.</p>
<p id="18">[18] Guo, X., Li, Y., & Ling, H. (2016). LIME: Low-light image enhancement
via illumination map estimation. IEEE Transactions on image processing, 26(2),
982-993.</p>
<p id="19">[19] Fu, X., Zeng, D., Huang, Y., Zhang, X. P., & Ding, X. (2016). A weighted
variational model for simultaneous reflectance and illumination estimation. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
2782-2790).</p>
<p id="20">[20] Li, C. Y., Guo, J. C., Cong, R. M., Pang, Y. W., & Wang, B. (2016).
Underwater image enhancement by dehazing with minimum information loss and histogram
distribution prior. IEEE Transactions on Image Processing, 25(12), 5664-5677.</p>
<p id="21">[21] Cai, J., Gu, S., & Zhang, L. (2018). Learning a deep single image
contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing,
27(4), 2049-2062.</p>
<p id="22">[22] Zhang, Y., Zhang, J., & Guo, X. (2019, October). Kindling the darkness:
A practical low-light image enhancer. In Proceedings of the 27th ACM international
conference on multimedia (pp. 1632-1640).</p>
<p id="23">[23] Jiang, Y., Gong, X., Liu, D., Cheng, Y., Fang, C., Shen, X., ... & Wang,
Z. (2021). Enlightengan: Deep light enhancement without paired supervision. IEEE
Transactions on Image Processing, 30, 2340-2349.</p>
</section>
</body>
</html>