-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
203 lines (161 loc) · 8.74 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
<script src="http://www.google.com/jsapi" type="text/javascript"></script>
<script type="text/javascript">google.load("jquery", "1.3.2");</script>
<link href="https://fonts.googleapis.com/css2?family=Open+Sans&display=swap"
rel="stylesheet">
<link rel="stylesheet" type="text/css" href="./resources/style.css" media="screen"/>
<html lang="en">
<head>
<title>Self-supervised Object-Centric Learning for Videos</title>
<meta property="og:image" content="Path to my teaser.jpg"/>
<meta property="og:title" content="SOLV" />
<meta property="og:description" content="SOLV" />
<meta property="twitter:card" content="SOLV" />
<meta property="twitter:title" content="SOLV" />
<meta property="twitter:description" content="SOLV" />
<meta property="twitter:image" content="Path to my teaser.jpg" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- Add your Google Analytics tag here -->
<script async
src="https://www.googletagmanager.com/gtag/js?id=UA-97476543-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'UA-97476543-1');
</script>
</head>
<body>
<div class="container">
<div class="title">
Self-supervised Object-Centric Learning for Videos
</div>
<br><br>
<div class="author">
<a href="https://github.com/gorkaydemir" target="_blank">Gorkay Aydemir</a><sup>1</sup>
</div>
<div class="author">
<a href="https://weidixie.github.io" target="_blank">Weidi Xie</a><sup>3, 4</sup>
</div>
<div class="author">
<a href="https://mysite.ku.edu.tr/fguney/" target="_blank">Fatma Guney</a><sup>1, 2</sup>
</div>
<br><br>
<div class="affiliation"><sup>1 </sup><a href="https://cs.ku.edu.tr" target="_blank">Department of Computer Engineering, Koc University</a></div>
<div class="affiliation"><sup>2 </sup><a href="https://ai.ku.edu.tr" target="_blank">KUIS AI Center</a></div>
<div class="affiliation"><sup>3 </sup><a href="https://en.sjtu.edu.cn" target="_blank">CMIC, Shanghai Jiao Tong University</a></div>
<div class="affiliation"><sup>4 </sup><a href="https://www.shlab.org.cn" target="_blank">Shanghai AI Laboratory</a></div>
<div class="venue">
NeurIPS 2023
</div>
<br><br>
<div class="links">Paper <a href="https://arxiv.org/abs/2310.06907" target="_blank"> [arXiv]</a></div>
<div class="links">Code <a href="https://github.com/gorkaydemir/SOLV" target="_blank"> [GitHub]</a></div>
<div class="links">Cite <a href="./resources/bibtex.txt" target="_blank"> [BibTeX]</a></div>
<br>
<br>
<br>
<div class="row">
<div class="cropped">
<img style="width: 80%;" src="./resources/result_gifs/teaser1.gif"/>
</div>
<div class="cropped">
<img style="width: 80%;" src="./resources/result_gifs/teaser2.gif"/>
</div>
</div><br>
<div class="box"><b> <FONT COLOR="RED">TL;DR</FONT></b> we introduce <b>SOLV</b> (Self-supervised Object Centric Learning for Videos), a self-supervised model capable of discovering multiple objects in real-world video sequences without using additional modalities. </div>
<br>
<br>
<hr>
<h1>Abstract</h1>
<p style="width: 80%;">
Unsupervised multi-object segmentation has shown impressive results on images by utilizing powerful semantics learned from self-supervised pretraining.
An additional modality such as depth or motion is often used to facilitate the segmentation in video sequences. However, the performance improvements observed in synthetic sequences, which rely on the robustness of an additional cue, do not translate to more challenging real-world scenarios.
In this paper, we propose the first fully unsupervised method for segmenting multiple objects in real-world sequences.
Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames.
From these temporally-aware slots,
the training objective is to reconstruct the middle frame in a high-level semantic feature space.
We propose a masking strategy by dropping a significant portion of tokens in the feature space for efficiency and regularization.
Additionally, we address over-clustering by merging slots based on similarity. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
</p>
<br><br>
<hr>
<h1>Method Overview</h1>
<img style="width: 80%;" src="./resources/pipeline.png"
alt="Method overview figure"/>
<br><br>
<p style="width: 80%;">
In this study, we introduce SOLV, an autoencoder-based model designed for object-centric learning in videos.
Our model consists of three components: <br>
(i) <b>Visual Encoder</b> for extracting features for each frame; <br>
(ii) <b>Spatial-temporal Binding</b> module for generating temporally-aware object-centric representations; <br>
(iii) <b>Visual Decoder</b> for estimating segmentation masks and feature reconstructions for the central frame
</p>
<br>
<hr>
<h1>Quantitative Results</h1>
<!-- <img style="width: 38%;" src="./resources/movi_table.png" alt="MOVi-E SOTA"/><br><br>
<p style="width: 80%;">
<b>Quantitative Results on MOVi-E.</b> This table shows results in comparison to the previous work in terms of FG-ARI on MOVi-E.
</p>
<br><br>
<img style="width: 74%;" src="./resources/ytvis_table.png" alt="Interaction SOTA"/><br><br>
<p style="width: 80%;">
<b>Quantitative Results on Real-World Data.</b> These results show the video multi-object evaluation results on the validation split of DAVIS17 and a subset of the YTVIS19 train split.
</p> -->
<div class="row_v2">
<img class="img-center" style="width: 30%;" src="./resources/movi_table.png" alt="MOVi-E SOTA"/><br><br>
<img class="img-center" style="width: 10%;"><br><br>
<img class="img-center" style="width: 60%;height: 70%;" src="./resources/ytvis_table.png" alt="Real-World SOTA"/><br><br>
</div>
<br>
<div class="row">
<p style="width: 30%;">
<b>Quantitative Results on MOVi-E.</b> This table shows results in comparison to the previous work in terms of FG-ARI on MOVi-E.
</p>
<p style="width: 10%;"> </p>
<p style="width: 60%;">
<b>Quantitative Results on Real-World Data.</b> These results show the video multi-object evaluation results on the validation split of DAVIS17 and a subset of the YTVIS19 train split.
</p>
</div>
<br>
<hr>
<h1>Qualitative Results</h1>
<div class="row">
<div class="column">
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/ours_fg_116.gif"/></div>
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/ours_fg_131.gif"/></div>
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/ours_fg_143.gif"/></div>
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/ours_fg_199.gif"/></div>
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/ours_fg_244.gif"/></div>
</div>
<div class="column">
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/gt_116.gif"/></div>
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/gt_131.gif"/></div>
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/gt_143.gif"/></div>
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/gt_199.gif"/></div>
<div class="cropped_v2"><img style="width: 100%;" src="./resources/result_gifs/gt_244.gif"/></div>
</div>
</div>
<p style="width: 80%;">
<b> Qualitative results on YTVIS19.</b>
We visualize our object discovery results of multi-object video segmentation on YTVIS19 after Hungarian Matching is applied.
</p>
<hr>
<h1>Paper</h1>
<div class="paper-info"style="width: 80%;">
<h3>Self-supervised Object-Centric Learning for Videos</h3>
<p>Gorkay Aydemir, Weidi Xie and Fatma Guney</p>
<p>NeurIPS 2023</p>
<pre><code>@InProceedings{Aydemir2023NeurIPS,
author = {Aydemir, G\"orkay and Xie, Weidi and G\"uney, Fatma},
title = {{S}elf-supervised {O}bject-Centric {L}earning for {V}ideos},
booktitle = {Advances in Neural Information Processing Systems},
year = {2023}}
</code></pre>
</div>
<br><br>
</div>
</body>
</html>