-
Notifications
You must be signed in to change notification settings - Fork 0
/
publications.html
3108 lines (2667 loc) · 206 KB
/
publications.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-176270417-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-176270417-1');
</script>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta name="generator" content="pandoc">
<title>ravikiran</title>
<script async="" src="./ravikiran_files/analytics.js"></script><script src="./ravikiran_files/jquery-1.js"></script>
<link rel="stylesheet" href="./ravikiran_files/bootstrap.css">
<link rel="stylesheet" href="./ravikiran_files/bootstrap-theme.css">
<script src="./ravikiran_files/bootstrap.js"></script>
<style>
/* http://stackoverflow.com/questions/18325779/bootstrap-3-collapse-show-state-with-chevron-icon */
.panel-heading .accordion-toggle:before {
font-family: 'Glyphicons Halflings';
content: "\e114";
float: left;
color: black;
padding-right: 6px;
}
.panel-heading .accordion-toggle.collapsed:before {
content: "\e080";
}
table, th, td {
border: 0px solid black;
border-collapse: collapse;
}
h4 {
color: green;
}
</style>
<style type="text/css">
</style>
</head>
<body>
<nav class="navbar navbar-default navbar-static-top" role="navigation">
<div class="container">
<ul class="nav navbar-nav">
<li><a href="index.html"><font style="font-size:20px; font-weight:500;" color="#003380">Ravi Kiran Sarvadevabhatla</font></a></li>
<li><a href="publications.html"><font size="4px" color="#E62E00"><b>Publications</b></font></a></li>
<li><a href="research.html"><font size="4px" color="#E62E00"><b>Current Research</b></font></a></li>
<li><a href="past-research.html"><font size="4px" color="#E62E00"><b>Past Research</b></font></a></li>
</ul>
</div>
</nav>
<div class="container">
<a name="publications"></a>
<p style="margin:-15px 0px 0px 0px;"></p>
<br>
<br>
<font size="4">Full list also on <a href="https://scholar.google.co.in/citations?user=oLJTcXIAAAAJ&hl=en">Google scholar</a></font>
<br>
<br>
<p style="margin:-2.5px 0px 0px 0px;"></p>
<br>
<h2>2024</h2>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/linetr-method.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
LineTR: Unified Text Line Segmentation for Challenging Palm Leaf Manuscripts
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Vaibhav Agrawal, Niharika Vadlamudi, Amal Joseph, Muhammad Waseem, Sreenya Chitluri, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
International Conference on Pattern Recognition (ICPR)
</strong></em>,
2024
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
Re-imagining text line localization in challenging documents. Instead of a pixel-based segmentation paradigm, LineTR uses a parametric representation of a line, leveraging its inductive priors.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://drive.google.com/file/d/1Z2nMoeyhxy_HgHTrmcT_UD2LBV-Y0FhJ/view?usp=drive_link" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#Qoid9agabs" href="#Qoid9agabs-list">Abstract</a>
<a href="https://ihdia.iiit.ac.in/LineTR/" target="_blank" class="buttonPP">Project page</a>
<div id="Qoid9ag-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="Qoid9agabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Historical manuscripts pose significant challenges for line segmentation due to their diverse sizes, scripts, and appearances. Traditional methods often rely on dataset-specific processing or training per-dataset models, limiting scalability and maintainability. To this end, we propose LineTR, a single model for all dataset collections. LineTR is a two-staged approach. The first stage predicts text-strike-through lines called scribbles and a novel text-energy map of the input document image. The second stage is a seam-generation network which uses these to get precise polygons around the text-lines. Text-line segmentation has been mainly approached as a dense-prediction task, which is ineffective, as the inductive prior of a line is not utilized, and this leads to poor segmentation performance. Thus, our key insight is to parametrize a text-line, thus preserving these inductive priors. To avoid resizing the document, the input image is first broken down into context-adapted patches, and each patch is processed by the stage-1 network independently. The patch-level outputs are combined using a dataset-agnostic post processing pipeline. Notably, we show that carefully choosing the patch size to capture enough context is crucial for generalization, as document images come in arbitrary resolutions. LineTR has been evaluated extensively through experiments and qualitative comparisons. Additionally, our method exhibits strong zero-shot generalization to unseen document collections.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
<img width="100%" src="./ravikiran_files/new_burst.png">
</div>
<video width="400" height="300" controls>
<source src="https://ihdia.iiit.ac.in/LineTR/static/videos/1711_final_2mins.mp4" type="video/mp4">
</video>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/olaf.jpg"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Pranav Gupta, Rishubh Singh, Pradeep Shenoy, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
European Conference on Computer Vision (ECCV)
</strong></em>,
2024
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A plug-and-play recipe for improved multi-object multi-part segmentation. Our recipe leads to significant gains (up to 4.0 mIoU) across multiple architectures and across multiple challenging segmentation datasets.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/04338.pdf" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#JPaTu6Tabs" href="#JPaTu6Tabs-list">Abstract</a>
<a href="https://olafseg.github.io/" target="_blank" class="buttonPP">Project page</a>
<div id="JPaTu6T-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="JPaTu6Tabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of 3.3 (Pascal-Parts-58), 3.5 (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is 4.0. Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
<img width="100%" src="./ravikiran_files/new_burst.png">
</div>
<video width="400" height="300" controls>
<source src="https://olafseg.github.io/assets/OLAF_video.mp4" type="video/mp4">
</video>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/alert_prediction.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
Enhancing Road Safety: Predictive Modeling of Accident-Prone Zones with ADAS-Equipped Vehicle Fleet Data
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Ravi Shankar Mishra, Dev Singh Thakur, Anbumani Subramanian, Mukti Advani, S Velmurugan, Juby Jose, CV Jawahar, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
IEEE Intelligent Vehicles Symposium (IV)
</strong></em>,
2024
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A novel approach to identifying possible early accident-prone zones in a large city-scale road network using geo-tagged collision alert data from a vehicle fleet.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10588591" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#GwFyk7Aabs" href="#GwFyk7Aabs-list">Abstract</a>
<a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10588591" target="_blank" class="buttonPP">Project page</a>
<div id="GwFyk7A-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="GwFyk7Aabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
This work presents a novel approach to identifying possible early accident-prone zones in a large city-scale road network using geo-tagged collision alert data from a vehicle fleet. The alert data has been collected for a year from 200 city buses installed with the Advanced Driver Assistance System (ADAS). To the best of our knowledge, no research paper has used ADAS alerts to identify the early accidentprone zones. A nonparametric technique called Kernel Density Estimation (KDE) is employed to model the distribution of alert data across stratified time intervals. A novel recall-based measure is introduced to assess the degree of support provided by our density-based approach for existing, manually determined accident-prone zones (‘blackspots’) provided by civic authorities. This shows that our KDE approach significantly outperforms existing approaches in terms of the recall-based measure. Introducing a novel linear assignment Earth Mover Distance based measure to predict previously unidentified accident-prone zones. The results and findings support the feasibility of utilizing alert data from vehicle fleets to aid civic planners in assessing accident-zone trends and deploying traffic calming measures, thereby improving overall road safety and saving lives.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
<img width="100%" src="./ravikiran_files/new_burst.png">
</div>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/idd-x.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic
<b><font color=red>[ORAL]</font></b>
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Chirag Parikh, Rohit Saluja, C. V. Jawahar, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
IEEE International Conference on Robotics and Automation (ICRA)
</strong></em>,
2024
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A large-scale dataset and deep networks for understanding and explainability of Indian road driving scenarios.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="http://arxiv.org/abs/2404.08561" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#Ol03Us4abs" href="#Ol03Us4abs-list">Abstract</a>
<a href="https://idd-x.github.io/" target="_blank" class="buttonPP">Project page</a>
<div id="Ol03Us4-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="Ol03Us4abs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured and sparse traffic scenarios, fall short of capturing the complexity of driving in such environments. To fill this gap, we present IDD-X, a large-scale dual-view driving video dataset. With 697K bounding boxes, 9K important object tracks, and 1-12 objects per video, IDD-X offers comprehensive ego-relative annotations for multiple important road objects covering 10 categories and 19 explanation label categories. The dataset also incorporates rearview information to provide a more complete representation of the driving environment. We also introduce custom-designed deep networks aimed at multiple important object localization and per-object explanation prediction. Overall, our dataset and introduced prediction models form the foundation for studying how road conditions and surrounding entities affect driving behavior in complex traffic situations.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
<img width="100%" src="./ravikiran_files/new_burst.png">
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href=""><img width="100%" src="./ravikiran_files/Oral-session-icon.png"></div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/wtXng1S496w" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/madverse.jpg"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
MAdVerse: A Hierarchical Dataset of Multi-Lingual Ads from Diverse Sources and Categories
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Amruth Sagar, Rishabh Srivastava , Rakshitha R. T. , Venkata Kesav Venna , Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
IEEE Winter Conference on Applications of Computer Vision (WACV)
</strong></em>,
2024
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A hierarchical, multi-source, multi-lingual compilation of more than 50,000 ads from the web, social media websites, and e-newspapers.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://openaccess.thecvf.com/content/WACV2024/papers/Sagar_MAdVerse_A_Hierarchical_Dataset_of_Multi-Lingual_Ads_From_Diverse_Sources_WACV_2024_paper.pdf" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#AvZxtwHabs" href="#AvZxtwHabs-list">Abstract</a>
<a href="https://madverse24.github.io/" target="_blank" class="buttonPP">Project page</a>
<div id="AvZxtwH-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="AvZxtwHabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
The convergence of computer vision and advertising has sparked substantial interest lately. Existing advertisement datasets often derive from subsets of established data with highly specialized annotations or feature diverse annotations without a cohesive taxonomy among ad images. Notably, no datasets encompass diverse advertisement styles or semantic grouping at various levels of granularity for a better understanding of ads. Our work addresses this gap by introducing MAdVerse, an extensive, multilingual compilation of more than 50,000 ads from the web, social media websites, and e-newspapers. Advertisements are hierarchically grouped with uniform granularity into 11 categories, divided into 51 sub-categories, and 524 finegrained brands at leaf level, each featuring ads in various languages. Furthermore, we provide comprehensive baseline classification results for various pertinent prediction tasks within the realm of advertising analysis. Specifically, these tasks include hierarchical ad classification, source classification, multilingual classification, and inducing hierarchy in existing ad datasets
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
<img width="100%" src="./ravikiran_files/new_burst.png">
</div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/j_NC3JxCVxs" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<h2>2023</h2>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/interact2023.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
"Draw Fast, Guess Slow": Characterizing Interactions in Cooperative Partially Observable Settings with Online Pictionary as a Case Study
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Kiruthika Kannan, Anandhini Rajendran, Vinoo Alluri, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
Human-Computer Interaction – INTERACT
</strong></em>,
2023
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
Analyzing player types and gameplay styles, effect of target word difficulty in Pictionary
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://link.springer.com/chapter/10.1007/978-3-031-42286-7_16" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#Ez01tpwabs" href="#Ez01tpwabs-list">Abstract</a>
<div id="Ez01tpw-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="Ez01tpwabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Cooperative human-human communication becomes challenging when restrictions such as difference in communication modality and limited time are imposed. We use the popular cooperative social game Pictionary as an online multimodal test bed to explore the dynamics of human-human interactions in such settings. As a part of our study, we identify attributes of player interactions that characterize cooperative gameplay. We found stable and role-specific playing style components that are independent of game difficulty. In terms of gameplay and the larger context of cooperative partially observable communication, our results suggest that too much interaction or unbalanced interaction negatively impacts game success. Additionally, the playing style components discovered via our analysis align with select player personality types proposed in existing frameworks for multiplayer games.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/seamformer.jpeg"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
SeamFormer : High Precision Text Line Segmentation for Handwritten Documents
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Niharika Vadlamudi, Rahul Krishna, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
The 17th International Conference on Document Analysis and Recognition (ICDAR)
</strong></em>,
2023
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A novel approach involving a multi-task Transformer and image seam generation using custom energy maps for high precision line segmentation.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://link.springer.com/chapter/10.1007/978-3-031-41685-9_20" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#iLmd5QYabs" href="#iLmd5QYabs-list">Abstract</a>
<a href="https://ihdia.iiit.ac.in/seamformer/" target="_blank" class="buttonPP">Project page</a>
<div id="iLmd5QY-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="iLmd5QYabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Historical manuscripts often contain dense unstructured text lines. The large diversity in sizes, scripts and appearance makes precise text line segmentation extremely challenging. Existing line segmentation approaches often associate diacritic elements incorrectly to text lines and also address above mentioned challenges inadequately. To tackle these issues, we introduce SeamFormer, a novel approach for high precision text line segmentation in handwritten manuscripts. In the first stage of our approach, a multi-task Transformer deep network outputs coarse line identifiers which we term ‘scribbles’ and the binarized manuscript image. In the second stage, a scribble-conditioned seam generation procedure utilizes outputs from first stage and feature maps derived from manuscript image to generate tight-fitting line segmentation polygons. In the process, we incorporate a novel diacritic feature map which enables improved diacritic and text line associations. Via experiments and evaluations on new and existing challenging palm leaf manuscript datasets, we show that SeamFormer outperforms competing approaches and generates precise text line segmentations.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/xUV1kNKgeko" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/cloudfog.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
A Cloud-Fog Architecture for Video Analytics on Large Scale Camera Networks Using Semantic Scene Analysis
<b><font color=red>[ORAL]</font></b>
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Kunal Jain, Kishan Sairam Adapa, Kunwar Grover, Ravi Kiran Sarvadevabhatla, Suresh Purini
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
</strong></em>,
2023
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A scalable distributed video analytics framework that can process thousands of video streams from sources such as CCTV cameras using semantic scene analysis.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://ieeexplore.ieee.org/document/10171548" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#wLbqhOAabs" href="#wLbqhOAabs-list">Abstract</a>
<a href="" target="_blank" class="buttonPP">Project page</a>
<div id="wLbqhOA-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="wLbqhOAabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
This paper proposes a scalable distributed video analytics framework that can process thousands of video streams from sources such as CCTV cameras using semantic scene analysis. The main idea is to deploy deep learning pipelines on the fog nodes and generate semantic scene description records (SDRs) of video feeds from the associated CCTV cameras. These SDRs are transmitted to the cloud instead of video frames saving on network bandwidth. Using these SDRs stored on the cloud database, we can answer many complex queries and perform rich video analytics, within extremely low latencies. There is no need to scan and process the video streams again on a per query basis. The software architecture on the fog nodes allows for integrating new deep learning pipelines dynamically into the existing system, thereby supporting novel analytics and queries. We demonstrate the effectiveness of the system by proposing a novel distributed algorithm for real-time vehicle pursuit. The proposed algorithm involves asking multiple spatio-temporal queries in an adaptive fashion to reduce the query processing time and is robust to inaccuracies in the deployed deep learning pipelines and camera failures.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href=""><img width="100%" src="./ravikiran_files/Oral-session-icon.png"></div>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/f3.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
F3: Fair and Federated Face Attribute Classification with Heterogeneous Data
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Samhita Kanaparthy, Manisha Padala, Sankarshan Damle, Ravi Kiran Sarvadevabhatla, Sujit Gujar
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
The Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)
</strong></em>,
2023
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A novel federated framework for fair facial attribute classification.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://arxiv.org/pdf/2109.02351.pdf" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#nYzzAVqabs" href="#nYzzAVqabs-list">Abstract</a>
<a href="https://arxiv.org/pdf/2109.02351.pdf" target="_blank" class="buttonPP">Project page</a>
<div id="nYzzAVq-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="nYzzAVqabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Fairness across different demographic groups is an essential criterion for face-related tasks, Face Attribute Classification (FAC) being a prominent example. Apart from this trend, Federated Learning (FL) is increasingly gaining traction as a scalable paradigm for distributed training. Existing FL approaches require data homogeneity to ensure fairness. However, this assumption is too restrictive in real-world settings. We propose F3, a novel FL framework for fair FAC under data heterogeneity. F3 adopts multiple heuristics to improve fairness across different demographic groups without requiring data homogeneity assumption. We demonstrate the efficacy of F3 by reporting empirically observed fairness measures and accuracy guarantees on popular face datasets. Our results suggest that F3 strikes a practical balance between accuracy and fairness for FAC.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/overview.gif"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation
<b><font color=red>[ORAL]</font></b>
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Sai Shashank Kalakonda, Shubh Maheshwari, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
IEEE International Conference on Multimedia & Expo (ICME)
</strong></em>,
2023
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
We show how Large Language Models such as GPT can be used to enable better quality and generalized human action generation
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://arxiv.org/abs/2211.15603" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#teBiRvRabs" href="#teBiRvRabs-list">Abstract</a>
<a href="https://actiongpt.github.io/" target="_blank" class="buttonPP">Project page</a>
<div id="teBiRvR-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="teBiRvRabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
We introduce Action-GPT, a plug-and-play framework for incorporating Large Language Models (LLMs) into text-based action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. We introduce a generic approach compatible with stochastic (e.g. VAE-based) and deterministic (e.g. MotionCLIP) text-to-motion models. In addition, the approach enables multiple text descriptions to be utilized. Our experiments show (i) noticeable qualitative and quantitative improvement in the quality of synthesized motions, (ii) benefits of utilizing multiple LLM-generated descriptions, (iii) suitability of the prompt function, and (iv) zero-shot generation capabilities of the proposed approach.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/Architecture_merged.jpeg"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
DSAG: A Scalable Deep Framework for Action-Conditioned Multi-Actor Full Body Motion Synthesis
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Debtanu Gupta, Shubh Maheshwari, Sai Shashank Kalakonda, Manasvi Vaidyula, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
IEEE Winter Conference on Applications of Computer Vision (WACV)
</strong></em>,
2023
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
Scaling human action generation to multiple action categories, action durations and with fine-grained finger-level realism.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://drive.google.com/file/d/1JvniutD5LdjLjRtsZZq464_m2IN0RNHr/view" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#jnIpFJSabs" href="#jnIpFJSabs-list">Abstract</a>
<a href="https://skeleton.iiit.ac.in/dsag" target="_blank" class="buttonPP">Project page</a>
<div id="jnIpFJS-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="jnIpFJSabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
We introduce DSAG, a controllable deep neural frame- work for action-conditioned generation of full body multi- actor variable duration actions. To compensate for incom- pletely detailed finger joints in existing large-scale datasets, we introduce full body dataset variants with detailed fin- ger joints. To overcome shortcomings in existing genera- tive approaches, we introduce dedicated representations for encoding finger joints. We also introduce novel spatiotem- poral transformation blocks with multi-head self attention and specialized temporal processing. The design choices enable generations for a large range in body joint counts (24 - 52), frame rates (13 - 50), global body movement (in- place, locomotion) and action categories (12 - 120), across multiple datasets (NTU-120, HumanAct12, UESTC, Hu- man3.6M). Our experimental results demonstrate DSAG’s significant improvements over state-of-the-art, its suitabil- ity for action-conditioned generation at scale.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/Ax9SYJnMTj4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<br>
<h2>2022</h2>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/fgvd.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Prafful Kumar Khoba, Chirag Parikh, Rohit Saluja, Ravi Kiran Sarvadevabhatla, C. V. Jawahar
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP)
</strong></em>,
2022
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A large-scale fine-grained vehicle dataset for Indian roads.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://arxiv.org/pdf/2212.14569" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#KVIUTIlabs" href="#KVIUTIlabs-list">Abstract</a>
<a href="https://github.com/iHubData-Mobility/public-FGVD" target="_blank" class="buttonPP">Project page</a>
<div id="KVIUTIl-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="KVIUTIlabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
The previous fine-grained datasets mainly focus on classification and are often captured in a controlled setup, with the camera focusing on the objects. We introduce the first Fine-Grained Vehicle Detection (FGVD) dataset in the wild, captured from a moving camera mounted on a car. It contains 5502 scene images with 210 unique fine-grained labels of multiple vehicle types organized in a three-level hierarchy. While previous classification datasets also include makes for different kinds of cars, the FGVD dataset introduces new class labels for categorizing two-wheelers, autorickshaws, and trucks. The FGVD dataset is challenging as it has vehicles in complex traffic scenarios with intra-class and inter-class variations in types, scale, pose, occlusion, and lighting conditions. The current object detectors like yolov5 and faster RCNN perform poorly on our dataset due to a lack of hierarchical modeling. Along with providing baseline results for existing object detectors on FGVD Dataset, we also present the results of a combination of an existing detector and the recent Hierarchical Residual Network (HRN) classifier for the FGVD task. Finally, we show that FGVD vehicle images are the most challenging to classify among the fine-grained datasets.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/-JENxAmXX6c" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/Multiclass_Samples.jpeg"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Nikhil Bansal, Kartik Gupta, Kiruthika Kannan, Sivani Pentapati, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
ACM Multimedia (ACMMM)
</strong></em>,
2022
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
We introduce a system for detecting atypical whiteboard content in a Pictionary game setting. We also introduce a first of its kind dataset for atypical hand-drawn sketches.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://arxiv.org/pdf/2211.05429" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#k3PjVl2abs" href="#k3PjVl2abs-list">Abstract</a>
<a href="https://drawm0n.github.io/" target="_blank" class="buttonPP">Project page</a>
<div id="k3PjVl2-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="k3PjVl2abs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Pictionary, the popular sketch-based guessing game, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings. However, some players occasionally draw atypical sketch content. While such content is occasionally relevant in the game context, it sometimes represents a rule violation and impairs the game experience. To address such situations in a timely and scalable manner, we introduce DrawMon, a novel distributed framework for automatic detection of atypical sketch content in concurrently occurring Pictionary game sessions. We build specialized online interfaces to collect game session data and annotate atypical sketch content, resulting in AtyPict, the first ever atypical sketch content dataset. We use AtyPict to train CanvasNet, a deep neural atypical content detection network. We utilize CanvasNet as a core component of DrawMon. Our analysis of post deployment game session data indicates DrawMon's effectiveness for scalable monitoring and atypical sketch content detection. Beyond Pictionary, our contributions also serve as a design guide for customized atypical content response systems involving shared and interactive whiteboards.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/LAYk2XGwCoI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/uvrsabi.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
UAV-based Visual Remote Sensing for Automated Building Inspection
<b><font color=red>[ORAL]</font></b>
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Kushagra Srivastava, Dhruv Patel, Aditya Kumar Jha, Mohhit Kumar Jha, Jaskirat Singh, Ravi Kiran Sarvadevabhatla, Pradeep Kumar Ramancharla, Harikumar Kandath, K. Madhava Krishna
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
CVCIE Workshop at ECCV
</strong></em>,
2022
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
UVRSABI is a software suite which processes drone-based imagery. It aids assessment of earthquake risk for buildings at scale.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://arxiv.org/pdf/2209.13418" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#pTG5p7Iabs" href="#pTG5p7Iabs-list">Abstract</a>
<a href="https://uvrsabi.github.io/" target="_blank" class="buttonPP">Project page</a>
<div id="pTG5p7I-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="pTG5p7Iabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Unmanned Aerial Vehicle (UAV) based remote sensing system incorporated with computer vision has demonstrated potential for assisting building construction and in disaster management like damage assessment during earthquakes. The vulnerability of a building to earthquake can be assessed through inspection that takes into account the expected damage progression of the associated component and the component's contribution to structural system performance. Most of these inspections are done manually, leading to high utilization of manpower, time, and cost. This paper proposes a methodology to automate these inspections through UAV-based image data collection and a software library for post-processing that helps in estimating the seismic structural parameters. The key parameters considered here are the distances between adjacent buildings, building plan-shape, building plan area, objects on the rooftop and rooftop layout. The accuracy of the proposed methodology in estimating the above-mentioned parameters is verified through field measurements taken using a distance measuring sensor and also from the data obtained through Google Earth.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href=""><img width="100%" src="./ravikiran_files/Oral-session-icon.png"></div>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/psumnet.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Neel Trivedi, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
ECCV INTERNATIONAL WORKSHOP AND CHALLENGE ON PEOPLE ANALYSIS
</strong></em>,
2022
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
PSUMNet is a deep net for scalable & efficient action recognition. It outperforms competing methods which use 100%-400% more params. PSUMNet is an attractive choice for deployment on compute-restricted embedded and edge devices.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://arxiv.org/pdf/2208.05775" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#nKWN8dpabs" href="#nKWN8dpabs-list">Abstract</a>
<a href="https://skeleton.iiit.ac.in/psumnet" target="_blank" class="buttonPP">Project page</a>
<div id="nKWN8dp-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="nKWN8dpabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Pose-based action recognition is predominantly tackled by approaches which treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. `Thumbs up') or legs (e.g. `Kicking'). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters. To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet's scalability, performance and efficiency makes it an attractive choice for action recognition and for deployment on compute-restricted embedded and edge devices.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/float-mainfig.jpeg"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Rishubh Singh, Pranav Gupta, Pradeep Shenoy, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
Computer Vision and Pattern Recognition (CVPR)
</strong></em>,
2022
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A simple but effective trick for scalable multi-part multi-object segmentation -- transform label-text attributes into spatial maps and have a deep network predict them.
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://openaccess.thecvf.com/content/CVPR2022/html/Singh_FLOAT_Factorized_Learning_of_Object_Attributes_for_Improved_Multi-Object_Multi-Part_CVPR_2022_paper.html" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#hdNMwuMabs" href="#hdNMwuMabs-list">Abstract</a>
<a href="https://floatseg.github.io/" target="_blank" class="buttonPP">Project page</a>
<div id="hdNMwuM-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="hdNMwuMabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
Multi-object multi-part scene parsing is a challenging task which requires detecting multiple object classes in a scene and segmenting the semantic parts within each object. In this paper, we propose FLOAT, a factorized label space framework for scalable multi-object multi-part parsing. Our framework involves independent dense prediction of object category and part attributes which increases scalability and reduces task complexity compared to the monolithic label space counterpart. In addition, we propose an inference-time 'zoom' refinement technique which significantly improves segmentation quality, especially for smaller objects/parts. Compared to state of the art, FLOAT obtains an absolute improvement of 2.0% for mean IOU (mIOU) and 4.8% for segmentation quality IOU (sqIOU) on the Pascal-Part-58 dataset. For the larger Pascal-Part-108 dataset, the improvements are 2.1% for mIOU and 3.9% for sqIOU. We incorporate previously excluded part attributes and other minor parts of the Pascal-Part dataset to create the most comprehensive and challenging version which we dub Pascal-Part-201. FLOAT obtains improvements of 8.6% for mIOU and 7.5% for sqIOU on the new dataset, demonstrating its parsing effectiveness across a challenging diversity of objects and parts.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/gDcEnhJZVCc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/moto_violations.jpeg"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
Detecting, Tracking and Counting Motorcycle Rider Traffic Violations on Unconstrained Roads
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Aman Goyal, Dev Agarwal, Anbumani Subramanian, C.V. Jawahar, Ravi Kiran Sarvadevabhatla, Rohit Saluja
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
5th Workshop and Prize Challenge: Bridging the Gap between Computational Photography and Visual Recognition (UG2+) , CVPR
</strong></em>,
2022
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A curriculum learning approach for detecting, tracking, and counting motorcycle riding violations in videos taken from a vehicle-mounted dashboard camera
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://arxiv.org/pdf/2204.08364" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#q5OPqx0abs" href="#q5OPqx0abs-list">Abstract</a>
<a href="https://github.com/ihubdata-mobility/public-motorcycle-violations" target="_blank" class="buttonPP">Project page</a>
<div id="q5OPqx0-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="q5OPqx0abs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
In many Asian countries with unconstrained road traffic conditions, driving violations such as not wearing helmets and triple-riding are a significant source of fatalities involving motorcycles. Identifying and penalizing such riders is vital in curbing road accidents and improving citizens' safety. With this motivation, we propose an approach for detecting, tracking, and counting motorcycle riding violations in videos taken from a vehicle-mounted dashboard camera. We employ a curriculum learning-based object detector to better tackle challenging scenarios such as occlusions. We introduce a novel trapezium-shaped object boundary representation to increase robustness and tackle the rider-motorcycle association. We also introduce an amodal regressor that generates bounding boxes for the occluded riders. Experimental results on a large-scale unconstrained driving dataset demonstrate the superiority of our approach compared to existing approaches and other ablative variants.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/ypqGihjh-CQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>
<br>
<ul class="list-group">
<li class="list-group-item" style="padding:0 0 0.2% 1.1%;border:1px solid">
<div class="media">
<div class="pull-left text-center" style="padding:9px 1.5% 0 0; width:45%;" href="">
<img width="100%" src="./ravikiran_files/tal-01.png"></div>
<div class="pull-left text-left" style="padding:0% 0% 0% 0.1%; width:45%;">
<p style="margin:-3px 0px 0px 0px;"></p>
<h4 style="font-size:14.1px; line-height:120%">
Hear Me out: Fusional Approaches for Audio Augmented Temporal Action Localization
<b><font color=red>[ORAL]</font></b>
</h4>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">
Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, Ravi Kiran Sarvadevabhatla
</p>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px">In <em><strong>
17th International Conference on Computer Vision Theory and Applications (VISAPP)
</strong></em>,
2022
</p>
<br>
<p style="margin:-10.0px 0px 0px 0px;"></p>
<p style="font-size:13.4px; color:#E62E00;">
A simple yet effective approach for incorporating audio to improve temporal action localization in videos
</p>
<p style="margin:-9.0px 0px 0px 0px;"></p>
<a href="https://arxiv.org/pdf/2106.14118.pdf" target="_blank" class="buttonTT">Paper</a>
<a class="buttonAA" data-toggle="collapse" data-parent="#WWDWUqpabs" href="#WWDWUqpabs-list">Abstract</a>
<a href="https://github.com/skelemoa/tal-hmo" target="_blank" class="buttonPP">Project page</a>
<div id="WWDWUqp-list" class="panel-collapse collapse out" style="background-color:#FFE0C2; padding:0% 3% 1% 3%; border-radius:10px;">
<p style="font-size:15px;">
</p>
</div>
<div id="WWDWUqpabs-list" class="panel-collapse collapse out" style="background-color:#ADEBFF; padding:9px 2.5% 3px 2.5%; border-radius:10px;">
<p style="margin:0px 0px 0px 0px;"></p>
<p style="font-size:14.1px;">
State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited. Audio fusion has been explored for the related but arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider audio and video modalities for supervised TAL. We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches. Specifically, they help achieve new state of the art performance on large-scale benchmark datasets - ActivityNet-1.3 (54.34 [email protected]) and THUMOS14 (57.18 [email protected]). Our experiments include ablations involving multiple fusion schemes, modality combinations and TAL architectures.
</p>
<p style="margin:0px 0px 0px 0px;"></p>
</div>
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href="">
</div>
<div class="pull-right text-center" style="padding:4px 0.5% 0 0; width:5.0%;" href=""><img width="100%" src="./ravikiran_files/Oral-session-icon.png"></div>
<iframe width="400" height="300" src="https://www.youtube.com/embed/SLT85c785LY" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</li>
</ul>
<p style="margin:-13.5px 0px 0px 0px;"></p>