From ec7b2935b37bc73d9c8463248729b4d89e5486d6 Mon Sep 17 00:00:00 2001 From: AlongWY Date: Thu, 27 Jul 2023 05:20:51 +0000 Subject: [PATCH] deploy: 72066be21ad467c8ffc76b74c152b38decf3f0ac --- .nojekyll | 0 cache.json | 1 + favicon.ico | Bin 0 -> 15086 bytes index.css | 355 + index.html | 75199 ++++++++++++++++++++++++++++++++++++++++++++++++++ index.js | 39 + 6 files changed, 75594 insertions(+) create mode 100644 .nojekyll create mode 100644 cache.json create mode 100644 favicon.ico create mode 100644 index.css create mode 100644 index.html create mode 100644 index.js diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..356d9606 --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2023-07-19T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2307.10172v1","updated":"2023-07-19T17:57:53Z","published":"2023-07-19T17:57:53Z","title":"DialogStudio: Towards Richest and Most Diverse Unified Dataset\n Collection for Conversational AI","summary":" Despite advancements in conversational AI, language models encounter\nchallenges to handle diverse conversational tasks, and existing dialogue\ndataset collections often lack diversity and comprehensiveness. To tackle these\nissues, we introduce DialogStudio: the largest and most diverse collection of\ndialogue datasets, unified under a consistent format while preserving their\noriginal information. Our collection encompasses data from open-domain\ndialogues, task-oriented dialogues, natural language understanding,\nconversational recommendation, dialogue summarization, and knowledge-grounded\ndialogues, making it an incredibly rich and diverse resource for dialogue\nresearch and model training. To further enhance the utility of DialogStudio, we\nidentify the licenses for each dataset and design domain-aware prompts for\nselected dialogues to facilitate instruction-aware fine-tuning. Furthermore, we\ndevelop conversational AI models using the dataset collection, and our\nexperiments in both zero-shot and few-shot learning scenarios demonstrate the\nsuperiority of DialogStudio. To improve transparency and support dataset and\ntask-based research, as well as language model pre-training, all datasets,\nlicenses, codes, and models associated with DialogStudio are made publicly\naccessible at https://github.com/salesforce/DialogStudio\n","authors":["Jianguo Zhang","Kun Qian","Zhiwei Liu","Shelby Heinecke","Rui Meng","Ye Liu","Zhou Yu","Silvio Savarese","Caiming Xiong"],"pdf_url":"https://arxiv.org/pdf/2307.10172v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10169v1","updated":"2023-07-19T17:55:13Z","published":"2023-07-19T17:55:13Z","title":"Challenges and Applications of Large Language Models","summary":" Large Language Models (LLMs) went from non-existent to ubiquitous in the\nmachine learning discourse within a few years. Due to the fast pace of the\nfield, it is difficult to identify the remaining challenges and already\nfruitful application areas. In this paper, we aim to establish a systematic set\nof open problems and application successes so that ML researchers can\ncomprehend the field's current state more quickly and become productive.\n","authors":["Jean Kaddour","Joshua Harris","Maximilian Mozes","Herbie Bradley","Roberta Raileanu","Robert McHardy"],"pdf_url":"https://arxiv.org/pdf/2307.10169v1.pdf","comment":"72 pages. v01. Work in progress. Feedback and comments are highly\n appreciated!"},{"id":"http://arxiv.org/abs/2307.10168v1","updated":"2023-07-19T17:54:43Z","published":"2023-07-19T17:54:43Z","title":"LLMs as Workers in Human-Computational Algorithms? Replicating\n Crowdsourcing Pipelines with LLMs","summary":" LLMs have shown promise in replicating human-like behavior in crowdsourcing\ntasks that were previously thought to be exclusive to human abilities. However,\ncurrent efforts focus mainly on simple atomic tasks. We explore whether LLMs\ncan replicate more complex crowdsourcing pipelines. We find that modern LLMs\ncan simulate some of crowdworkers' abilities in these \"human computation\nalgorithms,\" but the level of success is variable and influenced by requesters'\nunderstanding of LLM capabilities, the specific skills required for sub-tasks,\nand the optimal interaction modality for performing these sub-tasks. We reflect\non human and LLMs' different sensitivities to instructions, stress the\nimportance of enabling human-facing safeguards for LLMs, and discuss the\npotential of training humans and LLMs with complementary skill sets. Crucially,\nwe show that replicating crowdsourcing pipelines offers a valuable platform to\ninvestigate (1) the relative strengths of LLMs on different tasks (by\ncross-comparing their performances on sub-tasks) and (2) LLMs' potential in\ncomplex tasks, where they can complete part of the tasks while leaving others\nto humans.\n","authors":["Tongshuang Wu","Haiyi Zhu","Maya Albayrak","Alexis Axon","Amanda Bertsch","Wenxing Deng","Ziqi Ding","Bill Guo","Sireesh Gururaja","Tzu-Sheng Kuo","Jenny T. Liang","Ryan Liu","Ihita Mandal","Jeremiah Milbauer","Xiaolin Ni","Namrata Padmanabhan","Subhashini Ramkumar","Alexis Sudjianto","Jordan Taylor","Ying-Jui Tseng","Patricia Vaidos","Zhijin Wu","Wei Wu","Chenyang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.10168v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10156v1","updated":"2023-07-19T17:37:03Z","published":"2023-07-19T17:37:03Z","title":"Exploring Transformer Extrapolation","summary":" Length extrapolation has attracted considerable attention recently since it\nallows transformers to be tested on longer sequences than those used in\ntraining. Previous research has shown that this property can be attained by\nusing carefully designed Relative Positional Encodings (RPEs). While these\nmethods perform well on a variety of corpora, the conditions for length\nextrapolation have yet to be investigated. This paper attempts to determine\nwhat types of RPEs allow for length extrapolation through a thorough\nmathematical and empirical analysis. We discover that a transformer is certain\nto possess this property as long as the series that corresponds to the RPE's\nexponential converges. Two practices are derived from the conditions and\nexamined in language modeling tasks on a variety of corpora. As a bonus from\nthe conditions, we derive a new Theoretical Receptive Field (TRF) to measure\nthe receptive field of RPEs without taking any training steps. Extensive\nexperiments are conducted on the Wikitext-103, Books, Github, and WikiBook\ndatasets to demonstrate the viability of our discovered conditions. We also\ncompare TRF to Empirical Receptive Field (ERF) across different models, showing\nconsistently matched trends on the aforementioned datasets. The code is\navailable at https://github.com/OpenNLPLab/Rpe.\n","authors":["Zhen Qin","Yiran Zhong","Hui Deng"],"pdf_url":"https://arxiv.org/pdf/2307.10156v1.pdf","comment":"Zhen Qin and Yiran Zhong contribute equally to this paper; Yiran\n Zhong is the corresponding author. The code is available at\n https://github.com/OpenNLPLab/Rpe"},{"id":"http://arxiv.org/abs/2307.09288v2","updated":"2023-07-19T17:08:59Z","published":"2023-07-18T14:31:57Z","title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","summary":" In this work, we develop and release Llama 2, a collection of pretrained and\nfine-tuned large language models (LLMs) ranging in scale from 7 billion to 70\nbillion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for\ndialogue use cases. Our models outperform open-source chat models on most\nbenchmarks we tested, and based on our human evaluations for helpfulness and\nsafety, may be a suitable substitute for closed-source models. We provide a\ndetailed description of our approach to fine-tuning and safety improvements of\nLlama 2-Chat in order to enable the community to build on our work and\ncontribute to the responsible development of LLMs.\n","authors":["Hugo Touvron","Louis Martin","Kevin Stone","Peter Albert","Amjad Almahairi","Yasmine Babaei","Nikolay Bashlykov","Soumya Batra","Prajjwal Bhargava","Shruti Bhosale","Dan Bikel","Lukas Blecher","Cristian Canton Ferrer","Moya Chen","Guillem Cucurull","David Esiobu","Jude Fernandes","Jeremy Fu","Wenyin Fu","Brian Fuller","Cynthia Gao","Vedanuj Goswami","Naman Goyal","Anthony Hartshorn","Saghar Hosseini","Rui Hou","Hakan Inan","Marcin Kardas","Viktor Kerkez","Madian Khabsa","Isabel Kloumann","Artem Korenev","Punit Singh Koura","Marie-Anne Lachaux","Thibaut Lavril","Jenya Lee","Diana Liskovich","Yinghai Lu","Yuning Mao","Xavier Martinet","Todor Mihaylov","Pushkar Mishra","Igor Molybog","Yixin Nie","Andrew Poulton","Jeremy Reizenstein","Rashi Rungta","Kalyan Saladi","Alan Schelten","Ruan Silva","Eric Michael Smith","Ranjan Subramanian","Xiaoqing Ellen Tan","Binh Tang","Ross Taylor","Adina Williams","Jian Xiang Kuan","Puxin Xu","Zheng Yan","Iliyan Zarov","Yuchen Zhang","Angela Fan","Melanie Kambadur","Sharan Narang","Aurelien Rodriguez","Robert Stojnic","Sergey Edunov","Thomas Scialom"],"pdf_url":"https://arxiv.org/pdf/2307.09288v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10098v1","updated":"2023-07-19T16:13:13Z","published":"2023-07-19T16:13:13Z","title":"Gradient Sparsification For Masked Fine-Tuning of Transformers","summary":" Fine-tuning pretrained self-supervised language models is widely adopted for\ntransfer learning to downstream tasks. Fine-tuning can be achieved by freezing\ngradients of the pretrained network and only updating gradients of a newly\nadded classification layer, or by performing gradient updates on all\nparameters. Gradual unfreezing makes a trade-off between the two by gradually\nunfreezing gradients of whole layers during training. This has been an\neffective strategy to trade-off between storage and training speed with\ngeneralization performance. However, it is not clear whether gradually\nunfreezing layers throughout training is optimal, compared to sparse variants\nof gradual unfreezing which may improve fine-tuning performance. In this paper,\nwe propose to stochastically mask gradients to regularize pretrained language\nmodels for improving overall fine-tuned performance. We introduce GradDrop and\nvariants thereof, a class of gradient sparsification methods that mask\ngradients during the backward pass, acting as gradient noise. GradDrop is\nsparse and stochastic unlike gradual freezing. Extensive experiments on the\nmultilingual XGLUE benchmark with XLMR-Large show that GradDrop is competitive\nagainst methods that use additional translated data for intermediate\npretraining and outperforms standard fine-tuning and gradual unfreezing. A\npost-analysis shows how GradDrop improves performance with languages it was not\ntrained on, such as under-resourced languages.\n","authors":["James O' Neill","Sourav Dutta"],"pdf_url":"https://arxiv.org/pdf/2307.10098v1.pdf","comment":"Accepted to IJCNN 2023"},{"id":"http://arxiv.org/abs/2307.10088v1","updated":"2023-07-19T15:57:24Z","published":"2023-07-19T15:57:24Z","title":"Android in the Wild: A Large-Scale Dataset for Android Device Control","summary":" There is a growing interest in device-control systems that can interpret\nhuman natural language instructions and execute them on a digital device by\ndirectly controlling its user interface. We present a dataset for\ndevice-control research, Android in the Wild (AITW), which is orders of\nmagnitude larger than current datasets. The dataset contains human\ndemonstrations of device interactions, including the screens and actions, and\ncorresponding natural language instructions. It consists of 715k episodes\nspanning 30k unique instructions, four versions of Android (v10-13),and eight\ndevice types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It\ncontains multi-step tasks that require semantic understanding of language and\nvisual context. This dataset poses a new challenge: actions available through\nthe user interface must be inferred from their visual appearance. And, instead\nof simple UI element-based actions, the action space consists of precise\ngestures (e.g., horizontal scrolls to operate carousel widgets). We organize\nour dataset to encourage robustness analysis of device-control systems, i.e.,\nhow well a system performs in the presence of new task descriptions, new\napplications, or new platform versions. We develop two agents and report\nperformance across the dataset. The dataset is available at\nhttps://github.com/google-research/google-research/tree/master/android_in_the_wild.\n","authors":["Christopher Rawles","Alice Li","Daniel Rodriguez","Oriana Riva","Timothy Lillicrap"],"pdf_url":"https://arxiv.org/pdf/2307.10088v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.11596v3","updated":"2023-07-19T15:25:37Z","published":"2023-01-27T08:45:53Z","title":"ThoughtSource: A central hub for large language model reasoning data","summary":" Large language models (LLMs) such as GPT-4 have recently demonstrated\nimpressive results across a wide range of tasks. LLMs are still limited,\nhowever, in that they frequently fail at complex reasoning, their reasoning\nprocesses are opaque, they are prone to 'hallucinate' facts, and there are\nconcerns about their underlying biases. Letting models verbalize reasoning\nsteps as natural language, a technique known as chain-of-thought prompting, has\nrecently been proposed as a way to address some of these issues. Here we\npresent ThoughtSource, a meta-dataset and software library for chain-of-thought\n(CoT) reasoning. The goal of ThoughtSource is to improve future artificial\nintelligence systems by facilitating qualitative understanding of CoTs,\nenabling empirical evaluations, and providing training data. This first release\nof ThoughtSource integrates six scientific/medical, three general-domain and\nfive math word question answering datasets.\n","authors":["Simon Ott","Konstantin Hebenstreit","Valentin Liévin","Christoffer Egeberg Hother","Milad Moradi","Maximilian Mayrhauser","Robert Praas","Ole Winther","Matthias Samwald"],"pdf_url":"https://arxiv.org/pdf/2301.11596v3.pdf","comment":"Revision: added datasets, minor restructuring"},{"id":"http://arxiv.org/abs/2307.10025v1","updated":"2023-07-19T15:09:50Z","published":"2023-07-19T15:09:50Z","title":"An Empirical Study on Fertility Proposals Using Multi-Grined Topic\n Analysis Methods","summary":" Fertility issues are closely related to population security, in 60 years\nChina's population for the first time in a negative growth trend, the change of\nfertility policy is of great concern to the community. 2023 ``two sessions\"\nproposal ``suggests that the country in the form of legislation, the birth of\nthe registration of the cancellation of the marriage restriction\" This topic\nwas once a hot topic on the Internet, and ``unbundling\" the relationship\nbetween birth registration and marriage has become the focus of social debate.\nIn this paper, we adopt co-occurrence semantic analysis, topic analysis and\nsentiment analysis to conduct multi-granularity semantic analysis of microblog\ncomments. It is found that the discussion on the proposal of ``removing\nmarriage restrictions from birth registration\" involves the individual, society\nand the state at three dimensions, and is detailed into social issues such as\npersonal behaviour, social ethics and law, and national policy, with people's\nsentiment inclined to be negative in most of the topics. Based on this, eight\nproposals were made to provide a reference for governmental decision making and\nto form a reference method for researching public opinion on political issues.\n","authors":["Yulin Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.10025v1.pdf","comment":"7 pages, 4 figures, 1 table"},{"id":"http://arxiv.org/abs/2307.09456v2","updated":"2023-07-19T14:27:57Z","published":"2023-07-18T17:35:45Z","title":"A comparative analysis of SRGAN models","summary":" In this study, we evaluate the performance of multiple state-of-the-art SRGAN\n(Super Resolution Generative Adversarial Network) models, ESRGAN, Real-ESRGAN\nand EDSR, on a benchmark dataset of real-world images which undergo degradation\nusing a pipeline. Our results show that some models seem to significantly\nincrease the resolution of the input images while preserving their visual\nquality, this is assessed using Tesseract OCR engine. We observe that EDSR-BASE\nmodel from huggingface outperforms the remaining candidate models in terms of\nboth quantitative metrics and subjective visual quality assessments with least\ncompute overhead. Specifically, EDSR generates images with higher peak\nsignal-to-noise ratio (PSNR) and structural similarity index (SSIM) values and\nare seen to return high quality OCR results with Tesseract OCR engine. These\nfindings suggest that EDSR is a robust and effective approach for single-image\nsuper-resolution and may be particularly well-suited for applications where\nhigh-quality visual fidelity is critical and optimized compute.\n","authors":["Fatemeh Rezapoor Nikroo","Ajinkya Deshmukh","Anantha Sharma","Adrian Tam","Kaarthik Kumar","Cleo Norris","Aditya Dangi"],"pdf_url":"https://arxiv.org/pdf/2307.09456v2.pdf","comment":"9 pages, 6 tables, 2 figures"},{"id":"http://arxiv.org/abs/2307.09998v1","updated":"2023-07-19T14:13:02Z","published":"2023-07-19T14:13:02Z","title":"Generating Mathematical Derivations with Large Language Models","summary":" The derivation of mathematical results in specialised fields using Large\nLanguage Models (LLMs) is an emerging research direction that can help identify\nmodels' limitations, and potentially support mathematical discovery. In this\npaper, we leverage a symbolic engine to generate derivations of equations at\nscale, and investigate the capabilities of LLMs when deriving goal equations\nfrom premises. Specifically, we employ in-context learning for GPT and\nfine-tune a range of T5 models to compare the robustness and generalisation of\npre-training strategies to specialised models. Empirical results show that\nfine-tuned FLAN-T5-large (MathT5) outperforms GPT models on all static and\nout-of-distribution test sets in terms of absolute performance. However, an\nin-depth analysis reveals that the fine-tuned models are more sensitive to\nperturbations involving unseen symbols and (to a lesser extent) changes to\nequation structure. In addition, we analyse 1.7K equations and over 200\nderivations to highlight common reasoning errors such as the inclusion of\nincorrect, irrelevant, and redundant equations, along with the tendency to skip\nderivation steps. Finally, we explore the suitability of existing metrics for\nevaluating mathematical derivations finding evidence that, while they capture\ngeneral properties such as sensitivity to perturbations, they fail to highlight\nfine-grained reasoning errors and essential differences between models.\nOverall, this work demonstrates that training models on synthetic data can\nimprove their mathematical capabilities beyond larger architectures.\n","authors":["Jordan Meadows","Marco Valentino","Andre Freitas"],"pdf_url":"https://arxiv.org/pdf/2307.09998v1.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2303.15056v2","updated":"2023-07-19T14:10:55Z","published":"2023-03-27T09:59:48Z","title":"ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks","summary":" Many NLP applications require manual data annotations for a variety of tasks,\nnotably to train classifiers or evaluate the performance of unsupervised\nmodels. Depending on the size and degree of complexity, the tasks may be\nconducted by crowd-workers on platforms such as MTurk as well as trained\nannotators, such as research assistants. Using a sample of 2,382 tweets, we\ndemonstrate that ChatGPT outperforms crowd-workers for several annotation\ntasks, including relevance, stance, topics, and frames detection. Specifically,\nthe zero-shot accuracy of ChatGPT exceeds that of crowd-workers for four out of\nfive tasks, while ChatGPT's intercoder agreement exceeds that of both\ncrowd-workers and trained annotators for all tasks. Moreover, the\nper-annotation cost of ChatGPT is less than $0.003 -- about twenty times\ncheaper than MTurk. These results show the potential of large language models\nto drastically increase the efficiency of text classification.\n","authors":["Fabrizio Gilardi","Meysam Alizadeh","Maël Kubli"],"pdf_url":"https://arxiv.org/pdf/2303.15056v2.pdf","comment":"Gilardi, Fabrizio, Meysam Alizadeh, and Ma\\\"el Kubli. 2023. \"ChatGPT\n Outperforms Crowd Workers for Text-Annotation Tasks\". Proceedings of the\n National Academy of Sciences 120(30): e2305016120"},{"id":"http://arxiv.org/abs/2210.14037v2","updated":"2023-07-19T13:43:07Z","published":"2022-10-25T14:13:53Z","title":"Revisiting Softmax for Uncertainty Approximation in Text Classification","summary":" Uncertainty approximation in text classification is an important area with\napplications in domain adaptation and interpretability. One of the most widely\nused uncertainty approximation methods is Monte Carlo (MC) Dropout, which is\ncomputationally expensive as it requires multiple forward passes through the\nmodel. A cheaper alternative is to simply use the softmax based on a single\nforward pass without dropout to estimate model uncertainty. However, prior work\nhas indicated that these predictions tend to be overconfident. In this paper,\nwe perform a thorough empirical analysis of these methods on five datasets with\ntwo base neural architectures in order to identify the trade-offs between the\ntwo. We compare both softmax and an efficient version of MC Dropout on their\nuncertainty approximations and downstream text classification performance,\nwhile weighing their runtime (cost) against performance (benefit). We find\nthat, while MC dropout produces the best uncertainty approximations, using a\nsimple softmax leads to competitive and in some cases better uncertainty\nestimation for text classification at a much lower computational cost,\nsuggesting that softmax can in fact be a sufficient uncertainty estimate when\ncomputational resources are a concern.\n","authors":["Andreas Nugaard Holm","Dustin Wright","Isabelle Augenstein"],"pdf_url":"https://arxiv.org/pdf/2210.14037v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09959v1","updated":"2023-07-19T13:01:03Z","published":"2023-07-19T13:01:03Z","title":"GUIDO: A Hybrid Approach to Guideline Discovery & Ordering from Natural\n Language Texts","summary":" Extracting workflow nets from textual descriptions can be used to simplify\nguidelines or formalize textual descriptions of formal processes like business\nprocesses and algorithms. The task of manually extracting processes, however,\nrequires domain expertise and effort. While automatic process model extraction\nis desirable, annotating texts with formalized process models is expensive.\nTherefore, there are only a few machine-learning-based extraction approaches.\nRule-based approaches, in turn, require domain specificity to work well and can\nrarely distinguish relevant and irrelevant information in textual descriptions.\nIn this paper, we present GUIDO, a hybrid approach to the process model\nextraction task that first, classifies sentences regarding their relevance to\nthe process model, using a BERT-based sentence classifier, and second, extracts\na process model from the sentences classified as relevant, using dependency\nparsing. The presented approach achieves significantly better results than a\npure rule-based approach. GUIDO achieves an average behavioral similarity score\nof $0.93$. Still, in comparison to purely machine-learning-based approaches,\nthe annotation costs stay low.\n","authors":["Nils Freyer","Dustin Thewes","Matthias Meinecke"],"pdf_url":"https://arxiv.org/pdf/2307.09959v1.pdf","comment":"Preprint of the short paper presented at the 12th International\n Conference on Data Science, Technology and Applications"},{"id":"http://arxiv.org/abs/2307.02486v2","updated":"2023-07-19T12:25:35Z","published":"2023-07-05T17:59:38Z","title":"LongNet: Scaling Transformers to 1,000,000,000 Tokens","summary":" Scaling sequence length has become a critical demand in the era of large\nlanguage models. However, existing methods struggle with either computational\ncomplexity or model expressivity, rendering the maximum sequence length\nrestricted. To address this issue, we introduce LongNet, a Transformer variant\nthat can scale sequence length to more than 1 billion tokens, without\nsacrificing the performance on shorter sequences. Specifically, we propose\ndilated attention, which expands the attentive field exponentially as the\ndistance grows. LongNet has significant advantages: 1) it has a linear\ncomputation complexity and a logarithm dependency between any two tokens in a\nsequence; 2) it can be served as a distributed trainer for extremely long\nsequences; 3) its dilated attention is a drop-in replacement for standard\nattention, which can be seamlessly integrated with the existing\nTransformer-based optimization. Experiments results demonstrate that LongNet\nyields strong performance on both long-sequence modeling and general language\ntasks. Our work opens up new possibilities for modeling very long sequences,\ne.g., treating a whole corpus or even the entire Internet as a sequence.\n","authors":["Jiayu Ding","Shuming Ma","Li Dong","Xingxing Zhang","Shaohan Huang","Wenhui Wang","Nanning Zheng","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2307.02486v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2307.09923v1","updated":"2023-07-19T11:54:46Z","published":"2023-07-19T11:54:46Z","title":"Large Language Models can accomplish Business Process Management Tasks","summary":" Business Process Management (BPM) aims to improve organizational activities\nand their outcomes by managing the underlying processes. To achieve this, it is\noften necessary to consider information from various sources, including\nunstructured textual documents. Therefore, researchers have developed several\nBPM-specific solutions that extract information from textual documents using\nNatural Language Processing techniques. These solutions are specific to their\nrespective tasks and cannot accomplish multiple process-related problems as a\ngeneral-purpose instrument. However, in light of the recent emergence of Large\nLanguage Models (LLMs) with remarkable reasoning capabilities, such a\ngeneral-purpose instrument with multiple applications now appears attainable.\nIn this paper, we illustrate how LLMs can accomplish text-related BPM tasks by\napplying a specific LLM to three exemplary tasks: mining imperative process\nmodels from textual descriptions, mining declarative process models from\ntextual descriptions, and assessing the suitability of process tasks from\ntextual descriptions for robotic process automation. We show that, without\nextensive configuration or prompt engineering, LLMs perform comparably to or\nbetter than existing solutions and discuss implications for future BPM research\nas well as practical usage.\n","authors":["Michael Grohs","Luka Abb","Nourhan Elsayed","Jana-Rebecca Rehse"],"pdf_url":"https://arxiv.org/pdf/2307.09923v1.pdf","comment":"Accepted at NLP4BPM workshop at BPM 2023"},{"id":"http://arxiv.org/abs/2307.09885v1","updated":"2023-07-19T10:28:59Z","published":"2023-07-19T10:28:59Z","title":"Test-takers have a say: understanding the implications of the use of AI\n in language tests","summary":" Language tests measure a person's ability to use a language in terms of\nlistening, speaking, reading, or writing. Such tests play an integral role in\nacademic, professional, and immigration domains, with entities such as\neducational institutions, professional accreditation bodies, and governments\nusing them to assess candidate language proficiency. Recent advances in\nArtificial Intelligence (AI) and the discipline of Natural Language Processing\nhave prompted language test providers to explore AI's potential applicability\nwithin language testing, leading to transformative activity patterns\nsurrounding language instruction and learning. However, with concerns over AI's\ntrustworthiness, it is imperative to understand the implications of integrating\nAI into language testing. This knowledge will enable stakeholders to make\nwell-informed decisions, thus safeguarding community well-being and testing\nintegrity. To understand the concerns and effects of AI usage in language\ntests, we conducted interviews and surveys with English test-takers. To the\nbest of our knowledge, this is the first empirical study aimed at identifying\nthe implications of AI adoption in language tests from a test-taker\nperspective. Our study reveals test-taker perceptions and behavioral patterns.\nSpecifically, we identify that AI integration may enhance perceptions of\nfairness, consistency, and availability. Conversely, it might incite mistrust\nregarding reliability and interactivity aspects, subsequently influencing the\nbehaviors and well-being of test-takers. These insights provide a better\nunderstanding of potential societal implications and assist stakeholders in\nmaking informed decisions concerning AI usage in language testing.\n","authors":["Dawen Zhang","Thong Hoang","Shidong Pan","Yongquan Hu","Zhenchang Xing","Mark Staples","Xiwei Xu","Qinghua Lu","Aaron Quigley"],"pdf_url":"https://arxiv.org/pdf/2307.09885v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09416v2","updated":"2023-07-19T08:27:50Z","published":"2023-07-18T16:33:30Z","title":"Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation\n Evaluation","summary":" Research in Image Generation has recently made significant progress,\nparticularly boosted by the introduction of Vision-Language models which are\nable to produce high-quality visual content based on textual inputs. Despite\nongoing advancements in terms of generation quality and realism, no methodical\nframeworks have been defined yet to quantitatively measure the quality of the\ngenerated content and the adherence with the prompted requests: so far, only\nhuman-based evaluations have been adopted for quality satisfaction and for\ncomparing different generative methods. We introduce a novel automated method\nfor Visual Concept Evaluation (ViCE), i.e. to assess consistency between a\ngenerated/edited image and the corresponding prompt/instructions, with a\nprocess inspired by the human cognitive behaviour. ViCE combines the strengths\nof Large Language Models (LLMs) and Visual Question Answering (VQA) into a\nunified pipeline, aiming to replicate the human cognitive process in quality\nassessment. This method outlines visual concepts, formulates image-specific\nverification questions, utilizes the Q&A system to investigate the image, and\nscores the combined outcome. Although this brave new hypothesis of mimicking\nhumans in the image evaluation process is in its preliminary assessment stage,\nresults are promising and open the door to a new form of automatic evaluation\nwhich could have significant impact as the image generation or the image target\nediting tasks become more and more sophisticated.\n","authors":["Federico Betti","Jacopo Staiano","Lorenzo Baraldi","Lorenzo Baraldi","Rita Cucchiara","Nicu Sebe"],"pdf_url":"https://arxiv.org/pdf/2307.09416v2.pdf","comment":"Accepted as oral at ACM MultiMedia 2023 (Brave New Ideas track)"},{"id":"http://arxiv.org/abs/2307.09813v1","updated":"2023-07-19T08:02:20Z","published":"2023-07-19T08:02:20Z","title":"DAPrompt: Deterministic Assumption Prompt Learning for Event Causality\n Identification","summary":" Event Causality Identification (ECI) aims at determining whether there is a\ncausal relation between two event mentions. Conventional prompt learning\ndesigns a prompt template to first predict an answer word and then maps it to\nthe final decision. Unlike conventional prompts, we argue that predicting an\nanswer word may not be a necessary prerequisite for the ECI task. Instead, we\ncan first make a deterministic assumption on the existence of causal relation\nbetween two events and then evaluate its rationality to either accept or reject\nthe assumption. The design motivation is to try the most utilization of the\nencyclopedia-like knowledge embedded in a pre-trained language model. In light\nof such considerations, we propose a deterministic assumption prompt learning\nmodel, called DAPrompt, for the ECI task. In particular, we design a simple\ndeterministic assumption template concatenating with the input event pair,\nwhich includes two masks as predicted events' tokens. We use the probabilities\nof predicted events to evaluate the assumption rationality for the final event\ncausality decision. Experiments on the EventStoryLine corpus and\nCausal-TimeBank corpus validate our design objective in terms of significant\nperformance improvements over the state-of-the-art algorithms.\n","authors":["Wei Xiang","Chuanhong Zhan","Bang Wang"],"pdf_url":"https://arxiv.org/pdf/2307.09813v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09793v1","updated":"2023-07-19T07:17:43Z","published":"2023-07-19T07:17:43Z","title":"On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large\n Language Models","summary":" Since late 2022, Large Language Models (LLMs) have become very prominent with\nLLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs\nare announced each week, many of which are deposited to Hugging Face, a\nrepository of machine learning models and datasets. To date, nearly 16,000 Text\nGeneration models have been uploaded to the site. Given the huge influx of\nLLMs, it is of interest to know which LLM backbones, settings, training\nmethods, and families are popular or trending. However, there is no\ncomprehensive index of LLMs available. We take advantage of the relatively\nsystematic nomenclature of Hugging Face LLMs to perform hierarchical clustering\nand identify communities amongst LLMs using n-grams and term frequency-inverse\ndocument frequency. Our methods successfully identify families of LLMs and\naccurately cluster LLMs into meaningful subgroups. We present a public web\napplication to navigate and explore Constellation, our atlas of 15,821 LLMs.\nConstellation rapidly generates a variety of visualizations, namely\ndendrograms, graphs, word clouds, and scatter plots. Constellation is available\nat the following link: https://constellation.sites.stanford.edu/.\n","authors":["Sarah Gao","Andrew Kean Gao"],"pdf_url":"https://arxiv.org/pdf/2307.09793v1.pdf","comment":"14 pages, 6 figures, 1 table"},{"id":"http://arxiv.org/abs/2307.09782v1","updated":"2023-07-19T06:58:03Z","published":"2023-07-19T06:58:03Z","title":"ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization\n Using Floating-Point Formats","summary":" In the complex domain of large language models (LLMs), striking a balance\nbetween computational efficiency and maintaining model quality is a formidable\nchallenge. Navigating the inherent limitations of uniform quantization,\nparticularly when dealing with outliers, and motivated by the launch of\nNVIDIA's H100 hardware, this study delves into the viability of floating-point\n(FP) quantization, particularly focusing on FP8 and FP4, as a potential\nsolution. Our comprehensive investigation reveals that for LLMs, FP8 activation\nconsistently outshines its integer (INT8) equivalent, with the performance edge\nbecoming more noticeable in models possessing parameters beyond one billion.\nFor weight quantization, our findings indicate that FP4 exhibits comparable, if\nnot superior, performance to INT4, simplifying deployment on FP-supported\nhardware like H100. To mitigate the overhead from precision alignment caused by\nthe disparity between weights and activations, we propose two scaling\nconstraints for weight quantization that negligibly impact the performance\ncompared to the standard W4A8 model. We additionally enhance our quantization\nmethods by integrating the Low Rank Compensation (LoRC) strategy, yielding\nimprovements especially in smaller models. The results of our investigation\nemphasize the immense potential of FP quantization for LLMs, paving the way for\nhigh-efficiency deployment in resource-limited settings.\n","authors":["Xiaoxia Wu","Zhewei Yao","Yuxiong He"],"pdf_url":"https://arxiv.org/pdf/2307.09782v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.01692v4","updated":"2023-07-19T06:48:35Z","published":"2022-12-03T21:14:32Z","title":"Can In-context Learners Learn a Reasoning Concept from Demonstrations?","summary":" Language models exhibit an emergent ability to learn a new task from a small\nnumber of input-output demonstrations. However, recent work shows that\nin-context learners largely rely on their pre-trained knowledge, such as the\nsentiment of the labels, instead of learning new associations from the input.\nWe argue that the commonly-used few-shot evaluation using a random selection of\nin-context demonstrations can not disentangle models' reliance on such biases,\nas most of the randomly-selected demonstrations do not present relations\ninformative for prediction beyond exposing the task's input-output\ndistribution.\n Therefore, to evaluate models' in-context learning ability independent of\nmodels' memory, we introduce a Concept-sharing few-shot learning method\nchoosing the demonstrations that share an underlying concept with the predicted\nsample. We extract a set of such concepts from available human explanations and\nmeasure how much models can benefit from presenting these concepts in few-shot\ndemonstrations.\n We find that most of the recent in-context learners can not consistently\nbenefit from the demonstrated concepts, irrespective of the model size.\nHowever, we note that T0 models are more sensitive to exhibited concepts,\nbenefiting from concept-sharing demonstrations in 7 out of 8 evaluation\nscenarios.\n","authors":["Michal Štefánik","Marek Kadlčík"],"pdf_url":"https://arxiv.org/pdf/2212.01692v4.pdf","comment":"Awarded Best Paper at ACL 2023 Natural Language Reasoning and\n Structured Explanations (NLRSE) workshop"},{"id":"http://arxiv.org/abs/2307.08621v2","updated":"2023-07-19T05:56:42Z","published":"2023-07-17T16:40:01Z","title":"Retentive Network: A Successor to Transformer for Large Language Models","summary":" In this work, we propose Retentive Network (RetNet) as a foundation\narchitecture for large language models, simultaneously achieving training\nparallelism, low-cost inference, and good performance. We theoretically derive\nthe connection between recurrence and attention. Then we propose the retention\nmechanism for sequence modeling, which supports three computation paradigms,\ni.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel\nrepresentation allows for training parallelism. The recurrent representation\nenables low-cost $O(1)$ inference, which improves decoding throughput, latency,\nand GPU memory without sacrificing performance. The chunkwise recurrent\nrepresentation facilitates efficient long-sequence modeling with linear\ncomplexity, where each chunk is encoded parallelly while recurrently\nsummarizing the chunks. Experimental results on language modeling show that\nRetNet achieves favorable scaling results, parallel training, low-cost\ndeployment, and efficient inference. The intriguing properties make RetNet a\nstrong successor to Transformer for large language models. Code will be\navailable at https://aka.ms/retnet.\n","authors":["Yutao Sun","Li Dong","Shaohan Huang","Shuming Ma","Yuqing Xia","Jilong Xue","Jianyong Wang","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2307.08621v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.10551v3","updated":"2023-07-19T05:52:32Z","published":"2022-12-20T18:54:08Z","title":"Lego-MT: Learning Detachable Models for Massively Multilingual Machine\n Translation","summary":" Multilingual neural machine translation (MNMT) aims to build a unified model\nfor many language directions. Existing monolithic models for MNMT encounter two\nchallenges: parameter interference among languages and inefficient inference\nfor large models. In this paper, we revisit the classic multi-way structures\nand develop a detachable model by assigning each language (or group of\nlanguages) to an individual branch that supports plug-and-play training and\ninference. To address the needs of learning representations for all languages\nin a unified space, we propose a novel efficient training recipe, upon which we\nbuild an effective detachable model, Lego-MT. For a fair comparison, we collect\ndata from OPUS and build a translation benchmark covering 433 languages and\n1.3B parallel data. Experiments show that Lego-MT with 1.2B parameters brings\nan average gain of 3.2 spBLEU. It even outperforms M2M-100 with 12B parameters.\nThe proposed training recipe brings a 28.2$\\times$ speedup over the\nconventional multi-way training method.\\footnote{\n\\url{https://github.com/CONE-MT/Lego-MT}.}\n","authors":["Fei Yuan","Yinquan Lu","WenHao Zhu","Lingpeng Kong","Lei Li","Yu Qiao","Jingjing Xu"],"pdf_url":"https://arxiv.org/pdf/2212.10551v3.pdf","comment":"ACL 2023 Findings"},{"id":"http://arxiv.org/abs/2303.12135v4","updated":"2023-07-19T05:30:31Z","published":"2023-03-21T18:48:11Z","title":"Understand Legal Documents with Contextualized Large Language Models","summary":" The growth of pending legal cases in populous countries, such as India, has\nbecome a major issue. Developing effective techniques to process and understand\nlegal documents is extremely useful in resolving this problem. In this paper,\nwe present our systems for SemEval-2023 Task 6: understanding legal texts (Modi\net al., 2023). Specifically, we first develop the Legal-BERT-HSLN model that\nconsiders the comprehensive context information in both intra- and\ninter-sentence levels to predict rhetorical roles (subtask A) and then train a\nLegal-LUKE model, which is legal-contextualized and entity-aware, to recognize\nlegal entities (subtask B). Our evaluations demonstrate that our designed\nmodels are more accurate than baselines, e.g., with an up to 15.0% better F1\nscore in subtask B. We achieved notable performance in the task leaderboard,\ne.g., 0.834 micro F1 score, and ranked No.5 out of 27 teams in subtask A.\n","authors":["Xin Jin","Yuchen Wang"],"pdf_url":"https://arxiv.org/pdf/2303.12135v4.pdf","comment":"SemEval 2023"},{"id":"http://arxiv.org/abs/2306.07848v5","updated":"2023-07-19T04:56:33Z","published":"2023-06-13T15:28:10Z","title":"GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio\n Pretraining for Speech Emotion Recognition","summary":" Contrastive learning based cross-modality pretraining methods have recently\nexhibited impressive success in diverse fields. In this paper, we propose\nGEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio\npretraining (CLAP) method for speech emotion recognition. Specifically, a novel\nemotion CLAP model (Emo-CLAP) is first built, utilizing various self-supervised\npre-trained models. Second, considering the importance of gender attribute in\nspeech emotion modeling, the soft label based GEmo-CLAP (SL-GEmo-CLAP) and\nmulti-task learning based GEmo-CLAP (ML-GEmo-CLAP) are further proposed to\nintegrate the emotion and gender information of speech signals, forming more\nreasonable objectives. Extensive experiments on IEMOCAP show that our proposed\ntwo GEmo-CLAP models consistently outperform the baseline Emo-CLAP with\ndifferent pre-trained models, while also achieving the best recognition\nperformance compared with recent state-of-the-art methods. Noticeably, the\nproposed WavLM-based ML-GEmo-CLAP obtains the best UAR of 80.16\\% and WAR of\n82.06\\%.\n","authors":["Yu Pan","Lei Ma"],"pdf_url":"https://arxiv.org/pdf/2306.07848v5.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2307.09744v1","updated":"2023-07-19T04:25:21Z","published":"2023-07-19T04:25:21Z","title":"Enhancing conversational quality in language learning chatbots: An\n evaluation of GPT4 for ASR error correction","summary":" The integration of natural language processing (NLP) technologies into\neducational applications has shown promising results, particularly in the\nlanguage learning domain. Recently, many spoken open-domain chatbots have been\nused as speaking partners, helping language learners improve their language\nskills. However, one of the significant challenges is the high word-error-rate\n(WER) when recognizing non-native/non-fluent speech, which interrupts\nconversation flow and leads to disappointment for learners. This paper explores\nthe use of GPT4 for ASR error correction in conversational settings. In\naddition to WER, we propose to use semantic textual similarity (STS) and next\nresponse sensibility (NRS) metrics to evaluate the impact of error correction\nmodels on the quality of the conversation. We find that transcriptions\ncorrected by GPT4 lead to higher conversation quality, despite an increase in\nWER. GPT4 also outperforms standard error correction methods without the need\nfor in-domain training data.\n","authors":["Long Mai","Julie Carson-Berndsen"],"pdf_url":"https://arxiv.org/pdf/2307.09744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09455v2","updated":"2023-07-19T04:13:11Z","published":"2023-07-18T17:29:23Z","title":"Pseudo Outlier Exposure for Out-of-Distribution Detection using\n Pretrained Transformers","summary":" For real-world language applications, detecting an out-of-distribution (OOD)\nsample is helpful to alert users or reject such unreliable samples. However,\nmodern over-parameterized language models often produce overconfident\npredictions for both in-distribution (ID) and OOD samples. In particular,\nlanguage models suffer from OOD samples with a similar semantic representation\nto ID samples since these OOD samples lie near the ID manifold. A rejection\nnetwork can be trained with ID and diverse outlier samples to detect test OOD\nsamples, but explicitly collecting auxiliary OOD datasets brings an additional\nburden for data collection. In this paper, we propose a simple but effective\nmethod called Pseudo Outlier Exposure (POE) that constructs a surrogate OOD\ndataset by sequentially masking tokens related to ID classes. The surrogate OOD\nsample introduced by POE shows a similar representation to ID data, which is\nmost effective in training a rejection network. Our method does not require any\nexternal OOD data and can be easily implemented within off-the-shelf\nTransformers. A comprehensive comparison with state-of-the-art algorithms\ndemonstrates POE's competitiveness on several text classification benchmarks.\n","authors":["Jaeyoung Kim","Kyuheon Jung","Dongbin Na","Sion Jang","Eunbin Park","Sungchul Choi"],"pdf_url":"https://arxiv.org/pdf/2307.09455v2.pdf","comment":"12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2307.09706v1","updated":"2023-07-19T01:37:31Z","published":"2023-07-19T01:37:31Z","title":"RaTE: a Reproducible automatic Taxonomy Evaluation by Filling the Gap","summary":" Taxonomies are an essential knowledge representation, yet most studies on\nautomatic taxonomy construction (ATC) resort to manual evaluation to score\nproposed algorithms. We argue that automatic taxonomy evaluation (ATE) is just\nas important as taxonomy construction. We propose RaTE, an automatic label-free\ntaxonomy scoring procedure, which relies on a large pre-trained language model.\nWe apply our evaluation procedure to three state-of-the-art ATC algorithms with\nwhich we built seven taxonomies from the Yelp domain, and show that 1) RaTE\ncorrelates well with human judgments and 2) artificially degrading a taxonomy\nleads to decreasing RaTE score.\n","authors":["Tianjian Gao","Phillipe Langlais"],"pdf_url":"https://arxiv.org/pdf/2307.09706v1.pdf","comment":"15th International Conference on Computational Semantics (IWCS),\n Association for Computational Linguistics (ACL)"},{"id":"http://arxiv.org/abs/2307.03135v2","updated":"2023-07-19T01:28:30Z","published":"2023-07-06T17:05:26Z","title":"Distilling Large Vision-Language Model with Out-of-Distribution\n Generalizability","summary":" Large vision-language models have achieved outstanding performance, but their\nsize and computational requirements make their deployment on\nresource-constrained devices and time-sensitive tasks impractical. Model\ndistillation, the process of creating smaller, faster models that maintain the\nperformance of larger models, is a promising direction towards the solution.\nThis paper investigates the distillation of visual representations in large\nteacher vision-language models into lightweight student models using a small-\nor mid-scale dataset. Notably, this study focuses on open-vocabulary\nout-of-distribution (OOD) generalization, a challenging problem that has been\noverlooked in previous model distillation literature. We propose two principles\nfrom vision and language modality perspectives to enhance student's OOD\ngeneralization: (1) by better imitating teacher's visual representation space,\nand carefully promoting better coherence in vision-language alignment with the\nteacher; (2) by enriching the teacher's language representations with\ninformative and finegrained semantic attributes to effectively distinguish\nbetween different labels. We propose several metrics and conduct extensive\nexperiments to investigate their techniques. The results demonstrate\nsignificant improvements in zero-shot and few-shot student performance on\nopen-vocabulary out-of-distribution classification, highlighting the\neffectiveness of our proposed approaches. Code released at\nhttps://github.com/xuanlinli17/large_vlm_distillation_ood\n","authors":["Xuanlin Li","Yunhao Fang","Minghua Liu","Zhan Ling","Zhuowen Tu","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2307.03135v2.pdf","comment":"Published at International Conference on Computer Vision (ICCV) 2023"},{"id":"http://arxiv.org/abs/2307.09705v1","updated":"2023-07-19T01:22:40Z","published":"2023-07-19T01:22:40Z","title":"CValues: Measuring the Values of Chinese Large Language Models from\n Safety to Responsibility","summary":" With the rapid evolution of large language models (LLMs), there is a growing\nconcern that they may pose risks or have negative social impacts. Therefore,\nevaluation of human values alignment is becoming increasingly important.\nPrevious work mainly focuses on assessing the performance of LLMs on certain\nknowledge and reasoning abilities, while neglecting the alignment to human\nvalues, especially in a Chinese context. In this paper, we present CValues, the\nfirst Chinese human values evaluation benchmark to measure the alignment\nability of LLMs in terms of both safety and responsibility criteria. As a\nresult, we have manually collected adversarial safety prompts across 10\nscenarios and induced responsibility prompts from 8 domains by professional\nexperts. To provide a comprehensive values evaluation of Chinese LLMs, we not\nonly conduct human evaluation for reliable comparison, but also construct\nmulti-choice prompts for automatic evaluation. Our findings suggest that while\nmost Chinese LLMs perform well in terms of safety, there is considerable room\nfor improvement in terms of responsibility. Moreover, both the automatic and\nhuman evaluation are important for assessing the human values alignment in\ndifferent aspects. The benchmark and code is available on ModelScope and\nGithub.\n","authors":["Guohai Xu","Jiayi Liu","Ming Yan","Haotian Xu","Jinghui Si","Zhuoran Zhou","Peng Yi","Xing Gao","Jitao Sang","Rong Zhang","Ji Zhang","Chao Peng","Fei Huang","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.09705v1.pdf","comment":"Working in Process"},{"id":"http://arxiv.org/abs/2307.09702v1","updated":"2023-07-19T01:14:49Z","published":"2023-07-19T01:14:49Z","title":"Efficient Guided Generation for LLMs","summary":" In this article we describe an efficient approach to guiding language model\ntext generation with regular expressions and context-free grammars. Our\napproach adds little to no overhead to the token sequence generation process,\nand makes guided generation feasible in practice. An implementation is provided\nin the open source Python library Outlines.\n","authors":["Brandon T. Willard","Rémi Louf"],"pdf_url":"https://arxiv.org/pdf/2307.09702v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09701v1","updated":"2023-07-19T01:05:33Z","published":"2023-07-19T01:05:33Z","title":"Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation","summary":" Rising computational demands of modern natural language processing (NLP)\nsystems have increased the barrier to entry for cutting-edge research while\nposing serious environmental concerns. Yet, progress on model efficiency has\nbeen impeded by practical challenges in model evaluation and comparison. For\nexample, hardware is challenging to control due to disparate levels of\naccessibility across different institutions. Moreover, improvements in metrics\nsuch as FLOPs often fail to translate to progress in real-world applications.\nIn response, we introduce Pentathlon, a benchmark for holistic and realistic\nevaluation of model efficiency. Pentathlon focuses on inference, which accounts\nfor a majority of the compute in a model's lifecycle. It offers a\nstrictly-controlled hardware platform, and is designed to mirror real-world\napplications scenarios. It incorporates a suite of metrics that target\ndifferent aspects of efficiency, including latency, throughput, memory\noverhead, and energy consumption. Pentathlon also comes with a software library\nthat can be seamlessly integrated into any codebase and enable evaluation. As a\nstandardized and centralized evaluation platform, Pentathlon can drastically\nreduce the workload to make fair and reproducible efficiency comparisons. While\ninitially focused on natural language processing (NLP) models, Pentathlon is\ndesigned to allow flexible extension to other fields. We envision Pentathlon\nwill stimulate algorithmic innovations in building efficient models, and foster\nan increased awareness of the social and environmental implications in the\ndevelopment of future-generation NLP models.\n","authors":["Hao Peng","Qingqing Cao","Jesse Dodge","Matthew E. Peters","Jared Fernandez","Tom Sherborne","Kyle Lo","Sam Skjonsberg","Emma Strubell","Darrell Plessas","Iz Beltagy","Evan Pete Walsh","Noah A. Smith","Hannaneh Hajishirzi"],"pdf_url":"https://arxiv.org/pdf/2307.09701v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08272v2","updated":"2023-07-19T23:52:23Z","published":"2023-07-17T06:36:53Z","title":"ChatGPT is Good but Bing Chat is Better for Vietnamese Students","summary":" This study examines the efficacy of two SOTA large language models (LLMs),\nnamely ChatGPT and Microsoft Bing Chat (BingChat), in catering to the needs of\nVietnamese students. Although ChatGPT exhibits proficiency in multiple\ndisciplines, Bing Chat emerges as the more advantageous option. We conduct a\ncomparative analysis of their academic achievements in various disciplines,\nencompassing mathematics, literature, English language, physics, chemistry,\nbiology, history, geography, and civic education. The results of our study\nsuggest that BingChat demonstrates superior performance compared to ChatGPT\nacross a wide range of subjects, with the exception of literature, where\nChatGPT exhibits better performance. Additionally, BingChat utilizes the more\nadvanced GPT-4 technology in contrast to ChatGPT, which is built upon GPT-3.5.\nThis allows BingChat to improve to comprehension, reasoning and generation of\ncreative and informative text. Moreover, the fact that BingChat is accessible\nin Vietnam and its integration of hyperlinks and citations within responses\nserve to reinforce its superiority. In our analysis, it is evident that while\nChatGPT exhibits praiseworthy qualities, BingChat presents a more apdated\nsolutions for Vietnamese students.\n","authors":["Xuan-Quy Dao","Ngoc-Bich Le"],"pdf_url":"https://arxiv.org/pdf/2307.08272v2.pdf","comment":"13 pages; 6 figures"},{"id":"http://arxiv.org/abs/2307.10490v1","updated":"2023-07-19T23:03:20Z","published":"2023-07-19T23:03:20Z","title":"(Ab)using Images and Sounds for Indirect Instruction Injection in\n Multi-Modal LLMs","summary":" We demonstrate how images and sounds can be used for indirect prompt and\ninstruction injection in multi-modal LLMs. An attacker generates an adversarial\nperturbation corresponding to the prompt and blends it into an image or audio\nrecording. When the user asks the (unmodified, benign) model about the\nperturbed image or audio, the perturbation steers the model to output the\nattacker-chosen text and/or make the subsequent dialog follow the attacker's\ninstruction. We illustrate this attack with several proof-of-concept examples\ntargeting LLaVa and PandaGPT.\n","authors":["Eugene Bagdasaryan","Tsung-Yin Hsieh","Ben Nassi","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2307.10490v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10488v1","updated":"2023-07-19T22:48:02Z","published":"2023-07-19T22:48:02Z","title":"SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot\n Neural Sparse Retrieval","summary":" Traditionally, sparse retrieval systems relied on lexical representations to\nretrieve documents, such as BM25, dominated information retrieval tasks. With\nthe onset of pre-trained transformer models such as BERT, neural sparse\nretrieval has led to a new paradigm within retrieval. Despite the success,\nthere has been limited software supporting different sparse retrievers running\nin a unified, common environment. This hinders practitioners from fairly\ncomparing different sparse models and obtaining realistic evaluation results.\nAnother missing piece is, that a majority of prior work evaluates sparse\nretrieval models on in-domain retrieval, i.e. on a single dataset: MS MARCO.\nHowever, a key requirement in practical retrieval systems requires models that\ncan generalize well to unseen out-of-domain, i.e. zero-shot retrieval tasks. In\nthis work, we provide SPRINT, a unified Python toolkit based on Pyserini and\nLucene, supporting a common interface for evaluating neural sparse retrieval.\nThe toolkit currently includes five built-in models: uniCOIL, DeepImpact,\nSPARTA, TILDEv2 and SPLADEv2. Users can also easily add customized models by\ndefining their term weighting method. Using our toolkit, we establish strong\nand reproducible zero-shot sparse retrieval baselines across the\nwell-acknowledged benchmark, BEIR. Our results demonstrate that SPLADEv2\nachieves the best average score of 0.470 nDCG@10 on BEIR amongst all neural\nsparse retrievers. In this work, we further uncover the reasons behind its\nperformance gain. We show that SPLADEv2 produces sparse representations with a\nmajority of tokens outside of the original query and document which is often\ncrucial for its performance gains, i.e. a limitation among its other sparse\ncounterparts. We provide our SPRINT toolkit, models, and data used in our\nexperiments publicly here at https://github.com/thakur-nandan/sprint.\n","authors":["Nandan Thakur","Kexin Wang","Iryna Gurevych","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2307.10488v1.pdf","comment":"Accepted at SIGIR 2023 (Resource Track)"},{"id":"http://arxiv.org/abs/2307.10485v1","updated":"2023-07-19T22:43:57Z","published":"2023-07-19T22:43:57Z","title":"FinGPT: Democratizing Internet-scale Data for Financial Large Language\n Models","summary":" Large language models (LLMs) have demonstrated remarkable proficiency in\nunderstanding and generating human-like texts, which may potentially\nrevolutionize the finance industry. However, existing LLMs often fall short in\nthe financial field, which is mainly attributed to the disparities between\ngeneral text data and financial text data. Unfortunately, there is only a\nlimited number of financial text datasets available (quite small size), and\nBloombergGPT, the first financial LLM (FinLLM), is close-sourced (only the\ntraining logs were released). In light of this, we aim to democratize\nInternet-scale financial data for LLMs, which is an open challenge due to\ndiverse data sources, low signal-to-noise ratio, and high time-validity. To\naddress the challenges, we introduce an open-sourced and data-centric\nframework, \\textit{Financial Generative Pre-trained Transformer (FinGPT)}, that\nautomates the collection and curation of real-time financial data from >34\ndiverse sources on the Internet, providing researchers and practitioners with\naccessible and transparent resources to develop their FinLLMs. Additionally, we\npropose a simple yet effective strategy for fine-tuning FinLLM using the\ninherent feedback from the market, dubbed Reinforcement Learning with Stock\nPrices (RLSP). We also adopt the Low-rank Adaptation (LoRA, QLoRA) method that\nenables users to customize their own FinLLMs from open-source general-purpose\nLLMs at a low cost. Finally, we showcase several FinGPT applications, including\nrobo-advisor, sentiment analysis for algorithmic trading, and low-code\ndevelopment. FinGPT aims to democratize FinLLMs, stimulate innovation, and\nunlock new opportunities in open finance. The codes are available at\nhttps://github.com/AI4Finance-Foundation/FinGPT and\nhttps://github.com/AI4Finance-Foundation/FinNLP\n","authors":["Xiao-Yang Liu","Guoxuan Wang","Daochen Zha"],"pdf_url":"https://arxiv.org/pdf/2307.10485v1.pdf","comment":"43 pages, 9 tables, and 3 figures"},{"id":"http://arxiv.org/abs/2307.10476v1","updated":"2023-07-19T22:14:58Z","published":"2023-07-19T22:14:58Z","title":"What can we learn from Data Leakage and Unlearning for Law?","summary":" Large Language Models (LLMs) have a privacy concern because they memorize\ntraining data (including personally identifiable information (PII) like emails\nand phone numbers) and leak it during inference. A company can train an LLM on\nits domain-customized data which can potentially also include their users' PII.\nIn order to comply with privacy laws such as the \"right to be forgotten\", the\ndata points of users that are most vulnerable to extraction could be deleted.\nWe find that once the most vulnerable points are deleted, a new set of points\nbecome vulnerable to extraction. So far, little attention has been given to\nunderstanding memorization for fine-tuned models. In this work, we also show\nthat not only do fine-tuned models leak their training data but they also leak\nthe pre-training data (and PII) memorized during the pre-training phase. The\nproperty of new data points becoming vulnerable to extraction after unlearning\nand leakage of pre-training data through fine-tuned models can pose significant\nprivacy and legal concerns for companies that use LLMs to offer services. We\nhope this work will start an interdisciplinary discussion within AI and law\ncommunities regarding the need for policies to tackle these issues.\n","authors":["Jaydeep Borkar"],"pdf_url":"https://arxiv.org/pdf/2307.10476v1.pdf","comment":"5 pages, 8 figures, accepted to the first GenLaw workshop at ICML'23,\n Hawai'i"},{"id":"http://arxiv.org/abs/2307.10475v1","updated":"2023-07-19T22:14:49Z","published":"2023-07-19T22:14:49Z","title":"Findings of Factify 2: Multimodal Fake News Detection","summary":" With social media usage growing exponentially in the past few years, fake\nnews has also become extremely prevalent. The detrimental impact of fake news\nemphasizes the need for research focused on automating the detection of false\ninformation and verifying its accuracy. In this work, we present the outcome of\nthe Factify 2 shared task, which provides a multi-modal fact verification and\nsatire news dataset, as part of the DeFactify 2 workshop at AAAI'23. The data\ncalls for a comparison based approach to the task by pairing social media\nclaims with supporting documents, with both text and image, divided into 5\nclasses based on multi-modal relations. In the second iteration of this task we\nhad over 60 participants and 9 final test-set submissions. The best\nperformances came from the use of DeBERTa for text and Swinv2 and CLIP for\nimage. The highest F1 score averaged for all five classes was 81.82%.\n","authors":["S Suryavardan","Shreyash Mishra","Megha Chakraborty","Parth Patwa","Anku Rani","Aman Chadha","Aishwarya Reganti","Amitava Das","Amit Sheth","Manoj Chinnakotla","Asif Ekbal","Srijan Kumar"],"pdf_url":"https://arxiv.org/pdf/2307.10475v1.pdf","comment":"Defactify2 @AAAI 2023"},{"id":"http://arxiv.org/abs/2307.10472v1","updated":"2023-07-19T22:03:40Z","published":"2023-07-19T22:03:40Z","title":"Can Instruction Fine-Tuned Language Models Identify Social Bias through\n Prompting?","summary":" As the breadth and depth of language model applications continue to expand\nrapidly, it is increasingly important to build efficient frameworks for\nmeasuring and mitigating the learned or inherited social biases of these\nmodels. In this paper, we present our work on evaluating instruction fine-tuned\nlanguage models' ability to identify bias through zero-shot prompting,\nincluding Chain-of-Thought (CoT) prompts. Across LLaMA and its two instruction\nfine-tuned versions, Alpaca 7B performs best on the bias identification task\nwith an accuracy of 56.7%. We also demonstrate that scaling up LLM size and\ndata diversity could lead to further performance gain. This is a\nwork-in-progress presenting the first component of our bias mitigation\nframework. We will keep updating this work as we get more results.\n","authors":["Omkar Dige","Jacob-Junqi Tian","David Emerson","Faiza Khan Khattak"],"pdf_url":"https://arxiv.org/pdf/2307.10472v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10457v1","updated":"2023-07-19T21:00:16Z","published":"2023-07-19T21:00:16Z","title":"Improving Pre-trained Language Models' Generalization","summary":" The reusability of state-of-the-art Pre-trained Language Models (PLMs) is\noften limited by their generalization problem, where their performance\ndrastically decreases when evaluated on examples that differ from the training\ndataset, known as Out-of-Distribution (OOD)/unseen examples. This limitation\narises from PLMs' reliance on spurious correlations, which work well for\nfrequent example types but not for general examples. To address this issue, we\npropose a training approach called Mask-tuning, which integrates Masked\nLanguage Modeling (MLM) training objectives into the fine-tuning process to\nenhance PLMs' generalization. Comprehensive experiments demonstrate that\nMask-tuning surpasses current state-of-the-art techniques and enhances PLMs'\ngeneralization on OOD datasets while improving their performance on\nin-distribution datasets. The findings suggest that Mask-tuning improves the\nreusability of PLMs on unseen data, making them more practical and effective\nfor real-world applications.\n","authors":["Somayeh Ghanbarzadeh","Hamid Palangi","Yan Huang","Radames Cruz Moreno","Hamed Khanpour"],"pdf_url":"https://arxiv.org/pdf/2307.10457v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10443v1","updated":"2023-07-19T20:17:37Z","published":"2023-07-19T20:17:37Z","title":"Integrating a Heterogeneous Graph with Entity-aware Self-attention using\n Relative Position Labels for Reading Comprehension Model","summary":" Despite the significant progress made by transformer models in machine\nreading comprehension tasks, they still face limitations in handling complex\nreasoning tasks due to the absence of explicit knowledge in the input sequence.\nThis paper proposes a novel attention pattern to overcome this limitation,\nwhich integrates reasoning knowledge derived from a heterogeneous graph into\nthe transformer architecture using a graph-enhanced self-attention mechanism.\nThe proposed attention pattern comprises three key elements: global-local\nattention for word tokens, graph attention for entity tokens that exhibit\nstrong attention towards tokens connected in the graph as opposed to those\nunconnected, and the consideration of the type of relationship between each\nentity token and word token. This results in optimized attention between the\ntwo if a relationship exists. The pattern is coupled with special relative\nposition labels, allowing it to integrate with LUKE's entity-aware\nself-attention mechanism. The experimental findings corroborate that our model\noutperforms both the cutting-edge LUKE-Graph and the baseline LUKE model on the\nReCoRD dataset that focuses on commonsense reasoning.\n","authors":["Shima Foolad","Kourosh Kiani"],"pdf_url":"https://arxiv.org/pdf/2307.10443v1.pdf","comment":"submitted for Knowledge-Based Systems Journal"},{"id":"http://arxiv.org/abs/2307.10442v1","updated":"2023-07-19T20:16:46Z","published":"2023-07-19T20:16:46Z","title":"Thrust: Adaptively Propels Large Language Models with External Knowledge","summary":" Although large-scale pre-trained language models (PTLMs) are shown to encode\nrich knowledge in their model parameters, the inherent knowledge in PTLMs can\nbe opaque or static, making external knowledge necessary. However, the existing\ninformation retrieval techniques could be costly and may even introduce noisy\nand sometimes misleading knowledge. To address these challenges, we propose the\ninstance-level adaptive propulsion of external knowledge (IAPEK), where we only\nconduct the retrieval when necessary. To achieve this goal, we propose\nmeasuring whether a PTLM contains enough knowledge to solve an instance with a\nnovel metric, Thrust, which leverages the representation distribution of a\nsmall number of seen instances. Extensive experiments demonstrate that thrust\nis a good measurement of PTLM models' instance-level knowledgeability.\nMoreover, we can achieve significantly higher cost-efficiency with the Thrust\nscore as the retrieval indicator than the naive usage of external knowledge on\n88% of the evaluated tasks with 26% average performance improvement. Such\nfindings shed light on the real-world practice of knowledge-enhanced LMs with a\nlimited knowledge-seeking budget due to computation latency or costs.\n","authors":["Xinran Zhao","Hongming Zhang","Xiaoman Pan","Wenlin Yao","Dong Yu","Jianshu Chen"],"pdf_url":"https://arxiv.org/pdf/2307.10442v1.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2301.13816v4","updated":"2023-07-19T19:55:31Z","published":"2023-01-31T18:02:26Z","title":"Execution-based Code Generation using Deep Reinforcement Learning","summary":" The utilization of programming language (PL) models, pre-trained on\nlarge-scale code corpora, as a means of automating software engineering\nprocesses has demonstrated considerable potential in streamlining various code\ngeneration tasks such as code completion, code translation, and program\nsynthesis. However, current approaches mainly rely on supervised fine-tuning\nobjectives borrowed from text generation, neglecting unique sequence-level\ncharacteristics of code, including but not limited to compilability as well as\nsyntactic and functional correctness. To address this limitation, we propose\nPPOCoder, a new framework for code generation that synergistically combines\npre-trained PL models with Proximal Policy Optimization (PPO) which is a widely\nused deep reinforcement learning technique. By utilizing non-differentiable\nfeedback from code execution and structure alignment, PPOCoder seamlessly\nintegrates external code-specific knowledge into the model optimization\nprocess. It's important to note that PPOCoder is a task-agnostic and\nmodel-agnostic framework that can be used across different code generation\ntasks and PLs. Extensive experiments on three code generation tasks demonstrate\nthe effectiveness of our proposed approach compared to SOTA methods, achieving\nsignificant improvements in compilation success rates and functional\ncorrectness across different PLs.\n","authors":["Parshin Shojaee","Aneesh Jain","Sindhu Tipirneni","Chandan K. Reddy"],"pdf_url":"https://arxiv.org/pdf/2301.13816v4.pdf","comment":"Published in Transactions on Machine Learning Research (TMLR), 2023"},{"id":"http://arxiv.org/abs/2307.10432v1","updated":"2023-07-19T19:40:34Z","published":"2023-07-19T19:40:34Z","title":"PharmacyGPT: The AI Pharmacist","summary":" In this study, we introduce PharmacyGPT, a novel framework to assess the\ncapabilities of large language models (LLMs) such as ChatGPT and GPT-4 in\nemulating the role of clinical pharmacists. Our methodology encompasses the\nutilization of LLMs to generate comprehensible patient clusters, formulate\nmedication plans, and forecast patient outcomes. We conduct our investigation\nusing real data acquired from the intensive care unit (ICU) at the University\nof North Carolina Chapel Hill (UNC) Hospital. Our analysis offers valuable\ninsights into the potential applications and limitations of LLMs in the field\nof clinical pharmacy, with implications for both patient care and the\ndevelopment of future AI-driven healthcare solutions. By evaluating the\nperformance of PharmacyGPT, we aim to contribute to the ongoing discourse\nsurrounding the integration of artificial intelligence in healthcare settings,\nultimately promoting the responsible and efficacious use of such technologies.\n","authors":["Zhengliang Liu","Zihao Wu","Mengxuan Hu","Bokai Zhao","Lin Zhao","Tianyi Zhang","Haixing Dai","Xianyan Chen","Ye Shen","Sheng Li","Brian Murray","Tianming Liu","Andrea Sikora"],"pdf_url":"https://arxiv.org/pdf/2307.10432v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.09826v2","updated":"2023-07-19T19:30:52Z","published":"2023-04-16T11:22:59Z","title":"Fairness in AI and Its Long-Term Implications on Society","summary":" Successful deployment of artificial intelligence (AI) in various settings has\nled to numerous positive outcomes for individuals and society. However, AI\nsystems have also been shown to harm parts of the population due to biased\npredictions. AI fairness focuses on mitigating such biases to ensure AI\ndecision making is not discriminatory towards certain groups. We take a closer\nlook at AI fairness and analyze how lack of AI fairness can lead to deepening\nof biases over time and act as a social stressor. More specifically, we discuss\nhow biased models can lead to more negative real-world outcomes for certain\ngroups, which may then become more prevalent by deploying new AI models trained\non increasingly biased data, resulting in a feedback loop. If the issues\npersist, they could be reinforced by interactions with other risks and have\nsevere implications on society in the form of social unrest. We examine current\nstrategies for improving AI fairness, assess their limitations in terms of\nreal-world deployment, and explore potential paths forward to ensure we reap\nAI's benefits without causing society's collapse.\n","authors":["Ondrej Bohdal","Timothy Hospedales","Philip H. S. Torr","Fazl Barez"],"pdf_url":"https://arxiv.org/pdf/2304.09826v2.pdf","comment":"Stanford Existential Risks Conference 2023"},{"id":"http://arxiv.org/abs/2306.17582v2","updated":"2023-07-19T19:30:28Z","published":"2023-02-20T06:39:06Z","title":"ChatGPT for Robotics: Design Principles and Model Abilities","summary":" This paper presents an experimental study regarding the use of OpenAI's\nChatGPT for robotics applications. We outline a strategy that combines design\nprinciples for prompt engineering and the creation of a high-level function\nlibrary which allows ChatGPT to adapt to different robotics tasks, simulators,\nand form factors. We focus our evaluations on the effectiveness of different\nprompt engineering techniques and dialog strategies towards the execution of\nvarious types of robotics tasks. We explore ChatGPT's ability to use free-form\ndialog, parse XML tags, and to synthesize code, in addition to the use of\ntask-specific prompting functions and closed-loop reasoning through dialogues.\nOur study encompasses a range of tasks within the robotics domain, from basic\nlogical, geometrical, and mathematical reasoning all the way to complex domains\nsuch as aerial navigation, manipulation, and embodied agents. We show that\nChatGPT can be effective at solving several of such tasks, while allowing users\nto interact with it primarily via natural language instructions. In addition to\nthese studies, we introduce an open-sourced research tool called PromptCraft,\nwhich contains a platform where researchers can collaboratively upload and vote\non examples of good prompting schemes for robotics applications, as well as a\nsample robotics simulator with ChatGPT integration, making it easier for users\nto get started with using ChatGPT for robotics.\n","authors":["Sai Vemprala","Rogerio Bonatti","Arthur Bucker","Ashish Kapoor"],"pdf_url":"https://arxiv.org/pdf/2306.17582v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10323v1","updated":"2023-07-19T07:20:30Z","published":"2023-07-19T07:20:30Z","title":"IncDSI: Incrementally Updatable Document Retrieval","summary":" Differentiable Search Index is a recently proposed paradigm for document\nretrieval, that encodes information about a corpus of documents within the\nparameters of a neural network and directly maps queries to corresponding\ndocuments. These models have achieved state-of-the-art performances for\ndocument retrieval across many benchmarks. These kinds of models have a\nsignificant limitation: it is not easy to add new documents after a model is\ntrained. We propose IncDSI, a method to add documents in real time (about\n20-50ms per document), without retraining the model on the entire dataset (or\neven parts thereof). Instead we formulate the addition of documents as a\nconstrained optimization problem that makes minimal changes to the network\nparameters. Although orders of magnitude faster, our approach is competitive\nwith re-training the model on the whole dataset and enables the development of\ndocument retrieval systems that can be updated with new information in\nreal-time. Our code for IncDSI is available at\nhttps://github.com/varshakishore/IncDSI.\n","authors":["Varsha Kishore","Chao Wan","Justin Lovelace","Yoav Artzi","Kilian Q. Weinberger"],"pdf_url":"https://arxiv.org/pdf/2307.10323v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.00370v2","updated":"2023-07-19T06:55:04Z","published":"2023-07-01T15:44:53Z","title":"Improving Text Matching in E-Commerce Search with A Rationalizable,\n Intervenable and Fast Entity-Based Relevance Model","summary":" Discovering the intended items of user queries from a massive repository of\nitems is one of the main goals of an e-commerce search system. Relevance\nprediction is essential to the search system since it helps improve\nperformance. When online serving a relevance model, the model is required to\nperform fast and accurate inference. Currently, the widely used models such as\nBi-encoder and Cross-encoder have their limitations in accuracy or inference\nspeed respectively. In this work, we propose a novel model called the\nEntity-Based Relevance Model (EBRM). We identify the entities contained in an\nitem and decompose the QI (query-item) relevance problem into multiple QE\n(query-entity) relevance problems; we then aggregate their results to form the\nQI prediction using a soft logic formulation. The decomposition allows us to\nuse a Cross-encoder QE relevance module for high accuracy as well as cache QE\npredictions for fast online inference. Utilizing soft logic makes the\nprediction procedure interpretable and intervenable. We also show that\npretraining the QE module with auto-generated QE data from user logs can\nfurther improve the overall performance. The proposed method is evaluated on\nlabeled data from e-commerce websites. Empirical results show that it achieves\npromising improvements with computation efficiency.\n","authors":["Jiong Cai","Yong Jiang","Yue Zhang","Chengyue Jiang","Ke Yu","Jianhui Ji","Rong Xiao","Haihong Tang","Tao Wang","Zhongqiang Huang","Pengjun Xie","Fei Huang","Kewei Tu"],"pdf_url":"https://arxiv.org/pdf/2307.00370v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10314v1","updated":"2023-07-19T03:31:41Z","published":"2023-07-19T03:31:41Z","title":"Mood Classification of Bangla Songs Based on Lyrics","summary":" Music can evoke various emotions, and with the advancement of technology, it\nhas become more accessible to people. Bangla music, which portrays different\nhuman emotions, lacks sufficient research. The authors of this article aim to\nanalyze Bangla songs and classify their moods based on the lyrics. To achieve\nthis, this research has compiled a dataset of 4000 Bangla song lyrics, genres,\nand used Natural Language Processing and the Bert Algorithm to analyze the\ndata. Among the 4000 songs, 1513 songs are represented for the sad mood, 1362\nfor the romantic mood, 886 for happiness, and the rest 239 are classified as\nrelaxation. By embedding the lyrics of the songs, the authors have classified\nthe songs into four moods: Happy, Sad, Romantic, and Relaxed. This research is\ncrucial as it enables a multi-class classification of songs' moods, making the\nmusic more relatable to people's emotions. The article presents the automated\nresult of the four moods accurately derived from the song lyrics.\n","authors":["Maliha Mahajebin","Mohammad Rifat Ahmmad Rashid","Nafees Mansoor"],"pdf_url":"https://arxiv.org/pdf/2307.10314v1.pdf","comment":"Presented at International Conference on. Inventive Communication and\n Computational Technologies 2023"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2307.10173v1","updated":"2023-07-19T17:58:03Z","published":"2023-07-19T17:58:03Z","title":"DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity\n Human-centric Rendering","summary":" Realistic human-centric rendering plays a key role in both computer vision\nand computer graphics. Rapid progress has been made in the algorithm aspect\nover the years, yet existing human-centric rendering datasets and benchmarks\nare rather impoverished in terms of diversity, which are crucial for rendering\neffect. Researchers are usually constrained to explore and evaluate a small set\nof rendering problems on current datasets, while real-world applications\nrequire methods to be robust across different scenarios. In this work, we\npresent DNA-Rendering, a large-scale, high-fidelity repository of human\nperformance data for neural actor rendering. DNA-Rendering presents several\nalluring attributes. First, our dataset contains over 1500 human subjects, 5000\nmotion sequences, and 67.5M frames' data volume. Second, we provide rich assets\nfor each subject -- 2D/3D human body keypoints, foreground masks, SMPLX models,\ncloth/accessory materials, multi-view images, and videos. These assets boost\nthe current method's accuracy on downstream rendering tasks. Third, we\nconstruct a professional multi-view system to capture data, which contains 60\nsynchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern\ncamera calibration steps, ensuring high-quality resources for task training and\nevaluation. Along with the dataset, we provide a large-scale and quantitative\nbenchmark in full-scale, with multiple tasks to evaluate the existing progress\nof novel view synthesis, novel pose animation synthesis, and novel identity\nrendering methods. In this manuscript, we describe our DNA-Rendering effort as\na revealing of new observations, challenges, and future directions to\nhuman-centric rendering. The dataset, code, and benchmarks will be publicly\navailable at https://dna-rendering.github.io/\n","authors":["Wei Cheng","Ruixiang Chen","Wanqi Yin","Siming Fan","Keyu Chen","Honglin He","Huiwen Luo","Zhongang Cai","Jingbo Wang","Yang Gao","Zhengming Yu","Zhengyu Lin","Daxuan Ren","Lei Yang","Ziwei Liu","Chen Change Loy","Chen Qian","Wayne Wu","Dahua Lin","Bo Dai","Kwan-Yee Lin"],"pdf_url":"https://arxiv.org/pdf/2307.10173v1.pdf","comment":"This paper is accepted by ICCV2023. Project page:\n https://dna-rendering.github.io/"},{"id":"http://arxiv.org/abs/2112.06809v8","updated":"2023-07-19T17:50:21Z","published":"2021-12-13T17:11:32Z","title":"Persistent Animal Identification Leveraging Non-Visual Markers","summary":" Our objective is to locate and provide a unique identifier for each mouse in\na cluttered home-cage environment through time, as a precursor to automated\nbehaviour recognition for biological research. This is a very challenging\nproblem due to (i) the lack of distinguishing visual features for each mouse,\nand (ii) the close confines of the scene with constant occlusion, making\nstandard visual tracking approaches unusable. However, a coarse estimate of\neach mouse's location is available from a unique RFID implant, so there is the\npotential to optimally combine information from (weak) tracking with coarse\ninformation on identity. To achieve our objective, we make the following key\ncontributions: (a) the formulation of the object identification problem as an\nassignment problem (solved using Integer Linear Programming), and (b) a novel\nprobabilistic model of the affinity between tracklets and RFID data. The latter\nis a crucial part of the model, as it provides a principled probabilistic\ntreatment of object detections given coarse localisation. Our approach achieves\n77% accuracy on this animal identification problem, and is able to reject\nspurious detections when the animals are hidden.\n","authors":["Michael P. J. Camilleri","Li Zhang","Rasneer S. Bains","Andrew Zisserman","Christopher K. I. Williams"],"pdf_url":"https://arxiv.org/pdf/2112.06809v8.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10166v1","updated":"2023-07-19T17:50:03Z","published":"2023-07-19T17:50:03Z","title":"Adversarial Latent Autoencoder with Self-Attention for Structural Image\n Synthesis","summary":" Generative Engineering Design approaches driven by Deep Generative Models\n(DGM) have been proposed to facilitate industrial engineering processes. In\nsuch processes, designs often come in the form of images, such as blueprints,\nengineering drawings, and CAD models depending on the level of detail. DGMs\nhave been successfully employed for synthesis of natural images, e.g.,\ndisplaying animals, human faces and landscapes. However, industrial design\nimages are fundamentally different from natural scenes in that they contain\nrich structural patterns and long-range dependencies, which are challenging for\nconvolution-based DGMs to generate. Moreover, DGM-driven generation process is\ntypically triggered based on random noisy inputs, which outputs unpredictable\nsamples and thus cannot perform an efficient industrial design exploration. We\ntackle these challenges by proposing a novel model Self-Attention Adversarial\nLatent Autoencoder (SA-ALAE), which allows generating feasible design images of\ncomplex engineering parts. With SA-ALAE, users can not only explore novel\nvariants of an existing design, but also control the generation process by\noperating in latent space. The potential of SA-ALAE is shown by generating\nengineering blueprints in a real automotive design task.\n","authors":["Jiajie Fan","Laure Vuaille","Hao Wang","Thomas Bäck"],"pdf_url":"https://arxiv.org/pdf/2307.10166v1.pdf","comment":"18 pages, 8 figures"},{"id":"http://arxiv.org/abs/2307.10165v1","updated":"2023-07-19T17:46:55Z","published":"2023-07-19T17:46:55Z","title":"Drone navigation and license place detection for vehicle location in\n indoor spaces","summary":" Millions of vehicles are transported every year, tightly parked in vessels or\nboats. To reduce the risks of associated safety issues like fires, knowing the\nlocation of vehicles is essential, since different vehicles may need different\nmitigation measures, e.g. electric cars. This work is aimed at creating a\nsolution based on a nano-drone that navigates across rows of parked vehicles\nand detects their license plates. We do so via a wall-following algorithm, and\na CNN trained to detect license plates. All computations are done in real-time\non the drone, which just sends position and detected images that allow the\ncreation of a 2D map with the position of the plates. Our solution is capable\nof reading all plates across eight test cases (with several rows of plates,\ndifferent drone speeds, or low light) by aggregation of measurements across\nseveral drone journeys.\n","authors":["Moa Arvidsson","Sithichot Sawirot","Cristofer Englund","Fernando Alonso-Fernandez","Martin Torstensson","Boris Duran"],"pdf_url":"https://arxiv.org/pdf/2307.10165v1.pdf","comment":"Published at VIII International Workshop on Artificial Intelligence\n and Pattern Recognition, IWAIPR"},{"id":"http://arxiv.org/abs/2307.10160v1","updated":"2023-07-19T17:42:36Z","published":"2023-07-19T17:42:36Z","title":"Robust Driving Policy Learning with Guided Meta Reinforcement Learning","summary":" Although deep reinforcement learning (DRL) has shown promising results for\nautonomous navigation in interactive traffic scenarios, existing work typically\nadopts a fixed behavior policy to control social vehicles in the training\nenvironment. This may cause the learned driving policy to overfit the\nenvironment, making it difficult to interact well with vehicles with different,\nunseen behaviors. In this work, we introduce an efficient method to train\ndiverse driving policies for social vehicles as a single meta-policy. By\nrandomizing the interaction-based reward functions of social vehicles, we can\ngenerate diverse objectives and efficiently train the meta-policy through\nguiding policies that achieve specific objectives. We further propose a\ntraining strategy to enhance the robustness of the ego vehicle's driving policy\nusing the environment where social vehicles are controlled by the learned\nmeta-policy. Our method successfully learns an ego driving policy that\ngeneralizes well to unseen situations with out-of-distribution (OOD) social\nagents' behaviors in a challenging uncontrolled T-intersection scenario.\n","authors":["Kanghoon Lee","Jiachen Li","David Isele","Jinkyoo Park","Kikuo Fujimura","Mykel J. Kochenderfer"],"pdf_url":"https://arxiv.org/pdf/2307.10160v1.pdf","comment":"ITSC 2023"},{"id":"http://arxiv.org/abs/2307.10159v1","updated":"2023-07-19T17:39:39Z","published":"2023-07-19T17:39:39Z","title":"FABRIC: Personalizing Diffusion Models with Iterative Feedback","summary":" In an era where visual content generation is increasingly driven by machine\nlearning, the integration of human feedback into generative models presents\nsignificant opportunities for enhancing user experience and output quality.\nThis study explores strategies for incorporating iterative human feedback into\nthe generative process of diffusion-based text-to-image models. We propose\nFABRIC, a training-free approach applicable to a wide range of popular\ndiffusion models, which exploits the self-attention layer present in the most\nwidely used architectures to condition the diffusion process on a set of\nfeedback images. To ensure a rigorous assessment of our approach, we introduce\na comprehensive evaluation methodology, offering a robust mechanism to quantify\nthe performance of generative visual models that integrate human feedback. We\nshow that generation results improve over multiple rounds of iterative feedback\nthrough exhaustive analysis, implicitly optimizing arbitrary user preferences.\nThe potential applications of these findings extend to fields such as\npersonalized content creation and customization.\n","authors":["Dimitri von Rütte","Elisabetta Fedele","Jonathan Thomm","Lukas Wolf"],"pdf_url":"https://arxiv.org/pdf/2307.10159v1.pdf","comment":"14 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.10157v1","updated":"2023-07-19T17:38:26Z","published":"2023-07-19T17:38:26Z","title":"Leveraging Visemes for Better Visual Speech Representation and Lip\n Reading","summary":" Lip reading is a challenging task that has many potential applications in\nspeech recognition, human-computer interaction, and security systems. However,\nexisting lip reading systems often suffer from low accuracy due to the\nlimitations of video features. In this paper, we propose a novel approach that\nleverages visemes, which are groups of phonetically similar lip shapes, to\nextract more discriminative and robust video features for lip reading. We\nevaluate our approach on various tasks, including word-level and sentence-level\nlip reading, and audiovisual speech recognition using the Arman-AV dataset, a\nlargescale Persian corpus. Our experimental results show that our viseme based\napproach consistently outperforms the state-of-theart methods in all these\ntasks. The proposed method reduces the lip-reading word error rate (WER) by\n9.1% relative to the best previous method.\n","authors":["Javad Peymanfard","Vahid Saeedi","Mohammad Reza Mohammadi","Hossein Zeinali","Nasser Mozayani"],"pdf_url":"https://arxiv.org/pdf/2307.10157v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10135v1","updated":"2023-07-19T17:00:45Z","published":"2023-07-19T17:00:45Z","title":"An Improved NeuMIP with Better Accuracy","summary":" Neural reflectance models are capable of accurately reproducing the\nspatially-varying appearance of many real-world materials at different scales.\nHowever, existing methods have difficulties handling highly glossy materials.\nTo address this problem, we introduce a new neural reflectance model which,\ncompared with existing methods, better preserves not only specular highlights\nbut also fine-grained details. To this end, we enhance the neural network\nperformance by encoding input data to frequency space, inspired by NeRF, to\nbetter preserve the details. Furthermore, we introduce a gradient-based loss\nand employ it in multiple stages, adaptive to the progress of the learning\nphase. Lastly, we utilize an optional extension to the decoder network using\nthe Inception module for more accurate yet costly performance. We demonstrate\nthe effectiveness of our method using a variety of synthetic and real examples.\n","authors":["Bowen Xue","Shuang Zhao","Henrik Wann Jensen","Zahra Montazeri"],"pdf_url":"https://arxiv.org/pdf/2307.10135v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10129v1","updated":"2023-07-19T16:51:59Z","published":"2023-07-19T16:51:59Z","title":"General vs. Long-Tailed Age Estimation: An Approach to Kill Two Birds\n with One Stone","summary":" Facial age estimation has received a lot of attention for its diverse\napplication scenarios. Most existing studies treat each sample equally and aim\nto reduce the average estimation error for the entire dataset, which can be\nsummarized as General Age Estimation. However, due to the long-tailed\ndistribution prevalent in the dataset, treating all samples equally will\ninevitably bias the model toward the head classes (usually the adult with a\nmajority of samples). Driven by this, some works suggest that each class should\nbe treated equally to improve performance in tail classes (with a minority of\nsamples), which can be summarized as Long-tailed Age Estimation. However,\nLong-tailed Age Estimation usually faces a performance trade-off, i.e.,\nachieving improvement in tail classes by sacrificing the head classes. In this\npaper, our goal is to design a unified framework to perform well on both tasks,\nkilling two birds with one stone. To this end, we propose a simple, effective,\nand flexible training paradigm named GLAE, which is two-fold. Our GLAE provides\na surprising improvement on Morph II, reaching the lowest MAE and CMAE of 1.14\nand 1.27 years, respectively. Compared to the previous best method, MAE dropped\nby up to 34%, which is an unprecedented improvement, and for the first time,\nMAE is close to 1 year old. Extensive experiments on other age benchmark\ndatasets, including CACD, MIVIA, and Chalearn LAP 2015, also indicate that GLAE\noutperforms the state-of-the-art approaches significantly.\n","authors":["Zenghao Bao","Zichang Tan","Jun Li","Jun Wan","Xibo Ma","Zhen Lei"],"pdf_url":"https://arxiv.org/pdf/2307.10129v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10123v1","updated":"2023-07-19T16:42:52Z","published":"2023-07-19T16:42:52Z","title":"Two Approaches to Supervised Image Segmentation","summary":" Though performed almost effortlessly by humans, segmenting 2D gray-scale or\ncolor images in terms of their constituent regions of interest\n(e.g.~background, objects or portions of objects) constitutes one of the\ngreatest challenges in science and technology as a consequence of the involved\ndimensionality reduction(3D to 2D), noise, reflections, shades, and occlusions,\namong many other possible effects. While a large number of interesting\napproaches have been respectively suggested along the last decades, it was\nmainly with the more recent development of deep learning that more effective\nand general solutions have been obtained, currently constituting the basic\ncomparison reference for this type of operation. Also developed recently, a\nmultiset-based methodology has been described that is capable of encouraging\nperformance that combines spatial accuracy, stability, and robustness while\nrequiring minimal computational resources (hardware and/or training and\nrecognition time). The interesting features of the latter methodology mostly\nfollow from the enhanced selectivity and sensitivity, as well as good\nrobustness to data perturbations and outliers, allowed by the coincidence\nsimilarity index on which the multiset approach to supervised image\nsegmentation is based. After describing the deep learning and multiset\napproaches, the present work develops two comparison experiments between them\nwhich are primarily aimed at illustrating their respective main interesting\nfeatures when applied to the adopted specific type of data and parameter\nconfigurations. While the deep learning approach confirmed its potential for\nperforming image segmentation, the alternative multiset methodology allowed for\nencouraging accuracy while requiring little computational resources.\n","authors":["Alexandre Benatti","Luciano da F. Costa"],"pdf_url":"https://arxiv.org/pdf/2307.10123v1.pdf","comment":"37 pages, 18 figures"},{"id":"http://arxiv.org/abs/2103.03328v3","updated":"2023-07-19T16:19:53Z","published":"2021-03-04T20:58:22Z","title":"Evaluation of Complexity Measures for Deep Learning Generalization in\n Medical Image Analysis","summary":" The generalization performance of deep learning models for medical image\nanalysis often decreases on images collected with different devices for data\nacquisition, device settings, or patient population. A better understanding of\nthe generalization capacity on new images is crucial for clinicians'\ntrustworthiness in deep learning. Although significant research efforts have\nbeen recently directed toward establishing generalization bounds and complexity\nmeasures, still, there is often a significant discrepancy between the predicted\nand actual generalization performance. As well, related large empirical studies\nhave been primarily based on validation with general-purpose image datasets.\nThis paper presents an empirical study that investigates the correlation\nbetween 25 complexity measures and the generalization abilities of supervised\ndeep learning classifiers for breast ultrasound images. The results indicate\nthat PAC-Bayes flatness-based and path norm-based measures produce the most\nconsistent explanation for the combination of models and data. We also\ninvestigate the use of multi-task classification and segmentation approach for\nbreast images, and report that such learning approach acts as an implicit\nregularizer and is conducive toward improved generalization.\n","authors":["Aleksandar Vakanski","Min Xian"],"pdf_url":"https://arxiv.org/pdf/2103.03328v3.pdf","comment":"15 pages, 4 figures"},{"id":"http://arxiv.org/abs/2307.10097v1","updated":"2023-07-19T16:12:37Z","published":"2023-07-19T16:12:37Z","title":"Boundary-Refined Prototype Generation: A General End-to-End Paradigm for\n Semi-Supervised Semantic Segmentation","summary":" Prototype-based classification is a classical method in machine learning, and\nrecently it has achieved remarkable success in semi-supervised semantic\nsegmentation. However, the current approach isolates the prototype\ninitialization process from the main training framework, which appears to be\nunnecessary. Furthermore, while the direct use of K-Means algorithm for\nprototype generation has considered rich intra-class variance, it may not be\nthe optimal solution for the classification task. To tackle these problems, we\npropose a novel boundary-refined prototype generation (BRPG) method, which is\nincorporated into the whole training framework. Specifically, our approach\nsamples and clusters high- and low-confidence features separately based on a\nconfidence threshold, aiming to generate prototypes closer to the class\nboundaries. Moreover, an adaptive prototype optimization strategy is introduced\nto make prototype augmentation for categories with scattered feature\ndistributions. Extensive experiments on the PASCAL VOC 2012 and Cityscapes\ndatasets demonstrate the superiority and scalability of the proposed method,\noutperforming the current state-of-the-art approaches. The code is available at\nxxxxxxxxxxxxxx.\n","authors":["Junhao Dong","Zhu Meng","Delong Liu","Zhicheng Zhao","Fei Su"],"pdf_url":"https://arxiv.org/pdf/2307.10097v1.pdf","comment":"53 pages, 7 figures"},{"id":"http://arxiv.org/abs/2303.13479v2","updated":"2023-07-19T16:11:13Z","published":"2023-03-23T17:48:12Z","title":"IST-Net: Prior-free Category-level Pose Estimation with Implicit Space\n Transformation","summary":" Category-level 6D pose estimation aims to predict the poses and sizes of\nunseen objects from a specific category. Thanks to prior deformation, which\nexplicitly adapts a category-specific 3D prior (i.e., a 3D template) to a given\nobject instance, prior-based methods attained great success and have become a\nmajor research stream. However, obtaining category-specific priors requires\ncollecting a large amount of 3D models, which is labor-consuming and often not\naccessible in practice. This motivates us to investigate whether priors are\nnecessary to make prior-based methods effective. Our empirical study shows that\nthe 3D prior itself is not the credit to the high performance. The keypoint\nactually is the explicit deformation process, which aligns camera and world\ncoordinates supervised by world-space 3D models (also called canonical space).\nInspired by these observations, we introduce a simple prior-free implicit space\ntransformation network, namely IST-Net, to transform camera-space features to\nworld-space counterparts and build correspondence between them in an implicit\nmanner without relying on 3D priors. Besides, we design camera- and world-space\nenhancers to enrich the features with pose-sensitive information and\ngeometrical constraints, respectively. Albeit simple, IST-Net achieves\nstate-of-the-art performance based-on prior-free design, with top inference\nspeed on the REAL275 benchmark. Our code and models are available at\nhttps://github.com/CVMI-Lab/IST-Net.\n","authors":["Jianhui Liu","Yukang Chen","Xiaoqing Ye","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2303.13479v2.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.10094v1","updated":"2023-07-19T16:01:09Z","published":"2023-07-19T16:01:09Z","title":"Make-A-Volume: Leveraging Latent Diffusion Models for Cross-Modality 3D\n Brain MRI Synthesis","summary":" Cross-modality medical image synthesis is a critical topic and has the\npotential to facilitate numerous applications in the medical imaging field.\nDespite recent successes in deep-learning-based generative models, most current\nmedical image synthesis methods rely on generative adversarial networks and\nsuffer from notorious mode collapse and unstable training. Moreover, the 2D\nbackbone-driven approaches would easily result in volumetric inconsistency,\nwhile 3D backbones are challenging and impractical due to the tremendous memory\ncost and training difficulty. In this paper, we introduce a new paradigm for\nvolumetric medical data synthesis by leveraging 2D backbones and present a\ndiffusion-based framework, Make-A-Volume, for cross-modality 3D medical image\nsynthesis. To learn the cross-modality slice-wise mapping, we employ a latent\ndiffusion model and learn a low-dimensional latent space, resulting in high\ncomputational efficiency. To enable the 3D image synthesis and mitigate\nvolumetric inconsistency, we further insert a series of volumetric layers in\nthe 2D slice-mapping model and fine-tune them with paired 3D data. This\nparadigm extends the 2D image diffusion model to a volumetric version with a\nslightly increasing number of parameters and computation, offering a principled\nsolution for generic cross-modality 3D medical image synthesis. We showcase the\neffectiveness of our Make-A-Volume framework on an in-house SWI-MRA brain MRI\ndataset and a public T1-T2 brain MRI dataset. Experimental results demonstrate\nthat our framework achieves superior synthesis results with volumetric\nconsistency.\n","authors":["Lingting Zhu","Zeyue Xue","Zhenchao Jin","Xian Liu","Jingzhen He","Ziwei Liu","Lequan Yu"],"pdf_url":"https://arxiv.org/pdf/2307.10094v1.pdf","comment":"Accepted by International Conference on Medical Image Computing and\n Computer Assisted Intervention (MICCAI 2023). 10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2207.00419v3","updated":"2023-07-19T16:00:08Z","published":"2022-06-18T00:26:52Z","title":"Self-Supervised Learning for Videos: A Survey","summary":" The remarkable success of deep learning in various domains relies on the\navailability of large-scale annotated datasets. However, obtaining annotations\nis expensive and requires great effort, which is especially challenging for\nvideos. Moreover, the use of human-generated annotations leads to models with\nbiased learning and poor domain generalization and robustness. As an\nalternative, self-supervised learning provides a way for representation\nlearning which does not require annotations and has shown promise in both image\nand video domains. Different from the image domain, learning video\nrepresentations are more challenging due to the temporal dimension, bringing in\nmotion and other environmental dynamics. This also provides opportunities for\nvideo-exclusive ideas that advance self-supervised learning in the video and\nmultimodal domain. In this survey, we provide a review of existing approaches\non self-supervised learning focusing on the video domain. We summarize these\nmethods into four different categories based on their learning objectives: 1)\npretext tasks, 2) generative learning, 3) contrastive learning, and 4)\ncross-modal agreement. We further introduce the commonly used datasets,\ndownstream evaluation tasks, insights into the limitations of existing works,\nand the potential future directions in this area.\n","authors":["Madeline C. Schiappa","Yogesh S. Rawat","Mubarak Shah"],"pdf_url":"https://arxiv.org/pdf/2207.00419v3.pdf","comment":"ACM CSUR (December 2022). Project Link: https://bit.ly/3Oimc7Q"},{"id":"http://arxiv.org/abs/2307.04838v2","updated":"2023-07-19T15:59:03Z","published":"2023-07-10T18:15:03Z","title":"CREPE: Learnable Prompting With CLIP Improves Visual Relationship\n Prediction","summary":" In this paper, we explore the potential of Vision-Language Models (VLMs),\nspecifically CLIP, in predicting visual object relationships, which involves\ninterpreting visual features from images into language-based relations. Current\nstate-of-the-art methods use complex graphical models that utilize language\ncues and visual features to address this challenge. We hypothesize that the\nstrong language priors in CLIP embeddings can simplify these graphical models\npaving for a simpler approach. We adopt the UVTransE relation prediction\nframework, which learns the relation as a translational embedding with subject,\nobject, and union box embeddings from a scene. We systematically explore the\ndesign of CLIP-based subject, object, and union-box representations within the\nUVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate\nEstimation). CREPE utilizes text-based representations for all three bounding\nboxes and introduces a novel contrastive training strategy to automatically\ninfer the text prompt for union-box. Our approach achieves state-of-the-art\nperformance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual\nGenome benchmark, achieving a 15.3\\% gain in performance over recent\nstate-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in\nobject relation prediction and encourages further research on VLMs in this\nchallenging domain.\n","authors":["Rakshith Subramanyam","T. S. Jayram","Rushil Anirudh","Jayaraman J. Thiagarajan"],"pdf_url":"https://arxiv.org/pdf/2307.04838v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07894v3","updated":"2023-07-19T15:57:12Z","published":"2023-06-13T16:39:39Z","title":"iSLAM: Imperative SLAM","summary":" Simultaneous localization and mapping (SLAM) stands as one of the critical\nchallenges in robot navigation. Recent advancements suggest that methods based\non supervised learning deliver impressive performance in front-end odometry,\nwhile traditional optimization-based methods still play a vital role in the\nback-end for minimizing estimation drift. In this paper, we found that such\ndecoupled paradigm can lead to only sub-optimal performance, consequently\ncurtailing system capabilities and generalization potential. To solve this\nproblem, we proposed a novel self-supervised learning framework, imperative\nSLAM (iSLAM), which fosters reciprocal correction between the front-end and\nback-end, thus enhancing performance without necessitating any external\nsupervision. Specifically, we formulate a SLAM system as a bi-level\noptimization problem so that the two components are bidirectionally connected.\nAs a result, the front-end model is able to learn global geometric knowledge\nobtained through pose graph optimization by back-propagating the residuals from\nthe back-end. This significantly improves the generalization ability of the\nentire system and thus achieves the accuracy improvement up to 45%. To the best\nof our knowledge, iSLAM is the first SLAM system showing that the front-end and\nback-end can learn jointly and mutually contribute to each other in a\nself-supervised manner.\n","authors":["Taimeng Fu","Shaoshu Su","Chen Wang"],"pdf_url":"https://arxiv.org/pdf/2306.07894v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10062v1","updated":"2023-07-19T15:33:11Z","published":"2023-07-19T15:33:11Z","title":"Unsupervised Accuracy Estimation of Deep Visual Models using\n Domain-Adaptive Adversarial Perturbation without Source Samples","summary":" Deploying deep visual models can lead to performance drops due to the\ndiscrepancies between source and target distributions. Several approaches\nleverage labeled source data to estimate target domain accuracy, but accessing\nlabeled source data is often prohibitively difficult due to data\nconfidentiality or resource limitations on serving devices. Our work proposes a\nnew framework to estimate model accuracy on unlabeled target data without\naccess to source data. We investigate the feasibility of using pseudo-labels\nfor accuracy estimation and evolve this idea into adopting recent advances in\nsource-free domain adaptation algorithms. Our approach measures the\ndisagreement rate between the source hypothesis and the target pseudo-labeling\nfunction, adapted from the source hypothesis. We mitigate the impact of\nerroneous pseudo-labels that may arise due to a high ideal joint hypothesis\nrisk by employing adaptive adversarial perturbation on the input of the target\nmodel. Our proposed source-free framework effectively addresses the challenging\ndistribution shift scenarios and outperforms existing methods requiring source\ndata and labels for training.\n","authors":["JoonHo Lee","Jae Oh Woo","Hankyu Moon","Kwonho Lee"],"pdf_url":"https://arxiv.org/pdf/2307.10062v1.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2307.10046v1","updated":"2023-07-19T15:22:06Z","published":"2023-07-19T15:22:06Z","title":"Divert More Attention to Vision-Language Object Tracking","summary":" Multimodal vision-language (VL) learning has noticeably pushed the tendency\ntoward generic intelligence owing to emerging large foundation models. However,\ntracking, as a fundamental vision problem, surprisingly enjoys less bonus from\nrecent flourishing VL learning. We argue that the reasons are two-fold: the\nlack of large-scale vision-language annotated videos and ineffective\nvision-language interaction learning of current works. These nuisances motivate\nus to design more effective vision-language representation for tracking,\nmeanwhile constructing a large database with language annotation for model\nlearning. Particularly, in this paper, we first propose a general attribute\nannotation strategy to decorate videos in six popular tracking benchmarks,\nwhich contributes a large-scale vision-language tracking database with more\nthan 23,000 videos. We then introduce a novel framework to improve tracking by\nlearning a unified-adaptive VL representation, where the cores are the proposed\nasymmetric architecture search and modality mixer (ModaMixer). To further\nimprove VL representation, we introduce a contrastive loss to align different\nmodalities. To thoroughly evidence the effectiveness of our method, we\nintegrate the proposed framework on three tracking methods with different\ndesigns, i.e., the CNN-based SiamCAR, the Transformer-based OSTrack, and the\nhybrid structure TransT. The experiments demonstrate that our framework can\nsignificantly improve all baselines on six benchmarks. Besides empirical\nresults, we theoretically analyze our approach to show its rationality. By\nrevealing the potential of VL representation, we expect the community to divert\nmore attention to VL tracking and hope to open more possibilities for future\ntracking with diversified multimodal messages.\n","authors":["Mingzhe Guo","Zhipeng Zhang","Liping Jing","Haibin Ling","Heng Fan"],"pdf_url":"https://arxiv.org/pdf/2307.10046v1.pdf","comment":"16 pages, 9 figures"},{"id":"http://arxiv.org/abs/2307.10036v1","updated":"2023-07-19T15:19:02Z","published":"2023-07-19T15:19:02Z","title":"Class Attention to Regions of Lesion for Imbalanced Medical Image\n Recognition","summary":" Automated medical image classification is the key component in intelligent\ndiagnosis systems. However, most medical image datasets contain plenty of\nsamples of common diseases and just a handful of rare ones, leading to major\nclass imbalances. Currently, it is an open problem in intelligent diagnosis to\neffectively learn from imbalanced training data. In this paper, we propose a\nsimple yet effective framework, named \\textbf{C}lass \\textbf{A}ttention to\n\\textbf{RE}gions of the lesion (CARE), to handle data imbalance issues by\nembedding attention into the training process of \\textbf{C}onvolutional\n\\textbf{N}eural \\textbf{N}etworks (CNNs). The proposed attention module helps\nCNNs attend to lesion regions of rare diseases, therefore helping CNNs to learn\ntheir characteristics more effectively. In addition, this attention module\nworks only during the training phase and does not change the architecture of\nthe original network, so it can be directly combined with any existing CNN\narchitecture. The CARE framework needs bounding boxes to represent the lesion\nregions of rare diseases. To alleviate the need for manual annotation, we\nfurther developed variants of CARE by leveraging the traditional saliency\nmethods or a pretrained segmentation model for bounding box generation. Results\nshow that the CARE variants with automated bounding box generation are\ncomparable to the original CARE framework with \\textit{manual} bounding box\nannotations. A series of experiments on an imbalanced skin image dataset and a\npneumonia dataset indicates that our method can effectively help the network\nfocus on the lesion regions of rare diseases and remarkably improves the\nclassification performance of rare diseases.\n","authors":["Jia-Xin Zhuang","Jiabin Cai","Jianguo Zhang","Wei-shi Zheng","Ruixuan Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10036v1.pdf","comment":"Accepted by Neurocomputing on July 2023. 37 pages"},{"id":"http://arxiv.org/abs/2307.06385v2","updated":"2023-07-19T14:51:37Z","published":"2023-07-12T18:13:58Z","title":"Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event\n Localization","summary":" Audio-Visual Event Localization (AVEL) is the task of temporally localizing\nand classifying \\emph{audio-visual events}, i.e., events simultaneously visible\nand audible in a video. In this paper, we solve AVEL in a weakly-supervised\nsetting, where only video-level event labels (their presence/absence, but not\ntheir locations in time) are available as supervision for training. Our idea is\nto use a base model to estimate labels on the training data at a finer temporal\nresolution than at the video level and re-train the model with these labels.\nI.e., we determine the subset of labels for each \\emph{slice} of frames in a\ntraining video by (i) replacing the frames outside the slice with those from a\nsecond video having no overlap in video-level labels, and (ii) feeding this\nsynthetic video into the base model to extract labels for just the slice in\nquestion. To handle the out-of-distribution nature of our synthetic videos, we\npropose an auxiliary objective for the base model that induces more reliable\npredictions of the localized event labels as desired. Our three-stage pipeline\noutperforms several existing AVEL methods with no architectural changes and\nimproves performance on a related weakly-supervised task as well.\n","authors":["Kalyan Ramakrishnan"],"pdf_url":"https://arxiv.org/pdf/2307.06385v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10011v1","updated":"2023-07-19T14:49:14Z","published":"2023-07-19T14:49:14Z","title":"Towards Fair Face Verification: An In-depth Analysis of Demographic\n Biases","summary":" Deep learning-based person identification and verification systems have\nremarkably improved in terms of accuracy in recent years; however, such\nsystems, including widely popular cloud-based solutions, have been found to\nexhibit significant biases related to race, age, and gender, a problem that\nrequires in-depth exploration and solutions. This paper presents an in-depth\nanalysis, with a particular emphasis on the intersectionality of these\ndemographic factors. Intersectional bias refers to the performance\ndiscrepancies w.r.t. the different combinations of race, age, and gender\ngroups, an area relatively unexplored in current literature. Furthermore, the\nreliance of most state-of-the-art approaches on accuracy as the principal\nevaluation metric often masks significant demographic disparities in\nperformance. To counter this crucial limitation, we incorporate five additional\nmetrics in our quantitative analysis, including disparate impact and\nmistreatment metrics, which are typically ignored by the relevant\nfairness-aware approaches. Results on the Racial Faces in-the-Wild (RFW)\nbenchmark indicate pervasive biases in face recognition systems, extending\nbeyond race, with different demographic factors yielding significantly\ndisparate outcomes. In particular, Africans demonstrate an 11.25% lower True\nPositive Rate (TPR) compared to Caucasians, while only a 3.51% accuracy drop is\nobserved. Even more concerning, the intersections of multiple protected groups,\nsuch as African females over 60 years old, demonstrate a +39.89% disparate\nmistreatment rate compared to the highest Caucasians rate. By shedding light on\nthese biases and their implications, this paper aims to stimulate further\nresearch towards developing fairer, more equitable face recognition and\nverification systems.\n","authors":["Ioannis Sarridis","Christos Koutlis","Symeon Papadopoulos","Christos Diou"],"pdf_url":"https://arxiv.org/pdf/2307.10011v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10008v1","updated":"2023-07-19T14:45:11Z","published":"2023-07-19T14:45:11Z","title":"MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions","summary":" Audio-driven portrait animation aims to synthesize portrait videos that are\nconditioned by given audio. Animating high-fidelity and multimodal video\nportraits has a variety of applications. Previous methods have attempted to\ncapture different motion modes and generate high-fidelity portrait videos by\ntraining different models or sampling signals from given videos. However,\nlacking correlation learning between lip-sync and other movements (e.g., head\npose/eye blinking) usually leads to unnatural results. In this paper, we\npropose a unified system for multi-person, diverse, and high-fidelity talking\nportrait generation. Our method contains three stages, i.e., 1) Mapping-Once\nnetwork with Dual Attentions (MODA) generates talking representation from given\naudio. In MODA, we design a dual-attention module to encode accurate mouth\nmovements and diverse modalities. 2) Facial composer network generates dense\nand detailed face landmarks, and 3) temporal-guided renderer syntheses stable\nvideos. Extensive evaluations demonstrate that the proposed system produces\nmore natural and realistic video portraits compared to previous methods.\n","authors":["Yunfei Liu","Lijian Lin","Fei Yu","Changyin Zhou","Yu Li"],"pdf_url":"https://arxiv.org/pdf/2307.10008v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.09456v2","updated":"2023-07-19T14:27:57Z","published":"2023-07-18T17:35:45Z","title":"A comparative analysis of SRGAN models","summary":" In this study, we evaluate the performance of multiple state-of-the-art SRGAN\n(Super Resolution Generative Adversarial Network) models, ESRGAN, Real-ESRGAN\nand EDSR, on a benchmark dataset of real-world images which undergo degradation\nusing a pipeline. Our results show that some models seem to significantly\nincrease the resolution of the input images while preserving their visual\nquality, this is assessed using Tesseract OCR engine. We observe that EDSR-BASE\nmodel from huggingface outperforms the remaining candidate models in terms of\nboth quantitative metrics and subjective visual quality assessments with least\ncompute overhead. Specifically, EDSR generates images with higher peak\nsignal-to-noise ratio (PSNR) and structural similarity index (SSIM) values and\nare seen to return high quality OCR results with Tesseract OCR engine. These\nfindings suggest that EDSR is a robust and effective approach for single-image\nsuper-resolution and may be particularly well-suited for applications where\nhigh-quality visual fidelity is critical and optimized compute.\n","authors":["Fatemeh Rezapoor Nikroo","Ajinkya Deshmukh","Anantha Sharma","Adrian Tam","Kaarthik Kumar","Cleo Norris","Aditya Dangi"],"pdf_url":"https://arxiv.org/pdf/2307.09456v2.pdf","comment":"9 pages, 6 tables, 2 figures"},{"id":"http://arxiv.org/abs/2307.10003v1","updated":"2023-07-19T14:23:26Z","published":"2023-07-19T14:23:26Z","title":"TbExplain: A Text-based Explanation Method for Scene Classification\n Models with the Statistical Prediction Correction","summary":" The field of Explainable Artificial Intelligence (XAI) aims to improve the\ninterpretability of black-box machine learning models. Building a heatmap based\non the importance value of input features is a popular method for explaining\nthe underlying functions of such models in producing their predictions.\nHeatmaps are almost understandable to humans, yet they are not without flaws.\nNon-expert users, for example, may not fully understand the logic of heatmaps\n(the logic in which relevant pixels to the model's prediction are highlighted\nwith different intensities or colors). Additionally, objects and regions of the\ninput image that are relevant to the model prediction are frequently not\nentirely differentiated by heatmaps. In this paper, we propose a framework\ncalled TbExplain that employs XAI techniques and a pre-trained object detector\nto present text-based explanations of scene classification models. Moreover,\nTbExplain incorporates a novel method to correct predictions and textually\nexplain them based on the statistics of objects in the input image when the\ninitial prediction is unreliable. To assess the trustworthiness and validity of\nthe text-based explanations, we conducted a qualitative experiment, and the\nfindings indicated that these explanations are sufficiently reliable.\nFurthermore, our quantitative and qualitative experiments on TbExplain with\nscene classification datasets reveal an improvement in classification accuracy\nover ResNet variants.\n","authors":["Amirhossein Aminimehr","Pouya Khani","Amirali Molaei","Amirmohammad Kazemeini","Erik Cambria"],"pdf_url":"https://arxiv.org/pdf/2307.10003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10001v1","updated":"2023-07-19T14:21:11Z","published":"2023-07-19T14:21:11Z","title":"As large as it gets: Learning infinitely large Filters via Neural\n Implicit Functions in the Fourier Domain","summary":" Motivated by the recent trend towards the usage of larger receptive fields\nfor more context-aware neural networks in vision applications, we aim to\ninvestigate how large these receptive fields really need to be. To facilitate\nsuch study, several challenges need to be addressed, most importantly: (i) We\nneed to provide an effective way for models to learn large filters (potentially\nas large as the input data) without increasing their memory consumption during\ntraining or inference, (ii) the study of filter sizes has to be decoupled from\nother effects such as the network width or number of learnable parameters, and\n(iii) the employed convolution operation should be a plug-and-play module that\ncan replace any conventional convolution in a Convolutional Neural Network\n(CNN) and allow for an efficient implementation in current frameworks. To\nfacilitate such models, we propose to learn not spatial but frequency\nrepresentations of filter weights as neural implicit functions, such that even\ninfinitely large filters can be parameterized by only a few learnable weights.\nThe resulting neural implicit frequency CNNs are the first models to achieve\nresults on par with the state-of-the-art on large image classification\nbenchmarks while executing convolutions solely in the frequency domain and can\nbe employed within any CNN architecture. They allow us to provide an extensive\nanalysis of the learned receptive fields. Interestingly, our analysis shows\nthat, although the proposed networks could learn very large convolution\nkernels, the learned filters practically translate into well-localized and\nrelatively small convolution kernels in the spatial domain.\n","authors":["Julia Grabinski","Janis Keuper","Margret Keuper"],"pdf_url":"https://arxiv.org/pdf/2307.10001v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08913v2","updated":"2023-07-19T14:18:00Z","published":"2023-07-18T01:16:23Z","title":"Towards the Sparseness of Projection Head in Self-Supervised Learning","summary":" In recent years, self-supervised learning (SSL) has emerged as a promising\napproach for extracting valuable representations from unlabeled data. One\nsuccessful SSL method is contrastive learning, which aims to bring positive\nexamples closer while pushing negative examples apart. Many current contrastive\nlearning approaches utilize a parameterized projection head. Through a\ncombination of empirical analysis and theoretical investigation, we provide\ninsights into the internal mechanisms of the projection head and its\nrelationship with the phenomenon of dimensional collapse. Our findings\ndemonstrate that the projection head enhances the quality of representations by\nperforming contrastive loss in a projected subspace. Therefore, we propose an\nassumption that only a subset of features is necessary when minimizing the\ncontrastive loss of a mini-batch of data. Theoretical analysis further suggests\nthat a sparse projection head can enhance generalization, leading us to\nintroduce SparseHead - a regularization term that effectively constrains the\nsparsity of the projection head, and can be seamlessly integrated with any\nself-supervised learning (SSL) approaches. Our experimental results validate\nthe effectiveness of SparseHead, demonstrating its ability to improve the\nperformance of existing contrastive methods.\n","authors":["Zeen Song","Xingzhe Su","Jingyao Wang","Wenwen Qiang","Changwen Zheng","Fuchun Sun"],"pdf_url":"https://arxiv.org/pdf/2307.08913v2.pdf","comment":"9 pages,3 figures"},{"id":"http://arxiv.org/abs/2307.09997v1","updated":"2023-07-19T14:10:55Z","published":"2023-07-19T14:10:55Z","title":"TUNeS: A Temporal U-Net with Self-Attention for Video-based Surgical\n Phase Recognition","summary":" To enable context-aware computer assistance in the operating room of the\nfuture, cognitive systems need to understand automatically which surgical phase\nis being performed by the medical team. The primary source of information for\nsurgical phase recognition is typically video, which presents two challenges:\nextracting meaningful features from the video stream and effectively modeling\ntemporal information in the sequence of visual features. For temporal modeling,\nattention mechanisms have gained popularity due to their ability to capture\nlong-range dependencies. In this paper, we explore design choices for attention\nin existing temporal models for surgical phase recognition and propose a novel\napproach that does not resort to local attention or regularization of attention\nweights: TUNeS is an efficient and simple temporal model that incorporates\nself-attention at the coarsest stage of a U-Net-like structure. In addition, we\npropose to train the feature extractor, a standard CNN, together with an LSTM\non preferably long video segments, i.e., with long temporal context. In our\nexperiments, all temporal models performed better on top of feature extractors\nthat were trained with longer temporal context. On top of these contextualized\nfeatures, TUNeS achieves state-of-the-art results on Cholec80.\n","authors":["Isabel Funke","Dominik Rivoir","Stefanie Krell","Stefanie Speidel"],"pdf_url":"https://arxiv.org/pdf/2307.09997v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09994v1","updated":"2023-07-19T13:58:01Z","published":"2023-07-19T13:58:01Z","title":"Impact of Disentanglement on Pruning Neural Networks","summary":" Deploying deep learning neural networks on edge devices, to accomplish task\nspecific objectives in the real-world, requires a reduction in their memory\nfootprint, power consumption, and latency. This can be realized via efficient\nmodel compression. Disentangled latent representations produced by variational\nautoencoder (VAE) networks are a promising approach for achieving model\ncompression because they mainly retain task-specific information, discarding\nuseless information for the task at hand. We make use of the Beta-VAE framework\ncombined with a standard criterion for pruning to investigate the impact of\nforcing the network to learn disentangled representations on the pruning\nprocess for the task of classification. In particular, we perform experiments\non MNIST and CIFAR10 datasets, examine disentanglement challenges, and propose\na path forward for future works.\n","authors":["Carl Shneider","Peyman Rostami","Anis Kacem","Nilotpal Sinha","Abd El Rahman Shabayek","Djamila Aouada"],"pdf_url":"https://arxiv.org/pdf/2307.09994v1.pdf","comment":"Presented in ISCS23"},{"id":"http://arxiv.org/abs/2307.08347v2","updated":"2023-07-19T13:55:32Z","published":"2023-07-17T09:38:41Z","title":"M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models\n and Latent Space Geometry Optimization","summary":" Medical vision-language models enable co-learning and integrating features\nfrom medical imaging and clinical text. However, these models are not easy to\ntrain and the latent representation space can be complex. Here we propose a\nnovel way for pre-training and regularising medical vision-language models. The\nproposed method, named Medical vision-language pre-training with Frozen\nlanguage models and Latent spAce Geometry optimization (M-FLAG), leverages a\nfrozen language model for training stability and efficiency and introduces a\nnovel orthogonality loss to harmonize the latent space geometry. We demonstrate\nthe potential of the pre-trained model on three downstream tasks: medical image\nclassification, segmentation, and object detection. Extensive experiments\nacross five public datasets demonstrate that M-FLAG significantly outperforms\nexisting medical vision-language pre-training approaches and reduces the number\nof parameters by 78\\%. Notably, M-FLAG achieves outstanding performance on the\nsegmentation task while using only 1\\% of the RSNA dataset, even outperforming\nImageNet pre-trained models that have been fine-tuned using 100\\% of the data.\n","authors":["Che Liu","Sibo Cheng","Chen Chen","Mengyun Qiao","Weitong Zhang","Anand Shah","Wenjia Bai","Rossella Arcucci"],"pdf_url":"https://arxiv.org/pdf/2307.08347v2.pdf","comment":"Accepted by MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.09988v1","updated":"2023-07-19T13:49:12Z","published":"2023-07-19T13:49:12Z","title":"TinyTrain: Deep Neural Network Training at the Extreme Edge","summary":" On-device training is essential for user personalisation and privacy. With\nthe pervasiveness of IoT devices and microcontroller units (MCU), this task\nbecomes more challenging due to the constrained memory and compute resources,\nand the limited availability of labelled user data. Nonetheless, prior works\nneglect the data scarcity issue, require excessively long training time (e.g. a\nfew hours), or induce substantial accuracy loss ($\\geq$10\\%). We propose\nTinyTrain, an on-device training approach that drastically reduces training\ntime by selectively updating parts of the model and explicitly coping with data\nscarcity. TinyTrain introduces a task-adaptive sparse-update method that\ndynamically selects the layer/channel based on a multi-objective criterion that\njointly captures user data, the memory, and the compute capabilities of the\ntarget device, leading to high accuracy on unseen tasks with reduced\ncomputation and memory footprint. TinyTrain outperforms vanilla fine-tuning of\nthe entire network by 3.6-5.0\\% in accuracy, while reducing the backward-pass\nmemory and computation cost by up to 2,286$\\times$ and 7.68$\\times$,\nrespectively. Targeting broadly used real-world edge devices, TinyTrain\nachieves 9.5$\\times$ faster and 3.5$\\times$ more energy-efficient training over\nstatus-quo approaches, and 2.8$\\times$ smaller memory footprint than SOTA\napproaches, while remaining within the 1 MB memory envelope of MCU-grade\nplatforms.\n","authors":["Young D. Kwon","Rui Li","Stylianos I. Venieris","Jagmohan Chauhan","Nicholas D. Lane","Cecilia Mascolo"],"pdf_url":"https://arxiv.org/pdf/2307.09988v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09981v1","updated":"2023-07-19T13:40:45Z","published":"2023-07-19T13:40:45Z","title":"Lazy Visual Localization via Motion Averaging","summary":" Visual (re)localization is critical for various applications in computer\nvision and robotics. Its goal is to estimate the 6 degrees of freedom (DoF)\ncamera pose for each query image, based on a set of posed database images.\nCurrently, all leading solutions are structure-based that either explicitly\nconstruct 3D metric maps from the database with structure-from-motion, or\nimplicitly encode the 3D information with scene coordinate regression models.\nOn the contrary, visual localization without reconstructing the scene in 3D\noffers clear benefits. It makes deployment more convenient by reducing database\npre-processing time, releasing storage requirements, and remaining unaffected\nby imperfect reconstruction, etc. In this technical report, we demonstrate that\nit is possible to achieve high localization accuracy without reconstructing the\nscene from the database. The key to achieving this owes to a tailored motion\naveraging over database-query pairs. Experiments show that our visual\nlocalization proposal, LazyLoc, achieves comparable performance against\nstate-of-the-art structure-based methods. Furthermore, we showcase the\nversatility of LazyLoc, which can be easily extended to handle complex\nconfigurations such as multi-query co-localization and camera rigs.\n","authors":["Siyan Dong","Shaohui Liu","Hengkai Guo","Baoquan Chen","Marc Pollefeys"],"pdf_url":"https://arxiv.org/pdf/2307.09981v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09362v2","updated":"2023-07-19T13:21:30Z","published":"2023-07-18T15:46:21Z","title":"Disentangle then Parse:Night-time Semantic Segmentation with\n Illumination Disentanglement","summary":" Most prior semantic segmentation methods have been developed for day-time\nscenes, while typically underperforming in night-time scenes due to\ninsufficient and complicated lighting conditions. In this work, we tackle this\nchallenge by proposing a novel night-time semantic segmentation paradigm, i.e.,\ndisentangle then parse (DTP). DTP explicitly disentangles night-time images\ninto light-invariant reflectance and light-specific illumination components and\nthen recognizes semantics based on their adaptive fusion. Concretely, the\nproposed DTP comprises two key components: 1) Instead of processing\nlighting-entangled features as in prior works, our Semantic-Oriented\nDisentanglement (SOD) framework enables the extraction of reflectance component\nwithout being impeded by lighting, allowing the network to consistently\nrecognize the semantics under cover of varying and complicated lighting\nconditions. 2) Based on the observation that the illumination component can\nserve as a cue for some semantically confused regions, we further introduce an\nIllumination-Aware Parser (IAParser) to explicitly learn the correlation\nbetween semantics and lighting, and aggregate the illumination features to\nyield more precise predictions. Extensive experiments on the night-time\nsegmentation task with various settings demonstrate that DTP significantly\noutperforms state-of-the-art methods. Furthermore, with negligible additional\nparameters, DTP can be directly used to benefit existing day-time methods for\nnight-time segmentation.\n","authors":["Zhixiang Wei","Lin Chen","Tao Tu","Huaian Chen","Pengyang Ling","Yi Jin"],"pdf_url":"https://arxiv.org/pdf/2307.09362v2.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2305.09946v2","updated":"2023-07-19T13:15:08Z","published":"2023-05-17T04:56:11Z","title":"AdaMSS: Adaptive Multi-Modality Segmentation-to-Survival Learning for\n Survival Outcome Prediction from PET/CT Images","summary":" Survival prediction is a major concern for cancer management. Deep survival\nmodels based on deep learning have been widely adopted to perform end-to-end\nsurvival prediction from medical images. Recent deep survival models achieved\npromising performance by jointly performing tumor segmentation with survival\nprediction, where the models were guided to extract tumor-related information\nthrough Multi-Task Learning (MTL). However, these deep survival models have\ndifficulties in exploring out-of-tumor prognostic information. In addition,\nexisting deep survival models are unable to effectively leverage multi-modality\nimages. Empirically-designed fusion strategies were commonly adopted to fuse\nmulti-modality information via task-specific manually-designed networks, thus\nlimiting the adaptability to different scenarios. In this study, we propose an\nAdaptive Multi-modality Segmentation-to-Survival model (AdaMSS) for survival\nprediction from PET/CT images. Instead of adopting MTL, we propose a novel\nSegmentation-to-Survival Learning (SSL) strategy, where our AdaMSS is trained\nfor tumor segmentation and survival prediction sequentially in two stages. This\nstrategy enables the AdaMSS to focus on tumor regions in the first stage and\ngradually expand its focus to include other prognosis-related regions in the\nsecond stage. We also propose a data-driven strategy to fuse multi-modality\ninformation, which realizes adaptive optimization of fusion strategies based on\ntraining data during training. With the SSL and data-driven fusion strategies,\nour AdaMSS is designed as an adaptive model that can self-adapt its focus\nregions and fusion strategy for different training stages. Extensive\nexperiments with two large clinical datasets show that our AdaMSS outperforms\nstate-of-the-art survival prediction methods.\n","authors":["Mingyuan Meng","Bingxin Gu","Michael Fulham","Shaoli Song","Dagan Feng","Lei Bi","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2305.09946v2.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2305.18060v2","updated":"2023-07-19T13:13:39Z","published":"2023-05-29T12:53:54Z","title":"Mining Negative Temporal Contexts For False Positive Suppression In\n Real-Time Ultrasound Lesion Detection","summary":" During ultrasonic scanning processes, real-time lesion detection can assist\nradiologists in accurate cancer diagnosis. However, this essential task remains\nchallenging and underexplored. General-purpose real-time object detection\nmodels can mistakenly report obvious false positives (FPs) when applied to\nultrasound videos, potentially misleading junior radiologists. One key issue is\ntheir failure to utilize negative symptoms in previous frames, denoted as\nnegative temporal contexts (NTC). To address this issue, we propose to extract\ncontexts from previous frames, including NTC, with the guidance of inverse\noptical flow. By aggregating extracted contexts, we endow the model with the\nability to suppress FPs by leveraging NTC. We call the resulting model\nUltraDet. The proposed UltraDet demonstrates significant improvement over\nprevious state-of-the-arts and achieves real-time inference speed. We release\nthe code, checkpoints, and high-quality labels of the CVA-BUS dataset in\nhttps://github.com/HaojunYu1998/UltraDet.\n","authors":["Haojun Yu","Youcheng Li","QuanLin Wu","Ziwei Zhao","Dengbo Chen","Dong Wang","Liwei Wang"],"pdf_url":"https://arxiv.org/pdf/2305.18060v2.pdf","comment":"10 pages, 4 figures, MICCAI 2023 Early Accept"},{"id":"http://arxiv.org/abs/2001.05887v4","updated":"2023-07-19T12:58:18Z","published":"2020-01-16T15:24:26Z","title":"MixPath: A Unified Approach for One-shot Neural Architecture Search","summary":" Blending multiple convolutional kernels is proved advantageous in neural\narchitecture design. However, current two-stage neural architecture search\nmethods are mainly limited to single-path search spaces. How to efficiently\nsearch models of multi-path structures remains a difficult problem. In this\npaper, we are motivated to train a one-shot multi-path supernet to accurately\nevaluate the candidate architectures. Specifically, we discover that in the\nstudied search spaces, feature vectors summed from multiple paths are nearly\nmultiples of those from a single path. Such disparity perturbs the supernet\ntraining and its ranking ability. Therefore, we propose a novel mechanism\ncalled Shadow Batch Normalization (SBN) to regularize the disparate feature\nstatistics. Extensive experiments prove that SBNs are capable of stabilizing\nthe optimization and improving ranking performance. We call our unified\nmulti-path one-shot approach as MixPath, which generates a series of models\nthat achieve state-of-the-art results on ImageNet.\n","authors":["Xiangxiang Chu","Shun Lu","Xudong Li","Bo Zhang"],"pdf_url":"https://arxiv.org/pdf/2001.05887v4.pdf","comment":"ICCV2023"},{"id":"http://arxiv.org/abs/2307.09947v1","updated":"2023-07-19T12:41:54Z","published":"2023-07-19T12:41:54Z","title":"U-CE: Uncertainty-aware Cross-Entropy for Semantic Segmentation","summary":" Deep neural networks have shown exceptional performance in various tasks, but\ntheir lack of robustness, reliability, and tendency to be overconfident pose\nchallenges for their deployment in safety-critical applications like autonomous\ndriving. In this regard, quantifying the uncertainty inherent to a model's\nprediction is a promising endeavour to address these shortcomings. In this\nwork, we present a novel Uncertainty-aware Cross-Entropy loss (U-CE) that\nincorporates dynamic predictive uncertainties into the training process by\npixel-wise weighting of the well-known cross-entropy loss (CE). Through\nextensive experimentation, we demonstrate the superiority of U-CE over regular\nCE training on two benchmark datasets, Cityscapes and ACDC, using two common\nbackbone architectures, ResNet-18 and ResNet-101. With U-CE, we manage to train\nmodels that not only improve their segmentation performance but also provide\nmeaningful uncertainties after training. Consequently, we contribute to the\ndevelopment of more robust and reliable segmentation models, ultimately\nadvancing the state-of-the-art in safety-critical applications and beyond.\n","authors":["Steven Landgraf","Markus Hillemann","Kira Wursthorn","Markus Ulrich"],"pdf_url":"https://arxiv.org/pdf/2307.09947v1.pdf","comment":"10 pages, 3 figures, 7 tables, 1 algorithm"},{"id":"http://arxiv.org/abs/2307.09944v1","updated":"2023-07-19T12:39:40Z","published":"2023-07-19T12:39:40Z","title":"ProtoCaps: A Fast and Non-Iterative Capsule Network Routing Method","summary":" Capsule Networks have emerged as a powerful class of deep learning\narchitectures, known for robust performance with relatively few parameters\ncompared to Convolutional Neural Networks (CNNs). However, their inherent\nefficiency is often overshadowed by their slow, iterative routing mechanisms\nwhich establish connections between Capsule layers, posing computational\nchallenges resulting in an inability to scale. In this paper, we introduce a\nnovel, non-iterative routing mechanism, inspired by trainable prototype\nclustering. This innovative approach aims to mitigate computational complexity,\nwhile retaining, if not enhancing, performance efficacy. Furthermore, we\nharness a shared Capsule subspace, negating the need to project each\nlower-level Capsule to each higher-level Capsule, thereby significantly\nreducing memory requisites during training. Our approach demonstrates superior\nresults compared to the current best non-iterative Capsule Network and tests on\nthe Imagewoof dataset, which is too computationally demanding to handle\nefficiently by iterative approaches. Our findings underscore the potential of\nour proposed methodology in enhancing the operational efficiency and\nperformance of Capsule Networks, paving the way for their application in\nincreasingly complex computational scenarios.\n","authors":["Miles Everett","Mingjun Zhong","Georgios Leontidis"],"pdf_url":"https://arxiv.org/pdf/2307.09944v1.pdf","comment":"8 pages, 5 figures, 4 tables"},{"id":"http://arxiv.org/abs/2307.09936v1","updated":"2023-07-19T12:21:39Z","published":"2023-07-19T12:21:39Z","title":"AGAR: Attention Graph-RNN for Adaptative Motion Prediction of Point\n Clouds of Deformable Objects","summary":" This paper focuses on motion prediction for point cloud sequences in the\nchallenging case of deformable 3D objects, such as human body motion. First, we\ninvestigate the challenges caused by deformable shapes and complex motions\npresent in this type of representation, with the ultimate goal of understanding\nthe technical limitations of state-of-the-art models. From this understanding,\nwe propose an improved architecture for point cloud prediction of deformable 3D\nobjects. Specifically, to handle deformable shapes, we propose a graph-based\napproach that learns and exploits the spatial structure of point clouds to\nextract more representative features. Then we propose a module able to combine\nthe learned features in an adaptative manner according to the point cloud\nmovements. The proposed adaptative module controls the composition of local and\nglobal motions for each point, enabling the network to model complex motions in\ndeformable 3D objects more effectively. We tested the proposed method on the\nfollowing datasets: MNIST moving digits, the Mixamo human bodies motions, JPEG\nand CWIPC-SXR real-world dynamic bodies. Simulation results demonstrate that\nour method outperforms the current baseline methods given its improved ability\nto model complex movements as well as preserve point cloud shape. Furthermore,\nwe demonstrate the generalizability of the proposed framework for dynamic\nfeature learning, by testing the framework for action recognition on the\nMSRAction3D dataset and achieving results on-par with state-of-the-art methods\n","authors":["Pedro Gomes","Silvia Rossi","Laura Toni"],"pdf_url":"https://arxiv.org/pdf/2307.09936v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09933v1","updated":"2023-07-19T12:15:06Z","published":"2023-07-19T12:15:06Z","title":"Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to\n Harness Spurious Features","summary":" To avoid failures on out-of-distribution data, recent works have sought to\nextract features that have a stable or invariant relationship with the label\nacross domains, discarding the \"spurious\" or unstable features whose\nrelationship with the label changes across domains. However, unstable features\noften carry complementary information about the label that could boost\nperformance if used correctly in the test domain. Our main contribution is to\nshow that it is possible to learn how to use these unstable features in the\ntest domain without labels. In particular, we prove that pseudo-labels based on\nstable features provide sufficient guidance for doing so, provided that stable\nand unstable features are conditionally independent given the label. Based on\nthis theoretical insight, we propose Stable Feature Boosting (SFB), an\nalgorithm for: (i) learning a predictor that separates stable and\nconditionally-independent unstable features; and (ii) using the stable-feature\npredictions to adapt the unstable-feature predictions in the test domain.\nTheoretically, we prove that SFB can learn an asymptotically-optimal predictor\nwithout test-domain labels. Empirically, we demonstrate the effectiveness of\nSFB on real and synthetic data.\n","authors":["Cian Eastwood","Shashank Singh","Andrei Liviu Nicolicioiu","Marin Vlastelica","Julius von Kügelgen","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2307.09933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09931v1","updated":"2023-07-19T12:12:17Z","published":"2023-07-19T12:12:17Z","title":"DISA: DIfferentiable Similarity Approximation for Universal Multimodal\n Registration","summary":" Multimodal image registration is a challenging but essential step for\nnumerous image-guided procedures. Most registration algorithms rely on the\ncomputation of complex, frequently non-differentiable similarity metrics to\ndeal with the appearance discrepancy of anatomical structures between imaging\nmodalities. Recent Machine Learning based approaches are limited to specific\nanatomy-modality combinations and do not generalize to new settings. We propose\na generic framework for creating expressive cross-modal descriptors that enable\nfast deformable global registration. We achieve this by approximating existing\nmetrics with a dot-product in the feature space of a small convolutional neural\nnetwork (CNN) which is inherently differentiable can be trained without\nregistered data. Our method is several orders of magnitude faster than local\npatch-based metrics and can be directly applied in clinical settings by\nreplacing the similarity measure with the proposed one. Experiments on three\ndifferent datasets demonstrate that our approach generalizes well beyond the\ntraining data, yielding a broad capture range even on unseen anatomies and\nmodality pairs, without the need for specialized retraining. We make our\ntraining code and data publicly available.\n","authors":["Matteo Ronchetti","Wolfgang Wein","Nassir Navab","Oliver Zettinig","Raphael Prevost"],"pdf_url":"https://arxiv.org/pdf/2307.09931v1.pdf","comment":"This preprint was submitted to MICCAI 2023. The Version of Record of\n this contribution will be published in Springer LNCS"},{"id":"http://arxiv.org/abs/2307.09929v1","updated":"2023-07-19T12:11:15Z","published":"2023-07-19T12:11:15Z","title":"Measuring and Modeling Uncertainty Degree for Monocular Depth Estimation","summary":" Effectively measuring and modeling the reliability of a trained model is\nessential to the real-world deployment of monocular depth estimation (MDE)\nmodels. However, the intrinsic ill-posedness and ordinal-sensitive nature of\nMDE pose major challenges to the estimation of uncertainty degree of the\ntrained models. On the one hand, utilizing current uncertainty modeling methods\nmay increase memory consumption and are usually time-consuming. On the other\nhand, measuring the uncertainty based on model accuracy can also be\nproblematic, where uncertainty reliability and prediction accuracy are not well\ndecoupled. In this paper, we propose to model the uncertainty of MDE models\nfrom the perspective of the inherent probability distributions originating from\nthe depth probability volume and its extensions, and to assess it more fairly\nwith more comprehensive metrics. By simply introducing additional training\nregularization terms, our model, with surprisingly simple formations and\nwithout requiring extra modules or multiple inferences, can provide uncertainty\nestimations with state-of-the-art reliability, and can be further improved when\ncombined with ensemble or sampling methods. A series of experiments demonstrate\nthe effectiveness of our methods.\n","authors":["Mochu Xiang","Jing Zhang","Nick Barnes","Yuchao Dai"],"pdf_url":"https://arxiv.org/pdf/2307.09929v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04639v2","updated":"2023-07-19T12:08:51Z","published":"2023-07-10T15:35:31Z","title":"Multimodal brain age estimation using interpretable adaptive\n population-graph learning","summary":" Brain age estimation is clinically important as it can provide valuable\ninformation in the context of neurodegenerative diseases such as Alzheimer's.\nPopulation graphs, which include multimodal imaging information of the subjects\nalong with the relationships among the population, have been used in literature\nalong with Graph Convolutional Networks (GCNs) and have proved beneficial for a\nvariety of medical imaging tasks. A population graph is usually static and\nconstructed manually using non-imaging information. However, graph construction\nis not a trivial task and might significantly affect the performance of the\nGCN, which is inherently very sensitive to the graph structure. In this work,\nwe propose a framework that learns a population graph structure optimized for\nthe downstream task. An attention mechanism assigns weights to a set of imaging\nand non-imaging features (phenotypes), which are then used for edge extraction.\nThe resulting graph is used to train the GCN. The entire pipeline can be\ntrained end-to-end. Additionally, by visualizing the attention weights that\nwere the most important for the graph construction, we increase the\ninterpretability of the graph. We use the UK Biobank, which provides a large\nvariety of neuroimaging and non-imaging phenotypes, to evaluate our method on\nbrain age regression and classification. The proposed method outperforms\ncompeting static graph approaches and other state-of-the-art adaptive methods.\nWe further show that the assigned attention scores indicate that there are both\nimaging and non-imaging phenotypes that are informative for brain age\nestimation and are in agreement with the relevant literature.\n","authors":["Kyriaki-Margarita Bintsi","Vasileios Baltatzis","Rolandos Alexandros Potamias","Alexander Hammers","Daniel Rueckert"],"pdf_url":"https://arxiv.org/pdf/2307.04639v2.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2303.06635v2","updated":"2023-07-19T12:05:29Z","published":"2023-03-12T11:23:56Z","title":"Schema Inference for Interpretable Image Classification","summary":" In this paper, we study a novel inference paradigm, termed as schema\ninference, that learns to deductively infer the explainable predictions by\nrebuilding the prior deep neural network (DNN) forwarding scheme, guided by the\nprevalent philosophical cognitive concept of schema. We strive to reformulate\nthe conventional model inference pipeline into a graph matching policy that\nassociates the extracted visual concepts of an image with the pre-computed\nscene impression, by analogy with human reasoning mechanism via impression\nmatching. To this end, we devise an elaborated architecture, termed as\nSchemaNet, as a dedicated instantiation of the proposed schema inference\nconcept, that models both the visual semantics of input instances and the\nlearned abstract imaginations of target categories as topological relational\ngraphs. Meanwhile, to capture and leverage the compositional contributions of\nvisual semantics in a global view, we also introduce a universal Feat2Graph\nscheme in SchemaNet to establish the relational graphs that contain abundant\ninteraction information. Both the theoretical analysis and the experimental\nresults on several benchmarks demonstrate that the proposed schema inference\nachieves encouraging performance and meanwhile yields a clear picture of the\ndeductive process leading to the predictions. Our code is available at\nhttps://github.com/zhfeing/SchemaNet-PyTorch.\n","authors":["Haofei Zhang","Mengqi Xue","Xiaokang Liu","Kaixuan Chen","Jie Song","Mingli Song"],"pdf_url":"https://arxiv.org/pdf/2303.06635v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07591v3","updated":"2023-07-19T12:04:59Z","published":"2023-06-13T07:35:28Z","title":"I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models","summary":" Modern image-to-text systems typically adopt the encoder-decoder framework,\nwhich comprises two main components: an image encoder, responsible for\nextracting image features, and a transformer-based decoder, used for generating\ncaptions. Taking inspiration from the analysis of neural networks' robustness\nagainst adversarial perturbations, we propose a novel gray-box algorithm for\ncreating adversarial examples in image-to-text models. Unlike image\nclassification tasks that have a finite set of class labels, finding visually\nsimilar adversarial examples in an image-to-text task poses greater challenges\nbecause the captioning system allows for a virtually infinite space of possible\ncaptions. In this paper, we present a gray-box adversarial attack on\nimage-to-text, both untargeted and targeted. We formulate the process of\ndiscovering adversarial perturbations as an optimization problem that uses only\nthe image-encoder component, meaning the proposed attack is language-model\nagnostic. Through experiments conducted on the ViT-GPT2 model, which is the\nmost-used image-to-text model in Hugging Face, and the Flickr30k dataset, we\ndemonstrate that our proposed attack successfully generates visually similar\nadversarial examples, both with untargeted and targeted captions. Notably, our\nattack operates in a gray-box manner, requiring no knowledge about the decoder\nmodule. We also show that our attacks fool the popular open-source platform\nHugging Face.\n","authors":["Raz Lapid","Moshe Sipper"],"pdf_url":"https://arxiv.org/pdf/2306.07591v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2110.03544v2","updated":"2023-07-19T11:46:34Z","published":"2021-10-07T15:06:52Z","title":"RAR: Region-Aware Point Cloud Registration","summary":" This paper concerns the research problem of point cloud registration to find\nthe rigid transformation to optimally align the source point set with the\ntarget one. Learning robust point cloud registration models with deep neural\nnetworks has emerged as a powerful paradigm, offering promising performance in\npredicting the global geometric transformation for a pair of point sets.\nExisting methods firstly leverage an encoder to regress a latent shape\nembedding, which is then decoded into a shape-conditioned transformation via\nconcatenation-based conditioning. However, different regions of a 3D shape vary\nin their geometric structures which makes it more sense that we have a\nregion-conditioned transformation instead of the shape-conditioned one. In this\npaper we present a \\underline{R}egion-\\underline{A}ware point cloud\n\\underline{R}egistration, denoted as RAR, to predict transformation for\npairwise point sets in the self-supervised learning fashion. More specifically,\nwe develop a novel region-aware decoder (RAD) module that is formed with an\nimplicit neural region representation parameterized by neural networks. The\nimplicit neural region representation is learned with a self-supervised 3D\nshape reconstruction loss without the need for region labels. Consequently, the\nregion-aware decoder (RAD) module guides the training of the region-aware\ntransformation (RAT) module and region-aware weight (RAW) module, which predict\nthe transforms and weights for different regions respectively. The global\ngeometric transformation from source point set to target one is then formed by\nthe weighted fusion of region-aware transforms. Compared to the\nstate-of-the-art approaches, our experiments show that our RAR achieves\nsuperior registration performance over various benchmark datasets (e.g.\nModelNet40).\n","authors":["Yu Hao","Yi Fang"],"pdf_url":"https://arxiv.org/pdf/2110.03544v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2006.06200"},{"id":"http://arxiv.org/abs/2307.09915v1","updated":"2023-07-19T11:35:21Z","published":"2023-07-19T11:35:21Z","title":"Embedded Heterogeneous Attention Transformer for Cross-lingual Image\n Captioning","summary":" Cross-lingual image captioning is confronted with both cross-lingual and\ncross-modal challenges for multimedia analysis. The crucial issue in this task\nis to model the global and local matching between the image and different\nlanguages. Existing cross-modal embedding methods based on Transformer\narchitecture oversight the local matching between the image region and\nmonolingual words, not to mention in the face of a variety of differentiated\nlanguages. Due to the heterogeneous property of the cross-modal and\ncross-lingual task, we utilize the heterogeneous network to establish\ncross-domain relationships and the local correspondences between the image and\ndifferent languages. In this paper, we propose an Embedded Heterogeneous\nAttention Transformer (EHAT) to build reasoning paths bridging cross-domain for\ncross-lingual image captioning and integrate into transformer. The proposed\nEHAT consists of a Masked Heterogeneous Cross-attention (MHCA), Heterogeneous\nAttention Reasoning Network (HARN) and Heterogeneous Co-attention (HCA). HARN\nas the core network, models and infers cross-domain relationship anchored by\nvision bounding box representation features to connect two languages word\nfeatures and learn the heterogeneous maps. MHCA and HCA implement cross-domain\nintegration in the encoder through the special heterogeneous attention and\nenable single model to generate two language captioning. We test on MSCOCO\ndataset to generate English and Chinese, which are most widely used and have\nobvious difference between their language families. Our experiments show that\nour method even achieve better than advanced monolingual methods.\n","authors":["Zijie Song","Zhenzhen Hu","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2307.09915v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.03238v3","updated":"2023-07-19T11:20:12Z","published":"2023-05-05T01:40:00Z","title":"Reduction of Class Activation Uncertainty with Background Information","summary":" Multitask learning is a popular approach to training high-performing neural\nnetworks with improved generalization. In this paper, we propose a background\nclass to achieve improved generalization at a lower computation compared to\nmultitask learning to help researchers and organizations with limited\ncomputation power. We also present a methodology for selecting background\nimages and discuss potential future improvements. We apply our approach to\nseveral datasets and achieved improved generalization with much lower\ncomputation. We also investigate class activation mappings (CAMs) of the\ntrained model and observed the tendency towards looking at a bigger picture in\na few class classification problems with the proposed model training\nmethodology. Applying transformer with the proposed background class, we\nreceive state-of-the-art (SOTA) performance on STL-10, Caltech-101, and\nCINIC-10 datasets. Example scripts are available in the `CAM' folder of the\nfollowing GitHub Repository: github.com/dipuk0506/UQ\n","authors":["H M Dipu Kabir"],"pdf_url":"https://arxiv.org/pdf/2305.03238v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09906v1","updated":"2023-07-19T11:10:26Z","published":"2023-07-19T11:10:26Z","title":"Implicit Identity Representation Conditioned Memory Compensation Network\n for Talking Head video Generation","summary":" Talking head video generation aims to animate a human face in a still image\nwith dynamic poses and expressions using motion information derived from a\ntarget-driving video, while maintaining the person's identity in the source\nimage. However, dramatic and complex motions in the driving video cause\nambiguous generation, because the still source image cannot provide sufficient\nappearance information for occluded regions or delicate expression variations,\nwhich produces severe artifacts and significantly degrades the generation\nquality. To tackle this problem, we propose to learn a global facial\nrepresentation space, and design a novel implicit identity representation\nconditioned memory compensation network, coined as MCNet, for high-fidelity\ntalking head generation.~Specifically, we devise a network module to learn a\nunified spatial facial meta-memory bank from all training samples, which can\nprovide rich facial structure and appearance priors to compensate warped source\nfacial features for the generation. Furthermore, we propose an effective query\nmechanism based on implicit identity representations learned from the discrete\nkeypoints of the source image. It can greatly facilitate the retrieval of more\ncorrelated information from the memory bank for the compensation. Extensive\nexperiments demonstrate that MCNet can learn representative and complementary\nfacial memory, and can clearly outperform previous state-of-the-art talking\nhead generation methods on VoxCeleb1 and CelebV datasets. Please check our\n\\href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.\n","authors":["Fa-Ting Hong","Dan Xu"],"pdf_url":"https://arxiv.org/pdf/2307.09906v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2305.09211v3","updated":"2023-07-19T10:52:30Z","published":"2023-05-16T06:40:04Z","title":"CB-HVTNet: A channel-boosted hybrid vision transformer network for\n lymphocyte assessment in histopathological images","summary":" Transformers, due to their ability to learn long range dependencies, have\novercome the shortcomings of convolutional neural networks (CNNs) for global\nperspective learning. Therefore, they have gained the focus of researchers for\nseveral vision related tasks including medical diagnosis. However, their\nmulti-head attention module only captures global level feature representations,\nwhich is insufficient for medical images. To address this issue, we propose a\nChannel Boosted Hybrid Vision Transformer (CB HVT) that uses transfer learning\nto generate boosted channels and employs both transformers and CNNs to analyse\nlymphocytes in histopathological images. The proposed CB HVT comprises five\nmodules, including a channel generation module, channel exploitation module,\nchannel merging module, region-aware module, and a detection and segmentation\nhead, which work together to effectively identify lymphocytes. The channel\ngeneration module uses the idea of channel boosting through transfer learning\nto extract diverse channels from different auxiliary learners. In the CB HVT,\nthese boosted channels are first concatenated and ranked using an attention\nmechanism in the channel exploitation module. A fusion block is then utilized\nin the channel merging module for a gradual and systematic merging of the\ndiverse boosted channels to improve the network's learning representations. The\nCB HVT also employs a proposal network in its region aware module and a head to\neffectively identify objects, even in overlapping regions and with artifacts.\nWe evaluated the proposed CB HVT on two publicly available datasets for\nlymphocyte assessment in histopathological images. The results show that CB HVT\noutperformed other state of the art detection models, and has good\ngeneralization ability, demonstrating its value as a tool for pathologists.\n","authors":["Momina Liaqat Ali","Zunaira Rauf","Asifullah Khan","Anabia Sohail","Rafi Ullah","Jeonghwan Gwak"],"pdf_url":"https://arxiv.org/pdf/2305.09211v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09893v1","updated":"2023-07-19T10:45:49Z","published":"2023-07-19T10:45:49Z","title":"Learning from Abstract Images: on the Importance of Occlusion in a\n Minimalist Encoding of Human Poses","summary":" Existing 2D-to-3D pose lifting networks suffer from poor performance in\ncross-dataset benchmarks. Although the use of 2D keypoints joined by\n\"stick-figure\" limbs has shown promise as an intermediate step, stick-figures\ndo not account for occlusion information that is often inherent in an image. In\nthis paper, we propose a novel representation using opaque 3D limbs that\npreserves occlusion information while implicitly encoding joint locations.\nCrucially, when training on data with accurate three-dimensional keypoints and\nwithout part-maps, this representation allows training on abstract synthetic\nimages, with occlusion, from as many synthetic viewpoints as desired. The\nresult is a pose defined by limb angles rather than joint positions\n$\\unicode{x2013}$ because poses are, in the real world, independent of cameras\n$\\unicode{x2013}$ allowing us to predict poses that are completely independent\nof camera viewpoint. The result provides not only an improvement in\nsame-dataset benchmarks, but a \"quantum leap\" in cross-dataset benchmarks.\n","authors":["Saad Manzur","Wayne Hayes"],"pdf_url":"https://arxiv.org/pdf/2307.09893v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2307.09892v1","updated":"2023-07-19T10:44:44Z","published":"2023-07-19T10:44:44Z","title":"3Deformer: A Common Framework for Image-Guided Mesh Deformation","summary":" We propose 3Deformer, a general-purpose framework for interactive 3D shape\nediting. Given a source 3D mesh with semantic materials, and a user-specified\nsemantic image, 3Deformer can accurately edit the source mesh following the\nshape guidance of the semantic image, while preserving the source topology as\nrigid as possible. Recent studies of 3D shape editing mostly focus on learning\nneural networks to predict 3D shapes, which requires high-cost 3D training\ndatasets and is limited to handling objects involved in the datasets. Unlike\nthese studies, our 3Deformer is a non-training and common framework, which only\nrequires supervision of readily-available semantic images, and is compatible\nwith editing various objects unlimited by datasets. In 3Deformer, the source\nmesh is deformed utilizing the differentiable renderer technique, according to\nthe correspondences between semantic images and mesh materials. However,\nguiding complex 3D shapes with a simple 2D image incurs extra challenges, that\nis, the deform accuracy, surface smoothness, geometric rigidity, and global\nsynchronization of the edited mesh should be guaranteed. To address these\nchallenges, we propose a hierarchical optimization architecture to balance the\nglobal and local shape features, and propose further various strategies and\nlosses to improve properties of accuracy, smoothness, rigidity, and so on.\nExtensive experiments show that our 3Deformer is able to produce impressive\nresults and reaches the state-of-the-art level.\n","authors":["Hao Su","Xuefeng Liu","Jianwei Niu","Ji Wan","Xinghao Wu"],"pdf_url":"https://arxiv.org/pdf/2307.09892v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09886v1","updated":"2023-07-19T10:31:35Z","published":"2023-07-19T10:31:35Z","title":"A reinforcement learning approach for VQA validation: an application to\n diabetic macular edema grading","summary":" Recent advances in machine learning models have greatly increased the\nperformance of automated methods in medical image analysis. However, the\ninternal functioning of such models is largely hidden, which hinders their\nintegration in clinical practice. Explainability and trust are viewed as\nimportant aspects of modern methods, for the latter's widespread use in\nclinical communities. As such, validation of machine learning models represents\nan important aspect and yet, most methods are only validated in a limited way.\nIn this work, we focus on providing a richer and more appropriate validation\napproach for highly powerful Visual Question Answering (VQA) algorithms. To\nbetter understand the performance of these methods, which answer arbitrary\nquestions related to images, this work focuses on an automatic visual Turing\ntest (VTT). That is, we propose an automatic adaptive questioning method, that\naims to expose the reasoning behavior of a VQA algorithm. Specifically, we\nintroduce a reinforcement learning (RL) agent that observes the history of\npreviously asked questions, and uses it to select the next question to pose. We\ndemonstrate our approach in the context of evaluating algorithms that\nautomatically answer questions related to diabetic macular edema (DME) grading.\nThe experiments show that such an agent has similar behavior to a clinician,\nwhereby asking questions that are relevant to key clinical concepts.\n","authors":["Tatiana Fountoukidou","Raphael Sznitman"],"pdf_url":"https://arxiv.org/pdf/2307.09886v1.pdf","comment":"16 pages (+ 23 pages supplementary material)"},{"id":"http://arxiv.org/abs/2307.09880v1","updated":"2023-07-19T10:23:28Z","published":"2023-07-19T10:23:28Z","title":"A3D: Adaptive, Accurate, and Autonomous Navigation for Edge-Assisted\n Drones","summary":" Accurate navigation is of paramount importance to ensure flight safety and\nefficiency for autonomous drones. Recent research starts to use Deep Neural\nNetworks to enhance drone navigation given their remarkable predictive\ncapability for visual perception. However, existing solutions either run DNN\ninference tasks on drones in situ, impeded by the limited onboard resource, or\noffload the computation to external servers which may incur large network\nlatency. Few works consider jointly optimizing the offloading decisions along\nwith image transmission configurations and adapting them on the fly. In this\npaper, we propose A3D, an edge server assisted drone navigation framework that\ncan dynamically adjust task execution location, input resolution, and image\ncompression ratio in order to achieve low inference latency, high prediction\naccuracy, and long flight distances. Specifically, we first augment\nstate-of-the-art convolutional neural networks for drone navigation and define\na novel metric called Quality of Navigation as our optimization objective which\ncan effectively capture the above goals. We then design a deep reinforcement\nlearning based neural scheduler at the drone side for which an information\nencoder is devised to reshape the state features and thus improve its learning\nability. To further support simultaneous multi-drone serving, we extend the\nedge server design by developing a network-aware resource allocation algorithm,\nwhich allows provisioning containerized resources aligned with drones' demand.\nWe finally implement a proof-of-concept prototype with realistic devices and\nvalidate its performance in a real-world campus scene, as well as a simulation\nenvironment for thorough evaluation upon AirSim. Extensive experimental results\nshow that A3D can reduce end-to-end latency by 28.06% and extend the flight\ndistance by up to 27.28% compared with non-adaptive solutions.\n","authors":["Liekang Zeng","Haowei Chen","Daipeng Feng","Xiaoxi Zhang","Xu Chen"],"pdf_url":"https://arxiv.org/pdf/2307.09880v1.pdf","comment":"Accepted by IEEE/ACM Transactions on Networking"},{"id":"http://arxiv.org/abs/2304.06403v2","updated":"2023-07-19T10:12:58Z","published":"2023-04-13T11:10:16Z","title":"Leveraging triplet loss for unsupervised action segmentation","summary":" In this paper, we propose a novel fully unsupervised framework that learns\naction representations suitable for the action segmentation task from the\nsingle input video itself, without requiring any training data. Our method is a\ndeep metric learning approach rooted in a shallow network with a triplet loss\noperating on similarity distributions and a novel triplet selection strategy\nthat effectively models temporal and semantic priors to discover actions in the\nnew representational space. Under these circumstances, we successfully recover\ntemporal boundaries in the learned action representations with higher quality\ncompared with existing unsupervised approaches. The proposed method is\nevaluated on two widely used benchmark datasets for the action segmentation\ntask and it achieves competitive performance by applying a generic clustering\nalgorithm on the learned representations.\n","authors":["E. Bueno-Benito","B. Tura","M. Dimiccoli"],"pdf_url":"https://arxiv.org/pdf/2304.06403v2.pdf","comment":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern\n Recognition (CVPR) Workshops, 2023, pp. 4921-4929"},{"id":"http://arxiv.org/abs/2304.05417v2","updated":"2023-07-19T10:01:29Z","published":"2023-04-11T18:00:02Z","title":"The MONET dataset: Multimodal drone thermal dataset recorded in rural\n scenarios","summary":" We present MONET, a new multimodal dataset captured using a thermal camera\nmounted on a drone that flew over rural areas, and recorded human and vehicle\nactivities. We captured MONET to study the problem of object localisation and\nbehaviour understanding of targets undergoing large-scale variations and being\nrecorded from different and moving viewpoints. Target activities occur in two\ndifferent land sites, each with unique scene structures and cluttered\nbackgrounds. MONET consists of approximately 53K images featuring 162K manually\nannotated bounding boxes. Each image is timestamp-aligned with drone metadata\nthat includes information about attitudes, speed, altitude, and GPS\ncoordinates. MONET is different from previous thermal drone datasets because it\nfeatures multimodal data, including rural scenes captured with thermal cameras\ncontaining both person and vehicle targets, along with trajectory information\nand metadata. We assessed the difficulty of the dataset in terms of transfer\nlearning between the two sites and evaluated nine object detection algorithms\nto identify the open challenges associated with this type of data. Project\npage: https://github.com/fabiopoiesi/monet_dataset.\n","authors":["Luigi Riz","Andrea Caraffa","Matteo Bortolon","Mohamed Lamine Mekhalfi","Davide Boscaini","André Moura","José Antunes","André Dias","Hugo Silva","Andreas Leonidou","Christos Constantinides","Christos Keleshis","Dante Abate","Fabio Poiesi"],"pdf_url":"https://arxiv.org/pdf/2304.05417v2.pdf","comment":"Published in Computer Vision and Pattern Recognition (CVPR) Workshops\n 2023 - 6th Multimodal Learning and Applications Workshop"},{"id":"http://arxiv.org/abs/2307.09861v1","updated":"2023-07-19T09:45:06Z","published":"2023-07-19T09:45:06Z","title":"BSDM: Background Suppression Diffusion Model for Hyperspectral Anomaly\n Detection","summary":" Hyperspectral anomaly detection (HAD) is widely used in Earth observation and\ndeep space exploration. A major challenge for HAD is the complex background of\nthe input hyperspectral images (HSIs), resulting in anomalies confused in the\nbackground. On the other hand, the lack of labeled samples for HSIs leads to\npoor generalization of existing HAD methods. This paper starts the first\nattempt to study a new and generalizable background learning problem without\nlabeled samples. We present a novel solution BSDM (background suppression\ndiffusion model) for HAD, which can simultaneously learn latent background\ndistributions and generalize to different datasets for suppressing complex\nbackground. It is featured in three aspects: (1) For the complex background of\nHSIs, we design pseudo background noise and learn the potential background\ndistribution in it with a diffusion model (DM). (2) For the generalizability\nproblem, we apply a statistical offset module so that the BSDM adapts to\ndatasets of different domains without labeling samples. (3) For achieving\nbackground suppression, we innovatively improve the inference process of DM by\nfeeding the original HSIs into the denoising network, which removes the\nbackground as noise. Our work paves a new background suppression way for HAD\nthat can improve HAD performance without the prerequisite of manually labeled\ndata. Assessments and generalization experiments of four HAD methods on several\nreal HSI datasets demonstrate the above three unique properties of the proposed\nmethod. The code is available at https://github.com/majitao-xd/BSDM-HAD.\n","authors":["Jitao Ma","Weiying Xie","Yunsong Li","Leyuan Fang"],"pdf_url":"https://arxiv.org/pdf/2307.09861v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09857v1","updated":"2023-07-19T09:36:08Z","published":"2023-07-19T09:36:08Z","title":"Blind Image Quality Assessment Using Multi-Stream Architecture with\n Spatial and Channel Attention","summary":" BIQA (Blind Image Quality Assessment) is an important field of study that\nevaluates images automatically. Although significant progress has been made,\nblind image quality assessment remains a difficult task since images vary in\ncontent and distortions. Most algorithms generate quality without emphasizing\nthe important region of interest. In order to solve this, a multi-stream\nspatial and channel attention-based algorithm is being proposed. This algorithm\ngenerates more accurate predictions with a high correlation to human perceptual\nassessment by combining hybrid features from two different backbones, followed\nby spatial and channel attention to provide high weights to the region of\ninterest. Four legacy image quality assessment datasets are used to validate\nthe effectiveness of our proposed approach. Authentic and synthetic distortion\nimage databases are used to demonstrate the effectiveness of the proposed\nmethod, and we show that it has excellent generalization properties with a\nparticular focus on the perceptual foreground information.\n","authors":["Hassan Khalid","Nisar Ahmed"],"pdf_url":"https://arxiv.org/pdf/2307.09857v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02203v3","updated":"2023-07-19T09:34:22Z","published":"2023-07-05T10:54:50Z","title":"Neural Fields for Interactive Visualization of Statistical Dependencies\n in 3D Simulation Ensembles","summary":" We present the first neural network that has learned to compactly represent\nand can efficiently reconstruct the statistical dependencies between the values\nof physical variables at different spatial locations in large 3D simulation\nensembles. Going beyond linear dependencies, we consider mutual information as\na measure of non-linear dependence. We demonstrate learning and reconstruction\nwith a large weather forecast ensemble comprising 1000 members, each storing\nmultiple physical variables at a 250 x 352 x 20 simulation grid. By\ncircumventing compute-intensive statistical estimators at runtime, we\ndemonstrate significantly reduced memory and computation requirements for\nreconstructing the major dependence structures. This enables embedding the\nestimator into a GPU-accelerated direct volume renderer and interactively\nvisualizing all mutual dependencies for a selected domain point.\n","authors":["Fatemeh Farokhmanesh","Kevin Höhlein","Christoph Neuhauser","Rüdiger Westermann"],"pdf_url":"https://arxiv.org/pdf/2307.02203v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09856v1","updated":"2023-07-19T09:30:00Z","published":"2023-07-19T09:30:00Z","title":"Hierarchical Spatio-Temporal Representation Learning for Gait\n Recognition","summary":" Gait recognition is a biometric technique that identifies individuals by\ntheir unique walking styles, which is suitable for unconstrained environments\nand has a wide range of applications. While current methods focus on exploiting\nbody part-based representations, they often neglect the hierarchical\ndependencies between local motion patterns. In this paper, we propose a\nhierarchical spatio-temporal representation learning (HSTL) framework for\nextracting gait features from coarse to fine. Our framework starts with a\nhierarchical clustering analysis to recover multi-level body structures from\nthe whole body to local details. Next, an adaptive region-based motion\nextractor (ARME) is designed to learn region-independent motion features. The\nproposed HSTL then stacks multiple ARMEs in a top-down manner, with each ARME\ncorresponding to a specific partition level of the hierarchy. An adaptive\nspatio-temporal pooling (ASTP) module is used to capture gait features at\ndifferent levels of detail to perform hierarchical feature mapping. Finally, a\nframe-level temporal aggregation (FTA) module is employed to reduce redundant\ninformation in gait sequences through multi-scale temporal downsampling.\nExtensive experiments on CASIA-B, OUMVLP, GREW, and Gait3D datasets demonstrate\nthat our method outperforms the state-of-the-art while maintaining a reasonable\nbalance between model accuracy and complexity.\n","authors":["Lei Wang","Bo Liu","Fangfang Liang","Bincheng Wang"],"pdf_url":"https://arxiv.org/pdf/2307.09856v1.pdf","comment":"Accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2307.00574v2","updated":"2023-07-19T09:27:14Z","published":"2023-07-02T13:57:45Z","title":"Bidirectional Temporal Diffusion Model for Temporally Consistent Human\n Animation","summary":" We introduce a method to generate temporally coherent human animation from a\nsingle image, a video, or a random noise. This problem has been formulated as\nmodeling of an auto-regressive generation, i.e., to regress past frames to\ndecode future frames. However, such unidirectional generation is highly prone\nto motion drifting over time, generating unrealistic human animation with\nsignificant artifacts such as appearance distortion. We claim that\nbidirectional temporal modeling enforces temporal coherence on a generative\nnetwork by largely suppressing the motion ambiguity of human appearance. To\nprove our claim, we design a novel human animation framework using a denoising\ndiffusion model: a neural network learns to generate the image of a person by\ndenoising temporal Gaussian noises whose intermediate results are\ncross-conditioned bidirectionally between consecutive frames. In the\nexperiments, our method demonstrates strong performance compared to existing\nunidirectional approaches with realistic temporal coherence\n","authors":["Tserendorj Adiya","Sanghun Kim","Jung Eun Lee","Jae Shin Yoon","Hwasup Lim"],"pdf_url":"https://arxiv.org/pdf/2307.00574v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07873v2","updated":"2023-07-19T09:23:43Z","published":"2023-07-15T19:20:49Z","title":"Why Does Little Robustness Help? Understanding Adversarial\n Transferability From Surrogate Training","summary":" Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs\nthat successfully fool white-box surrogate models can also deceive other\nblack-box models with different architectures. Although a bunch of empirical\nstudies have provided guidance on generating highly transferable AEs, many of\nthese findings lack explanations and even lead to inconsistent advice. In this\npaper, we take a further step towards understanding adversarial\ntransferability, with a particular focus on surrogate aspects. Starting from\nthe intriguing little robustness phenomenon, where models adversarially trained\nwith mildly perturbed adversarial samples can serve as better surrogates, we\nattribute it to a trade-off between two predominant factors: model smoothness\nand gradient similarity. Our investigations focus on their joint effects,\nrather than their separate correlations with transferability. Through a series\nof theoretical and empirical analyses, we conjecture that the data distribution\nshift in adversarial training explains the degradation of gradient similarity.\nBuilding on these insights, we explore the impacts of data augmentation and\ngradient regularization on transferability and identify that the trade-off\ngenerally exists in the various training mechanisms, thus building a\ncomprehensive blueprint for the regulation mechanism behind transferability.\nFinally, we provide a general route for constructing better surrogates to boost\ntransferability which optimizes both model smoothness and gradient similarity\nsimultaneously, e.g., the combination of input gradient regularization and\nsharpness-aware minimization (SAM), validated by extensive experiments. In\nsummary, we call for attention to the united impacts of these two factors for\nlaunching effective transfer attacks, rather than optimizing one while ignoring\nthe other, and emphasize the crucial role of manipulating surrogate models.\n","authors":["Yechao Zhang","Shengshan Hu","Leo Yu Zhang","Junyu Shi","Minghui Li","Xiaogeng Liu","Wei Wan","Hai Jin"],"pdf_url":"https://arxiv.org/pdf/2307.07873v2.pdf","comment":"Accepted by IEEE Symposium on Security and Privacy (Oakland) 2024; 21\n pages, 12 figures, 13 tables"},{"id":"http://arxiv.org/abs/2208.10741v3","updated":"2023-07-19T09:15:05Z","published":"2022-08-23T05:27:32Z","title":"Hierarchically Decomposed Graph Convolutional Networks for\n Skeleton-Based Action Recognition","summary":" Graph convolutional networks (GCNs) are the most commonly used methods for\nskeleton-based action recognition and have achieved remarkable performance.\nGenerating adjacency matrices with semantically meaningful edges is\nparticularly important for this task, but extracting such edges is challenging\nproblem. To solve this, we propose a hierarchically decomposed graph\nconvolutional network (HD-GCN) architecture with a novel hierarchically\ndecomposed graph (HD-Graph). The proposed HD-GCN effectively decomposes every\njoint node into several sets to extract major structurally adjacent and distant\nedges, and uses them to construct an HD-Graph containing those edges in the\nsame semantic spaces of a human skeleton. In addition, we introduce an\nattention-guided hierarchy aggregation (A-HA) module to highlight the dominant\nhierarchical edge sets of the HD-Graph. Furthermore, we apply a new six-way\nensemble method, which uses only joint and bone stream without any motion\nstream. The proposed model is evaluated and achieves state-of-the-art\nperformance on four large, popular datasets. Finally, we demonstrate the\neffectiveness of our model with various comparative experiments.\n","authors":["Jungho Lee","Minhyeok Lee","Dogyoon Lee","Sangyoun Lee"],"pdf_url":"https://arxiv.org/pdf/2208.10741v3.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.06689v2","updated":"2023-07-19T09:09:42Z","published":"2023-07-13T11:21:58Z","title":"YOLIC: An Efficient Method for Object Localization and Classification on\n Edge Devices","summary":" In the realm of Tiny AI, we introduce \"You Only Look at Interested Cells\"\n(YOLIC), an efficient method for object localization and classification on edge\ndevices. Seamlessly blending the strengths of semantic segmentation and object\ndetection, YOLIC offers superior computational efficiency and precision. By\nadopting Cells of Interest for classification instead of individual pixels,\nYOLIC encapsulates relevant information, reduces computational load, and\nenables rough object shape inference. Importantly, the need for bounding box\nregression is obviated, as YOLIC capitalizes on the predetermined cell\nconfiguration that provides information about potential object location, size,\nand shape. To tackle the issue of single-label classification limitations, a\nmulti-label classification approach is applied to each cell, effectively\nrecognizing overlapping or closely situated objects. This paper presents\nextensive experiments on multiple datasets, demonstrating that YOLIC achieves\ndetection performance comparable to the state-of-the-art YOLO algorithms while\nsurpassing in speed, exceeding 30fps on a Raspberry Pi 4B CPU. All resources\nrelated to this study, including datasets, cell designer, image annotation\ntool, and source code, have been made publicly available on our project website\nat https://kai3316.github.io/yolic.github.io\n","authors":["Kai Su","Qiangfu Zhao","Yoichi Tomioka","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2307.06689v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09847v1","updated":"2023-07-19T09:09:24Z","published":"2023-07-19T09:09:24Z","title":"Cryo-forum: A framework for orientation recovery with uncertainty\n measure with the application in cryo-EM image analysis","summary":" In single-particle cryo-electron microscopy (cryo-EM), the efficient\ndetermination of orientation parameters for 2D projection images poses a\nsignificant challenge yet is crucial for reconstructing 3D structures. This\ntask is complicated by the high noise levels present in the cryo-EM datasets,\nwhich often include outliers, necessitating several time-consuming 2D clean-up\nprocesses. Recently, solutions based on deep learning have emerged, offering a\nmore streamlined approach to the traditionally laborious task of orientation\nestimation. These solutions often employ amortized inference, eliminating the\nneed to estimate parameters individually for each image. However, these methods\nfrequently overlook the presence of outliers and may not adequately concentrate\non the components used within the network. This paper introduces a novel\napproach that uses a 10-dimensional feature vector to represent the orientation\nand applies a Quadratically-Constrained Quadratic Program to derive the\npredicted orientation as a unit quaternion, supplemented by an uncertainty\nmetric. Furthermore, we propose a unique loss function that considers the\npairwise distances between orientations, thereby enhancing the accuracy of our\nmethod. Finally, we also comprehensively evaluate the design choices involved\nin constructing the encoder network, a topic that has not received sufficient\nattention in the literature. Our numerical analysis demonstrates that our\nmethodology effectively recovers orientations from 2D cryo-EM images in an\nend-to-end manner. Importantly, the inclusion of uncertainty quantification\nallows for direct clean-up of the dataset at the 3D level. Lastly, we package\nour proposed methods into a user-friendly software suite named cryo-forum,\ndesigned for easy accessibility by the developers.\n","authors":["Szu-Chi Chung"],"pdf_url":"https://arxiv.org/pdf/2307.09847v1.pdf","comment":"27 pages, 9 figures"},{"id":"http://arxiv.org/abs/2307.09841v1","updated":"2023-07-19T08:55:39Z","published":"2023-07-19T08:55:39Z","title":"Compressive Image Scanning Microscope","summary":" We present a novel approach to implement compressive sensing in laser\nscanning microscopes (LSM), specifically in image scanning microscopy (ISM),\nusing a single-photon avalanche diode (SPAD) array detector. Our method\naddresses two significant limitations in applying compressive sensing to LSM:\nthe time to compute the sampling matrix and the quality of reconstructed\nimages. We employ a fixed sampling strategy, skipping alternate rows and\ncolumns during data acquisition, which reduces the number of points scanned by\na factor of four and eliminates the need to compute different sampling\nmatrices. By exploiting the parallel images generated by the SPAD array, we\nimprove the quality of the reconstructed compressive-ISM images compared to\nstandard compressive confocal LSM images. Our results demonstrate the\neffectiveness of our approach in producing higher-quality images with reduced\ndata acquisition time and potential benefits in reducing photobleaching.\n","authors":["Ajay Gunalan","Marco Castello","Simonluca Piazza","Shunlei Li","Alberto Diaspro","Leonardo S. Mattos","Paolo Bianchini"],"pdf_url":"https://arxiv.org/pdf/2307.09841v1.pdf","comment":"Presented in ISCS23"},{"id":"http://arxiv.org/abs/2111.01396v2","updated":"2023-07-19T08:55:05Z","published":"2021-11-02T06:58:22Z","title":"Boundary Distribution Estimation for Precise Object Detection","summary":" In the field of state-of-the-art object detection, the task of object\nlocalization is typically accomplished through a dedicated subnet that\nemphasizes bounding box regression. This subnet traditionally predicts the\nobject's position by regressing the box's center position and scaling factors.\nDespite the widespread adoption of this approach, we have observed that the\nlocalization results often suffer from defects, leading to unsatisfactory\ndetector performance. In this paper, we address the shortcomings of previous\nmethods through theoretical analysis and experimental verification and present\nan innovative solution for precise object detection. Instead of solely focusing\non the object's center and size, our approach enhances the accuracy of bounding\nbox localization by refining the box edges based on the estimated distribution\nat the object's boundary. Experimental results demonstrate the potential and\ngeneralizability of our proposed method.\n","authors":["Peng Zhi","Haoran Zhou","Hang Huang","Rui Zhao","Rui Zhou","Qingguo Zhou"],"pdf_url":"https://arxiv.org/pdf/2111.01396v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09340v2","updated":"2023-07-19T08:55:01Z","published":"2023-03-16T14:21:45Z","title":"Improving Automated Hemorrhage Detection in Sparse-view Computed\n Tomography via Deep Convolutional Neural Network based Artifact Reduction","summary":" Purpose: Sparse-view computed tomography (CT) is an effective way to reduce\ndose by lowering the total number of views acquired, albeit at the expense of\nimage quality, which, in turn, can impact the ability to detect diseases. We\nexplore deep learning-based artifact reduction in sparse-view cranial CT scans\nand its impact on automated hemorrhage detection. Methods: We trained a U-Net\nfor artefact reduction on simulated sparse-view cranial CT scans from 3000\npatients obtained from a public dataset and reconstructed with varying levels\nof sub-sampling. Additionally, we trained a convolutional neural network on\nfully sampled CT data from 17,545 patients for automated hemorrhage detection.\nWe evaluated the classification performance using the area under the receiver\noperator characteristic curves (AUC-ROCs) with corresponding 95% confidence\nintervals (CIs) and the DeLong test, along with confusion matrices. The\nperformance of the U-Net was compared to an analytical approach based on total\nvariation (TV). Results: The U-Net performed superior compared to unprocessed\nand TV-processed images with respect to image quality and automated hemorrhage\ndiagnosis. With U-Net post-processing, the number of views can be reduced from\n4096 (AUC-ROC: 0.974; 95% CI: 0.972-0.976) views to 512 views (0.973;\n0.971-0.975) with minimal decrease in hemorrhage detection (P<.001) and to 256\nviews (0.967; 0.964-0.969) with a slight performance decrease (P<.001).\nConclusion: The results suggest that U-Net based artifact reduction\nsubstantially enhances automated hemorrhage detection in sparse-view cranial\nCTs. Our findings highlight that appropriate post-processing is crucial for\noptimal image quality and diagnostic accuracy while minimizing radiation dose.\n","authors":["Johannes Thalhammer","Manuel Schultheiss","Tina Dorosti","Tobias Lasser","Franz Pfeiffer","Daniela Pfeiffer","Florian Schaff"],"pdf_url":"https://arxiv.org/pdf/2303.09340v2.pdf","comment":"11 pages, 6 figures, 1 table"},{"id":"http://arxiv.org/abs/2210.06551v3","updated":"2023-07-19T08:54:27Z","published":"2022-10-12T19:46:25Z","title":"MotionBERT: A Unified Perspective on Learning Human Motion\n Representations","summary":" We present a unified perspective on tackling various human-centric video\ntasks by learning human motion representations from large-scale and\nheterogeneous data resources. Specifically, we propose a pretraining stage in\nwhich a motion encoder is trained to recover the underlying 3D motion from\nnoisy partial 2D observations. The motion representations acquired in this way\nincorporate geometric, kinematic, and physical knowledge about human motion,\nwhich can be easily transferred to multiple downstream tasks. We implement the\nmotion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer)\nneural network. It could capture long-range spatio-temporal relationships among\nthe skeletal joints comprehensively and adaptively, exemplified by the lowest\n3D pose estimation error so far when trained from scratch. Furthermore, our\nproposed framework achieves state-of-the-art performance on all three\ndownstream tasks by simply finetuning the pretrained motion encoder with a\nsimple regression head (1-2 layers), which demonstrates the versatility of the\nlearned motion representations. Code and models are available at\nhttps://motionbert.github.io/\n","authors":["Wentao Zhu","Xiaoxuan Ma","Zhaoyang Liu","Libin Liu","Wayne Wu","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2210.06551v3.pdf","comment":"ICCV 2023 version"},{"id":"http://arxiv.org/abs/2307.09829v1","updated":"2023-07-19T08:34:25Z","published":"2023-07-19T08:34:25Z","title":"What do neural networks learn in image classification? A frequency\n shortcut perspective","summary":" Frequency analysis is useful for understanding the mechanisms of\nrepresentation learning in neural networks (NNs). Most research in this area\nfocuses on the learning dynamics of NNs for regression tasks, while little for\nclassification. This study empirically investigates the latter and expands the\nunderstanding of frequency shortcuts. First, we perform experiments on\nsynthetic datasets, designed to have a bias in different frequency bands. Our\nresults demonstrate that NNs tend to find simple solutions for classification,\nand what they learn first during training depends on the most distinctive\nfrequency characteristics, which can be either low- or high-frequencies.\nSecond, we confirm this phenomenon on natural images. We propose a metric to\nmeasure class-wise frequency characteristics and a method to identify frequency\nshortcuts. The results show that frequency shortcuts can be texture-based or\nshape-based, depending on what best simplifies the objective. Third, we\nvalidate the transferability of frequency shortcuts on out-of-distribution\n(OOD) test sets. Our results suggest that frequency shortcuts can be\ntransferred across datasets and cannot be fully avoided by larger model\ncapacity and data augmentation. We recommend that future research should focus\non effective training schemes mitigating frequency shortcut learning.\n","authors":["Shunxin Wang","Raymond Veldhuis","Christoph Brune","Nicola Strisciuglio"],"pdf_url":"https://arxiv.org/pdf/2307.09829v1.pdf","comment":"Accepted at ICCV2023"},{"id":"http://arxiv.org/abs/2307.09827v1","updated":"2023-07-19T08:32:59Z","published":"2023-07-19T08:32:59Z","title":"Online Continual Learning for Robust Indoor Object Recognition","summary":" Vision systems mounted on home robots need to interact with unseen classes in\nchanging environments. Robots have limited computational resources, labelled\ndata and storage capability. These requirements pose some unique challenges:\nmodels should adapt without forgetting past knowledge in a data- and\nparameter-efficient way. We characterize the problem as few-shot (FS) online\ncontinual learning (OCL), where robotic agents learn from a non-repeated stream\nof few-shot data updating only a few model parameters. Additionally, such\nmodels experience variable conditions at test time, where objects may appear in\ndifferent poses (e.g., horizontal or vertical) and environments (e.g., day or\nnight). To improve robustness of CL agents, we propose RobOCLe, which; 1)\nconstructs an enriched feature space computing high order statistical moments\nfrom the embedded features of samples; and 2) computes similarity between high\norder statistics of the samples on the enriched feature space, and predicts\ntheir class labels. We evaluate robustness of CL models to train/test\naugmentations in various cases. We show that different moments allow RobOCLe to\ncapture different properties of deformations, providing higher robustness with\nno decrease of inference speed.\n","authors":["Umberto Michieli","Mete Ozay"],"pdf_url":"https://arxiv.org/pdf/2307.09827v1.pdf","comment":"IROS 2023"},{"id":"http://arxiv.org/abs/2307.09416v2","updated":"2023-07-19T08:27:50Z","published":"2023-07-18T16:33:30Z","title":"Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation\n Evaluation","summary":" Research in Image Generation has recently made significant progress,\nparticularly boosted by the introduction of Vision-Language models which are\nable to produce high-quality visual content based on textual inputs. Despite\nongoing advancements in terms of generation quality and realism, no methodical\nframeworks have been defined yet to quantitatively measure the quality of the\ngenerated content and the adherence with the prompted requests: so far, only\nhuman-based evaluations have been adopted for quality satisfaction and for\ncomparing different generative methods. We introduce a novel automated method\nfor Visual Concept Evaluation (ViCE), i.e. to assess consistency between a\ngenerated/edited image and the corresponding prompt/instructions, with a\nprocess inspired by the human cognitive behaviour. ViCE combines the strengths\nof Large Language Models (LLMs) and Visual Question Answering (VQA) into a\nunified pipeline, aiming to replicate the human cognitive process in quality\nassessment. This method outlines visual concepts, formulates image-specific\nverification questions, utilizes the Q&A system to investigate the image, and\nscores the combined outcome. Although this brave new hypothesis of mimicking\nhumans in the image evaluation process is in its preliminary assessment stage,\nresults are promising and open the door to a new form of automatic evaluation\nwhich could have significant impact as the image generation or the image target\nediting tasks become more and more sophisticated.\n","authors":["Federico Betti","Jacopo Staiano","Lorenzo Baraldi","Lorenzo Baraldi","Rita Cucchiara","Nicu Sebe"],"pdf_url":"https://arxiv.org/pdf/2307.09416v2.pdf","comment":"Accepted as oral at ACM MultiMedia 2023 (Brave New Ideas track)"},{"id":"http://arxiv.org/abs/2205.11397v5","updated":"2023-07-19T08:25:37Z","published":"2022-05-23T15:42:12Z","title":"Super Vision Transformer","summary":" We attempt to reduce the computational costs in vision transformers (ViTs),\nwhich increase quadratically in the token number. We present a novel training\nparadigm that trains only one ViT model at a time, but is capable of providing\nimproved image recognition performance with various computational costs. Here,\nthe trained ViT model, termed super vision transformer (SuperViT), is empowered\nwith the versatile ability to solve incoming patches of multiple sizes as well\nas preserve informative tokens with multiple keeping rates (the ratio of\nkeeping tokens) to achieve good hardware efficiency for inference, given that\nthe available hardware resources often change from time to time. Experimental\nresults on ImageNet demonstrate that our SuperViT can considerably reduce the\ncomputational costs of ViT models with even performance increase. For example,\nwe reduce 2x FLOPs of DeiT-S while increasing the Top-1 accuracy by 0.2% and\n0.7% for 1.5x reduction. Also, our SuperViT significantly outperforms existing\nstudies on efficient vision transformers. For example, when consuming the same\namount of FLOPs, our SuperViT surpasses the recent state-of-the-art (SOTA) EViT\nby 1.1% when using DeiT-S as their backbones. The project of this work is made\npublicly available at https://github.com/lmbxmu/SuperViT.\n","authors":["Mingbao Lin","Mengzhao Chen","Yuxin Zhang","Chunhua Shen","Rongrong Ji","Liujuan Cao"],"pdf_url":"https://arxiv.org/pdf/2205.11397v5.pdf","comment":"Accepted by International Journal of Computer Vision (IJCV) in the\n year of 2023"},{"id":"http://arxiv.org/abs/2307.09823v1","updated":"2023-07-19T08:21:01Z","published":"2023-07-19T08:21:01Z","title":"Multi-modal Learning based Prediction for Disease","summary":" Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic\nliver disease, which can be predicted accurately to prevent advanced fibrosis\nand cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is\ninvasive, expensive, and prone to sampling errors. Therefore, non-invasive\nstudies are extremely promising, yet they are still in their infancy due to the\nlack of comprehensive research data and intelligent methods for multi-modal\ndata. This paper proposes a NAFLD diagnosis system (DeepFLDDiag) combining a\ncomprehensive clinical dataset (FLDData) and a multi-modal learning based NAFLD\nprediction method (DeepFLD). The dataset includes over 6000 participants\nphysical examinations, laboratory and imaging studies, extensive\nquestionnaires, and facial images of partial participants, which is\ncomprehensive and valuable for clinical studies. From the dataset, we\nquantitatively analyze and select clinical metadata that most contribute to\nNAFLD prediction. Furthermore, the proposed DeepFLD, a deep neural network\nmodel designed to predict NAFLD using multi-modal input, including metadata and\nfacial images, outperforms the approach that only uses metadata. Satisfactory\nperformance is also verified on other unseen datasets. Inspiringly, DeepFLD can\nachieve competitive results using only facial images as input rather than\nmetadata, paving the way for a more robust and simpler non-invasive NAFLD\ndiagnosis.\n","authors":["Yaran Chen","Xueyu Chen","Yu Han","Haoran Li","Dongbin Zhao","Jingzhong Li","Xu Wang"],"pdf_url":"https://arxiv.org/pdf/2307.09823v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.08096v2","updated":"2023-07-19T08:19:58Z","published":"2023-03-14T17:33:39Z","title":"MELON: NeRF with Unposed Images in SO(3)","summary":" Neural radiance fields enable novel-view synthesis and scene reconstruction\nwith photorealistic quality from a few images, but require known and accurate\ncamera poses. Conventional pose estimation algorithms fail on smooth or\nself-similar scenes, while methods performing inverse rendering from unposed\nviews require a rough initialization of the camera orientations. The main\ndifficulty of pose estimation lies in real-life objects being almost invariant\nunder certain transformations, making the photometric distance between rendered\nviews non-convex with respect to the camera parameters. Using an equivalence\nrelation that matches the distribution of local minima in camera space, we\nreduce this space to its quotient set, in which pose estimation becomes a more\nconvex problem. Using a neural-network to regularize pose estimation, we\ndemonstrate that our method - MELON - can reconstruct a neural radiance field\nfrom unposed images with state-of-the-art accuracy while requiring ten times\nfewer views than adversarial approaches.\n","authors":["Axel Levy","Mark Matthews","Matan Sela","Gordon Wetzstein","Dmitry Lagun"],"pdf_url":"https://arxiv.org/pdf/2303.08096v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09822v1","updated":"2023-07-19T08:19:08Z","published":"2023-07-19T08:19:08Z","title":"A Siamese-based Verification System for Open-set Architecture\n Attribution of Synthetic Images","summary":" Despite the wide variety of methods developed for synthetic image\nattribution, most of them can only attribute images generated by models or\narchitectures included in the training set and do not work with unknown\narchitectures, hindering their applicability in real-world scenarios. In this\npaper, we propose a verification framework that relies on a Siamese Network to\naddress the problem of open-set attribution of synthetic images to the\narchitecture that generated them. We consider two different settings. In the\nfirst setting, the system determines whether two images have been produced by\nthe same generative architecture or not. In the second setting, the system\nverifies a claim about the architecture used to generate a synthetic image,\nutilizing one or multiple reference images generated by the claimed\narchitecture. The main strength of the proposed system is its ability to\noperate in both closed and open-set scenarios so that the input images, either\nthe query and reference images, can belong to the architectures considered\nduring training or not. Experimental evaluations encompassing various\ngenerative architectures such as GANs, diffusion models, and transformers,\nfocusing on synthetic face image generation, confirm the excellent performance\nof our method in both closed and open-set settings, as well as its strong\ngeneralization capabilities.\n","authors":["Lydia Abady","Jun Wang","Benedetta Tondi","Mauro Barni"],"pdf_url":"https://arxiv.org/pdf/2307.09822v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09821v1","updated":"2023-07-19T08:16:34Z","published":"2023-07-19T08:16:34Z","title":"Hierarchical Semantic Perceptual Listener Head Video Generation: A\n High-performance Pipeline","summary":" In dyadic speaker-listener interactions, the listener's head reactions along\nwith the speaker's head movements, constitute an important non-verbal semantic\nexpression together. The listener Head generation task aims to synthesize\nresponsive listener's head videos based on audios of the speaker and reference\nimages of the listener. Compared to the Talking-head generation, it is more\nchallenging to capture the correlation clues from the speaker's audio and\nvisual information. Following the ViCo baseline scheme, we propose a\nhigh-performance solution by enhancing the hierarchical semantic extraction\ncapability of the audio encoder module and improving the decoder part, renderer\nand post-processing modules. Our solution gets the first place on the official\nleaderboard for the track of listening head generation. This paper is a\ntechnical report of ViCo@2023 Conversational Head Generation Challenge in ACM\nMultimedia 2023 conference.\n","authors":["Zhigang Chang","Weitai Hu","Qing Yang","Shibao Zheng"],"pdf_url":"https://arxiv.org/pdf/2307.09821v1.pdf","comment":"ACM MM 2023"},{"id":"http://arxiv.org/abs/2307.09818v1","updated":"2023-07-19T08:06:37Z","published":"2023-07-19T08:06:37Z","title":"Deep unrolling Shrinkage Network for Dynamic MR imaging","summary":" Deep unrolling networks that utilize sparsity priors have achieved great\nsuccess in dynamic magnetic resonance (MR) imaging. The convolutional neural\nnetwork (CNN) is usually utilized to extract the transformed domain, and then\nthe soft thresholding (ST) operator is applied to the CNN-transformed data to\nenforce the sparsity priors. However, the ST operator is usually constrained to\nbe the same across all channels of the CNN-transformed data. In this paper, we\npropose a novel operator, called soft thresholding with channel attention\n(AST), that learns the threshold for each channel. In particular, we put\nforward a novel deep unrolling shrinkage network (DUS-Net) by unrolling the\nalternating direction method of multipliers (ADMM) for optimizing the\ntransformed $l_1$ norm dynamic MR reconstruction model. Experimental results on\nan open-access dynamic cine MR dataset demonstrate that the proposed DUS-Net\noutperforms the state-of-the-art methods. The source code is available at\n\\url{https://github.com/yhao-z/DUS-Net}.\n","authors":["Yinghao Zhang","Xiaodi Li","Weihang Li","Yue Hu"],"pdf_url":"https://arxiv.org/pdf/2307.09818v1.pdf","comment":"5 pages,3 figures,2 tables"},{"id":"http://arxiv.org/abs/2307.07813v3","updated":"2023-07-19T08:06:34Z","published":"2023-07-15T14:34:25Z","title":"TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for\n Gaze Estimation","summary":" Intelligent edge vision tasks encounter the critical challenge of ensuring\npower and latency efficiency due to the typically heavy computational load they\nimpose on edge platforms.This work leverages one of the first \"AI in sensor\"\nvision platforms, IMX500 by Sony, to achieve ultra-fast and ultra-low-power\nend-to-end edge vision applications. We evaluate the IMX500 and compare it to\nother edge platforms, such as the Google Coral Dev Micro and Sony Spresense, by\nexploring gaze estimation as a case study. We propose TinyTracker, a highly\nefficient, fully quantized model for 2D gaze estimation designed to maximize\nthe performance of the edge vision systems considered in this study.\nTinyTracker achieves a 41x size reduction (600Kb) compared to iTracker [1]\nwithout significant loss in gaze estimation accuracy (maximum of 0.16 cm when\nfully quantized). TinyTracker's deployment on the Sony IMX500 vision sensor\nresults in end-to-end latency of around 19ms. The camera takes around 17.9ms to\nread, process and transmit the pixels to the accelerator. The inference time of\nthe network is 0.86ms with an additional 0.24 ms for retrieving the results\nfrom the sensor. The overall energy consumption of the end-to-end system is 4.9\nmJ, including 0.06 mJ for inference. The end-to-end study shows that IMX500 is\n1.7x faster than CoralMicro (19ms vs 34.4ms) and 7x more power efficient (4.9mJ\nVS 34.2mJ)\n","authors":["Pietro Bonazzi","Thomas Ruegg","Sizhen Bian","Yawei Li","Michele Magno"],"pdf_url":"https://arxiv.org/pdf/2307.07813v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09815v1","updated":"2023-07-19T08:03:53Z","published":"2023-07-19T08:03:53Z","title":"LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network","summary":" Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent\nblur is a challenging task.~Existing blur map-based deblurring methods have\ndemonstrated promising results. In this paper, we propose, to the best of our\nknowledge, the first framework to introduce the contrastive language-image\npre-training framework (CLIP) to achieve accurate blur map estimation from DP\npairs unsupervisedly. To this end, we first carefully design text prompts to\nenable CLIP to understand blur-related geometric prior knowledge from the DP\npair. Then, we propose a format to input stereo DP pair to the CLIP without any\nfine-tuning, where the CLIP is pre-trained on monocular images. Given the\nestimated blur map, we introduce a blur-prior attention block, a blur-weighting\nloss and a blur-aware loss to recover the all-in-focus image. Our method\nachieves state-of-the-art performance in extensive experiments.\n","authors":["Hao Yang","Liyuan Pan","Yan Yang","Miaomiao Liu"],"pdf_url":"https://arxiv.org/pdf/2307.09815v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09810v1","updated":"2023-07-19T07:58:21Z","published":"2023-07-19T07:58:21Z","title":"GenKL: An Iterative Framework for Resolving Label Ambiguity and Label\n Non-conformity in Web Images Via a New Generalized KL Divergence","summary":" Web image datasets curated online inherently contain ambiguous\nin-distribution (ID) instances and out-of-distribution (OOD) instances, which\nwe collectively call non-conforming (NC) instances. In many recent approaches\nfor mitigating the negative effects of NC instances, the core implicit\nassumption is that the NC instances can be found via entropy maximization. For\n\"entropy\" to be well-defined, we are interpreting the output prediction vector\nof an instance as the parameter vector of a multinomial random variable, with\nrespect to some trained model with a softmax output layer. Hence, entropy\nmaximization is based on the idealized assumption that NC instances have\npredictions that are \"almost\" uniformly distributed. However, in real-world web\nimage datasets, there are numerous NC instances whose predictions are far from\nbeing uniformly distributed. To tackle the limitation of entropy maximization,\nwe propose $(\\alpha, \\beta)$-generalized KL divergence,\n$\\mathcal{D}_{\\text{KL}}^{\\alpha, \\beta}(p\\|q)$, which can be used to identify\nsignificantly more NC instances. Theoretical properties of\n$\\mathcal{D}_{\\text{KL}}^{\\alpha, \\beta}(p\\|q)$ are proven, and we also show\nempirically that a simple use of $\\mathcal{D}_{\\text{KL}}^{\\alpha,\n\\beta}(p\\|q)$ outperforms all baselines on the NC instance identification task.\nBuilding upon $(\\alpha,\\beta)$-generalized KL divergence, we also introduce a\nnew iterative training framework, GenKL, that identifies and relabels NC\ninstances. When evaluated on three web image datasets, Clothing1M,\nFood101/Food101N, and mini WebVision 1.0, we achieved new state-of-the-art\nclassification accuracies: $81.34\\%$, $85.73\\%$ and $78.99\\%$/$92.54\\%$\n(top-1/top-5), respectively.\n","authors":["Xia Huang","Kai Fong Ernest Chong"],"pdf_url":"https://arxiv.org/pdf/2307.09810v1.pdf","comment":"Published (with open access) at International Journal of Computer\n Vision (IJCV, 2023). 25 pages, 8 figures. Code is available at:\n https://github.com/codetopaper/GenKL"},{"id":"http://arxiv.org/abs/2307.09804v1","updated":"2023-07-19T07:47:23Z","published":"2023-07-19T07:47:23Z","title":"Fix your downsampling ASAP! Be natively more robust via Aliasing and\n Spectral Artifact free Pooling","summary":" Convolutional neural networks encode images through a sequence of\nconvolutions, normalizations and non-linearities as well as downsampling\noperations into potentially strong semantic embeddings. Yet, previous work\nshowed that even slight mistakes during sampling, leading to aliasing, can be\ndirectly attributed to the networks' lack in robustness. To address such issues\nand facilitate simpler and faster adversarial training, [12] recently proposed\nFLC pooling, a method for provably alias-free downsampling - in theory. In this\nwork, we conduct a further analysis through the lens of signal processing and\nfind that such current pooling methods, which address aliasing in the frequency\ndomain, are still prone to spectral leakage artifacts. Hence, we propose\naliasing and spectral artifact-free pooling, short ASAP. While only introducing\na few modifications to FLC pooling, networks using ASAP as downsampling method\nexhibit higher native robustness against common corruptions, a property that\nFLC pooling was missing. ASAP also increases native robustness against\nadversarial attacks on high and low resolution data while maintaining similar\nclean accuracy or even outperforming the baseline.\n","authors":["Julia Grabinski","Janis Keuper","Margret Keuper"],"pdf_url":"https://arxiv.org/pdf/2307.09804v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08723v2","updated":"2023-07-19T07:35:54Z","published":"2023-07-17T11:19:41Z","title":"Revisiting Scene Text Recognition: A Data Perspective","summary":" This paper aims to re-assess scene text recognition (STR) from a\ndata-oriented perspective. We begin by revisiting the six commonly used\nbenchmarks in STR and observe a trend of performance saturation, whereby only\n2.91% of the benchmark images cannot be accurately recognized by an ensemble of\n13 representative models. While these results are impressive and suggest that\nSTR could be considered solved, however, we argue that this is primarily due to\nthe less challenging nature of the common benchmarks, thus concealing the\nunderlying issues that STR faces. To this end, we consolidate a large-scale\nreal STR dataset, namely Union14M, which comprises 4 million labeled images and\n10 million unlabeled images, to assess the performance of STR models in more\ncomplex real-world scenarios. Our experiments demonstrate that the 13 models\ncan only achieve an average accuracy of 66.53% on the 4 million labeled images,\nindicating that STR still faces numerous challenges in the real world. By\nanalyzing the error patterns of the 13 models, we identify seven open\nchallenges in STR and develop a challenge-driven benchmark consisting of eight\ndistinct subsets to facilitate further progress in the field. Our exploration\ndemonstrates that STR is far from being solved and leveraging data may be a\npromising solution. In this regard, we find that utilizing the 10 million\nunlabeled images through self-supervised pre-training can significantly improve\nthe robustness of STR model in real-world scenarios and leads to\nstate-of-the-art performance.\n","authors":["Qing Jiang","Jiapeng Wang","Dezhi Peng","Chongyu Liu","Lianwen Jin"],"pdf_url":"https://arxiv.org/pdf/2307.08723v2.pdf","comment":"Accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2210.16117v4","updated":"2023-07-19T07:34:37Z","published":"2022-10-28T13:25:59Z","title":"Improving the Transferability of Adversarial Attacks on Face Recognition\n with Beneficial Perturbation Feature Augmentation","summary":" Face recognition (FR) models can be easily fooled by adversarial examples,\nwhich are crafted by adding imperceptible perturbations on benign face images.\nThe existence of adversarial face examples poses a great threat to the security\nof society. In order to build a more sustainable digital nation, in this paper,\nwe improve the transferability of adversarial face examples to expose more\nblind spots of existing FR models. Though generating hard samples has shown its\neffectiveness in improving the generalization of models in training tasks, the\neffectiveness of utilizing this idea to improve the transferability of\nadversarial face examples remains unexplored. To this end, based on the\nproperty of hard samples and the symmetry between training tasks and\nadversarial attack tasks, we propose the concept of hard models, which have\nsimilar effects as hard samples for adversarial attack tasks. Utilizing the\nconcept of hard models, we propose a novel attack method called Beneficial\nPerturbation Feature Augmentation Attack (BPFA), which reduces the overfitting\nof adversarial examples to surrogate FR models by constantly generating new\nhard models to craft the adversarial examples. Specifically, in the\nbackpropagation, BPFA records the gradients on pre-selected feature maps and\nuses the gradient on the input image to craft the adversarial example. In the\nnext forward propagation, BPFA leverages the recorded gradients to add\nbeneficial perturbations on their corresponding feature maps to increase the\nloss. Extensive experiments demonstrate that BPFA can significantly boost the\ntransferability of adversarial attacks on FR.\n","authors":["Fengfan Zhou","Hefei Ling","Yuxuan Shi","Jiazhong Chen","Zongyi Li","Ping Li"],"pdf_url":"https://arxiv.org/pdf/2210.16117v4.pdf","comment":"\\c{opyright} 2023 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2302.05086v3","updated":"2023-07-19T07:31:35Z","published":"2023-02-10T07:08:13Z","title":"Making Substitute Models More Bayesian Can Enhance Transferability of\n Adversarial Examples","summary":" The transferability of adversarial examples across deep neural networks\n(DNNs) is the crux of many black-box attacks. Many prior efforts have been\ndevoted to improving the transferability via increasing the diversity in inputs\nof some substitute models. In this paper, by contrast, we opt for the diversity\nin substitute models and advocate to attack a Bayesian model for achieving\ndesirable transferability. Deriving from the Bayesian formulation, we develop a\nprincipled strategy for possible finetuning, which can be combined with many\noff-the-shelf Gaussian posterior approximations over DNN parameters. Extensive\nexperiments have been conducted to verify the effectiveness of our method, on\ncommon benchmark datasets, and the results demonstrate that our method\noutperforms recent state-of-the-arts by large margins (roughly 19% absolute\nincrease in average attack success rate on ImageNet), and, by combining with\nthese recent methods, further performance gain can be obtained. Our code:\nhttps://github.com/qizhangli/MoreBayesian-attack.\n","authors":["Qizhang Li","Yiwen Guo","Wangmeng Zuo","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2302.05086v3.pdf","comment":"Accepted by ICLR 2023, fix typos"},{"id":"http://arxiv.org/abs/2307.09795v1","updated":"2023-07-19T07:29:14Z","published":"2023-07-19T07:29:14Z","title":"From West to East: Who can understand the music of the others better?","summary":" Recent developments in MIR have led to several benchmark deep learning models\nwhose embeddings can be used for a variety of downstream tasks. At the same\ntime, the vast majority of these models have been trained on Western pop/rock\nmusic and related styles. This leads to research questions on whether these\nmodels can be used to learn representations for different music cultures and\nstyles, or whether we can build similar music audio embedding models trained on\ndata from different cultures or styles. To that end, we leverage transfer\nlearning methods to derive insights about the similarities between the\ndifferent music cultures to which the data belongs to. We use two Western music\ndatasets, two traditional/folk datasets coming from eastern Mediterranean\ncultures, and two datasets belonging to Indian art music. Three deep audio\nembedding models are trained and transferred across domains, including two\nCNN-based and a Transformer-based architecture, to perform auto-tagging for\neach target domain dataset. Experimental results show that competitive\nperformance is achieved in all domains via transfer learning, while the best\nsource dataset varies for each music culture. The implementation and the\ntrained models are both provided in a public repository.\n","authors":["Charilaos Papaioannou","Emmanouil Benetos","Alexandros Potamianos"],"pdf_url":"https://arxiv.org/pdf/2307.09795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09794v1","updated":"2023-07-19T07:25:33Z","published":"2023-07-19T07:25:33Z","title":"DiffDP: Radiotherapy Dose Prediction via a Diffusion Model","summary":" Currently, deep learning (DL) has achieved the automatic prediction of dose\ndistribution in radiotherapy planning, enhancing its efficiency and quality.\nHowever, existing methods suffer from the over-smoothing problem for their\ncommonly used L_1 or L_2 loss with posterior average calculations. To alleviate\nthis limitation, we innovatively introduce a diffusion-based dose prediction\n(DiffDP) model for predicting the radiotherapy dose distribution of cancer\npatients. Specifically, the DiffDP model contains a forward process and a\nreverse process. In the forward process, DiffDP gradually transforms dose\ndistribution maps into Gaussian noise by adding small noise and trains a noise\npredictor to predict the noise added in each timestep. In the reverse process,\nit removes the noise from the original Gaussian noise in multiple steps with\nthe well-trained noise predictor and finally outputs the predicted dose\ndistribution map. To ensure the accuracy of the prediction, we further design a\nstructure encoder to extract anatomical information from patient anatomy images\nand enable the noise predictor to be aware of the dose constraints within\nseveral essential organs, i.e., the planning target volume and organs at risk.\nExtensive experiments on an in-house dataset with 130 rectum cancer patients\ndemonstrate the s\n","authors":["Zhenghao Feng","Lu Wen","Peng Wang","Binyu Yan","Xi Wu","Jiliu Zhou","Yan Wang"],"pdf_url":"https://arxiv.org/pdf/2307.09794v1.pdf","comment":"to be published in MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.08015v2","updated":"2023-07-19T07:18:12Z","published":"2023-07-16T11:52:27Z","title":"Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via\n Geometry-Guided Cross-View Transformer","summary":" Image retrieval-based cross-view localization methods often lead to very\ncoarse camera pose estimation, due to the limited sampling density of the\ndatabase satellite images. In this paper, we propose a method to increase the\naccuracy of a ground camera's location and orientation by estimating the\nrelative rotation and translation between the ground-level image and its\nmatched/retrieved satellite image. Our approach designs a geometry-guided\ncross-view transformer that combines the benefits of conventional geometry and\nlearnable cross-view transformers to map the ground-view observations to an\noverhead view. Given the synthesized overhead view and observed satellite\nfeature maps, we construct a neural pose optimizer with strong global\ninformation embedding ability to estimate the relative rotation between them.\nAfter aligning their rotations, we develop an uncertainty-guided spatial\ncorrelation to generate a probability map of the vehicle locations, from which\nthe relative translation can be determined. Experimental results demonstrate\nthat our method significantly outperforms the state-of-the-art. Notably, the\nlikelihood of restricting the vehicle lateral pose to be within 1m of its\nGround Truth (GT) value on the cross-view KITTI dataset has been improved from\n$35.54\\%$ to $76.44\\%$, and the likelihood of restricting the vehicle\norientation to be within $1^{\\circ}$ of its GT value has been improved from\n$19.64\\%$ to $99.10\\%$.\n","authors":["Yujiao Shi","Fei Wu","Ankit Vora","Akhil Perincherry","Hongdong Li"],"pdf_url":"https://arxiv.org/pdf/2307.08015v2.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2307.09788v1","updated":"2023-07-19T07:11:45Z","published":"2023-07-19T07:11:45Z","title":"Density-invariant Features for Distant Point Cloud Registration","summary":" Registration of distant outdoor LiDAR point clouds is crucial to extending\nthe 3D vision of collaborative autonomous vehicles, and yet is challenging due\nto small overlapping area and a huge disparity between observed point\ndensities. In this paper, we propose Group-wise Contrastive Learning (GCL)\nscheme to extract density-invariant geometric features to register distant\noutdoor LiDAR point clouds. We mark through theoretical analysis and\nexperiments that, contrastive positives should be independent and identically\ndistributed (i.i.d.), in order to train densityinvariant feature extractors. We\npropose upon the conclusion a simple yet effective training scheme to force the\nfeature of multiple point clouds in the same spatial location (referred to as\npositive groups) to be similar, which naturally avoids the sampling bias\nintroduced by a pair of point clouds to conform with the i.i.d. principle. The\nresulting fully-convolutional feature extractor is more powerful and\ndensity-invariant than state-of-the-art methods, improving the registration\nrecall of distant scenarios on KITTI and nuScenes benchmarks by 40.9% and\n26.9%, respectively. The code will be open-sourced.\n","authors":["Quan Liu","Hongzi Zhu","Yunsong Zhou","Hongyang Li","Shan Chang","Minyi Guo"],"pdf_url":"https://arxiv.org/pdf/2307.09788v1.pdf","comment":"In Proceedings of the IEEE/CVF International Conference on Computer\n Vision (ICCV), 2023"},{"id":"http://arxiv.org/abs/2307.09787v1","updated":"2023-07-19T07:11:11Z","published":"2023-07-19T07:11:11Z","title":"DVPT: Dynamic Visual Prompt Tuning of Large Pre-trained Models for\n Medical Image Analysis","summary":" Limited labeled data makes it hard to train models from scratch in medical\ndomain, and an important paradigm is pre-training and then fine-tuning. Large\npre-trained models contain rich representations, which can be adapted to\ndownstream medical tasks. However, existing methods either tune all the\nparameters or the task-specific layers of the pre-trained models, ignoring the\ninput variations of medical images, and thus they are not efficient or\neffective. In this work, we aim to study parameter-efficient fine-tuning (PEFT)\nfor medical image analysis, and propose a dynamic visual prompt tuning method,\nnamed DVPT. It can extract knowledge beneficial to downstream tasks from large\nmodels with a few trainable parameters. Firstly, the frozen features are\ntransformed by an lightweight bottleneck layer to learn the domain-specific\ndistribution of downstream medical tasks, and then a few learnable visual\nprompts are used as dynamic queries and then conduct cross-attention with the\ntransformed features, attempting to acquire sample-specific knowledge that are\nsuitable for each sample. Finally, the features are projected to original\nfeature dimension and aggregated with the frozen features. This DVPT module can\nbe shared between different Transformer layers, further reducing the trainable\nparameters. To validate DVPT, we conduct extensive experiments with different\npre-trained models on medical classification and segmentation tasks. We find\nsuch PEFT method can not only efficiently adapt the pre-trained models to the\nmedical domain, but also brings data efficiency with partial labeled data. For\nexample, with 0.5\\% extra trainable parameters, our method not only outperforms\nstate-of-the-art PEFT methods, even surpasses the full fine-tuning by more than\n2.20\\% Kappa score on medical classification task. It can saves up to 60\\%\nlabeled data and 99\\% storage cost of ViT-B/16.\n","authors":["Along He","Kai Wang","Zhihong Wang","Tao Li","Huazhu Fu"],"pdf_url":"https://arxiv.org/pdf/2307.09787v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09781v1","updated":"2023-07-19T06:56:07Z","published":"2023-07-19T06:56:07Z","title":"Text2Layer: Layered Image Generation using Latent Diffusion Model","summary":" Layer compositing is one of the most popular image editing workflows among\nboth amateurs and professionals. Motivated by the success of diffusion models,\nwe explore layer compositing from a layered image generation perspective.\nInstead of generating an image, we propose to generate background, foreground,\nlayer mask, and the composed image simultaneously. To achieve layered image\ngeneration, we train an autoencoder that is able to reconstruct layered images\nand train diffusion models on the latent representation. One benefit of the\nproposed problem is to enable better compositing workflows in addition to the\nhigh-quality image output. Another benefit is producing higher-quality layer\nmasks compared to masks produced by a separate step of image segmentation.\nExperimental results show that the proposed method is able to generate\nhigh-quality layered images and initiates a benchmark for future work.\n","authors":["Xinyang Zhang","Wentian Zhao","Xin Lu","Jeff Chien"],"pdf_url":"https://arxiv.org/pdf/2307.09781v1.pdf","comment":"Preprint. Work in progress"},{"id":"http://arxiv.org/abs/2307.01533v2","updated":"2023-07-19T06:39:36Z","published":"2023-07-04T07:36:48Z","title":"Unsupervised Video Anomaly Detection with Diffusion Models Conditioned\n on Compact Motion Representations","summary":" This paper aims to address the unsupervised video anomaly detection (VAD)\nproblem, which involves classifying each frame in a video as normal or\nabnormal, without any access to labels. To accomplish this, the proposed method\nemploys conditional diffusion models, where the input data is the\nspatiotemporal features extracted from a pre-trained network, and the condition\nis the features extracted from compact motion representations that summarize a\ngiven video segment in terms of its motion and appearance. Our method utilizes\na data-driven threshold and considers a high reconstruction error as an\nindicator of anomalous events. This study is the first to utilize compact\nmotion representations for VAD and the experiments conducted on two large-scale\nVAD benchmarks demonstrate that they supply relevant information to the\ndiffusion model, and consequently improve VAD performances w.r.t the prior art.\nImportantly, our method exhibits better generalization performance across\ndifferent datasets, notably outperforming both the state-of-the-art and\nbaseline methods. The code of our method is available at\nhttps://github.com/AnilOsmanTur/conditioned_video_anomaly_diffusion\n","authors":["Anil Osman Tur","Nicola Dall'Asen","Cigdem Beyan","Elisa Ricci"],"pdf_url":"https://arxiv.org/pdf/2307.01533v2.pdf","comment":"Accepted to ICIAP 2023"},{"id":"http://arxiv.org/abs/2307.09769v1","updated":"2023-07-19T06:07:12Z","published":"2023-07-19T06:07:12Z","title":"Source-Free Domain Adaptation for Medical Image Segmentation via\n Prototype-Anchored Feature Alignment and Contrastive Learning","summary":" Unsupervised domain adaptation (UDA) has increasingly gained interests for\nits capacity to transfer the knowledge learned from a labeled source domain to\nan unlabeled target domain. However, typical UDA methods require concurrent\naccess to both the source and target domain data, which largely limits its\napplication in medical scenarios where source data is often unavailable due to\nprivacy concern. To tackle the source data-absent problem, we present a novel\ntwo-stage source-free domain adaptation (SFDA) framework for medical image\nsegmentation, where only a well-trained source segmentation model and unlabeled\ntarget data are available during domain adaptation. Specifically, in the\nprototype-anchored feature alignment stage, we first utilize the weights of the\npre-trained pixel-wise classifier as source prototypes, which preserve the\ninformation of source features. Then, we introduce the bi-directional transport\nto align the target features with class prototypes by minimizing its expected\ncost. On top of that, a contrastive learning stage is further devised to\nutilize those pixels with unreliable predictions for a more compact target\nfeature distribution. Extensive experiments on a cross-modality medical\nsegmentation task demonstrate the superiority of our method in large domain\ndiscrepancy settings compared with the state-of-the-art SFDA approaches and\neven some UDA methods. Code is available at\nhttps://github.com/CSCYQJ/MICCAI23-ProtoContra-SFDA.\n","authors":["Qinji Yu","Nan Xi","Junsong Yuan","Ziyu Zhou","Kang Dang","Xiaowei Ding"],"pdf_url":"https://arxiv.org/pdf/2307.09769v1.pdf","comment":"Accepted by MICCAI23"},{"id":"http://arxiv.org/abs/2009.06205v3","updated":"2023-07-19T06:05:27Z","published":"2020-09-14T05:23:58Z","title":"Joint Demosaicking and Denoising Benefits from a Two-stage Training\n Strategy","summary":" Image demosaicking and denoising are the first two key steps of the color\nimage production pipeline. The classical processing sequence has for a long\ntime consisted of applying denoising first, and then demosaicking. Applying the\noperations in this order leads to oversmoothing and checkerboard effects. Yet,\nit was difficult to change this order, because once the image is demosaicked,\nthe statistical properties of the noise are dramatically changed and hard to\nhandle by traditional denoising models. In this paper, we address this problem\nby a hybrid machine learning method. We invert the traditional color filter\narray (CFA) processing pipeline by first demosaicking and then denoising. Our\ndemosaicking algorithm, trained on noiseless images, combines a traditional\nmethod and a residual convolutional neural network (CNN). This first stage\nretains all known information, which is the key point to obtain faithful final\nresults. The noisy demosaicked image is then passed through a second CNN\nrestoring a noiseless full-color image. This pipeline order completely avoids\ncheckerboard effects and restores fine image detail. Although CNNs can be\ntrained to solve jointly demosaicking-denoising end-to-end, we find that this\ntwo-stage training performs better and is less prone to failure. It is shown\nexperimentally to improve on the state of the art, both quantitatively and in\nterms of visual quality.\n","authors":["Yu Guo","Qiyu Jin","Gabriele Facciolo","Tieyong Zeng","Jean-Michel Morel"],"pdf_url":"https://arxiv.org/pdf/2009.06205v3.pdf","comment":"28 pages, 40 figures"},{"id":"http://arxiv.org/abs/2307.09763v1","updated":"2023-07-19T05:46:56Z","published":"2023-07-19T05:46:56Z","title":"Towards Building More Robust Models with Frequency Bias","summary":" The vulnerability of deep neural networks to adversarial samples has been a\nmajor impediment to their broad applications, despite their success in various\nfields. Recently, some works suggested that adversarially-trained models\nemphasize the importance of low-frequency information to achieve higher\nrobustness. While several attempts have been made to leverage this frequency\ncharacteristic, they have all faced the issue that applying low-pass filters\ndirectly to input images leads to irreversible loss of discriminative\ninformation and poor generalizability to datasets with distinct frequency\nfeatures. This paper presents a plug-and-play module called the Frequency\nPreference Control Module that adaptively reconfigures the low- and\nhigh-frequency components of intermediate feature representations, providing\nbetter utilization of frequency in robust learning. Empirical studies show that\nour proposed module can be easily incorporated into any adversarial training\nframework, further improving model robustness across different architectures\nand datasets. Additionally, experiments were conducted to examine how the\nfrequency bias of robust models impacts the adversarial training process and\nits final robustness, revealing interesting insights.\n","authors":["Qingwen Bu","Dong Huang","Heming Cui"],"pdf_url":"https://arxiv.org/pdf/2307.09763v1.pdf","comment":"Accepted by ICCV23"},{"id":"http://arxiv.org/abs/2307.08779v2","updated":"2023-07-19T05:43:45Z","published":"2023-07-17T18:50:15Z","title":"Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation","summary":" Low-light conditions not only hamper human visual experience but also degrade\nthe model's performance on downstream vision tasks. While existing works make\nremarkable progress on day-night domain adaptation, they rely heavily on domain\nknowledge derived from the task-specific nighttime dataset. This paper\nchallenges a more complicated scenario with border applicability, i.e.,\nzero-shot day-night domain adaptation, which eliminates reliance on any\nnighttime data. Unlike prior zero-shot adaptation approaches emphasizing either\nimage-level translation or model-level adaptation, we propose a similarity\nmin-max paradigm that considers them under a unified framework. On the image\nlevel, we darken images towards minimum feature similarity to enlarge the\ndomain gap. Then on the model level, we maximize the feature similarity between\nthe darkened images and their normal-light counterparts for better model\nadaptation. To the best of our knowledge, this work represents the pioneering\neffort in jointly optimizing both aspects, resulting in a significant\nimprovement of model generalizability. Extensive experiments demonstrate our\nmethod's effectiveness and broad applicability on various nighttime vision\ntasks, including classification, semantic segmentation, visual place\nrecognition, and video action recognition. Code and pre-trained models are\navailable at https://red-fairy.github.io/ZeroShotDayNightDA-Webpage/.\n","authors":["Rundong Luo","Wenjing Wang","Wenhan Yang","Jiaying Liu"],"pdf_url":"https://arxiv.org/pdf/2307.08779v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.09758v1","updated":"2023-07-19T05:41:14Z","published":"2023-07-19T05:41:14Z","title":"Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray\n Report Generation","summary":" Chest X-Ray (CXR) report generation is a promising approach to improving the\nefficiency of CXR interpretation. However, a significant increase in diagnostic\naccuracy is required before that can be realised. Motivated by this, we propose\na framework that is more inline with a radiologist's workflow by considering\nlongitudinal data. Here, the decoder is additionally conditioned on the report\nfrom the subject's previous imaging study via a prompt. We also propose a new\nreward for reinforcement learning based on CXR-BERT, which computes the\nsimilarity between reports. We conduct experiments on the MIMIC-CXR dataset.\nThe results indicate that longitudinal data improves CXR report generation.\nCXR-BERT is also shown to be a promising alternative to the current\nstate-of-the-art reward based on RadGraph. This investigation indicates that\nlongitudinal CXR report generation can offer a substantial increase in\ndiagnostic accuracy. Our Hugging Face model is available at:\nhttps://huggingface.co/aehrc/cxrmate and code is available at:\nhttps://github.com/aehrc/cxrmate.\n","authors":["Aaron Nicolson","Jason Dowling","Bevan Koopman"],"pdf_url":"https://arxiv.org/pdf/2307.09758v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09756v1","updated":"2023-07-19T05:40:38Z","published":"2023-07-19T05:40:38Z","title":"Generative Prompt Model for Weakly Supervised Object Localization","summary":" Weakly supervised object localization (WSOL) remains challenging when\nlearning object localization models from image category labels. Conventional\nmethods that discriminatively train activation models ignore representative yet\nless discriminative object parts. In this study, we propose a generative prompt\nmodel (GenPromp), defining the first generative pipeline to localize less\ndiscriminative object parts by formulating WSOL as a conditional image\ndenoising procedure. During training, GenPromp converts image category labels\nto learnable prompt embeddings which are fed to a generative model to\nconditionally recover the input image with noise and learn representative\nembeddings. During inference, enPromp combines the representative embeddings\nwith discriminative embeddings (queried from an off-the-shelf vision-language\nmodel) for both representative and discriminative capacity. The combined\nembeddings are finally used to generate multi-scale high-quality attention\nmaps, which facilitate localizing full object extent. Experiments on\nCUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best\ndiscriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline\nfor WSOL with the generative model. Code is available at\nhttps://github.com/callsys/GenPromp.\n","authors":["Yuzhong Zhao","Qixiang Ye","Weijia Wu","Chunhua Shen","Fang Wan"],"pdf_url":"https://arxiv.org/pdf/2307.09756v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09755v1","updated":"2023-07-19T05:39:15Z","published":"2023-07-19T05:39:15Z","title":"Space Engage: Collaborative Space Supervision for Contrastive-based\n Semi-Supervised Semantic Segmentation","summary":" Semi-Supervised Semantic Segmentation (S4) aims to train a segmentation model\nwith limited labeled images and a substantial volume of unlabeled images. To\nimprove the robustness of representations, powerful methods introduce a\npixel-wise contrastive learning approach in latent space (i.e., representation\nspace) that aggregates the representations to their prototypes in a fully\nsupervised manner. However, previous contrastive-based S4 methods merely rely\non the supervision from the model's output (logits) in logit space during\nunlabeled training. In contrast, we utilize the outputs in both logit space and\nrepresentation space to obtain supervision in a collaborative way. The\nsupervision from two spaces plays two roles: 1) reduces the risk of\nover-fitting to incorrect semantic information in logits with the help of\nrepresentations; 2) enhances the knowledge exchange between the two spaces.\nFurthermore, unlike previous approaches, we use the similarity between\nrepresentations and prototypes as a new indicator to tilt training those\nunder-performing representations and achieve a more efficient contrastive\nlearning process. Results on two public benchmarks demonstrate the competitive\nperformance of our method compared with state-of-the-art methods.\n","authors":["Changqi Wang","Haoyu Xie","Yuhui Yuan","Chong Fu","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2307.09755v1.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2307.09749v1","updated":"2023-07-19T05:08:47Z","published":"2023-07-19T05:08:47Z","title":"Towards Robust Scene Text Image Super-resolution via Explicit Location\n Enhancement","summary":" Scene text image super-resolution (STISR), aiming to improve image quality\nwhile boosting downstream scene text recognition accuracy, has recently\nachieved great success. However, most existing methods treat the foreground\n(character regions) and background (non-character regions) equally in the\nforward process, and neglect the disturbance from the complex background, thus\nlimiting the performance. To address these issues, in this paper, we propose a\nnovel method LEMMA that explicitly models character regions to produce\nhigh-level text-specific guidance for super-resolution. To model the location\nof characters effectively, we propose the location enhancement module to\nextract character region features based on the attention map sequence. Besides,\nwe propose the multi-modal alignment module to perform bidirectional\nvisual-semantic alignment to generate high-quality prior guidance, which is\nthen incorporated into the super-resolution branch in an adaptive manner using\nthe proposed adaptive fusion module. Experiments on TextZoom and four scene\ntext recognition benchmarks demonstrate the superiority of our method over\nother state-of-the-art methods. Code is available at\nhttps://github.com/csguoh/LEMMA.\n","authors":["Hang Guo","Tao Dai","Guanghao Meng","Shu-Tao Xia"],"pdf_url":"https://arxiv.org/pdf/2307.09749v1.pdf","comment":"Accepted as IJCAI2023 paper"},{"id":"http://arxiv.org/abs/2307.09748v1","updated":"2023-07-19T04:59:58Z","published":"2023-07-19T04:59:58Z","title":"Watch out Venomous Snake Species: A Solution to SnakeCLEF2023","summary":" The SnakeCLEF2023 competition aims to the development of advanced algorithms\nfor snake species identification through the analysis of images and\naccompanying metadata. This paper presents a method leveraging utilization of\nboth images and metadata. Modern CNN models and strong data augmentation are\nutilized to learn better representation of images. To relieve the challenge of\nlong-tailed distribution, seesaw loss is utilized in our method. We also design\na light model to calculate prior probabilities using metadata features\nextracted from CLIP in post processing stage. Besides, we attach more\nimportance to venomous species by assigning venomous species labels to some\nexamples that model is uncertain about. Our method achieves 91.31% score of the\nfinal metric combined of F1 and other metrics on private leaderboard, which is\nthe 1st place among the participators. The code is available at\nhttps://github.com/xiaoxsparraw/CLEF2023.\n","authors":["Feiran Hu","Peng Wang","Yangyang Li","Chenlong Duan","Zijian Zhu","Fei Wang","Faen Zhang","Yong Li","Xiu-Shen Wei"],"pdf_url":"https://arxiv.org/pdf/2307.09748v1.pdf","comment":"This work was the winner solution of the SnakeCLEF2023 challenge"},{"id":"http://arxiv.org/abs/2307.09742v1","updated":"2023-07-19T04:07:33Z","published":"2023-07-19T04:07:33Z","title":"Improved Distribution Matching for Dataset Condensation","summary":" Dataset Condensation aims to condense a large dataset into a smaller one\nwhile maintaining its ability to train a well-performing model, thus reducing\nthe storage cost and training effort in deep learning applications. However,\nconventional dataset condensation methods are optimization-oriented and\ncondense the dataset by performing gradient or parameter matching during model\noptimization, which is computationally intensive even on small datasets and\nmodels. In this paper, we propose a novel dataset condensation method based on\ndistribution matching, which is more efficient and promising. Specifically, we\nidentify two important shortcomings of naive distribution matching (i.e.,\nimbalanced feature numbers and unvalidated embeddings for distance computation)\nand address them with three novel techniques (i.e., partitioning and expansion\naugmentation, efficient and enriched model sampling, and class-aware\ndistribution regularization). Our simple yet effective method outperforms most\nprevious optimization-oriented methods with much fewer computational resources,\nthereby scaling data condensation to larger datasets and models. Extensive\nexperiments demonstrate the effectiveness of our method. Codes are available at\nhttps://github.com/uitrbn/IDM\n","authors":["Ganlong Zhao","Guanbin Li","Yipeng Qin","Yizhou Yu"],"pdf_url":"https://arxiv.org/pdf/2307.09742v1.pdf","comment":"CVPR2023"},{"id":"http://arxiv.org/abs/2306.13074v3","updated":"2023-07-19T03:46:37Z","published":"2023-06-22T17:47:08Z","title":"Iterative Scale-Up ExpansionIoU and Deep Features Association for\n Multi-Object Tracking in Sports","summary":" Multi-object tracking algorithms have made significant advancements due to\nthe recent developments in object detection. However, most existing methods\nprimarily focus on tracking pedestrians or vehicles, which exhibit relatively\nsimple and regular motion patterns. Consequently, there is a scarcity of\nalgorithms that address the tracking of targets with irregular or non-linear\nmotion, such as multi-athlete tracking. Furthermore, popular tracking\nalgorithms often rely on the Kalman filter for object motion modeling, which\nfails to track objects when their motion contradicts the linear motion\nassumption of the Kalman filter. Due to this reason, we proposed a novel online\nand robust multi-object tracking approach, named Iterative Scale-Up\nExpansionIoU and Deep Features for multi-object tracking. Unlike conventional\nmethods, we abandon the use of the Kalman filter and propose utilizing the\niterative scale-up expansion IoU. This approach achieves superior tracking\nperformance without requiring additional training data or adopting a more\nrobust detector, all while maintaining a lower computational cost compared to\nother appearance-based methods. Our proposed method demonstrates remarkable\neffectiveness in tracking irregular motion objects, achieving a score of 76.9%\nin HOTA. It outperforms all state-of-the-art tracking algorithms on the\nSportsMOT dataset, covering various kinds of sport scenarios.\n","authors":["Hsiang-Wei Huang","Cheng-Yen Yang","Jiacheng Sun","Jenq-Neng Hwang","Chung-I Huang"],"pdf_url":"https://arxiv.org/pdf/2306.13074v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07859v2","updated":"2023-07-19T03:04:50Z","published":"2023-07-15T17:45:17Z","title":"Unified Adversarial Patch for Cross-modal Attacks in the Physical World","summary":" Recently, physical adversarial attacks have been presented to evade\nDNNs-based object detectors. To ensure the security, many scenarios are\nsimultaneously deployed with visible sensors and infrared sensors, leading to\nthe failures of these single-modal physical attacks. To show the potential\nrisks under such scenes, we propose a unified adversarial patch to perform\ncross-modal physical attacks, i.e., fooling visible and infrared object\ndetectors at the same time via a single patch. Considering different imaging\nmechanisms of visible and infrared sensors, our work focuses on modeling the\nshapes of adversarial patches, which can be captured in different modalities\nwhen they change. To this end, we design a novel boundary-limited shape\noptimization to achieve the compact and smooth shapes, and thus they can be\neasily implemented in the physical world. In addition, to balance the fooling\ndegree between visible detector and infrared detector during the optimization\nprocess, we propose a score-aware iterative evaluation, which can guide the\nadversarial patch to iteratively reduce the predicted scores of the multi-modal\nsensors. We finally test our method against the one-stage detector: YOLOv3 and\nthe two-stage detector: Faster RCNN. Results show that our unified patch\nachieves an Attack Success Rate (ASR) of 73.33% and 69.17%, respectively. More\nimportantly, we verify the effective attacks in the physical world when visible\nand infrared sensors shoot the objects under various settings like different\nangles, distances, postures, and scenes.\n","authors":["Xingxing Wei","Yao Huang","Yitong Sun","Jie Yu"],"pdf_url":"https://arxiv.org/pdf/2307.07859v2.pdf","comment":"10 pages, 8 figures, accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2306.16197v3","updated":"2023-07-19T02:53:36Z","published":"2023-06-28T13:23:33Z","title":"Multi-IMU with Online Self-Consistency for Freehand 3D Ultrasound\n Reconstruction","summary":" Ultrasound (US) imaging is a popular tool in clinical diagnosis, offering\nsafety, repeatability, and real-time capabilities. Freehand 3D US is a\ntechnique that provides a deeper understanding of scanned regions without\nincreasing complexity. However, estimating elevation displacement and\naccumulation error remains challenging, making it difficult to infer the\nrelative position using images alone. The addition of external lightweight\nsensors has been proposed to enhance reconstruction performance without adding\ncomplexity, which has been shown to be beneficial. We propose a novel online\nself-consistency network (OSCNet) using multiple inertial measurement units\n(IMUs) to improve reconstruction performance. OSCNet utilizes a modal-level\nself-supervised strategy to fuse multiple IMU information and reduce\ndifferences between reconstruction results obtained from each IMU data.\nAdditionally, a sequence-level self-consistency strategy is proposed to improve\nthe hierarchical consistency of prediction results among the scanning sequence\nand its sub-sequences. Experiments on large-scale arm and carotid datasets with\nmultiple scanning tactics demonstrate that our OSCNet outperforms previous\nmethods, achieving state-of-the-art reconstruction performance.\n","authors":["Mingyuan Luo","Xin Yang","Zhongnuo Yan","Junyu Li","Yuanji Zhang","Jiongquan Chen","Xindi Hu","Jikuan Qian","Jun Cheng","Dong Ni"],"pdf_url":"https://arxiv.org/pdf/2306.16197v3.pdf","comment":"Accepted by MICCAI-2023"},{"id":"http://arxiv.org/abs/2307.09732v1","updated":"2023-07-19T02:49:44Z","published":"2023-07-19T02:49:44Z","title":"ClickSeg: 3D Instance Segmentation with Click-Level Weak Annotations","summary":" 3D instance segmentation methods often require fully-annotated dense labels\nfor training, which are costly to obtain. In this paper, we present ClickSeg, a\nnovel click-level weakly supervised 3D instance segmentation method that\nrequires one point per instance annotation merely. Such a problem is very\nchallenging due to the extremely limited labels, which has rarely been solved\nbefore. We first develop a baseline weakly-supervised training method, which\ngenerates pseudo labels for unlabeled data by the model itself. To utilize the\nproperty of click-level annotation setting, we further propose a new training\nframework. Instead of directly using the model inference way, i.e., mean-shift\nclustering, to generate the pseudo labels, we propose to use k-means with fixed\ninitial seeds: the annotated points. New similarity metrics are further\ndesigned for clustering. Experiments on ScanNetV2 and S3DIS datasets show that\nthe proposed ClickSeg surpasses the previous best weakly supervised instance\nsegmentation result by a large margin (e.g., +9.4% mAP on ScanNetV2). Using\n0.02% supervision signals merely, ClickSeg achieves $\\sim$90% of the accuracy\nof the fully-supervised counterpart. Meanwhile, it also achieves\nstate-of-the-art semantic segmentation results among weakly supervised methods\nthat use the same annotation settings.\n","authors":["Leyao Liu","Tao Kong","Minzhao Zhu","Jiashuo Fan","Lu Fang"],"pdf_url":"https://arxiv.org/pdf/2307.09732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09729v1","updated":"2023-07-19T02:33:42Z","published":"2023-07-19T02:33:42Z","title":"NTIRE 2023 Quality Assessment of Video Enhancement Challenge","summary":" This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement\nChallenge, which will be held in conjunction with the New Trends in Image\nRestoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to\naddress a major challenge in the field of video processing, namely, video\nquality assessment (VQA) for enhanced videos. The challenge uses the VQA\nDataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211\nenhanced videos, including 600 videos with color, brightness, and contrast\nenhancements, 310 videos with deblurring, and 301 deshaked videos. The\nchallenge has a total of 167 registered participants. 61 participating teams\nsubmitted their prediction results during the development phase, with a total\nof 3168 submissions. A total of 176 submissions were submitted by 37\nparticipating teams during the final testing phase. Finally, 19 participating\nteams submitted their models and fact sheets, and detailed the methods they\nused. Some methods have achieved better results than baseline methods, and the\nwinning methods have demonstrated superior prediction performance.\n","authors":["Xiaohong Liu","Xiongkuo Min","Wei Sun","Yulun Zhang","Kai Zhang","Radu Timofte","Guangtao Zhai","Yixuan Gao","Yuqin Cao","Tengchuan Kou","Yunlong Dong","Ziheng Jia","Yilin Li","Wei Wu","Shuming Hu","Sibin Deng","Pengxiang Xiao","Ying Chen","Kai Li","Kai Zhao","Kun Yuan","Ming Sun","Heng Cong","Hao Wang","Lingzhi Fu","Yusheng Zhang","Rongyu Zhang","Hang Shi","Qihang Xu","Longan Xiao","Zhiliang Ma","Mirko Agarla","Luigi Celona","Claudio Rota","Raimondo Schettini","Zhiwei Huang","Yanan Li","Xiaotao Wang","Lei Lei","Hongye Liu","Wei Hong","Ironhead Chuang","Allen Lin","Drake Guan","Iris Chen","Kae Lou","Willy Huang","Yachun Tasi","Yvonne Kao","Haotian Fan","Fangyuan Kong","Shiqi Zhou","Hao Liu","Yu Lai","Shanshan Chen","Wenqi Wang","Haoning Wu","Chaofeng Chen","Chunzheng Zhu","Zekun Guo","Shiling Zhao","Haibing Yin","Hongkui Wang","Hanene Brachemi Meftah","Sid Ahmed Fezza","Wassim Hamidouche","Olivier Déforges","Tengfei Shi","Azadeh Mansouri","Hossein Motamednia","Amir Hossein Bakhtiari","Ahmad Mahmoudi Aznaveh"],"pdf_url":"https://arxiv.org/pdf/2307.09729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09023v2","updated":"2023-07-19T02:30:48Z","published":"2023-07-18T07:25:38Z","title":"LA-Net: Landmark-Aware Learning for Reliable Facial Expression\n Recognition under Label Noise","summary":" Facial expression recognition (FER) remains a challenging task due to the\nambiguity of expressions. The derived noisy labels significantly harm the\nperformance in real-world scenarios. To address this issue, we present a new\nFER model named Landmark-Aware Net~(LA-Net), which leverages facial landmarks\nto mitigate the impact of label noise from two perspectives. Firstly, LA-Net\nuses landmark information to suppress the uncertainty in expression space and\nconstructs the label distribution of each sample by neighborhood aggregation,\nwhich in turn improves the quality of training supervision. Secondly, the model\nincorporates landmark information into expression representations using the\ndevised expression-landmark contrastive loss. The enhanced expression feature\nextractor can be less susceptible to label noise. Our method can be integrated\nwith any deep neural network for better training supervision without\nintroducing extra inference costs. We conduct extensive experiments on both\nin-the-wild datasets and synthetic noisy datasets and demonstrate that LA-Net\nachieves state-of-the-art performance.\n","authors":["Zhiyu Wu","Jinshi Cui"],"pdf_url":"https://arxiv.org/pdf/2307.09023v2.pdf","comment":"accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.09728v1","updated":"2023-07-19T02:29:57Z","published":"2023-07-19T02:29:57Z","title":"Uncertainty-Driven Multi-Scale Feature Fusion Network for Real-time\n Image Deraining","summary":" Visual-based measurement systems are frequently affected by rainy weather due\nto the degradation caused by rain streaks in captured images, and existing\nimaging devices struggle to address this issue in real-time. While most efforts\nleverage deep networks for image deraining and have made progress, their large\nparameter sizes hinder deployment on resource-constrained devices.\nAdditionally, these data-driven models often produce deterministic results,\nwithout considering their inherent epistemic uncertainty, which can lead to\nundesired reconstruction errors. Well-calibrated uncertainty can help alleviate\nprediction errors and assist measurement devices in mitigating risks and\nimproving usability. Therefore, we propose an Uncertainty-Driven Multi-Scale\nFeature Fusion Network (UMFFNet) that learns the probability mapping\ndistribution between paired images to estimate uncertainty. Specifically, we\nintroduce an uncertainty feature fusion block (UFFB) that utilizes uncertainty\ninformation to dynamically enhance acquired features and focus on blurry\nregions obscured by rain streaks, reducing prediction errors. In addition, to\nfurther boost the performance of UMFFNet, we fused feature information from\nmultiple scales to guide the network for efficient collaborative rain removal.\nExtensive experiments demonstrate that UMFFNet achieves significant performance\nimprovements with few parameters, surpassing state-of-the-art image deraining\nmethods.\n","authors":["Ming Tong","Xuefeng Yan","Yongzhen Wang"],"pdf_url":"https://arxiv.org/pdf/2307.09728v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09727v1","updated":"2023-07-19T02:28:41Z","published":"2023-07-19T02:28:41Z","title":"SAMConvex: Fast Discrete Optimization for CT Registration using\n Self-supervised Anatomical Embedding and Correlation Pyramid","summary":" Estimating displacement vector field via a cost volume computed in the\nfeature space has shown great success in image registration, but it suffers\nexcessive computation burdens. Moreover, existing feature descriptors only\nextract local features incapable of representing the global semantic\ninformation, which is especially important for solving large transformations.\nTo address the discussed issues, we propose SAMConvex, a fast coarse-to-fine\ndiscrete optimization method for CT registration that includes a decoupled\nconvex optimization procedure to obtain deformation fields based on a\nself-supervised anatomical embedding (SAM) feature extractor that captures both\nlocal and global information. To be specific, SAMConvex extracts per-voxel\nfeatures and builds 6D correlation volumes based on SAM features, and\niteratively updates a flow field by performing lookups on the correlation\nvolumes with a coarse-to-fine scheme. SAMConvex outperforms the\nstate-of-the-art learning-based methods and optimization-based methods over two\ninter-patient registration datasets (Abdomen CT and HeadNeck CT) and one\nintra-patient registration dataset (Lung CT). Moreover, as an\noptimization-based method, SAMConvex only takes $\\sim2$s ($\\sim5s$ with\ninstance optimization) for one paired images.\n","authors":["Zi Li","Lin Tian","Tony C. W. Mok","Xiaoyu Bai","Puyang Wang","Jia Ge","Jingren Zhou","Le Lu","Xianghua Ye","Ke Yan","Dakai Jin"],"pdf_url":"https://arxiv.org/pdf/2307.09727v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09724v1","updated":"2023-07-19T02:26:20Z","published":"2023-07-19T02:26:20Z","title":"AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks","summary":" To deliver the artistic expression of the target style, recent studies\nexploit the attention mechanism owing to its ability to map the local patches\nof the style image to the corresponding patches of the content image. However,\nbecause of the low semantic correspondence between arbitrary content and\nartworks, the attention module repeatedly abuses specific local patches from\nthe style image, resulting in disharmonious and evident repetitive artifacts.\nTo overcome this limitation and accomplish impeccable artistic style transfer,\nwe focus on enhancing the attention mechanism and capturing the rhythm of\npatterns that organize the style. In this paper, we introduce a novel metric,\nnamely pattern repeatability, that quantifies the repetition of patterns in the\nstyle image. Based on the pattern repeatability, we propose Aesthetic\nPattern-Aware style transfer Networks (AesPA-Net) that discover the sweet spot\nof local and global style expressions. In addition, we propose a novel\nself-supervisory task to encourage the attention mechanism to learn precise and\nmeaningful semantic correspondence. Lastly, we introduce the patch-wise style\nloss to transfer the elaborate rhythm of local patterns. Through qualitative\nand quantitative evaluations, we verify the reliability of the proposed pattern\nrepeatability that aligns with human perception, and demonstrate the\nsuperiority of the proposed framework.\n","authors":["Kibeom Hong","Seogkyu Jeon","Junsoo Lee","Namhyuk Ahn","Kunhee Kim","Pilhyeon Lee","Daesik Kim","Youngjung Uh","Hyeran Byun"],"pdf_url":"https://arxiv.org/pdf/2307.09724v1.pdf","comment":"Accepted by ICCV 2023. Code is available at this\n https://github.com/Kibeom-Hong/AesPA-Net"},{"id":"http://arxiv.org/abs/2212.04761v2","updated":"2023-07-19T02:20:18Z","published":"2022-12-09T10:37:22Z","title":"Leveraging Spatio-Temporal Dependency for Skeleton-Based Action\n Recognition","summary":" Skeleton-based action recognition has attracted considerable attention due to\nits compact representation of the human body's skeletal sructure. Many recent\nmethods have achieved remarkable performance using graph convolutional networks\n(GCNs) and convolutional neural networks (CNNs), which extract spatial and\ntemporal features, respectively. Although spatial and temporal dependencies in\nthe human skeleton have been explored separately, spatio-temporal dependency is\nrarely considered. In this paper, we propose the Spatio-Temporal Curve Network\n(STC-Net) to effectively leverage the spatio-temporal dependency of the human\nskeleton. Our proposed network consists of two novel elements: 1) The\nSpatio-Temporal Curve (STC) module; and 2) Dilated Kernels for Graph\nConvolution (DK-GC). The STC module dynamically adjusts the receptive field by\nidentifying meaningful node connections between every adjacent frame and\ngenerating spatio-temporal curves based on the identified node connections,\nproviding an adaptive spatio-temporal coverage. In addition, we propose DK-GC\nto consider long-range dependencies, which results in a large receptive field\nwithout any additional parameters by applying an extended kernel to the given\nadjacency matrices of the graph. Our STC-Net combines these two modules and\nachieves state-of-the-art performance on four skeleton-based action recognition\nbenchmarks.\n","authors":["Jungho Lee","Minhyeok Lee","Suhwan Cho","Sungmin Woo","Sungjun Jang","Sangyoun Lee"],"pdf_url":"https://arxiv.org/pdf/2212.04761v2.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.09721v1","updated":"2023-07-19T02:11:19Z","published":"2023-07-19T02:11:19Z","title":"Multi-Grained Multimodal Interaction Network for Entity Linking","summary":" Multimodal entity linking (MEL) task, which aims at resolving ambiguous\nmentions to a multimodal knowledge graph, has attracted wide attention in\nrecent years. Though large efforts have been made to explore the complementary\neffect among multiple modalities, however, they may fail to fully absorb the\ncomprehensive expression of abbreviated textual context and implicit visual\nindication. Even worse, the inevitable noisy data may cause inconsistency of\ndifferent modalities during the learning process, which severely degenerates\nthe performance. To address the above issues, in this paper, we propose a novel\nMulti-GraIned Multimodal InteraCtion Network $\\textbf{(MIMIC)}$ framework for\nsolving the MEL task. Specifically, the unified inputs of mentions and entities\nare first encoded by textual/visual encoders separately, to extract global\ndescriptive features and local detailed features. Then, to derive the\nsimilarity matching score for each mention-entity pair, we device three\ninteraction units to comprehensively explore the intra-modal interaction and\ninter-modal fusion among features of entities and mentions. In particular,\nthree modules, namely the Text-based Global-Local interaction Unit (TGLU),\nVision-based DuaL interaction Unit (VDLU) and Cross-Modal Fusion-based\ninteraction Unit (CMFU) are designed to capture and integrate the fine-grained\nrepresentation lying in abbreviated text and implicit visual cues. Afterwards,\nwe introduce a unit-consistency objective function via contrastive learning to\navoid inconsistency and model degradation. Experimental results on three public\nbenchmark datasets demonstrate that our solution outperforms various\nstate-of-the-art baselines, and ablation studies verify the effectiveness of\ndesigned modules.\n","authors":["Pengfei Luo","Tong Xu","Shiwei Wu","Chen Zhu","Linli Xu","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2307.09721v1.pdf","comment":"Accepted by KDD 2023"},{"id":"http://arxiv.org/abs/2307.09715v1","updated":"2023-07-19T01:57:31Z","published":"2023-07-19T01:57:31Z","title":"Semantic-Aware Dual Contrastive Learning for Multi-label Image\n Classification","summary":" Extracting image semantics effectively and assigning corresponding labels to\nmultiple objects or attributes for natural images is challenging due to the\ncomplex scene contents and confusing label dependencies. Recent works have\nfocused on modeling label relationships with graph and understanding object\nregions using class activation maps (CAM). However, these methods ignore the\ncomplex intra- and inter-category relationships among specific semantic\nfeatures, and CAM is prone to generate noisy information. To this end, we\npropose a novel semantic-aware dual contrastive learning framework that\nincorporates sample-to-sample contrastive learning (SSCL) as well as\nprototype-to-sample contrastive learning (PSCL). Specifically, we leverage\nsemantic-aware representation learning to extract category-related local\ndiscriminative features and construct category prototypes. Then based on SSCL,\nlabel-level visual representations of the same category are aggregated\ntogether, and features belonging to distinct categories are separated.\nMeanwhile, we construct a novel PSCL module to narrow the distance between\npositive samples and category prototypes and push negative samples away from\nthe corresponding category prototypes. Finally, the discriminative label-level\nfeatures related to the image content are accurately captured by the joint\ntraining of the above three parts. Experiments on five challenging large-scale\npublic datasets demonstrate that our proposed method is effective and\noutperforms the state-of-the-art methods. Code and supplementary materials are\nreleased on https://github.com/yu-gi-oh-leilei/SADCL.\n","authors":["Leilei Ma","Dengdi Sun","Lei Wang","Haifang Zhao","Bin Luo"],"pdf_url":"https://arxiv.org/pdf/2307.09715v1.pdf","comment":"8 pages, 6 figures, accepted by ECAI 23"},{"id":"http://arxiv.org/abs/2307.07928v2","updated":"2023-07-19T01:43:59Z","published":"2023-07-16T02:44:19Z","title":"Reinforced Disentanglement for Face Swapping without Skip Connection","summary":" The SOTA face swap models still suffer the problem of either target identity\n(i.e., shape) being leaked or the target non-identity attributes (i.e.,\nbackground, hair) failing to be fully preserved in the final results. We show\nthat this insufficient disentanglement is caused by two flawed designs that\nwere commonly adopted in prior models: (1) counting on only one compressed\nencoder to represent both the semantic-level non-identity facial\nattributes(i.e., pose) and the pixel-level non-facial region details, which is\ncontradictory to satisfy at the same time; (2) highly relying on long\nskip-connections between the encoder and the final generator, leaking a certain\namount of target face identity into the result. To fix them, we introduce a new\nface swap framework called 'WSC-swap' that gets rid of skip connections and\nuses two target encoders to respectively capture the pixel-level non-facial\nregion attributes and the semantic non-identity attributes in the face region.\nTo further reinforce the disentanglement learning for the target encoder, we\nemploy both identity removal loss via adversarial training (i.e., GAN) and the\nnon-identity preservation loss via prior 3DMM models like [11]. Extensive\nexperiments on both FaceForensics++ and CelebA-HQ show that our results\nsignificantly outperform previous works on a rich set of metrics, including one\nnovel metric for measuring identity consistency that was completely neglected\nbefore.\n","authors":["Xiaohang Ren","Xingyu Chen","Pengfei Yao","Heung-Yeung Shum","Baoyuan Wang"],"pdf_url":"https://arxiv.org/pdf/2307.07928v2.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.03135v2","updated":"2023-07-19T01:28:30Z","published":"2023-07-06T17:05:26Z","title":"Distilling Large Vision-Language Model with Out-of-Distribution\n Generalizability","summary":" Large vision-language models have achieved outstanding performance, but their\nsize and computational requirements make their deployment on\nresource-constrained devices and time-sensitive tasks impractical. Model\ndistillation, the process of creating smaller, faster models that maintain the\nperformance of larger models, is a promising direction towards the solution.\nThis paper investigates the distillation of visual representations in large\nteacher vision-language models into lightweight student models using a small-\nor mid-scale dataset. Notably, this study focuses on open-vocabulary\nout-of-distribution (OOD) generalization, a challenging problem that has been\noverlooked in previous model distillation literature. We propose two principles\nfrom vision and language modality perspectives to enhance student's OOD\ngeneralization: (1) by better imitating teacher's visual representation space,\nand carefully promoting better coherence in vision-language alignment with the\nteacher; (2) by enriching the teacher's language representations with\ninformative and finegrained semantic attributes to effectively distinguish\nbetween different labels. We propose several metrics and conduct extensive\nexperiments to investigate their techniques. The results demonstrate\nsignificant improvements in zero-shot and few-shot student performance on\nopen-vocabulary out-of-distribution classification, highlighting the\neffectiveness of our proposed approaches. Code released at\nhttps://github.com/xuanlinli17/large_vlm_distillation_ood\n","authors":["Xuanlin Li","Yunhao Fang","Minghua Liu","Zhan Ling","Zhuowen Tu","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2307.03135v2.pdf","comment":"Published at International Conference on Computer Vision (ICCV) 2023"},{"id":"http://arxiv.org/abs/2307.09153v2","updated":"2023-07-19T01:27:17Z","published":"2023-07-18T11:24:42Z","title":"OPHAvatars: One-shot Photo-realistic Head Avatars","summary":" We propose a method for synthesizing photo-realistic digital avatars from\nonly one portrait as the reference. Given a portrait, our method synthesizes a\ncoarse talking head video using driving keypoints features. And with the coarse\nvideo, our method synthesizes a coarse talking head avatar with a deforming\nneural radiance field. With rendered images of the coarse avatar, our method\nupdates the low-quality images with a blind face restoration model. With\nupdated images, we retrain the avatar for higher quality. After several\niterations, our method can synthesize a photo-realistic animatable 3D neural\nhead avatar. The motivation of our method is deformable neural radiance field\ncan eliminate the unnatural distortion caused by the image2video method. Our\nmethod outperforms state-of-the-art methods in quantitative and qualitative\nstudies on various subjects.\n","authors":["Shaoxu Li"],"pdf_url":"https://arxiv.org/pdf/2307.09153v2.pdf","comment":"code: https://github.com/lsx0101/OPHAvatars"},{"id":"http://arxiv.org/abs/2307.09696v1","updated":"2023-07-19T00:41:39Z","published":"2023-07-19T00:41:39Z","title":"Towards Saner Deep Image Registration","summary":" With recent advances in computing hardware and surges of deep-learning\narchitectures, learning-based deep image registration methods have surpassed\ntheir traditional counterparts, in terms of metric performance and inference\ntime. However, these methods focus on improving performance measurements such\nas Dice, resulting in less attention given to model behaviors that are equally\ndesirable for registrations, especially for medical imaging. This paper\ninvestigates these behaviors for popular learning-based deep registrations\nunder a sanity-checking microscope. We find that most existing registrations\nsuffer from low inverse consistency and nondiscrimination of identical pairs\ndue to overly optimized image similarities. To rectify these behaviors, we\npropose a novel regularization-based sanity-enforcer method that imposes two\nsanity checks on the deep model to reduce its inverse consistency errors and\nincrease its discriminative power simultaneously. Moreover, we derive a set of\ntheoretical guarantees for our sanity-checked image registration method, with\nexperimental results supporting our theoretical findings and their\neffectiveness in increasing the sanity of models without sacrificing any\nperformance. Our code and models are available at\n\\url{https://github.com/tuffr5/Saner-deep-registration}.\n","authors":["Bin Duan","Ming Zhong","Yan Yan"],"pdf_url":"https://arxiv.org/pdf/2307.09696v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.09693v1","updated":"2023-07-19T00:36:05Z","published":"2023-07-19T00:36:05Z","title":"GlobalMapper: Arbitrary-Shaped Urban Layout Generation","summary":" Modeling and designing urban building layouts is of significant interest in\ncomputer vision, computer graphics, and urban applications. A building layout\nconsists of a set of buildings in city blocks defined by a network of roads. We\nobserve that building layouts are discrete structures, consisting of multiple\nrows of buildings of various shapes, and are amenable to skeletonization for\nmapping arbitrary city block shapes to a canonical form. Hence, we propose a\nfully automatic approach to building layout generation using graph attention\nnetworks. Our method generates realistic urban layouts given arbitrary road\nnetworks, and enables conditional generation based on learned priors. Our\nresults, including user study, demonstrate superior performance as compared to\nprior layout generation networks, support arbitrary city block and varying\nbuilding shapes as demonstrated by generating layouts for 28 large cities.\n","authors":["Liu He","Daniel Aliaga"],"pdf_url":"https://arxiv.org/pdf/2307.09693v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.10501v1","updated":"2023-07-19T23:57:39Z","published":"2023-07-19T23:57:39Z","title":"Eye Disease Classification Using Deep Learning Techniques","summary":" Eye is the essential sense organ for vision function. Due to the fact that\ncertain eye disorders might result in vision loss, it is essential to diagnose\nand treat eye diseases early on. By identifying common eye illnesses and\nperforming an eye check, eye care providers can safeguard patients against\nvision loss or blindness. Convolutional neural networks (CNN) and transfer\nlearning were employed in this study to discriminate between a normal eye and\none with diabetic retinopathy, cataract, or glaucoma disease. Using transfer\nlearning for multi-class classification, high accuracy was achieved at 94%\nwhile the traditional CNN achieved 84% rate.\n","authors":["Tareq Babaqi","Manar Jaradat","Ayse Erdem Yildirim","Saif H. Al-Nimer","Daehan Won"],"pdf_url":"https://arxiv.org/pdf/2307.10501v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10499v1","updated":"2023-07-19T23:55:15Z","published":"2023-07-19T23:55:15Z","title":"Mining Conditional Part Semantics with Occluded Extrapolation for\n Human-Object Interaction Detection","summary":" Human-Object Interaction Detection is a crucial aspect of human-centric scene\nunderstanding, with important applications in various domains. Despite recent\nprogress in this field, recognizing subtle and detailed interactions remains\nchallenging. Existing methods try to use human-related clues to alleviate the\ndifficulty, but rely heavily on external annotations or knowledge, limiting\ntheir practical applicability in real-world scenarios. In this work, we propose\na novel Part Semantic Network (PSN) to solve this problem. The core of PSN is a\nConditional Part Attention (CPA) mechanism, where human features are taken as\nkeys and values, and the object feature is used as query for the computation in\na cross-attention mechanism. In this way, our model learns to automatically\nfocus on the most informative human parts conditioned on the involved object,\ngenerating more semantically meaningful features for interaction recognition.\nAdditionally, we propose an Occluded Part Extrapolation (OPE) strategy to\nfacilitate interaction recognition under occluded scenarios, which teaches the\nmodel to extrapolate detailed features from partially occluded ones. Our method\nconsistently outperforms prior approaches on the V-COCO and HICO-DET datasets,\nwithout external data or extra annotations. Additional ablation studies\nvalidate the effectiveness of each component of our proposed method.\n","authors":["Guangzhi Wang","Yangyang Guo","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2307.10499v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2307.10495v1","updated":"2023-07-19T23:25:21Z","published":"2023-07-19T23:25:21Z","title":"Novel Batch Active Learning Approach and Its Application to Synthetic\n Aperture Radar Datasets","summary":" Active learning improves the performance of machine learning methods by\njudiciously selecting a limited number of unlabeled data points to query for\nlabels, with the aim of maximally improving the underlying classifier's\nperformance. Recent gains have been made using sequential active learning for\nsynthetic aperture radar (SAR) data arXiv:2204.00005. In each iteration,\nsequential active learning selects a query set of size one while batch active\nlearning selects a query set of multiple datapoints. While batch active\nlearning methods exhibit greater efficiency, the challenge lies in maintaining\nmodel accuracy relative to sequential active learning methods. We developed a\nnovel, two-part approach for batch active learning: Dijkstra's Annulus Core-Set\n(DAC) for core-set generation and LocalMax for batch sampling. The batch active\nlearning process that combines DAC and LocalMax achieves nearly identical\naccuracy as sequential active learning but is more efficient, proportional to\nthe batch size. As an application, a pipeline is built based on transfer\nlearning feature embedding, graph learning, DAC, and LocalMax to classify the\nFUSAR-Ship and OpenSARShip datasets. Our pipeline outperforms the\nstate-of-the-art CNN-based methods.\n","authors":["James Chapman","Bohan Chen","Zheng Tan","Jeff Calder","Kevin Miller","Andrea L. Bertozzi"],"pdf_url":"https://arxiv.org/pdf/2307.10495v1.pdf","comment":"16 pages, 7 figures, Preprint"},{"id":"http://arxiv.org/abs/2307.10487v1","updated":"2023-07-19T22:46:35Z","published":"2023-07-19T22:46:35Z","title":"Backdoor Attack against Object Detection with Clean Annotation","summary":" Deep neural networks (DNNs) have shown unprecedented success in object\ndetection tasks. However, it was also discovered that DNNs are vulnerable to\nmultiple kinds of attacks, including Backdoor Attacks. Through the attack, the\nattacker manages to embed a hidden backdoor into the DNN such that the model\nbehaves normally on benign data samples, but makes attacker-specified judgments\ngiven the occurrence of a predefined trigger. Although numerous backdoor\nattacks have been experimented on image classification, backdoor attacks on\nobject detection tasks have not been properly investigated and explored. As\nobject detection has been adopted as an important module in multiple\nsecurity-sensitive applications such as autonomous driving, backdoor attacks on\nobject detection could pose even more severe threats. Inspired by the inherent\nproperty of deep learning-based object detectors, we propose a simple yet\neffective backdoor attack method against object detection without modifying the\nground truth annotations, specifically focusing on the object disappearance\nattack and object generation attack. Extensive experiments and ablation studies\nprove the effectiveness of our attack on two benchmark object detection\ndatasets, PASCAL VOC07+12 and MSCOCO, on which we achieve an attack success\nrate of more than 92% with a poison rate of only 5%.\n","authors":["Yize Cheng","Wenbin Hu","Minhao Cheng"],"pdf_url":"https://arxiv.org/pdf/2307.10487v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10475v1","updated":"2023-07-19T22:14:49Z","published":"2023-07-19T22:14:49Z","title":"Findings of Factify 2: Multimodal Fake News Detection","summary":" With social media usage growing exponentially in the past few years, fake\nnews has also become extremely prevalent. The detrimental impact of fake news\nemphasizes the need for research focused on automating the detection of false\ninformation and verifying its accuracy. In this work, we present the outcome of\nthe Factify 2 shared task, which provides a multi-modal fact verification and\nsatire news dataset, as part of the DeFactify 2 workshop at AAAI'23. The data\ncalls for a comparison based approach to the task by pairing social media\nclaims with supporting documents, with both text and image, divided into 5\nclasses based on multi-modal relations. In the second iteration of this task we\nhad over 60 participants and 9 final test-set submissions. The best\nperformances came from the use of DeBERTa for text and Swinv2 and CLIP for\nimage. The highest F1 score averaged for all five classes was 81.82%.\n","authors":["S Suryavardan","Shreyash Mishra","Megha Chakraborty","Parth Patwa","Anku Rani","Aman Chadha","Aishwarya Reganti","Amitava Das","Amit Sheth","Manoj Chinnakotla","Asif Ekbal","Srijan Kumar"],"pdf_url":"https://arxiv.org/pdf/2307.10475v1.pdf","comment":"Defactify2 @AAAI 2023"},{"id":"http://arxiv.org/abs/2307.10471v1","updated":"2023-07-19T21:45:07Z","published":"2023-07-19T21:45:07Z","title":"Classification of Visualization Types and Perspectives in Patents","summary":" Due to the swift growth of patent applications each year, information and\nmultimedia retrieval approaches that facilitate patent exploration and\nretrieval are of utmost importance. Different types of visualizations (e.g.,\ngraphs, technical drawings) and perspectives (e.g., side view, perspective) are\nused to visualize details of innovations in patents. The classification of\nthese images enables a more efficient search and allows for further analysis.\nSo far, datasets for image type classification miss some important\nvisualization types for patents. Furthermore, related work does not make use of\nrecent deep learning approaches including transformers. In this paper, we adopt\nstate-of-the-art deep learning methods for the classification of visualization\ntypes and perspectives in patent images. We extend the CLEF-IP dataset for\nimage type classification in patents to ten classes and provide manual ground\ntruth annotations. In addition, we derive a set of hierarchical classes from a\ndataset that provides weakly-labeled data for image perspectives. Experimental\nresults have demonstrated the feasibility of the proposed approaches. Source\ncode, models, and dataset will be made publicly available.\n","authors":["Junaid Ahmed Ghauri","Eric Müller-Budack","Ralph Ewerth"],"pdf_url":"https://arxiv.org/pdf/2307.10471v1.pdf","comment":"Accepted in International Conference on Theory and Practice of\n Digital Libraries (TPDL) 2023 (They have the copyright to publish\n camera-ready version of this work)"},{"id":"http://arxiv.org/abs/2307.10455v1","updated":"2023-07-19T20:54:08Z","published":"2023-07-19T20:54:08Z","title":"A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect\n Dataset","summary":" In an effort to catalog insect biodiversity, we propose a new large dataset\nof hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is\ntaxonomically classified by an expert, and also has associated genetic\ninformation including raw nucleotide barcode sequences and assigned barcode\nindex numbers, which are genetically-based proxies for species classification.\nThis paper presents a curated million-image dataset, primarily to train\ncomputer-vision models capable of providing image-based taxonomic assessment,\nhowever, the dataset also presents compelling characteristics, the study of\nwhich would be of interest to the broader machine learning community. Driven by\nthe biological nature inherent to the dataset, a characteristic long-tailed\nclass-imbalance distribution is exhibited. Furthermore, taxonomic labelling is\na hierarchical classification scheme, presenting a highly fine-grained\nclassification problem at lower levels. Beyond spurring interest in\nbiodiversity research within the machine learning community, progress on\ncreating an image-based taxonomic classifier will also further the ultimate\ngoal of all BIOSCAN research: to lay the foundation for a comprehensive survey\nof global biodiversity. This paper introduces the dataset and explores the\nclassification task through the implementation and analysis of a baseline\nclassifier.\n","authors":["Zahra Gharaee","ZeMing Gong","Nicholas Pellegrino","Iuliia Zarubiieva","Joakim Bruslund Haurum","Scott C. Lowe","Jaclyn T. A. McKeown","Chris C. Y. Ho","Joschka McLeod","Yi-Yun C Wei","Jireh Agda","Sujeevan Ratnasingham","Dirk Steinke","Angel X. Chang","Graham W. Taylor","Paul Fieguth"],"pdf_url":"https://arxiv.org/pdf/2307.10455v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10440v1","updated":"2023-07-19T20:11:30Z","published":"2023-07-19T20:11:30Z","title":"Confidence Estimation Using Unlabeled Data","summary":" Overconfidence is a common issue for deep neural networks, limiting their\ndeployment in real-world applications. To better estimate confidence, existing\nmethods mostly focus on fully-supervised scenarios and rely on training labels.\nIn this paper, we propose the first confidence estimation method for a\nsemi-supervised setting, when most training labels are unavailable. We\nstipulate that even with limited training labels, we can still reasonably\napproximate the confidence of model on unlabeled samples by inspecting the\nprediction consistency through the training process. We use training\nconsistency as a surrogate function and propose a consistency ranking loss for\nconfidence estimation. On both image classification and segmentation tasks, our\nmethod achieves state-of-the-art performances in confidence estimation.\nFurthermore, we show the benefit of the proposed method through a downstream\nactive learning task. The code is available at\nhttps://github.com/TopoXLab/consistency-ranking-loss\n","authors":["Chen Li","Xiaoling Hu","Chao Chen"],"pdf_url":"https://arxiv.org/pdf/2307.10440v1.pdf","comment":"Accepted by ICLR'23"},{"id":"http://arxiv.org/abs/2105.11166v6","updated":"2023-07-19T19:32:53Z","published":"2021-05-24T09:16:04Z","title":"AirNet: Neural Network Transmission over the Air","summary":" State-of-the-art performance for many edge applications is achieved by deep\nneural networks (DNNs). Often, these DNNs are location- and time-sensitive, and\nmust be delivered over a wireless channel rapidly and efficiently. In this\npaper, we introduce AirNet, a family of novel training and transmission methods\nthat allow DNNs to be efficiently delivered over wireless channels under\nstringent transmit power and latency constraints. This corresponds to a new\nclass of joint source-channel coding problems, aimed at delivering DNNs with\nthe goal of maximizing their accuracy at the receiver, rather than recovering\nthem with high fidelity. In AirNet, we propose the direct mapping of the DNN\nparameters to transmitted channel symbols, while the network is trained to meet\nthe channel constraints, and exhibit robustness against channel noise. AirNet\nachieves higher accuracy compared to separation-based alternatives. We further\nimprove the performance of AirNet by pruning the network below the available\nbandwidth, and expanding it for improved robustness. We also benefit from\nunequal error protection by selectively expanding important layers of the\nnetwork. Finally, we develop an approach, which simultaneously trains a\nspectrum of DNNs, each targeting a different channel condition, resolving the\nimpractical memory requirements of training distinct networks for different\nchannel conditions.\n","authors":["Mikolaj Jankowski","Deniz Gunduz","Krystian Mikolajczyk"],"pdf_url":"https://arxiv.org/pdf/2105.11166v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.09826v2","updated":"2023-07-19T19:30:52Z","published":"2023-04-16T11:22:59Z","title":"Fairness in AI and Its Long-Term Implications on Society","summary":" Successful deployment of artificial intelligence (AI) in various settings has\nled to numerous positive outcomes for individuals and society. However, AI\nsystems have also been shown to harm parts of the population due to biased\npredictions. AI fairness focuses on mitigating such biases to ensure AI\ndecision making is not discriminatory towards certain groups. We take a closer\nlook at AI fairness and analyze how lack of AI fairness can lead to deepening\nof biases over time and act as a social stressor. More specifically, we discuss\nhow biased models can lead to more negative real-world outcomes for certain\ngroups, which may then become more prevalent by deploying new AI models trained\non increasingly biased data, resulting in a feedback loop. If the issues\npersist, they could be reinforced by interactions with other risks and have\nsevere implications on society in the form of social unrest. We examine current\nstrategies for improving AI fairness, assess their limitations in terms of\nreal-world deployment, and explore potential paths forward to ensure we reap\nAI's benefits without causing society's collapse.\n","authors":["Ondrej Bohdal","Timothy Hospedales","Philip H. S. Torr","Fazl Barez"],"pdf_url":"https://arxiv.org/pdf/2304.09826v2.pdf","comment":"Stanford Existential Risks Conference 2023"},{"id":"http://arxiv.org/abs/2307.10422v1","updated":"2023-07-19T19:19:13Z","published":"2023-07-19T19:19:13Z","title":"PreDiff: Precipitation Nowcasting with Latent Diffusion Models","summary":" Earth system forecasting has traditionally relied on complex physical models\nthat are computationally expensive and require significant domain expertise. In\nthe past decade, the unprecedented increase in spatiotemporal Earth observation\ndata has enabled data-driven forecasting models using deep learning techniques.\nThese models have shown promise for diverse Earth system forecasting tasks but\neither struggle with handling uncertainty or neglect domain-specific prior\nknowledge, resulting in averaging possible futures to blurred forecasts or\ngenerating physically implausible predictions. To address these limitations, we\npropose a two-stage pipeline for probabilistic spatiotemporal forecasting: 1)\nWe develop PreDiff, a conditional latent diffusion model capable of\nprobabilistic forecasts. 2) We incorporate an explicit knowledge control\nmechanism to align forecasts with domain-specific physical constraints. This is\nachieved by estimating the deviation from imposed constraints at each denoising\nstep and adjusting the transition distribution accordingly. We conduct\nempirical studies on two datasets: N-body MNIST, a synthetic dataset with\nchaotic behavior, and SEVIR, a real-world precipitation nowcasting dataset.\nSpecifically, we impose the law of conservation of energy in N-body MNIST and\nanticipated precipitation intensity in SEVIR. Experiments demonstrate the\neffectiveness of PreDiff in handling uncertainty, incorporating domain-specific\nprior knowledge, and generating forecasts that exhibit high operational\nutility.\n","authors":["Zhihan Gao","Xingjian Shi","Boran Han","Hao Wang","Xiaoyong Jin","Danielle Maddix","Yi Zhu","Mu Li","Yuyang Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10422v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2307.10408v1","updated":"2023-07-19T18:37:57Z","published":"2023-07-19T18:37:57Z","title":"Explaining Autonomous Driving Actions with Visual Question Answering","summary":" The end-to-end learning ability of self-driving vehicles has achieved\nsignificant milestones over the last decade owing to rapid advances in deep\nlearning and computer vision algorithms. However, as autonomous driving\ntechnology is a safety-critical application of artificial intelligence (AI),\nroad accidents and established regulatory principles necessitate the need for\nthe explainability of intelligent action choices for self-driving vehicles. To\nfacilitate interpretability of decision-making in autonomous driving, we\npresent a Visual Question Answering (VQA) framework, which explains driving\nactions with question-answering-based causal reasoning. To do so, we first\ncollect driving videos in a simulation environment using reinforcement learning\n(RL) and extract consecutive frames from this log data uniformly for five\nselected action categories. Further, we manually annotate the extracted frames\nusing question-answer pairs as justifications for the actions chosen in each\nscenario. Finally, we evaluate the correctness of the VQA-predicted answers for\nactions on unseen driving scenes. The empirical results suggest that the VQA\nmechanism can provide support to interpret real-time decisions of autonomous\nvehicles and help enhance overall driving safety.\n","authors":["Shahin Atakishiyev","Mohammad Salameh","Housam Babiker","Randy Goebel"],"pdf_url":"https://arxiv.org/pdf/2307.10408v1.pdf","comment":"Accepted to the 2023 IEEE International Conference on Intelligent\n Transportation Systems (IEEE ITSC-2023)"},{"id":"http://arxiv.org/abs/2307.10404v1","updated":"2023-07-19T18:19:18Z","published":"2023-07-19T18:19:18Z","title":"Interpreting and Correcting Medical Image Classification with PIP-Net","summary":" Part-prototype models are explainable-by-design image classifiers, and a\npromising alternative to black box AI. This paper explores the applicability\nand potential of interpretable machine learning, in particular PIP-Net, for\nautomated diagnosis support on real-world medical imaging data. PIP-Net learns\nhuman-understandable prototypical image parts and we evaluate its accuracy and\ninterpretability for fracture detection and skin cancer diagnosis. We find that\nPIP-Net's decision making process is in line with medical classification\nstandards, while only provided with image-level class labels. Because of\nPIP-Net's unsupervised pretraining of prototypes, data quality problems such as\nundesired text in an X-ray or labelling errors can be easily identified.\nAdditionally, we are the first to show that humans can manually correct the\nreasoning of PIP-Net by directly disabling undesired prototypes. We conclude\nthat part-prototype models are promising for medical applications due to their\ninterpretability and potential for advanced model debugging.\n","authors":["Meike Nauta","Johannes H. Hegeman","Jeroen Geerdink","Jörg Schlötterer","Maurice van Keulen","Christin Seifert"],"pdf_url":"https://arxiv.org/pdf/2307.10404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10387v1","updated":"2023-07-19T18:00:32Z","published":"2023-07-19T18:00:32Z","title":"POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation\n During Surgical Activities","summary":" The surgical usage of Mixed Reality (MR) has received growing attention in\nareas such as surgical navigation systems, skill assessment, and robot-assisted\nsurgeries. For such applications, pose estimation for hand and surgical\ninstruments from an egocentric perspective is a fundamental task and has been\nstudied extensively in the computer vision field in recent years. However, the\ndevelopment of this field has been impeded by a lack of datasets, especially in\nthe surgical field, where bloody gloves and reflective metallic tools make it\nhard to obtain 3D pose annotations for hands and objects using conventional\nmethods. To address this issue, we propose POV-Surgery, a large-scale,\nsynthetic, egocentric dataset focusing on pose estimation for hands with\ndifferent surgical gloves and three orthopedic surgical instruments, namely\nscalpel, friem, and diskplacer. Our dataset consists of 53 sequences and 88,329\nframes, featuring high-resolution RGB-D video streams with activity\nannotations, accurate 3D and 2D annotations for hand-object pose, and 2D\nhand-object segmentation masks. We fine-tune the current SOTA methods on\nPOV-Surgery and further show the generalizability when applying to real-life\ncases with surgical gloves and tools by extensive evaluations. The code and the\ndataset are publicly available at batfacewayne.github.io/POV_Surgery_io/.\n","authors":["Rui Wang","Sophokles Ktistakis","Siwei Zhang","Mirko Meboldt","Quentin Lohmeyer"],"pdf_url":"https://arxiv.org/pdf/2307.10387v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10373v1","updated":"2023-07-19T18:00:03Z","published":"2023-07-19T18:00:03Z","title":"TokenFlow: Consistent Diffusion Features for Consistent Video Editing","summary":" The generative AI revolution has recently expanded to videos. Nevertheless,\ncurrent state-of-the-art video models are still lagging behind image models in\nterms of visual quality and user control over the generated content. In this\nwork, we present a framework that harnesses the power of a text-to-image\ndiffusion model for the task of text-driven video editing. Specifically, given\na source video and a target text-prompt, our method generates a high-quality\nvideo that adheres to the target text, while preserving the spatial layout and\nmotion of the input video. Our method is based on a key observation that\nconsistency in the edited video can be obtained by enforcing consistency in the\ndiffusion feature space. We achieve this by explicitly propagating diffusion\nfeatures based on inter-frame correspondences, readily available in the model.\nThus, our framework does not require any training or fine-tuning, and can work\nin conjunction with any off-the-shelf text-to-image editing method. We\ndemonstrate state-of-the-art editing results on a variety of real-world videos.\nWebpage: https://diffusion-tokenflow.github.io/\n","authors":["Michal Geyer","Omer Bar-Tal","Shai Bagon","Tali Dekel"],"pdf_url":"https://arxiv.org/pdf/2307.10373v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10350v1","updated":"2023-07-19T17:47:12Z","published":"2023-07-19T17:47:12Z","title":"Improving Multimodal Datasets with Image Captioning","summary":" Massive web datasets play a key role in the success of large vision-language\nmodels like CLIP and Flamingo. However, the raw web data is noisy, and existing\nfiltering methods to reduce noise often come at the expense of data diversity.\nOur work focuses on caption quality as one major source of noise, and studies\nhow generated captions can increase the utility of web-scraped datapoints with\nnondescript text. Through exploring different mixing strategies for raw and\ngenerated captions, we outperform the best filtering method proposed by the\nDataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a\ncandidate pool of 128M image-text pairs. Our best approach is also 2x better at\nFlickr and MS-COCO retrieval. We then analyze what makes synthetic captions an\neffective source of text supervision. In experimenting with different image\ncaptioning models, we also demonstrate that the performance of a model on\nstandard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable\nindicator of the utility of the captions it generates for multimodal training.\nFinally, our experiments with using generated captions at DataComp's large\nscale (1.28B image-text pairs) offer insights into the limitations of synthetic\ntext, as well as the importance of image curation with increasing training data\nquantity.\n","authors":["Thao Nguyen","Samir Yitzhak Gadre","Gabriel Ilharco","Sewoong Oh","Ludwig Schmidt"],"pdf_url":"https://arxiv.org/pdf/2307.10350v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2307.09989v1","updated":"2023-07-19T13:49:35Z","published":"2023-07-19T13:49:35Z","title":"UniMatch: A Unified User-Item Matching Framework for the Multi-purpose\n Merchant Marketing","summary":" When doing private domain marketing with cloud services, the merchants\nusually have to purchase different machine learning models for the multiple\nmarketing purposes, leading to a very high cost. We present a unified user-item\nmatching framework to simultaneously conduct item recommendation and user\ntargeting with just one model. We empirically demonstrate that the above\nconcurrent modeling is viable via modeling the user-item interaction matrix\nwith the multinomial distribution, and propose a bidirectional bias-corrected\nNCE loss for the implementation. The proposed loss function guides the model to\nlearn the user-item joint probability $p(u,i)$ instead of the conditional\nprobability $p(i|u)$ or $p(u|i)$ through correcting both the users and items'\nbiases caused by the in-batch negative sampling. In addition, our framework is\nmodel-agnostic enabling a flexible adaptation of different model architectures.\nExtensive experiments demonstrate that our framework results in significant\nperformance gains in comparison with the state-of-the-art methods, with greatly\nreduced cost on computing resources and daily maintenance.\n","authors":["Qifang Zhao","Tianyu Li","Meng Du","Yu Jiang","Qinghui Sun","Zhongyao Wang","Hong Liu","Huan Xu"],"pdf_url":"https://arxiv.org/pdf/2307.09989v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09985v1","updated":"2023-07-19T13:44:32Z","published":"2023-07-19T13:44:32Z","title":"Our Model Achieves Excellent Performance on MovieLens: What Does it\n Mean?","summary":" A typical benchmark dataset for recommender system (RecSys) evaluation\nconsists of user-item interactions generated on a platform within a time\nperiod. The interaction generation mechanism partially explains why a user\ninteracts with (e.g.,like, purchase, rate) an item, and the context of when a\nparticular interaction happened. In this study, we conduct a meticulous\nanalysis on the MovieLens dataset and explain the potential impact on using the\ndataset for evaluating recommendation algorithms. We make a few main findings\nfrom our analysis. First, there are significant differences in user\ninteractions at the different stages when a user interacts with the MovieLens\nplatform. The early interactions largely define the user portrait which affect\nthe subsequent interactions. Second, user interactions are highly affected by\nthe candidate movies that are recommended by the platform's internal\nrecommendation algorithm(s). Removal of interactions that happen nearer to the\nlast few interactions of a user leads to increasing difficulty in learning user\npreference, thus deteriorating recommendation accuracy. Third, changing the\norder of user interactions makes it more difficult for sequential algorithms to\ncapture the progressive interaction process. Based on these findings, we\nfurther discuss the discrepancy between the interaction generation mechanism\nthat is employed by the MovieLens system and that of typical real world\nrecommendation scenarios. In summary, models that achieve excellent\nrecommendation accuracy on the MovieLens dataset may not demonstrate superior\nperformance in practice for at least two kinds of differences: (i) the\ndifferences in the contexts of user-item interaction generation, and (ii) the\ndifferences in user knowledge about the item collections.\n","authors":["Yu-chen Fan","Yitong Ji","Jie Zhang","Aixin Sun"],"pdf_url":"https://arxiv.org/pdf/2307.09985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09834v1","updated":"2023-07-19T08:44:11Z","published":"2023-07-19T08:44:11Z","title":"Who Provides the Largest Megaphone? The Role of Google News in Promoting\n Russian State-Affiliated News Sources","summary":" The Internet has not only digitized but also democratized information access\nacross the globe. This gradual but path-breaking move to online information\npropagation has resulted in search engines playing an increasingly prominent\nrole in shaping access to human knowledge. When an Internet user enters a\nquery, the search engine sorts through the hundreds of billions of possible\nwebpages to determine what to show. Google dominates the search engine market,\nwith Google Search surpassing 80% market share globally every year of the last\ndecade. Only in Russia and China do Google competitors claim more market share,\nwith approximately 60% of Internet users in Russia preferring Yandex (compared\nto 40% in favor of Google) and more than 80% of China's Internet users\naccessing Baidu as of 2022. Notwithstanding this long-standing regional\nvariation in Internet search providers, there is limited research showing how\nthese providers compare in terms of propagating state-sponsored information.\nOur study fills this research gap by focusing on Russian cyberspace and\nexamining how Google and Yandex's search algorithms rank content from Russian\nstate-controlled media (hereon, RSM) outlets. This question is timely and of\npractical interest given widespread reports indicating that RSM outlets have\nactively engaged in promoting Kremlin propaganda in the lead-up to, and in the\naftermath of, the Russian invasion of Ukraine in February 2022.\n","authors":["Keeley Erhardt","Saurabh Khanna"],"pdf_url":"https://arxiv.org/pdf/2307.09834v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09775v1","updated":"2023-07-19T06:31:58Z","published":"2023-07-19T06:31:58Z","title":"DisCover: Disentangled Music Representation Learning for Cover Song\n Identification","summary":" In the field of music information retrieval (MIR), cover song identification\n(CSI) is a challenging task that aims to identify cover versions of a query\nsong from a massive collection. Existing works still suffer from high\nintra-song variances and inter-song correlations, due to the entangled nature\nof version-specific and version-invariant factors in their modeling. In this\nwork, we set the goal of disentangling version-specific and version-invariant\nfactors, which could make it easier for the model to learn invariant music\nrepresentations for unseen query songs. We analyze the CSI task in a\ndisentanglement view with the causal graph technique, and identify the\nintra-version and inter-version effects biasing the invariant learning. To\nblock these effects, we propose the disentangled music representation learning\nframework (DisCover) for CSI. DisCover consists of two critical components: (1)\nKnowledge-guided Disentanglement Module (KDM) and (2) Gradient-based\nAdversarial Disentanglement Module (GADM), which block intra-version and\ninter-version biased effects, respectively. KDM minimizes the mutual\ninformation between the learned representations and version-variant factors\nthat are identified with prior domain knowledge. GADM identifies\nversion-variant factors by simulating the representation transitions between\nintra-song versions, and exploits adversarial distillation for effect blocking.\nExtensive comparisons with best-performing methods and in-depth analysis\ndemonstrate the effectiveness of DisCover and the and necessity of\ndisentanglement for CSI.\n","authors":["Jiahao Xun","Shengyu Zhang","Yanting Yang","Jieming Zhu","Liqun Deng","Zhou Zhao","Zhenhua Dong","Ruiqi Li","Lichao Zhang","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2307.09775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09751v1","updated":"2023-07-19T05:23:43Z","published":"2023-07-19T05:23:43Z","title":"Information Retrieval Meets Large Language Models: A Strategic Report\n from Chinese IR Community","summary":" The research field of Information Retrieval (IR) has evolved significantly,\nexpanding beyond traditional search to meet diverse user information needs.\nRecently, Large Language Models (LLMs) have demonstrated exceptional\ncapabilities in text understanding, generation, and knowledge inference,\nopening up exciting avenues for IR research. LLMs not only facilitate\ngenerative retrieval but also offer improved solutions for user understanding,\nmodel evaluation, and user-system interactions. More importantly, the\nsynergistic relationship among IR models, LLMs, and humans forms a new\ntechnical paradigm that is more powerful for information seeking. IR models\nprovide real-time and relevant information, LLMs contribute internal knowledge,\nand humans play a central role of demanders and evaluators to the reliability\nof information services. Nevertheless, significant challenges exist, including\ncomputational costs, credibility concerns, domain-specific limitations, and\nethical considerations. To thoroughly discuss the transformative impact of LLMs\non IR research, the Chinese IR community conducted a strategic workshop in\nApril 2023, yielding valuable insights. This paper provides a summary of the\nworkshop's outcomes, including the rethinking of IR's core values, the mutual\nenhancement of LLMs and IR, the proposal of a novel IR technical paradigm, and\nopen challenges.\n","authors":["Qingyao Ai","Ting Bai","Zhao Cao","Yi Chang","Jiawei Chen","Zhumin Chen","Zhiyong Cheng","Shoubin Dong","Zhicheng Dou","Fuli Feng","Shen Gao","Jiafeng Guo","Xiangnan He","Yanyan Lan","Chenliang Li","Yiqun Liu","Ziyu Lyu","Weizhi Ma","Jun Ma","Zhaochun Ren","Pengjie Ren","Zhiqiang Wang","Mingwen Wang","Jirong Wen","Le Wu","Xin Xin","Jun Xu","Dawei Yin","Peng Zhang","Fan Zhang","Weinan Zhang","Min Zhang","Xiaofei Zhu"],"pdf_url":"https://arxiv.org/pdf/2307.09751v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2208.06265v2","updated":"2023-07-19T05:08:06Z","published":"2022-08-10T08:28:46Z","title":"Trustworthy Recommender Systems","summary":" Recommender systems (RSs) aim to help users to effectively retrieve items of\ntheir interests from a large catalogue. For a quite long period of time,\nresearchers and practitioners have been focusing on developing accurate RSs.\nRecent years have witnessed an increasing number of threats to RSs, coming from\nattacks, system and user generated noise, system bias. As a result, it has\nbecome clear that a strict focus on RS accuracy is limited and the research\nmust consider other important factors, e.g., trustworthiness. For end users, a\ntrustworthy RS (TRS) should not only be accurate, but also transparent,\nunbiased and fair as well as robust to noise or attacks. These observations\nactually led to a paradigm shift of the research on RSs: from accuracy-oriented\nRSs to TRSs. However, researchers lack a systematic overview and discussion of\nthe literature in this novel and fast developing field of TRSs. To this end, in\nthis paper, we provide an overview of TRSs, including a discussion of the\nmotivation and basic concepts of TRSs, a presentation of the challenges in\nbuilding TRSs, and a perspective on the future directions in this area. We also\nprovide a novel conceptual framework to support the construction of TRSs.\n","authors":["Shoujin Wang","Xiuzhen Zhang","Yan Wang","Huan Liu","Francesco Ricci"],"pdf_url":"https://arxiv.org/pdf/2208.06265v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09688v1","updated":"2023-07-19T00:08:49Z","published":"2023-07-19T00:08:49Z","title":"Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for\n Recommendation and Text Generation","summary":" Modeling customer shopping intentions is a crucial task for e-commerce, as it\ndirectly impacts user experience and engagement. Thus, accurately understanding\ncustomer preferences is essential for providing personalized recommendations.\nSession-based recommendation, which utilizes customer session data to predict\ntheir next interaction, has become increasingly popular. However, existing\nsession datasets have limitations in terms of item attributes, user diversity,\nand dataset scale. As a result, they cannot comprehensively capture the\nspectrum of user behaviors and preferences. To bridge this gap, we present the\nAmazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It\nis the first multilingual dataset consisting of millions of user sessions from\nsix different locales, where the major languages of products are English,\nGerman, Japanese, French, Italian, and Spanish. Remarkably, the dataset can\nhelp us enhance personalization and understanding of user preferences, which\ncan benefit various existing tasks as well as enable new tasks. To test the\npotential of the dataset, we introduce three tasks in this work: (1)\nnext-product recommendation, (2) next-product recommendation with domain\nshifts, and (3) next-product title generation. With the above tasks, we\nbenchmark a range of algorithms on our proposed dataset, drawing new insights\nfor further research and practice. In addition, based on the proposed dataset\nand tasks, we hosted a competition in the KDD CUP 2023 and have attracted\nthousands of users and submissions. The winning solutions and the associated\nworkshop can be accessed at our website https://kddcup23.github.io/.\n","authors":["Wei Jin","Haitao Mao","Zheng Li","Haoming Jiang","Chen Luo","Hongzhi Wen","Haoyu Han","Hanqing Lu","Zhengyang Wang","Ruirui Li","Zhen Li","Monica Xiao Cheng","Rahul Goutam","Haiyang Zhang","Karthik Subbian","Suhang Wang","Yizhou Sun","Jiliang Tang","Bing Yin","Xianfeng Tang"],"pdf_url":"https://arxiv.org/pdf/2307.09688v1.pdf","comment":"Dataset for KDD Cup 2023, https://kddcup23.github.io/"},{"id":"http://arxiv.org/abs/2205.11498v2","updated":"2023-07-19T23:05:57Z","published":"2022-05-23T17:53:44Z","title":"Injecting Domain Adaptation with Learning-to-hash for Effective and\n Efficient Zero-shot Dense Retrieval","summary":" Dense retrieval overcome the lexical gap and has shown great success in\nad-hoc information retrieval (IR). Despite their success, dense retrievers are\nexpensive to serve across practical use cases. For use cases requiring to\nsearch from millions of documents, the dense index becomes bulky and requires\nhigh memory usage for storing the index. More recently, learning-to-hash (LTH)\ntechniques, for e.g., BPR and JPQ, produce binary document vectors, thereby\nreducing the memory requirement to efficiently store the dense index. LTH\ntechniques are supervised and finetune the retriever using a ranking loss. They\noutperform their counterparts, i.e., traditional out-of-the-box vector\ncompression techniques such as PCA or PQ. A missing piece from prior work is\nthat existing techniques have been evaluated only in-domain, i.e., on a single\ndataset such as MS MARCO. In our work, we evaluate LTH and vector compression\ntechniques for improving the downstream zero-shot retrieval accuracy of the\nTAS-B dense retriever while maintaining efficiency at inference. Our results\ndemonstrate that, unlike prior work, LTH strategies when applied naively can\nunderperform the zero-shot TAS-B dense retriever on average by up to 14%\nnDCG@10 on the BEIR benchmark. To solve this limitation, in our work, we\npropose an easy yet effective solution of injecting domain adaptation with\nexisting supervised LTH techniques. We experiment with two well-known\nunsupervised domain adaptation techniques: GenQ and GPL. Our domain adaptation\ninjection technique can improve the downstream zero-shot retrieval\neffectiveness for both BPR and JPQ variants of the TAS-B model by on average\n11.5% and 8.2% nDCG@10 while both maintaining 32$\\times$ memory efficiency and\n14$\\times$ and 2$\\times$ speedup respectively in CPU retrieval latency on BEIR.\nAll our code, models, and data are publicly available at\nhttps://github.com/thakur-nandan/income.\n","authors":["Nandan Thakur","Nils Reimers","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2205.11498v2.pdf","comment":"Accepted at ReNeuIR 2023 Workshop"},{"id":"http://arxiv.org/abs/2307.10488v1","updated":"2023-07-19T22:48:02Z","published":"2023-07-19T22:48:02Z","title":"SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot\n Neural Sparse Retrieval","summary":" Traditionally, sparse retrieval systems relied on lexical representations to\nretrieve documents, such as BM25, dominated information retrieval tasks. With\nthe onset of pre-trained transformer models such as BERT, neural sparse\nretrieval has led to a new paradigm within retrieval. Despite the success,\nthere has been limited software supporting different sparse retrievers running\nin a unified, common environment. This hinders practitioners from fairly\ncomparing different sparse models and obtaining realistic evaluation results.\nAnother missing piece is, that a majority of prior work evaluates sparse\nretrieval models on in-domain retrieval, i.e. on a single dataset: MS MARCO.\nHowever, a key requirement in practical retrieval systems requires models that\ncan generalize well to unseen out-of-domain, i.e. zero-shot retrieval tasks. In\nthis work, we provide SPRINT, a unified Python toolkit based on Pyserini and\nLucene, supporting a common interface for evaluating neural sparse retrieval.\nThe toolkit currently includes five built-in models: uniCOIL, DeepImpact,\nSPARTA, TILDEv2 and SPLADEv2. Users can also easily add customized models by\ndefining their term weighting method. Using our toolkit, we establish strong\nand reproducible zero-shot sparse retrieval baselines across the\nwell-acknowledged benchmark, BEIR. Our results demonstrate that SPLADEv2\nachieves the best average score of 0.470 nDCG@10 on BEIR amongst all neural\nsparse retrievers. In this work, we further uncover the reasons behind its\nperformance gain. We show that SPLADEv2 produces sparse representations with a\nmajority of tokens outside of the original query and document which is often\ncrucial for its performance gains, i.e. a limitation among its other sparse\ncounterparts. We provide our SPRINT toolkit, models, and data used in our\nexperiments publicly here at https://github.com/thakur-nandan/sprint.\n","authors":["Nandan Thakur","Kexin Wang","Iryna Gurevych","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2307.10488v1.pdf","comment":"Accepted at SIGIR 2023 (Resource Track)"},{"id":"http://arxiv.org/abs/2307.10479v1","updated":"2023-07-19T22:20:06Z","published":"2023-07-19T22:20:06Z","title":"Fast Approximate Nearest Neighbor Search with a Dynamic Exploration\n Graph using Continuous Refinement","summary":" For approximate nearest neighbor search, graph-based algorithms have shown to\noffer the best trade-off between accuracy and search time. We propose the\nDynamic Exploration Graph (DEG) which significantly outperforms existing\nalgorithms in terms of search and exploration efficiency by combining two new\nideas: First, a single undirected even regular graph is incrementally built by\npartially replacing existing edges to integrate new vertices and to update old\nneighborhoods at the same time. Secondly, an edge optimization algorithm is\nused to continuously improve the quality of the graph. Combining this ongoing\nrefinement with the graph construction process leads to a well-organized graph\nstructure at all times, resulting in: (1) increased search efficiency, (2)\npredictable index size, (3) guaranteed connectivity and therefore reachability\nof all vertices, and (4) a dynamic graph structure. In addition we investigate\nhow well existing graph-based search systems can handle indexed queries where\nthe seed vertex of a search is the query itself. Such exploration tasks,\ndespite their good starting point, are not necessarily easy. High efficiency in\napproximate nearest neighbor search (ANNS) does not automatically imply good\nperformance in exploratory search. Extensive experiments show that our new\nDynamic Exploration Graph outperforms existing algorithms significantly for\nindexed and unindexed queries.\n","authors":["Nico Hezel","Kai Uwe Barthel","Konstantin Schall","Klaus Jung"],"pdf_url":"https://arxiv.org/pdf/2307.10479v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10471v1","updated":"2023-07-19T21:45:07Z","published":"2023-07-19T21:45:07Z","title":"Classification of Visualization Types and Perspectives in Patents","summary":" Due to the swift growth of patent applications each year, information and\nmultimedia retrieval approaches that facilitate patent exploration and\nretrieval are of utmost importance. Different types of visualizations (e.g.,\ngraphs, technical drawings) and perspectives (e.g., side view, perspective) are\nused to visualize details of innovations in patents. The classification of\nthese images enables a more efficient search and allows for further analysis.\nSo far, datasets for image type classification miss some important\nvisualization types for patents. Furthermore, related work does not make use of\nrecent deep learning approaches including transformers. In this paper, we adopt\nstate-of-the-art deep learning methods for the classification of visualization\ntypes and perspectives in patent images. We extend the CLEF-IP dataset for\nimage type classification in patents to ten classes and provide manual ground\ntruth annotations. In addition, we derive a set of hierarchical classes from a\ndataset that provides weakly-labeled data for image perspectives. Experimental\nresults have demonstrated the feasibility of the proposed approaches. Source\ncode, models, and dataset will be made publicly available.\n","authors":["Junaid Ahmed Ghauri","Eric Müller-Budack","Ralph Ewerth"],"pdf_url":"https://arxiv.org/pdf/2307.10471v1.pdf","comment":"Accepted in International Conference on Theory and Practice of\n Digital Libraries (TPDL) 2023 (They have the copyright to publish\n camera-ready version of this work)"},{"id":"http://arxiv.org/abs/2109.12509v3","updated":"2023-07-19T21:28:52Z","published":"2021-09-26T06:54:26Z","title":"Deep Exploration for Recommendation Systems","summary":" Modern recommendation systems ought to benefit by probing for and learning\nfrom delayed feedback. Research has tended to focus on learning from a user's\nresponse to a single recommendation. Such work, which leverages methods of\nsupervised and bandit learning, forgoes learning from the user's subsequent\nbehavior. Where past work has aimed to learn from subsequent behavior, there\nhas been a lack of effective methods for probing to elicit informative delayed\nfeedback. Effective exploration through probing for delayed feedback becomes\nparticularly challenging when rewards are sparse. To address this, we develop\ndeep exploration methods for recommendation systems. In particular, we\nformulate recommendation as a sequential decision problem and demonstrate\nbenefits of deep exploration over single-step exploration. Our experiments are\ncarried out with high-fidelity industrial-grade simulators and establish large\nimprovements over existing algorithms.\n","authors":["Zheqing Zhu","Benjamin Van Roy"],"pdf_url":"https://arxiv.org/pdf/2109.12509v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10323v1","updated":"2023-07-19T07:20:30Z","published":"2023-07-19T07:20:30Z","title":"IncDSI: Incrementally Updatable Document Retrieval","summary":" Differentiable Search Index is a recently proposed paradigm for document\nretrieval, that encodes information about a corpus of documents within the\nparameters of a neural network and directly maps queries to corresponding\ndocuments. These models have achieved state-of-the-art performances for\ndocument retrieval across many benchmarks. These kinds of models have a\nsignificant limitation: it is not easy to add new documents after a model is\ntrained. We propose IncDSI, a method to add documents in real time (about\n20-50ms per document), without retraining the model on the entire dataset (or\neven parts thereof). Instead we formulate the addition of documents as a\nconstrained optimization problem that makes minimal changes to the network\nparameters. Although orders of magnitude faster, our approach is competitive\nwith re-training the model on the whole dataset and enables the development of\ndocument retrieval systems that can be updated with new information in\nreal-time. Our code for IncDSI is available at\nhttps://github.com/varshakishore/IncDSI.\n","authors":["Varsha Kishore","Chao Wan","Justin Lovelace","Yoav Artzi","Kilian Q. Weinberger"],"pdf_url":"https://arxiv.org/pdf/2307.10323v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.00370v2","updated":"2023-07-19T06:55:04Z","published":"2023-07-01T15:44:53Z","title":"Improving Text Matching in E-Commerce Search with A Rationalizable,\n Intervenable and Fast Entity-Based Relevance Model","summary":" Discovering the intended items of user queries from a massive repository of\nitems is one of the main goals of an e-commerce search system. Relevance\nprediction is essential to the search system since it helps improve\nperformance. When online serving a relevance model, the model is required to\nperform fast and accurate inference. Currently, the widely used models such as\nBi-encoder and Cross-encoder have their limitations in accuracy or inference\nspeed respectively. In this work, we propose a novel model called the\nEntity-Based Relevance Model (EBRM). We identify the entities contained in an\nitem and decompose the QI (query-item) relevance problem into multiple QE\n(query-entity) relevance problems; we then aggregate their results to form the\nQI prediction using a soft logic formulation. The decomposition allows us to\nuse a Cross-encoder QE relevance module for high accuracy as well as cache QE\npredictions for fast online inference. Utilizing soft logic makes the\nprediction procedure interpretable and intervenable. We also show that\npretraining the QE module with auto-generated QE data from user logs can\nfurther improve the overall performance. The proposed method is evaluated on\nlabeled data from e-commerce websites. Empirical results show that it achieves\npromising improvements with computation efficiency.\n","authors":["Jiong Cai","Yong Jiang","Yue Zhang","Chengyue Jiang","Ke Yu","Jianhui Ji","Rong Xiao","Haihong Tang","Tao Wang","Zhongqiang Huang","Pengjun Xie","Fei Huang","Kewei Tu"],"pdf_url":"https://arxiv.org/pdf/2307.00370v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10314v1","updated":"2023-07-19T03:31:41Z","published":"2023-07-19T03:31:41Z","title":"Mood Classification of Bangla Songs Based on Lyrics","summary":" Music can evoke various emotions, and with the advancement of technology, it\nhas become more accessible to people. Bangla music, which portrays different\nhuman emotions, lacks sufficient research. The authors of this article aim to\nanalyze Bangla songs and classify their moods based on the lyrics. To achieve\nthis, this research has compiled a dataset of 4000 Bangla song lyrics, genres,\nand used Natural Language Processing and the Bert Algorithm to analyze the\ndata. Among the 4000 songs, 1513 songs are represented for the sad mood, 1362\nfor the romantic mood, 886 for happiness, and the rest 239 are classified as\nrelaxation. By embedding the lyrics of the songs, the authors have classified\nthe songs into four moods: Happy, Sad, Romantic, and Relaxed. This research is\ncrucial as it enables a multi-class classification of songs' moods, making the\nmusic more relatable to people's emotions. The article presents the automated\nresult of the four moods accurately derived from the song lyrics.\n","authors":["Maliha Mahajebin","Mohammad Rifat Ahmmad Rashid","Nafees Mansoor"],"pdf_url":"https://arxiv.org/pdf/2307.10314v1.pdf","comment":"Presented at International Conference on. Inventive Communication and\n Computational Technologies 2023"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2307.10171v1","updated":"2023-07-19T17:57:27Z","published":"2023-07-19T17:57:27Z","title":"LightPath: Lightweight and Scalable Path Representation Learning","summary":" Movement paths are used widely in intelligent transportation and smart city\napplications. To serve such applications, path representation learning aims to\nprovide compact representations of paths that enable efficient and accurate\noperations when used for different downstream tasks such as path ranking and\ntravel cost estimation. In many cases, it is attractive that the path\nrepresentation learning is lightweight and scalable; in resource-limited\nenvironments and under green computing limitations, it is essential. Yet,\nexisting path representation learning studies focus on accuracy and pay at most\nsecondary attention to resource consumption and scalability.\n We propose a lightweight and scalable path representation learning framework,\ntermed LightPath, that aims to reduce resource consumption and achieve\nscalability without affecting accuracy, thus enabling broader applicability.\nMore specifically, we first propose a sparse auto-encoder that ensures that the\nframework achieves good scalability with respect to path length. Next, we\npropose a relational reasoning framework to enable faster training of more\nrobust sparse path encoders. We also propose global-local knowledge\ndistillation to further reduce the size and improve the performance of sparse\npath encoders. Finally, we report extensive experiments on two real-world\ndatasets to offer insight into the efficiency, scalability, and effectiveness\nof the proposed framework.\n","authors":["Sean Bin Yang","Jilin Hu","Chenjuan Guo","Bin Yang","Christian S. Jensen"],"pdf_url":"https://arxiv.org/pdf/2307.10171v1.pdf","comment":"This paper has been accepted by ACM SIGKDD-23"},{"id":"http://arxiv.org/abs/2212.07383v3","updated":"2023-07-19T17:56:01Z","published":"2022-12-14T18:08:42Z","title":"Sequential Kernelized Independence Testing","summary":" Independence testing is a classical statistical problem that has been\nextensively studied in the batch setting when one fixes the sample size before\ncollecting data. However, practitioners often prefer procedures that adapt to\nthe complexity of a problem at hand instead of setting sample size in advance.\nIdeally, such procedures should (a) stop earlier on easy tasks (and later on\nharder tasks), hence making better use of available resources, and (b)\ncontinuously monitor the data and efficiently incorporate statistical evidence\nafter collecting new data, while controlling the false alarm rate. Classical\nbatch tests are not tailored for streaming data: valid inference after data\npeeking requires correcting for multiple testing which results in low power.\nFollowing the principle of testing by betting, we design sequential kernelized\nindependence tests that overcome such shortcomings. We exemplify our broad\nframework using bets inspired by kernelized dependence measures, e.g., the\nHilbert-Schmidt independence criterion. Our test is also valid under\nnon-i.i.d., time-varying settings. We demonstrate the power of our approaches\non both simulated and real data.\n","authors":["Aleksandr Podkopaev","Patrick Blöbaum","Shiva Prasad Kasiviswanathan","Aaditya Ramdas"],"pdf_url":"https://arxiv.org/pdf/2212.07383v3.pdf","comment":"To appear at ICML 2023"},{"id":"http://arxiv.org/abs/2307.10169v1","updated":"2023-07-19T17:55:13Z","published":"2023-07-19T17:55:13Z","title":"Challenges and Applications of Large Language Models","summary":" Large Language Models (LLMs) went from non-existent to ubiquitous in the\nmachine learning discourse within a few years. Due to the fast pace of the\nfield, it is difficult to identify the remaining challenges and already\nfruitful application areas. In this paper, we aim to establish a systematic set\nof open problems and application successes so that ML researchers can\ncomprehend the field's current state more quickly and become productive.\n","authors":["Jean Kaddour","Joshua Harris","Maximilian Mozes","Herbie Bradley","Roberta Raileanu","Robert McHardy"],"pdf_url":"https://arxiv.org/pdf/2307.10169v1.pdf","comment":"72 pages. v01. Work in progress. Feedback and comments are highly\n appreciated!"},{"id":"http://arxiv.org/abs/2307.10167v1","updated":"2023-07-19T17:53:22Z","published":"2023-07-19T17:53:22Z","title":"VITS : Variational Inference Thomson Sampling for contextual bandits","summary":" In this paper, we introduce and analyze a variant of the Thompson sampling\n(TS) algorithm for contextual bandits. At each round, traditional TS requires\nsamples from the current posterior distribution, which is usually intractable.\nTo circumvent this issue, approximate inference techniques can be used and\nprovide samples with distribution close to the posteriors. However, current\napproximate techniques yield to either poor estimation (Laplace approximation)\nor can be computationally expensive (MCMC methods, Ensemble sampling...). In\nthis paper, we propose a new algorithm, Varational Inference Thompson sampling\nVITS, based on Gaussian Variational Inference. This scheme provides powerful\nposterior approximations which are easy to sample from, and is computationally\nefficient, making it an ideal choice for TS. In addition, we show that VITS\nachieves a sub-linear regret bound of the same order in the dimension and\nnumber of round as traditional TS for linear contextual bandit. Finally, we\ndemonstrate experimentally the effectiveness of VITS on both synthetic and real\nworld datasets.\n","authors":["Pierre Clavier","Tom Huix","Alain Durmus"],"pdf_url":"https://arxiv.org/pdf/2307.10167v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10163v1","updated":"2023-07-19T17:44:54Z","published":"2023-07-19T17:44:54Z","title":"Rethinking Backdoor Attacks","summary":" In a backdoor attack, an adversary inserts maliciously constructed backdoor\nexamples into a training set to make the resulting model vulnerable to\nmanipulation. Defending against such attacks typically involves viewing these\ninserted examples as outliers in the training set and using techniques from\nrobust statistics to detect and remove them.\n In this work, we present a different approach to the backdoor attack problem.\nSpecifically, we show that without structural information about the training\ndata distribution, backdoor attacks are indistinguishable from\nnaturally-occurring features in the data--and thus impossible to \"detect\" in a\ngeneral sense. Then, guided by this observation, we revisit existing defenses\nagainst backdoor attacks and characterize the (often latent) assumptions they\nmake and on which they depend. Finally, we explore an alternative perspective\non backdoor attacks: one that assumes these attacks correspond to the strongest\nfeature in the training data. Under this assumption (which we make formal) we\ndevelop a new primitive for detecting backdoor attacks. Our primitive naturally\ngives rise to a detection algorithm that comes with theoretical guarantees and\nis effective in practice.\n","authors":["Alaa Khaddaj","Guillaume Leclerc","Aleksandar Makelov","Kristian Georgiev","Hadi Salman","Andrew Ilyas","Aleksander Madry"],"pdf_url":"https://arxiv.org/pdf/2307.10163v1.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2307.10160v1","updated":"2023-07-19T17:42:36Z","published":"2023-07-19T17:42:36Z","title":"Robust Driving Policy Learning with Guided Meta Reinforcement Learning","summary":" Although deep reinforcement learning (DRL) has shown promising results for\nautonomous navigation in interactive traffic scenarios, existing work typically\nadopts a fixed behavior policy to control social vehicles in the training\nenvironment. This may cause the learned driving policy to overfit the\nenvironment, making it difficult to interact well with vehicles with different,\nunseen behaviors. In this work, we introduce an efficient method to train\ndiverse driving policies for social vehicles as a single meta-policy. By\nrandomizing the interaction-based reward functions of social vehicles, we can\ngenerate diverse objectives and efficiently train the meta-policy through\nguiding policies that achieve specific objectives. We further propose a\ntraining strategy to enhance the robustness of the ego vehicle's driving policy\nusing the environment where social vehicles are controlled by the learned\nmeta-policy. Our method successfully learns an ego driving policy that\ngeneralizes well to unseen situations with out-of-distribution (OOD) social\nagents' behaviors in a challenging uncontrolled T-intersection scenario.\n","authors":["Kanghoon Lee","Jiachen Li","David Isele","Jinkyoo Park","Kikuo Fujimura","Mykel J. Kochenderfer"],"pdf_url":"https://arxiv.org/pdf/2307.10160v1.pdf","comment":"ITSC 2023"},{"id":"http://arxiv.org/abs/2307.10155v1","updated":"2023-07-19T17:35:08Z","published":"2023-07-19T17:35:08Z","title":"Curvature-based Clustering on Graphs","summary":" Unsupervised node clustering (or community detection) is a classical graph\nlearning task. In this paper, we study algorithms, which exploit the geometry\nof the graph to identify densely connected substructures, which form clusters\nor communities. Our method implements discrete Ricci curvatures and their\nassociated geometric flows, under which the edge weights of the graph evolve to\nreveal its community structure. We consider several discrete curvature notions\nand analyze the utility of the resulting algorithms. In contrast to prior\nliterature, we study not only single-membership community detection, where each\nnode belongs to exactly one community, but also mixed-membership community\ndetection, where communities may overlap. For the latter, we argue that it is\nbeneficial to perform community detection on the line graph, i.e., the graph's\ndual. We provide both theoretical and empirical evidence for the utility of our\ncurvature-based clustering algorithms. In addition, we give several results on\nthe relationship between the curvature of a graph and that of its dual, which\nenable the efficient implementation of our proposed mixed-membership community\ndetection approach and which may be of independent interest for curvature-based\nnetwork analysis.\n","authors":["Yu Tian","Zachary Lubberts","Melanie Weber"],"pdf_url":"https://arxiv.org/pdf/2307.10155v1.pdf","comment":"65 pages, 19 figures"},{"id":"http://arxiv.org/abs/2307.04228v2","updated":"2023-07-19T17:24:29Z","published":"2023-07-09T16:44:37Z","title":"Efficient Bayesian travel-time tomography with geologically-complex\n priors using sensitivity-informed polynomial chaos expansion and deep\n generative networks","summary":" Monte Carlo Markov Chain (MCMC) methods commonly confront two fundamental\nchallenges: the accurate characterization of the prior distribution and the\nefficient evaluation of the likelihood. In the context of Bayesian studies on\ntomography, principal component analysis (PCA) can in some cases facilitate the\nstraightforward definition of the prior distribution, while simultaneously\nenabling the implementation of accurate surrogate models based on polynomial\nchaos expansion (PCE) to replace computationally intensive full-physics forward\nsolvers. When faced with scenarios where PCA does not offer a direct means of\neasily defining the prior distribution alternative methods like deep generative\nmodels (e.g., variational autoencoders (VAEs)), can be employed as viable\noptions. However, accurately producing a surrogate capable of capturing the\nintricate non-linear relationship between the latent parameters of a VAE and\nthe outputs of forward modeling presents a notable challenge. Indeed, while PCE\nmodels provide high accuracy when the input-output relationship can be\neffectively approximated by relatively low-degree multivariate polynomials,\nthis condition is typically unmet when utilizing latent variables derived from\ndeep generative models. In this contribution, we present a strategy that\ncombines the excellent reconstruction performances of VAE in terms of prio\nrepresentation with the accuracy of PCA-PCE surrogate modeling in the context\nof Bayesian ground penetrating radar (GPR) travel-time tomography. Within the\nMCMC process, the parametrization of the VAE is leveraged for prior exploration\nand sample proposal. Concurrently, modeling is conducted using PCE, which\noperates on either globally or locally defined principal components of the VAE\nsamples under examination.\n","authors":["Giovanni Angelo Meles","Macarena Amaya","Shiran Levy","Stefano Marelli","Niklas Linde"],"pdf_url":"https://arxiv.org/pdf/2307.04228v2.pdf","comment":"25 pages, 15 figures"},{"id":"http://arxiv.org/abs/2307.10142v1","updated":"2023-07-19T17:12:28Z","published":"2023-07-19T17:12:28Z","title":"Benchmarking Potential Based Rewards for Learning Humanoid Locomotion","summary":" The main challenge in developing effective reinforcement learning (RL)\npipelines is often the design and tuning the reward functions. Well-designed\nshaping reward can lead to significantly faster learning. Naively formulated\nrewards, however, can conflict with the desired behavior and result in\noverfitting or even erratic performance if not properly tuned. In theory, the\nbroad class of potential based reward shaping (PBRS) can help guide the\nlearning process without affecting the optimal policy. Although several studies\nhave explored the use of potential based reward shaping to accelerate learning\nconvergence, most have been limited to grid-worlds and low-dimensional systems,\nand RL in robotics has predominantly relied on standard forms of reward\nshaping. In this paper, we benchmark standard forms of shaping with PBRS for a\nhumanoid robot. We find that in this high-dimensional system, PBRS has only\nmarginal benefits in convergence speed. However, the PBRS reward terms are\nsignificantly more robust to scaling than typical reward shaping approaches,\nand thus easier to tune.\n","authors":["Se Hwan Jeon","Steve Heim","Charles Khazoom","Sangbae Kim"],"pdf_url":"https://arxiv.org/pdf/2307.10142v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.08169v2","updated":"2023-07-19T16:45:18Z","published":"2022-09-16T20:52:39Z","title":"Value Summation: A Novel Scoring Function for MPC-based Model-based\n Reinforcement Learning","summary":" This paper proposes a novel scoring function for the planning module of\nMPC-based reinforcement learning methods to address the inherent bias of using\nthe reward function to score trajectories. The proposed method enhances the\nlearning efficiency of existing MPC-based MBRL methods using the discounted sum\nof values. The method utilizes optimal trajectories to guide policy learning\nand updates its state-action value function based on real-world and augmented\nonboard data. The learning efficiency of the proposed method is evaluated in\nselected MuJoCo Gym environments as well as in learning locomotion skills for a\nsimulated model of the Cassie robot. The results demonstrate that the proposed\nmethod outperforms the current state-of-the-art algorithms in terms of learning\nefficiency and average reward return.\n","authors":["Mehran Raisi","Amirhossein Noohian","Luc Mccutcheon","Saber Fallah"],"pdf_url":"https://arxiv.org/pdf/2209.08169v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09191v2","updated":"2023-07-19T16:24:31Z","published":"2023-07-17T13:17:26Z","title":"A benchmark of categorical encoders for binary classification","summary":" Categorical encoders transform categorical features into numerical\nrepresentations that are indispensable for a wide range of machine learning\nmodels. Existing encoder benchmark studies lack generalizability because of\ntheir limited choice of (1) encoders, (2) experimental factors, and (3)\ndatasets. Additionally, inconsistencies arise from the adoption of varying\naggregation strategies. This paper is the most comprehensive benchmark of\ncategorical encoders to date, including an extensive evaluation of 32\nconfigurations of encoders from diverse families, with 36 combinations of\nexperimental factors, and on 50 datasets. The study shows the profound\ninfluence of dataset selection, experimental factors, and aggregation\nstrategies on the benchmark's conclusions -- aspects disregarded in previous\nencoder benchmarks.\n","authors":["Federico Matteucci","Vadim Arzamasov","Klemens Boehm"],"pdf_url":"https://arxiv.org/pdf/2307.09191v2.pdf","comment":"Submitted to the 37th Conference on Neural Information Processing\n Systems (NeurIPS 2023) Track on Datasets and Benchmarks"},{"id":"http://arxiv.org/abs/2103.03328v3","updated":"2023-07-19T16:19:53Z","published":"2021-03-04T20:58:22Z","title":"Evaluation of Complexity Measures for Deep Learning Generalization in\n Medical Image Analysis","summary":" The generalization performance of deep learning models for medical image\nanalysis often decreases on images collected with different devices for data\nacquisition, device settings, or patient population. A better understanding of\nthe generalization capacity on new images is crucial for clinicians'\ntrustworthiness in deep learning. Although significant research efforts have\nbeen recently directed toward establishing generalization bounds and complexity\nmeasures, still, there is often a significant discrepancy between the predicted\nand actual generalization performance. As well, related large empirical studies\nhave been primarily based on validation with general-purpose image datasets.\nThis paper presents an empirical study that investigates the correlation\nbetween 25 complexity measures and the generalization abilities of supervised\ndeep learning classifiers for breast ultrasound images. The results indicate\nthat PAC-Bayes flatness-based and path norm-based measures produce the most\nconsistent explanation for the combination of models and data. We also\ninvestigate the use of multi-task classification and segmentation approach for\nbreast images, and report that such learning approach acts as an implicit\nregularizer and is conducive toward improved generalization.\n","authors":["Aleksandar Vakanski","Min Xian"],"pdf_url":"https://arxiv.org/pdf/2103.03328v3.pdf","comment":"15 pages, 4 figures"},{"id":"http://arxiv.org/abs/2306.13197v2","updated":"2023-07-19T16:19:24Z","published":"2023-06-22T20:42:50Z","title":"Pre or Post-Softmax Scores in Gradient-based Attribution Methods, What\n is Best?","summary":" Gradient based attribution methods for neural networks working as classifiers\nuse gradients of network scores. Here we discuss the practical differences\nbetween using gradients of pre-softmax scores versus post-softmax scores, and\ntheir respective advantages and disadvantages.\n","authors":["Miguel Lerma","Mirtha Lucas"],"pdf_url":"https://arxiv.org/pdf/2306.13197v2.pdf","comment":"8 pages, 2 figures, 2023 IEEE 13th International Conference on\n Pattern Recognition Systems (ICPRS)"},{"id":"http://arxiv.org/abs/2210.12547v2","updated":"2023-07-19T16:16:50Z","published":"2022-10-22T20:42:06Z","title":"SurCo: Learning Linear Surrogates For Combinatorial Nonlinear\n Optimization Problems","summary":" Optimization problems with nonlinear cost functions and combinatorial\nconstraints appear in many real-world applications but remain challenging to\nsolve efficiently compared to their linear counterparts. To bridge this gap, we\npropose $\\textbf{SurCo}$ that learns linear $\\underline{\\text{Sur}}$rogate\ncosts which can be used in existing $\\underline{\\text{Co}}$mbinatorial solvers\nto output good solutions to the original nonlinear combinatorial optimization\nproblem. The surrogate costs are learned end-to-end with nonlinear loss by\ndifferentiating through the linear surrogate solver, combining the flexibility\nof gradient-based methods with the structure of linear combinatorial\noptimization. We propose three $\\texttt{SurCo}$ variants:\n$\\texttt{SurCo}-\\texttt{zero}$ for individual nonlinear problems,\n$\\texttt{SurCo}-\\texttt{prior}$ for problem distributions, and\n$\\texttt{SurCo}-\\texttt{hybrid}$ to combine both distribution and\nproblem-specific information. We give theoretical intuition motivating\n$\\texttt{SurCo}$, and evaluate it empirically. Experiments show that\n$\\texttt{SurCo}$ finds better solutions faster than state-of-the-art and domain\nexpert approaches in real-world optimization problems such as embedding table\nsharding, inverse photonic design, and nonlinear route planning.\n","authors":["Aaron Ferber","Taoan Huang","Daochen Zha","Martin Schubert","Benoit Steiner","Bistra Dilkina","Yuandong Tian"],"pdf_url":"https://arxiv.org/pdf/2210.12547v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.05783v4","updated":"2023-07-19T16:14:31Z","published":"2023-02-11T21:07:30Z","title":"ConCerNet: A Contrastive Learning Based Framework for Automated\n Conservation Law Discovery and Trustworthy Dynamical System Prediction","summary":" Deep neural networks (DNN) have shown great capacity of modeling a dynamical\nsystem; nevertheless, they usually do not obey physics constraints such as\nconservation laws. This paper proposes a new learning framework named ConCerNet\nto improve the trustworthiness of the DNN based dynamics modeling to endow the\ninvariant properties. ConCerNet consists of two steps: (i) a contrastive\nlearning method to automatically capture the system invariants (i.e.\nconservation properties) along the trajectory observations; (ii) a neural\nprojection layer to guarantee that the learned dynamics models preserve the\nlearned invariants. We theoretically prove the functional relationship between\nthe learned latent representation and the unknown system invariant function.\nExperiments show that our method consistently outperforms the baseline neural\nnetworks in both coordinate error and conservation metrics by a large margin.\nWith neural network based parameterization and no dependence on prior\nknowledge, our method can be extended to complex and large-scale dynamics by\nleveraging an autoencoder.\n","authors":["Wang Zhang","Tsui-Wei Weng","Subhro Das","Alexandre Megretski","Luca Daniel","Lam M. Nguyen"],"pdf_url":"https://arxiv.org/pdf/2302.05783v4.pdf","comment":"Accepted by ICML 2023"},{"id":"http://arxiv.org/abs/2307.10098v1","updated":"2023-07-19T16:13:13Z","published":"2023-07-19T16:13:13Z","title":"Gradient Sparsification For Masked Fine-Tuning of Transformers","summary":" Fine-tuning pretrained self-supervised language models is widely adopted for\ntransfer learning to downstream tasks. Fine-tuning can be achieved by freezing\ngradients of the pretrained network and only updating gradients of a newly\nadded classification layer, or by performing gradient updates on all\nparameters. Gradual unfreezing makes a trade-off between the two by gradually\nunfreezing gradients of whole layers during training. This has been an\neffective strategy to trade-off between storage and training speed with\ngeneralization performance. However, it is not clear whether gradually\nunfreezing layers throughout training is optimal, compared to sparse variants\nof gradual unfreezing which may improve fine-tuning performance. In this paper,\nwe propose to stochastically mask gradients to regularize pretrained language\nmodels for improving overall fine-tuned performance. We introduce GradDrop and\nvariants thereof, a class of gradient sparsification methods that mask\ngradients during the backward pass, acting as gradient noise. GradDrop is\nsparse and stochastic unlike gradual freezing. Extensive experiments on the\nmultilingual XGLUE benchmark with XLMR-Large show that GradDrop is competitive\nagainst methods that use additional translated data for intermediate\npretraining and outperforms standard fine-tuning and gradual unfreezing. A\npost-analysis shows how GradDrop improves performance with languages it was not\ntrained on, such as under-resourced languages.\n","authors":["James O' Neill","Sourav Dutta"],"pdf_url":"https://arxiv.org/pdf/2307.10098v1.pdf","comment":"Accepted to IJCNN 2023"},{"id":"http://arxiv.org/abs/2307.10093v1","updated":"2023-07-19T16:00:29Z","published":"2023-07-19T16:00:29Z","title":"Revisiting invariances and introducing priors in Gromov-Wasserstein\n distances","summary":" Gromov-Wasserstein distance has found many applications in machine learning\ndue to its ability to compare measures across metric spaces and its invariance\nto isometric transformations. However, in certain applications, this invariance\nproperty can be too flexible, thus undesirable. Moreover, the\nGromov-Wasserstein distance solely considers pairwise sample similarities in\ninput datasets, disregarding the raw feature representations. We propose a new\noptimal transport-based distance, called Augmented Gromov-Wasserstein, that\nallows for some control over the level of rigidity to transformations. It also\nincorporates feature alignments, enabling us to better leverage prior knowledge\non the input data for improved performance. We present theoretical insights\ninto the proposed metric. We then demonstrate its usefulness for single-cell\nmulti-omic alignment tasks and a transfer learning scenario in machine\nlearning.\n","authors":["Pinar Demetci","Quang Huy Tran","Ievgen Redko","Ritambhara Singh"],"pdf_url":"https://arxiv.org/pdf/2307.10093v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04838v2","updated":"2023-07-19T15:59:03Z","published":"2023-07-10T18:15:03Z","title":"CREPE: Learnable Prompting With CLIP Improves Visual Relationship\n Prediction","summary":" In this paper, we explore the potential of Vision-Language Models (VLMs),\nspecifically CLIP, in predicting visual object relationships, which involves\ninterpreting visual features from images into language-based relations. Current\nstate-of-the-art methods use complex graphical models that utilize language\ncues and visual features to address this challenge. We hypothesize that the\nstrong language priors in CLIP embeddings can simplify these graphical models\npaving for a simpler approach. We adopt the UVTransE relation prediction\nframework, which learns the relation as a translational embedding with subject,\nobject, and union box embeddings from a scene. We systematically explore the\ndesign of CLIP-based subject, object, and union-box representations within the\nUVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate\nEstimation). CREPE utilizes text-based representations for all three bounding\nboxes and introduces a novel contrastive training strategy to automatically\ninfer the text prompt for union-box. Our approach achieves state-of-the-art\nperformance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual\nGenome benchmark, achieving a 15.3\\% gain in performance over recent\nstate-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in\nobject relation prediction and encourages further research on VLMs in this\nchallenging domain.\n","authors":["Rakshith Subramanyam","T. S. Jayram","Rushil Anirudh","Jayaraman J. Thiagarajan"],"pdf_url":"https://arxiv.org/pdf/2307.04838v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10088v1","updated":"2023-07-19T15:57:24Z","published":"2023-07-19T15:57:24Z","title":"Android in the Wild: A Large-Scale Dataset for Android Device Control","summary":" There is a growing interest in device-control systems that can interpret\nhuman natural language instructions and execute them on a digital device by\ndirectly controlling its user interface. We present a dataset for\ndevice-control research, Android in the Wild (AITW), which is orders of\nmagnitude larger than current datasets. The dataset contains human\ndemonstrations of device interactions, including the screens and actions, and\ncorresponding natural language instructions. It consists of 715k episodes\nspanning 30k unique instructions, four versions of Android (v10-13),and eight\ndevice types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It\ncontains multi-step tasks that require semantic understanding of language and\nvisual context. This dataset poses a new challenge: actions available through\nthe user interface must be inferred from their visual appearance. And, instead\nof simple UI element-based actions, the action space consists of precise\ngestures (e.g., horizontal scrolls to operate carousel widgets). We organize\nour dataset to encourage robustness analysis of device-control systems, i.e.,\nhow well a system performs in the presence of new task descriptions, new\napplications, or new platform versions. We develop two agents and report\nperformance across the dataset. The dataset is available at\nhttps://github.com/google-research/google-research/tree/master/android_in_the_wild.\n","authors":["Christopher Rawles","Alice Li","Daniel Rodriguez","Oriana Riva","Timothy Lillicrap"],"pdf_url":"https://arxiv.org/pdf/2307.10088v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10078v1","updated":"2023-07-19T15:51:25Z","published":"2023-07-19T15:51:25Z","title":"A Dual Formulation for Probabilistic Principal Component Analysis","summary":" In this paper, we characterize Probabilistic Principal Component Analysis in\nHilbert spaces and demonstrate how the optimal solution admits a representation\nin dual space. This allows us to develop a generative framework for kernel\nmethods. Furthermore, we show how it englobes Kernel Principal Component\nAnalysis and illustrate its working on a toy and a real dataset.\n","authors":["Henri De Plaen","Johan A. K. Suykens"],"pdf_url":"https://arxiv.org/pdf/2307.10078v1.pdf","comment":"ICML 2023 Workshop on Duality for Modern Machine Learning (DP4ML). 14\n pages (8 main + 5 appendix), 4 figures and 4 tables"},{"id":"http://arxiv.org/abs/2212.00736v2","updated":"2023-07-19T15:43:40Z","published":"2022-12-01T18:29:48Z","title":"An exponentially-growing family of universal quantum circuits","summary":" Quantum machine learning has become an area of growing interest but has\ncertain theoretical and hardware-specific limitations. Notably, the problem of\nvanishing gradients, or barren plateaus, renders the training impossible for\ncircuits with high qubit counts, imposing a limit on the number of qubits that\ndata scientists can use for solving problems. Independently, angle-embedded\nsupervised quantum neural networks were shown to produce truncated Fourier\nseries with a degree directly dependent on two factors: the depth of the\nencoding and the number of parallel qubits the encoding applied to. The degree\nof the Fourier series limits the model expressivity. This work introduces two\nnew architectures whose Fourier degrees grow exponentially: the sequential and\nparallel exponential quantum machine learning architectures. This is done by\nefficiently using the available Hilbert space when encoding, increasing the\nexpressivity of the quantum encoding. Therefore, the exponential growth allows\nstaying at the low-qubit limit to create highly expressive circuits avoiding\nbarren plateaus. Practically, parallel exponential architecture was shown to\noutperform the existing linear architectures by reducing their final mean\nsquare error value by up to 44.7% in a one-dimensional test problem.\nFurthermore, the feasibility of this technique was also shown on a trapped ion\nquantum processing unit.\n","authors":["Mo Kordzanganeh","Pavel Sekatski","Markus Pflitsch","Alexey Melnikov"],"pdf_url":"https://arxiv.org/pdf/2212.00736v2.pdf","comment":"14 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.10062v1","updated":"2023-07-19T15:33:11Z","published":"2023-07-19T15:33:11Z","title":"Unsupervised Accuracy Estimation of Deep Visual Models using\n Domain-Adaptive Adversarial Perturbation without Source Samples","summary":" Deploying deep visual models can lead to performance drops due to the\ndiscrepancies between source and target distributions. Several approaches\nleverage labeled source data to estimate target domain accuracy, but accessing\nlabeled source data is often prohibitively difficult due to data\nconfidentiality or resource limitations on serving devices. Our work proposes a\nnew framework to estimate model accuracy on unlabeled target data without\naccess to source data. We investigate the feasibility of using pseudo-labels\nfor accuracy estimation and evolve this idea into adopting recent advances in\nsource-free domain adaptation algorithms. Our approach measures the\ndisagreement rate between the source hypothesis and the target pseudo-labeling\nfunction, adapted from the source hypothesis. We mitigate the impact of\nerroneous pseudo-labels that may arise due to a high ideal joint hypothesis\nrisk by employing adaptive adversarial perturbation on the input of the target\nmodel. Our proposed source-free framework effectively addresses the challenging\ndistribution shift scenarios and outperforms existing methods requiring source\ndata and labels for training.\n","authors":["JoonHo Lee","Jae Oh Woo","Hankyu Moon","Kwonho Lee"],"pdf_url":"https://arxiv.org/pdf/2307.10062v1.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2307.10060v1","updated":"2023-07-19T15:30:06Z","published":"2023-07-19T15:30:06Z","title":"Accurate deep learning sub-grid scale models for large eddy simulations","summary":" We present two families of sub-grid scale (SGS) turbulence models developed\nfor large-eddy simulation (LES) purposes. Their development required the\nformulation of physics-informed robust and efficient Deep Learning (DL)\nalgorithms which, unlike state-of-the-art analytical modeling techniques can\nproduce high-order complex non-linear relations between inputs and outputs.\nExplicit filtering of data from direct simulations of the canonical channel\nflow at two friction Reynolds numbers $Re_\\tau\\approx 395$ and 590 provided\naccurate data for training and testing. The two sets of models use different\nnetwork architectures. One of the architectures uses tensor basis neural\nnetworks (TBNN) and embeds the simplified analytical model form of the general\neffective-viscosity hypothesis, thus incorporating the Galilean, rotational and\nreflectional invariances. The other architecture is that of a relatively simple\nnetwork, that is able to incorporate the Galilean invariance only. However,\nthis simpler architecture has better feature extraction capacity owing to its\nability to establish relations between and extract information from\ncross-components of the integrity basis tensors and the SGS stresses. Both sets\nof models are used to predict the SGS stresses for feature datasets generated\nwith different filter widths, and at different Reynolds numbers. It is shown\nthat due to the simpler model's better feature learning capabilities, it\noutperforms the invariance embedded model in statistical performance metrics.\nIn a priori tests, both sets of models provide similar levels of dissipation\nand backscatter. Based on the test results, both sets of models should be\nusable in a posteriori actual LESs.\n","authors":["Rikhi Bose","Arunabha M. Roy"],"pdf_url":"https://arxiv.org/pdf/2307.10060v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10053v1","updated":"2023-07-19T15:26:18Z","published":"2023-07-19T15:26:18Z","title":"Convergence Guarantees for Stochastic Subgradient Methods in Nonsmooth\n Nonconvex Optimization","summary":" In this paper, we investigate the convergence properties of the stochastic\ngradient descent (SGD) method and its variants, especially in training neural\nnetworks built from nonsmooth activation functions. We develop a novel\nframework that assigns different timescales to stepsizes for updating the\nmomentum terms and variables, respectively. Under mild conditions, we prove the\nglobal convergence of our proposed framework in both single-timescale and\ntwo-timescale cases. We show that our proposed framework encompasses a wide\nrange of well-known SGD-type methods, including heavy-ball SGD, SignSGD, Lion,\nnormalized SGD and clipped SGD. Furthermore, when the objective function adopts\na finite-sum formulation, we prove the convergence properties for these\nSGD-type methods based on our proposed framework. In particular, we prove that\nthese SGD-type methods find the Clarke stationary points of the objective\nfunction with randomly chosen stepsizes and initial points under mild\nassumptions. Preliminary numerical experiments demonstrate the high efficiency\nof our analyzed SGD-type methods.\n","authors":["Nachuan Xiao","Xiaoyin Hu","Kim-Chuan Toh"],"pdf_url":"https://arxiv.org/pdf/2307.10053v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2303.15592v2","updated":"2023-07-19T15:16:21Z","published":"2023-03-27T20:49:42Z","title":"Uncovering Bias in Personal Informatics","summary":" Personal informatics (PI) systems, powered by smartphones and wearables,\nenable people to lead healthier lifestyles by providing meaningful and\nactionable insights that break down barriers between users and their health\ninformation. Today, such systems are used by billions of users for monitoring\nnot only physical activity and sleep but also vital signs and women's and heart\nhealth, among others. Despite their widespread usage, the processing of\nsensitive PI data may suffer from biases, which may entail practical and\nethical implications. In this work, we present the first comprehensive\nempirical and analytical study of bias in PI systems, including biases in raw\ndata and in the entire machine learning life cycle. We use the most detailed\nframework to date for exploring the different sources of bias and find that\nbiases exist both in the data generation and the model learning and\nimplementation streams. According to our results, the most affected minority\ngroups are users with health issues, such as diabetes, joint issues, and\nhypertension, and female users, whose data biases are propagated or even\namplified by learning models, while intersectional biases can also be observed.\n","authors":["Sofia Yfantidou","Pavlos Sermpezis","Athena Vakali","Ricardo Baeza-Yates"],"pdf_url":"https://arxiv.org/pdf/2303.15592v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10026v1","updated":"2023-07-19T15:11:04Z","published":"2023-07-19T15:11:04Z","title":"Contextual Reliability: When Different Features Matter in Different\n Contexts","summary":" Deep neural networks often fail catastrophically by relying on spurious\ncorrelations. Most prior work assumes a clear dichotomy into spurious and\nreliable features; however, this is often unrealistic. For example, most of the\ntime we do not want an autonomous car to simply copy the speed of surrounding\ncars -- we don't want our car to run a red light if a neighboring car does so.\nHowever, we cannot simply enforce invariance to next-lane speed, since it could\nprovide valuable information about an unobservable pedestrian at a crosswalk.\nThus, universally ignoring features that are sometimes (but not always)\nreliable can lead to non-robust performance. We formalize a new setting called\ncontextual reliability which accounts for the fact that the \"right\" features to\nuse may vary depending on the context. We propose and analyze a two-stage\nframework called Explicit Non-spurious feature Prediction (ENP) which first\nidentifies the relevant features to use for a given context, then trains a\nmodel to rely exclusively on these features. Our work theoretically and\nempirically demonstrates the advantages of ENP over existing methods and\nprovides new benchmarks for contextual reliability.\n","authors":["Gaurav Ghosal","Amrith Setlur","Daniel S. Brown","Anca D. Dragan","Aditi Raghunathan"],"pdf_url":"https://arxiv.org/pdf/2307.10026v1.pdf","comment":"ICML 2023 Camera Ready Version"},{"id":"http://arxiv.org/abs/2307.10022v1","updated":"2023-07-19T15:05:55Z","published":"2023-07-19T15:05:55Z","title":"Europepolls: A Dataset of Country-Level Opinion Polling Data for the\n European Union and the UK","summary":" I propose an open dataset of country-level historical opinion polling data\nfor the European Union and the UK. The dataset aims to fill a gap in available\nopinion polling data for the European Union. Some existing datasets are\nrestricted to the past five years, limiting research opportunities. At the same\ntime, some larger proprietary datasets exist but are available only in a visual\npreprocessed time series format. Finally, while other large datasets for\nindividual countries might exist, these could be inaccessible due to language\nbarriers. The data was gathered from Wikipedia, and preprocessed using the\npandas library. Both the raw and the preprocessed data are in the .csv format.\nI hope that given the recent advances in LLMs and deep learning in general,\nthis large dataset will enable researchers to uncover complex interactions\nbetween multimodal data (news articles, economic indicators, social media) and\nvoting behavior. The raw data, the preprocessed data, and the preprocessing\nscripts are available on GitHub.\n","authors":["Konstantinos Pitas"],"pdf_url":"https://arxiv.org/pdf/2307.10022v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.15585v2","updated":"2023-07-19T15:00:06Z","published":"2023-03-27T20:28:26Z","title":"Beyond Accuracy: A Critical Review of Fairness in Machine Learning for\n Mobile and Wearable Computing","summary":" The field of mobile and wearable computing is undergoing a revolutionary\nintegration of machine learning. Devices can now diagnose diseases, predict\nheart irregularities, and unlock the full potential of human cognition.\nHowever, the underlying algorithms powering these predictions are not immune to\nbiases with respect to sensitive attributes (e.g., gender, race), leading to\ndiscriminatory outcomes. The goal of this work is to explore the extent to\nwhich the mobile and wearable computing community has adopted ways of reporting\ninformation about datasets and models to surface and, eventually, counter\nbiases. Our systematic review of papers published in the Proceedings of the ACM\nInteractive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) journal from\n2018-2022 indicates that, while there has been progress made on algorithmic\nfairness, there is still ample room for growth. Our findings show that only a\nsmall portion (5%) of published papers adheres to modern fairness reporting,\nwhile the overwhelming majority thereof focuses on accuracy or error metrics.\nTo generalize these results across venues of similar scope, we analyzed recent\nproceedings of ACM MobiCom, MobiSys, and SenSys, IEEE Pervasive, and IEEE\nTransactions on Mobile Computing Computing, and found no deviation from our\nprimary result. In light of these findings, our work provides practical\nguidelines for the design and development of mobile and wearable technologies\nthat not only strive for accuracy but also fairness.\n","authors":["Sofia Yfantidou","Marios Constantinides","Dimitris Spathis","Athena Vakali","Daniele Quercia","Fahim Kawsar"],"pdf_url":"https://arxiv.org/pdf/2303.15585v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2202.09753v3","updated":"2023-07-19T14:55:17Z","published":"2022-02-20T07:42:00Z","title":"Finite-Time Analysis of Natural Actor-Critic for POMDPs","summary":" We consider the reinforcement learning problem for partially observed Markov\ndecision processes (POMDPs) with large or even countably infinite state spaces,\nwhere the controller has access to only noisy observations of the underlying\ncontrolled Markov chain. We consider a natural actor-critic method that employs\na finite internal memory for policy parameterization, and a multi-step temporal\ndifference learning algorithm for policy evaluation. We establish, to the best\nof our knowledge, the first non-asymptotic global convergence of actor-critic\nmethods for partially observed systems under function approximation. In\nparticular, in addition to the function approximation and statistical errors\nthat also arise in MDPs, we explicitly characterize the error due to the use of\nfinite-state controllers. This additional error is stated in terms of the total\nvariation distance between the traditional belief state in POMDPs and the\nposterior distribution of the hidden state when using a finite-state\ncontroller. Further, we show that this error can be made small in the case of\nsliding-block controllers by using larger block sizes.\n","authors":["Semih Cayci","Niao He","R. Srikant"],"pdf_url":"https://arxiv.org/pdf/2202.09753v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.06385v2","updated":"2023-07-19T14:51:37Z","published":"2023-07-12T18:13:58Z","title":"Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event\n Localization","summary":" Audio-Visual Event Localization (AVEL) is the task of temporally localizing\nand classifying \\emph{audio-visual events}, i.e., events simultaneously visible\nand audible in a video. In this paper, we solve AVEL in a weakly-supervised\nsetting, where only video-level event labels (their presence/absence, but not\ntheir locations in time) are available as supervision for training. Our idea is\nto use a base model to estimate labels on the training data at a finer temporal\nresolution than at the video level and re-train the model with these labels.\nI.e., we determine the subset of labels for each \\emph{slice} of frames in a\ntraining video by (i) replacing the frames outside the slice with those from a\nsecond video having no overlap in video-level labels, and (ii) feeding this\nsynthetic video into the base model to extract labels for just the slice in\nquestion. To handle the out-of-distribution nature of our synthetic videos, we\npropose an auxiliary objective for the base model that induces more reliable\npredictions of the localized event labels as desired. Our three-stage pipeline\noutperforms several existing AVEL methods with no architectural changes and\nimproves performance on a related weakly-supervised task as well.\n","authors":["Kalyan Ramakrishnan"],"pdf_url":"https://arxiv.org/pdf/2307.06385v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.07677v4","updated":"2023-07-19T14:45:15Z","published":"2021-06-14T18:01:08Z","title":"Planning to Fairly Allocate: Probabilistic Fairness in the Restless\n Bandit Setting","summary":" Restless and collapsing bandits are often used to model budget-constrained\nresource allocation in settings where arms have action-dependent transition\nprobabilities, such as the allocation of health interventions among patients.\nHowever, state-of-the-art Whittle-index-based approaches to this planning\nproblem either do not consider fairness among arms, or incentivize fairness\nwithout guaranteeing it. We thus introduce ProbFair, a probabilistically fair\npolicy that maximizes total expected reward and satisfies the budget constraint\nwhile ensuring a strictly positive lower bound on the probability of being\npulled at each timestep. We evaluate our algorithm on a real-world application,\nwhere interventions support continuous positive airway pressure (CPAP) therapy\nadherence among patients, as well as on a broader class of synthetic transition\nmatrices. We find that ProbFair preserves utility while providing fairness\nguarantees.\n","authors":["Christine Herlihy","Aviva Prins","Aravind Srinivasan","John P. Dickerson"],"pdf_url":"https://arxiv.org/pdf/2106.07677v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11103v2","updated":"2023-07-19T14:42:10Z","published":"2023-03-20T13:40:11Z","title":"Sionna RT: Differentiable Ray Tracing for Radio Propagation Modeling","summary":" Sionna is a GPU-accelerated open-source library for link-level simulations\nbased on TensorFlow. Since release v0.14 it integrates a differentiable ray\ntracer (RT) for the simulation of radio wave propagation. This unique feature\nallows for the computation of gradients of the channel impulse response and\nother related quantities with respect to many system and environment\nparameters, such as material properties, antenna patterns, array geometries, as\nwell as transmitter and receiver orientations and positions. In this paper, we\noutline the key components of Sionna RT and showcase example applications such\nas learning radio materials and optimizing transmitter orientations by gradient\ndescent. While classic ray tracing is a crucial tool for 6G research topics\nlike reconfigurable intelligent surfaces, integrated sensing and\ncommunications, as well as user localization, differentiable ray tracing is a\nkey enabler for many novel and exciting research directions, for example,\ndigital twins.\n","authors":["Jakob Hoydis","Fayçal Aït Aoudia","Sebastian Cammerer","Merlin Nimier-David","Nikolaus Binder","Guillermo Marcus","Alexander Keller"],"pdf_url":"https://arxiv.org/pdf/2303.11103v2.pdf","comment":"5 pages, 5 figures, update reflects new features of Sionna RT\n introduced in release v0.15"},{"id":"http://arxiv.org/abs/2208.07734v6","updated":"2023-07-19T14:39:54Z","published":"2022-08-16T13:09:25Z","title":"Data Augmentation is a Hyperparameter: Cherry-picked Self-Supervision\n for Unsupervised Anomaly Detection is Creating the Illusion of Success","summary":" Self-supervised learning (SSL) has emerged as a promising alternative to\ncreate supervisory signals to real-world problems, avoiding the extensive cost\nof manual labeling. SSL is particularly attractive for unsupervised tasks such\nas anomaly detection (AD), where labeled anomalies are rare or often\nnonexistent. A large catalog of augmentation functions has been used for\nSSL-based AD (SSAD) on image data, and recent works have reported that the type\nof augmentation has a significant impact on accuracy. Motivated by those, this\nwork sets out to put image-based SSAD under a larger lens and investigate the\nrole of data augmentation in SSAD. Through extensive experiments on 3 different\ndetector models and across 420 AD tasks, we provide comprehensive numerical and\nvisual evidences that the alignment between data augmentation and\nanomaly-generating mechanism is the key to the success of SSAD, and in the lack\nthereof, SSL may even impair accuracy. To the best of our knowledge, this is\nthe first meta-analysis on the role of data augmentation in SSAD.\n","authors":["Jaemin Yoo","Tiancheng Zhao","Leman Akoglu"],"pdf_url":"https://arxiv.org/pdf/2208.07734v6.pdf","comment":"Accepted to Transactions on Machine Learning Research (TMLR)"},{"id":"http://arxiv.org/abs/2307.10003v1","updated":"2023-07-19T14:23:26Z","published":"2023-07-19T14:23:26Z","title":"TbExplain: A Text-based Explanation Method for Scene Classification\n Models with the Statistical Prediction Correction","summary":" The field of Explainable Artificial Intelligence (XAI) aims to improve the\ninterpretability of black-box machine learning models. Building a heatmap based\non the importance value of input features is a popular method for explaining\nthe underlying functions of such models in producing their predictions.\nHeatmaps are almost understandable to humans, yet they are not without flaws.\nNon-expert users, for example, may not fully understand the logic of heatmaps\n(the logic in which relevant pixels to the model's prediction are highlighted\nwith different intensities or colors). Additionally, objects and regions of the\ninput image that are relevant to the model prediction are frequently not\nentirely differentiated by heatmaps. In this paper, we propose a framework\ncalled TbExplain that employs XAI techniques and a pre-trained object detector\nto present text-based explanations of scene classification models. Moreover,\nTbExplain incorporates a novel method to correct predictions and textually\nexplain them based on the statistics of objects in the input image when the\ninitial prediction is unreliable. To assess the trustworthiness and validity of\nthe text-based explanations, we conducted a qualitative experiment, and the\nfindings indicated that these explanations are sufficiently reliable.\nFurthermore, our quantitative and qualitative experiments on TbExplain with\nscene classification datasets reveal an improvement in classification accuracy\nover ResNet variants.\n","authors":["Amirhossein Aminimehr","Pouya Khani","Amirali Molaei","Amirmohammad Kazemeini","Erik Cambria"],"pdf_url":"https://arxiv.org/pdf/2307.10003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.10403v2","updated":"2023-07-19T14:23:17Z","published":"2022-05-20T18:44:06Z","title":"Tackling Provably Hard Representative Selection via Graph Neural\n Networks","summary":" Representative Selection (RS) is the problem of finding a small subset of\nexemplars from a dataset that is representative of the dataset. In this paper,\nwe study RS for attributed graphs, and focus on finding representative nodes\nthat optimize the accuracy of a model trained on the selected representatives.\nTheoretically, we establish a new hardness result forRS (in the absence of a\ngraph structure) by proving that a particular, highly practical variant of it\n(RS for Learning) is hard to approximate in polynomial time within any\nreasonable factor, which implies a significant potential gap between the\noptimum solution of widely-used surrogate functions and the actual accuracy of\nthe model. We then study the setting where a (homophilous) graph structure is\navailable, or can be constructed, between the data points.We show that with an\nappropriate modeling approach, the presence of such a structure can turn a hard\nRS (for learning) problem into one that can be effectively solved. To this end,\nwe develop RS-GNN, a representation learning-based RS model based on Graph\nNeural Networks. Empirically, we demonstrate the effectiveness of RS-GNN on\nproblems with predefined graph structures as well as problems with graphs\ninduced from node feature similarities, by showing that RS-GNN achieves\nsignificant improvements over established baselines on a suite of eight\nbenchmarks.\n","authors":["Mehran Kazemi","Anton Tsitsulin","Hossein Esfandiari","MohammadHossein Bateni","Deepak Ramachandran","Bryan Perozzi","Vahab Mirrokni"],"pdf_url":"https://arxiv.org/pdf/2205.10403v2.pdf","comment":"Accepted at the Transactions of Machine Learning Research (TMLR)\n Journal"},{"id":"http://arxiv.org/abs/2307.08913v2","updated":"2023-07-19T14:18:00Z","published":"2023-07-18T01:16:23Z","title":"Towards the Sparseness of Projection Head in Self-Supervised Learning","summary":" In recent years, self-supervised learning (SSL) has emerged as a promising\napproach for extracting valuable representations from unlabeled data. One\nsuccessful SSL method is contrastive learning, which aims to bring positive\nexamples closer while pushing negative examples apart. Many current contrastive\nlearning approaches utilize a parameterized projection head. Through a\ncombination of empirical analysis and theoretical investigation, we provide\ninsights into the internal mechanisms of the projection head and its\nrelationship with the phenomenon of dimensional collapse. Our findings\ndemonstrate that the projection head enhances the quality of representations by\nperforming contrastive loss in a projected subspace. Therefore, we propose an\nassumption that only a subset of features is necessary when minimizing the\ncontrastive loss of a mini-batch of data. Theoretical analysis further suggests\nthat a sparse projection head can enhance generalization, leading us to\nintroduce SparseHead - a regularization term that effectively constrains the\nsparsity of the projection head, and can be seamlessly integrated with any\nself-supervised learning (SSL) approaches. Our experimental results validate\nthe effectiveness of SparseHead, demonstrating its ability to improve the\nperformance of existing contrastive methods.\n","authors":["Zeen Song","Xingzhe Su","Jingyao Wang","Wenwen Qiang","Changwen Zheng","Fuchun Sun"],"pdf_url":"https://arxiv.org/pdf/2307.08913v2.pdf","comment":"9 pages,3 figures"},{"id":"http://arxiv.org/abs/2305.15851v2","updated":"2023-07-19T14:16:22Z","published":"2023-05-25T08:43:11Z","title":"On sampling determinantal and Pfaffian point processes on a quantum\n computer","summary":" DPPs were introduced by Macchi as a model in quantum optics the 1970s. Since\nthen, they have been widely used as models and subsampling tools in statistics\nand computer science. Most applications require sampling from a DPP, and given\ntheir quantum origin, it is natural to wonder whether sampling a DPP on a\nquantum computer is easier than on a classical one. We focus here on DPPs over\na finite state space, which are distributions over the subsets of\n$\\{1,\\dots,N\\}$ parametrized by an $N\\times N$ Hermitian kernel matrix. Vanilla\nsampling consists in two steps, of respective costs $\\mathcal{O}(N^3)$ and\n$\\mathcal{O}(Nr^2)$ operations on a classical computer, where $r$ is the rank\nof the kernel matrix. A large first part of the current paper consists in\nexplaining why the state-of-the-art in quantum simulation of fermionic systems\nalready yields quantum DPP sampling algorithms. We then modify existing quantum\ncircuits, and discuss their insertion in a full DPP sampling pipeline that\nstarts from practical kernel specifications. The bottom line is that, with $P$\n(classical) parallel processors, we can divide the preprocessing cost by $P$\nand build a quantum circuit with $\\mathcal{O}(Nr)$ gates that sample a given\nDPP, with depth varying from $\\mathcal{O}(N)$ to $\\mathcal{O}(r\\log N)$\ndepending on qubit-communication constraints on the target machine. We also\nconnect existing work on the simulation of superconductors to Pfaffian point\nprocesses, which generalize DPPs and would be a natural addition to the machine\nlearner's toolbox. Finally, the circuits are empirically validated on a\nclassical simulator and on 5-qubit machines.\n","authors":["Rémi Bardenet","Michaël Fanuel","Alexandre Feller"],"pdf_url":"https://arxiv.org/pdf/2305.15851v2.pdf","comment":"48 pages, 8 figures. Additional results about parity of cardinality\n of PfPP samples"},{"id":"http://arxiv.org/abs/2307.09994v1","updated":"2023-07-19T13:58:01Z","published":"2023-07-19T13:58:01Z","title":"Impact of Disentanglement on Pruning Neural Networks","summary":" Deploying deep learning neural networks on edge devices, to accomplish task\nspecific objectives in the real-world, requires a reduction in their memory\nfootprint, power consumption, and latency. This can be realized via efficient\nmodel compression. Disentangled latent representations produced by variational\nautoencoder (VAE) networks are a promising approach for achieving model\ncompression because they mainly retain task-specific information, discarding\nuseless information for the task at hand. We make use of the Beta-VAE framework\ncombined with a standard criterion for pruning to investigate the impact of\nforcing the network to learn disentangled representations on the pruning\nprocess for the task of classification. In particular, we perform experiments\non MNIST and CIFAR10 datasets, examine disentanglement challenges, and propose\na path forward for future works.\n","authors":["Carl Shneider","Peyman Rostami","Anis Kacem","Nilotpal Sinha","Abd El Rahman Shabayek","Djamila Aouada"],"pdf_url":"https://arxiv.org/pdf/2307.09994v1.pdf","comment":"Presented in ISCS23"},{"id":"http://arxiv.org/abs/2307.08347v2","updated":"2023-07-19T13:55:32Z","published":"2023-07-17T09:38:41Z","title":"M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models\n and Latent Space Geometry Optimization","summary":" Medical vision-language models enable co-learning and integrating features\nfrom medical imaging and clinical text. However, these models are not easy to\ntrain and the latent representation space can be complex. Here we propose a\nnovel way for pre-training and regularising medical vision-language models. The\nproposed method, named Medical vision-language pre-training with Frozen\nlanguage models and Latent spAce Geometry optimization (M-FLAG), leverages a\nfrozen language model for training stability and efficiency and introduces a\nnovel orthogonality loss to harmonize the latent space geometry. We demonstrate\nthe potential of the pre-trained model on three downstream tasks: medical image\nclassification, segmentation, and object detection. Extensive experiments\nacross five public datasets demonstrate that M-FLAG significantly outperforms\nexisting medical vision-language pre-training approaches and reduces the number\nof parameters by 78\\%. Notably, M-FLAG achieves outstanding performance on the\nsegmentation task while using only 1\\% of the RSNA dataset, even outperforming\nImageNet pre-trained models that have been fine-tuned using 100\\% of the data.\n","authors":["Che Liu","Sibo Cheng","Chen Chen","Mengyun Qiao","Weitong Zhang","Anand Shah","Wenjia Bai","Rossella Arcucci"],"pdf_url":"https://arxiv.org/pdf/2307.08347v2.pdf","comment":"Accepted by MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.09989v1","updated":"2023-07-19T13:49:35Z","published":"2023-07-19T13:49:35Z","title":"UniMatch: A Unified User-Item Matching Framework for the Multi-purpose\n Merchant Marketing","summary":" When doing private domain marketing with cloud services, the merchants\nusually have to purchase different machine learning models for the multiple\nmarketing purposes, leading to a very high cost. We present a unified user-item\nmatching framework to simultaneously conduct item recommendation and user\ntargeting with just one model. We empirically demonstrate that the above\nconcurrent modeling is viable via modeling the user-item interaction matrix\nwith the multinomial distribution, and propose a bidirectional bias-corrected\nNCE loss for the implementation. The proposed loss function guides the model to\nlearn the user-item joint probability $p(u,i)$ instead of the conditional\nprobability $p(i|u)$ or $p(u|i)$ through correcting both the users and items'\nbiases caused by the in-batch negative sampling. In addition, our framework is\nmodel-agnostic enabling a flexible adaptation of different model architectures.\nExtensive experiments demonstrate that our framework results in significant\nperformance gains in comparison with the state-of-the-art methods, with greatly\nreduced cost on computing resources and daily maintenance.\n","authors":["Qifang Zhao","Tianyu Li","Meng Du","Yu Jiang","Qinghui Sun","Zhongyao Wang","Hong Liu","Huan Xu"],"pdf_url":"https://arxiv.org/pdf/2307.09989v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09988v1","updated":"2023-07-19T13:49:12Z","published":"2023-07-19T13:49:12Z","title":"TinyTrain: Deep Neural Network Training at the Extreme Edge","summary":" On-device training is essential for user personalisation and privacy. With\nthe pervasiveness of IoT devices and microcontroller units (MCU), this task\nbecomes more challenging due to the constrained memory and compute resources,\nand the limited availability of labelled user data. Nonetheless, prior works\nneglect the data scarcity issue, require excessively long training time (e.g. a\nfew hours), or induce substantial accuracy loss ($\\geq$10\\%). We propose\nTinyTrain, an on-device training approach that drastically reduces training\ntime by selectively updating parts of the model and explicitly coping with data\nscarcity. TinyTrain introduces a task-adaptive sparse-update method that\ndynamically selects the layer/channel based on a multi-objective criterion that\njointly captures user data, the memory, and the compute capabilities of the\ntarget device, leading to high accuracy on unseen tasks with reduced\ncomputation and memory footprint. TinyTrain outperforms vanilla fine-tuning of\nthe entire network by 3.6-5.0\\% in accuracy, while reducing the backward-pass\nmemory and computation cost by up to 2,286$\\times$ and 7.68$\\times$,\nrespectively. Targeting broadly used real-world edge devices, TinyTrain\nachieves 9.5$\\times$ faster and 3.5$\\times$ more energy-efficient training over\nstatus-quo approaches, and 2.8$\\times$ smaller memory footprint than SOTA\napproaches, while remaining within the 1 MB memory envelope of MCU-grade\nplatforms.\n","authors":["Young D. Kwon","Rui Li","Stylianos I. Venieris","Jagmohan Chauhan","Nicholas D. Lane","Cecilia Mascolo"],"pdf_url":"https://arxiv.org/pdf/2307.09988v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.14518v2","updated":"2023-07-19T13:48:46Z","published":"2023-02-28T12:13:57Z","title":"Generalization Error Bounds for Noisy, Iterative Algorithms via Maximal\n Leakage","summary":" We adopt an information-theoretic framework to analyze the generalization\nbehavior of the class of iterative, noisy learning algorithms. This class is\nparticularly suitable for study under information-theoretic metrics as the\nalgorithms are inherently randomized, and it includes commonly used algorithms\nsuch as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the\nmaximal leakage (equivalently, the Sibson mutual information of order infinity)\nmetric, as it is simple to analyze, and it implies both bounds on the\nprobability of having a large generalization error and on its expected value.\nWe show that, if the update function (e.g., gradient) is bounded in $L_2$-norm\nand the additive noise is isotropic Gaussian noise, then one can obtain an\nupper-bound on maximal leakage in semi-closed form. Furthermore, we demonstrate\nhow the assumptions on the update function affect the optimal (in the sense of\nminimizing the induced maximal leakage) choice of the noise. Finally, we\ncompute explicit tight upper bounds on the induced maximal leakage for other\nscenarios of interest.\n","authors":["Ibrahim Issa","Amedeo Roberto Esposito","Michael Gastpar"],"pdf_url":"https://arxiv.org/pdf/2302.14518v2.pdf","comment":"Updated to fix an error in Theorem 4 (asymptotic analysis)"},{"id":"http://arxiv.org/abs/2210.14037v2","updated":"2023-07-19T13:43:07Z","published":"2022-10-25T14:13:53Z","title":"Revisiting Softmax for Uncertainty Approximation in Text Classification","summary":" Uncertainty approximation in text classification is an important area with\napplications in domain adaptation and interpretability. One of the most widely\nused uncertainty approximation methods is Monte Carlo (MC) Dropout, which is\ncomputationally expensive as it requires multiple forward passes through the\nmodel. A cheaper alternative is to simply use the softmax based on a single\nforward pass without dropout to estimate model uncertainty. However, prior work\nhas indicated that these predictions tend to be overconfident. In this paper,\nwe perform a thorough empirical analysis of these methods on five datasets with\ntwo base neural architectures in order to identify the trade-offs between the\ntwo. We compare both softmax and an efficient version of MC Dropout on their\nuncertainty approximations and downstream text classification performance,\nwhile weighing their runtime (cost) against performance (benefit). We find\nthat, while MC dropout produces the best uncertainty approximations, using a\nsimple softmax leads to competitive and in some cases better uncertainty\nestimation for text classification at a much lower computational cost,\nsuggesting that softmax can in fact be a sufficient uncertainty estimate when\ncomputational resources are a concern.\n","authors":["Andreas Nugaard Holm","Dustin Wright","Isabelle Augenstein"],"pdf_url":"https://arxiv.org/pdf/2210.14037v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09977v1","updated":"2023-07-19T13:33:43Z","published":"2023-07-19T13:33:43Z","title":"Learner Referral for Cost-Effective Federated Learning Over Hierarchical\n IoT Networks","summary":" The paradigm of federated learning (FL) to address data privacy concerns by\nlocally training parameters on resource-constrained clients in a distributed\nmanner has garnered significant attention. Nonetheless, FL is not applicable\nwhen not all clients within the coverage of the FL server are registered with\nthe FL network. To bridge this gap, this paper proposes joint learner referral\naided federated client selection (LRef-FedCS), along with communications and\ncomputing resource scheduling, and local model accuracy optimization (LMAO)\nmethods. These methods are designed to minimize the cost incurred by the\nworst-case participant and ensure the long-term fairness of FL in hierarchical\nInternet of Things (HieIoT) networks. Utilizing the Lyapunov optimization\ntechnique, we reformulate the original problem into a stepwise joint\noptimization problem (JOP). Subsequently, to tackle the mixed-integer\nnon-convex JOP, we separatively and iteratively address LRef-FedCS and LMAO\nthrough the centralized method and self-adaptive global best harmony search\n(SGHS) algorithm, respectively. To enhance scalability, we further propose a\ndistributed LRef-FedCS approach based on a matching game to replace the\ncentralized method described above. Numerical simulations and experimental\nresults on the MNIST/CIFAR-10 datasets demonstrate that our proposed LRef-FedCS\napproach could achieve a good balance between pursuing high global accuracy and\nreducing cost.\n","authors":["Yulan Gao","Ziqiang Ye","Yue Xiao","Wei Xiang"],"pdf_url":"https://arxiv.org/pdf/2307.09977v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03587v2","updated":"2023-07-19T13:23:29Z","published":"2023-07-07T13:29:07Z","title":"BOF-UCB: A Bayesian-Optimistic Frequentist Algorithm for Non-Stationary\n Contextual Bandits","summary":" We propose a novel Bayesian-Optimistic Frequentist Upper Confidence Bound\n(BOF-UCB) algorithm for stochastic contextual linear bandits in non-stationary\nenvironments. This unique combination of Bayesian and frequentist principles\nenhances adaptability and performance in dynamic settings. The BOF-UCB\nalgorithm utilizes sequential Bayesian updates to infer the posterior\ndistribution of the unknown regression parameter, and subsequently employs a\nfrequentist approach to compute the Upper Confidence Bound (UCB) by maximizing\nthe expected reward over the posterior distribution. We provide theoretical\nguarantees of BOF-UCB's performance and demonstrate its effectiveness in\nbalancing exploration and exploitation on synthetic datasets and classical\ncontrol tasks in a reinforcement learning setting. Our results show that\nBOF-UCB outperforms existing methods, making it a promising solution for\nsequential decision-making in non-stationary environments.\n","authors":["Nicklas Werge","Abdullah Akgül","Melih Kandemir"],"pdf_url":"https://arxiv.org/pdf/2307.03587v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.09946v2","updated":"2023-07-19T13:15:08Z","published":"2023-05-17T04:56:11Z","title":"AdaMSS: Adaptive Multi-Modality Segmentation-to-Survival Learning for\n Survival Outcome Prediction from PET/CT Images","summary":" Survival prediction is a major concern for cancer management. Deep survival\nmodels based on deep learning have been widely adopted to perform end-to-end\nsurvival prediction from medical images. Recent deep survival models achieved\npromising performance by jointly performing tumor segmentation with survival\nprediction, where the models were guided to extract tumor-related information\nthrough Multi-Task Learning (MTL). However, these deep survival models have\ndifficulties in exploring out-of-tumor prognostic information. In addition,\nexisting deep survival models are unable to effectively leverage multi-modality\nimages. Empirically-designed fusion strategies were commonly adopted to fuse\nmulti-modality information via task-specific manually-designed networks, thus\nlimiting the adaptability to different scenarios. In this study, we propose an\nAdaptive Multi-modality Segmentation-to-Survival model (AdaMSS) for survival\nprediction from PET/CT images. Instead of adopting MTL, we propose a novel\nSegmentation-to-Survival Learning (SSL) strategy, where our AdaMSS is trained\nfor tumor segmentation and survival prediction sequentially in two stages. This\nstrategy enables the AdaMSS to focus on tumor regions in the first stage and\ngradually expand its focus to include other prognosis-related regions in the\nsecond stage. We also propose a data-driven strategy to fuse multi-modality\ninformation, which realizes adaptive optimization of fusion strategies based on\ntraining data during training. With the SSL and data-driven fusion strategies,\nour AdaMSS is designed as an adaptive model that can self-adapt its focus\nregions and fusion strategy for different training stages. Extensive\nexperiments with two large clinical datasets show that our AdaMSS outperforms\nstate-of-the-art survival prediction methods.\n","authors":["Mingyuan Meng","Bingxin Gu","Michael Fulham","Shaoli Song","Dagan Feng","Lei Bi","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2305.09946v2.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2307.09964v1","updated":"2023-07-19T13:14:47Z","published":"2023-07-19T13:14:47Z","title":"Towards green AI-based software systems: an architecture-centric\n approach (GAISSA)","summary":" Nowadays, AI-based systems have achieved outstanding results and have\noutperformed humans in different domains. However, the processes of training AI\nmodels and inferring from them require high computational resources, which pose\na significant challenge in the current energy efficiency societal demand. To\ncope with this challenge, this research project paper describes the main\nvision, goals, and expected outcomes of the GAISSA project. The GAISSA project\naims at providing data scientists and software engineers tool-supported,\narchitecture-centric methods for the modelling and development of green\nAI-based systems. Although the project is in an initial stage, we describe the\ncurrent research results, which illustrate the potential to achieve GAISSA\nobjectives.\n","authors":["Silverio Martínez-Fernández","Xavier Franch","Francisco Durán"],"pdf_url":"https://arxiv.org/pdf/2307.09964v1.pdf","comment":"Accepted for publication as full paper - 2023 49th Euromicro\n Conference Series on Software Engineering and Advanced Applications (SEAA)"},{"id":"http://arxiv.org/abs/2210.06226v2","updated":"2023-07-19T13:08:21Z","published":"2022-10-12T14:15:39Z","title":"Alpha-divergence Variational Inference Meets Importance Weighted\n Auto-Encoders: Methodology and Asymptotics","summary":" Several algorithms involving the Variational R\\'enyi (VR) bound have been\nproposed to minimize an alpha-divergence between a target posterior\ndistribution and a variational distribution. Despite promising empirical\nresults, those algorithms resort to biased stochastic gradient descent\nprocedures and thus lack theoretical guarantees. In this paper, we formalize\nand study the VR-IWAE bound, a generalization of the Importance Weighted\nAuto-Encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several\ndesirable properties and notably leads to the same stochastic gradient descent\nprocedure as the VR bound in the reparameterized case, but this time by relying\non unbiased gradient estimators. We then provide two complementary theoretical\nanalyses of the VR-IWAE bound and thus of the standard IWAE bound. Those\nanalyses shed light on the benefits or lack thereof of these bounds. Lastly, we\nillustrate our theoretical claims over toy and real-data examples.\n","authors":["Kamélia Daudel","Joe Benton","Yuyang Shi","Arnaud Doucet"],"pdf_url":"https://arxiv.org/pdf/2210.06226v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2001.05887v4","updated":"2023-07-19T12:58:18Z","published":"2020-01-16T15:24:26Z","title":"MixPath: A Unified Approach for One-shot Neural Architecture Search","summary":" Blending multiple convolutional kernels is proved advantageous in neural\narchitecture design. However, current two-stage neural architecture search\nmethods are mainly limited to single-path search spaces. How to efficiently\nsearch models of multi-path structures remains a difficult problem. In this\npaper, we are motivated to train a one-shot multi-path supernet to accurately\nevaluate the candidate architectures. Specifically, we discover that in the\nstudied search spaces, feature vectors summed from multiple paths are nearly\nmultiples of those from a single path. Such disparity perturbs the supernet\ntraining and its ranking ability. Therefore, we propose a novel mechanism\ncalled Shadow Batch Normalization (SBN) to regularize the disparate feature\nstatistics. Extensive experiments prove that SBNs are capable of stabilizing\nthe optimization and improving ranking performance. We call our unified\nmulti-path one-shot approach as MixPath, which generates a series of models\nthat achieve state-of-the-art results on ImageNet.\n","authors":["Xiangxiang Chu","Shun Lu","Xudong Li","Bo Zhang"],"pdf_url":"https://arxiv.org/pdf/2001.05887v4.pdf","comment":"ICCV2023"},{"id":"http://arxiv.org/abs/2307.09955v1","updated":"2023-07-19T12:51:28Z","published":"2023-07-19T12:51:28Z","title":"XSkill: Cross Embodiment Skill Discovery","summary":" Human demonstration videos are a widely available data source for robot\nlearning and an intuitive user interface for expressing desired behavior.\nHowever, directly extracting reusable robot manipulation skills from\nunstructured human videos is challenging due to the big embodiment difference\nand unobserved action parameters. To bridge this embodiment gap, this paper\nintroduces XSkill, an imitation learning framework that 1) discovers a\ncross-embodiment representation called skill prototypes purely from unlabeled\nhuman and robot manipulation videos, 2) transfers the skill representation to\nrobot actions using conditional diffusion policy, and finally, 3) composes the\nlearned skill to accomplish unseen tasks specified by a human prompt video. Our\nexperiments in simulation and real-world environments show that the discovered\nskill prototypes facilitate both skill transfer and composition for unseen\ntasks, resulting in a more general and scalable imitation learning framework.\nThe performance of XSkill is best understood from the anonymous website:\nhttps://xskillcorl.github.io.\n","authors":["Mengda Xu","Zhenjia Xu","Cheng Chi","Manuela Veloso","Shuran Song"],"pdf_url":"https://arxiv.org/pdf/2307.09955v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09943v1","updated":"2023-07-19T12:35:16Z","published":"2023-07-19T12:35:16Z","title":"Impatient Bandits: Optimizing for the Long-Term Without Delay","summary":" Recommender systems are a ubiquitous feature of online platforms.\nIncreasingly, they are explicitly tasked with increasing users' long-term\nsatisfaction. In this context, we study a content exploration task, which we\nformalize as a multi-armed bandit problem with delayed rewards. We observe that\nthere is an apparent trade-off in choosing the learning signal: Waiting for the\nfull reward to become available might take several weeks, hurting the rate at\nwhich learning happens, whereas measuring short-term proxy rewards reflects the\nactual long-term goal only imperfectly. We address this challenge in two steps.\nFirst, we develop a predictive model of delayed rewards that incorporates all\ninformation obtained to date. Full observations as well as partial (short or\nmedium-term) outcomes are combined through a Bayesian filter to obtain a\nprobabilistic belief. Second, we devise a bandit algorithm that takes advantage\nof this new predictive model. The algorithm quickly learns to identify content\naligned with long-term success by carefully balancing exploration and\nexploitation. We apply our approach to a podcast recommendation problem, where\nwe seek to identify shows that users engage with repeatedly over two months. We\nempirically validate that our approach results in substantially better\nperformance compared to approaches that either optimize for short-term proxies,\nor wait for the long-term outcome to be fully realized.\n","authors":["Thomas McDonald","Lucas Maystre","Mounia Lalmas","Daniel Russo","Kamil Ciosek"],"pdf_url":"https://arxiv.org/pdf/2307.09943v1.pdf","comment":"Presented at the 29th ACM SIGKDD Conference on Knowledge Discovery\n and Data Mining (KDD '23)"},{"id":"http://arxiv.org/abs/2307.09942v1","updated":"2023-07-19T12:35:09Z","published":"2023-07-19T12:35:09Z","title":"TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic\n Tree-Based Memory Network","summary":" Clinical trials are critical for drug development but often suffer from\nexpensive and inefficient patient recruitment. In recent years, machine\nlearning models have been proposed for speeding up patient recruitment via\nautomatically matching patients with clinical trials based on longitudinal\npatient electronic health records (EHR) data and eligibility criteria of\nclinical trials. However, they either depend on trial-specific expert rules\nthat cannot expand to other trials or perform matching at a very general level\nwith a black-box model where the lack of interpretability makes the model\nresults difficult to be adopted.\n To provide accurate and interpretable patient trial matching, we introduce a\npersonalized dynamic tree-based memory network model named TREEMENT. It\nutilizes hierarchical clinical ontologies to expand the personalized patient\nrepresentation learned from sequential EHR data, and then uses an attentional\nbeam-search query learned from eligibility criteria embedding to offer a\ngranular level of alignment for improved performance and interpretability. We\nevaluated TREEMENT against existing models on real-world datasets and\ndemonstrated that TREEMENT outperforms the best baseline by 7% in terms of\nerror reduction in criteria-level matching and achieves state-of-the-art\nresults in its trial-level matching ability. Furthermore, we also show TREEMENT\ncan offer good interpretability to make the model results easier for adoption.\n","authors":["Brandon Theodorou","Cao Xiao","Jimeng Sun"],"pdf_url":"https://arxiv.org/pdf/2307.09942v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02486v2","updated":"2023-07-19T12:25:35Z","published":"2023-07-05T17:59:38Z","title":"LongNet: Scaling Transformers to 1,000,000,000 Tokens","summary":" Scaling sequence length has become a critical demand in the era of large\nlanguage models. However, existing methods struggle with either computational\ncomplexity or model expressivity, rendering the maximum sequence length\nrestricted. To address this issue, we introduce LongNet, a Transformer variant\nthat can scale sequence length to more than 1 billion tokens, without\nsacrificing the performance on shorter sequences. Specifically, we propose\ndilated attention, which expands the attentive field exponentially as the\ndistance grows. LongNet has significant advantages: 1) it has a linear\ncomputation complexity and a logarithm dependency between any two tokens in a\nsequence; 2) it can be served as a distributed trainer for extremely long\nsequences; 3) its dilated attention is a drop-in replacement for standard\nattention, which can be seamlessly integrated with the existing\nTransformer-based optimization. Experiments results demonstrate that LongNet\nyields strong performance on both long-sequence modeling and general language\ntasks. Our work opens up new possibilities for modeling very long sequences,\ne.g., treating a whole corpus or even the entire Internet as a sequence.\n","authors":["Jiayu Ding","Shuming Ma","Li Dong","Xingxing Zhang","Shaohan Huang","Wenhui Wang","Nanning Zheng","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2307.02486v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2302.07265v2","updated":"2023-07-19T12:18:34Z","published":"2023-02-14T18:59:02Z","title":"The Meta-Evaluation Problem in Explainable AI: Identifying Reliable\n Estimators with MetaQuantus","summary":" One of the unsolved challenges in the field of Explainable AI (XAI) is\ndetermining how to most reliably estimate the quality of an explanation method\nin the absence of ground truth explanation labels. Resolving this issue is of\nutmost importance as the evaluation outcomes generated by competing evaluation\nmethods (or ''quality estimators''), which aim at measuring the same property\nof an explanation method, frequently present conflicting rankings. Such\ndisagreements can be challenging for practitioners to interpret, thereby\ncomplicating their ability to select the best-performing explanation method. We\naddress this problem through a meta-evaluation of different quality estimators\nin XAI, which we define as ''the process of evaluating the evaluation method''.\nOur novel framework, MetaQuantus, analyses two complementary performance\ncharacteristics of a quality estimator: its resilience to noise and reactivity\nto randomness, thus circumventing the need for ground truth labels. We\ndemonstrate the effectiveness of our framework through a series of experiments,\ntargeting various open questions in XAI such as the selection and\nhyperparameter optimisation of quality estimators. Our work is released under\nan open-source license (https://github.com/annahedstroem/MetaQuantus) to serve\nas a development tool for XAI- and Machine Learning (ML) practitioners to\nverify and benchmark newly constructed quality estimators in a given\nexplainability context. With this work, we provide the community with clear and\ntheoretically-grounded guidance for identifying reliable evaluation methods,\nthus facilitating reproducibility in the field.\n","authors":["Anna Hedström","Philine Bommer","Kristoffer K. Wickstrøm","Wojciech Samek","Sebastian Lapuschkin","Marina M. -C. Höhne"],"pdf_url":"https://arxiv.org/pdf/2302.07265v2.pdf","comment":"35 pages, 15 figures, 5 tables"},{"id":"http://arxiv.org/abs/2307.09933v1","updated":"2023-07-19T12:15:06Z","published":"2023-07-19T12:15:06Z","title":"Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to\n Harness Spurious Features","summary":" To avoid failures on out-of-distribution data, recent works have sought to\nextract features that have a stable or invariant relationship with the label\nacross domains, discarding the \"spurious\" or unstable features whose\nrelationship with the label changes across domains. However, unstable features\noften carry complementary information about the label that could boost\nperformance if used correctly in the test domain. Our main contribution is to\nshow that it is possible to learn how to use these unstable features in the\ntest domain without labels. In particular, we prove that pseudo-labels based on\nstable features provide sufficient guidance for doing so, provided that stable\nand unstable features are conditionally independent given the label. Based on\nthis theoretical insight, we propose Stable Feature Boosting (SFB), an\nalgorithm for: (i) learning a predictor that separates stable and\nconditionally-independent unstable features; and (ii) using the stable-feature\npredictions to adapt the unstable-feature predictions in the test domain.\nTheoretically, we prove that SFB can learn an asymptotically-optimal predictor\nwithout test-domain labels. Empirically, we demonstrate the effectiveness of\nSFB on real and synthetic data.\n","authors":["Cian Eastwood","Shashank Singh","Andrei Liviu Nicolicioiu","Marin Vlastelica","Julius von Kügelgen","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2307.09933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09931v1","updated":"2023-07-19T12:12:17Z","published":"2023-07-19T12:12:17Z","title":"DISA: DIfferentiable Similarity Approximation for Universal Multimodal\n Registration","summary":" Multimodal image registration is a challenging but essential step for\nnumerous image-guided procedures. Most registration algorithms rely on the\ncomputation of complex, frequently non-differentiable similarity metrics to\ndeal with the appearance discrepancy of anatomical structures between imaging\nmodalities. Recent Machine Learning based approaches are limited to specific\nanatomy-modality combinations and do not generalize to new settings. We propose\na generic framework for creating expressive cross-modal descriptors that enable\nfast deformable global registration. We achieve this by approximating existing\nmetrics with a dot-product in the feature space of a small convolutional neural\nnetwork (CNN) which is inherently differentiable can be trained without\nregistered data. Our method is several orders of magnitude faster than local\npatch-based metrics and can be directly applied in clinical settings by\nreplacing the similarity measure with the proposed one. Experiments on three\ndifferent datasets demonstrate that our approach generalizes well beyond the\ntraining data, yielding a broad capture range even on unseen anatomies and\nmodality pairs, without the need for specialized retraining. We make our\ntraining code and data publicly available.\n","authors":["Matteo Ronchetti","Wolfgang Wein","Nassir Navab","Oliver Zettinig","Raphael Prevost"],"pdf_url":"https://arxiv.org/pdf/2307.09931v1.pdf","comment":"This preprint was submitted to MICCAI 2023. The Version of Record of\n this contribution will be published in Springer LNCS"},{"id":"http://arxiv.org/abs/2307.04639v2","updated":"2023-07-19T12:08:51Z","published":"2023-07-10T15:35:31Z","title":"Multimodal brain age estimation using interpretable adaptive\n population-graph learning","summary":" Brain age estimation is clinically important as it can provide valuable\ninformation in the context of neurodegenerative diseases such as Alzheimer's.\nPopulation graphs, which include multimodal imaging information of the subjects\nalong with the relationships among the population, have been used in literature\nalong with Graph Convolutional Networks (GCNs) and have proved beneficial for a\nvariety of medical imaging tasks. A population graph is usually static and\nconstructed manually using non-imaging information. However, graph construction\nis not a trivial task and might significantly affect the performance of the\nGCN, which is inherently very sensitive to the graph structure. In this work,\nwe propose a framework that learns a population graph structure optimized for\nthe downstream task. An attention mechanism assigns weights to a set of imaging\nand non-imaging features (phenotypes), which are then used for edge extraction.\nThe resulting graph is used to train the GCN. The entire pipeline can be\ntrained end-to-end. Additionally, by visualizing the attention weights that\nwere the most important for the graph construction, we increase the\ninterpretability of the graph. We use the UK Biobank, which provides a large\nvariety of neuroimaging and non-imaging phenotypes, to evaluate our method on\nbrain age regression and classification. The proposed method outperforms\ncompeting static graph approaches and other state-of-the-art adaptive methods.\nWe further show that the assigned attention scores indicate that there are both\nimaging and non-imaging phenotypes that are informative for brain age\nestimation and are in agreement with the relevant literature.\n","authors":["Kyriaki-Margarita Bintsi","Vasileios Baltatzis","Rolandos Alexandros Potamias","Alexander Hammers","Daniel Rueckert"],"pdf_url":"https://arxiv.org/pdf/2307.04639v2.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.09916v1","updated":"2023-07-19T11:40:15Z","published":"2023-07-19T11:40:15Z","title":"TimeTuner: Diagnosing Time Representations for Time-Series Forecasting\n with Counterfactual Explanations","summary":" Deep learning (DL) approaches are being increasingly used for time-series\nforecasting, with many efforts devoted to designing complex DL models. Recent\nstudies have shown that the DL success is often attributed to effective data\nrepresentations, fostering the fields of feature engineering and representation\nlearning. However, automated approaches for feature learning are typically\nlimited with respect to incorporating prior knowledge, identifying interactions\namong variables, and choosing evaluation metrics to ensure that the models are\nreliable. To improve on these limitations, this paper contributes a novel\nvisual analytics framework, namely TimeTuner, designed to help analysts\nunderstand how model behaviors are associated with localized correlations,\nstationarity, and granularity of time-series representations. The system mainly\nconsists of the following two-stage technique: We first leverage counterfactual\nexplanations to connect the relationships among time-series representations,\nmultivariate features and model predictions. Next, we design multiple\ncoordinated views including a partition-based correlation matrix and juxtaposed\nbivariate stripes, and provide a set of interactions that allow users to step\ninto the transformation selection process, navigate through the feature space,\nand reason the model performance. We instantiate TimeTuner with two\ntransformation methods of smoothing and sampling, and demonstrate its\napplicability on real-world time-series forecasting of univariate sunspots and\nmultivariate air pollutants. Feedback from domain experts indicates that our\nsystem can help characterize time-series representations and guide the feature\nengineering processes.\n","authors":["Jianing Hao","Qing Shi","Yilin Ye","Wei Zeng"],"pdf_url":"https://arxiv.org/pdf/2307.09916v1.pdf","comment":"11 pages, 9 figures, this paper has been accepted by VIS2024"},{"id":"http://arxiv.org/abs/2307.09912v1","updated":"2023-07-19T11:32:24Z","published":"2023-07-19T11:32:24Z","title":"Deep projection networks for learning time-homogeneous dynamical systems","summary":" We consider the general class of time-homogeneous dynamical systems, both\ndiscrete and continuous, and study the problem of learning a meaningful\nrepresentation of the state from observed data. This is instrumental for the\ntask of learning a forward transfer operator of the system, that in turn can be\nused for forecasting future states or observables. The representation,\ntypically parametrized via a neural network, is associated with a projection\noperator and is learned by optimizing an objective function akin to that of\ncanonical correlation analysis (CCA). However, unlike CCA, our objective avoids\nmatrix inversions and therefore is generally more stable and applicable to\nchallenging scenarios. Our objective is a tight relaxation of CCA and we\nfurther enhance it by proposing two regularization schemes, one encouraging the\northogonality of the components of the representation while the other\nexploiting Chapman-Kolmogorov's equation. We apply our method to challenging\ndiscrete dynamical systems, discussing improvements over previous methods, as\nwell as to continuous dynamical systems.\n","authors":["Vladimir R. Kostic","Pietro Novelli","Riccardo Grazzi","Karim Lounici","Massimiliano Pontil"],"pdf_url":"https://arxiv.org/pdf/2307.09912v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.06698v2","updated":"2023-07-19T11:23:07Z","published":"2023-07-13T11:54:32Z","title":"IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation","summary":" Knowledge Graph Embedding (KGE) models are used to learn continuous\nrepresentations of entities and relations. A key task in the literature is\npredicting missing links between entities. However, Knowledge Graphs are not\njust sets of links but also have semantics underlying their structure.\nSemantics is crucial in several downstream tasks, such as query answering or\nreasoning. We introduce the subgraph inference task, where a model has to\ngenerate likely and semantically valid subgraphs. We propose IntelliGraphs, a\nset of five new Knowledge Graph datasets. The IntelliGraphs datasets contain\nsubgraphs with semantics expressed in logical rules for evaluating subgraph\ninference. We also present the dataset generator that produced the synthetic\ndatasets. We designed four novel baseline models, which include three models\nbased on traditional KGEs. We evaluate their expressiveness and show that these\nmodels cannot capture the semantics. We believe this benchmark will encourage\nthe development of machine learning models that emphasize semantic\nunderstanding.\n","authors":["Thiviyan Thanapalasingam","Emile van Krieken","Peter Bloem","Paul Groth"],"pdf_url":"https://arxiv.org/pdf/2307.06698v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.09211v3","updated":"2023-07-19T10:52:30Z","published":"2023-05-16T06:40:04Z","title":"CB-HVTNet: A channel-boosted hybrid vision transformer network for\n lymphocyte assessment in histopathological images","summary":" Transformers, due to their ability to learn long range dependencies, have\novercome the shortcomings of convolutional neural networks (CNNs) for global\nperspective learning. Therefore, they have gained the focus of researchers for\nseveral vision related tasks including medical diagnosis. However, their\nmulti-head attention module only captures global level feature representations,\nwhich is insufficient for medical images. To address this issue, we propose a\nChannel Boosted Hybrid Vision Transformer (CB HVT) that uses transfer learning\nto generate boosted channels and employs both transformers and CNNs to analyse\nlymphocytes in histopathological images. The proposed CB HVT comprises five\nmodules, including a channel generation module, channel exploitation module,\nchannel merging module, region-aware module, and a detection and segmentation\nhead, which work together to effectively identify lymphocytes. The channel\ngeneration module uses the idea of channel boosting through transfer learning\nto extract diverse channels from different auxiliary learners. In the CB HVT,\nthese boosted channels are first concatenated and ranked using an attention\nmechanism in the channel exploitation module. A fusion block is then utilized\nin the channel merging module for a gradual and systematic merging of the\ndiverse boosted channels to improve the network's learning representations. The\nCB HVT also employs a proposal network in its region aware module and a head to\neffectively identify objects, even in overlapping regions and with artifacts.\nWe evaluated the proposed CB HVT on two publicly available datasets for\nlymphocyte assessment in histopathological images. The results show that CB HVT\noutperformed other state of the art detection models, and has good\ngeneralization ability, demonstrating its value as a tool for pathologists.\n","authors":["Momina Liaqat Ali","Zunaira Rauf","Asifullah Khan","Anabia Sohail","Rafi Ullah","Jeonghwan Gwak"],"pdf_url":"https://arxiv.org/pdf/2305.09211v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09896v1","updated":"2023-07-19T10:50:36Z","published":"2023-07-19T10:50:36Z","title":"Repeated Observations for Classification","summary":" We study the problem nonparametric classification with repeated observations.\nLet $\\bX$ be the $d$ dimensional feature vector and let $Y$ denote the label\ntaking values in $\\{1,\\dots ,M\\}$. In contrast to usual setup with large sample\nsize $n$ and relatively low dimension $d$, this paper deals with the situation,\nwhen instead of observing a single feature vector $\\bX$ we are given $t$\nrepeated feature vectors $\\bV_1,\\dots ,\\bV_t $. Some simple classification\nrules are presented such that the conditional error probabilities have\nexponential convergence rate of convergence as $t\\to\\infty$. In the analysis,\nwe investigate particular models like robust detection by nominal densities,\nprototype classification, linear transformation, linear classification,\nscaling.\n","authors":["Hüseyin Afşer","László Györfi","Harro Walk"],"pdf_url":"https://arxiv.org/pdf/2307.09896v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09883v1","updated":"2023-07-19T10:27:34Z","published":"2023-07-19T10:27:34Z","title":"Symmetric Equilibrium Learning of VAEs","summary":" We view variational autoencoders (VAE) as decoder-encoder pairs, which map\ndistributions in the data space to distributions in the latent space and vice\nversa. The standard learning approach for VAEs, i.e. maximisation of the\nevidence lower bound (ELBO), has an obvious asymmetry in that respect.\nMoreover, it requires a closed form a-priori latent distribution. This limits\nthe applicability of VAEs in more complex scenarios, such as general\nsemi-supervised learning and employing complex generative models as priors. We\npropose a Nash equilibrium learning approach that relaxes these restrictions\nand allows learning VAEs in situations where both the data and the latent\ndistributions are accessible only by sampling. The flexibility and simplicity\nof this approach allows its application to a wide range of learning scenarios\nand downstream tasks. We show experimentally that the models learned by this\nmethod are comparable to those obtained by ELBO learning and demonstrate its\napplicability for tasks that are not accessible by standard VAE learning.\n","authors":["Boris Flach","Dmitrij Schlesinger","Alexander Shekhovtsov"],"pdf_url":"https://arxiv.org/pdf/2307.09883v1.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.09882v1","updated":"2023-07-19T10:26:29Z","published":"2023-07-19T10:26:29Z","title":"Adversarial Likelihood Estimation with One-way Flows","summary":" Generative Adversarial Networks (GANs) can produce high-quality samples, but\ndo not provide an estimate of the probability density around the samples.\nHowever, it has been noted that maximizing the log-likelihood within an\nenergy-based setting can lead to an adversarial framework where the\ndiscriminator provides unnormalized density (often called energy). We further\ndevelop this perspective, incorporate importance sampling, and show that 1)\nWasserstein GAN performs a biased estimate of the partition function, and we\npropose instead to use an unbiased estimator; 2) when optimizing for\nlikelihood, one must maximize generator entropy. This is hypothesized to\nprovide a better mode coverage. Different from previous works, we explicitly\ncompute the density of the generated samples. This is the key enabler to\ndesigning an unbiased estimator of the partition function and computation of\nthe generator entropy term. The generator density is obtained via a new type of\nflow network, called one-way flow network, that is less constrained in terms of\narchitecture, as it does not require to have a tractable inverse function. Our\nexperimental results show that we converge faster, produce comparable sample\nquality to GANs with similar architecture, successfully avoid over-fitting to\ncommonly used datasets and produce smooth low-dimensional latent\nrepresentations of the training data.\n","authors":["Omri Ben-Dov","Pravir Singh Gupta","Victoria Abrevaya","Michael J. Black","Partha Ghosh"],"pdf_url":"https://arxiv.org/pdf/2307.09882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09866v1","updated":"2023-07-19T09:53:56Z","published":"2023-07-19T09:53:56Z","title":"Detecting Vulnerable Nodes in Urban Infrastructure Interdependent\n Network","summary":" Understanding and characterizing the vulnerability of urban infrastructures,\nwhich refers to the engineering facilities essential for the regular running of\ncities and that exist naturally in the form of networks, is of great value to\nus. Potential applications include protecting fragile facilities and designing\nrobust topologies, etc. Due to the strong correlation between different\ntopological characteristics and infrastructure vulnerability and their\ncomplicated evolution mechanisms, some heuristic and machine-assisted analysis\nfall short in addressing such a scenario. In this paper, we model the\ninterdependent network as a heterogeneous graph and propose a system based on\ngraph neural network with reinforcement learning, which can be trained on\nreal-world data, to characterize the vulnerability of the city system\naccurately. The presented system leverages deep learning techniques to\nunderstand and analyze the heterogeneous graph, which enables us to capture the\nrisk of cascade failure and discover vulnerable infrastructures of cities.\nExtensive experiments with various requests demonstrate not only the expressive\npower of our system but also transferring ability and necessity of the specific\ncomponents.\n","authors":["Jinzhu Mao","Liu Cao","Chen Gao","Huandong Wang","Hangyu Fan","Depeng Jin","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2307.09866v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09862v1","updated":"2023-07-19T09:45:41Z","published":"2023-07-19T09:45:41Z","title":"Towards a population-informed approach to the definition of data-driven\n models for structural dynamics","summary":" Machine learning has affected the way in which many phenomena for various\ndomains are modelled, one of these domains being that of structural dynamics.\nHowever, because machine-learning algorithms are problem-specific, they often\nfail to perform efficiently in cases of data scarcity. To deal with such\nissues, combination of physics-based approaches and machine learning algorithms\nhave been developed. Although such methods are effective, they also require the\nanalyser's understanding of the underlying physics of the problem. The current\nwork is aimed at motivating the use of models which learn such relationships\nfrom a population of phenomena, whose underlying physics are similar. The\ndevelopment of such models is motivated by the way that physics-based models,\nand more specifically finite element models, work. Such models are considered\ntransferrable, explainable and trustworthy, attributes which are not trivially\nimposed or achieved for machine-learning models. For this reason,\nmachine-learning approaches are less trusted by industry and often considered\nmore difficult to form validated models. To achieve such data-driven models, a\npopulation-based scheme is followed here and two different machine-learning\nalgorithms from the meta-learning domain are used. The two algorithms are the\nmodel-agnostic meta-learning (MAML) algorithm and the conditional neural\nprocesses (CNP) model. The algorithms seem to perform as intended and\noutperform a traditional machine-learning algorithm at approximating the\nquantities of interest. Moreover, they exhibit behaviour similar to traditional\nmachine learning algorithms (e.g. neural networks or Gaussian processes),\nconcerning their performance as a function of the available structures in the\ntraining population.\n","authors":["G. Tsialiamanis","N. Dervilis","D. J. Wagg","K. Worden"],"pdf_url":"https://arxiv.org/pdf/2307.09862v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07873v2","updated":"2023-07-19T09:23:43Z","published":"2023-07-15T19:20:49Z","title":"Why Does Little Robustness Help? Understanding Adversarial\n Transferability From Surrogate Training","summary":" Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs\nthat successfully fool white-box surrogate models can also deceive other\nblack-box models with different architectures. Although a bunch of empirical\nstudies have provided guidance on generating highly transferable AEs, many of\nthese findings lack explanations and even lead to inconsistent advice. In this\npaper, we take a further step towards understanding adversarial\ntransferability, with a particular focus on surrogate aspects. Starting from\nthe intriguing little robustness phenomenon, where models adversarially trained\nwith mildly perturbed adversarial samples can serve as better surrogates, we\nattribute it to a trade-off between two predominant factors: model smoothness\nand gradient similarity. Our investigations focus on their joint effects,\nrather than their separate correlations with transferability. Through a series\nof theoretical and empirical analyses, we conjecture that the data distribution\nshift in adversarial training explains the degradation of gradient similarity.\nBuilding on these insights, we explore the impacts of data augmentation and\ngradient regularization on transferability and identify that the trade-off\ngenerally exists in the various training mechanisms, thus building a\ncomprehensive blueprint for the regulation mechanism behind transferability.\nFinally, we provide a general route for constructing better surrogates to boost\ntransferability which optimizes both model smoothness and gradient similarity\nsimultaneously, e.g., the combination of input gradient regularization and\nsharpness-aware minimization (SAM), validated by extensive experiments. In\nsummary, we call for attention to the united impacts of these two factors for\nlaunching effective transfer attacks, rather than optimizing one while ignoring\nthe other, and emphasize the crucial role of manipulating surrogate models.\n","authors":["Yechao Zhang","Shengshan Hu","Leo Yu Zhang","Junyu Shi","Minghui Li","Xiaogeng Liu","Wei Wan","Hai Jin"],"pdf_url":"https://arxiv.org/pdf/2307.07873v2.pdf","comment":"Accepted by IEEE Symposium on Security and Privacy (Oakland) 2024; 21\n pages, 12 figures, 13 tables"},{"id":"http://arxiv.org/abs/2307.09458v2","updated":"2023-07-19T09:22:02Z","published":"2023-07-18T17:39:04Z","title":"Does Circuit Analysis Interpretability Scale? Evidence from Multiple\n Choice Capabilities in Chinchilla","summary":" \\emph{Circuit analysis} is a promising technique for understanding the\ninternal mechanisms of language models. However, existing analyses are done in\nsmall models far from the state of the art. To address this, we present a case\nstudy of circuit analysis in the 70B Chinchilla model, aiming to test the\nscalability of circuit analysis. In particular, we study multiple-choice\nquestion answering, and investigate Chinchilla's capability to identify the\ncorrect answer \\emph{label} given knowledge of the correct answer \\emph{text}.\nWe find that the existing techniques of logit attribution, attention pattern\nvisualization, and activation patching naturally scale to Chinchilla, allowing\nus to identify and categorize a small set of `output nodes' (attention heads\nand MLPs).\n We further study the `correct letter' category of attention heads aiming to\nunderstand the semantics of their features, with mixed results. For normal\nmultiple-choice question answers, we significantly compress the query, key and\nvalue subspaces of the head without loss of performance when operating on the\nanswer labels for multiple-choice questions, and we show that the query and key\nsubspaces represent an `Nth item in an enumeration' feature to at least some\nextent. However, when we attempt to use this explanation to understand the\nheads' behaviour on a more general distribution including randomized answer\nlabels, we find that it is only a partial explanation, suggesting there is more\nto learn about the operation of `correct letter' heads on multiple choice\nquestion answering.\n","authors":["Tom Lieberum","Matthew Rahtz","János Kramár","Neel Nanda","Geoffrey Irving","Rohin Shah","Vladimir Mikulik"],"pdf_url":"https://arxiv.org/pdf/2307.09458v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10404v4","updated":"2023-07-19T09:17:09Z","published":"2023-06-17T18:16:51Z","title":"The RL Perceptron: Generalisation Dynamics of Policy Learning in High\n Dimensions","summary":" Reinforcement learning (RL) algorithms have proven transformative in a range\nof domains. To tackle real-world domains, these systems often use neural\nnetworks to learn policies directly from pixels or other high-dimensional\nsensory input. By contrast, much theory of RL has focused on discrete state\nspaces or worst-case analysis, and fundamental questions remain about the\ndynamics of policy learning in high-dimensional settings. Here, we propose a\nsolvable high-dimensional model of RL that can capture a variety of learning\nprotocols, and derive its typical dynamics as a set of closed-form ordinary\ndifferential equations (ODEs). We derive optimal schedules for the learning\nrates and task difficulty - analogous to annealing schemes and curricula during\ntraining in RL - and show that the model exhibits rich behaviour, including\ndelayed learning under sparse rewards; a variety of learning regimes depending\non reward baselines; and a speed-accuracy trade-off driven by reward\nstringency. Experiments on variants of the Procgen game \"Bossfight\" and Arcade\nLearning Environment game \"Pong\" also show such a speed-accuracy trade-off in\npractice. Together, these results take a step towards closing the gap between\ntheory and practice in high-dimensional RL.\n","authors":["Nishil Patel","Sebastian Lee","Stefano Sarao Mannelli","Sebastian Goldt","Adrew Saxe"],"pdf_url":"https://arxiv.org/pdf/2306.10404v4.pdf","comment":"10 pages, 7 figures, Preprint"},{"id":"http://arxiv.org/abs/2305.07898v2","updated":"2023-07-19T09:15:20Z","published":"2023-05-13T11:42:40Z","title":"Network-GIANT: Fully distributed Newton-type optimization via harmonic\n Hessian consensus","summary":" This paper considers the problem of distributed multi-agent learning, where\nthe global aim is to minimize a sum of local objective (empirical loss)\nfunctions through local optimization and information exchange between\nneighbouring nodes. We introduce a Newton-type fully distributed optimization\nalgorithm, Network-GIANT, which is based on GIANT, a Federated learning\nalgorithm that relies on a centralized parameter server. The Network-GIANT\nalgorithm is designed via a combination of gradient-tracking and a Newton-type\niterative algorithm at each node with consensus based averaging of local\ngradient and Newton updates. We prove that our algorithm guarantees semi-global\nand exponential convergence to the exact solution over the network assuming\nstrongly convex and smooth loss functions. We provide empirical evidence of the\nsuperior convergence performance of Network-GIANT over other state-of-art\ndistributed learning algorithms such as Network-DANE and Newton-Raphson\nConsensus.\n","authors":["Alessio Maritan","Ganesh Sharma","Luca Schenato","Subhrakanti Dey"],"pdf_url":"https://arxiv.org/pdf/2305.07898v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09844v1","updated":"2023-07-19T09:03:41Z","published":"2023-07-19T09:03:41Z","title":"Reinforcement Learning for Credit Index Option Hedging","summary":" In this paper, we focus on finding the optimal hedging strategy of a credit\nindex option using reinforcement learning. We take a practical approach, where\nthe focus is on realism i.e. discrete time, transaction costs; even testing our\npolicy on real market data. We apply a state of the art algorithm, the Trust\nRegion Volatility Optimization (TRVO) algorithm and show that the derived\nhedging strategy outperforms the practitioner's Black & Scholes delta hedge.\n","authors":["Francesco Mandelli","Marco Pinciroli","Michele Trapletti","Edoardo Vittori"],"pdf_url":"https://arxiv.org/pdf/2307.09844v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09340v2","updated":"2023-07-19T08:55:01Z","published":"2023-03-16T14:21:45Z","title":"Improving Automated Hemorrhage Detection in Sparse-view Computed\n Tomography via Deep Convolutional Neural Network based Artifact Reduction","summary":" Purpose: Sparse-view computed tomography (CT) is an effective way to reduce\ndose by lowering the total number of views acquired, albeit at the expense of\nimage quality, which, in turn, can impact the ability to detect diseases. We\nexplore deep learning-based artifact reduction in sparse-view cranial CT scans\nand its impact on automated hemorrhage detection. Methods: We trained a U-Net\nfor artefact reduction on simulated sparse-view cranial CT scans from 3000\npatients obtained from a public dataset and reconstructed with varying levels\nof sub-sampling. Additionally, we trained a convolutional neural network on\nfully sampled CT data from 17,545 patients for automated hemorrhage detection.\nWe evaluated the classification performance using the area under the receiver\noperator characteristic curves (AUC-ROCs) with corresponding 95% confidence\nintervals (CIs) and the DeLong test, along with confusion matrices. The\nperformance of the U-Net was compared to an analytical approach based on total\nvariation (TV). Results: The U-Net performed superior compared to unprocessed\nand TV-processed images with respect to image quality and automated hemorrhage\ndiagnosis. With U-Net post-processing, the number of views can be reduced from\n4096 (AUC-ROC: 0.974; 95% CI: 0.972-0.976) views to 512 views (0.973;\n0.971-0.975) with minimal decrease in hemorrhage detection (P<.001) and to 256\nviews (0.967; 0.964-0.969) with a slight performance decrease (P<.001).\nConclusion: The results suggest that U-Net based artifact reduction\nsubstantially enhances automated hemorrhage detection in sparse-view cranial\nCTs. Our findings highlight that appropriate post-processing is crucial for\noptimal image quality and diagnostic accuracy while minimizing radiation dose.\n","authors":["Johannes Thalhammer","Manuel Schultheiss","Tina Dorosti","Tobias Lasser","Franz Pfeiffer","Daniela Pfeiffer","Florian Schaff"],"pdf_url":"https://arxiv.org/pdf/2303.09340v2.pdf","comment":"11 pages, 6 figures, 1 table"},{"id":"http://arxiv.org/abs/2307.09836v1","updated":"2023-07-19T08:47:41Z","published":"2023-07-19T08:47:41Z","title":"Near-Linear Time Projection onto the $\\ell_{1,\\infty}$ Ball; Application\n to Sparse Autoencoders","summary":" Looking for sparsity is nowadays crucial to speed up the training of\nlarge-scale neural networks. Projections onto the $\\ell_{1,2}$ and\n$\\ell_{1,\\infty}$ are among the most efficient techniques to sparsify and\nreduce the overall cost of neural networks. In this paper, we introduce a new\nprojection algorithm for the $\\ell_{1,\\infty}$ norm ball. The worst-case time\ncomplexity of this algorithm is $\\mathcal{O}\\big(nm+J\\log(nm)\\big)$ for a\nmatrix in $\\mathbb{R}^{n\\times m}$. $J$ is a term that tends to 0 when the\nsparsity is high, and to $nm$ when the sparsity is low. Its implementation is\neasy and it is guaranteed to converge to the exact solution in a finite time.\nMoreover, we propose to incorporate the $\\ell_{1,\\infty}$ ball projection while\ntraining an autoencoder to enforce feature selection and sparsity of the\nweights. Sparsification appears in the encoder to primarily do feature\nselection due to our application in biology, where only a very small part\n($<2\\%$) of the data is relevant. We show that both in the biological case and\nin the general case of sparsity that our method is the fastest.\n","authors":["Guillaume Perez","Laurent Condat","Michel Barlaud"],"pdf_url":"https://arxiv.org/pdf/2307.09836v1.pdf","comment":"22 pages, 8 figures"},{"id":"http://arxiv.org/abs/2307.09835v1","updated":"2023-07-19T08:46:47Z","published":"2023-07-19T08:46:47Z","title":"Deep Operator Network Approximation Rates for Lipschitz Operators","summary":" We establish universality and expression rate bounds for a class of neural\nDeep Operator Networks (DON) emulating Lipschitz (or H\\\"older) continuous maps\n$\\mathcal G:\\mathcal X\\to\\mathcal Y$ between (subsets of) separable Hilbert\nspaces $\\mathcal X$, $\\mathcal Y$. The DON architecture considered uses linear\nencoders $\\mathcal E$ and decoders $\\mathcal D$ via (biorthogonal) Riesz bases\nof $\\mathcal X$, $\\mathcal Y$, and an approximator network of an\ninfinite-dimensional, parametric coordinate map that is Lipschitz continuous on\nthe sequence space $\\ell^2(\\mathbb N)$. Unlike previous works ([Herrmann,\nSchwab and Zech: Neural and Spectral operator surrogates: construction and\nexpression rate bounds, SAM Report, 2022], [Marcati and Schwab: Exponential\nConvergence of Deep Operator Networks for Elliptic Partial Differential\nEquations, SAM Report, 2022]), which required for example $\\mathcal G$ to be\nholomorphic, the present expression rate results require mere Lipschitz (or\nH\\\"older) continuity of $\\mathcal G$. Key in the proof of the present\nexpression rate bounds is the use of either super-expressive activations (e.g.\n[Yarotski: Elementary superexpressive activations, Int. Conf. on ML, 2021],\n[Shen, Yang and Zhang: Neural network approximation: Three hidden layers are\nenough, Neural Networks, 2021], and the references there) which are inspired by\nthe Kolmogorov superposition theorem, or of nonstandard NN architectures with\nstandard (ReLU) activations as recently proposed in [Zhang, Shen and Yang:\nNeural Network Architecture Beyond Width and Depth, Adv. in Neural Inf. Proc.\nSys., 2022]. We illustrate the abstract results by approximation rate bounds\nfor emulation of a) solution operators for parametric elliptic variational\ninequalities, and b) Lipschitz maps of Hilbert-Schmidt operators.\n","authors":["Christoph Schwab","Andreas Stein","Jakob Zech"],"pdf_url":"https://arxiv.org/pdf/2307.09835v1.pdf","comment":"31 pages"},{"id":"http://arxiv.org/abs/2307.09829v1","updated":"2023-07-19T08:34:25Z","published":"2023-07-19T08:34:25Z","title":"What do neural networks learn in image classification? A frequency\n shortcut perspective","summary":" Frequency analysis is useful for understanding the mechanisms of\nrepresentation learning in neural networks (NNs). Most research in this area\nfocuses on the learning dynamics of NNs for regression tasks, while little for\nclassification. This study empirically investigates the latter and expands the\nunderstanding of frequency shortcuts. First, we perform experiments on\nsynthetic datasets, designed to have a bias in different frequency bands. Our\nresults demonstrate that NNs tend to find simple solutions for classification,\nand what they learn first during training depends on the most distinctive\nfrequency characteristics, which can be either low- or high-frequencies.\nSecond, we confirm this phenomenon on natural images. We propose a metric to\nmeasure class-wise frequency characteristics and a method to identify frequency\nshortcuts. The results show that frequency shortcuts can be texture-based or\nshape-based, depending on what best simplifies the objective. Third, we\nvalidate the transferability of frequency shortcuts on out-of-distribution\n(OOD) test sets. Our results suggest that frequency shortcuts can be\ntransferred across datasets and cannot be fully avoided by larger model\ncapacity and data augmentation. We recommend that future research should focus\non effective training schemes mitigating frequency shortcut learning.\n","authors":["Shunxin Wang","Raymond Veldhuis","Christoph Brune","Nicola Strisciuglio"],"pdf_url":"https://arxiv.org/pdf/2307.09829v1.pdf","comment":"Accepted at ICCV2023"},{"id":"http://arxiv.org/abs/2307.09823v1","updated":"2023-07-19T08:21:01Z","published":"2023-07-19T08:21:01Z","title":"Multi-modal Learning based Prediction for Disease","summary":" Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic\nliver disease, which can be predicted accurately to prevent advanced fibrosis\nand cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is\ninvasive, expensive, and prone to sampling errors. Therefore, non-invasive\nstudies are extremely promising, yet they are still in their infancy due to the\nlack of comprehensive research data and intelligent methods for multi-modal\ndata. This paper proposes a NAFLD diagnosis system (DeepFLDDiag) combining a\ncomprehensive clinical dataset (FLDData) and a multi-modal learning based NAFLD\nprediction method (DeepFLD). The dataset includes over 6000 participants\nphysical examinations, laboratory and imaging studies, extensive\nquestionnaires, and facial images of partial participants, which is\ncomprehensive and valuable for clinical studies. From the dataset, we\nquantitatively analyze and select clinical metadata that most contribute to\nNAFLD prediction. Furthermore, the proposed DeepFLD, a deep neural network\nmodel designed to predict NAFLD using multi-modal input, including metadata and\nfacial images, outperforms the approach that only uses metadata. Satisfactory\nperformance is also verified on other unseen datasets. Inspiringly, DeepFLD can\nachieve competitive results using only facial images as input rather than\nmetadata, paving the way for a more robust and simpler non-invasive NAFLD\ndiagnosis.\n","authors":["Yaran Chen","Xueyu Chen","Yu Han","Haoran Li","Dongbin Zhao","Jingzhong Li","Xu Wang"],"pdf_url":"https://arxiv.org/pdf/2307.09823v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09818v1","updated":"2023-07-19T08:06:37Z","published":"2023-07-19T08:06:37Z","title":"Deep unrolling Shrinkage Network for Dynamic MR imaging","summary":" Deep unrolling networks that utilize sparsity priors have achieved great\nsuccess in dynamic magnetic resonance (MR) imaging. The convolutional neural\nnetwork (CNN) is usually utilized to extract the transformed domain, and then\nthe soft thresholding (ST) operator is applied to the CNN-transformed data to\nenforce the sparsity priors. However, the ST operator is usually constrained to\nbe the same across all channels of the CNN-transformed data. In this paper, we\npropose a novel operator, called soft thresholding with channel attention\n(AST), that learns the threshold for each channel. In particular, we put\nforward a novel deep unrolling shrinkage network (DUS-Net) by unrolling the\nalternating direction method of multipliers (ADMM) for optimizing the\ntransformed $l_1$ norm dynamic MR reconstruction model. Experimental results on\nan open-access dynamic cine MR dataset demonstrate that the proposed DUS-Net\noutperforms the state-of-the-art methods. The source code is available at\n\\url{https://github.com/yhao-z/DUS-Net}.\n","authors":["Yinghao Zhang","Xiaodi Li","Weihang Li","Yue Hu"],"pdf_url":"https://arxiv.org/pdf/2307.09818v1.pdf","comment":"5 pages,3 figures,2 tables"},{"id":"http://arxiv.org/abs/2307.09816v1","updated":"2023-07-19T08:05:46Z","published":"2023-07-19T08:05:46Z","title":"Manifold Learning with Sparse Regularised Optimal Transport","summary":" Manifold learning is a central task in modern statistics and data science.\nMany datasets (cells, documents, images, molecules) can be represented as point\nclouds embedded in a high dimensional ambient space, however the degrees of\nfreedom intrinsic to the data are usually far fewer than the number of ambient\ndimensions. The task of detecting a latent manifold along which the data are\nembedded is a prerequisite for a wide family of downstream analyses. Real-world\ndatasets are subject to noisy observations and sampling, so that distilling\ninformation about the underlying manifold is a major challenge. We propose a\nmethod for manifold learning that utilises a symmetric version of optimal\ntransport with a quadratic regularisation that constructs a sparse and adaptive\naffinity matrix, that can be interpreted as a generalisation of the\nbistochastic kernel normalisation. We prove that the resulting kernel is\nconsistent with a Laplace-type operator in the continuous limit, establish\nrobustness to heteroskedastic noise and exhibit these results in simulations.\nWe identify a highly efficient computational scheme for computing this optimal\ntransport for discrete data and demonstrate that it outperforms competing\nmethods in a set of examples.\n","authors":["Stephen Zhang","Gilles Mordant","Tetsuya Matsumoto","Geoffrey Schiebinger"],"pdf_url":"https://arxiv.org/pdf/2307.09816v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09810v1","updated":"2023-07-19T07:58:21Z","published":"2023-07-19T07:58:21Z","title":"GenKL: An Iterative Framework for Resolving Label Ambiguity and Label\n Non-conformity in Web Images Via a New Generalized KL Divergence","summary":" Web image datasets curated online inherently contain ambiguous\nin-distribution (ID) instances and out-of-distribution (OOD) instances, which\nwe collectively call non-conforming (NC) instances. In many recent approaches\nfor mitigating the negative effects of NC instances, the core implicit\nassumption is that the NC instances can be found via entropy maximization. For\n\"entropy\" to be well-defined, we are interpreting the output prediction vector\nof an instance as the parameter vector of a multinomial random variable, with\nrespect to some trained model with a softmax output layer. Hence, entropy\nmaximization is based on the idealized assumption that NC instances have\npredictions that are \"almost\" uniformly distributed. However, in real-world web\nimage datasets, there are numerous NC instances whose predictions are far from\nbeing uniformly distributed. To tackle the limitation of entropy maximization,\nwe propose $(\\alpha, \\beta)$-generalized KL divergence,\n$\\mathcal{D}_{\\text{KL}}^{\\alpha, \\beta}(p\\|q)$, which can be used to identify\nsignificantly more NC instances. Theoretical properties of\n$\\mathcal{D}_{\\text{KL}}^{\\alpha, \\beta}(p\\|q)$ are proven, and we also show\nempirically that a simple use of $\\mathcal{D}_{\\text{KL}}^{\\alpha,\n\\beta}(p\\|q)$ outperforms all baselines on the NC instance identification task.\nBuilding upon $(\\alpha,\\beta)$-generalized KL divergence, we also introduce a\nnew iterative training framework, GenKL, that identifies and relabels NC\ninstances. When evaluated on three web image datasets, Clothing1M,\nFood101/Food101N, and mini WebVision 1.0, we achieved new state-of-the-art\nclassification accuracies: $81.34\\%$, $85.73\\%$ and $78.99\\%$/$92.54\\%$\n(top-1/top-5), respectively.\n","authors":["Xia Huang","Kai Fong Ernest Chong"],"pdf_url":"https://arxiv.org/pdf/2307.09810v1.pdf","comment":"Published (with open access) at International Journal of Computer\n Vision (IJCV, 2023). 25 pages, 8 figures. Code is available at:\n https://github.com/codetopaper/GenKL"},{"id":"http://arxiv.org/abs/2307.09801v1","updated":"2023-07-19T07:40:51Z","published":"2023-07-19T07:40:51Z","title":"Graph Federated Learning Based on the Decentralized Framework","summary":" Graph learning has a wide range of applications in many scenarios, which\nrequire more need for data privacy. Federated learning is an emerging\ndistributed machine learning approach that leverages data from individual\ndevices or data centers to improve the accuracy and generalization of the\nmodel, while also protecting the privacy of user data. Graph-federated learning\nis mainly based on the classical federated learning framework i.e., the\nClient-Server framework. However, the Client-Server framework faces problems\nsuch as a single point of failure of the central server and poor scalability of\nnetwork topology. First, we introduce the decentralized framework to\ngraph-federated learning. Second, determine the confidence among nodes based on\nthe similarity of data among nodes, subsequently, the gradient information is\nthen aggregated by linear weighting based on confidence. Finally, the proposed\nmethod is compared with FedAvg, Fedprox, GCFL, and GCFL+ to verify the\neffectiveness of the proposed method. Experiments demonstrate that the proposed\nmethod outperforms other methods.\n","authors":["Peilin Liu","Yanni Tang","Mingyue Zhang","Wu Chen"],"pdf_url":"https://arxiv.org/pdf/2307.09801v1.pdf","comment":"12 pages, 4 figures, 4 tables"},{"id":"http://arxiv.org/abs/2307.09797v1","updated":"2023-07-19T07:31:37Z","published":"2023-07-19T07:31:37Z","title":"Probabilistic Forecasting with Coherent Aggregation","summary":" Obtaining accurate probabilistic forecasts while respecting hierarchical\ninformation is an important operational challenge in many applications, perhaps\nmost obviously in energy management, supply chain planning, and resource\nallocation. The basic challenge, especially for multivariate forecasting, is\nthat forecasts are often required to be coherent with respect to the\nhierarchical structure. In this paper, we propose a new model which leverages a\nfactor model structure to produce coherent forecasts by construction. This is a\nconsequence of a simple (exchangeability) observation: permuting\n\\textit{}base-level series in the hierarchy does not change their aggregates.\nOur model uses a convolutional neural network to produce parameters for the\nfactors, their loadings and base-level distributions; it produces samples which\ncan be differentiated with respect to the model's parameters; and it can\ntherefore optimize for any sample-based loss function, including the Continuous\nRanked Probability Score and quantile losses. We can choose arbitrary\ncontinuous distributions for the factor and the base-level distributions. We\ncompare our method to two previous methods which can be optimized end-to-end,\nwhile enforcing coherent aggregation. Our model achieves significant\nimprovements: between $11.8-41.4\\%$ on three hierarchical forecasting datasets.\nWe also analyze the influence of parameters in our model with respect to\nbase-level distribution and number of factors.\n","authors":["Geoffrey Négiar","Ruijun Ma","O. Nangba Meetei","Mengfei Cao","Michael W. Mahoney"],"pdf_url":"https://arxiv.org/pdf/2307.09797v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.05086v3","updated":"2023-07-19T07:31:35Z","published":"2023-02-10T07:08:13Z","title":"Making Substitute Models More Bayesian Can Enhance Transferability of\n Adversarial Examples","summary":" The transferability of adversarial examples across deep neural networks\n(DNNs) is the crux of many black-box attacks. Many prior efforts have been\ndevoted to improving the transferability via increasing the diversity in inputs\nof some substitute models. In this paper, by contrast, we opt for the diversity\nin substitute models and advocate to attack a Bayesian model for achieving\ndesirable transferability. Deriving from the Bayesian formulation, we develop a\nprincipled strategy for possible finetuning, which can be combined with many\noff-the-shelf Gaussian posterior approximations over DNN parameters. Extensive\nexperiments have been conducted to verify the effectiveness of our method, on\ncommon benchmark datasets, and the results demonstrate that our method\noutperforms recent state-of-the-arts by large margins (roughly 19% absolute\nincrease in average attack success rate on ImageNet), and, by combining with\nthese recent methods, further performance gain can be obtained. Our code:\nhttps://github.com/qizhangli/MoreBayesian-attack.\n","authors":["Qizhang Li","Yiwen Guo","Wangmeng Zuo","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2302.05086v3.pdf","comment":"Accepted by ICLR 2023, fix typos"},{"id":"http://arxiv.org/abs/2307.09796v1","updated":"2023-07-19T07:30:01Z","published":"2023-07-19T07:30:01Z","title":"Forecasting Early with Meta Learning","summary":" In the early observation period of a time series, there might be only a few\nhistoric observations available to learn a model. However, in cases where an\nexisting prior set of datasets is available, Meta learning methods can be\napplicable. In this paper, we devise a Meta learning method that exploits\nsamples from additional datasets and learns to augment time series through\nadversarial learning as an auxiliary task for the target dataset. Our model\n(FEML), is equipped with a shared Convolutional backbone that learns features\nfor varying length inputs from different datasets and has dataset specific\nheads to forecast for different output lengths. We show that FEML can meta\nlearn across datasets and by additionally learning on adversarial generated\nsamples as auxiliary samples for the target dataset, it can improve the\nforecasting performance compared to single task learning, and various solutions\nadapted from Joint learning, Multi-task learning and classic forecasting\nbaselines.\n","authors":["Shayan Jawed","Kiran Madhusudhanan","Vijaya Krishna Yalavarthi","Lars Schmidt-Thieme"],"pdf_url":"https://arxiv.org/pdf/2307.09796v1.pdf","comment":"IJCNN 2023"},{"id":"http://arxiv.org/abs/2307.09795v1","updated":"2023-07-19T07:29:14Z","published":"2023-07-19T07:29:14Z","title":"From West to East: Who can understand the music of the others better?","summary":" Recent developments in MIR have led to several benchmark deep learning models\nwhose embeddings can be used for a variety of downstream tasks. At the same\ntime, the vast majority of these models have been trained on Western pop/rock\nmusic and related styles. This leads to research questions on whether these\nmodels can be used to learn representations for different music cultures and\nstyles, or whether we can build similar music audio embedding models trained on\ndata from different cultures or styles. To that end, we leverage transfer\nlearning methods to derive insights about the similarities between the\ndifferent music cultures to which the data belongs to. We use two Western music\ndatasets, two traditional/folk datasets coming from eastern Mediterranean\ncultures, and two datasets belonging to Indian art music. Three deep audio\nembedding models are trained and transferred across domains, including two\nCNN-based and a Transformer-based architecture, to perform auto-tagging for\neach target domain dataset. Experimental results show that competitive\nperformance is achieved in all domains via transfer learning, while the best\nsource dataset varies for each music culture. The implementation and the\ntrained models are both provided in a public repository.\n","authors":["Charilaos Papaioannou","Emmanouil Benetos","Alexandros Potamianos"],"pdf_url":"https://arxiv.org/pdf/2307.09795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09792v1","updated":"2023-07-19T07:17:06Z","published":"2023-07-19T07:17:06Z","title":"A Note on Hardness of Computing Recursive Teaching Dimension","summary":" In this short note, we show that the problem of computing the recursive\nteaching dimension (RTD) for a concept class (given explicitly as input)\nrequires $n^{\\Omega(\\log n)}$-time, assuming the exponential time hypothesis\n(ETH). This matches the running time $n^{O(\\log n)}$ of the brute-force\nalgorithm for the problem.\n","authors":["Pasin Manurangsi"],"pdf_url":"https://arxiv.org/pdf/2307.09792v1.pdf","comment":"To appear in IPL"},{"id":"http://arxiv.org/abs/2307.09782v1","updated":"2023-07-19T06:58:03Z","published":"2023-07-19T06:58:03Z","title":"ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization\n Using Floating-Point Formats","summary":" In the complex domain of large language models (LLMs), striking a balance\nbetween computational efficiency and maintaining model quality is a formidable\nchallenge. Navigating the inherent limitations of uniform quantization,\nparticularly when dealing with outliers, and motivated by the launch of\nNVIDIA's H100 hardware, this study delves into the viability of floating-point\n(FP) quantization, particularly focusing on FP8 and FP4, as a potential\nsolution. Our comprehensive investigation reveals that for LLMs, FP8 activation\nconsistently outshines its integer (INT8) equivalent, with the performance edge\nbecoming more noticeable in models possessing parameters beyond one billion.\nFor weight quantization, our findings indicate that FP4 exhibits comparable, if\nnot superior, performance to INT4, simplifying deployment on FP-supported\nhardware like H100. To mitigate the overhead from precision alignment caused by\nthe disparity between weights and activations, we propose two scaling\nconstraints for weight quantization that negligibly impact the performance\ncompared to the standard W4A8 model. We additionally enhance our quantization\nmethods by integrating the Low Rank Compensation (LoRC) strategy, yielding\nimprovements especially in smaller models. The results of our investigation\nemphasize the immense potential of FP quantization for LLMs, paving the way for\nhigh-efficiency deployment in resource-limited settings.\n","authors":["Xiaoxia Wu","Zhewei Yao","Yuxiong He"],"pdf_url":"https://arxiv.org/pdf/2307.09782v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09781v1","updated":"2023-07-19T06:56:07Z","published":"2023-07-19T06:56:07Z","title":"Text2Layer: Layered Image Generation using Latent Diffusion Model","summary":" Layer compositing is one of the most popular image editing workflows among\nboth amateurs and professionals. Motivated by the success of diffusion models,\nwe explore layer compositing from a layered image generation perspective.\nInstead of generating an image, we propose to generate background, foreground,\nlayer mask, and the composed image simultaneously. To achieve layered image\ngeneration, we train an autoencoder that is able to reconstruct layered images\nand train diffusion models on the latent representation. One benefit of the\nproposed problem is to enable better compositing workflows in addition to the\nhigh-quality image output. Another benefit is producing higher-quality layer\nmasks compared to masks produced by a separate step of image segmentation.\nExperimental results show that the proposed method is able to generate\nhigh-quality layered images and initiates a benchmark for future work.\n","authors":["Xinyang Zhang","Wentian Zhao","Xin Lu","Jeff Chien"],"pdf_url":"https://arxiv.org/pdf/2307.09781v1.pdf","comment":"Preprint. Work in progress"},{"id":"http://arxiv.org/abs/2212.01692v4","updated":"2023-07-19T06:48:35Z","published":"2022-12-03T21:14:32Z","title":"Can In-context Learners Learn a Reasoning Concept from Demonstrations?","summary":" Language models exhibit an emergent ability to learn a new task from a small\nnumber of input-output demonstrations. However, recent work shows that\nin-context learners largely rely on their pre-trained knowledge, such as the\nsentiment of the labels, instead of learning new associations from the input.\nWe argue that the commonly-used few-shot evaluation using a random selection of\nin-context demonstrations can not disentangle models' reliance on such biases,\nas most of the randomly-selected demonstrations do not present relations\ninformative for prediction beyond exposing the task's input-output\ndistribution.\n Therefore, to evaluate models' in-context learning ability independent of\nmodels' memory, we introduce a Concept-sharing few-shot learning method\nchoosing the demonstrations that share an underlying concept with the predicted\nsample. We extract a set of such concepts from available human explanations and\nmeasure how much models can benefit from presenting these concepts in few-shot\ndemonstrations.\n We find that most of the recent in-context learners can not consistently\nbenefit from the demonstrated concepts, irrespective of the model size.\nHowever, we note that T0 models are more sensitive to exhibited concepts,\nbenefiting from concept-sharing demonstrations in 7 out of 8 evaluation\nscenarios.\n","authors":["Michal Štefánik","Marek Kadlčík"],"pdf_url":"https://arxiv.org/pdf/2212.01692v4.pdf","comment":"Awarded Best Paper at ACL 2023 Natural Language Reasoning and\n Structured Explanations (NLRSE) workshop"},{"id":"http://arxiv.org/abs/2307.09779v1","updated":"2023-07-19T06:48:33Z","published":"2023-07-19T06:48:33Z","title":"Beyond Single-Feature Importance with ICECREAM","summary":" Which set of features was responsible for a certain output of a machine\nlearning model? Which components caused the failure of a cloud computing\napplication? These are just two examples of questions we are addressing in this\nwork by Identifying Coalition-based Explanations for Common and Rare Events in\nAny Model (ICECREAM). Specifically, we propose an information-theoretic\nquantitative measure for the influence of a coalition of variables on the\ndistribution of a target variable. This allows us to identify which set of\nfactors is essential to obtain a certain outcome, as opposed to\nwell-established explainability and causal contribution analysis methods which\ncan assign contributions only to individual factors and rank them by their\nimportance. In experiments with synthetic and real-world data, we show that\nICECREAM outperforms state-of-the-art methods for explainability and root cause\nanalysis, and achieves impressive accuracy in both tasks.\n","authors":["Michael Oesterle","Patrick Blöbaum","Atalanti A. Mastakouri","Elke Kirschbaum"],"pdf_url":"https://arxiv.org/pdf/2307.09779v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.03638v4","updated":"2023-07-19T06:43:10Z","published":"2022-06-08T01:50:08Z","title":"Alternately Optimized Graph Neural Networks","summary":" Graph Neural Networks (GNNs) have greatly advanced the semi-supervised node\nclassification task on graphs. The majority of existing GNNs are trained in an\nend-to-end manner that can be viewed as tackling a bi-level optimization\nproblem. This process is often inefficient in computation and memory usage. In\nthis work, we propose a new optimization framework for semi-supervised learning\non graphs. The proposed framework can be conveniently solved by the alternating\noptimization algorithms, resulting in significantly improved efficiency.\nExtensive experiments demonstrate that the proposed method can achieve\ncomparable or better performance with state-of-the-art baselines while it has\nsignificantly better computation and memory efficiency.\n","authors":["Haoyu Han","Xiaorui Liu","Haitao Mao","MohamadAli Torkamani","Feng Shi","Victor Lee","Jiliang Tang"],"pdf_url":"https://arxiv.org/pdf/2206.03638v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09771v1","updated":"2023-07-19T06:17:16Z","published":"2023-07-19T06:17:16Z","title":"A Novel Spatial-Temporal Variational Quantum Circuit to Enable Deep\n Learning on NISQ Devices","summary":" Quantum computing presents a promising approach for machine learning with its\ncapability for extremely parallel computation in high-dimension through\nsuperposition and entanglement. Despite its potential, existing quantum\nlearning algorithms, such as Variational Quantum Circuits(VQCs), face\nchallenges in handling more complex datasets, particularly those that are not\nlinearly separable. What's more, it encounters the deployability issue, making\nthe learning models suffer a drastic accuracy drop after deploying them to the\nactual quantum devices. To overcome these limitations, this paper proposes a\nnovel spatial-temporal design, namely ST-VQC, to integrate non-linearity in\nquantum learning and improve the robustness of the learning model to noise.\nSpecifically, ST-VQC can extract spatial features via a novel block-based\nencoding quantum sub-circuit coupled with a layer-wise computation quantum\nsub-circuit to enable temporal-wise deep learning. Additionally, a SWAP-Free\nphysical circuit design is devised to improve robustness. These designs bring a\nnumber of hyperparameters. After a systematic analysis of the design space for\neach design component, an automated optimization framework is proposed to\ngenerate the ST-VQC quantum circuit. The proposed ST-VQC has been evaluated on\ntwo IBM quantum processors, ibm_cairo with 27 qubits and ibmq_lima with 7\nqubits to assess its effectiveness. The results of the evaluation on the\nstandard dataset for binary classification show that ST-VQC can achieve over\n30% accuracy improvement compared with existing VQCs on actual quantum\ncomputers. Moreover, on a non-linear synthetic dataset, the ST-VQC outperforms\na linear classifier by 27.9%, while the linear classifier using classical\ncomputing outperforms the existing VQC by 15.58%.\n","authors":["Jinyang Li","Zhepeng Wang","Zhirui Hu","Prasanna Date","Ang Li","Weiwen Jiang"],"pdf_url":"https://arxiv.org/pdf/2307.09771v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.11434v2","updated":"2023-07-19T06:17:10Z","published":"2022-03-22T03:13:39Z","title":"Non-linear Embeddings in Hilbert Simplex Geometry","summary":" A key technique of machine learning and computer vision is to embed discrete\nweighted graphs into continuous spaces for further downstream processing.\nEmbedding discrete hierarchical structures in hyperbolic geometry has proven\nvery successful since it was shown that any weighted tree can be embedded in\nthat geometry with arbitrary low distortion. Various optimization methods for\nhyperbolic embeddings based on common models of hyperbolic geometry have been\nstudied. In this paper, we consider Hilbert geometry for the standard simplex\nwhich is isometric to a vector space equipped with the variation polytope norm.\nWe study the representation power of this Hilbert simplex geometry by embedding\ndistance matrices of graphs. Our findings demonstrate that Hilbert simplex\ngeometry is competitive to alternative geometries such as the Poincar\\'e\nhyperbolic ball or the Euclidean geometry for embedding tasks while being fast\nand numerically robust.\n","authors":["Frank Nielsen","Ke Sun"],"pdf_url":"https://arxiv.org/pdf/2203.11434v2.pdf","comment":"19 pages, 11 figures"},{"id":"http://arxiv.org/abs/2307.09768v1","updated":"2023-07-19T06:05:33Z","published":"2023-07-19T06:05:33Z","title":"How Curvature Enhance the Adaptation Power of Framelet GCNs","summary":" Graph neural network (GNN) has been demonstrated powerful in modeling\ngraph-structured data. However, despite many successful cases of applying GNNs\nto various graph classification and prediction tasks, whether the graph\ngeometrical information has been fully exploited to enhance the learning\nperformance of GNNs is not yet well understood. This paper introduces a new\napproach to enhance GNN by discrete graph Ricci curvature. Specifically, the\ngraph Ricci curvature defined on the edges of a graph measures how difficult\nthe information transits on one edge from one node to another based on their\nneighborhoods. Motivated by the geometric analogy of Ricci curvature in the\ngraph setting, we prove that by inserting the curvature information with\ndifferent carefully designed transformation function $\\zeta$, several known\ncomputational issues in GNN such as over-smoothing can be alleviated in our\nproposed model. Furthermore, we verified that edges with very positive Ricci\ncurvature (i.e., $\\kappa_{i,j} \\approx 1$) are preferred to be dropped to\nenhance model's adaption to heterophily graph and one curvature based graph\nedge drop algorithm is proposed. Comprehensive experiments show that our\ncurvature-based GNN model outperforms the state-of-the-art baselines in both\nhomophily and heterophily graph datasets, indicating the effectiveness of\ninvolving graph geometric information in GNNs.\n","authors":["Dai Shi","Yi Guo","Zhiqi Shao","Junbin Gao"],"pdf_url":"https://arxiv.org/pdf/2307.09768v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14048v2","updated":"2023-07-19T06:02:38Z","published":"2023-06-24T20:11:14Z","title":"H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large\n Language Models","summary":" Large Language Models (LLMs), despite their recent impressive\naccomplishments, are notably cost-prohibitive to deploy, particularly for\napplications involving long-content generation, such as dialogue systems and\nstory writing. Often, a large amount of transient state information, referred\nto as the KV cache, is stored in GPU memory in addition to model parameters,\nscaling linearly with the sequence length and batch size. In this paper, we\nintroduce a novel approach for implementing the KV cache which significantly\nreduces its memory footprint. Our approach is based on the noteworthy\nobservation that a small portion of tokens contributes most of the value when\ncomputing attention scores. We call these tokens Heavy Hitters (H$_2$). Through\na comprehensive investigation, we find that (i) the emergence of H$_2$ is\nnatural and strongly correlates with the frequent co-occurrence of tokens in\nthe text, and (ii) removing them results in significant performance\ndegradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O),\na KV cache eviction policy that dynamically retains a balance of recent and\nH$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular\nproblem and prove (under mild assumptions) a theoretical guarantee for our\nnovel eviction algorithm which could help guide future work. We validate the\naccuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of\ntasks. Our implementation of H$_2$O with 20% heavy hitters improves the\nthroughput over three leading inference systems DeepSpeed Zero-Inference,\nHugging Face Accelerate, and FlexGen by up to 29$\\times$, 29$\\times$, and\n3$\\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the\nlatency by up to 1.9$\\times$. The code is available at\nhttps://github.com/FMInference/H2O.\n","authors":["Zhenyu Zhang","Ying Sheng","Tianyi Zhou","Tianlong Chen","Lianmin Zheng","Ruisi Cai","Zhao Song","Yuandong Tian","Christopher Ré","Clark Barrett","Zhangyang Wang","Beidi Chen"],"pdf_url":"https://arxiv.org/pdf/2306.14048v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09767v1","updated":"2023-07-19T05:58:21Z","published":"2023-07-19T05:58:21Z","title":"Sig-Splines: universal approximation and convex calibration of time\n series generative models","summary":" We propose a novel generative model for multivariate discrete-time time\nseries data. Drawing inspiration from the construction of neural spline flows,\nour algorithm incorporates linear transformations and the signature transform\nas a seamless substitution for traditional neural networks. This approach\nenables us to achieve not only the universality property inherent in neural\nnetworks but also introduces convexity in the model's parameters.\n","authors":["Magnus Wiese","Phillip Murray","Ralf Korn"],"pdf_url":"https://arxiv.org/pdf/2307.09767v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08621v2","updated":"2023-07-19T05:56:42Z","published":"2023-07-17T16:40:01Z","title":"Retentive Network: A Successor to Transformer for Large Language Models","summary":" In this work, we propose Retentive Network (RetNet) as a foundation\narchitecture for large language models, simultaneously achieving training\nparallelism, low-cost inference, and good performance. We theoretically derive\nthe connection between recurrence and attention. Then we propose the retention\nmechanism for sequence modeling, which supports three computation paradigms,\ni.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel\nrepresentation allows for training parallelism. The recurrent representation\nenables low-cost $O(1)$ inference, which improves decoding throughput, latency,\nand GPU memory without sacrificing performance. The chunkwise recurrent\nrepresentation facilitates efficient long-sequence modeling with linear\ncomplexity, where each chunk is encoded parallelly while recurrently\nsummarizing the chunks. Experimental results on language modeling show that\nRetNet achieves favorable scaling results, parallel training, low-cost\ndeployment, and efficient inference. The intriguing properties make RetNet a\nstrong successor to Transformer for large language models. Code will be\navailable at https://aka.ms/retnet.\n","authors":["Yutao Sun","Li Dong","Shaohan Huang","Shuming Ma","Yuqing Xia","Jilong Xue","Jianyong Wang","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2307.08621v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.02918v3","updated":"2023-07-19T05:51:00Z","published":"2023-03-06T06:28:20Z","title":"Graph Positional Encoding via Random Feature Propagation","summary":" Two main families of node feature augmentation schemes have been explored for\nenhancing GNNs: random features and spectral positional encoding. Surprisingly,\nhowever, there is still no clear understanding of the relation between these\ntwo augmentation schemes. Here we propose a novel family of positional encoding\nschemes which draws a link between the above two approaches and improves over\nboth. The new approach, named Random Feature Propagation (RFP), is inspired by\nthe power iteration method and its generalizations. It concatenates several\nintermediate steps of an iterative algorithm for computing the dominant\neigenvectors of a propagation matrix, starting from random node features.\nNotably, these propagation steps are based on graph-dependent propagation\noperators that can be either predefined or learned. We explore the theoretical\nand empirical benefits of RFP. First, we provide theoretical justifications for\nusing random features, for incorporating early propagation steps, and for using\nmultiple random initializations. Then, we empirically demonstrate that RFP\nsignificantly outperforms both spectral PE and random features in multiple node\nclassification and graph classification benchmarks.\n","authors":["Moshe Eliasof","Fabrizio Frasca","Beatrice Bevilacqua","Eran Treister","Gal Chechik","Haggai Maron"],"pdf_url":"https://arxiv.org/pdf/2303.02918v3.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2307.09762v1","updated":"2023-07-19T05:45:05Z","published":"2023-07-19T05:45:05Z","title":"Reinforcing POD based model reduction techniques in reaction-diffusion\n complex networks using stochastic filtering and pattern recognition","summary":" Complex networks are used to model many real-world systems. However, the\ndimensionality of these systems can make them challenging to analyze.\nDimensionality reduction techniques like POD can be used in such cases.\nHowever, these models are susceptible to perturbations in the input data. We\npropose an algorithmic framework that combines techniques from pattern\nrecognition (PR) and stochastic filtering theory to enhance the output of such\nmodels. The results of our study show that our method can improve the accuracy\nof the surrogate model under perturbed inputs. Deep Neural Networks (DNNs) are\nsusceptible to adversarial attacks. However, recent research has revealed that\nneural Ordinary Differential Equations (ODEs) exhibit robustness in specific\napplications. We benchmark our algorithmic framework with a Neural ODE-based\napproach as a reference.\n","authors":["Abhishek Ajayakumar","Soumyendu Raha"],"pdf_url":"https://arxiv.org/pdf/2307.09762v1.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.04603v3","updated":"2023-07-19T05:43:44Z","published":"2023-07-07T09:01:42Z","title":"Solvent: A Framework for Protein Folding","summary":" Consistency and reliability are crucial for conducting AI research. Many\nfamous research fields, such as object detection, have been compared and\nvalidated with solid benchmark frameworks. After AlphaFold2, the protein\nfolding task has entered a new phase, and many methods are proposed based on\nthe component of AlphaFold2. The importance of a unified research framework in\nprotein folding contains implementations and benchmarks to consistently and\nfairly compare various approaches. To achieve this, we present Solvent, an\nprotein folding framework that supports significant components of\nstate-of-th-arts models in the manner of off-the-shelf interface Solvent\ncontains different models implemented in a unified codebase and supports\ntraining and evaluation for defined models on the same dataset. We benchmark\nwell-known algorithms and their components and provide experiments that give\nhelpful insights into the protein structure modeling field. We hope that\nSolvent will increase the reliability and consistency of proposed models and\ngives efficiency in both speed and costs, resulting in acceleration on protein\nfolding modeling research. The code is available at\nhttps://github.com/kakaobrain/solvent, and the project will continue to be\ndeveloped.\n","authors":["Jaemyung Lee","Kyeongtak Han","Jaehoon Kim","Hasun Yu","Youhan Lee"],"pdf_url":"https://arxiv.org/pdf/2307.04603v3.pdf","comment":"preprint, 8pages"},{"id":"http://arxiv.org/abs/2307.09759v1","updated":"2023-07-19T05:41:40Z","published":"2023-07-19T05:41:40Z","title":"Constructing Extreme Learning Machines with zero Spectral Bias","summary":" The phenomena of Spectral Bias, where the higher frequency components of a\nfunction being learnt in a feedforward Artificial Neural Network (ANN) are seen\nto converge more slowly than the lower frequencies, is observed ubiquitously\nacross ANNs. This has created technology challenges in fields where resolution\nof higher frequencies is crucial, like in Physics Informed Neural Networks\n(PINNs). Extreme Learning Machines (ELMs) that obviate an iterative solution\nprocess which provides the theoretical basis of Spectral Bias (SB), should in\nprinciple be free of the same. This work verifies the reliability of this\nassumption, and shows that it is incorrect. However, the structure of ELMs\nmakes them naturally amenable to implementation of variants of Fourier Feature\nEmbeddings, which have been shown to mitigate SB in ANNs. This approach is\nimplemented and verified to completely eliminate SB, thus bringing into\nfeasibility the application of ELMs for practical problems like PINNs where\nresolution of higher frequencies is essential.\n","authors":["Kaumudi Joshi","Vukka Snigdha","Arya Kumar Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2307.09759v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.12239v2","updated":"2023-07-19T05:32:04Z","published":"2023-05-20T17:13:06Z","title":"Off-Policy Average Reward Actor-Critic with Deterministic Policy Search","summary":" The average reward criterion is relatively less studied as most existing\nworks in the Reinforcement Learning literature consider the discounted reward\ncriterion. There are few recent works that present on-policy average reward\nactor-critic algorithms, but average reward off-policy actor-critic is\nrelatively less explored. In this work, we present both on-policy and\noff-policy deterministic policy gradient theorems for the average reward\nperformance criterion. Using these theorems, we also present an Average Reward\nOff-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first\nshow asymptotic convergence analysis using the ODE-based method. Subsequently,\nwe provide a finite time analysis of the resulting stochastic approximation\nscheme with linear function approximator and obtain an $\\epsilon$-optimal\nstationary policy with a sample complexity of $\\Omega(\\epsilon^{-2.5})$. We\ncompare the average reward performance of our proposed ARO-DDPG algorithm and\nobserve better empirical performance compared to state-of-the-art on-policy\naverage reward actor-critic algorithms over MuJoCo-based environments.\n","authors":["Naman Saxena","Subhojyoti Khastigir","Shishir Kolathaya","Shalabh Bhatnagar"],"pdf_url":"https://arxiv.org/pdf/2305.12239v2.pdf","comment":"Accepted at ICML 2023"},{"id":"http://arxiv.org/abs/2208.06265v2","updated":"2023-07-19T05:08:06Z","published":"2022-08-10T08:28:46Z","title":"Trustworthy Recommender Systems","summary":" Recommender systems (RSs) aim to help users to effectively retrieve items of\ntheir interests from a large catalogue. For a quite long period of time,\nresearchers and practitioners have been focusing on developing accurate RSs.\nRecent years have witnessed an increasing number of threats to RSs, coming from\nattacks, system and user generated noise, system bias. As a result, it has\nbecome clear that a strict focus on RS accuracy is limited and the research\nmust consider other important factors, e.g., trustworthiness. For end users, a\ntrustworthy RS (TRS) should not only be accurate, but also transparent,\nunbiased and fair as well as robust to noise or attacks. These observations\nactually led to a paradigm shift of the research on RSs: from accuracy-oriented\nRSs to TRSs. However, researchers lack a systematic overview and discussion of\nthe literature in this novel and fast developing field of TRSs. To this end, in\nthis paper, we provide an overview of TRSs, including a discussion of the\nmotivation and basic concepts of TRSs, a presentation of the challenges in\nbuilding TRSs, and a perspective on the future directions in this area. We also\nprovide a novel conceptual framework to support the construction of TRSs.\n","authors":["Shoujin Wang","Xiuzhen Zhang","Yan Wang","Huan Liu","Francesco Ricci"],"pdf_url":"https://arxiv.org/pdf/2208.06265v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.01646v2","updated":"2023-07-19T04:59:35Z","published":"2023-07-04T10:58:42Z","title":"SwinGNN: Rethinking Permutation Invariance in Diffusion Models for Graph\n Generation","summary":" Diffusion models based on permutation-equivariant networks can learn\npermutation-invariant distributions for graph data. However, in comparison to\ntheir non-invariant counterparts, we have found that these invariant models\nencounter greater learning challenges since 1) their effective target\ndistributions exhibit more modes; 2) their optimal one-step denoising scores\nare the score functions of Gaussian mixtures with more components. Motivated by\nthis analysis, we propose a non-invariant diffusion model, called\n$\\textit{SwinGNN}$, which employs an efficient edge-to-edge 2-WL message\npassing network and utilizes shifted window based self-attention inspired by\nSwinTransformers. Further, through systematic ablations, we identify several\ncritical training and sampling techniques that significantly improve the sample\nquality of graph generation. At last, we introduce a simple post-processing\ntrick, $\\textit{i.e.}$, randomly permuting the generated graphs, which provably\nconverts any graph generative model to a permutation-invariant one. Extensive\nexperiments on synthetic and real-world protein and molecule datasets show that\nour SwinGNN achieves state-of-the-art performances. Our code is released at\nhttps://github.com/qiyan98/SwinGNN.\n","authors":["Qi Yan","Zhengyang Liang","Yang Song","Renjie Liao","Lele Wang"],"pdf_url":"https://arxiv.org/pdf/2307.01646v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.03597v4","updated":"2023-07-19T04:52:33Z","published":"2022-06-07T21:30:58Z","title":"Meta-Learning Parameterized Skills","summary":" We propose a novel parameterized skill-learning algorithm that aims to learn\ntransferable parameterized skills and synthesize them into a new action space\nthat supports efficient learning in long-horizon tasks. We propose to leverage\noff-policy Meta-RL combined with a trajectory-centric smoothness term to learn\na set of parameterized skills. Our agent can use these learned skills to\nconstruct a three-level hierarchical framework that models a\nTemporally-extended Parameterized Action Markov Decision Process. We\nempirically demonstrate that the proposed algorithms enable an agent to solve a\nset of difficult long-horizon (obstacle-course and robot manipulation) tasks.\n","authors":["Haotian Fu","Shangqun Yu","Saket Tiwari","Michael Littman","George Konidaris"],"pdf_url":"https://arxiv.org/pdf/2206.03597v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09742v1","updated":"2023-07-19T04:07:33Z","published":"2023-07-19T04:07:33Z","title":"Improved Distribution Matching for Dataset Condensation","summary":" Dataset Condensation aims to condense a large dataset into a smaller one\nwhile maintaining its ability to train a well-performing model, thus reducing\nthe storage cost and training effort in deep learning applications. However,\nconventional dataset condensation methods are optimization-oriented and\ncondense the dataset by performing gradient or parameter matching during model\noptimization, which is computationally intensive even on small datasets and\nmodels. In this paper, we propose a novel dataset condensation method based on\ndistribution matching, which is more efficient and promising. Specifically, we\nidentify two important shortcomings of naive distribution matching (i.e.,\nimbalanced feature numbers and unvalidated embeddings for distance computation)\nand address them with three novel techniques (i.e., partitioning and expansion\naugmentation, efficient and enriched model sampling, and class-aware\ndistribution regularization). Our simple yet effective method outperforms most\nprevious optimization-oriented methods with much fewer computational resources,\nthereby scaling data condensation to larger datasets and models. Extensive\nexperiments demonstrate the effectiveness of our method. Codes are available at\nhttps://github.com/uitrbn/IDM\n","authors":["Ganlong Zhao","Guanbin Li","Yipeng Qin","Yizhou Yu"],"pdf_url":"https://arxiv.org/pdf/2307.09742v1.pdf","comment":"CVPR2023"},{"id":"http://arxiv.org/abs/2302.11665v2","updated":"2023-07-19T04:03:11Z","published":"2023-02-22T21:41:34Z","title":"AlpaServe: Statistical Multiplexing with Model Parallelism for Deep\n Learning Serving","summary":" Model parallelism is conventionally viewed as a method to scale a single\nlarge deep learning model beyond the memory limits of a single device. In this\npaper, we demonstrate that model parallelism can be additionally used for the\nstatistical multiplexing of multiple devices when serving multiple models, even\nwhen a single model can fit into a single device. Our work reveals a\nfundamental trade-off between the overhead introduced by model parallelism and\nthe opportunity to exploit statistical multiplexing to reduce serving latency\nin the presence of bursty workloads. We explore the new trade-off space and\npresent a novel serving system, AlpaServe, that determines an efficient\nstrategy for placing and parallelizing collections of large deep learning\nmodels across a distributed cluster. Evaluation results on production workloads\nshow that AlpaServe can process requests at up to 10x higher rates or 6x more\nburstiness while staying within latency constraints for more than 99% of\nrequests.\n","authors":["Zhuohan Li","Lianmin Zheng","Yinmin Zhong","Vincent Liu","Ying Sheng","Xin Jin","Yanping Huang","Zhifeng Chen","Hao Zhang","Joseph E. Gonzalez","Ion Stoica"],"pdf_url":"https://arxiv.org/pdf/2302.11665v2.pdf","comment":"OSDI 2023"},{"id":"http://arxiv.org/abs/2305.16165v2","updated":"2023-07-19T02:42:46Z","published":"2023-05-11T21:20:29Z","title":"A Conceptual Model for End-to-End Causal Discovery in Knowledge Tracing","summary":" In this paper, we take a preliminary step towards solving the problem of\ncausal discovery in knowledge tracing, i.e., finding the underlying causal\nrelationship among different skills from real-world student response data. This\nproblem is important since it can potentially help us understand the causal\nrelationship between different skills without extensive A/B testing, which can\npotentially help educators to design better curricula according to skill\nprerequisite information. Specifically, we propose a conceptual solution, a\nnovel causal gated recurrent unit (GRU) module in a modified deep knowledge\ntracing model, which uses i) a learnable permutation matrix for causal ordering\namong skills and ii) an optionally learnable lower-triangular matrix for causal\nstructure among skills. We also detail how to learn the model parameters in an\nend-to-end, differentiable way. Our solution placed among the top entries in\nTask 3 of the NeurIPS 2022 Challenge on Causal Insights for Learning Paths in\nEducation. We detail preliminary experiments as evaluated on the challenge's\npublic leaderboard since the ground truth causal structure has not been\npublicly released, making detailed local evaluation impossible.\n","authors":["Nischal Ashok Kumar","Wanyong Feng","Jaewook Lee","Hunter McNichols","Aritra Ghosh","Andrew Lan"],"pdf_url":"https://arxiv.org/pdf/2305.16165v2.pdf","comment":"16th International Conference on Educational Data Mining (EDM 2023)"},{"id":"http://arxiv.org/abs/2305.00909v4","updated":"2023-07-19T02:41:58Z","published":"2023-04-28T01:47:09Z","title":"Outline, Then Details: Syntactically Guided Coarse-To-Fine Code\n Generation","summary":" For a complicated algorithm, its implementation by a human programmer usually\nstarts with outlining a rough control flow followed by iterative enrichments,\neventually yielding carefully generated syntactic structures and variables in a\nhierarchy. However, state-of-the-art large language models generate codes in a\nsingle pass, without intermediate warm-ups to reflect the structured thought\nprocess of \"outline-then-detail\". Inspired by the recent success of\nchain-of-thought prompting, we propose ChainCoder, a program synthesis language\nmodel that generates Python code progressively, i.e. from coarse to fine in\nmultiple passes. We first decompose source code into layout frame components\nand accessory components via abstract syntax tree parsing to construct a\nhierarchical representation. We then reform our prediction target into a\nmulti-pass objective, each pass generates a subsequence, which is concatenated\nin the hierarchy. Finally, a tailored transformer architecture is leveraged to\njointly encode the natural language descriptions and syntactically aligned I/O\ndata samples. Extensive evaluations show that ChainCoder outperforms\nstate-of-the-arts, demonstrating that our progressive generation eases the\nreasoning procedure and guides the language model to generate higher-quality\nsolutions. Our codes are available at:\nhttps://github.com/VITA-Group/ChainCoder.\n","authors":["Wenqing Zheng","S P Sharan","Ajay Kumar Jaiswal","Kevin Wang","Yihan Xi","Dejia Xu","Zhangyang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.00909v4.pdf","comment":"Accepted in ICML 2023"},{"id":"http://arxiv.org/abs/2307.09706v1","updated":"2023-07-19T01:37:31Z","published":"2023-07-19T01:37:31Z","title":"RaTE: a Reproducible automatic Taxonomy Evaluation by Filling the Gap","summary":" Taxonomies are an essential knowledge representation, yet most studies on\nautomatic taxonomy construction (ATC) resort to manual evaluation to score\nproposed algorithms. We argue that automatic taxonomy evaluation (ATE) is just\nas important as taxonomy construction. We propose RaTE, an automatic label-free\ntaxonomy scoring procedure, which relies on a large pre-trained language model.\nWe apply our evaluation procedure to three state-of-the-art ATC algorithms with\nwhich we built seven taxonomies from the Yelp domain, and show that 1) RaTE\ncorrelates well with human judgments and 2) artificially degrading a taxonomy\nleads to decreasing RaTE score.\n","authors":["Tianjian Gao","Phillipe Langlais"],"pdf_url":"https://arxiv.org/pdf/2307.09706v1.pdf","comment":"15th International Conference on Computational Semantics (IWCS),\n Association for Computational Linguistics (ACL)"},{"id":"http://arxiv.org/abs/2307.03135v2","updated":"2023-07-19T01:28:30Z","published":"2023-07-06T17:05:26Z","title":"Distilling Large Vision-Language Model with Out-of-Distribution\n Generalizability","summary":" Large vision-language models have achieved outstanding performance, but their\nsize and computational requirements make their deployment on\nresource-constrained devices and time-sensitive tasks impractical. Model\ndistillation, the process of creating smaller, faster models that maintain the\nperformance of larger models, is a promising direction towards the solution.\nThis paper investigates the distillation of visual representations in large\nteacher vision-language models into lightweight student models using a small-\nor mid-scale dataset. Notably, this study focuses on open-vocabulary\nout-of-distribution (OOD) generalization, a challenging problem that has been\noverlooked in previous model distillation literature. We propose two principles\nfrom vision and language modality perspectives to enhance student's OOD\ngeneralization: (1) by better imitating teacher's visual representation space,\nand carefully promoting better coherence in vision-language alignment with the\nteacher; (2) by enriching the teacher's language representations with\ninformative and finegrained semantic attributes to effectively distinguish\nbetween different labels. We propose several metrics and conduct extensive\nexperiments to investigate their techniques. The results demonstrate\nsignificant improvements in zero-shot and few-shot student performance on\nopen-vocabulary out-of-distribution classification, highlighting the\neffectiveness of our proposed approaches. Code released at\nhttps://github.com/xuanlinli17/large_vlm_distillation_ood\n","authors":["Xuanlin Li","Yunhao Fang","Minghua Liu","Zhan Ling","Zhuowen Tu","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2307.03135v2.pdf","comment":"Published at International Conference on Computer Vision (ICCV) 2023"},{"id":"http://arxiv.org/abs/2307.09702v1","updated":"2023-07-19T01:14:49Z","published":"2023-07-19T01:14:49Z","title":"Efficient Guided Generation for LLMs","summary":" In this article we describe an efficient approach to guiding language model\ntext generation with regular expressions and context-free grammars. Our\napproach adds little to no overhead to the token sequence generation process,\nand makes guided generation feasible in practice. An implementation is provided\nin the open source Python library Outlines.\n","authors":["Brandon T. Willard","Rémi Louf"],"pdf_url":"https://arxiv.org/pdf/2307.09702v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09692v1","updated":"2023-07-19T00:31:58Z","published":"2023-07-19T00:31:58Z","title":"STRAPPER: Preference-based Reinforcement Learning via Self-training\n Augmentation and Peer Regularization","summary":" Preference-based reinforcement learning (PbRL) promises to learn a complex\nreward function with binary human preference. However, such human-in-the-loop\nformulation requires considerable human effort to assign preference labels to\nsegment pairs, hindering its large-scale applications. Recent approache has\ntried to reuse unlabeled segments, which implicitly elucidates the distribution\nof segments and thereby alleviates the human effort. And consistency\nregularization is further considered to improve the performance of\nsemi-supervised learning. However, we notice that, unlike general\nclassification tasks, in PbRL there exits a unique phenomenon that we defined\nas similarity trap in this paper. Intuitively, human can have diametrically\nopposite preferredness for similar segment pairs, but such similarity may trap\nconsistency regularization fail in PbRL. Due to the existence of similarity\ntrap, such consistency regularization improperly enhances the consistency\npossiblity of the model's predictions between segment pairs, and thus reduces\nthe confidence in reward learning, since the augmented distribution does not\nmatch with the original one in PbRL. To overcome such issue, we present a\nself-training method along with our proposed peer regularization, which\npenalizes the reward model memorizing uninformative labels and acquires\nconfident predictions. Empirically, we demonstrate that our approach is capable\nof learning well a variety of locomotion and robotic manipulation behaviors\nusing different semi-supervised alternatives and peer regularization.\n","authors":["Yachen Kang","Li He","Jinxin Liu","Zifeng Zhuang","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2307.09692v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09691v1","updated":"2023-07-19T00:27:49Z","published":"2023-07-19T00:27:49Z","title":"Joint Service Caching, Communication and Computing Resource Allocation\n in Collaborative MEC Systems: A DRL-based Two-timescale Approach","summary":" Meeting the strict Quality of Service (QoS) requirements of terminals has\nimposed a signiffcant challenge on Multiaccess Edge Computing (MEC) systems,\ndue to the limited multidimensional resources. To address this challenge, we\npropose a collaborative MEC framework that facilitates resource sharing between\nthe edge servers, and with the aim to maximize the long-term QoS and reduce the\ncache switching cost through joint optimization of service caching,\ncollaborative offfoading, and computation and communication resource\nallocation. The dual timescale feature and temporal recurrence relationship\nbetween service caching and other resource allocation make solving the problem\neven more challenging. To solve it, we propose a deep reinforcement learning\n(DRL)-based dual timescale scheme, called DGL-DDPG, which is composed of a\nshort-term genetic algorithm (GA) and a long short-term memory network-based\ndeep deterministic policy gradient (LSTM-DDPG). In doing so, we reformulate the\noptimization problem as a Markov decision process (MDP) where the\nsmall-timescale resource allocation decisions generated by an improved GA are\ntaken as the states and input into a centralized LSTM-DDPG agent to generate\nthe service caching decision for the large-timescale. Simulation results\ndemonstrate that our proposed algorithm outperforms the baseline algorithms in\nterms of the average QoS and cache switching cost.\n","authors":["Qianqian Liu","Haixia Zhang","Xin Zhang","Dongfeng Yuan"],"pdf_url":"https://arxiv.org/pdf/2307.09691v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09688v1","updated":"2023-07-19T00:08:49Z","published":"2023-07-19T00:08:49Z","title":"Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for\n Recommendation and Text Generation","summary":" Modeling customer shopping intentions is a crucial task for e-commerce, as it\ndirectly impacts user experience and engagement. Thus, accurately understanding\ncustomer preferences is essential for providing personalized recommendations.\nSession-based recommendation, which utilizes customer session data to predict\ntheir next interaction, has become increasingly popular. However, existing\nsession datasets have limitations in terms of item attributes, user diversity,\nand dataset scale. As a result, they cannot comprehensively capture the\nspectrum of user behaviors and preferences. To bridge this gap, we present the\nAmazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It\nis the first multilingual dataset consisting of millions of user sessions from\nsix different locales, where the major languages of products are English,\nGerman, Japanese, French, Italian, and Spanish. Remarkably, the dataset can\nhelp us enhance personalization and understanding of user preferences, which\ncan benefit various existing tasks as well as enable new tasks. To test the\npotential of the dataset, we introduce three tasks in this work: (1)\nnext-product recommendation, (2) next-product recommendation with domain\nshifts, and (3) next-product title generation. With the above tasks, we\nbenchmark a range of algorithms on our proposed dataset, drawing new insights\nfor further research and practice. In addition, based on the proposed dataset\nand tasks, we hosted a competition in the KDD CUP 2023 and have attracted\nthousands of users and submissions. The winning solutions and the associated\nworkshop can be accessed at our website https://kddcup23.github.io/.\n","authors":["Wei Jin","Haitao Mao","Zheng Li","Haoming Jiang","Chen Luo","Hongzhi Wen","Haoyu Han","Hanqing Lu","Zhengyang Wang","Ruirui Li","Zhen Li","Monica Xiao Cheng","Rahul Goutam","Haiyang Zhang","Karthik Subbian","Suhang Wang","Yizhou Sun","Jiliang Tang","Bing Yin","Xianfeng Tang"],"pdf_url":"https://arxiv.org/pdf/2307.09688v1.pdf","comment":"Dataset for KDD Cup 2023, https://kddcup23.github.io/"},{"id":"http://arxiv.org/abs/2210.01834v2","updated":"2023-07-19T23:50:47Z","published":"2022-10-04T18:06:29Z","title":"Invariant Aggregator for Defending against Federated Backdoor Attacks","summary":" Federated learning is gaining popularity as it enables training high-utility\nmodels across several clients without directly sharing their private data. As a\ndownside, the federated setting makes the model vulnerable to various\nadversarial attacks in the presence of malicious clients. Despite the\ntheoretical and empirical success in defending against attacks that aim to\ndegrade models' utility, defense against backdoor attacks that increase model\naccuracy on backdoor samples exclusively without hurting the utility on other\nsamples remains challenging. To this end, we first analyze the vulnerability of\nfederated learning to backdoor attacks over a flat loss landscape which is\ncommon for well-designed neural networks such as Resnet [He et al., 2015] but\nis often overlooked by previous works. Over a flat loss landscape, misleading\nfederated learning models to exclusively benefit malicious clients with\nbackdoor samples do not require a significant difference between malicious and\nbenign client-wise updates, making existing defenses insufficient. In contrast,\nwe propose an invariant aggregator that redirects the aggregated update to\ninvariant directions that are generally useful via selectively masking out the\ngradient elements that favor few and possibly malicious clients regardless of\nthe difference magnitude. Theoretical results suggest that our approach\nprovably mitigates backdoor attacks over both flat and sharp loss landscapes.\nEmpirical results on three datasets with different modalities and varying\nnumbers of clients further demonstrate that our approach mitigates a broad\nclass of backdoor attacks with a negligible cost on the model utility.\n","authors":["Xiaoyang Wang","Dimitrios Dimitriadis","Sanmi Koyejo","Shruti Tople"],"pdf_url":"https://arxiv.org/pdf/2210.01834v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.12877v2","updated":"2023-07-19T23:38:55Z","published":"2022-07-26T13:12:22Z","title":"Representing Random Utility Choice Models with Neural Networks","summary":" Motivated by the successes of deep learning, we propose a class of neural\nnetwork-based discrete choice models, called RUMnets, inspired by the random\nutility maximization (RUM) framework. This model formulates the agents' random\nutility function using a sample average approximation. We show that RUMnets\nsharply approximate the class of RUM discrete choice models: any model derived\nfrom random utility maximization has choice probabilities that can be\napproximated arbitrarily closely by a RUMnet. Reciprocally, any RUMnet is\nconsistent with the RUM principle. We derive an upper bound on the\ngeneralization error of RUMnets fitted on choice data, and gain theoretical\ninsights on their ability to predict choices on new, unseen data depending on\ncritical parameters of the dataset and architecture. By leveraging open-source\nlibraries for neural networks, we find that RUMnets are competitive against\nseveral choice modeling and machine learning methods in terms of predictive\naccuracy on two real-world datasets.\n","authors":["Ali Aouad","Antoine Désir"],"pdf_url":"https://arxiv.org/pdf/2207.12877v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2207.00419v3","updated":"2023-07-19T16:00:08Z","published":"2022-06-18T00:26:52Z","title":"Self-Supervised Learning for Videos: A Survey","summary":" The remarkable success of deep learning in various domains relies on the\navailability of large-scale annotated datasets. However, obtaining annotations\nis expensive and requires great effort, which is especially challenging for\nvideos. Moreover, the use of human-generated annotations leads to models with\nbiased learning and poor domain generalization and robustness. As an\nalternative, self-supervised learning provides a way for representation\nlearning which does not require annotations and has shown promise in both image\nand video domains. Different from the image domain, learning video\nrepresentations are more challenging due to the temporal dimension, bringing in\nmotion and other environmental dynamics. This also provides opportunities for\nvideo-exclusive ideas that advance self-supervised learning in the video and\nmultimodal domain. In this survey, we provide a review of existing approaches\non self-supervised learning focusing on the video domain. We summarize these\nmethods into four different categories based on their learning objectives: 1)\npretext tasks, 2) generative learning, 3) contrastive learning, and 4)\ncross-modal agreement. We further introduce the commonly used datasets,\ndownstream evaluation tasks, insights into the limitations of existing works,\nand the potential future directions in this area.\n","authors":["Madeline C. Schiappa","Yogesh S. Rawat","Mubarak Shah"],"pdf_url":"https://arxiv.org/pdf/2207.00419v3.pdf","comment":"ACM CSUR (December 2022). Project Link: https://bit.ly/3Oimc7Q"},{"id":"http://arxiv.org/abs/2307.10003v1","updated":"2023-07-19T14:23:26Z","published":"2023-07-19T14:23:26Z","title":"TbExplain: A Text-based Explanation Method for Scene Classification\n Models with the Statistical Prediction Correction","summary":" The field of Explainable Artificial Intelligence (XAI) aims to improve the\ninterpretability of black-box machine learning models. Building a heatmap based\non the importance value of input features is a popular method for explaining\nthe underlying functions of such models in producing their predictions.\nHeatmaps are almost understandable to humans, yet they are not without flaws.\nNon-expert users, for example, may not fully understand the logic of heatmaps\n(the logic in which relevant pixels to the model's prediction are highlighted\nwith different intensities or colors). Additionally, objects and regions of the\ninput image that are relevant to the model prediction are frequently not\nentirely differentiated by heatmaps. In this paper, we propose a framework\ncalled TbExplain that employs XAI techniques and a pre-trained object detector\nto present text-based explanations of scene classification models. Moreover,\nTbExplain incorporates a novel method to correct predictions and textually\nexplain them based on the statistics of objects in the input image when the\ninitial prediction is unreliable. To assess the trustworthiness and validity of\nthe text-based explanations, we conducted a qualitative experiment, and the\nfindings indicated that these explanations are sufficiently reliable.\nFurthermore, our quantitative and qualitative experiments on TbExplain with\nscene classification datasets reveal an improvement in classification accuracy\nover ResNet variants.\n","authors":["Amirhossein Aminimehr","Pouya Khani","Amirali Molaei","Amirmohammad Kazemeini","Erik Cambria"],"pdf_url":"https://arxiv.org/pdf/2307.10003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09936v1","updated":"2023-07-19T12:21:39Z","published":"2023-07-19T12:21:39Z","title":"AGAR: Attention Graph-RNN for Adaptative Motion Prediction of Point\n Clouds of Deformable Objects","summary":" This paper focuses on motion prediction for point cloud sequences in the\nchallenging case of deformable 3D objects, such as human body motion. First, we\ninvestigate the challenges caused by deformable shapes and complex motions\npresent in this type of representation, with the ultimate goal of understanding\nthe technical limitations of state-of-the-art models. From this understanding,\nwe propose an improved architecture for point cloud prediction of deformable 3D\nobjects. Specifically, to handle deformable shapes, we propose a graph-based\napproach that learns and exploits the spatial structure of point clouds to\nextract more representative features. Then we propose a module able to combine\nthe learned features in an adaptative manner according to the point cloud\nmovements. The proposed adaptative module controls the composition of local and\nglobal motions for each point, enabling the network to model complex motions in\ndeformable 3D objects more effectively. We tested the proposed method on the\nfollowing datasets: MNIST moving digits, the Mixamo human bodies motions, JPEG\nand CWIPC-SXR real-world dynamic bodies. Simulation results demonstrate that\nour method outperforms the current baseline methods given its improved ability\nto model complex movements as well as preserve point cloud shape. Furthermore,\nwe demonstrate the generalizability of the proposed framework for dynamic\nfeature learning, by testing the framework for action recognition on the\nMSRAction3D dataset and achieving results on-par with state-of-the-art methods\n","authors":["Pedro Gomes","Silvia Rossi","Laura Toni"],"pdf_url":"https://arxiv.org/pdf/2307.09936v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09915v1","updated":"2023-07-19T11:35:21Z","published":"2023-07-19T11:35:21Z","title":"Embedded Heterogeneous Attention Transformer for Cross-lingual Image\n Captioning","summary":" Cross-lingual image captioning is confronted with both cross-lingual and\ncross-modal challenges for multimedia analysis. The crucial issue in this task\nis to model the global and local matching between the image and different\nlanguages. Existing cross-modal embedding methods based on Transformer\narchitecture oversight the local matching between the image region and\nmonolingual words, not to mention in the face of a variety of differentiated\nlanguages. Due to the heterogeneous property of the cross-modal and\ncross-lingual task, we utilize the heterogeneous network to establish\ncross-domain relationships and the local correspondences between the image and\ndifferent languages. In this paper, we propose an Embedded Heterogeneous\nAttention Transformer (EHAT) to build reasoning paths bridging cross-domain for\ncross-lingual image captioning and integrate into transformer. The proposed\nEHAT consists of a Masked Heterogeneous Cross-attention (MHCA), Heterogeneous\nAttention Reasoning Network (HARN) and Heterogeneous Co-attention (HCA). HARN\nas the core network, models and infers cross-domain relationship anchored by\nvision bounding box representation features to connect two languages word\nfeatures and learn the heterogeneous maps. MHCA and HCA implement cross-domain\nintegration in the encoder through the special heterogeneous attention and\nenable single model to generate two language captioning. We test on MSCOCO\ndataset to generate English and Chinese, which are most widely used and have\nobvious difference between their language families. Our experiments show that\nour method even achieve better than advanced monolingual methods.\n","authors":["Zijie Song","Zhenzhen Hu","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2307.09915v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09821v1","updated":"2023-07-19T08:16:34Z","published":"2023-07-19T08:16:34Z","title":"Hierarchical Semantic Perceptual Listener Head Video Generation: A\n High-performance Pipeline","summary":" In dyadic speaker-listener interactions, the listener's head reactions along\nwith the speaker's head movements, constitute an important non-verbal semantic\nexpression together. The listener Head generation task aims to synthesize\nresponsive listener's head videos based on audios of the speaker and reference\nimages of the listener. Compared to the Talking-head generation, it is more\nchallenging to capture the correlation clues from the speaker's audio and\nvisual information. Following the ViCo baseline scheme, we propose a\nhigh-performance solution by enhancing the hierarchical semantic extraction\ncapability of the audio encoder module and improving the decoder part, renderer\nand post-processing modules. Our solution gets the first place on the official\nleaderboard for the track of listening head generation. This paper is a\ntechnical report of ViCo@2023 Conversational Head Generation Challenge in ACM\nMultimedia 2023 conference.\n","authors":["Zhigang Chang","Weitai Hu","Qing Yang","Shibao Zheng"],"pdf_url":"https://arxiv.org/pdf/2307.09821v1.pdf","comment":"ACM MM 2023"},{"id":"http://arxiv.org/abs/2306.07848v5","updated":"2023-07-19T04:56:33Z","published":"2023-06-13T15:28:10Z","title":"GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio\n Pretraining for Speech Emotion Recognition","summary":" Contrastive learning based cross-modality pretraining methods have recently\nexhibited impressive success in diverse fields. In this paper, we propose\nGEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio\npretraining (CLAP) method for speech emotion recognition. Specifically, a novel\nemotion CLAP model (Emo-CLAP) is first built, utilizing various self-supervised\npre-trained models. Second, considering the importance of gender attribute in\nspeech emotion modeling, the soft label based GEmo-CLAP (SL-GEmo-CLAP) and\nmulti-task learning based GEmo-CLAP (ML-GEmo-CLAP) are further proposed to\nintegrate the emotion and gender information of speech signals, forming more\nreasonable objectives. Extensive experiments on IEMOCAP show that our proposed\ntwo GEmo-CLAP models consistently outperform the baseline Emo-CLAP with\ndifferent pre-trained models, while also achieving the best recognition\nperformance compared with recent state-of-the-art methods. Noticeably, the\nproposed WavLM-based ML-GEmo-CLAP obtains the best UAR of 80.16\\% and WAR of\n82.06\\%.\n","authors":["Yu Pan","Lei Ma"],"pdf_url":"https://arxiv.org/pdf/2306.07848v5.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2307.09729v1","updated":"2023-07-19T02:33:42Z","published":"2023-07-19T02:33:42Z","title":"NTIRE 2023 Quality Assessment of Video Enhancement Challenge","summary":" This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement\nChallenge, which will be held in conjunction with the New Trends in Image\nRestoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to\naddress a major challenge in the field of video processing, namely, video\nquality assessment (VQA) for enhanced videos. The challenge uses the VQA\nDataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211\nenhanced videos, including 600 videos with color, brightness, and contrast\nenhancements, 310 videos with deblurring, and 301 deshaked videos. The\nchallenge has a total of 167 registered participants. 61 participating teams\nsubmitted their prediction results during the development phase, with a total\nof 3168 submissions. A total of 176 submissions were submitted by 37\nparticipating teams during the final testing phase. Finally, 19 participating\nteams submitted their models and fact sheets, and detailed the methods they\nused. Some methods have achieved better results than baseline methods, and the\nwinning methods have demonstrated superior prediction performance.\n","authors":["Xiaohong Liu","Xiongkuo Min","Wei Sun","Yulun Zhang","Kai Zhang","Radu Timofte","Guangtao Zhai","Yixuan Gao","Yuqin Cao","Tengchuan Kou","Yunlong Dong","Ziheng Jia","Yilin Li","Wei Wu","Shuming Hu","Sibin Deng","Pengxiang Xiao","Ying Chen","Kai Li","Kai Zhao","Kun Yuan","Ming Sun","Heng Cong","Hao Wang","Lingzhi Fu","Yusheng Zhang","Rongyu Zhang","Hang Shi","Qihang Xu","Longan Xiao","Zhiliang Ma","Mirko Agarla","Luigi Celona","Claudio Rota","Raimondo Schettini","Zhiwei Huang","Yanan Li","Xiaotao Wang","Lei Lei","Hongye Liu","Wei Hong","Ironhead Chuang","Allen Lin","Drake Guan","Iris Chen","Kae Lou","Willy Huang","Yachun Tasi","Yvonne Kao","Haotian Fan","Fangyuan Kong","Shiqi Zhou","Hao Liu","Yu Lai","Shanshan Chen","Wenqi Wang","Haoning Wu","Chaofeng Chen","Chunzheng Zhu","Zekun Guo","Shiling Zhao","Haibing Yin","Hongkui Wang","Hanene Brachemi Meftah","Sid Ahmed Fezza","Wassim Hamidouche","Olivier Déforges","Tengfei Shi","Azadeh Mansouri","Hossein Motamednia","Amir Hossein Bakhtiari","Ahmad Mahmoudi Aznaveh"],"pdf_url":"https://arxiv.org/pdf/2307.09729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10346v1","updated":"2023-07-19T17:10:56Z","published":"2023-07-19T17:10:56Z","title":"Estudio de la Experiencia de Usuario mediante un Sistema de Dashboards\n de Análisis de Aprendizaje Multimodal","summary":" In the article, we present a Web-based System called M2LADS, which supports\nthe integration and visualization of multimodal data recorded in user\nexperiences (UX) in a Learning Analytics (LA) system in the form of Web-based\nDashboards. Based on the edBB platform, the multimodal data gathered contains\nbiometric and behavioral signals including electroencephalogram data to measure\nlearners' cognitive attention, heart rate for affective measures and visual\nattention from the video recordings. Additionally, learners' static background\ndata and their learning performance measures are tracked using LOGGE tool.\nM2LADS provides opportunities to capture learners' holistic experience during\ntheir interactions with the learning analytic system in order to improve the\nsystem and the user experience of the learners.\n --\n En este art\\'iculo, presentamos M2LADS, un sistema que permite la\nintegraci\\'on y visualizaci\\'on de datos multimodales en forma de Dashboards\nWeb. Estos datos provienen de sesiones de experiencia de usuario en un sistema\nde Learning Analytics (LA) llevadas a cabo por estudiantes de MOOCs. Los datos\nmultimodales incluyen se\\~nales biom\\'etricas y de comportamiento monitorizados\npor la plataforma edBB, como electroencefalogramas (EEG) de 5 canales,\nfrecuencia card\\'iaca, atenci\\'on visual, videos en el espectro visible y NIR,\nentre otros. Adem\\'as, se incluyen datos de interacci\\'on de los estudiantes\ncon el sistema de LA a trav\\'es de la herramienta LOGGE. Toda esta\ninformaci\\'on proporciona una comprensi\\'on completa de la experiencia del\nusuario al utilizar el sistema de LA, lo que ha permitido tanto mejorar el\nsistema LA como la experiencia de aprendizaje de los estudiantes de MOOCs.\n","authors":["Álvaro Becerra","Roberto Daza","Ruth Cobos","Aythami Morales","Julian Fierrez"],"pdf_url":"https://arxiv.org/pdf/2307.10346v1.pdf","comment":"Accepted in \"XXIII CONGRESO INTERNACIONAL DE INTERACCI\\'ON\n PERSONA-ORDENADOR 2023\". Article in Spanish language. The abstract in English\n and Spanish. There is an extended abstract of 2 pages in English"}]},"2023-07-20T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2307.11088v1","updated":"2023-07-20T17:59:41Z","published":"2023-07-20T17:59:41Z","title":"L-Eval: Instituting Standardized Evaluation for Long Context Language\n Models","summary":" Recently, there has been growing interest in extending the context length of\ninstruction-following models in order to effectively process single-turn long\ninput (e.g. summarizing a paper) and conversations with more extensive\nhistories. While proprietary models such as GPT-4 and Claude have demonstrated\nconsiderable advancements in handling tens of thousands of tokens of context,\nopen-sourced models are still in the early stages of experimentation. It also\nremains unclear whether developing these long context models can offer\nsubstantial gains on practical downstream tasks over retrieval-based methods or\nmodels simply trained on chunked contexts. To address this challenge, we\npropose to institute standardized evaluation for long context language models.\nConcretely, we develop L-Eval which contains 411 long documents and over 2,000\nquery-response pairs manually annotated and checked by the authors encompassing\nareas such as law, finance, school lectures, lengthy conversations, news,\nlong-form novels, and meetings. L-Eval also adopts diverse evaluation methods\nand instruction styles, enabling a more reliable assessment of Long Context\nLanguage Models (LCLMs). Our findings indicate that while open-source models\ntypically lag behind their commercial counterparts, they still exhibit\nimpressive performance. LLaMA2 achieves the best results (win 45\\% vs\nturbo-16k) on open-ended tasks with only 4k context length and ChatGLM2\nachieves the best results on closed-ended tasks with 8k input tokens. We\nrelease our new evaluation suite, code, and all generation results including\npredictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at\n{\\url{https://github.com/OpenLMLab/LEval}}.\n","authors":["Chenxin An","Shansan Gong","Ming Zhong","Mukai Li","Jun Zhang","Lingpeng Kong","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2307.11088v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10172v2","updated":"2023-07-20T17:59:35Z","published":"2023-07-19T17:57:53Z","title":"DialogStudio: Towards Richest and Most Diverse Unified Dataset\n Collection for Conversational AI","summary":" Despite advancements in conversational AI, language models encounter\nchallenges to handle diverse conversational tasks, and existing dialogue\ndataset collections often lack diversity and comprehensiveness. To tackle these\nissues, we introduce DialogStudio: the largest and most diverse collection of\ndialogue datasets, unified under a consistent format while preserving their\noriginal information. Our collection encompasses data from open-domain\ndialogues, task-oriented dialogues, natural language understanding,\nconversational recommendation, dialogue summarization, and knowledge-grounded\ndialogues, making it an incredibly rich and diverse resource for dialogue\nresearch and model training. To further enhance the utility of DialogStudio, we\nidentify the licenses for each dataset and design domain-aware prompts for\nselected dialogues to facilitate instruction-aware fine-tuning. Furthermore, we\ndevelop conversational AI models using the dataset collection, and our\nexperiments in both zero-shot and few-shot learning scenarios demonstrate the\nsuperiority of DialogStudio. To improve transparency and support dataset and\ntask-based research, as well as language model pre-training, all datasets,\nlicenses, codes, and models associated with DialogStudio are made publicly\naccessible at https://github.com/salesforce/DialogStudio\n","authors":["Jianguo Zhang","Kun Qian","Zhiwei Liu","Shelby Heinecke","Rui Meng","Ye Liu","Zhou Yu","Huan Wang","Silvio Savarese","Caiming Xiong"],"pdf_url":"https://arxiv.org/pdf/2307.10172v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.13867v2","updated":"2023-07-20T17:59:14Z","published":"2023-01-31T18:59:03Z","title":"Mathematical Capabilities of ChatGPT","summary":" We investigate the mathematical capabilities of two iterations of ChatGPT\n(released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on\npublicly available datasets, as well as hand-crafted ones, using a novel\nmethodology. In contrast to formal mathematics, where large databases of formal\nproofs are available (e.g., the Lean Mathematical Library), current datasets of\nnatural-language mathematics, used to benchmark language models, either cover\nonly elementary mathematics or are very small. We address this by publicly\nreleasing two new datasets: GHOSTS and miniGHOSTS. These are the first\nnatural-language datasets curated by working researchers in mathematics that\n(1) aim to cover graduate-level mathematics, (2) provide a holistic overview of\nthe mathematical capabilities of language models, and (3) distinguish multiple\ndimensions of mathematical reasoning. These datasets also test whether ChatGPT\nand GPT-4 can be helpful assistants to professional mathematicians by emulating\nuse cases that arise in the daily professional activities of mathematicians. We\nbenchmark the models on a range of fine-grained performance metrics. For\nadvanced mathematics, this is the most detailed evaluation effort to date. We\nfind that ChatGPT can be used most successfully as a mathematical assistant for\nquerying facts, acting as a mathematical search engine and knowledge base\ninterface. GPT-4 can additionally be used for undergraduate-level mathematics\nbut fails on graduate-level difficulty. Contrary to many positive reports in\nthe media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of\nselection bias), their overall mathematical performance is well below the level\nof a graduate student. Hence, if your goal is to use ChatGPT to pass a\ngraduate-level math exam, you would be better off copying from your average\npeer!\n","authors":["Simon Frieder","Luca Pinchetti","Alexis Chevalier","Ryan-Rhys Griffiths","Tommaso Salvatori","Thomas Lukasiewicz","Philipp Christian Petersen","Julius Berner"],"pdf_url":"https://arxiv.org/pdf/2301.13867v2.pdf","comment":"Added further evaluations on another ChatGPT version and on GPT-4.\n The GHOSTS and miniGHOSTS datasets are available at\n https://github.com/xyfrieder/science-GHOSTS"},{"id":"http://arxiv.org/abs/2304.07880v3","updated":"2023-07-20T17:34:39Z","published":"2023-04-16T20:11:19Z","title":"Sabiá: Portuguese Large Language Models","summary":" As the capabilities of language models continue to advance, it is conceivable\nthat \"one-size-fits-all\" model will remain as the main paradigm. For instance,\ngiven the vast number of languages worldwide, many of which are low-resource,\nthe prevalent practice is to pretrain a single model on multiple languages. In\nthis paper, we add to the growing body of evidence that challenges this\npractice, demonstrating that monolingual pretraining on the target language\nsignificantly improves models already extensively trained on diverse corpora.\nMore specifically, we further pretrain GPT-J and LLaMA models on Portuguese\ntexts using 3% or less of their original pretraining budget. Few-shot\nevaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models\noutperform English-centric and multilingual counterparts by a significant\nmargin. Our best model, Sabi\\'a-65B, performs on par with GPT-3.5-turbo. By\nevaluating on datasets originally conceived in the target language as well as\ntranslated ones, we study the contributions of language-specific pretraining in\nterms of 1) capturing linguistic nuances and structures inherent to the target\nlanguage, and 2) enriching the model's knowledge about a domain or culture. Our\nresults indicate that the majority of the benefits stem from the\ndomain-specific knowledge acquired through monolingual pretraining.\n","authors":["Ramon Pires","Hugo Abonizio","Thales Sales Almeida","Rodrigo Nogueira"],"pdf_url":"https://arxiv.org/pdf/2304.07880v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11031v1","updated":"2023-07-20T17:07:28Z","published":"2023-07-20T17:07:28Z","title":"Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot\n Classification","summary":" Recent work has shown that language models' (LMs) prompt-based learning\ncapabilities make them well suited for automating data labeling in domains\nwhere manual annotation is expensive. The challenge is that while writing an\ninitial prompt is cheap, improving a prompt is costly -- practitioners often\nrequire significant labeled data in order to evaluate the impact of prompt\nmodifications. Our work asks whether it is possible to improve prompt-based\nlearning without additional labeled data. We approach this problem by\nattempting to modify the predictions of a prompt, rather than the prompt\nitself. Our intuition is that accurate predictions should also be consistent:\nsamples which are similar under some feature representation should receive the\nsame prompt prediction. We propose Embroid, a method which computes multiple\nrepresentations of a dataset under different embedding functions, and uses the\nconsistency between the LM predictions for neighboring samples to identify\nmispredictions. Embroid then uses these neighborhoods to create additional\npredictions for each sample, and combines these predictions with a simple\nlatent variable graphical model in order to generate a final corrected\nprediction. In addition to providing a theoretical analysis of Embroid, we\nconduct a rigorous empirical evaluation across six different LMs and up to 95\ndifferent tasks. We find that (1) Embroid substantially improves performance\nover original prompts (e.g., by an average of 7.3 points on GPT-JT), (2) also\nrealizes improvements for more sophisticated prompting strategies (e.g.,\nchain-of-thought), and (3) can be specialized to domains like law through the\nembedding functions.\n","authors":["Neel Guha","Mayee F. Chen","Kush Bhatia","Azalia Mirhoseini","Frederic Sala","Christopher Ré"],"pdf_url":"https://arxiv.org/pdf/2307.11031v1.pdf","comment":"38 pages, 22 figures, 8 tables"},{"id":"http://arxiv.org/abs/2307.10811v1","updated":"2023-07-20T16:55:25Z","published":"2023-07-20T16:55:25Z","title":"\"It Felt Like Having a Second Mind\": Investigating Human-AI\n Co-creativity in Prewriting with Large Language Models","summary":" Prewriting is the process of discovering and developing ideas before a first\ndraft, which requires divergent thinking and often implies unstructured\nstrategies such as diagramming, outlining, free-writing, etc. Although large\nlanguage models (LLMs) have been demonstrated to be useful for a variety of\ntasks including creative writing, little is known about how users would\ncollaborate with LLMs to support prewriting. The preferred collaborative role\nand initiative of LLMs during such a creativity process is also unclear. To\ninvestigate human-LLM collaboration patterns and dynamics during prewriting, we\nconducted a three-session qualitative study with 15 participants in two\ncreative tasks: story writing and slogan writing. The findings indicated that\nduring collaborative prewriting, there appears to be a three-stage iterative\nHuman-AI Co-creativity process that includes Ideation, Illumination, and\nImplementation stages. This collaborative process champions the human in a\ndominant role, in addition to mixed and shifting levels of initiative that\nexist between humans and LLMs. This research also reports on collaboration\nbreakdowns that occur during this process, user perceptions of using existing\nLLMs during Human-AI Co-creativity, and discusses design implications to\nsupport this co-creativity process.\n","authors":["Qian Wan","Siying Hu","Yu Zhang","Piaohong Wang","Bo Wen","Zhicong Lu"],"pdf_url":"https://arxiv.org/pdf/2307.10811v1.pdf","comment":"Under review at CSCW after a Major Revision"},{"id":"http://arxiv.org/abs/2307.11019v1","updated":"2023-07-20T16:46:10Z","published":"2023-07-20T16:46:10Z","title":"Investigating the Factual Knowledge Boundary of Large Language Models\n with Retrieval Augmentation","summary":" Knowledge-intensive tasks (e.g., open-domain question answering (QA)) require\na substantial amount of factual knowledge and often rely on external\ninformation for assistance. Recently, large language models (LLMs) (e.g.,\nChatGPT), have demonstrated impressive prowess in solving a wide range of tasks\nwith world knowledge, including knowledge-intensive tasks. However, it remains\nunclear how well LLMs are able to perceive their factual knowledge boundaries,\nparticularly how they behave when incorporating retrieval augmentation. In this\nstudy, we present an initial analysis of the factual knowledge boundaries of\nLLMs and how retrieval augmentation affects LLMs on open-domain QA. Specially,\nwe focus on three primary research questions and analyze them by examining QA\nperformance, priori judgement and posteriori judgement of LLMs. We show\nevidence that LLMs possess unwavering confidence in their capabilities to\nrespond to questions and the accuracy of their responses. Furthermore,\nretrieval augmentation proves to be an effective approach in enhancing LLMs'\nawareness of knowledge boundaries, thereby improving their judgemental\nabilities. Additionally, we also find that LLMs have a propensity to rely on\nthe provided retrieval results when formulating answers, while the quality of\nthese results significantly impacts their reliance. The code to reproduce this\nwork is available at https://github.com/RUCAIBox/LLM-Knowledge-Boundary.\n","authors":["Ruiyang Ren","Yuhao Wang","Yingqi Qu","Wayne Xin Zhao","Jing Liu","Hao Tian","Hua Wu","Ji-Rong Wen","Haifeng Wang"],"pdf_url":"https://arxiv.org/pdf/2307.11019v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11005v1","updated":"2023-07-20T16:34:40Z","published":"2023-07-20T16:34:40Z","title":"Integrating Pretrained ASR and LM to Perform Sequence Generation for\n Spoken Language Understanding","summary":" There has been an increased interest in the integration of pretrained speech\nrecognition (ASR) and language models (LM) into the SLU framework. However,\nprior methods often struggle with a vocabulary mismatch between pretrained\nmodels, and LM cannot be directly utilized as they diverge from its NLU\nformulation. In this study, we propose a three-pass end-to-end (E2E) SLU system\nthat effectively integrates ASR and LM subnetworks into the SLU formulation for\nsequence generation tasks. In the first pass, our architecture predicts ASR\ntranscripts using the ASR subnetwork. This is followed by the LM subnetwork,\nwhich makes an initial SLU prediction. Finally, in the third pass, the\ndeliberation subnetwork conditions on representations from the ASR and LM\nsubnetworks to make the final prediction. Our proposed three-pass SLU system\nshows improved performance over cascaded and E2E SLU models on two benchmark\nSLU datasets, SLURP and SLUE, especially on acoustically challenging\nutterances.\n","authors":["Siddhant Arora","Hayato Futami","Yosuke Kashiwagi","Emiru Tsunoo","Brian Yan","Shinji Watanabe"],"pdf_url":"https://arxiv.org/pdf/2307.11005v1.pdf","comment":"Accepted at INTERSPEECH 2023"},{"id":"http://arxiv.org/abs/2210.05335v3","updated":"2023-07-20T16:24:14Z","published":"2022-10-11T10:54:54Z","title":"MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model","summary":" Multimodal semantic understanding often has to deal with uncertainty, which\nmeans the obtained messages tend to refer to multiple targets. Such uncertainty\nis problematic for our interpretation, including inter- and intra-modal\nuncertainty. Little effort has studied the modeling of this uncertainty,\nparticularly in pre-training on unlabeled datasets and fine-tuning in\ntask-specific downstream datasets. In this paper, we project the\nrepresentations of all modalities as probabilistic distributions via a\nProbability Distribution Encoder (PDE) by utilizing sequence-level\ninteractions. Compared to the existing deterministic methods, such uncertainty\nmodeling can convey richer multimodal semantic information and more complex\nrelationships. Furthermore, we integrate uncertainty modeling with popular\npre-training frameworks and propose suitable pre-training tasks:\nDistribution-based Vision-Language Contrastive learning (D-VLC),\nDistribution-based Masked Language Modeling (D-MLM), and Distribution-based\nImage-Text Matching (D-ITM). The fine-tuned models are applied to challenging\ndownstream tasks, including image-text retrieval, visual question answering,\nvisual reasoning, and visual entailment, and achieve state-of-the-art results.\n","authors":["Yatai Ji","Junjie Wang","Yuan Gong","Lin Zhang","Yanru Zhu","Hongfa Wang","Jiaxing Zhang","Tetsuya Sakai","Yujiu Yang"],"pdf_url":"https://arxiv.org/pdf/2210.05335v3.pdf","comment":"CVPR 2023 Main Track Long Paper"},{"id":"http://arxiv.org/abs/2307.10982v1","updated":"2023-07-20T16:09:57Z","published":"2023-07-20T16:09:57Z","title":"MASR: Metadata Aware Speech Representation","summary":" In the recent years, speech representation learning is constructed primarily\nas a self-supervised learning (SSL) task, using the raw audio signal alone,\nwhile ignoring the side-information that is often available for a given speech\nrecording. In this paper, we propose MASR, a Metadata Aware Speech\nRepresentation learning framework, which addresses the aforementioned\nlimitations. MASR enables the inclusion of multiple external knowledge sources\nto enhance the utilization of meta-data information. The external knowledge\nsources are incorporated in the form of sample-level pair-wise similarity\nmatrices that are useful in a hard-mining loss. A key advantage of the MASR\nframework is that it can be combined with any choice of SSL method. Using MASR\nrepresentations, we perform evaluations on several downstream tasks such as\nlanguage identification, speech recognition and other non-semantic tasks such\nas speaker and emotion recognition. In these experiments, we illustrate\nsignificant performance improvements for the MASR over other established\nbenchmarks. We perform a detailed analysis on the language identification task\nto provide insights on how the proposed loss function enables the\nrepresentations to separate closely related languages.\n","authors":["Anjali Raj","Shikhar Bharadwaj","Sriram Ganapathy","Min Ma","Shikhar Vashishth"],"pdf_url":"https://arxiv.org/pdf/2307.10982v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12619v2","updated":"2023-07-20T16:04:19Z","published":"2023-06-22T01:14:47Z","title":"Class-Incremental Learning based on Label Generation","summary":" Despite the great success of pre-trained language models, it is still a\nchallenge to use these models for continual learning, especially for the\nclass-incremental learning (CIL) setting due to catastrophic forgetting (CF).\nThis paper reports our finding that if we formulate CIL as a continual label\ngeneration problem, CF is drastically reduced and the generalizable\nrepresentations of pre-trained models can be better retained. We thus propose a\nnew CIL method (VAG) that also leverages the sparsity of vocabulary to focus\nthe generation and creates pseudo-replay samples by using label semantics.\nExperimental results show that VAG outperforms baselines by a large margin.\n","authors":["Yijia Shao","Yiduo Guo","Dongyan Zhao","Bing Liu"],"pdf_url":"https://arxiv.org/pdf/2306.12619v2.pdf","comment":"12 pages, ACL 2023 Main Conference"},{"id":"http://arxiv.org/abs/2306.14192v2","updated":"2023-07-20T15:20:51Z","published":"2023-06-25T10:16:49Z","title":"$α$-$β$-Factorization and the Binary Case of Simon's Congruence","summary":" In 1991 H\\'ebrard introduced a factorization of words that turned out to be a\npowerful tool for the investigation of a word's scattered factors (also known\nas (scattered) subwords or subsequences). Based on this, first Karandikar and\nSchnoebelen introduced the notion of $k$-richness and later on Barker et al.\nthe notion of $k$-universality. In 2022 Fleischmann et al. presented a\ngeneralization of the arch factorization by intersecting the arch factorization\nof a word and its reverse. While the authors merely used this factorization for\nthe investigation of shortest absent scattered factors, in this work we\ninvestigate this new $\\alpha$-$\\beta$-factorization as such. We characterize\nthe famous Simon congruence of $k$-universal words in terms of $1$-universal\nwords. Moreover, we apply these results to binary words. In this special case,\nwe obtain a full characterization of the classes and calculate the index of the\ncongruence. Lastly, we start investigating the ternary case, present a full\nlist of possibilities for $\\alpha\\beta\\alpha$-factors, and characterize their\ncongruence.\n","authors":["Pamela Fleischmann","Jonas Höfer","Annika Huch","Dirk Nowotka"],"pdf_url":"https://arxiv.org/pdf/2306.14192v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10932v1","updated":"2023-07-20T15:02:42Z","published":"2023-07-20T15:02:42Z","title":"Identical and Fraternal Twins: Fine-Grained Semantic Contrastive\n Learning of Sentence Representations","summary":" The enhancement of unsupervised learning of sentence representations has been\nsignificantly achieved by the utility of contrastive learning. This approach\nclusters the augmented positive instance with the anchor instance to create a\ndesired embedding space. However, relying solely on the contrastive objective\ncan result in sub-optimal outcomes due to its inability to differentiate subtle\nsemantic variations between positive pairs. Specifically, common data\naugmentation techniques frequently introduce semantic distortion, leading to a\nsemantic margin between the positive pair. While the InfoNCE loss function\noverlooks the semantic margin and prioritizes similarity maximization between\npositive pairs during training, leading to the insensitive semantic\ncomprehension ability of the trained model. In this paper, we introduce a novel\nIdentical and Fraternal Twins of Contrastive Learning (named IFTCL) framework,\ncapable of simultaneously adapting to various positive pairs generated by\ndifferent augmentation techniques. We propose a \\textit{Twins Loss} to preserve\nthe innate margin during training and promote the potential of data enhancement\nin order to overcome the sub-optimal issue. We also present proof-of-concept\nexperiments combined with the contrastive objective to prove the validity of\nthe proposed Twins Loss. Furthermore, we propose a hippocampus queue mechanism\nto restore and reuse the negative instances without additional calculation,\nwhich further enhances the efficiency and performance of the IFCL. We verify\nthe IFCL framework on nine semantic textual similarity tasks with both English\nand Chinese datasets, and the experimental results show that IFCL outperforms\nstate-of-the-art methods.\n","authors":["Qingfa Xiao","Shuangyin Li","Lei Chen"],"pdf_url":"https://arxiv.org/pdf/2307.10932v1.pdf","comment":"This article has been accepted for publication in European Conference\n on Artificial Intelligence (ECAI2023). 9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2307.10930v1","updated":"2023-07-20T14:59:02Z","published":"2023-07-20T14:59:02Z","title":"MediaGPT : A Large Language Model Target Chinese Media","summary":" The development of large language models (LLMs) has seen rapid progress in\nrecent years. One of the most widely used LLMs is the Generative Pre-trained\nTransformer (GPT) series, which has been applied in various fields, including\nthe media domain. However, in practical applications, the differences between\nthe media's use cases and the general-purpose applications of LLMs have become\nincreasingly apparent, especially Chinese. As a result, there is a growing need\nto develop LLM that are specifically tailored to the unique requirements of the\nmedia domain. In this paper, we present MediaGPT, a large language model\ntraining on variety of media data and addressing the practical needs of Chinese\nmedia. We have designed a diverse set of task instruction types to cater to the\nspecific requirements of the domain. To further validate the effectiveness of\nour proposed LLM, we have constructed unique datasets that are tailored to the\nmedia domain and have also developed verification methods that are specifically\ndesigned for generative-type tasks. By doing so, we aim to bridge the gap\nbetween the general-purpose LLM and the requirements of the media domain, and\nto pave the way for more effective and efficient use of LLM in this field. This\npaper aims to explore the challenges and opportunities of developing LLM for\nmedia applications and to propose potential solutions for addressing these\nchallenges.\n","authors":["Zhonghao Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10930v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10928v1","updated":"2023-07-20T14:56:35Z","published":"2023-07-20T14:56:35Z","title":"FLASK: Fine-grained Language Model Evaluation based on Alignment Skill\n Sets","summary":" Evaluation of Large Language Models (LLMs) is challenging because aligning to\nhuman values requires the composition of multiple skills and the required set\nof skills varies depending on the instruction. Recent studies have evaluated\nthe performance of LLMs in two ways, (1) automatic evaluation on several\nindependent benchmarks and (2) human or machined-based evaluation giving an\noverall score to the response. However, both settings are coarse-grained\nevaluations, not considering the nature of user instructions that require\ninstance-wise skill composition, which limits the interpretation of the true\ncapabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language\nModel Evaluation based on Alignment SKill Sets), a fine-grained evaluation\nprotocol that can be used for both model-based and human-based evaluation which\ndecomposes coarse-level scoring to an instance-wise skill set-level.\nSpecifically, we define 12 fine-grained skills needed for LLMs to follow\nopen-ended user instructions and construct an evaluation set by allocating a\nset of skills for each instance. Additionally, by annotating the target domains\nand difficulty level for each instance, FLASK provides a holistic view with a\ncomprehensive analysis of a model's performance depending on skill, domain, and\ndifficulty. Through using FLASK, we compare multiple open-sourced and\nproprietary LLMs and observe highly-correlated findings between model-based and\nhuman-based evaluations. FLASK enables developers to more accurately measure\nthe model performance and how it can be improved by analyzing factors that make\nLLMs proficient in particular skills. For practitioners, FLASK can be used to\nrecommend suitable models for particular situations through comprehensive\ncomparison among various LLMs. We release the evaluation data and code\nimplementation at https://github.com/kaistAI/FLASK.\n","authors":["Seonghyeon Ye","Doyoung Kim","Sungdong Kim","Hyeonbin Hwang","Seungone Kim","Yongrae Jo","James Thorne","Juho Kim","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2307.10928v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14030v2","updated":"2023-07-20T13:54:05Z","published":"2023-06-24T18:17:38Z","title":"My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models\n and Evaluation Benchmarks","summary":" The research on code-mixed data is limited due to the unavailability of\ndedicated code-mixed datasets and pre-trained language models. In this work, we\nfocus on the low-resource Indian language Marathi which lacks any prior work in\ncode-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English\n(Mr-En) corpus with 10 million social media sentences for pretraining. We also\nrelease L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models\npre-trained on MeCorpus. Furthermore, for benchmarking, we present three\nsupervised datasets MeHate, MeSent, and MeLID for downstream tasks like\ncode-mixed Mr-En hate speech detection, sentiment analysis, and language\nidentification respectively. These evaluation datasets individually consist of\nmanually annotated \\url{~}12,000 Marathi-English code-mixed tweets. Ablations\nshow that the models trained on this novel corpus significantly outperform the\nexisting state-of-the-art BERT models. This is the first work that presents\nartifacts for code-mixed Marathi research. All datasets and models are publicly\nreleased at https://github.com/l3cube-pune/MarathiNLP .\n","authors":["Tanmay Chavan","Omkar Gokhale","Aditya Kane","Shantanu Patankar","Raviraj Joshi"],"pdf_url":"https://arxiv.org/pdf/2306.14030v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10867v1","updated":"2023-07-20T13:40:22Z","published":"2023-07-20T13:40:22Z","title":"FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with\n Human Feedback","summary":" Captions are crucial for understanding scientific visualizations and\ndocuments. Existing captioning methods for scientific figures rely on\nfigure-caption pairs extracted from documents for training, many of which fall\nshort with respect to metrics like helpfulness, explainability, and\nvisual-descriptiveness [15] leading to generated captions being misaligned with\nreader preferences. To enable the generation of high-quality figure captions,\nwe introduce FigCaps-HF a new framework for figure-caption generation that can\nincorporate domain expert feedback in generating captions optimized for reader\npreferences. Our framework comprises of 1) an automatic method for evaluating\nquality of figure-caption pairs, 2) a novel reinforcement learning with human\nfeedback (RLHF) method to optimize a generative figure-to-caption model for\nreader preferences. We demonstrate the effectiveness of our simple learning\nframework by improving performance over standard fine-tuning across different\ntypes of models. In particular, when using BLIP as the base model, our RLHF\nframework achieves a mean gain of 35.7%, 16.9%, and 9% in ROUGE, BLEU, and\nMeteor, respectively. Finally, we release a large-scale benchmark dataset with\nhuman feedback on figure-caption pairs to enable further evaluation and\ndevelopment of RLHF techniques for this problem.\n","authors":["Ashish Singh","Prateek Agarwal","Zixuan Huang","Arpita Singh","Tong Yu","Sungchul Kim","Victor Bursztyn","Nikos Vlassis","Ryan A. Rossi"],"pdf_url":"https://arxiv.org/pdf/2307.10867v1.pdf","comment":"19 pages, 4 figures. Benchmark Documentation:\n https://figcapshf.github.io/"},{"id":"http://arxiv.org/abs/2307.10864v1","updated":"2023-07-20T13:33:28Z","published":"2023-07-20T13:33:28Z","title":"Divide & Bind Your Attention for Improved Generative Semantic Nursing","summary":" Emerging large-scale text-to-image generative models, e.g., Stable Diffusion\n(SD), have exhibited overwhelming results with high fidelity. Despite the\nmagnificent progress, current state-of-the-art models still struggle to\ngenerate images fully adhering to the input prompt. Prior work, Attend &\nExcite, has introduced the concept of Generative Semantic Nursing (GSN), aiming\nto optimize cross-attention during inference time to better incorporate the\nsemantics. It demonstrates promising results in generating simple prompts,\ne.g., ``a cat and a dog''. However, its efficacy declines when dealing with\nmore complex prompts, and it does not explicitly address the problem of\nimproper attribute binding. To address the challenges posed by complex prompts\nor scenarios involving multiple entities and to achieve improved attribute\nbinding, we propose Divide & Bind. We introduce two novel loss objectives for\nGSN: a novel attendance loss and a binding loss. Our approach stands out in its\nability to faithfully synthesize desired objects with improved attribute\nalignment from complex prompts and exhibits superior performance across\nmultiple evaluation benchmarks. More videos and updates can be found on the\nproject page \\url{https://sites.google.com/view/divide-and-bind}.\n","authors":["Yumeng Li","Margret Keuper","Dan Zhang","Anna Khoreva"],"pdf_url":"https://arxiv.org/pdf/2307.10864v1.pdf","comment":"Project page: \\url{https://sites.google.com/view/divide-and-bind}"},{"id":"http://arxiv.org/abs/2305.01146v3","updated":"2023-07-20T13:10:07Z","published":"2023-05-02T01:33:02Z","title":"RadAdapt: Radiology Report Summarization via Lightweight Domain\n Adaptation of Large Language Models","summary":" We systematically investigate lightweight strategies to adapt large language\nmodels (LLMs) for the task of radiology report summarization (RRS).\nSpecifically, we focus on domain adaptation via pretraining (on natural\nlanguage, biomedical text, or clinical text) and via discrete prompting or\nparameter-efficient fine-tuning. Our results consistently achieve best\nperformance by maximally adapting to the task via pretraining on clinical text\nand fine-tuning on RRS examples. Importantly, this method fine-tunes a mere\n0.32% of parameters throughout the model, in contrast to end-to-end fine-tuning\n(100% of parameters). Additionally, we study the effect of in-context examples\nand out-of-distribution (OOD) training before concluding with a radiologist\nreader study and qualitative analysis. Our findings highlight the importance of\ndomain adaptation in RRS and provide valuable insights toward developing\neffective natural language processing solutions for clinical tasks.\n","authors":["Dave Van Veen","Cara Van Uden","Maayane Attias","Anuj Pareek","Christian Bluethgen","Malgorzata Polacin","Wah Chiu","Jean-Benoit Delbrouck","Juan Manuel Zambrano Chaves","Curtis P. Langlotz","Akshay S. Chaudhari","John Pauly"],"pdf_url":"https://arxiv.org/pdf/2305.01146v3.pdf","comment":"12 pages, 10 figures. Published in ACL BioNLP. Compared to v1, v2\n includes minor edits and one additional figure in the appendix. Compared to\n v2, v3 includes a link to the project's GitHub repository"},{"id":"http://arxiv.org/abs/2307.10826v1","updated":"2023-07-20T12:41:35Z","published":"2023-07-20T12:41:35Z","title":"Yelp Reviews and Food Types: A Comparative Analysis of Ratings,\n Sentiments, and Topics","summary":" This study examines the relationship between Yelp reviews and food types,\ninvestigating how ratings, sentiments, and topics vary across different types\nof food. Specifically, we analyze how ratings and sentiments of reviews vary\nacross food types, cluster food types based on ratings and sentiments, infer\nreview topics using machine learning models, and compare topic distributions\namong different food types. Our analyses reveal that some food types have\nsimilar ratings, sentiments, and topics distributions, while others have\ndistinct patterns. We identify four clusters of food types based on ratings and\nsentiments and find that reviewers tend to focus on different topics when\nreviewing certain food types. These findings have important implications for\nunderstanding user behavior and cultural influence on digital media platforms\nand promoting cross-cultural understanding and appreciation.\n","authors":["Wenyu Liao","Yiqing Shi","Yujia Hu","Wei Quan"],"pdf_url":"https://arxiv.org/pdf/2307.10826v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10814v1","updated":"2023-07-20T12:24:23Z","published":"2023-07-20T12:24:23Z","title":"Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other\n Languages","summary":" In a conventional Speech emotion recognition (SER) task, a classifier for a\ngiven language is trained on a pre-existing dataset for that same language.\nHowever, where training data for a language does not exist, data from other\nlanguages can be used instead. We experiment with cross-lingual and\nmultilingual SER, working with Amharic, English, German and URDU. For Amharic,\nwe use our own publicly-available Amharic Speech Emotion Dataset (ASED). For\nEnglish, German and Urdu we use the existing RAVDESS, EMO-DB and URDU datasets.\nWe followed previous research in mapping labels for all datasets to just two\nclasses, positive and negative. Thus we can compare performance on different\nlanguages directly, and combine languages for training and testing. In\nExperiment 1, monolingual SER trials were carried out using three classifiers,\nAlexNet, VGGE (a proposed variant of VGG), and ResNet50. Results averaged for\nthe three models were very similar for ASED and RAVDESS, suggesting that\nAmharic and English SER are equally difficult. Similarly, German SER is more\ndifficult, and Urdu SER is easier. In Experiment 2, we trained on one language\nand tested on another, in both directions for each pair: Amharic<->German,\nAmharic<->English, and Amharic<->Urdu. Results with Amharic as target suggested\nthat using English or German as source will give the best result. In Experiment\n3, we trained on several non-Amharic languages and then tested on Amharic. The\nbest accuracy obtained was several percent greater than the best accuracy in\nExperiment 2, suggesting that a better result can be obtained when using two or\nthree non-Amharic languages for training than when using just one non-Amharic\nlanguage. Overall, the results suggest that cross-lingual and multilingual\ntraining can be an effective strategy for training a SER classifier when\nresources for a language are scarce.\n","authors":["Ephrem Afele Retta","Richard Sutcliffe","Jabar Mahmood","Michael Abebe Berwo","Eiad Almekhlafi","Sajjad Ahmed Khan","Shehzad Ashraf Chaudhry","Mustafa Mhamed","Jun Feng"],"pdf_url":"https://arxiv.org/pdf/2307.10814v1.pdf","comment":"16 pages, 9 tables, 5 figures"},{"id":"http://arxiv.org/abs/2307.10802v1","updated":"2023-07-20T12:10:29Z","published":"2023-07-20T12:10:29Z","title":"Meta-Transformer: A Unified Framework for Multimodal Learning","summary":" Multimodal learning aims to build models that can process and relate\ninformation from multiple modalities. Despite years of development in this\nfield, it still remains challenging to design a unified network for processing\nvarious modalities ($\\textit{e.g.}$ natural language, 2D images, 3D point\nclouds, audio, video, time series, tabular data) due to the inherent gaps among\nthem. In this work, we propose a framework, named Meta-Transformer, that\nleverages a $\\textbf{frozen}$ encoder to perform multimodal perception without\nany paired multimodal training data. In Meta-Transformer, the raw input data\nfrom various modalities are mapped into a shared token space, allowing a\nsubsequent encoder with frozen parameters to extract high-level semantic\nfeatures of the input data. Composed of three main components: a unified data\ntokenizer, a modality-shared encoder, and task-specific heads for downstream\ntasks, Meta-Transformer is the first framework to perform unified learning\nacross 12 modalities with unpaired data. Experiments on different benchmarks\nreveal that Meta-Transformer can handle a wide range of tasks including\nfundamental perception (text, image, point cloud, audio, video), practical\napplication (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph,\ntabular, and time-series). Meta-Transformer indicates a promising future for\ndeveloping unified multimodal intelligence with transformers. Code will be\navailable at https://github.com/invictus717/MetaTransformer\n","authors":["Yiyuan Zhang","Kaixiong Gong","Kaipeng Zhang","Hongsheng Li","Yu Qiao","Wanli Ouyang","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2307.10802v1.pdf","comment":"Project website: https://kxgong.github.io/meta_transformer/"},{"id":"http://arxiv.org/abs/2307.10799v1","updated":"2023-07-20T12:01:40Z","published":"2023-07-20T12:01:40Z","title":"Layer-wise Representation Fusion for Compositional Generalization","summary":" Despite successes across a broad range of applications, sequence-to-sequence\nmodels' construct of solutions are argued to be less compositional than\nhuman-like generalization. There is mounting evidence that one of the reasons\nhindering compositional generalization is representations of the encoder and\ndecoder uppermost layer are entangled. In other words, the syntactic and\nsemantic representations of sequences are twisted inappropriately. However,\nmost previous studies mainly concentrate on enhancing token-level semantic\ninformation to alleviate the representations entanglement problem, rather than\ncomposing and using the syntactic and semantic representations of sequences\nappropriately as humans do. In addition, we explain why the entanglement\nproblem exists from the perspective of recent studies about training deeper\nTransformer, mainly owing to the ``shallow'' residual connections and its\nsimple, one-step operations, which fails to fuse previous layers' information\neffectively. Starting from this finding and inspired by humans' strategies, we\npropose \\textsc{FuSion} (\\textbf{Fu}sing \\textbf{S}yntactic and\nSemant\\textbf{i}c Representati\\textbf{on}s), an extension to\nsequence-to-sequence models to learn to fuse previous layers' information back\ninto the encoding and decoding process appropriately through introducing a\n\\emph{fuse-attention module} at each encoder and decoder layer. \\textsc{FuSion}\nachieves competitive and even \\textbf{state-of-the-art} results on two\nrealistic benchmarks, which empirically demonstrates the effectiveness of our\nproposal.\n","authors":["Yafang Zheng","Lei Lin","Zhaohong Lai","Binling Wang","Shan Liu","Biao Fu","Wenhao Rao","Peigen Ye","Yidong Chen","Xiaodong Shi"],"pdf_url":"https://arxiv.org/pdf/2307.10799v1.pdf","comment":"work in progress. arXiv admin note: substantial text overlap with\n arXiv:2305.12169"},{"id":"http://arxiv.org/abs/2210.11835v2","updated":"2023-07-20T11:56:40Z","published":"2022-10-21T09:28:54Z","title":"A Textless Metric for Speech-to-Speech Comparison","summary":" In this paper, we introduce a new and simple method for comparing speech\nutterances without relying on text transcripts. Our speech-to-speech comparison\nmetric utilizes state-of-the-art speech2unit encoders like HuBERT to convert\nspeech utterances into discrete acoustic units. We then propose a simple and\neasily replicable neural architecture that learns a speech-based metric that\nclosely corresponds to its text-based counterpart. This textless metric has\nnumerous potential applications, including evaluating speech-to-speech\ntranslation for oral languages, languages without dependable ASR systems, or to\navoid the need for ASR transcription altogether. This paper also shows that for\nspeech-to-speech translation evaluation, ASR-BLEU (which consists in\nautomatically transcribing both speech hypothesis and reference and compute\nsentence-level BLEU between transcripts) is a poor proxy to real text-BLEU even\nwhen ASR system is strong.\n","authors":["Laurent Besacier","Swen Ribeiro","Olivier Galibert","Ioan Calapodescu"],"pdf_url":"https://arxiv.org/pdf/2210.11835v2.pdf","comment":"link to supplementary material:\n https://github.com/besacier/textless-metric"},{"id":"http://arxiv.org/abs/2307.10778v1","updated":"2023-07-20T11:29:15Z","published":"2023-07-20T11:29:15Z","title":"Extreme Multi-Label Skill Extraction Training using Large Language\n Models","summary":" Online job ads serve as a valuable source of information for skill\nrequirements, playing a crucial role in labor market analysis and e-recruitment\nprocesses. Since such ads are typically formatted in free text, natural\nlanguage processing (NLP) technologies are required to automatically process\nthem. We specifically focus on the task of detecting skills (mentioned\nliterally, or implicitly described) and linking them to a large skill ontology,\nmaking it a challenging case of extreme multi-label classification (XMLC).\nGiven that there is no sizable labeled (training) dataset are available for\nthis specific XMLC task, we propose techniques to leverage general Large\nLanguage Models (LLMs). We describe a cost-effective approach to generate an\naccurate, fully synthetic labeled dataset for skill extraction, and present a\ncontrastive learning strategy that proves effective in the task. Our results\nacross three skill extraction benchmarks show a consistent increase of between\n15 to 25 percentage points in \\textit{R-Precision@5} compared to previously\npublished results that relied solely on distant supervision through literal\nmatches.\n","authors":["Jens-Joris Decorte","Severine Verlinden","Jeroen Van Hautte","Johannes Deleu","Chris Develder","Thomas Demeester"],"pdf_url":"https://arxiv.org/pdf/2307.10778v1.pdf","comment":"Accepted to the International workshop on AI for Human Resources and\n Public Employment Services (AI4HR&PES) as part of ECML-PKDD 2023"},{"id":"http://arxiv.org/abs/2305.15299v2","updated":"2023-07-20T10:43:57Z","published":"2023-05-24T16:23:46Z","title":"Science in the Era of ChatGPT, Large Language Models and Generative AI:\n Challenges for Research Ethics and How to Respond","summary":" Large language models of artificial intelligence (AI), such as ChatGPT, find\nremarkable but controversial applicability in science and research. This paper\nreviews epistemological challenges, ethical and integrity risks in science\nconduct in the advent of generative AI. This is with the aim to lay new timely\nfoundations for a high-quality research ethics review. The role of AI language\nmodels as a research instrument and subject is scrutinized along with ethical\nimplications for scientists, participants and reviewers. New emerging practices\nfor research ethics review are discussed, concluding with ten recommendations\nthat shape a response for a more responsible research conduct in the era of AI.\n","authors":["Evangelos Pournaras"],"pdf_url":"https://arxiv.org/pdf/2305.15299v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10757v1","updated":"2023-07-20T10:42:16Z","published":"2023-07-20T10:42:16Z","title":"Vesper: A Compact and Effective Pretrained Model for Speech Emotion\n Recognition","summary":" This paper presents a paradigm that adapts general large-scale pretrained\nmodels (PTMs) to speech emotion recognition task. Although PTMs shed new light\non artificial general intelligence, they are constructed with general tasks in\nmind, and thus, their efficacy for specific tasks can be further improved.\nAdditionally, employing PTMs in practical applications can be challenging due\nto their considerable size. Above limitations spawn another research direction,\nnamely, optimizing large-scale PTMs for specific tasks to generate\ntask-specific PTMs that are both compact and effective. In this paper, we focus\non the speech emotion recognition task and propose an improved emotion-specific\npretrained encoder called Vesper. Vesper is pretrained on a speech dataset\nbased on WavLM and takes into account emotional characteristics. To enhance\nsensitivity to emotional information, Vesper employs an emotion-guided masking\nstrategy to identify the regions that need masking. Subsequently, Vesper\nemploys hierarchical and cross-layer self-supervision to improve its ability to\ncapture acoustic and semantic representations, both of which are crucial for\nemotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D\ndatasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12\nlayers, and the performance of Vesper with 12 layers surpasses that of WavLM\nLarge with 24 layers.\n","authors":["Weidong Chen","Xiaofen Xing","Peihao Chen","Xiangmin Xu"],"pdf_url":"https://arxiv.org/pdf/2307.10757v1.pdf","comment":"13 pages, 5 figures, 8 tables"},{"id":"http://arxiv.org/abs/2307.10751v1","updated":"2023-07-20T10:26:57Z","published":"2023-07-20T10:26:57Z","title":"Exploring Perspectives on the Impact of Artificial Intelligence on the\n Creativity of Knowledge Work: Beyond Mechanised Plagiarism and Stochastic\n Parrots","summary":" Artificial Intelligence (AI), and in particular generative models, are\ntransformative tools for knowledge work. They problematise notions of\ncreativity, originality, plagiarism, the attribution of credit, and copyright\nownership. Critics of generative models emphasise the reliance on large amounts\nof training data, and view the output of these models as no more than\nrandomised plagiarism, remix, or collage of the source data. On these grounds,\nmany have argued for stronger regulations on the deployment, use, and\nattribution of the output of these models. However, these issues are not new or\nunique to artificial intelligence. In this position paper, using examples from\nliterary criticism, the history of art, and copyright law, I show how\ncreativity and originality resist definition as a notatable or\ninformation-theoretic property of an object, and instead can be seen as the\nproperty of a process, an author, or a viewer. Further alternative views hold\nthat all creative work is essentially reuse (mostly without attribution), or\nthat randomness itself can be creative. I suggest that creativity is ultimately\ndefined by communities of creators and receivers, and the deemed sources of\ncreativity in a workflow often depend on which parts of the workflow can be\nautomated. Using examples from recent studies of AI in creative knowledge work,\nI suggest that AI shifts knowledge work from material production to critical\nintegration. This position paper aims to begin a conversation around a more\nnuanced approach to the problems of creativity and credit assignment for\ngenerative models, one which more fully recognises the importance of the\ncreative and curatorial voice of the users of these models and moves away from\nsimpler notational or information-theoretic views.\n","authors":["Advait Sarkar"],"pdf_url":"https://arxiv.org/pdf/2307.10751v1.pdf","comment":"Advait Sarkar. 2023. Exploring Perspectives on the Impact of\n Artificial Intelligence on the Creativity of Knowledge Work Beyond Mechanised\n Plagiarism and Stochastic Parrots. In Annual Symposium on Human-Computer\n Interaction for Work 2023 (CHIWORK 2023), June 13-16, 2023, Oldenburg,\n Germany. ACM, New York, NY, USA, 17 pages"},{"id":"http://arxiv.org/abs/2301.11596v4","updated":"2023-07-20T08:58:12Z","published":"2023-01-27T08:45:53Z","title":"ThoughtSource: A central hub for large language model reasoning data","summary":" Large language models (LLMs) such as GPT-4 have recently demonstrated\nimpressive results across a wide range of tasks. LLMs are still limited,\nhowever, in that they frequently fail at complex reasoning, their reasoning\nprocesses are opaque, they are prone to 'hallucinate' facts, and there are\nconcerns about their underlying biases. Letting models verbalize reasoning\nsteps as natural language, a technique known as chain-of-thought prompting, has\nrecently been proposed as a way to address some of these issues. Here we\npresent ThoughtSource, a meta-dataset and software library for chain-of-thought\n(CoT) reasoning. The goal of ThoughtSource is to improve future artificial\nintelligence systems by facilitating qualitative understanding of CoTs,\nenabling empirical evaluations, and providing training data. This first release\nof ThoughtSource integrates six scientific/medical, three general-domain and\nfive math word question answering datasets.\n","authors":["Simon Ott","Konstantin Hebenstreit","Valentin Liévin","Christoffer Egeberg Hother","Milad Moradi","Maximilian Mayrhauser","Robert Praas","Ole Winther","Matthias Samwald"],"pdf_url":"https://arxiv.org/pdf/2301.11596v4.pdf","comment":"Revision: added datasets, formatting"},{"id":"http://arxiv.org/abs/2011.00696v2","updated":"2023-07-20T08:56:26Z","published":"2020-11-02T03:07:38Z","title":"ABNIRML: Analyzing the Behavior of Neural IR Models","summary":" Pretrained contextualized language models such as BERT and T5 have\nestablished a new state-of-the-art for ad-hoc search. However, it is not yet\nwell-understood why these methods are so effective, what makes some variants\nmore effective than others, and what pitfalls they may have. We present a new\ncomprehensive framework for Analyzing the Behavior of Neural IR ModeLs\n(ABNIRML), which includes new types of diagnostic probes that allow us to test\nseveral characteristics -- such as writing styles, factuality, sensitivity to\nparaphrasing and word order -- that are not addressed by previous techniques.\nTo demonstrate the value of the framework, we conduct an extensive empirical\nstudy that yields insights into the factors that contribute to the neural\nmodel's gains, and identify potential unintended biases the models exhibit.\nSome of our results confirm conventional wisdom, like that recent neural\nranking models rely less on exact term overlap with the query, and instead\nleverage richer linguistic information, evidenced by their higher sensitivity\nto word and sentence order. Other results are more surprising, such as that\nsome models (e.g., T5 and ColBERT) are biased towards factually correct (rather\nthan simply relevant) texts. Further, some characteristics vary even for the\nsame base language model, and other characteristics can appear due to random\nvariations during model training.\n","authors":["Sean MacAvaney","Sergey Feldman","Nazli Goharian","Doug Downey","Arman Cohan"],"pdf_url":"https://arxiv.org/pdf/2011.00696v2.pdf","comment":"TACL version"},{"id":"http://arxiv.org/abs/2306.06427v2","updated":"2023-07-20T08:47:14Z","published":"2023-06-10T12:42:36Z","title":"Boosting Language Models Reasoning with Chain-of-Knowledge Prompting","summary":" Recently, Chain-of-Thought (CoT) prompting has delivered success on complex\nreasoning tasks, which aims at designing a simple prompt like ``Let's think\nstep by step'' or multiple in-context exemplars with well-designed rationales\nto elicit Large Language Models (LLMs) to generate intermediate reasoning\nsteps. However, the generated rationales often come with mistakes, making\nunfactual and unfaithful reasoning chains. To mitigate this brittleness, we\npropose a novel Chain-of-Knowledge (CoK) prompting, where we aim at eliciting\nLLMs to generate explicit pieces of knowledge evidence in the form of structure\ntriple. This is inspired by our human behaviors, i.e., we can draw a mind map\nor knowledge map as the reasoning evidence in the brain before answering a\ncomplex question. Benefiting from CoK, we additionally introduce a\nF^2-Verification method to estimate the reliability of the reasoning chains in\nterms of factuality and faithfulness. For the unreliable response, the wrong\nevidence can be indicated to prompt the LLM to rethink. Extensive experiments\ndemonstrate that our method can further improve the performance of commonsense,\nfactual, symbolic, and arithmetic reasoning tasks.\n","authors":["Jianing Wang","Qiushi Sun","Nuo Chen","Xiang Li","Ming Gao"],"pdf_url":"https://arxiv.org/pdf/2306.06427v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2307.10700v1","updated":"2023-07-20T08:45:00Z","published":"2023-07-20T08:45:00Z","title":"Large language models shape and are shaped by society: A survey of arXiv\n publication patterns","summary":" There has been a steep recent increase in the number of large language model\n(LLM) papers, producing a dramatic shift in the scientific landscape which\nremains largely undocumented through bibliometric analysis. Here, we analyze\n388K papers posted on the CS and Stat arXivs, focusing on changes in\npublication patterns in 2023 vs. 2018-2022. We analyze how the proportion of\nLLM papers is increasing; the LLM-related topics receiving the most attention;\nthe authors writing LLM papers; how authors' research topics correlate with\ntheir backgrounds; the factors distinguishing highly cited LLM papers; and the\npatterns of international collaboration. We show that LLM research increasingly\nfocuses on societal impacts: there has been an 18x increase in the proportion\nof LLM-related papers on the Computers and Society sub-arXiv, and authors newly\npublishing on LLMs are more likely to focus on applications and societal\nimpacts than more experienced authors. LLM research is also shaped by social\ndynamics: we document gender and academic/industry disparities in the topics\nLLM authors focus on, and a US/China schism in the collaboration network.\nOverall, our analysis documents the profound ways in which LLM research both\nshapes and is shaped by society, attesting to the necessity of sociotechnical\nlenses.\n","authors":["Rajiv Movva","Sidhika Balachandar","Kenny Peng","Gabriel Agostini","Nikhil Garg","Emma Pierson"],"pdf_url":"https://arxiv.org/pdf/2307.10700v1.pdf","comment":"Working paper"},{"id":"http://arxiv.org/abs/2303.12112v3","updated":"2023-07-20T08:16:09Z","published":"2023-03-21T18:03:14Z","title":"Positive-Augmented Contrastive Learning for Image and Video Captioning\n Evaluation","summary":" The CLIP model has been recently proven to be very effective for a variety of\ncross-modal tasks, including the evaluation of captions generated from\nvision-and-language architectures. In this paper, we propose a new recipe for a\ncontrastive-based evaluation metric for image captioning, namely\nPositive-Augmented Contrastive learning Score (PAC-S), that in a novel way\nunifies the learning of a contrastive visual-semantic space with the addition\nof generated images and text on curated data. Experiments spanning several\ndatasets demonstrate that our new metric achieves the highest correlation with\nhuman judgments on both images and videos, outperforming existing\nreference-based metrics like CIDEr and SPICE and reference-free metrics like\nCLIP-Score. Finally, we test the system-level correlation of the proposed\nmetric when considering popular image captioning approaches, and assess the\nimpact of employing different cross-modal features. Our source code and trained\nmodels are publicly available at: https://github.com/aimagelab/pacscore.\n","authors":["Sara Sarto","Manuele Barraco","Marcella Cornia","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2303.12112v3.pdf","comment":"CVPR 2023 (highlight paper)"},{"id":"http://arxiv.org/abs/2307.10666v1","updated":"2023-07-20T07:47:08Z","published":"2023-07-20T07:47:08Z","title":"A Dataset and Strong Baselines for Classification of Czech News Texts","summary":" Pre-trained models for Czech Natural Language Processing are often evaluated\non purely linguistic tasks (POS tagging, parsing, NER) and relatively simple\nclassification tasks such as sentiment classification or article classification\nfrom a single news source. As an alternative, we present\nCZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech\nclassification datasets, composed of news articles from various sources\nspanning over twenty years, which allows a more rigorous evaluation of such\nmodels. We define four classification tasks: news source, news category,\ninferred author's gender, and day of the week. To verify the task difficulty,\nwe conducted a human evaluation, which revealed that human performance lags\nbehind strong machine-learning baselines built upon pre-trained transformer\nmodels. Furthermore, we show that language-specific pre-trained encoder\nanalysis outperforms selected commercially available large-scale generative\nlanguage models.\n","authors":["Hynek Kydlíček","Jindřich Libovický"],"pdf_url":"https://arxiv.org/pdf/2307.10666v1.pdf","comment":"12 pages, Accepted to Text, Speech and Dialogue (TSD) 2023"},{"id":"http://arxiv.org/abs/2307.10652v1","updated":"2023-07-20T07:33:30Z","published":"2023-07-20T07:33:30Z","title":"Exploring the Landscape of Natural Language Processing Research","summary":" As an efficient approach to understand, generate, and process natural\nlanguage texts, research in natural language processing (NLP) has exhibited a\nrapid spread and wide adoption in recent years. Given the increasing amount of\nresearch work in this area, several NLP-related approaches have been surveyed\nin the research community. However, a comprehensive study that categorizes\nestablished topics, identifies trends, and outlines areas for future research\nremains absent to this day. Contributing to closing this gap, we have\nsystematically classified and analyzed research papers included in the ACL\nAnthology. As a result, we present a structured overview of the research\nlandscape, provide a taxonomy of fields-of-study in NLP, analyze recent\ndevelopments in NLP, summarize our findings, and highlight directions for\nfuture work.\n","authors":["Tim Schopf","Karim Arabi","Florian Matthes"],"pdf_url":"https://arxiv.org/pdf/2307.10652v1.pdf","comment":"Accepted to the 14th International Conference on Recent Advances in\n Natural Language Processing (RANLP 2023)"},{"id":"http://arxiv.org/abs/2307.10635v1","updated":"2023-07-20T07:01:57Z","published":"2023-07-20T07:01:57Z","title":"SciBench: Evaluating College-Level Scientific Problem-Solving Abilities\n of Large Language Models","summary":" Recent advances in large language models (LLMs) have demonstrated notable\nprogress on many mathematical benchmarks. However, most of these benchmarks\nonly feature problems grounded in junior and senior high school subjects,\ncontain only multiple-choice questions, and are confined to a limited scope of\nelementary arithmetic operations. To address these issues, this paper\nintroduces an expansive benchmark suite SciBench that aims to systematically\nexamine the reasoning capabilities required for complex scientific problem\nsolving. SciBench contains two carefully curated datasets: an open set\nfeaturing a range of collegiate-level scientific problems drawn from\nmathematics, chemistry, and physics textbooks, and a closed set comprising\nproblems from undergraduate-level exams in computer science and mathematics.\nBased on the two datasets, we conduct an in-depth benchmark study of two\nrepresentative LLMs with various prompting strategies. The results reveal that\ncurrent LLMs fall short of delivering satisfactory performance, with an overall\nscore of merely 35.80%. Furthermore, through a detailed user study, we\ncategorize the errors made by LLMs into ten problem-solving abilities. Our\nanalysis indicates that no single prompting strategy significantly outperforms\nothers and some strategies that demonstrate improvements in certain\nproblem-solving skills result in declines in other skills. We envision that\nSciBench will catalyze further developments in the reasoning abilities of LLMs,\nthereby ultimately contributing to scientific research and discovery.\n","authors":["Xiaoxuan Wang","Ziniu Hu","Pan Lu","Yanqiao Zhu","Jieyu Zhang","Satyen Subramaniam","Arjun R. Loomba","Shichang Zhang","Yizhou Sun","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10635v1.pdf","comment":"Work in progress, 18 pages"},{"id":"http://arxiv.org/abs/2307.10634v1","updated":"2023-07-20T06:59:02Z","published":"2023-07-20T06:59:02Z","title":"Generative Language Models on Nucleotide Sequences of Human Genes","summary":" Language models, primarily transformer-based ones, obtained colossal success\nin NLP. To be more precise, studies like BERT in NLU and works such as GPT-3\nfor NLG are very crucial. DNA sequences are very close to natural language in\nterms of structure, so if the DNA-related bioinformatics domain is concerned,\ndiscriminative models, like DNABert, exist. Yet, the generative side of the\ncoin is mainly unexplored to the best of our knowledge. Consequently, we\nfocused on developing an autoregressive generative language model like GPT-3\nfor DNA sequences. Because working with whole DNA sequences is challenging\nwithout substantial computational resources, we decided to carry out our study\non a smaller scale, focusing on nucleotide sequences of human genes, unique\nparts in DNA with specific functionalities, instead of the whole DNA. This\ndecision did not change the problem structure a lot due to the fact that both\nDNA and genes can be seen as 1D sequences consisting of four different\nnucleotides without losing much information and making too much simplification.\nFirst of all, we systematically examined an almost entirely unexplored problem\nand observed that RNNs performed the best while simple techniques like N-grams\nwere also promising. Another beneficial point was learning how to work with\ngenerative models on languages we do not understand, unlike natural language.\nHow essential using real-life tasks beyond the classical metrics such as\nperplexity is observed. Furthermore, checking whether the data-hungry nature of\nthese models can be changed through selecting a language with minimal\nvocabulary size, four owing to four different types of nucleotides, is\nexamined. The reason for reviewing this was that choosing such a language might\nmake the problem easier. However, what we observed in this study was it did not\nprovide that much of a change in the amount of data needed.\n","authors":["Musa Nuri Ihtiyar","Arzucan Ozgur"],"pdf_url":"https://arxiv.org/pdf/2307.10634v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10633v1","updated":"2023-07-20T06:58:55Z","published":"2023-07-20T06:58:55Z","title":"Multi-Method Self-Training: Improving Code Generation With Text, And\n Vice Versa","summary":" Large Language Models have many methods for solving the same problem. This\nintroduces novel strengths (different methods may work well for different\nproblems) and weaknesses (it may be difficult for users to know which method to\nuse). In this paper, we introduce Multi-Method Self-Training (MMST), where one\nmethod is trained on the filtered outputs of another, allowing us to augment\nthe strengths and ameliorate the weaknesses of each method. Using a 176B\nparameter model trained on both language and code, we show that MMST can 1)\nimprove the less performant method (up to 30%) making the model easier to use,\n2) improve the more performant method (up to 32.2%) making the model more\nperformant, and 3) improve the performance of related but distinct tasks (up to\n10.3%) by improving the ability of the model to generate rationales. We then\nconduct ablation analyses to explore why MMST works. We show that MMST\ngenerates more data than traditional self-training, but the improvement in\nperformance is driven by the use of multiple methods. We also analyze\nprompt-engineering and anti-correlated performance between methods as means of\nmaking MMST more effective. We hope the evidence from our paper motivates\nmachine learning researchers to explore ways in which advances in language\nmodels allow for new forms of training.\n","authors":["Shriyash K. Upadhyay","Etan J. Ginsberg"],"pdf_url":"https://arxiv.org/pdf/2307.10633v1.pdf","comment":"23 pages, 3 figures"},{"id":"http://arxiv.org/abs/2307.10587v1","updated":"2023-07-20T05:03:00Z","published":"2023-07-20T05:03:00Z","title":"A Deep Dive into the Disparity of Word Error Rates Across Thousands of\n NPTEL MOOC Videos","summary":" Automatic speech recognition (ASR) systems are designed to transcribe spoken\nlanguage into written text and find utility in a variety of applications\nincluding voice assistants and transcription services. However, it has been\nobserved that state-of-the-art ASR systems which deliver impressive benchmark\nresults, struggle with speakers of certain regions or demographics due to\nvariation in their speech properties. In this work, we describe the curation of\na massive speech dataset of 8740 hours consisting of $\\sim9.8$K technical\nlectures in the English language along with their transcripts delivered by\ninstructors representing various parts of Indian demography. The dataset is\nsourced from the very popular NPTEL MOOC platform. We use the curated dataset\nto measure the existing disparity in YouTube Automatic Captions and OpenAI\nWhisper model performance across the diverse demographic traits of speakers in\nIndia. While there exists disparity due to gender, native region, age and\nspeech rate of speakers, disparity based on caste is non-existent. We also\nobserve statistically significant disparity across the disciplines of the\nlectures. These results indicate the need of more inclusive and robust ASR\nsystems and more representational datasets for disparity evaluation in them.\n","authors":["Anand Kumar Rai","Siddharth D Jaiswal","Animesh Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2307.10587v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10558v1","updated":"2023-07-20T03:54:24Z","published":"2023-07-20T03:54:24Z","title":"Instruction-following Evaluation through Verbalizer Manipulation","summary":" While instruction-tuned models have shown remarkable success in various\nnatural language processing tasks, accurately evaluating their ability to\nfollow instructions remains challenging. Existing benchmarks primarily focus on\ncommon instructions that align well with what the model learned during\ntraining. However, proficiency in responding to these instructions does not\nnecessarily imply strong ability in instruction following. In this paper, we\npropose a novel instruction-following evaluation protocol called verbalizer\nmanipulation. It instructs the model to verbalize the task label with words\naligning with model priors to different extents, adopting verbalizers from\nhighly aligned (e.g., outputting ``postive'' for positive sentiment), to\nminimally aligned (e.g., outputting ``negative'' for positive sentiment).\nVerbalizer manipulation can be seamlessly integrated with any classification\nbenchmark to examine the model's reliance on priors and its ability to override\nthem to accurately follow the instructions. We conduct a comprehensive\nevaluation of four major model families across nine datasets, employing twelve\nsets of verbalizers for each of them. We observe that the instruction-following\nabilities of models, across different families and scales, are significantly\ndistinguished by their performance on less natural verbalizers. Even the\nstrongest GPT-4 model struggles to perform better than random guessing on the\nmost challenging verbalizer, emphasizing the need for continued advancements to\nimprove their instruction-following abilities.\n","authors":["Shiyang Li","Jun Yan","Hai Wang","Zheng Tang","Xiang Ren","Vijay Srinivasan","Hongxia Jin"],"pdf_url":"https://arxiv.org/pdf/2307.10558v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14795v2","updated":"2023-07-20T03:39:19Z","published":"2023-06-26T15:53:02Z","title":"MotionGPT: Human Motion as a Foreign Language","summary":" Though the advancement of pre-trained large language models unfolds, the\nexploration of building a unified model for language and other multi-modal\ndata, such as motion, remains challenging and untouched so far. Fortunately,\nhuman motion displays a semantic coupling akin to human language, often\nperceived as a form of body language. By fusing language data with large-scale\nmotion models, motion-language pre-training that can enhance the performance of\nmotion-related tasks becomes feasible. Driven by this insight, we propose\nMotionGPT, a unified, versatile, and user-friendly motion-language model to\nhandle multiple motion-relevant tasks. Specifically, we employ the discrete\nvector quantization for human motion and transfer 3D motion into motion tokens,\nsimilar to the generation process of word tokens. Building upon this \"motion\nvocabulary\", we perform language modeling on both motion and text in a unified\nmanner, treating human motion as a specific language. Moreover, inspired by\nprompt learning, we pre-train MotionGPT with a mixture of motion-language data\nand fine-tune it on prompt-based question-and-answer tasks. Extensive\nexperiments demonstrate that MotionGPT achieves state-of-the-art performances\non multiple motion tasks including text-driven motion generation, motion\ncaptioning, motion prediction, and motion in-between.\n","authors":["Biao Jiang","Xin Chen","Wen Liu","Jingyi Yu","Gang Yu","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2306.14795v2.pdf","comment":"Project Page: https://github.com/OpenMotionLab/MotionGPT"},{"id":"http://arxiv.org/abs/2307.10549v1","updated":"2023-07-20T03:26:57Z","published":"2023-07-20T03:26:57Z","title":"Dynamic Large Language Models on Blockchains","summary":" Training and deploying the large language models requires a large mount of\ncomputational resource because the language models contain billions of\nparameters and the text has thousands of tokens. Another problem is that the\nlarge language models are static. They are fixed after the training process. To\ntackle these issues, in this paper, we propose to train and deploy the dynamic\nlarge language model on blockchains, which have high computation performance\nand are distributed across a network of computers. A blockchain is a secure,\ndecentralized, and transparent system that allows for the creation of a\ntamper-proof ledger for transactions without the need for intermediaries. The\ndynamic large language models can continuously learn from the user input after\nthe training process. Our method provides a new way to develop the large\nlanguage models and also sheds a light on the next generation artificial\nintelligence systems.\n","authors":["Yuanhao Gong"],"pdf_url":"https://arxiv.org/pdf/2307.10549v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.00470v4","updated":"2023-07-20T03:03:25Z","published":"2023-07-02T04:32:41Z","title":"PatternGPT :A Pattern-Driven Framework for Large Language Model Text\n Generation","summary":" Large language models(LLMS)have shown excellent text generation capabilities,\ncapable of generating fluent human-like responses for many downstream tasks.\nHowever, applying large language models to real-world critical tasks remains\nchallenging due to their susceptibility to hallucinations and inability to\ndirectly use external knowledge. To cope with the above challenges, this paper\nproposes PatternGPT, a pattern-driven text generation framework for Large\nLanguage Models. Firstly, the framework utilizes the extraction capability of\nLarge Language Models to generate rich and diversified structured and\nformalized patterns, which facilitates the introduction of external knowledge\nto do the computation, and then draws on the idea of federated learning to use\nmultiple agents to achieve the sharing in order to obtain more diversified\npatterns, and finally uses judgment criteria and optimization algorithm to\nsearch for high-quality patterns to guide the generation of models. Finally,\nexternal knowledge such as judgment criteria and optimization algorithms are\nused to search for high-quality patterns, and the searched patterns are used to\nguide model generation. This framework has the advantages of generating\ndiversified patterns, protecting data privacy, combining external knowledge,\nand improving the quality of generation, which provides an effective method to\noptimize the text generation capability of large language models, and make it\nbetter applied to the field of intelligent dialogue and content generation.\n","authors":["Le Xiao","Xin Shan"],"pdf_url":"https://arxiv.org/pdf/2307.00470v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10168v2","updated":"2023-07-20T02:29:25Z","published":"2023-07-19T17:54:43Z","title":"LLMs as Workers in Human-Computational Algorithms? Replicating\n Crowdsourcing Pipelines with LLMs","summary":" LLMs have shown promise in replicating human-like behavior in crowdsourcing\ntasks that were previously thought to be exclusive to human abilities. However,\ncurrent efforts focus mainly on simple atomic tasks. We explore whether LLMs\ncan replicate more complex crowdsourcing pipelines. We find that modern LLMs\ncan simulate some of crowdworkers' abilities in these \"human computation\nalgorithms,\" but the level of success is variable and influenced by requesters'\nunderstanding of LLM capabilities, the specific skills required for sub-tasks,\nand the optimal interaction modality for performing these sub-tasks. We reflect\non human and LLMs' different sensitivities to instructions, stress the\nimportance of enabling human-facing safeguards for LLMs, and discuss the\npotential of training humans and LLMs with complementary skill sets. Crucially,\nwe show that replicating crowdsourcing pipelines offers a valuable platform to\ninvestigate (1) the relative strengths of LLMs on different tasks (by\ncross-comparing their performances on sub-tasks) and (2) LLMs' potential in\ncomplex tasks, where they can complete part of the tasks while leaving others\nto humans.\n","authors":["Tongshuang Wu","Haiyi Zhu","Maya Albayrak","Alexis Axon","Amanda Bertsch","Wenxing Deng","Ziqi Ding","Bill Guo","Sireesh Gururaja","Tzu-Sheng Kuo","Jenny T. Liang","Ryan Liu","Ihita Mandal","Jeremiah Milbauer","Xiaolin Ni","Namrata Padmanabhan","Subhashini Ramkumar","Alexis Sudjianto","Jordan Taylor","Ying-Jui Tseng","Patricia Vaidos","Zhijin Wu","Wei Wu","Chenyang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.10168v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.11296v2","updated":"2023-07-20T02:20:35Z","published":"2023-06-20T05:20:29Z","title":"ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF\n Synthesis","summary":" We use prompt engineering to guide ChatGPT in the automation of text mining\nof metal-organic frameworks (MOFs) synthesis conditions from diverse formats\nand styles of the scientific literature. This effectively mitigates ChatGPT's\ntendency to hallucinate information -- an issue that previously made the use of\nLarge Language Models (LLMs) in scientific fields challenging. Our approach\ninvolves the development of a workflow implementing three different processes\nfor text mining, programmed by ChatGPT itself. All of them enable parsing,\nsearching, filtering, classification, summarization, and data unification with\ndifferent tradeoffs between labor, speed, and accuracy. We deploy this system\nto extract 26,257 distinct synthesis parameters pertaining to approximately 800\nMOFs sourced from peer-reviewed research articles. This process incorporates\nour ChemPrompt Engineering strategy to instruct ChatGPT in text mining,\nresulting in impressive precision, recall, and F1 scores of 90-99%.\nFurthermore, with the dataset built by text mining, we constructed a\nmachine-learning model with over 86% accuracy in predicting MOF experimental\ncrystallization outcomes and preliminarily identifying important factors in MOF\ncrystallization. We also developed a reliable data-grounded MOF chatbot to\nanswer questions on chemical reactions and synthesis procedures. Given that the\nprocess of using ChatGPT reliably mines and tabulates diverse MOF synthesis\ninformation in a unified format, while using only narrative language requiring\nno coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be\nvery useful across various other chemistry sub-disciplines.\n","authors":["Zhiling Zheng","Oufan Zhang","Christian Borgs","Jennifer T. Chayes","Omar M. Yaghi"],"pdf_url":"https://arxiv.org/pdf/2306.11296v2.pdf","comment":"Published on Journal of the American Chemical Society (2023); 102\n pages (18-page manuscript, 84 pages of supporting information)"},{"id":"http://arxiv.org/abs/2307.07946v2","updated":"2023-07-20T02:01:34Z","published":"2023-07-16T04:50:52Z","title":"Unifying Token and Span Level Supervisions for Few-Shot Sequence\n Labeling","summary":" Few-shot sequence labeling aims to identify novel classes based on only a few\nlabeled samples. Existing methods solve the data scarcity problem mainly by\ndesigning token-level or span-level labeling models based on metric learning.\nHowever, these methods are only trained at a single granularity (i.e., either\ntoken level or span level) and have some weaknesses of the corresponding\ngranularity. In this paper, we first unify token and span level supervisions\nand propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot\nsequence labeling. CDAP contains the token-level and span-level networks,\njointly trained at different granularities. To align the outputs of two\nnetworks, we further propose a consistent loss to enable them to learn from\neach other. During the inference phase, we propose a consistent greedy\ninference algorithm that first adjusts the predicted probability and then\ngreedily selects non-overlapping spans with maximum probability. Extensive\nexperiments show that our model achieves new state-of-the-art results on three\nbenchmark datasets.\n","authors":["Zifeng Cheng","Qingyu Zhou","Zhiwei Jiang","Xuemin Zhao","Yunbo Cao","Qing Gu"],"pdf_url":"https://arxiv.org/pdf/2307.07946v2.pdf","comment":"Accepted by ACM Transactions on Information Systems"},{"id":"http://arxiv.org/abs/2307.10522v1","updated":"2023-07-20T01:48:51Z","published":"2023-07-20T01:48:51Z","title":"Gender-tuning: Empowering Fine-tuning for Debiasing Pre-trained Language\n Models","summary":" Recent studies have revealed that the widely-used Pre-trained Language Models\n(PLMs) propagate societal biases from the large unmoderated pre-training\ncorpora. Existing solutions require debiasing training processes and datasets\nfor debiasing, which are resource-intensive and costly. Furthermore, these\nmethods hurt the PLMs' performance on downstream tasks. In this study, we\npropose Gender-tuning, which debiases the PLMs through fine-tuning on\ndownstream tasks' datasets. For this aim, Gender-tuning integrates Masked\nLanguage Modeling (MLM) training objectives into fine-tuning's training\nprocess. Comprehensive experiments show that Gender-tuning outperforms the\nstate-of-the-art baselines in terms of average gender bias scores in PLMs while\nimproving PLMs' performance on downstream tasks solely using the downstream\ntasks' dataset. Also, Gender-tuning is a deployable debiasing tool for any PLM\nthat works with original fine-tuning.\n","authors":["Somayeh Ghanbarzadeh","Yan Huang","Hamid Palangi","Radames Cruz Moreno","Hamed Khanpour"],"pdf_url":"https://arxiv.org/pdf/2307.10522v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10514v1","updated":"2023-07-20T01:26:34Z","published":"2023-07-20T01:26:34Z","title":"Building Socio-culturally Inclusive Stereotype Resources with Community\n Engagement","summary":" With rapid development and deployment of generative language models in global\nsettings, there is an urgent need to also scale our measurements of harm, not\njust in the number and types of harms covered, but also how well they account\nfor local cultural contexts, including marginalized identities and the social\nbiases experienced by them. Current evaluation paradigms are limited in their\nabilities to address this, as they are not representative of diverse, locally\nsituated but global, socio-cultural perspectives. It is imperative that our\nevaluation resources are enhanced and calibrated by including people and\nexperiences from different cultures and societies worldwide, in order to\nprevent gross underestimations or skews in measurements of harm. In this work,\nwe demonstrate a socio-culturally aware expansion of evaluation resources in\nthe Indian societal context, specifically for the harm of stereotyping. We\ndevise a community engaged effort to build a resource which contains\nstereotypes for axes of disparity that are uniquely present in India. The\nresultant resource increases the number of stereotypes known for and in the\nIndian context by over 1000 stereotypes across many unique identities. We also\ndemonstrate the utility and effectiveness of such expanded resources for\nevaluations of language models. CONTENT WARNING: This paper contains examples\nof stereotypes that may be offensive.\n","authors":["Sunipa Dev","Jaya Goyal","Dinesh Tewari","Shachi Dave","Vinodkumar Prabhakaran"],"pdf_url":"https://arxiv.org/pdf/2307.10514v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02288v3","updated":"2023-07-20T01:13:27Z","published":"2023-07-05T13:40:57Z","title":"Performance Comparison of Large Language Models on VNHSGE English\n Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard","summary":" This paper presents a performance comparison of three large language models\n(LLMs), namely OpenAI ChatGPT, Microsoft Bing Chat (BingChat), and Google Bard,\non the VNHSGE English dataset. The performance of BingChat, Bard, and ChatGPT\n(GPT-3.5) is 92.4\\%, 86\\%, and 79.2\\%, respectively. The results show that\nBingChat is better than ChatGPT and Bard. Therefore, BingChat and Bard can\nreplace ChatGPT while ChatGPT is not yet officially available in Vietnam. The\nresults also indicate that BingChat, Bard and ChatGPT outperform Vietnamese\nstudents in English language proficiency. The findings of this study contribute\nto the understanding of the potential of LLMs in English language education.\nThe remarkable performance of ChatGPT, BingChat, and Bard demonstrates their\npotential as effective tools for teaching and learning English at the high\nschool level.\n","authors":["Xuan-Quy Dao"],"pdf_url":"https://arxiv.org/pdf/2307.02288v3.pdf","comment":"11 pages, 8 figures"},{"id":"http://arxiv.org/abs/2307.10512v1","updated":"2023-07-20T01:11:14Z","published":"2023-07-20T01:11:14Z","title":"IvyGPT: InteractiVe Chinese pathwaY language model in medical domain","summary":" General large language models (LLMs) such as ChatGPT have shown remarkable\nsuccess. However, such LLMs have not been widely adopted for medical purposes,\ndue to poor accuracy and inability to provide medical advice. We propose\nIvyGPT, an LLM based on LLaMA that is trained and fine-tuned with high-quality\nmedical question-answer (QA) instances and Reinforcement Learning from Human\nFeedback (RLHF). After supervised fine-tuning, IvyGPT has good multi-turn\nconversation capabilities, but it cannot perform like a doctor in other\naspects, such as comprehensive diagnosis. Through RLHF, IvyGPT can output\nricher diagnosis and treatment answers that are closer to human. In the\ntraining, we used QLoRA to train 33 billion parameters on a small number of\nNVIDIA A100 (80GB) GPUs. Experimental results show that IvyGPT has outperformed\nother medical GPT models.\n","authors":["Rongsheng Wang","Yaofei Duan","ChanTong Lam","Jiexi Chen","Jiangsheng Xu","Haoming Chen","Xiaohong Liu","Patrick Cheong-Iao Pang","Tao Tan"],"pdf_url":"https://arxiv.org/pdf/2307.10512v1.pdf","comment":"5 pages, 3 figures"},{"id":"http://arxiv.org/abs/2305.11408v2","updated":"2023-07-20T00:58:30Z","published":"2023-05-19T03:31:42Z","title":"AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide\n for Simultaneous Speech Translation","summary":" Attention is the core mechanism of today's most used architectures for\nnatural language processing and has been analyzed from many perspectives,\nincluding its effectiveness for machine translation-related tasks. Among these\nstudies, attention resulted to be a useful source of information to get\ninsights about word alignment also when the input text is substituted with\naudio segments, as in the case of the speech translation (ST) task. In this\npaper, we propose AlignAtt, a novel policy for simultaneous ST (SimulST) that\nexploits the attention information to generate source-target alignments that\nguide the model during inference. Through experiments on the 8 language pairs\nof MuST-C v1.0, we show that AlignAtt outperforms previous state-of-the-art\nSimulST policies applied to offline-trained models with gains in terms of BLEU\nof 2 points and latency reductions ranging from 0.5s to 0.8s across the 8\nlanguages.\n","authors":["Sara Papi","Marco Turchi","Matteo Negri"],"pdf_url":"https://arxiv.org/pdf/2305.11408v2.pdf","comment":"Accepted at Interspeech 2023"},{"id":"http://arxiv.org/abs/2307.09702v2","updated":"2023-07-20T00:40:41Z","published":"2023-07-19T01:14:49Z","title":"Efficient Guided Generation for Large Language Models","summary":" In this article we describe an efficient approach to guiding language model\ntext generation with regular expressions and context-free grammars. Our\napproach adds little to no overhead to the token sequence generation process,\nand makes guided generation feasible in practice. An implementation is provided\nin the open source Python library Outlines.\n","authors":["Brandon T. Willard","Rémi Louf"],"pdf_url":"https://arxiv.org/pdf/2307.09702v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10511v1","updated":"2023-07-20T00:36:41Z","published":"2023-07-20T00:36:41Z","title":"General Debiasing for Multimodal Sentiment Analysis","summary":" Existing work on Multimodal Sentiment Analysis (MSA) utilizes multimodal\ninformation for prediction yet unavoidably suffers from fitting the spurious\ncorrelations between multimodal features and sentiment labels. For example, if\nmost videos with a blue background have positive labels in a dataset, the model\nwill rely on such correlations for prediction, while ``blue background'' is not\na sentiment-related feature. To address this problem, we define a general\ndebiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD)\ngeneralization ability of MSA models by reducing their reliance on spurious\ncorrelations. To this end, we propose a general debiasing framework based on\nInverse Probability Weighting (IPW), which adaptively assigns small weights to\nthe samples with larger bias i.e., the severer spurious correlations). The key\nto this debiasing framework is to estimate the bias of each sample, which is\nachieved by two steps: 1) disentangling the robust features and biased features\nin each modality, and 2) utilizing the biased features to estimate the bias.\nFinally, we employ IPW to reduce the effects of large-biased samples,\nfacilitating robust feature learning for sentiment prediction. To examine the\nmodel's generalization ability, we keep the original testing sets on two\nbenchmarks and additionally construct multiple unimodal and multimodal OOD\ntesting sets. The empirical results demonstrate the superior generalization\nability of our proposed framework. We have released the code and data to\nfacilitate the reproduction.\n","authors":["Teng Sun","Juntong Ni","Wenjie Wang","Liqiang Jing","Yinwei Wei","Liqiang Nie"],"pdf_url":"https://arxiv.org/pdf/2307.10511v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11254v1","updated":"2023-07-20T22:10:04Z","published":"2023-07-20T22:10:04Z","title":"A Systematic Evaluation of Federated Learning on Biomedical Natural\n Language Processing","summary":" Language models (LMs) like BERT and GPT have revolutionized natural language\nprocessing (NLP). However, privacy-sensitive domains, particularly the medical\nfield, face challenges to train LMs due to limited data access and privacy\nconstraints imposed by regulations like the Health Insurance Portability and\nAccountability Act (HIPPA) and the General Data Protection Regulation (GDPR).\nFederated learning (FL) offers a decentralized solution that enables\ncollaborative learning while ensuring the preservation of data privacy. In this\nstudy, we systematically evaluate FL in medicine across $2$ biomedical NLP\ntasks using $6$ LMs encompassing $8$ corpora. Our results showed that: 1) FL\nmodels consistently outperform LMs trained on individual client's data and\nsometimes match the model trained with polled data; 2) With the fixed number of\ntotal data, LMs trained using FL with more clients exhibit inferior\nperformance, but pre-trained transformer-based models exhibited greater\nresilience. 3) LMs trained using FL perform nearly on par with the model\ntrained with pooled data when clients' data are IID distributed while\nexhibiting visible gaps with non-IID data. Our code is available at:\nhttps://github.com/PL97/FedNLP\n","authors":["Le Peng","sicheng zhou","jiandong chen","Rui Zhang","Ziyue Xu","Ju Sun"],"pdf_url":"https://arxiv.org/pdf/2307.11254v1.pdf","comment":"Accepted by KDD 2023 Workshop FL4Data-Mining"},{"id":"http://arxiv.org/abs/2307.11224v1","updated":"2023-07-20T20:37:24Z","published":"2023-07-20T20:37:24Z","title":"Jina Embeddings: A Novel Set of High-Performance Sentence Embedding\n Models","summary":" Jina Embeddings constitutes a set of high-performance sentence embedding\nmodels adept at translating various textual inputs into numerical\nrepresentations, thereby capturing the semantic essence of the text. While\nthese models are not exclusively designed for text generation, they excel in\napplications such as dense retrieval and semantic textual similarity. This\npaper details the development of Jina Embeddings, starting with the creation of\na high-quality pairwise and triplet dataset. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-depth insights into the model\ntraining process, and concludes with a comprehensive performance evaluation\nusing the Massive Textual Embedding Benchmark (MTEB).\n","authors":["Michael Günther","Louis Milliken","Jonathan Geuter","Georgios Mastrapas","Bo Wang","Han Xiao"],"pdf_url":"https://arxiv.org/pdf/2307.11224v1.pdf","comment":"9 pages, 2 page appendix, EMNLP 2023 Industrial Track"},{"id":"http://arxiv.org/abs/2307.09782v2","updated":"2023-07-20T18:47:20Z","published":"2023-07-19T06:58:03Z","title":"ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization\n Using Floating-Point Formats","summary":" In the complex domain of large language models (LLMs), striking a balance\nbetween computational efficiency and maintaining model quality is a formidable\nchallenge. Navigating the inherent limitations of uniform quantization,\nparticularly when dealing with outliers, and motivated by the launch of\nNVIDIA's H100 hardware, this study delves into the viability of floating-point\n(FP) quantization, particularly focusing on FP8 and FP4, as a potential\nsolution. Our comprehensive investigation reveals that for LLMs, FP8 activation\nconsistently outshines its integer (INT8) equivalent, with the performance edge\nbecoming more noticeable in models possessing parameters beyond one billion.\nFor weight quantization, our findings indicate that FP4 exhibits comparable, if\nnot superior, performance to INT4, simplifying deployment on FP-supported\nhardware like H100. To mitigate the overhead from precision alignment caused by\nthe disparity between weights and activations, we propose two scaling\nconstraints for weight quantization that negligibly impact the performance\ncompared to the standard W4A8 model. We additionally enhance our quantization\nmethods by integrating the Low Rank Compensation (LoRC) strategy, yielding\nimprovements especially in smaller models. The results of our investigation\nemphasize the immense potential of FP quantization for LLMs, paving the way for\nhigh-efficiency deployment in resource-limited settings.\n","authors":["Xiaoxia Wu","Zhewei Yao","Yuxiong He"],"pdf_url":"https://arxiv.org/pdf/2307.09782v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11170v1","updated":"2023-07-20T18:08:34Z","published":"2023-07-20T18:08:34Z","title":"UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for\n Biomedical Entity Recognition","summary":" Pre-trained transformer language models (LMs) have in recent years become the\ndominant paradigm in applied NLP. These models have achieved state-of-the-art\nperformance on tasks such as information extraction, question answering,\nsentiment analysis, document classification and many others. In the biomedical\ndomain, significant progress has been made in adapting this paradigm to NLP\ntasks that require the integration of domain-specific knowledge as well as\nstatistical modelling of language. In particular, research in this area has\nfocused on the question of how best to construct LMs that take into account not\nonly the patterns of token distribution in medical text, but also the wealth of\nstructured information contained in terminology resources such as the UMLS.\nThis work contributes a data-centric paradigm for enriching the language\nrepresentations of biomedical transformer-encoder LMs by extracting text\nsequences from the UMLS. This allows for graph-based learning objectives to be\ncombined with masked-language pre-training. Preliminary results from\nexperiments in the extension of pre-trained LMs as well as training from\nscratch show that this framework improves downstream performance on multiple\nbiomedical and clinical Named Entity Recognition (NER) tasks.\n","authors":["Aidan Mannion","Thierry Chevalier","Didier Schwab","Lorraine Geouriot"],"pdf_url":"https://arxiv.org/pdf/2307.11170v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.11585v2","updated":"2023-07-20T14:31:10Z","published":"2023-06-20T15:02:25Z","title":"FAIR: A Causal Framework for Accurately Inferring Judgments Reversals","summary":" Artificial intelligence researchers have made significant advances in legal\nintelligence in recent years. However, the existing studies have not focused on\nthe important value embedded in judgments reversals, which limits the\nimprovement of the efficiency of legal intelligence. In this paper, we propose\na causal Framework for Accurately Inferring case Reversals (FAIR), which models\nthe problem of judgments reversals based on real Chinese judgments. We mine the\ncauses of judgments reversals by causal inference methods and inject the\nobtained causal relationships into the neural network as a priori knowledge.\nAnd then, our framework is validated on a challenging dataset as a legal\njudgment prediction task. The experimental results show that our framework can\ntap the most critical factors in judgments reversal, and the obtained causal\nrelationships can effectively improve the neural network's performance. In\naddition, we discuss the generalization ability of large language models for\nlegal intelligence tasks using ChatGPT as an example. Our experiment has found\nthat the generalization ability of large language models still has defects, and\nmining causal relationships can effectively improve the accuracy and explain\nability of model predictions.\n","authors":["Minghua He","Nanfei Gu","Yuntao Shi","Qionghui Zhang","Yaying Chen"],"pdf_url":"https://arxiv.org/pdf/2306.11585v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11788v1","updated":"2023-07-20T18:30:35Z","published":"2023-07-20T18:30:35Z","title":"Applying QNLP to sentiment analysis in finance","summary":" As an application domain where the slightest qualitative improvements can\nyield immense value, finance is a promising candidate for early quantum\nadvantage. Focusing on the rapidly advancing field of Quantum Natural Language\nProcessing (QNLP), we explore the practical applicability of the two central\napproaches DisCoCat and Quantum-Enhanced Long Short-Term Memory (QLSTM) to the\nproblem of sentiment analysis in finance. Utilizing a novel ChatGPT-based data\ngeneration approach, we conduct a case study with more than 1000 realistic\nsentences and find that QLSTMs can be trained substantially faster than\nDisCoCat while also achieving close to classical results for their available\nsoftware implementations.\n","authors":["Jonas Stein","Ivo Christ","Nicolas Kraus","Maximilian Balthasar Mansky","Robert Müller","Claudia Linnhof-Popien"],"pdf_url":"https://arxiv.org/pdf/2307.11788v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11787v1","updated":"2023-07-20T16:22:36Z","published":"2023-07-20T16:22:36Z","title":"LLM Cognitive Judgements Differ From Human","summary":" Large Language Models (LLMs) have lately been on the spotlight of\nresearchers, businesses, and consumers alike. While the linguistic capabilities\nof such models have been studied extensively, there is growing interest in\ninvestigating them as cognitive subjects. In the present work I examine GPT-3\nand ChatGPT capabilities on an limited-data inductive reasoning task from the\ncognitive science literature. The results suggest that these models' cognitive\njudgements are not human-like.\n","authors":["Sotiris Lamprinidis"],"pdf_url":"https://arxiv.org/pdf/2307.11787v1.pdf","comment":"7 pages, 1 figure"},{"id":"http://arxiv.org/abs/2307.11785v1","updated":"2023-07-20T12:44:47Z","published":"2023-07-20T12:44:47Z","title":"Adversarial Conversational Shaping for Intelligent Agents","summary":" The recent emergence of deep learning methods has enabled the research\ncommunity to achieve state-of-the art results in several domains including\nnatural language processing. However, the current robocall system remains\nunstable and inaccurate: text generator and chat-bots can be tedious and\nmisunderstand human-like dialogue. In this work, we study the performance of\ntwo models able to enhance an intelligent conversational agent through\nadversarial conversational shaping: a generative adversarial network with\npolicy gradient (GANPG) and a generative adversarial network with reward for\nevery generation step (REGS) based on the REGS model presented in Li et al.\n[18] . This model is able to assign rewards to both partially and fully\ngenerated text sequences. We discuss performance with different training\ndetails : seq2seq [ 36] and transformers [37 ] in a reinforcement learning\nframework.\n","authors":["Piotr Tarasiewicz","Sultan Kenjeyev","Ilana Sebag","Shehab Alshehabi"],"pdf_url":"https://arxiv.org/pdf/2307.11785v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2307.11086v1","updated":"2023-07-20T17:59:33Z","published":"2023-07-20T17:59:33Z","title":"PAPR: Proximity Attention Point Rendering","summary":" Learning accurate and parsimonious point cloud representations of scene\nsurfaces from scratch remains a challenge in 3D representation learning.\nExisting point-based methods often suffer from the vanishing gradient problem\nor require a large number of points to accurately model scene geometry and\ntexture. To address these limitations, we propose Proximity Attention Point\nRendering (PAPR), a novel method that consists of a point-based scene\nrepresentation and a differentiable renderer. Our scene representation uses a\npoint cloud where each point is characterized by its spatial position,\nforeground score, and view-independent feature vector. The renderer selects the\nrelevant points for each ray and produces accurate colours using their\nassociated features. PAPR effectively learns point cloud positions to represent\nthe correct scene geometry, even when the initialization drastically differs\nfrom the target geometry. Notably, our method captures fine texture details\nwhile using only a parsimonious set of points. We also demonstrate four\npractical applications of our method: geometry editing, object manipulation,\ntexture transfer, and exposure control. More results and code are available on\nour project website at https://zvict.github.io/papr/.\n","authors":["Yanshu Zhang","Shichong Peng","Alireza Moazeni","Ke Li"],"pdf_url":"https://arxiv.org/pdf/2307.11086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07269v2","updated":"2023-07-20T17:59:25Z","published":"2023-07-14T10:50:43Z","title":"Frequency Domain Adversarial Training for Robust Volumetric Medical\n Segmentation","summary":" It is imperative to ensure the robustness of deep learning models in critical\napplications such as, healthcare. While recent advances in deep learning have\nimproved the performance of volumetric medical image segmentation models, these\nmodels cannot be deployed for real-world applications immediately due to their\nvulnerability to adversarial attacks. We present a 3D frequency domain\nadversarial attack for volumetric medical image segmentation models and\ndemonstrate its advantages over conventional input or voxel domain attacks.\nUsing our proposed attack, we introduce a novel frequency domain adversarial\ntraining approach for optimizing a robust model against voxel and frequency\ndomain attacks. Moreover, we propose frequency consistency loss to regulate our\nfrequency domain adversarial training that achieves a better tradeoff between\nmodel's performance on clean and adversarial samples. Code is publicly\navailable at https://github.com/asif-hanif/vafa.\n","authors":["Asif Hanif","Muzammal Naseer","Salman Khan","Mubarak Shah","Fahad Shahbaz Khan"],"pdf_url":"https://arxiv.org/pdf/2307.07269v2.pdf","comment":"This paper has been accepted in MICCAI 2023 conference"},{"id":"http://arxiv.org/abs/2307.11085v1","updated":"2023-07-20T17:59:11Z","published":"2023-07-20T17:59:11Z","title":"Representation Learning in Anomaly Detection: Successes, Limits and a\n Grand Challenge","summary":" In this perspective paper, we argue that the dominant paradigm in anomaly\ndetection cannot scale indefinitely and will eventually hit fundamental limits.\nThis is due to the a no free lunch principle for anomaly detection. These\nlimitations can be overcome when there are strong tasks priors, as is the case\nfor many industrial tasks. When such priors do not exists, the task is much\nharder for anomaly detection. We pose two such tasks as grand challenges for\nanomaly detection: i) scientific discovery by anomaly detection ii) a\n\"mini-grand\" challenge of detecting the most anomalous image in the ImageNet\ndataset. We believe new anomaly detection tools and ideas would need to be\ndeveloped to overcome these challenges.\n","authors":["Yedid Hoshen"],"pdf_url":"https://arxiv.org/pdf/2307.11085v1.pdf","comment":"Keynote talk at the Visual Anomaly and Novelty Detection Workshop,\n CVPR'23"},{"id":"http://arxiv.org/abs/2307.11081v1","updated":"2023-07-20T17:57:04Z","published":"2023-07-20T17:57:04Z","title":"GLSFormer : Gated - Long, Short Sequence Transformer for Step\n Recognition in Surgical Videos","summary":" Automated surgical step recognition is an important task that can\nsignificantly improve patient safety and decision-making during surgeries.\nExisting state-of-the-art methods for surgical step recognition either rely on\nseparate, multi-stage modeling of spatial and temporal information or operate\non short-range temporal resolution when learned jointly. However, the benefits\nof joint modeling of spatio-temporal features and long-range information are\nnot taken in account. In this paper, we propose a vision transformer-based\napproach to jointly learn spatio-temporal features directly from sequence of\nframe-level patches. Our method incorporates a gated-temporal attention\nmechanism that intelligently combines short-term and long-term spatio-temporal\nfeature representations. We extensively evaluate our approach on two cataract\nsurgery video datasets, namely Cataract-101 and D99, and demonstrate superior\nperformance compared to various state-of-the-art methods. These results\nvalidate the suitability of our proposed approach for automated surgical step\nrecognition. Our code is released at:\nhttps://github.com/nisargshah1999/GLSFormer\n","authors":["Nisarg A. Shah","Shameema Sikder","S. Swaroop Vedula","Vishal M. Patel"],"pdf_url":"https://arxiv.org/pdf/2307.11081v1.pdf","comment":"Accepted to MICCAI 2023 (Early Accept)"},{"id":"http://arxiv.org/abs/2307.11077v1","updated":"2023-07-20T17:55:14Z","published":"2023-07-20T17:55:14Z","title":"AlignDet: Aligning Pre-training and Fine-tuning in Object Detection","summary":" The paradigm of large-scale pre-training followed by downstream fine-tuning\nhas been widely employed in various object detection algorithms. In this paper,\nwe reveal discrepancies in data, model, and task between the pre-training and\nfine-tuning procedure in existing practices, which implicitly limit the\ndetector's performance, generalization ability, and convergence speed. To this\nend, we propose AlignDet, a unified pre-training framework that can be adapted\nto various existing detectors to alleviate the discrepancies. AlignDet\ndecouples the pre-training process into two stages, i.e., image-domain and\nbox-domain pre-training. The image-domain pre-training optimizes the detection\nbackbone to capture holistic visual abstraction, and box-domain pre-training\nlearns instance-level semantics and task-aware concepts to initialize the parts\nout of the backbone. By incorporating the self-supervised pre-trained\nbackbones, we can pre-train all modules for various detectors in an\nunsupervised paradigm. As depicted in Figure 1, extensive experiments\ndemonstrate that AlignDet can achieve significant improvements across diverse\nprotocols, such as detection algorithm, model backbone, data setting, and\ntraining schedule. For example, AlignDet improves FCOS by 5.3 mAP, RetinaNet by\n2.1 mAP, Faster R-CNN by 3.3 mAP, and DETR by 2.3 mAP under fewer epochs.\n","authors":["Ming Li","Jie Wu","Xionghui Wang","Chen Chen","Jie Qin","Xuefeng Xiao","Rui Wang","Min Zheng","Xin Pan"],"pdf_url":"https://arxiv.org/pdf/2307.11077v1.pdf","comment":"Accepted by ICCV 2023. Code and Models are publicly available.\n Project Page: https://liming-ai.github.io/AlignDet"},{"id":"http://arxiv.org/abs/2307.11074v1","updated":"2023-07-20T17:53:57Z","published":"2023-07-20T17:53:57Z","title":"Learning Dense UV Completion for Human Mesh Recovery","summary":" Human mesh reconstruction from a single image is challenging in the presence\nof occlusion, which can be caused by self, objects, or other humans. Existing\nmethods either fail to separate human features accurately or lack proper\nsupervision for feature completion. In this paper, we propose Dense Inpainting\nHuman Mesh Recovery (DIMR), a two-stage method that leverages dense\ncorrespondence maps to handle occlusion. Our method utilizes a dense\ncorrespondence map to separate visible human features and completes human\nfeatures on a structured UV map dense human with an attention-based feature\ncompletion module. We also design a feature inpainting training procedure that\nguides the network to learn from unoccluded features. We evaluate our method on\nseveral datasets and demonstrate its superior performance under heavily\noccluded scenarios compared to other methods. Extensive experiments show that\nour method obviously outperforms prior SOTA methods on heavily occluded images\nand achieves comparable results on the standard benchmarks (3DPW).\n","authors":["Yanjun Wang","Qingping Sun","Wenjia Wang","Jun Ling","Zhongang Cai","Rong Xie","Li Song"],"pdf_url":"https://arxiv.org/pdf/2307.11074v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11073v1","updated":"2023-07-20T17:53:46Z","published":"2023-07-20T17:53:46Z","title":"OBJECT 3DIT: Language-guided 3D-aware Image Editing","summary":" Existing image editing tools, while powerful, typically disregard the\nunderlying 3D geometry from which the image is projected. As a result, edits\nmade using these tools may become detached from the geometry and lighting\nconditions that are at the foundation of the image formation process. In this\nwork, we formulate the newt ask of language-guided 3D-aware editing, where\nobjects in an image should be edited according to a language instruction in\ncontext of the underlying 3D scene. To promote progress towards this goal, we\nrelease OBJECT: a dataset consisting of 400K editing examples created from\nprocedurally generated 3D scenes. Each example consists of an input image,\nediting instruction in language, and the edited image. We also introduce 3DIT :\nsingle and multi-task models for four editing tasks. Our models show impressive\nabilities to understand the 3D composition of entire scenes, factoring in\nsurrounding objects, surfaces, lighting conditions, shadows, and\nphysically-plausible object configurations. Surprisingly, training on only\nsynthetic scenes from OBJECT, editing capabilities of 3DIT generalize to\nreal-world images.\n","authors":["Oscar Michel","Anand Bhattad","Eli VanderBilt","Ranjay Krishna","Aniruddha Kembhavi","Tanmay Gupta"],"pdf_url":"https://arxiv.org/pdf/2307.11073v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.01738v2","updated":"2023-07-20T17:53:41Z","published":"2023-07-04T14:14:12Z","title":"Mitigating Calibration Bias Without Fixed Attribute Grouping for\n Improved Fairness in Medical Imaging Analysis","summary":" Trustworthy deployment of deep learning medical imaging models into\nreal-world clinical practice requires that they be calibrated. However, models\nthat are well calibrated overall can still be poorly calibrated for a\nsub-population, potentially resulting in a clinician unwittingly making poor\ndecisions for this group based on the recommendations of the model. Although\nmethods have been shown to successfully mitigate biases across subgroups in\nterms of model accuracy, this work focuses on the open problem of mitigating\ncalibration biases in the context of medical image analysis. Our method does\nnot require subgroup attributes during training, permitting the flexibility to\nmitigate biases for different choices of sensitive attributes without\nre-training. To this end, we propose a novel two-stage method: Cluster-Focal to\nfirst identify poorly calibrated samples, cluster them into groups, and then\nintroduce group-wise focal loss to improve calibration bias. We evaluate our\nmethod on skin lesion classification with the public HAM10000 dataset, and on\npredicting future lesional activity for multiple sclerosis (MS) patients. In\naddition to considering traditional sensitive attributes (e.g. age, sex) with\ndemographic subgroups, we also consider biases among groups with different\nimage-derived attributes, such as lesion load, which are required in medical\nimage analysis. Our results demonstrate that our method effectively controls\ncalibration error in the worst-performing subgroups while preserving prediction\nperformance, and outperforming recent baselines.\n","authors":["Changjian Shui","Justin Szeto","Raghav Mehta","Douglas L. Arnold","Tal Arbel"],"pdf_url":"https://arxiv.org/pdf/2307.01738v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11067v1","updated":"2023-07-20T17:46:21Z","published":"2023-07-20T17:46:21Z","title":"CNOS: A Strong Baseline for CAD-based Novel Object Segmentation","summary":" We propose a simple three-stage approach to segment unseen objects in RGB\nimages using their CAD models. Leveraging recent powerful foundation models,\nDINOv2 and Segment Anything, we create descriptors and generate proposals,\nincluding binary masks for a given input RGB image. By matching proposals with\nreference descriptors created from CAD models, we achieve precise object ID\nassignment along with modal masks. We experimentally demonstrate that our\nmethod achieves state-of-the-art results in CAD-based novel object\nsegmentation, surpassing existing approaches on the seven core datasets of the\nBOP challenge by 19.8\\% AP using the same BOP evaluation protocol. Our source\ncode is available at https://github.com/nv-nguyen/cnos.\n","authors":["Van Nguyen Nguyen","Tomas Hodan","Georgy Ponimatkin","Thibault Groueix","Vincent Lepetit"],"pdf_url":"https://arxiv.org/pdf/2307.11067v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11058v1","updated":"2023-07-20T17:38:55Z","published":"2023-07-20T17:38:55Z","title":"Driving Policy Prediction based on Deep Learning Models","summary":" In this project, we implemented an end-to-end system that takes in combined\nvisual features of video frames from a normal camera and depth information from\na cloud points scanner, and predicts driving policies (vehicle speed and\nsteering angle). We verified the safety of our system by comparing the\npredicted results with standard behaviors by real-world experienced drivers.\nOur test results show that the predictions can be considered as accurate in at\nlease half of the testing cases (50% 80%, depending on the model), and using\ncombined features improved the performance in most cases than using video\nframes only.\n","authors":["Fuxiao Liu"],"pdf_url":"https://arxiv.org/pdf/2307.11058v1.pdf","comment":"5 pages, 9 figures"},{"id":"http://arxiv.org/abs/2307.11052v1","updated":"2023-07-20T17:33:57Z","published":"2023-07-20T17:33:57Z","title":"HRFNet: High-Resolution Forgery Network for Localizing Satellite Image\n Manipulation","summary":" Existing high-resolution satellite image forgery localization methods rely on\npatch-based or downsampling-based training. Both of these training methods have\nmajor drawbacks, such as inaccurate boundaries between pristine and forged\nregions, the generation of unwanted artifacts, etc. To tackle the\naforementioned challenges, inspired by the high-resolution image segmentation\nliterature, we propose a novel model called HRFNet to enable satellite image\nforgery localization effectively. Specifically, equipped with shallow and deep\nbranches, our model can successfully integrate RGB and resampling features in\nboth global and local manners to localize forgery more accurately. We perform\nvarious experiments to demonstrate that our method achieves the best\nperformance, while the memory requirement and processing speed are not\ncompromised compared to existing methods.\n","authors":["Fahim Faisal Niloy","Kishor Kumar Bhaumik","Simon S. Woo"],"pdf_url":"https://arxiv.org/pdf/2307.11052v1.pdf","comment":"ICIP 2023"},{"id":"http://arxiv.org/abs/2307.09023v3","updated":"2023-07-20T17:23:55Z","published":"2023-07-18T07:25:38Z","title":"LA-Net: Landmark-Aware Learning for Reliable Facial Expression\n Recognition under Label Noise","summary":" Facial expression recognition (FER) remains a challenging task due to the\nambiguity of expressions. The derived noisy labels significantly harm the\nperformance in real-world scenarios. To address this issue, we present a new\nFER model named Landmark-Aware Net~(LA-Net), which leverages facial landmarks\nto mitigate the impact of label noise from two perspectives. Firstly, LA-Net\nuses landmark information to suppress the uncertainty in expression space and\nconstructs the label distribution of each sample by neighborhood aggregation,\nwhich in turn improves the quality of training supervision. Secondly, the model\nincorporates landmark information into expression representations using the\ndevised expression-landmark contrastive loss. The enhanced expression feature\nextractor can be less susceptible to label noise. Our method can be integrated\nwith any deep neural network for better training supervision without\nintroducing extra inference costs. We conduct extensive experiments on both\nin-the-wild datasets and synthetic noisy datasets and demonstrate that LA-Net\nachieves state-of-the-art performance.\n","authors":["Zhiyu Wu","Jinshi Cui"],"pdf_url":"https://arxiv.org/pdf/2307.09023v3.pdf","comment":"accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.11035v1","updated":"2023-07-20T17:11:20Z","published":"2023-07-20T17:11:20Z","title":"Cascade-DETR: Delving into High-Quality Universal Object Detection","summary":" Object localization in general environments is a fundamental part of vision\nsystems. While dominating on the COCO benchmark, recent Transformer-based\ndetection methods are not competitive in diverse domains. Moreover, these\nmethods still struggle to very accurately estimate the object bounding boxes in\ncomplex environments.\n We introduce Cascade-DETR for high-quality universal object detection. We\njointly tackle the generalization to diverse domains and localization accuracy\nby proposing the Cascade Attention layer, which explicitly integrates\nobject-centric information into the detection decoder by limiting the attention\nto the previous box prediction. To further enhance accuracy, we also revisit\nthe scoring of queries. Instead of relying on classification scores, we predict\nthe expected IoU of the query, leading to substantially more well-calibrated\nconfidences. Lastly, we introduce a universal object detection benchmark,\nUDB10, that contains 10 datasets from diverse domains. While also advancing the\nstate-of-the-art on COCO, Cascade-DETR substantially improves DETR-based\ndetectors on all datasets in UDB10, even by over 10 mAP in some cases. The\nimprovements under stringent quality requirements are even more pronounced. Our\ncode and models will be released at https://github.com/SysCV/cascade-detr.\n","authors":["Mingqiao Ye","Lei Ke","Siyuan Li","Yu-Wing Tai","Chi-Keung Tang","Martin Danelljan","Fisher Yu"],"pdf_url":"https://arxiv.org/pdf/2307.11035v1.pdf","comment":"Accepted in ICCV 2023. Our code and models will be released at\n https://github.com/SysCV/cascade-detr"},{"id":"http://arxiv.org/abs/2305.05610v2","updated":"2023-07-20T16:46:36Z","published":"2023-05-09T17:01:17Z","title":"Can point cloud networks learn statistical shape models of anatomies?","summary":" Statistical Shape Modeling (SSM) is a valuable tool for investigating and\nquantifying anatomical variations within populations of anatomies. However,\ntraditional correspondence-based SSM generation methods have a prohibitive\ninference process and require complete geometric proxies (e.g., high-resolution\nbinary volumes or surface meshes) as input shapes to construct the SSM.\nUnordered 3D point cloud representations of shapes are more easily acquired\nfrom various medical imaging practices (e.g., thresholded images and surface\nscanning). Point cloud deep networks have recently achieved remarkable success\nin learning permutation-invariant features for different point cloud tasks\n(e.g., completion, semantic segmentation, classification). However, their\napplication to learning SSM from point clouds is to-date unexplored. In this\nwork, we demonstrate that existing point cloud encoder-decoder-based completion\nnetworks can provide an untapped potential for SSM, capturing population-level\nstatistical representations of shapes while reducing the inference burden and\nrelaxing the input requirement. We discuss the limitations of these techniques\nto the SSM application and suggest future improvements. Our work paves the way\nfor further exploration of point cloud deep learning for SSM, a promising\navenue for advancing shape analysis literature and broadening SSM to diverse\nuse cases.\n","authors":["Jadie Adams","Shireen Elhabian"],"pdf_url":"https://arxiv.org/pdf/2305.05610v2.pdf","comment":"Accepted to MICCAI 2023. 13 pages, 5 figures, appendix"},{"id":"http://arxiv.org/abs/2307.11017v1","updated":"2023-07-20T16:45:16Z","published":"2023-07-20T16:45:16Z","title":"Multi-objective point cloud autoencoders for explainable myocardial\n infarction prediction","summary":" Myocardial infarction (MI) is one of the most common causes of death in the\nworld. Image-based biomarkers commonly used in the clinic, such as ejection\nfraction, fail to capture more complex patterns in the heart's 3D anatomy and\nthus limit diagnostic accuracy. In this work, we present the multi-objective\npoint cloud autoencoder as a novel geometric deep learning approach for\nexplainable infarction prediction, based on multi-class 3D point cloud\nrepresentations of cardiac anatomy and function. Its architecture consists of\nmultiple task-specific branches connected by a low-dimensional latent space to\nallow for effective multi-objective learning of both reconstruction and MI\nprediction, while capturing pathology-specific 3D shape information in an\ninterpretable latent space. Furthermore, its hierarchical branch design with\npoint cloud-based deep learning operations enables efficient multi-scale\nfeature learning directly on high-resolution anatomy point clouds. In our\nexperiments on a large UK Biobank dataset, the multi-objective point cloud\nautoencoder is able to accurately reconstruct multi-temporal 3D shapes with\nChamfer distances between predicted and input anatomies below the underlying\nimages' pixel resolution. Our method outperforms multiple machine learning and\ndeep learning benchmarks for the task of incident MI prediction by 19% in terms\nof Area Under the Receiver Operating Characteristic curve. In addition, its\ntask-specific compact latent space exhibits easily separable control and MI\nclusters with clinically plausible associations between subject encodings and\ncorresponding 3D shapes, thus demonstrating the explainability of the\nprediction.\n","authors":["Marcel Beetz","Abhirup Banerjee","Vicente Grau"],"pdf_url":"https://arxiv.org/pdf/2307.11017v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.05797v2","updated":"2023-07-20T16:36:32Z","published":"2023-05-09T23:01:05Z","title":"Fully Bayesian VIB-DeepSSM","summary":" Statistical shape modeling (SSM) enables population-based quantitative\nanalysis of anatomical shapes, informing clinical diagnosis. Deep learning\napproaches predict correspondence-based SSM directly from unsegmented 3D images\nbut require calibrated uncertainty quantification, motivating Bayesian\nformulations. Variational information bottleneck DeepSSM (VIB-DeepSSM) is an\neffective, principled framework for predicting probabilistic shapes of anatomy\nfrom images with aleatoric uncertainty quantification. However, VIB is only\nhalf-Bayesian and lacks epistemic uncertainty inference. We derive a fully\nBayesian VIB formulation and demonstrate the efficacy of two scalable\nimplementation approaches: concrete dropout and batch ensemble. Additionally,\nwe introduce a novel combination of the two that further enhances uncertainty\ncalibration via multimodal marginalization. Experiments on synthetic shapes and\nleft atrium data demonstrate that the fully Bayesian VIB network predicts SSM\nfrom images with improved uncertainty reasoning without sacrificing accuracy.\n","authors":["Jadie Adams","Shireen Elhabian"],"pdf_url":"https://arxiv.org/pdf/2305.05797v2.pdf","comment":"Accepted to MICCAI 2023. 13 pages, 4 figures, appendix"},{"id":"http://arxiv.org/abs/2210.05335v3","updated":"2023-07-20T16:24:14Z","published":"2022-10-11T10:54:54Z","title":"MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model","summary":" Multimodal semantic understanding often has to deal with uncertainty, which\nmeans the obtained messages tend to refer to multiple targets. Such uncertainty\nis problematic for our interpretation, including inter- and intra-modal\nuncertainty. Little effort has studied the modeling of this uncertainty,\nparticularly in pre-training on unlabeled datasets and fine-tuning in\ntask-specific downstream datasets. In this paper, we project the\nrepresentations of all modalities as probabilistic distributions via a\nProbability Distribution Encoder (PDE) by utilizing sequence-level\ninteractions. Compared to the existing deterministic methods, such uncertainty\nmodeling can convey richer multimodal semantic information and more complex\nrelationships. Furthermore, we integrate uncertainty modeling with popular\npre-training frameworks and propose suitable pre-training tasks:\nDistribution-based Vision-Language Contrastive learning (D-VLC),\nDistribution-based Masked Language Modeling (D-MLM), and Distribution-based\nImage-Text Matching (D-ITM). The fine-tuned models are applied to challenging\ndownstream tasks, including image-text retrieval, visual question answering,\nvisual reasoning, and visual entailment, and achieve state-of-the-art results.\n","authors":["Yatai Ji","Junjie Wang","Yuan Gong","Lin Zhang","Yanru Zhu","Hongfa Wang","Jiaxing Zhang","Tetsuya Sakai","Yujiu Yang"],"pdf_url":"https://arxiv.org/pdf/2210.05335v3.pdf","comment":"CVPR 2023 Main Track Long Paper"},{"id":"http://arxiv.org/abs/2307.10984v1","updated":"2023-07-20T16:14:23Z","published":"2023-07-20T16:14:23Z","title":"Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image","summary":" Reconstructing accurate 3D scenes from images is a long-standing vision task.\nDue to the ill-posedness of the single-image reconstruction problem, most\nwell-established methods are built upon multi-view geometry. State-of-the-art\n(SOTA) monocular metric depth estimation methods can only handle a single\ncamera model and are unable to perform mixed-data training due to the metric\nambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets\nachieve zero-shot generalization by learning affine-invariant depths, which\ncannot recover real-world metrics. In this work, we show that the key to a\nzero-shot single-view metric depth model lies in the combination of large-scale\ndata training and resolving the metric ambiguity from various camera models. We\npropose a canonical camera space transformation module, which explicitly\naddresses the ambiguity problems and can be effortlessly plugged into existing\nmonocular models. Equipped with our module, monocular models can be stably\ntrained with over 8 million images with thousands of camera models, resulting\nin zero-shot generalization to in-the-wild images with unseen camera settings.\nExperiments demonstrate SOTA performance of our method on 7 zero-shot\nbenchmarks. Notably, our method won the championship in the 2nd Monocular Depth\nEstimation Challenge. Our method enables the accurate recovery of metric 3D\nstructures on randomly collected internet images, paving the way for plausible\nsingle-image metrology. The potential benefits extend to downstream tasks,\nwhich can be significantly improved by simply plugging in our model. For\nexample, our model relieves the scale drift issues of monocular-SLAM (Fig. 1),\nleading to high-quality metric scale dense mapping. The code is available at\nhttps://github.com/YvanYin/Metric3D.\n","authors":["Wei Yin","Chi Zhang","Hao Chen","Zhipeng Cai","Gang Yu","Kaixuan Wang","Xiaozhi Chen","Chunhua Shen"],"pdf_url":"https://arxiv.org/pdf/2307.10984v1.pdf","comment":"Accepted to ICCV 2023. Won the championship in the 2nd Monocular\n Depth Estimation Challenge. The code is available at\n https://github.com/YvanYin/Metric3D"},{"id":"http://arxiv.org/abs/2307.09676v2","updated":"2023-07-20T16:04:11Z","published":"2023-07-18T23:06:47Z","title":"Domain Adaptation based Enhanced Detection for Autonomous Driving in\n Foggy and Rainy Weather","summary":" Typically, object detection methods for autonomous driving that rely on\nsupervised learning make the assumption of a consistent feature distribution\nbetween the training and testing data, however such assumption may fail in\ndifferent weather conditions. Due to the domain gap, a detection model trained\nunder clear weather may not perform well in foggy and rainy conditions.\nOvercoming detection bottlenecks in foggy and rainy weather is a real challenge\nfor autonomous vehicles deployed in the wild. To bridge the domain gap and\nimprove the performance of object detectionin foggy and rainy weather, this\npaper presents a novel framework for domain-adaptive object detection. The\nadaptations at both the image-level and object-level are intended to minimize\nthe differences in image style and object appearance between domains.\nFurthermore, in order to improve the model's performance on challenging\nexamples, we introduce a novel adversarial gradient reversal layer that\nconducts adversarial mining on difficult instances in addition to domain\nadaptation. Additionally, we suggest generating an auxiliary domain through\ndata augmentation to enforce a new domain-level metric regularization.\nExperimental findings on public V2V benchmark exhibit a substantial enhancement\nin object detection specifically for foggy and rainy driving scenarios.\n","authors":["Jinlong Li","Runsheng Xu","Jin Ma","Qin Zou","Jiaqi Ma","Hongkai Yu"],"pdf_url":"https://arxiv.org/pdf/2307.09676v2.pdf","comment":"only change the title of this paper"},{"id":"http://arxiv.org/abs/2307.10974v1","updated":"2023-07-20T16:00:19Z","published":"2023-07-20T16:00:19Z","title":"Deep Spiking-UNet for Image Processing","summary":" U-Net, known for its simple yet efficient architecture, is widely utilized\nfor image processing tasks and is particularly suitable for deployment on\nneuromorphic chips. This paper introduces the novel concept of Spiking-UNet for\nimage processing, which combines the power of Spiking Neural Networks (SNNs)\nwith the U-Net architecture. To achieve an efficient Spiking-UNet, we face two\nprimary challenges: ensuring high-fidelity information propagation through the\nnetwork via spikes and formulating an effective training strategy. To address\nthe issue of information loss, we introduce multi-threshold spiking neurons,\nwhich improve the efficiency of information transmission within the\nSpiking-UNet. For the training strategy, we adopt a conversion and fine-tuning\npipeline that leverage pre-trained U-Net models. During the conversion process,\nsignificant variability in data distribution across different parts is observed\nwhen utilizing skip connections. Therefore, we propose a connection-wise\nnormalization method to prevent inaccurate firing rates. Furthermore, we adopt\na flow-based training method to fine-tune the converted models, reducing time\nsteps while preserving performance. Experimental results show that, on image\nsegmentation and denoising, our Spiking-UNet achieves comparable performance to\nits non-spiking counterpart, surpassing existing SNN methods. Compared with the\nconverted Spiking-UNet without fine-tuning, our Spiking-UNet reduces inference\ntime by approximately 90\\%. This research broadens the application scope of\nSNNs in image processing and is expected to inspire further exploration in the\nfield of neuromorphic engineering. The code for our Spiking-UNet implementation\nis available at https://github.com/SNNresearch/Spiking-UNet.\n","authors":["Hebei Li","Yueyi Zhang","Zhiwei Xiong","Zheng-jun Zha","Xiaoyan Sun"],"pdf_url":"https://arxiv.org/pdf/2307.10974v1.pdf","comment":"22 pages, 5 figures"},{"id":"http://arxiv.org/abs/2307.10955v1","updated":"2023-07-20T15:26:57Z","published":"2023-07-20T15:26:57Z","title":"Spinal nerve segmentation method and dataset construction in endoscopic\n surgical scenarios","summary":" Endoscopic surgery is currently an important treatment method in the field of\nspinal surgery and avoiding damage to the spinal nerves through video guidance\nis a key challenge. This paper presents the first real-time segmentation method\nfor spinal nerves in endoscopic surgery, which provides crucial navigational\ninformation for surgeons. A finely annotated segmentation dataset of\napproximately 10,000 consec-utive frames recorded during surgery is constructed\nfor the first time for this field, addressing the problem of semantic\nsegmentation. Based on this dataset, we propose FUnet (Frame-Unet), which\nachieves state-of-the-art performance by utilizing inter-frame information and\nself-attention mechanisms. We also conduct extended exper-iments on a similar\npolyp endoscopy video dataset and show that the model has good generalization\nability with advantageous performance. The dataset and code of this work are\npresented at: https://github.com/zzzzzzpc/FUnet .\n","authors":["Shaowu Peng","Pengcheng Zhao","Yongyu Ye","Junying Chen","Yunbing Chang","Xiaoqing Zheng"],"pdf_url":"https://arxiv.org/pdf/2307.10955v1.pdf","comment":"Accepted by MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.10954v1","updated":"2023-07-20T15:26:01Z","published":"2023-07-20T15:26:01Z","title":"Soft-tissue Driven Craniomaxillofacial Surgical Planning","summary":" In CMF surgery, the planning of bony movement to achieve a desired facial\noutcome is a challenging task. Current bone driven approaches focus on\nnormalizing the bone with the expectation that the facial appearance will be\ncorrected accordingly. However, due to the complex non-linear relationship\nbetween bony structure and facial soft-tissue, such bone-driven methods are\ninsufficient to correct facial deformities. Despite efforts to simulate facial\nchanges resulting from bony movement, surgical planning still relies on\niterative revisions and educated guesses. To address these issues, we propose a\nsoft-tissue driven framework that can automatically create and verify surgical\nplans. Our framework consists of a bony planner network that estimates the bony\nmovements required to achieve the desired facial outcome and a facial simulator\nnetwork that can simulate the possible facial changes resulting from the\nestimated bony movement plans. By combining these two models, we can verify and\ndetermine the final bony movement required for planning. The proposed framework\nwas evaluated using a clinical dataset, and our experimental results\ndemonstrate that the soft-tissue driven approach greatly improves the accuracy\nand efficacy of surgical planning when compared to the conventional bone-driven\napproach.\n","authors":["Xi Fang","Daeseung Kim","Xuanang Xu","Tianshu Kuang","Nathan Lampen","Jungwook Lee","Hannah H. Deng","Jaime Gateno","Michael A. K. Liebschner","James J. Xia","Pingkun Yan"],"pdf_url":"https://arxiv.org/pdf/2307.10954v1.pdf","comment":"Early accepted by MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.10953v1","updated":"2023-07-20T15:25:55Z","published":"2023-07-20T15:25:55Z","title":"PE-YOLO: Pyramid Enhancement Network for Dark Object Detection","summary":" Current object detection models have achieved good results on many benchmark\ndatasets, detecting objects in dark conditions remains a large challenge. To\naddress this issue, we propose a pyramid enhanced network (PENet) and joint it\nwith YOLOv3 to build a dark object detection framework named PE-YOLO. Firstly,\nPENet decomposes the image into four components of different resolutions using\nthe Laplacian pyramid. Specifically we propose a detail processing module (DPM)\nto enhance the detail of images, which consists of context branch and edge\nbranch. In addition, we propose a low-frequency enhancement filter (LEF) to\ncapture low-frequency semantics and prevent high-frequency noise. PE-YOLO\nadopts an end-to-end joint training approach and only uses normal detection\nloss to simplify the training process. We conduct experiments on the low-light\nobject detection dataset ExDark to demonstrate the effectiveness of ours. The\nresults indicate that compared with other dark detectors and low-light\nenhancement models, PE-YOLO achieves the advanced results, achieving 78.0% in\nmAP and 53.6 in FPS, respectively, which can adapt to object detection under\ndifferent low-light conditions. The code is available at\nhttps://github.com/XiangchenYin/PE-YOLO.\n","authors":["Xiangchen Yin","Zhenda Yu","Zetao Fei","Wenjun Lv","Xin Gao"],"pdf_url":"https://arxiv.org/pdf/2307.10953v1.pdf","comment":"Accepted at ICANN 2023"},{"id":"http://arxiv.org/abs/2307.10947v1","updated":"2023-07-20T15:21:28Z","published":"2023-07-20T15:21:28Z","title":"Improving Online Lane Graph Extraction by Object-Lane Clustering","summary":" Autonomous driving requires accurate local scene understanding information.\nTo this end, autonomous agents deploy object detection and online BEV lane\ngraph extraction methods as a part of their perception stack. In this work, we\npropose an architecture and loss formulation to improve the accuracy of local\nlane graph estimates by using 3D object detection outputs. The proposed method\nlearns to assign the objects to centerlines by considering the centerlines as\ncluster centers and the objects as data points to be assigned a probability\ndistribution over the cluster centers. This training scheme ensures direct\nsupervision on the relationship between lanes and objects, thus leading to\nbetter performance. The proposed method improves lane graph estimation\nsubstantially over state-of-the-art methods. The extensive ablations show that\nour method can achieve significant performance improvements by using the\noutputs of existing 3D object detection methods. Since our method uses the\ndetection outputs rather than detection method intermediate representations, a\nsingle model of our method can use any detection method at test time.\n","authors":["Yigit Baran Can","Alexander Liniger","Danda Pani Paudel","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2307.10947v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.10943v1","updated":"2023-07-20T15:13:29Z","published":"2023-07-20T15:13:29Z","title":"Proxy Anchor-based Unsupervised Learning for Continuous Generalized\n Category Discovery","summary":" Recent advances in deep learning have significantly improved the performance\nof various computer vision applications. However, discovering novel categories\nin an incremental learning scenario remains a challenging problem due to the\nlack of prior knowledge about the number and nature of new categories. Existing\nmethods for novel category discovery are limited by their reliance on labeled\ndatasets and prior knowledge about the number of novel categories and the\nproportion of novel samples in the batch. To address the limitations and more\naccurately reflect real-world scenarios, in this paper, we propose a novel\nunsupervised class incremental learning approach for discovering novel\ncategories on unlabeled sets without prior knowledge. The proposed method\nfine-tunes the feature extractor and proxy anchors on labeled sets, then splits\nsamples into old and novel categories and clusters on the unlabeled dataset.\nFurthermore, the proxy anchors-based exemplar generates representative category\nvectors to mitigate catastrophic forgetting. Experimental results demonstrate\nthat our proposed approach outperforms the state-of-the-art methods on\nfine-grained datasets under real-world scenarios.\n","authors":["Hyungmin Kim","Sungho Suh","Daehwan Kim","Daun Jeong","Hansang Cho","Junmo Kim"],"pdf_url":"https://arxiv.org/pdf/2307.10943v1.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2301.06262v2","updated":"2023-07-20T15:09:50Z","published":"2023-01-16T05:08:50Z","title":"Collaborative Perception in Autonomous Driving: Methods, Datasets and\n Challenges","summary":" Collaborative perception is essential to address occlusion and sensor failure\nissues in autonomous driving. In recent years, theoretical and experimental\ninvestigations of novel works for collaborative perception have increased\ntremendously. So far, however, few reviews have focused on systematical\ncollaboration modules and large-scale collaborative perception datasets. This\nwork reviews recent achievements in this field to bridge this gap and motivate\nfuture research. We start with a brief overview of collaboration schemes. After\nthat, we systematically summarize the collaborative perception methods for\nideal scenarios and real-world issues. The former focus on collaboration\nmodules and efficiency, and the latter is devoted to addressing the problems in\nactual application. Furthermore, we present large-scale public datasets and\nsummarize quantitative results on these benchmarks. Finally, we highlight gaps\nand overlooked challenges between current academic research and real-world\napplications.\n","authors":["Yushan Han","Hui Zhang","Huifang Li","Yi Jin","Congyan Lang","Yidong Li"],"pdf_url":"https://arxiv.org/pdf/2301.06262v2.pdf","comment":"18 pages, 6 figures. Accepted by IEEE Intelligent Transportation\n Systems Magazine. URL:\n https://github.com/CatOneTwo/Collaborative-Perception-in-Autonomous-Driving"},{"id":"http://arxiv.org/abs/2307.10934v1","updated":"2023-07-20T15:06:44Z","published":"2023-07-20T15:06:44Z","title":"OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured\n Traffic Scenarios","summary":" Modern approaches for vision-centric environment perception for autonomous\nnavigation make extensive use of self-supervised monocular depth estimation\nalgorithms that output disparity maps. However, when this disparity map is\nprojected onto 3D space, the errors in disparity are magnified, resulting in a\ndepth estimation error that increases quadratically as the distance from the\ncamera increases. Though Light Detection and Ranging (LiDAR) can solve this\nissue, it is expensive and not feasible for many applications. To address the\nchallenge of accurate ranging with low-cost sensors, we propose, OCTraN, a\ntransformer architecture that uses iterative-attention to convert 2D image\nfeatures into 3D occupancy features and makes use of convolution and transpose\nconvolution to efficiently operate on spatial information. We also develop a\nself-supervised training pipeline to generalize the model to any scene by\neliminating the need for LiDAR ground truth by substituting it with\npseudo-ground truth labels obtained from boosted monocular depth estimation.\n","authors":["Aditya Nalgunda Ganesh","Dhruval Pobbathi Badrinath","Harshith Mohan Kumar","Priya SS","Surabhi Narayan"],"pdf_url":"https://arxiv.org/pdf/2307.10934v1.pdf","comment":"This work was accepted as a spotlight presentation at the\n Transformers for Vision Workshop @CVPR 2023"},{"id":"http://arxiv.org/abs/2307.10927v1","updated":"2023-07-20T14:56:29Z","published":"2023-07-20T14:56:29Z","title":"Modeling 3D cardiac contraction and relaxation with point cloud\n deformation networks","summary":" Global single-valued biomarkers of cardiac function typically used in\nclinical practice, such as ejection fraction, provide limited insight on the\ntrue 3D cardiac deformation process and hence, limit the understanding of both\nhealthy and pathological cardiac mechanics. In this work, we propose the Point\nCloud Deformation Network (PCD-Net) as a novel geometric deep learning approach\nto model 3D cardiac contraction and relaxation between the extreme ends of the\ncardiac cycle. It employs the recent advances in point cloud-based deep\nlearning into an encoder-decoder structure, in order to enable efficient\nmulti-scale feature learning directly on multi-class 3D point cloud\nrepresentations of the cardiac anatomy. We evaluate our approach on a large\ndataset of over 10,000 cases from the UK Biobank study and find average Chamfer\ndistances between the predicted and ground truth anatomies below the pixel\nresolution of the underlying image acquisition. Furthermore, we observe similar\nclinical metrics between predicted and ground truth populations and show that\nthe PCD-Net can successfully capture subpopulation-specific differences between\nnormal subjects and myocardial infarction (MI) patients. We then demonstrate\nthat the learned 3D deformation patterns outperform multiple clinical\nbenchmarks by 13% and 7% in terms of area under the receiver operating\ncharacteristic curve for the tasks of prevalent MI detection and incident MI\nprediction and by 7% in terms of Harrell's concordance index for MI survival\nanalysis.\n","authors":["Marcel Beetz","Abhirup Banerjee","Vicente Grau"],"pdf_url":"https://arxiv.org/pdf/2307.10927v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10926v1","updated":"2023-07-20T14:52:45Z","published":"2023-07-20T14:52:45Z","title":"Confidence intervals for performance estimates in 3D medical image\n segmentation","summary":" Medical segmentation models are evaluated empirically. As such an evaluation\nis based on a limited set of example images, it is unavoidably noisy. Beyond a\nmean performance measure, reporting confidence intervals is thus crucial.\nHowever, this is rarely done in medical image segmentation. The width of the\nconfidence interval depends on the test set size and on the spread of the\nperformance measure (its standard-deviation across of the test set). For\nclassification, many test images are needed to avoid wide confidence intervals.\nSegmentation, however, has not been studied, and it differs by the amount of\ninformation brought by a given test image. In this paper, we study the typical\nconfidence intervals in medical image segmentation. We carry experiments on 3D\nimage segmentation using the standard nnU-net framework, two datasets from the\nMedical Decathlon challenge and two performance measures: the Dice accuracy and\nthe Hausdorff distance. We show that the parametric confidence intervals are\nreasonable approximations of the bootstrap estimates for varying test set sizes\nand spread of the performance metric. Importantly, we show that the test size\nneeded to achieve a given precision is often much lower than for classification\ntasks. Typically, a 1% wide confidence interval requires about 100-200 test\nsamples when the spread is low (standard-deviation around 3%). More difficult\nsegmentation tasks may lead to higher spreads and require over 1000 samples.\n","authors":["R. El Jurdi","G. Varoquax","O. Colliot"],"pdf_url":"https://arxiv.org/pdf/2307.10926v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2307.10924v1","updated":"2023-07-20T14:51:28Z","published":"2023-07-20T14:51:28Z","title":"Intrinsic Appearance Decomposition Using Point Cloud Representation","summary":" Intrinsic decomposition is to infer the albedo and shading from the image.\nSince it is a heavily ill-posed problem, previous methods rely on prior\nassumptions from 2D images, however, the exploration of the data representation\nitself is limited. The point cloud is known as a rich format of scene\nrepresentation, which naturally aligns the geometric information and the color\ninformation of an image. Our proposed method, Point Intrinsic Net, in short,\nPoInt-Net, jointly predicts the albedo, light source direction, and shading,\nusing point cloud representation. Experiments reveal the benefits of PoInt-Net,\nin terms of accuracy, it outperforms 2D representation approaches on multiple\nmetrics across datasets; in terms of efficiency, it trains on small-scale point\nclouds and performs stably on any-scale point clouds; in terms of robustness,\nit only trains on single object level dataset, and demonstrates reasonable\ngeneralization ability for unseen objects and scenes.\n","authors":["Xiaoyan Xing","Konrad Groh","Sezer Karaoglu","Theo Gevers"],"pdf_url":"https://arxiv.org/pdf/2307.10924v1.pdf","comment":"14 pages, 14 figures"},{"id":"http://arxiv.org/abs/2307.10922v1","updated":"2023-07-20T14:47:50Z","published":"2023-07-20T14:47:50Z","title":"Language-based Action Concept Spaces Improve Video Self-Supervised\n Learning","summary":" Recent contrastive language image pre-training has led to learning highly\ntransferable and robust image representations. However, adapting these models\nto video domains with minimal supervision remains an open problem. We explore a\nsimple step in that direction, using language tied self-supervised learning to\nadapt an image CLIP model to the video domain. A backbone modified for temporal\nmodeling is trained under self-distillation settings with train objectives\noperating in an action concept space. Feature vectors of various action\nconcepts extracted from a language encoder using relevant textual prompts\nconstruct this space. We introduce two train objectives, concept distillation\nand concept alignment, that retain generality of original representations while\nenforcing relations between actions and their attributes. Our approach improves\nzero-shot and linear probing performance on three action recognition\nbenchmarks.\n","authors":["Kanchana Ranasinghe","Michael Ryoo"],"pdf_url":"https://arxiv.org/pdf/2307.10922v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10915v1","updated":"2023-07-20T14:39:46Z","published":"2023-07-20T14:39:46Z","title":"Revisiting Fine-Tuning Strategies for Self-supervised Medical Imaging\n Analysis","summary":" Despite the rapid progress in self-supervised learning (SSL), end-to-end\nfine-tuning still remains the dominant fine-tuning strategy for medical imaging\nanalysis. However, it remains unclear whether this approach is truly optimal\nfor effectively utilizing the pre-trained knowledge, especially considering the\ndiverse categories of SSL that capture different types of features. In this\npaper, we first establish strong contrastive and restorative SSL baselines that\noutperform SOTA methods across four diverse downstream tasks. Building upon\nthese strong baselines, we conduct an extensive fine-tuning analysis across\nmultiple pre-training and fine-tuning datasets, as well as various fine-tuning\ndataset sizes. Contrary to the conventional wisdom of fine-tuning only the last\nfew layers of a pre-trained network, we show that fine-tuning intermediate\nlayers is more effective, with fine-tuning the second quarter (25-50%) of the\nnetwork being optimal for contrastive SSL whereas fine-tuning the third quarter\n(50-75%) of the network being optimal for restorative SSL. Compared to the\nde-facto standard of end-to-end fine-tuning, our best fine-tuning strategy,\nwhich fine-tunes a shallower network consisting of the first three quarters\n(0-75%) of the pre-trained network, yields improvements of as much as 5.48%.\nAdditionally, using these insights, we propose a simple yet effective method to\nleverage the complementary strengths of multiple SSL models, resulting in\nenhancements of up to 3.57% compared to using the best model alone. Hence, our\nfine-tuning strategies not only enhance the performance of individual SSL\nmodels, but also enable effective utilization of the complementary strengths\noffered by multiple SSL models, leading to significant improvements in\nself-supervised medical imaging analysis.\n","authors":["Muhammad Osama Khan","Yi Fang"],"pdf_url":"https://arxiv.org/pdf/2307.10915v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10912v1","updated":"2023-07-20T14:34:08Z","published":"2023-07-20T14:34:08Z","title":"WeakPolyp: You Only Look Bounding Box for Polyp Segmentation","summary":" Limited by expensive pixel-level labels, polyp segmentation models are\nplagued by data shortage and suffer from impaired generalization. In contrast,\npolyp bounding box annotations are much cheaper and more accessible. Thus, to\nreduce labeling cost, we propose to learn a weakly supervised polyp\nsegmentation model (i.e., WeakPolyp) completely based on bounding box\nannotations. However, coarse bounding boxes contain too much noise. To avoid\ninterference, we introduce the mask-to-box (M2B) transformation. By supervising\nthe outer box mask of the prediction instead of the prediction itself, M2B\ngreatly mitigates the mismatch between the coarse label and the precise\nprediction. But, M2B only provides sparse supervision, leading to non-unique\npredictions. Therefore, we further propose a scale consistency (SC) loss for\ndense supervision. By explicitly aligning predictions across the same image at\ndifferent scales, the SC loss largely reduces the variation of predictions.\nNote that our WeakPolyp is a plug-and-play model, which can be easily ported to\nother appealing backbones. Besides, the proposed modules are only used during\ntraining, bringing no computation cost to inference. Extensive experiments\ndemonstrate the effectiveness of our proposed WeakPolyp, which surprisingly\nachieves a comparable performance with a fully supervised model, requiring no\nmask annotations at all.\n","authors":["Jun Wei","Yiwen Hu","Shuguang Cui","S. Kevin Zhou","Zhen Li"],"pdf_url":"https://arxiv.org/pdf/2307.10912v1.pdf","comment":"accepted by MICCAI 2023, codes are available at\n https://github.com/weijun88/WeakPolyp"},{"id":"http://arxiv.org/abs/2306.14687v2","updated":"2023-07-20T14:29:39Z","published":"2023-06-26T13:32:09Z","title":"GSMorph: Gradient Surgery for cine-MRI Cardiac Deformable Registration","summary":" Deep learning-based deformable registration methods have been widely\ninvestigated in diverse medical applications. Learning-based deformable\nregistration relies on weighted objective functions trading off registration\naccuracy and smoothness of the deformation field. Therefore, they inevitably\nrequire tuning the hyperparameter for optimal registration performance. Tuning\nthe hyperparameters is highly computationally expensive and introduces\nundesired dependencies on domain knowledge. In this study, we construct a\nregistration model based on the gradient surgery mechanism, named GSMorph, to\nachieve a hyperparameter-free balance on multiple losses. In GSMorph, we\nreformulate the optimization procedure by projecting the gradient of similarity\nloss orthogonally to the plane associated with the smoothness constraint,\nrather than additionally introducing a hyperparameter to balance these two\ncompeting terms. Furthermore, our method is model-agnostic and can be merged\ninto any deep registration network without introducing extra parameters or\nslowing down inference. In this study, We compared our method with\nstate-of-the-art (SOTA) deformable registration approaches over two publicly\navailable cardiac MRI datasets. GSMorph proves superior to five SOTA\nlearning-based registration models and two conventional registration\ntechniques, SyN and Demons, on both registration accuracy and smoothness.\n","authors":["Haoran Dou","Ning Bi","Luyi Han","Yuhao Huang","Ritse Mann","Xin Yang","Dong Ni","Nishant Ravikumar","Alejandro F. Frangi","Yunzhi Huang"],"pdf_url":"https://arxiv.org/pdf/2306.14687v2.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2110.05216v2","updated":"2023-07-20T14:29:07Z","published":"2021-10-11T12:32:56Z","title":"High-order Tensor Pooling with Attention for Action Recognition","summary":" We aim at capturing high-order statistics of feature vectors formed by a\nneural network, and propose end-to-end second- and higher-order pooling to form\na tensor descriptor. Tensor descriptors require a robust similarity measure due\nto low numbers of aggregated vectors and the burstiness phenomenon, when a\ngiven feature appears more/less frequently than statistically expected. The\nHeat Diffusion Process (HDP) on a graph Laplacian is closely related to the\nEigenvalue Power Normalization (EPN) of the covariance/auto-correlation matrix,\nwhose inverse forms a loopy graph Laplacian. We show that the HDP and the EPN\nplay the same role, i.e., to boost or dampen the magnitude of the eigenspectrum\nthus preventing the burstiness. We equip higher-order tensors with EPN which\nacts as a spectral detector of higher-order occurrences to prevent burstiness.\nWe also prove that for a tensor of order r built from d dimensional feature\ndescriptors, such a detector gives the likelihood if at least one higher-order\noccurrence is 'projected' into one of binom(d,r) subspaces represented by the\ntensor; thus forming a tensor power normalization metric endowed with\nbinom(d,r) such 'detectors'. For experimental contributions, we apply several\nsecond- and higher-order pooling variants to action recognition, provide\npreviously not presented comparisons of such pooling variants, and show\nstate-of-the-art results on HMDB-51, YUP++ and MPII Cooking Activities.\n","authors":["Piotr Koniusz","Lei Wang","Ke Sun"],"pdf_url":"https://arxiv.org/pdf/2110.05216v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10895v1","updated":"2023-07-20T14:18:44Z","published":"2023-07-20T14:18:44Z","title":"Variational Point Encoding Deformation for Dental Modeling","summary":" Digital dentistry has made significant advancements in recent years, yet\nnumerous challenges remain to be addressed. In this study, we release a new\nextensive dataset of tooth meshes to encourage further research. Additionally,\nwe propose Variational FoldingNet (VF-Net), which extends FoldingNet to enable\nprobabilistic learning of point cloud representations. A key challenge in\nexisting latent variable models for point clouds is the lack of a 1-to-1\nmapping between input points and output points. Instead, they must rely on\noptimizing Chamfer distances, a metric that does not have a normalized\ndistributional counterpart, preventing its usage in probabilistic models. We\ndemonstrate that explicit minimization of Chamfer distances can be replaced by\na suitable encoder, which allows us to increase computational efficiency while\nsimplifying the probabilistic extension. Our experimental findings present\nempirical evidence demonstrating the superior performance of VF-Net over\nexisting models in terms of dental scan reconstruction and extrapolation.\nAdditionally, our investigation highlights the robustness of VF-Net's latent\nrepresentations. These results underscore the promising prospects of VF-Net as\nan effective and reliable method for point cloud reconstruction and analysis.\n","authors":["Johan Ziruo Ye","Thomas Ørkild","Peter Lempel Søndergaard","Søren Hauberg"],"pdf_url":"https://arxiv.org/pdf/2307.10895v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10894v1","updated":"2023-07-20T14:15:20Z","published":"2023-07-20T14:15:20Z","title":"Human Motion Generation: A Survey","summary":" Human motion generation aims to generate natural human pose sequences and\nshows immense potential for real-world applications. Substantial progress has\nbeen made recently in motion data collection technologies and generation\nmethods, laying the foundation for increasing interest in human motion\ngeneration. Most research within this field focuses on generating human motions\nbased on conditional signals, such as text, audio, and scene contexts. While\nsignificant advancements have been made in recent years, the task continues to\npose challenges due to the intricate nature of human motion and its implicit\nrelationship with conditional signals. In this survey, we present a\ncomprehensive literature review of human motion generation, which, to the best\nof our knowledge, is the first of its kind in this field. We begin by\nintroducing the background of human motion and generative models, followed by\nan examination of representative methods for three mainstream sub-tasks:\ntext-conditioned, audio-conditioned, and scene-conditioned human motion\ngeneration. Additionally, we provide an overview of common datasets and\nevaluation metrics. Lastly, we discuss open problems and outline potential\nfuture research directions. We hope that this survey could provide the\ncommunity with a comprehensive glimpse of this rapidly evolving field and\ninspire novel ideas that address the outstanding challenges.\n","authors":["Wentao Zhu","Xiaoxuan Ma","Dongwoo Ro","Hai Ci","Jinlu Zhang","Jiaxin Shi","Feng Gao","Qi Tian","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10894v1.pdf","comment":"20 pages, 5 figures"},{"id":"http://arxiv.org/abs/2307.10875v1","updated":"2023-07-20T13:47:30Z","published":"2023-07-20T13:47:30Z","title":"Risk-optimized Outlier Removal for Robust Point Cloud Classification","summary":" The popularity of point cloud deep models for safety-critical purposes has\nincreased, but the reliability and security of these models can be compromised\nby intentional or naturally occurring point cloud noise. To combat this issue,\nwe present a novel point cloud outlier removal method called PointCVaR, which\nempowers standard-trained models to eliminate additional outliers and restore\nthe data. Our approach begins by conducting attribution analysis to determine\nthe influence of each point on the model output, which we refer to as point\nrisk. We then optimize the process of filtering high-risk points using\nConditional Value at Risk (CVaR) as the objective. The rationale for this\napproach is based on the observation that noise points in point clouds tend to\ncluster in the tail of the risk distribution, with a low frequency but a high\nlevel of risk, resulting in significant interference with classification\nresults. Despite requiring no additional training effort, our method produces\nexceptional results in various removal-and-classification experiments for noisy\npoint clouds, which are corrupted by random noise, adversarial noise, and\nbackdoor trigger noise. Impressively, it achieves 87% accuracy in defense\nagainst the backdoor attack by removing triggers. Overall, the proposed\nPointCVaR effectively eliminates noise points and enhances point cloud\nclassification, making it a promising plug-in module for various models in\ndifferent scenarios.\n","authors":["Xinke Li","Junchi Lu"],"pdf_url":"https://arxiv.org/pdf/2307.10875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10873v1","updated":"2023-07-20T13:43:48Z","published":"2023-07-20T13:43:48Z","title":"Conservative Estimation of Perception Relevance of Dynamic Objects for\n Safe Trajectories in Automotive Scenarios","summary":" Having efficient testing strategies is a core challenge that needs to be\novercome for the release of automated driving. This necessitates clear\nrequirements as well as suitable methods for testing. In this work, the\nrequirements for perception modules are considered with respect to relevance.\nThe concept of relevance currently remains insufficiently defined and\nspecified. In this paper, we propose a novel methodology to overcome this\nchallenge by exemplary application to collision safety in the highway domain.\nUsing this general system and use case specification, a corresponding concept\nfor relevance is derived. Irrelevant objects are thus defined as objects which\ndo not limit the set of safe actions available to the ego vehicle under\nconsideration of all uncertainties. As an initial step, the use case is\ndecomposed into functional scenarios with respect to collision relevance. For\neach functional scenario, possible actions of both the ego vehicle and any\nother dynamic object are formalized as equations. This set of possible actions\nis constrained by traffic rules, yielding relevance criteria. As a result, we\npresent a conservative estimation which dynamic objects are relevant for\nperception and need to be considered for a complete evaluation. The estimation\nprovides requirements which are applicable for offline testing and validation\nof perception components. A visualization is presented for examples from the\nhighD dataset, showing the plausibility of the results. Finally, a possibility\nfor a future validation of the presented relevance concept is outlined.\n","authors":["Ken Mori","Kai Storms","Steven Peters"],"pdf_url":"https://arxiv.org/pdf/2307.10873v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10867v1","updated":"2023-07-20T13:40:22Z","published":"2023-07-20T13:40:22Z","title":"FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with\n Human Feedback","summary":" Captions are crucial for understanding scientific visualizations and\ndocuments. Existing captioning methods for scientific figures rely on\nfigure-caption pairs extracted from documents for training, many of which fall\nshort with respect to metrics like helpfulness, explainability, and\nvisual-descriptiveness [15] leading to generated captions being misaligned with\nreader preferences. To enable the generation of high-quality figure captions,\nwe introduce FigCaps-HF a new framework for figure-caption generation that can\nincorporate domain expert feedback in generating captions optimized for reader\npreferences. Our framework comprises of 1) an automatic method for evaluating\nquality of figure-caption pairs, 2) a novel reinforcement learning with human\nfeedback (RLHF) method to optimize a generative figure-to-caption model for\nreader preferences. We demonstrate the effectiveness of our simple learning\nframework by improving performance over standard fine-tuning across different\ntypes of models. In particular, when using BLIP as the base model, our RLHF\nframework achieves a mean gain of 35.7%, 16.9%, and 9% in ROUGE, BLEU, and\nMeteor, respectively. Finally, we release a large-scale benchmark dataset with\nhuman feedback on figure-caption pairs to enable further evaluation and\ndevelopment of RLHF techniques for this problem.\n","authors":["Ashish Singh","Prateek Agarwal","Zixuan Huang","Arpita Singh","Tong Yu","Sungchul Kim","Victor Bursztyn","Nikos Vlassis","Ryan A. Rossi"],"pdf_url":"https://arxiv.org/pdf/2307.10867v1.pdf","comment":"19 pages, 4 figures. Benchmark Documentation:\n https://figcapshf.github.io/"},{"id":"http://arxiv.org/abs/2307.10864v1","updated":"2023-07-20T13:33:28Z","published":"2023-07-20T13:33:28Z","title":"Divide & Bind Your Attention for Improved Generative Semantic Nursing","summary":" Emerging large-scale text-to-image generative models, e.g., Stable Diffusion\n(SD), have exhibited overwhelming results with high fidelity. Despite the\nmagnificent progress, current state-of-the-art models still struggle to\ngenerate images fully adhering to the input prompt. Prior work, Attend &\nExcite, has introduced the concept of Generative Semantic Nursing (GSN), aiming\nto optimize cross-attention during inference time to better incorporate the\nsemantics. It demonstrates promising results in generating simple prompts,\ne.g., ``a cat and a dog''. However, its efficacy declines when dealing with\nmore complex prompts, and it does not explicitly address the problem of\nimproper attribute binding. To address the challenges posed by complex prompts\nor scenarios involving multiple entities and to achieve improved attribute\nbinding, we propose Divide & Bind. We introduce two novel loss objectives for\nGSN: a novel attendance loss and a binding loss. Our approach stands out in its\nability to faithfully synthesize desired objects with improved attribute\nalignment from complex prompts and exhibits superior performance across\nmultiple evaluation benchmarks. More videos and updates can be found on the\nproject page \\url{https://sites.google.com/view/divide-and-bind}.\n","authors":["Yumeng Li","Margret Keuper","Dan Zhang","Anna Khoreva"],"pdf_url":"https://arxiv.org/pdf/2307.10864v1.pdf","comment":"Project page: \\url{https://sites.google.com/view/divide-and-bind}"},{"id":"http://arxiv.org/abs/2307.10854v1","updated":"2023-07-20T13:17:30Z","published":"2023-07-20T13:17:30Z","title":"BlendFace: Re-designing Identity Encoders for Face-Swapping","summary":" The great advancements of generative adversarial networks and face\nrecognition models in computer vision have made it possible to swap identities\non images from single sources. Although a lot of studies seems to have proposed\nalmost satisfactory solutions, we notice previous methods still suffer from an\nidentity-attribute entanglement that causes undesired attributes swapping\nbecause widely used identity encoders, eg, ArcFace, have some crucial attribute\nbiases owing to their pretraining on face recognition tasks. To address this\nissue, we design BlendFace, a novel identity encoder for face-swapping. The key\nidea behind BlendFace is training face recognition models on blended images\nwhose attributes are replaced with those of another mitigates inter-personal\nbiases such as hairsyles. BlendFace feeds disentangled identity features into\ngenerators and guides generators properly as an identity loss function.\nExtensive experiments demonstrate that BlendFace improves the\nidentity-attribute disentanglement in face-swapping models, maintaining a\ncomparable quantitative performance to previous methods.\n","authors":["Kaede Shiohara","Xingchao Yang","Takafumi Taketomi"],"pdf_url":"https://arxiv.org/pdf/2307.10854v1.pdf","comment":"ICCV2023. Code: https://github.com/mapooon/BlendFace, Webpage:\n https://mapooon.github.io/BlendFacePage/"},{"id":"http://arxiv.org/abs/2307.10853v1","updated":"2023-07-20T13:16:10Z","published":"2023-07-20T13:16:10Z","title":"Exploring Effective Priors and Efficient Models for Weakly-Supervised\n Change Detection","summary":" Weakly-supervised change detection (WSCD) aims to detect pixel-level changes\nwith only image-level annotations. Owing to its label efficiency, WSCD is\ndrawing increasing attention recently. However, current WSCD methods often\nencounter the challenge of change missing and fabricating, i.e., the\ninconsistency between image-level annotations and pixel-level predictions.\nSpecifically, change missing refer to the situation that the WSCD model fails\nto predict any changed pixels, even though the image-level label indicates\nchanged, and vice versa for change fabricating. To address this challenge, in\nthis work, we leverage global-scale and local-scale priors in WSCD and propose\ntwo components: a Dilated Prior (DP) decoder and a Label Gated (LG) constraint.\nThe DP decoder decodes samples with the changed image-level label, skips\nsamples with the unchanged label, and replaces them with an all-unchanged\npixel-level label. The LG constraint is derived from the correspondence between\nchanged representations and image-level labels, penalizing the model when it\nmispredicts the change status. Additionally, we develop TransWCD, a simple yet\npowerful transformer-based model, showcasing the potential of weakly-supervised\nlearning in change detection. By integrating the DP decoder and LG constraint\ninto TransWCD, we form TransWCD-DL. Our proposed TransWCD and TransWCD-DL\nachieve significant +6.33% and +9.55% F1 score improvements over the\nstate-of-the-art methods on the WHU-CD dataset, respectively. Some performance\nmetrics even exceed several fully-supervised change detection (FSCD)\ncompetitors. Code will be available at\nhttps://github.com/zhenghuizhao/TransWCD.\n","authors":["Zhenghui Zhao","Lixiang Ru","Chen Wu"],"pdf_url":"https://arxiv.org/pdf/2307.10853v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10845v1","updated":"2023-07-20T13:07:41Z","published":"2023-07-20T13:07:41Z","title":"Self-paced Weight Consolidation for Continual Learning","summary":" Continual learning algorithms which keep the parameters of new tasks close to\nthat of previous tasks, are popular in preventing catastrophic forgetting in\nsequential task learning settings. However, 1) the performance for the new\ncontinual learner will be degraded without distinguishing the contributions of\npreviously learned tasks; 2) the computational cost will be greatly increased\nwith the number of tasks, since most existing algorithms need to regularize all\nprevious tasks when learning new tasks. To address the above challenges, we\npropose a self-paced Weight Consolidation (spWC) framework to attain robust\ncontinual learning via evaluating the discriminative contributions of previous\ntasks. To be specific, we develop a self-paced regularization to reflect the\npriorities of past tasks via measuring difficulty based on key performance\nindicator (i.e., accuracy). When encountering a new task, all previous tasks\nare sorted from \"difficult\" to \"easy\" based on the priorities. Then the\nparameters of the new continual learner will be learned via selectively\nmaintaining the knowledge amongst more difficult past tasks, which could well\novercome catastrophic forgetting with less computational cost. We adopt an\nalternative convex search to iteratively update the model parameters and\npriority weights in the bi-convex formulation. The proposed spWC framework is\nplug-and-play, which is applicable to most continual learning algorithms (e.g.,\nEWC, MAS and RCIL) in different directions (e.g., classification and\nsegmentation). Experimental results on several public benchmark datasets\ndemonstrate that our proposed framework can effectively improve performance\nwhen compared with other popular continual learning algorithms.\n","authors":["Wei Cong","Yang Cong","Gan Sun","Yuyang Liu","Jiahua Dong"],"pdf_url":"https://arxiv.org/pdf/2307.10845v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10843v1","updated":"2023-07-20T13:04:26Z","published":"2023-07-20T13:04:26Z","title":"Global Precipitation Nowcasting of Integrated Multi-satellitE Retrievals\n for GPM: A U-Net Convolutional LSTM Architecture","summary":" This paper presents a deep learning architecture for nowcasting of\nprecipitation almost globally every 30 min with a 4-hour lead time. The\narchitecture fuses a U-Net and a convolutional long short-term memory (LSTM)\nneural network and is trained using data from the Integrated MultisatellitE\nRetrievals for GPM (IMERG) and a few key precipitation drivers from the Global\nForecast System (GFS). The impacts of different training loss functions,\nincluding the mean-squared error (regression) and the focal-loss\n(classification), on the quality of precipitation nowcasts are studied. The\nresults indicate that the regression network performs well in capturing light\nprecipitation (below 1.6 mm/hr), but the classification network can outperform\nthe regression network for nowcasting of precipitation extremes (>8 mm/hr), in\nterms of the critical success index (CSI).. Using the Wasserstein distance, it\nis shown that the predicted precipitation by the classification network has a\ncloser class probability distribution to the IMERG than the regression network.\nIt is uncovered that the inclusion of the physical variables can improve\nprecipitation nowcasting, especially at longer lead times in both networks.\nTaking IMERG as a relative reference, a multi-scale analysis in terms of\nfractions skill score (FSS), shows that the nowcasting machine remains skillful\n(FSS > 0.5) at the resolution of 10 km compared to 50 km for GFS. For\nprecipitation rates greater than 4~mm/hr, only the classification network\nremains FSS-skillful on scales greater than 50 km within a 2-hour lead time.\n","authors":["Reyhaneh Rahimi","Ardeshir Ebtehaj","Ali Behrangi","Jackson Tan"],"pdf_url":"https://arxiv.org/pdf/2307.10843v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10842v1","updated":"2023-07-20T13:02:45Z","published":"2023-07-20T13:02:45Z","title":"Label Calibration for Semantic Segmentation Under Domain Shift","summary":" Performance of a pre-trained semantic segmentation model is likely to\nsubstantially decrease on data from a new domain. We show a pre-trained model\ncan be adapted to unlabelled target domain data by calculating soft-label\nprototypes under the domain shift and making predictions according to the\nprototype closest to the vector with predicted class probabilities. The\nproposed adaptation procedure is fast, comes almost for free in terms of\ncomputational resources and leads to considerable performance improvements. We\ndemonstrate the benefits of such label calibration on the highly-practical\nsynthetic-to-real semantic segmentation problem.\n","authors":["Ondrej Bohdal","Da Li","Timothy Hospedales"],"pdf_url":"https://arxiv.org/pdf/2307.10842v1.pdf","comment":"ICLR 2023 Workshop on Pitfalls of Limited Data and Computation for\n Trustworthy ML"},{"id":"http://arxiv.org/abs/2307.08930v2","updated":"2023-07-20T12:41:19Z","published":"2023-07-18T02:35:01Z","title":"Unsupervised Deep Graph Matching Based on Cycle Consistency","summary":" We contribute to the sparsely populated area of unsupervised deep graph\nmatching with application to keypoint matching in images. Contrary to the\nstandard \\emph{supervised} approach, our method does not require ground truth\ncorrespondences between keypoint pairs. Instead, it is self-supervised by\nenforcing consistency of matchings between images of the same object category.\nAs the matching and the consistency loss are discrete, their derivatives cannot\nbe straightforwardly used for learning. We address this issue in a principled\nway by building our method upon the recent results on black-box differentiation\nof combinatorial solvers. This makes our method exceptionally flexible, as it\nis compatible with arbitrary network architectures and combinatorial solvers.\nOur experimental evaluation suggests that our technique sets a new\nstate-of-the-art for unsupervised graph matching.\n","authors":["Siddharth Tourani","Carsten Rother","Muhammad Haris Khan","Bogdan Savchynskyy"],"pdf_url":"https://arxiv.org/pdf/2307.08930v2.pdf","comment":"12 pages, 5 figures, 3 papers"},{"id":"http://arxiv.org/abs/2307.10824v1","updated":"2023-07-20T12:38:17Z","published":"2023-07-20T12:38:17Z","title":"Parse and Recall: Towards Accurate Lung Nodule Malignancy Prediction\n like Radiologists","summary":" Lung cancer is a leading cause of death worldwide and early screening is\ncritical for improving survival outcomes. In clinical practice, the contextual\nstructure of nodules and the accumulated experience of radiologists are the two\ncore elements related to the accuracy of identification of benign and malignant\nnodules. Contextual information provides comprehensive information about\nnodules such as location, shape, and peripheral vessels, and experienced\nradiologists can search for clues from previous cases as a reference to enrich\nthe basis of decision-making. In this paper, we propose a radiologist-inspired\nmethod to simulate the diagnostic process of radiologists, which is composed of\ncontext parsing and prototype recalling modules. The context parsing module\nfirst segments the context structure of nodules and then aggregates contextual\ninformation for a more comprehensive understanding of the nodule. The prototype\nrecalling module utilizes prototype-based learning to condense previously\nlearned cases as prototypes for comparative analysis, which is updated online\nin a momentum way during training. Building on the two modules, our method\nleverages both the intrinsic characteristics of the nodules and the external\nknowledge accumulated from other nodules to achieve a sound diagnosis. To meet\nthe needs of both low-dose and noncontrast screening, we collect a large-scale\ndataset of 12,852 and 4,029 nodules from low-dose and noncontrast CTs\nrespectively, each with pathology- or follow-up-confirmed labels. Experiments\non several datasets demonstrate that our method achieves advanced screening\nperformance on both low-dose and noncontrast scenarios.\n","authors":["Jianpeng Zhang","Xianghua Ye","Jianfeng Zhang","Yuxing Tang","Minfeng Xu","Jianfei Guo","Xin Chen","Zaiyi Liu","Jingren Zhou","Le Lu","Ling Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10824v1.pdf","comment":"MICCAI 2023"},{"id":"http://arxiv.org/abs/2212.13792v2","updated":"2023-07-20T12:37:06Z","published":"2022-12-28T12:08:27Z","title":"Periocular Biometrics: A Modality for Unconstrained Scenarios","summary":" Periocular refers to the externally visible region of the face that surrounds\nthe eye socket. This feature-rich area can provide accurate identification in\nunconstrained or uncooperative scenarios, where the iris or face modalities may\nnot offer sufficient biometric cues due to factors such as partial occlusion or\nhigh subject-to-camera distance. The COVID-19 pandemic has further highlighted\nits importance, as the ocular region remained the only visible facial area even\nin controlled settings due to the widespread use of masks. This paper discusses\nthe state of the art in periocular biometrics, presenting an overall framework\nencompassing its most significant research aspects, which include: (a) ocular\ndefinition, acquisition, and detection; (b) identity recognition, including\ncombination with other modalities and use of various spectra; and (c) ocular\nsoft-biometric analysis. Finally, we conclude by addressing current challenges\nand proposing future directions.\n","authors":["Fernando Alonso-Fernandez","Josef Bigun","Julian Fierrez","Naser Damer","Hugo Proença","Arun Ross"],"pdf_url":"https://arxiv.org/pdf/2212.13792v2.pdf","comment":"Published at IEEE Computer journal"},{"id":"http://arxiv.org/abs/2307.10822v1","updated":"2023-07-20T12:32:25Z","published":"2023-07-20T12:32:25Z","title":"Gradient-Semantic Compensation for Incremental Semantic Segmentation","summary":" Incremental semantic segmentation aims to continually learn the segmentation\nof new coming classes without accessing the training data of previously learned\nclasses. However, most current methods fail to address catastrophic forgetting\nand background shift since they 1) treat all previous classes equally without\nconsidering different forgetting paces caused by imbalanced gradient\nback-propagation; 2) lack strong semantic guidance between classes. To tackle\nthe above challenges, in this paper, we propose a Gradient-Semantic\nCompensation (GSC) model, which surmounts incremental semantic segmentation\nfrom both gradient and semantic perspectives. Specifically, to address\ncatastrophic forgetting from the gradient aspect, we develop a step-aware\ngradient compensation that can balance forgetting paces of previously seen\nclasses via re-weighting gradient backpropagation. Meanwhile, we propose a\nsoft-sharp semantic relation distillation to distill consistent inter-class\nsemantic relations via soft labels for alleviating catastrophic forgetting from\nthe semantic aspect. In addition, we develop a prototypical pseudo re-labeling\nthat provides strong semantic guidance to mitigate background shift. It\nproduces high-quality pseudo labels for old classes in the background by\nmeasuring distances between pixels and class-wise prototypes. Extensive\nexperiments on three public datasets, i.e., Pascal VOC 2012, ADE20K, and\nCityscapes, demonstrate the effectiveness of our proposed GSC model.\n","authors":["Wei Cong","Yang Cong","Jiahua Dong","Gan Sun","Henghui Ding"],"pdf_url":"https://arxiv.org/pdf/2307.10822v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10816v1","updated":"2023-07-20T12:25:06Z","published":"2023-07-20T12:25:06Z","title":"BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained\n Diffusion","summary":" Recent text-to-image diffusion models have demonstrated an astonishing\ncapacity to generate high-quality images. However, researchers mainly studied\nthe way of synthesizing images with only text prompts. While some works have\nexplored using other modalities as conditions, considerable paired data, e.g.,\nbox/mask-image pairs, and fine-tuning time are required for nurturing models.\nAs such paired data is time-consuming and labor-intensive to acquire and\nrestricted to a closed set, this potentially becomes the bottleneck for\napplications in an open world. This paper focuses on the simplest form of\nuser-provided conditions, e.g., box or scribble. To mitigate the aforementioned\nproblem, we propose a training-free method to control objects and contexts in\nthe synthesized images adhering to the given spatial conditions. Specifically,\nthree spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints,\nare designed and seamlessly integrated into the denoising step of diffusion\nmodels, requiring no additional training and massive annotated layout data.\nExtensive results show that the proposed constraints can control what and where\nto present in the images while retaining the ability of the Stable Diffusion\nmodel to synthesize with high fidelity and diverse concept coverage. The code\nis publicly available at https://github.com/Sierkinhane/BoxDiff.\n","authors":["Jinheng Xie","Yuexiang Li","Yawen Huang","Haozhe Liu","Wentian Zhang","Yefeng Zheng","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2307.10816v1.pdf","comment":"Accepted by ICCV 2023. The paper is still being revised for better\n organization and comparison"},{"id":"http://arxiv.org/abs/2306.09683v2","updated":"2023-07-20T12:23:12Z","published":"2023-06-16T08:27:46Z","title":"Scaling Open-Vocabulary Object Detection","summary":" Open-vocabulary object detection has benefited greatly from pretrained\nvision-language models, but is still limited by the amount of available\ndetection training data. While detection training data can be expanded by using\nWeb image-text pairs as weak supervision, this has not been done at scales\ncomparable to image-level pretraining. Here, we scale up detection data with\nself-training, which uses an existing detector to generate pseudo-box\nannotations on image-text pairs. Major challenges in scaling self-training are\nthe choice of label space, pseudo-annotation filtering, and training\nefficiency. We present the OWLv2 model and OWL-ST self-training recipe, which\naddress these challenges. OWLv2 surpasses the performance of previous\nstate-of-the-art open-vocabulary detectors already at comparable training\nscales (~10M examples). However, with OWL-ST, we can scale to over 1B examples,\nyielding further large improvement: With an L/14 architecture, OWL-ST improves\nAP on LVIS rare classes, for which the model has seen no human box annotations,\nfrom 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale\ntraining for open-world localization, similar to what has been seen for image\nclassification and language modelling.\n","authors":["Matthias Minderer","Alexey Gritsenko","Neil Houlsby"],"pdf_url":"https://arxiv.org/pdf/2306.09683v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10813v1","updated":"2023-07-20T12:21:26Z","published":"2023-07-20T12:21:26Z","title":"Perceptual Quality Assessment of Omnidirectional Audio-visual Signals","summary":" Omnidirectional videos (ODVs) play an increasingly important role in the\napplication fields of medical, education, advertising, tourism, etc. Assessing\nthe quality of ODVs is significant for service-providers to improve the user's\nQuality of Experience (QoE). However, most existing quality assessment studies\nfor ODVs only focus on the visual distortions of videos, while ignoring that\nthe overall QoE also depends on the accompanying audio signals. In this paper,\nwe first establish a large-scale audio-visual quality assessment dataset for\nomnidirectional videos, which includes 375 distorted omnidirectional\naudio-visual (A/V) sequences generated from 15 high-quality pristine\nomnidirectional A/V contents, and the corresponding perceptual audio-visual\nquality scores. Then, we design three baseline methods for full-reference\nomnidirectional audio-visual quality assessment (OAVQA), which combine existing\nstate-of-the-art single-mode audio and video QA models via multimodal fusion\nstrategies. We validate the effectiveness of the A/V multimodal fusion method\nfor OAVQA on our dataset, which provides a new benchmark for omnidirectional\nQoE evaluation. Our dataset is available at https://github.com/iamazxl/OAVQA.\n","authors":["Xilei Zhu","Huiyu Duan","Yuqin Cao","Yuxin Zhu","Yucheng Zhu","Jing Liu","Li Chen","Xiongkuo Min","Guangtao Zhai"],"pdf_url":"https://arxiv.org/pdf/2307.10813v1.pdf","comment":"12 pages, 5 figures, to be published in CICAI2023"},{"id":"http://arxiv.org/abs/2009.03259v2","updated":"2023-07-20T12:11:56Z","published":"2020-09-07T17:27:27Z","title":"Implicit Multidimensional Projection of Local Subspaces","summary":" We propose a visualization method to understand the effect of\nmultidimensional projection on local subspaces, using implicit function\ndifferentiation. Here, we understand the local subspace as the multidimensional\nlocal neighborhood of data points. Existing methods focus on the projection of\nmultidimensional data points, and the neighborhood information is ignored. Our\nmethod is able to analyze the shape and directional information of the local\nsubspace to gain more insights into the global structure of the data through\nthe perception of local structures. Local subspaces are fitted by\nmultidimensional ellipses that are spanned by basis vectors. An accurate and\nefficient vector transformation method is proposed based on analytical\ndifferentiation of multidimensional projections formulated as implicit\nfunctions. The results are visualized as glyphs and analyzed using a full set\nof specifically-designed interactions supported in our efficient web-based\nvisualization tool. The usefulness of our method is demonstrated using various\nmulti- and high-dimensional benchmark datasets. Our implicit differentiation\nvector transformation is evaluated through numerical comparisons; the overall\nmethod is evaluated through exploration examples and use cases.\n","authors":["Rongzheng Bian","Yumeng Xue","Liang Zhou","Jian Zhang","Baoquan Chen","Daniel Weiskopf","Yunhai Wang"],"pdf_url":"https://arxiv.org/pdf/2009.03259v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10802v1","updated":"2023-07-20T12:10:29Z","published":"2023-07-20T12:10:29Z","title":"Meta-Transformer: A Unified Framework for Multimodal Learning","summary":" Multimodal learning aims to build models that can process and relate\ninformation from multiple modalities. Despite years of development in this\nfield, it still remains challenging to design a unified network for processing\nvarious modalities ($\\textit{e.g.}$ natural language, 2D images, 3D point\nclouds, audio, video, time series, tabular data) due to the inherent gaps among\nthem. In this work, we propose a framework, named Meta-Transformer, that\nleverages a $\\textbf{frozen}$ encoder to perform multimodal perception without\nany paired multimodal training data. In Meta-Transformer, the raw input data\nfrom various modalities are mapped into a shared token space, allowing a\nsubsequent encoder with frozen parameters to extract high-level semantic\nfeatures of the input data. Composed of three main components: a unified data\ntokenizer, a modality-shared encoder, and task-specific heads for downstream\ntasks, Meta-Transformer is the first framework to perform unified learning\nacross 12 modalities with unpaired data. Experiments on different benchmarks\nreveal that Meta-Transformer can handle a wide range of tasks including\nfundamental perception (text, image, point cloud, audio, video), practical\napplication (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph,\ntabular, and time-series). Meta-Transformer indicates a promising future for\ndeveloping unified multimodal intelligence with transformers. Code will be\navailable at https://github.com/invictus717/MetaTransformer\n","authors":["Yiyuan Zhang","Kaixiong Gong","Kaipeng Zhang","Hongsheng Li","Yu Qiao","Wanli Ouyang","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2307.10802v1.pdf","comment":"Project website: https://kxgong.github.io/meta_transformer/"},{"id":"http://arxiv.org/abs/2307.09906v2","updated":"2023-07-20T12:00:23Z","published":"2023-07-19T11:10:26Z","title":"Implicit Identity Representation Conditioned Memory Compensation Network\n for Talking Head video Generation","summary":" Talking head video generation aims to animate a human face in a still image\nwith dynamic poses and expressions using motion information derived from a\ntarget-driving video, while maintaining the person's identity in the source\nimage. However, dramatic and complex motions in the driving video cause\nambiguous generation, because the still source image cannot provide sufficient\nappearance information for occluded regions or delicate expression variations,\nwhich produces severe artifacts and significantly degrades the generation\nquality. To tackle this problem, we propose to learn a global facial\nrepresentation space, and design a novel implicit identity representation\nconditioned memory compensation network, coined as MCNet, for high-fidelity\ntalking head generation.~Specifically, we devise a network module to learn a\nunified spatial facial meta-memory bank from all training samples, which can\nprovide rich facial structure and appearance priors to compensate warped source\nfacial features for the generation. Furthermore, we propose an effective query\nmechanism based on implicit identity representations learned from the discrete\nkeypoints of the source image. It can greatly facilitate the retrieval of more\ncorrelated information from the memory bank for the compensation. Extensive\nexperiments demonstrate that MCNet can learn representative and complementary\nfacial memory, and can clearly outperform previous state-of-the-art talking\nhead generation methods on VoxCeleb1 and CelebV datasets. Please check our\n\\href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.\n","authors":["Fa-Ting Hong","Dan Xu"],"pdf_url":"https://arxiv.org/pdf/2307.09906v2.pdf","comment":"Accepted by ICCV2023, update the reference and figures"},{"id":"http://arxiv.org/abs/2307.10797v1","updated":"2023-07-20T11:59:42Z","published":"2023-07-20T11:59:42Z","title":"HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and\n Retarget Faces","summary":" In this paper, we present our method for neural face reenactment, called\nHyperReenact, that aims to generate realistic talking head images of a source\nidentity, driven by a target facial pose. Existing state-of-the-art face\nreenactment methods train controllable generative models that learn to\nsynthesize realistic facial images, yet producing reenacted faces that are\nprone to significant visual artifacts, especially under the challenging\ncondition of extreme head pose changes, or requiring expensive few-shot\nfine-tuning to better preserve the source identity characteristics. We propose\nto address these limitations by leveraging the photorealistic generation\nability and the disentangled properties of a pretrained StyleGAN2 generator, by\nfirst inverting the real images into its latent space and then using a\nhypernetwork to perform: (i) refinement of the source identity characteristics\nand (ii) facial pose re-targeting, eliminating this way the dependence on\nexternal editing methods that typically produce artifacts. Our method operates\nunder the one-shot setting (i.e., using a single source frame) and allows for\ncross-subject reenactment, without requiring any subject-specific fine-tuning.\nWe compare our method both quantitatively and qualitatively against several\nstate-of-the-art techniques on the standard benchmarks of VoxCeleb1 and\nVoxCeleb2, demonstrating the superiority of our approach in producing\nartifact-free images, exhibiting remarkable robustness even under extreme head\npose changes. We make the code and the pretrained models publicly available at:\nhttps://github.com/StelaBou/HyperReenact .\n","authors":["Stella Bounareli","Christos Tzelepis","Vasileios Argyriou","Ioannis Patras","Georgios Tzimiropoulos"],"pdf_url":"https://arxiv.org/pdf/2307.10797v1.pdf","comment":"Accepted for publication in ICCV 2023. Project page:\n https://stelabou.github.io/hyperreenact.github.io/ Code:\n https://github.com/StelaBou/HyperReenact"},{"id":"http://arxiv.org/abs/2307.10792v1","updated":"2023-07-20T11:45:38Z","published":"2023-07-20T11:45:38Z","title":"Optimizing PatchCore for Few/many-shot Anomaly Detection","summary":" Few-shot anomaly detection (AD) is an emerging sub-field of general AD, and\ntries to distinguish between normal and anomalous data using only few selected\nsamples. While newly proposed few-shot AD methods do compare against\npre-existing algorithms developed for the full-shot domain as baselines, they\ndo not dedicatedly optimize them for the few-shot setting. It thus remains\nunclear if the performance of such pre-existing algorithms can be further\nimproved. We address said question in this work. Specifically, we present a\nstudy on the AD/anomaly segmentation (AS) performance of PatchCore, the current\nstate-of-the-art full-shot AD/AS algorithm, in both the few-shot and the\nmany-shot settings. We hypothesize that further performance improvements can be\nrealized by (I) optimizing its various hyperparameters, and by (II)\ntransferring techniques known to improve few-shot supervised learning to the AD\ndomain. Exhaustive experiments on the public VisA and MVTec AD datasets reveal\nthat (I) significant performance improvements can be realized by optimizing\nhyperparameters such as the underlying feature extractor, and that (II)\nimage-level augmentations can, but are not guaranteed, to improve performance.\nBased on these findings, we achieve a new state of the art in few-shot AD on\nVisA, further demonstrating the merit of adapting pre-existing AD/AS methods to\nthe few-shot setting. Last, we identify the investigation of feature extractors\nwith a strong inductive bias as a potential future research direction for\n(few-shot) AD/AS.\n","authors":["João Santos","Triet Tran","Oliver Rippel"],"pdf_url":"https://arxiv.org/pdf/2307.10792v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10790v1","updated":"2023-07-20T11:42:24Z","published":"2023-07-20T11:42:24Z","title":"Behavioral Analysis of Vision-and-Language Navigation Agents","summary":" To be successful, Vision-and-Language Navigation (VLN) agents must be able to\nground instructions to actions based on their surroundings. In this work, we\ndevelop a methodology to study agent behavior on a skill-specific basis --\nexamining how well existing agents ground instructions about stopping, turning,\nand moving towards specified objects or rooms. Our approach is based on\ngenerating skill-specific interventions and measuring changes in agent\npredictions. We present a detailed case study analyzing the behavior of a\nrecent agent and then compare multiple agents in terms of skill-specific\ncompetency scores. This analysis suggests that biases from training have\nlasting effects on agent behavior and that existing models are able to ground\nsimple referring expressions. Our comparisons between models show that\nskill-specific scores correlate with improvements in overall VLN task\nperformance.\n","authors":["Zijiao Yang","Arjun Majumdar","Stefan Lee"],"pdf_url":"https://arxiv.org/pdf/2307.10790v1.pdf","comment":"accepted to CVPR2023"},{"id":"http://arxiv.org/abs/2307.10787v1","updated":"2023-07-20T11:36:45Z","published":"2023-07-20T11:36:45Z","title":"Feed-Forward Source-Free Domain Adaptation via Class Prototypes","summary":" Source-free domain adaptation has become popular because of its practical\nusefulness and no need to access source data. However, the adaptation process\nstill takes a considerable amount of time and is predominantly based on\noptimization that relies on back-propagation. In this work we present a simple\nfeed-forward approach that challenges the need for back-propagation based\nadaptation. Our approach is based on computing prototypes of classes under the\ndomain shift using a pre-trained model. It achieves strong improvements in\naccuracy compared to the pre-trained model and requires only a small fraction\nof time of existing domain adaptation methods.\n","authors":["Ondrej Bohdal","Da Li","Timothy Hospedales"],"pdf_url":"https://arxiv.org/pdf/2307.10787v1.pdf","comment":"ECCV 2022 Workshop on Out of Distribution Generalization in Computer\n Vision (OOD-CV)"},{"id":"http://arxiv.org/abs/2307.10784v1","updated":"2023-07-20T11:33:46Z","published":"2023-07-20T11:33:46Z","title":"SMURF: Spatial Multi-Representation Fusion for 3D Object Detection with\n 4D Imaging Radar","summary":" The 4D Millimeter wave (mmWave) radar is a promising technology for vehicle\nsensing due to its cost-effectiveness and operability in adverse weather\nconditions. However, the adoption of this technology has been hindered by\nsparsity and noise issues in radar point cloud data. This paper introduces\nspatial multi-representation fusion (SMURF), a novel approach to 3D object\ndetection using a single 4D imaging radar. SMURF leverages multiple\nrepresentations of radar detection points, including pillarization and density\nfeatures of a multi-dimensional Gaussian mixture distribution through kernel\ndensity estimation (KDE). KDE effectively mitigates measurement inaccuracy\ncaused by limited angular resolution and multi-path propagation of radar\nsignals. Additionally, KDE helps alleviate point cloud sparsity by capturing\ndensity features. Experimental evaluations on View-of-Delft (VoD) and\nTJ4DRadSet datasets demonstrate the effectiveness and generalization ability of\nSMURF, outperforming recently proposed 4D imaging radar-based\nsingle-representation models. Moreover, while using 4D imaging radar only,\nSMURF still achieves comparable performance to the state-of-the-art 4D imaging\nradar and camera fusion-based method, with an increase of 1.22% in the mean\naverage precision on bird's-eye view of TJ4DRadSet dataset and 1.32% in the 3D\nmean average precision on the entire annotated area of VoD dataset. Our\nproposed method demonstrates impressive inference time and addresses the\nchallenges of real-time detection, with the inference time no more than 0.05\nseconds for most scans on both datasets. This research highlights the benefits\nof 4D mmWave radar and is a strong benchmark for subsequent works regarding 3D\nobject detection with 4D imaging radar.\n","authors":["Jianan Liu","Qiuchi Zhao","Weiyi Xiong","Tao Huang","Qing-Long Han","Bing Zhu"],"pdf_url":"https://arxiv.org/pdf/2307.10784v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10782v1","updated":"2023-07-20T11:32:51Z","published":"2023-07-20T11:32:51Z","title":"See More and Know More: Zero-shot Point Cloud Segmentation via\n Multi-modal Visual Data","summary":" Zero-shot point cloud segmentation aims to make deep models capable of\nrecognizing novel objects in point cloud that are unseen in the training phase.\nRecent trends favor the pipeline which transfers knowledge from seen classes\nwith labels to unseen classes without labels. They typically align visual\nfeatures with semantic features obtained from word embedding by the supervision\nof seen classes' annotations. However, point cloud contains limited information\nto fully match with semantic features. In fact, the rich appearance information\nof images is a natural complement to the textureless point cloud, which is not\nwell explored in previous literature. Motivated by this, we propose a novel\nmulti-modal zero-shot learning method to better utilize the complementary\ninformation of point clouds and images for more accurate visual-semantic\nalignment. Extensive experiments are performed in two popular benchmarks, i.e.,\nSemanticKITTI and nuScenes, and our method outperforms current SOTA methods\nwith 52% and 49% improvement on average for unseen class mIoU, respectively.\n","authors":["Yuhang Lu","Qi Jiang","Runnan Chen","Yuenan Hou","Xinge Zhu","Yuexin Ma"],"pdf_url":"https://arxiv.org/pdf/2307.10782v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.10780v1","updated":"2023-07-20T11:30:12Z","published":"2023-07-20T11:30:12Z","title":"Learned Thresholds Token Merging and Pruning for Vision Transformers","summary":" Vision transformers have demonstrated remarkable success in a wide range of\ncomputer vision tasks over the last years. However, their high computational\ncosts remain a significant barrier to their practical deployment. In\nparticular, the complexity of transformer models is quadratic with respect to\nthe number of input tokens. Therefore techniques that reduce the number of\ninput tokens that need to be processed have been proposed. This paper\nintroduces Learned Thresholds token Merging and Pruning (LTMP), a novel\napproach that leverages the strengths of both token merging and token pruning.\nLTMP uses learned threshold masking modules that dynamically determine which\ntokens to merge and which to prune. We demonstrate our approach with extensive\nexperiments on vision transformers on the ImageNet classification task. Our\nresults demonstrate that LTMP achieves state-of-the-art accuracy across\nreduction rates while requiring only a single fine-tuning epoch, which is an\norder of magnitude faster than previous methods. Code is available at\nhttps://github.com/Mxbonn/ltmp .\n","authors":["Maxim Bonnaerens","Joni Dambre"],"pdf_url":"https://arxiv.org/pdf/2307.10780v1.pdf","comment":"Paper to be presented at Efficient Systems for Foundation Models\n Workshop at the International Conference on Machine Learning (ICML) 2023"},{"id":"http://arxiv.org/abs/2307.10776v1","updated":"2023-07-20T11:24:55Z","published":"2023-07-20T11:24:55Z","title":"Urban Radiance Field Representation with Deformable Neural Mesh\n Primitives","summary":" Neural Radiance Fields (NeRFs) have achieved great success in the past few\nyears. However, most current methods still require intensive resources due to\nray marching-based rendering. To construct urban-level radiance fields\nefficiently, we design Deformable Neural Mesh Primitive~(DNMP), and propose to\nparameterize the entire scene with such primitives. The DNMP is a flexible and\ncompact neural variant of classic mesh representation, which enjoys both the\nefficiency of rasterization-based rendering and the powerful neural\nrepresentation capability for photo-realistic image synthesis. Specifically, a\nDNMP consists of a set of connected deformable mesh vertices with paired vertex\nfeatures to parameterize the geometry and radiance information of a local area.\nTo constrain the degree of freedom for optimization and lower the storage\nbudgets, we enforce the shape of each primitive to be decoded from a relatively\nlow-dimensional latent space. The rendering colors are decoded from the vertex\nfeatures (interpolated with rasterization) by a view-dependent MLP. The DNMP\nprovides a new paradigm for urban-level scene representation with appealing\nproperties: $(1)$ High-quality rendering. Our method achieves leading\nperformance for novel view synthesis in urban scenarios. $(2)$ Low\ncomputational costs. Our representation enables fast rendering (2.07ms/1k\npixels) and low peak memory usage (110MB/1k pixels). We also present a\nlightweight version that can run 33$\\times$ faster than vanilla NeRFs, and\ncomparable to the highly-optimized Instant-NGP (0.61 vs 0.71ms/1k pixels).\nProject page: \\href{https://dnmp.github.io/}{https://dnmp.github.io/}.\n","authors":["Fan Lu","Yan Xu","Guang Chen","Hongsheng Li","Kwan-Yee Lin","Changjun Jiang"],"pdf_url":"https://arxiv.org/pdf/2307.10776v1.pdf","comment":"Accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2307.10768v1","updated":"2023-07-20T10:57:02Z","published":"2023-07-20T10:57:02Z","title":"Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of\n Working Memory","summary":" Working memory (WM), a fundamental cognitive process facilitating the\ntemporary storage, integration, manipulation, and retrieval of information,\nplays a vital role in reasoning and decision-making tasks. Robust benchmark\ndatasets that capture the multifaceted nature of WM are crucial for the\neffective development and evaluation of AI WM models. Here, we introduce a\ncomprehensive Working Memory (WorM) benchmark dataset for this purpose. WorM\ncomprises 10 tasks and a total of 1 million trials, assessing 4\nfunctionalities, 3 domains, and 11 behavioral and neural characteristics of WM.\nWe jointly trained and tested state-of-the-art recurrent neural networks and\ntransformers on all these tasks. We also include human behavioral benchmarks as\nan upper bound for comparison. Our results suggest that AI models replicate\nsome characteristics of WM in the brain, most notably primacy and recency\neffects, and neural clusters and correlates specialized for different domains\nand functionalities of WM. In the experiments, we also reveal some limitations\nin existing models to approximate human behavior. This dataset serves as a\nvaluable resource for communities in cognitive psychology, neuroscience, and\nAI, offering a standardized framework to compare and enhance WM models,\ninvestigate WM's neural underpinnings, and develop WM models with human-like\ncapabilities. Our source code and data are available at\nhttps://github.com/ZhangLab-DeepNeuroCogLab/WorM.\n","authors":["Ankur Sikarwar","Mengmi Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10768v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10763v1","updated":"2023-07-20T10:53:12Z","published":"2023-07-20T10:53:12Z","title":"MSQNet: Actor-agnostic Action Recognition with Multi-modal Query","summary":" Existing action recognition methods are typically actor-specific due to the\nintrinsic topological and apparent differences among the actors. This requires\nactor-specific pose estimation (e.g., humans vs. animals), leading to\ncumbersome model design complexity and high maintenance costs. Moreover, they\noften focus on learning the visual modality alone and single-label\nclassification whilst neglecting other available information sources (e.g.,\nclass name text) and the concurrent occurrence of multiple actions. To overcome\nthese limitations, we propose a new approach called 'actor-agnostic multi-modal\nmulti-label action recognition,' which offers a unified solution for various\ntypes of actors, including humans and animals. We further formulate a novel\nMulti-modal Semantic Query Network (MSQNet) model in a transformer-based object\ndetection framework (e.g., DETR), characterized by leveraging visual and\ntextual modalities to represent the action classes better. The elimination of\nactor-specific model designs is a key advantage, as it removes the need for\nactor pose estimation altogether. Extensive experiments on five publicly\navailable benchmarks show that our MSQNet consistently outperforms the prior\narts of actor-specific alternatives on human and animal single- and multi-label\naction recognition tasks by up to 50%. Code will be released at\nhttps://github.com/mondalanindya/MSQNet.\n","authors":["Anindya Mondal","Sauradip Nag","Joaquin M Prada","Xiatian Zhu","Anjan Dutta"],"pdf_url":"https://arxiv.org/pdf/2307.10763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10753v1","updated":"2023-07-20T10:29:48Z","published":"2023-07-20T10:29:48Z","title":"LBL: Logarithmic Barrier Loss Function for One-class Classification","summary":" One-class classification (OCC) aims to train a classifier only with the\ntarget class data and attracts great attention for its strong applicability in\nreal-world application. Despite a lot of advances have been made in OCC, it\nstill lacks the effective OCC loss functions for deep learning. In this paper,\na novel logarithmic barrier function based OCC loss (LBL) that assigns large\ngradients to the margin samples and thus derives more compact hypersphere, is\nfirst proposed by approximating the OCC objective smoothly. But the\noptimization of LBL may be instability especially when samples lie on the\nboundary leading to the infinity loss. To address this issue, then, a\nunilateral relaxation Sigmoid function is introduced into LBL and a novel OCC\nloss named LBLSig is proposed. The LBLSig can be seen as the fusion of the mean\nsquare error (MSE) and the cross entropy (CE) and the optimization of LBLSig is\nsmoother owing to the unilateral relaxation Sigmoid function. The effectiveness\nof the proposed LBL and LBLSig is experimentally demonstrated in comparisons\nwith several state-of-the-art OCC algorithms on different network structures.\nThe source code can be found at https://github.com/ML-HDU/LBL_LBLSig.\n","authors":["Tianlei Wang","Dekang Liu","Wandong Zhang","Jiuwen Cao"],"pdf_url":"https://arxiv.org/pdf/2307.10753v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13960v2","updated":"2023-07-20T10:26:56Z","published":"2023-06-24T13:29:54Z","title":"Regular SE(3) Group Convolutions for Volumetric Medical Image Analysis","summary":" Regular group convolutional neural networks (G-CNNs) have been shown to\nincrease model performance and improve equivariance to different geometrical\nsymmetries. This work addresses the problem of SE(3), i.e., roto-translation\nequivariance, on volumetric data. Volumetric image data is prevalent in many\nmedical settings. Motivated by the recent work on separable group convolutions,\nwe devise a SE(3) group convolution kernel separated into a continuous SO(3)\n(rotation) kernel and a spatial kernel. We approximate equivariance to the\ncontinuous setting by sampling uniform SO(3) grids. Our continuous SO(3) kernel\nis parameterized via RBF interpolation on similarly uniform grids. We\ndemonstrate the advantages of our approach in volumetric medical image\nanalysis. Our SE(3) equivariant models consistently outperform CNNs and regular\ndiscrete G-CNNs on challenging medical classification tasks and show\nsignificantly improved generalization capabilities. Our approach achieves up to\na 16.5% gain in accuracy over regular CNNs.\n","authors":["Thijs P. Kuipers","Erik J. Bekkers"],"pdf_url":"https://arxiv.org/pdf/2306.13960v2.pdf","comment":"10 pages, 1 figure, 2 tables, accepted at MICCAI 2023. Updated\n version to camera ready version 1"},{"id":"http://arxiv.org/abs/2307.10745v1","updated":"2023-07-20T10:16:03Z","published":"2023-07-20T10:16:03Z","title":"EdgeAL: An Edge Estimation Based Active Learning Approach for OCT\n Segmentation","summary":" Active learning algorithms have become increasingly popular for training\nmodels with limited data. However, selecting data for annotation remains a\nchallenging problem due to the limited information available on unseen data. To\naddress this issue, we propose EdgeAL, which utilizes the edge information of\nunseen images as {\\it a priori} information for measuring uncertainty. The\nuncertainty is quantified by analyzing the divergence and entropy in model\npredictions across edges. This measure is then used to select superpixels for\nannotation. We demonstrate the effectiveness of EdgeAL on multi-class Optical\nCoherence Tomography (OCT) segmentation tasks, where we achieved a 99% dice\nscore while reducing the annotation label cost to 12%, 2.3%, and 3%,\nrespectively, on three publicly available datasets (Duke, AROI, and UMN). The\nsource code is available at \\url{https://github.com/Mak-Ta-Reque/EdgeAL}\n","authors":["Md Abdul Kadir","Hasan Md Tusfiqur Alam","Daniel Sonntag"],"pdf_url":"https://arxiv.org/pdf/2307.10745v1.pdf","comment":"This version of the contribution has been accepted for publication,\n after peer review (when applicable) but is not the Version of Record and does\n not reflect post-acceptance improvements, or any corrections. Use of this\n Accepted Version is subject to the publisher's Accepted Manuscript terms of\n use\n https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms"},{"id":"http://arxiv.org/abs/2307.02347v3","updated":"2023-07-20T09:54:41Z","published":"2023-07-05T15:03:10Z","title":"Detecting Images Generated by Deep Diffusion Models using their Local\n Intrinsic Dimensionality","summary":" Diffusion models recently have been successfully applied for the visual\nsynthesis of strikingly realistic appearing images. This raises strong concerns\nabout their potential for malicious purposes. In this paper, we propose using\nthe lightweight multi Local Intrinsic Dimensionality (multiLID), which has been\noriginally developed in context of the detection of adversarial examples, for\nthe automatic detection of synthetic images and the identification of the\naccording generator networks. In contrast to many existing detection\napproaches, which often only work for GAN-generated images, the proposed method\nprovides close to perfect detection results in many realistic use cases.\nExtensive experiments on known and newly created datasets demonstrate that the\nproposed multiLID approach exhibits superiority in diffusion detection and\nmodel identification. Since the empirical evaluations of recent publications on\nthe detection of generated images are often mainly focused on the\n\"LSUN-Bedroom\" dataset, we further establish a comprehensive benchmark for the\ndetection of diffusion-generated images, including samples from several\ndiffusion models with different image sizes.\n","authors":["Peter Lorenz","Ricard Durall","Janis Keuper"],"pdf_url":"https://arxiv.org/pdf/2307.02347v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.01091v2","updated":"2023-07-20T09:40:13Z","published":"2023-07-03T15:09:32Z","title":"UW-ProCCaps: UnderWater Progressive Colourisation with Capsules","summary":" Underwater images are fundamental for studying and understanding the status\nof marine life. We focus on reducing the memory space required for image\nstorage while the memory space consumption in the collecting phase limits the\ntime lasting of this phase leading to the need for more image collection\ncampaigns. We present a novel machine-learning model that reconstructs the\ncolours of underwater images from their luminescence channel, thus saving 2/3\nof the available storage space. Our model specialises in underwater colour\nreconstruction and consists of an encoder-decoder architecture. The encoder is\ncomposed of a convolutional encoder and a parallel specialised classifier\ntrained with webly-supervised data. The encoder and the decoder use layers of\ncapsules to capture the features of the entities in the image. The colour\nreconstruction process recalls the progressive and the generative adversarial\ntraining procedures. The progressive training gives the ground for a generative\nadversarial routine focused on the refining of colours giving the image bright\nand saturated colours which bring the image back to life. We validate the model\nboth qualitatively and quantitatively on four benchmark datasets. This is the\nfirst attempt at colour reconstruction in greyscale underwater images.\nExtensive results on four benchmark datasets demonstrate that our solution\noutperforms state-of-the-art (SOTA) solutions. We also demonstrate that the\ngenerated colourisation enhances the quality of images compared to enhancement\nmodels at the SOTA.\n","authors":["Rita Pucci","Niki Martinel"],"pdf_url":"https://arxiv.org/pdf/2307.01091v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10713v1","updated":"2023-07-20T09:13:32Z","published":"2023-07-20T09:13:32Z","title":"Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV","summary":" Self-supervised monocular depth estimation (SS-MDE) has the potential to\nscale to vast quantities of data. Unfortunately, existing approaches limit\nthemselves to the automotive domain, resulting in models incapable of\ngeneralizing to complex environments such as natural or indoor settings.\n To address this, we propose a large-scale SlowTV dataset curated from\nYouTube, containing an order of magnitude more data than existing automotive\ndatasets. SlowTV contains 1.7M images from a rich diversity of environments,\nsuch as worldwide seasonal hiking, scenic driving and scuba diving. Using this\ndataset, we train an SS-MDE model that provides zero-shot generalization to a\nlarge collection of indoor/outdoor datasets. The resulting model outperforms\nall existing SSL approaches and closes the gap on supervised SoTA, despite\nusing a more efficient architecture.\n We additionally introduce a collection of best-practices to further maximize\nperformance and zero-shot generalization. This includes 1) aspect ratio\naugmentation, 2) camera intrinsic estimation, 3) support frame randomization\nand 4) flexible motion estimation. Code is available at\nhttps://github.com/jspenmar/slowtv_monodepth.\n","authors":["Jaime Spencer","Chris Russell","Simon Hadfield","Richard Bowden"],"pdf_url":"https://arxiv.org/pdf/2307.10713v1.pdf","comment":"Accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2307.10711v1","updated":"2023-07-20T09:06:21Z","published":"2023-07-20T09:06:21Z","title":"AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of\n Diffusion Probabilistic Models","summary":" Existing customization methods require access to multiple reference examples\nto align pre-trained diffusion probabilistic models (DPMs) with user-provided\nconcepts. This paper aims to address the challenge of DPM customization when\nthe only available supervision is a differentiable metric defined on the\ngenerated contents. Since the sampling procedure of DPMs involves recursive\ncalls to the denoising UNet, na\\\"ive gradient backpropagation requires storing\nthe intermediate states of all iterations, resulting in extremely high memory\nconsumption. To overcome this issue, we propose a novel method AdjointDPM,\nwhich first generates new samples from diffusion models by solving the\ncorresponding probability-flow ODEs. It then uses the adjoint sensitivity\nmethod to backpropagate the gradients of the loss to the models' parameters\n(including conditioning signals, network weights, and initial noises) by\nsolving another augmented ODE. To reduce numerical errors in both the forward\ngeneration and gradient backpropagation processes, we further reparameterize\nthe probability-flow ODE and augmented ODE as simple non-stiff ODEs using\nexponential integration. Finally, we demonstrate the effectiveness of\nAdjointDPM on three interesting tasks: converting visual effects into\nidentification text embeddings, finetuning DPMs for specific types of\nstylization, and optimizing initial noise to generate adversarial samples for\nsecurity auditing.\n","authors":["Jiachun Pan","Hanshu Yan","Jun Hao Liew","Vincent Y. F. Tan","Jiashi Feng"],"pdf_url":"https://arxiv.org/pdf/2307.10711v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.10552v2","updated":"2023-07-20T08:57:20Z","published":"2022-06-21T17:33:53Z","title":"Vicinity Vision Transformer","summary":" Vision transformers have shown great success on numerous computer vision\ntasks. However, its central component, softmax attention, prohibits vision\ntransformers from scaling up to high-resolution images, due to both the\ncomputational complexity and memory footprint being quadratic. Although linear\nattention was introduced in natural language processing (NLP) tasks to mitigate\na similar issue, directly applying existing linear attention to vision\ntransformers may not lead to satisfactory results. We investigate this problem\nand find that computer vision tasks focus more on local information compared\nwith NLP tasks. Based on this observation, we present a Vicinity Attention that\nintroduces a locality bias to vision transformers with linear complexity.\nSpecifically, for each image patch, we adjust its attention weight based on its\n2D Manhattan distance measured by its neighbouring patches. In this case, the\nneighbouring patches will receive stronger attention than far-away patches.\nMoreover, since our Vicinity Attention requires the token length to be much\nlarger than the feature dimension to show its efficiency advantages, we further\npropose a new Vicinity Vision Transformer (VVT) structure to reduce the feature\ndimension without degenerating the accuracy. We perform extensive experiments\non the CIFAR100, ImageNet1K, and ADE20K datasets to validate the effectiveness\nof our method. Our method has a slower growth rate of GFlops than previous\ntransformer-based and convolution-based networks when the input resolution\nincreases. In particular, our approach achieves state-of-the-art image\nclassification accuracy with 50% fewer parameters than previous methods.\n","authors":["Weixuan Sun","Zhen Qin","Hui Deng","Jianyuan Wang","Yi Zhang","Kaihao Zhang","Nick Barnes","Stan Birchfield","Lingpeng Kong","Yiran Zhong"],"pdf_url":"https://arxiv.org/pdf/2206.10552v2.pdf","comment":"code: https://github.com/OpenNLPLab/Vicinity-Vision-Transformer"},{"id":"http://arxiv.org/abs/2307.10705v1","updated":"2023-07-20T08:53:47Z","published":"2023-07-20T08:53:47Z","title":"TwinLiteNet: An Efficient and Lightweight Model for Driveable Area and\n Lane Segmentation in Self-Driving Cars","summary":" Semantic segmentation is a common task in autonomous driving to understand\nthe surrounding environment. Driveable Area Segmentation and Lane Detection are\nparticularly important for safe and efficient navigation on the road. However,\noriginal semantic segmentation models are computationally expensive and require\nhigh-end hardware, which is not feasible for embedded systems in autonomous\nvehicles. This paper proposes a lightweight model for the driveable area and\nlane line segmentation. TwinLiteNet is designed cheaply but achieves accurate\nand efficient segmentation results. We evaluate TwinLiteNet on the BDD100K\ndataset and compare it with modern models. Experimental results show that our\nTwinLiteNet performs similarly to existing approaches, requiring significantly\nfewer computational resources. Specifically, TwinLiteNet achieves a mIoU score\nof 91.3% for the Drivable Area task and 31.08% IoU for the Lane Detection task\nwith only 0.4 million parameters and achieves 415 FPS on GPU RTX A5000.\nFurthermore, TwinLiteNet can run in real-time on embedded devices with limited\ncomputing power, especially since it achieves 60FPS on Jetson Xavier NX, making\nit an ideal solution for self-driving vehicles. Code is available:\nurl{https://github.com/chequanghuy/TwinLiteNet}.\n","authors":["Quang Huy Che","Dinh Phuc Nguyen","Minh Quan Pham","Duc Khai Lam"],"pdf_url":"https://arxiv.org/pdf/2307.10705v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10165v2","updated":"2023-07-20T08:53:13Z","published":"2023-07-19T17:46:55Z","title":"Drone navigation and license place detection for vehicle location in\n indoor spaces","summary":" Millions of vehicles are transported every year, tightly parked in vessels or\nboats. To reduce the risks of associated safety issues like fires, knowing the\nlocation of vehicles is essential, since different vehicles may need different\nmitigation measures, e.g. electric cars. This work is aimed at creating a\nsolution based on a nano-drone that navigates across rows of parked vehicles\nand detects their license plates. We do so via a wall-following algorithm, and\na CNN trained to detect license plates. All computations are done in real-time\non the drone, which just sends position and detected images that allow the\ncreation of a 2D map with the position of the plates. Our solution is capable\nof reading all plates across eight test cases (with several rows of plates,\ndifferent drone speeds, or low light) by aggregation of measurements across\nseveral drone journeys.\n","authors":["Moa Arvidsson","Sithichot Sawirot","Cristofer Englund","Fernando Alonso-Fernandez","Martin Torstensson","Boris Duran"],"pdf_url":"https://arxiv.org/pdf/2307.10165v2.pdf","comment":"Published at VIII International Workshop on Artificial Intelligence\n and Pattern Recognition, IWAIPR 2023"},{"id":"http://arxiv.org/abs/2205.09753v2","updated":"2023-07-20T08:41:46Z","published":"2022-04-30T07:08:30Z","title":"HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory\n Prediction via Scene Encoding","summary":" Encoding a driving scene into vector representations has been an essential\ntask for autonomous driving that can benefit downstream tasks e.g. trajectory\nprediction. The driving scene often involves heterogeneous elements such as the\ndifferent types of objects (agents, lanes, traffic signs) and the semantic\nrelations between objects are rich and diverse. Meanwhile, there also exist\nrelativity across elements, which means that the spatial relation is a relative\nconcept and need be encoded in a ego-centric manner instead of in a global\ncoordinate system. Based on these observations, we propose Heterogeneous\nDriving Graph Transformer (HDGT), a backbone modelling the driving scene as a\nheterogeneous graph with different types of nodes and edges. For heterogeneous\ngraph construction, we connect different types of nodes according to diverse\nsemantic relations. For spatial relation encoding, the coordinates of the node\nas well as its in-edges are in the local node-centric coordinate system. For\nthe aggregation module in the graph neural network (GNN), we adopt the\ntransformer structure in a hierarchical way to fit the heterogeneous nature of\ninputs. Experimental results show that HDGT achieves state-of-the-art\nperformance for the task of trajectory prediction, on INTERACTION Prediction\nChallenge and Waymo Open Motion Challenge.\n","authors":["Xiaosong Jia","Penghao Wu","Li Chen","Yu Liu","Hongyang Li","Junchi Yan"],"pdf_url":"https://arxiv.org/pdf/2205.09753v2.pdf","comment":"Accepted at IEEE TPAMI in 2023. Code url:\n https://github.com/OpenDriveLab/HDGT"},{"id":"http://arxiv.org/abs/2307.10698v1","updated":"2023-07-20T08:39:20Z","published":"2023-07-20T08:39:20Z","title":"Reverse Knowledge Distillation: Training a Large Model using a Small One\n for Retinal Image Matching on Limited Data","summary":" Retinal image matching plays a crucial role in monitoring disease progression\nand treatment response. However, datasets with matched keypoints between\ntemporally separated pairs of images are not available in abundance to train\ntransformer-based model. We propose a novel approach based on reverse knowledge\ndistillation to train large models with limited data while preventing\noverfitting. Firstly, we propose architectural modifications to a CNN-based\nsemi-supervised method called SuperRetina that help us improve its results on a\npublicly available dataset. Then, we train a computationally heavier model\nbased on a vision transformer encoder using the lighter CNN-based model, which\nis counter-intuitive in the field knowledge-distillation research where\ntraining lighter models based on heavier ones is the norm. Surprisingly, such\nreverse knowledge distillation improves generalization even further. Our\nexperiments suggest that high-dimensional fitting in representation space may\nprevent overfitting unlike training directly to match the final output. We also\nprovide a public dataset with annotations for retinal image keypoint detection\nand matching to help the research community develop algorithms for retinal\nimage applications.\n","authors":["Sahar Almahfouz Nasser","Nihar Gupte","Amit Sethi"],"pdf_url":"https://arxiv.org/pdf/2307.10698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10697v1","updated":"2023-07-20T08:38:50Z","published":"2023-07-20T08:38:50Z","title":"SqueezerFaceNet: Reducing a Small Face Recognition CNN Even More Via\n Filter Pruning","summary":" The widespread use of mobile devices for various digital services has created\na need for reliable and real-time person authentication. In this context,\nfacial recognition technologies have emerged as a dependable method for\nverifying users due to the prevalence of cameras in mobile devices and their\nintegration into everyday applications. The rapid advancement of deep\nConvolutional Neural Networks (CNNs) has led to numerous face verification\narchitectures. However, these models are often large and impractical for mobile\napplications, reaching sizes of hundreds of megabytes with millions of\nparameters. We address this issue by developing SqueezerFaceNet, a light face\nrecognition network which less than 1M parameters. This is achieved by applying\na network pruning method based on Taylor scores, where filters with small\nimportance scores are removed iteratively. Starting from an already small\nnetwork (of 1.24M) based on SqueezeNet, we show that it can be further reduced\n(up to 40%) without an appreciable loss in performance. To the best of our\nknowledge, we are the first to evaluate network pruning methods for the task of\nface recognition.\n","authors":["Fernando Alonso-Fernandez","Kevin Hernandez-Diaz","Jose Maria Buades Rubio","Josef Bigun"],"pdf_url":"https://arxiv.org/pdf/2307.10697v1.pdf","comment":"Published at VIII International Workshop on Artificial Intelligence\n and Pattern Recognition, IWAIPR 2023"},{"id":"http://arxiv.org/abs/2307.10696v1","updated":"2023-07-20T08:38:15Z","published":"2023-07-20T08:38:15Z","title":"SLPD: Slide-level Prototypical Distillation for WSIs","summary":" Improving the feature representation ability is the foundation of many whole\nslide pathological image (WSIs) tasks. Recent works have achieved great success\nin pathological-specific self-supervised learning (SSL). However, most of them\nonly focus on learning patch-level representations, thus there is still a gap\nbetween pretext and slide-level downstream tasks, e.g., subtyping, grading and\nstaging. Aiming towards slide-level representations, we propose Slide-Level\nPrototypical Distillation (SLPD) to explore intra- and inter-slide semantic\nstructures for context modeling on WSIs. Specifically, we iteratively perform\nintra-slide clustering for the regions (4096x4096 patches) within each WSI to\nyield the prototypes and encourage the region representations to be closer to\nthe assigned prototypes. By representing each slide with its prototypes, we\nfurther select similar slides by the set distance of prototypes and assign the\nregions by cross-slide prototypes for distillation. SLPD achieves\nstate-of-the-art results on multiple slide-level benchmarks and demonstrates\nthat representation learning of semantic structures of slides can make a\nsuitable proxy task for WSI analysis. Code will be available at\nhttps://github.com/Carboxy/SLPD.\n","authors":["Zhimiao Yu","Tiancheng Lin","Yi Xu"],"pdf_url":"https://arxiv.org/pdf/2307.10696v1.pdf","comment":"International Conference on Medical Image Computing and Computer\n Assisted Intervention (MICCAI)"},{"id":"http://arxiv.org/abs/2307.10695v1","updated":"2023-07-20T08:38:01Z","published":"2023-07-20T08:38:01Z","title":"Self2Self+: Single-Image Denoising with Self-Supervised Learning and\n Image Quality Assessment Loss","summary":" Recently, denoising methods based on supervised learning have exhibited\npromising performance. However, their reliance on external datasets containing\nnoisy-clean image pairs restricts their applicability. To address this\nlimitation, researchers have focused on training denoising networks using\nsolely a set of noisy inputs. To improve the feasibility of denoising\nprocedures, in this study, we proposed a single-image self-supervised learning\nmethod in which only the noisy input image is used for network training. Gated\nconvolution was used for feature extraction and no-reference image quality\nassessment was used for guiding the training process. Moreover, the proposed\nmethod sampled instances from the input image dataset using Bernoulli sampling\nwith a certain dropout rate for training. The corresponding result was produced\nby averaging the generated predictions from various instances of the trained\nnetwork with dropouts. The experimental results indicated that the proposed\nmethod achieved state-of-the-art denoising performance on both synthetic and\nreal-world datasets. This highlights the effectiveness and practicality of our\nmethod as a potential solution for various noise removal tasks.\n","authors":["Jaekyun Ko","Sanghwan Lee"],"pdf_url":"https://arxiv.org/pdf/2307.10695v1.pdf","comment":"Technical report and supplemantry materials are combined into one\n paper. - Technical report: Page 1~7 - Supplemantry materials : Page 8~18"},{"id":"http://arxiv.org/abs/2302.08292v3","updated":"2023-07-20T08:35:26Z","published":"2023-02-16T13:41:19Z","title":"Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation\n for autonomous vehicles","summary":" Autonomous driving (AD) perception today relies heavily on deep learning\nbased architectures requiring large scale annotated datasets with their\nassociated costs for curation and annotation. The 3D semantic data are useful\nfor core perception tasks such as obstacle detection and ego-vehicle\nlocalization. We propose a new dataset, Navya 3D Segmentation (Navya3DSeg),\nwith a diverse label space corresponding to a large scale production grade\noperational domain, including rural, urban, industrial sites and universities\nfrom 13 countries. It contains 23 labeled sequences and 25 supplementary\nsequences without labels, designed to explore self-supervised and\nsemi-supervised semantic segmentation benchmarks on point clouds. We also\npropose a novel method for sequential dataset split generation based on\niterative multi-label stratification, and demonstrated to achieve a +1.2% mIoU\nimprovement over the original split proposed by SemanticKITTI dataset. A\ncomplete benchmark for semantic segmentation task was performed, with state of\nthe art methods. Finally, we demonstrate an Active Learning (AL) based dataset\ndistillation framework. We introduce a novel heuristic-free sampling method\ncalled ego-pose distance based sampling in the context of AL. A detailed\npresentation on the dataset is available here\nhttps://www.youtube.com/watch?v=5m6ALIs-s20.\n","authors":["Alexandre Almin","Léo Lemarié","Anh Duong","B Ravi Kiran"],"pdf_url":"https://arxiv.org/pdf/2302.08292v3.pdf","comment":"Accepted version to IEEE RA-L. Version with supplementary materials"},{"id":"http://arxiv.org/abs/2307.10685v1","updated":"2023-07-20T08:25:38Z","published":"2023-07-20T08:25:38Z","title":"Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged\n Object Detection","summary":" Camouflaged object detection (COD), aiming to segment camouflaged objects\nwhich exhibit similar patterns with the background, is a challenging task. Most\nexisting works are dedicated to establishing specialized modules to identify\ncamouflaged objects with complete and fine details, while the boundary can not\nbe well located for the lack of object-related semantics. In this paper, we\npropose a novel ``pre-train, adapt and detect\" paradigm to detect camouflaged\nobjects. By introducing a large pre-trained model, abundant knowledge learned\nfrom massive multi-modal data can be directly transferred to COD. A lightweight\nparallel adapter is inserted to adjust the features suitable for the downstream\nCOD task. Extensive experiments on four challenging benchmark datasets\ndemonstrate that our method outperforms existing state-of-the-art COD models by\nlarge margins. Moreover, we design a multi-task learning scheme for tuning the\nadapter to exploit the shareable knowledge across different semantic classes.\nComprehensive experimental results showed that the generalization ability of\nour model can be substantially improved with multi-task adapter initialization\non source tasks and multi-task adaptation on target tasks.\n","authors":["Yinghui Xing","Dexuan Kong","Shizhou Zhang","Geng Chen","Lingyan Ran","Peng Wang","Yanning Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12112v3","updated":"2023-07-20T08:16:09Z","published":"2023-03-21T18:03:14Z","title":"Positive-Augmented Contrastive Learning for Image and Video Captioning\n Evaluation","summary":" The CLIP model has been recently proven to be very effective for a variety of\ncross-modal tasks, including the evaluation of captions generated from\nvision-and-language architectures. In this paper, we propose a new recipe for a\ncontrastive-based evaluation metric for image captioning, namely\nPositive-Augmented Contrastive learning Score (PAC-S), that in a novel way\nunifies the learning of a contrastive visual-semantic space with the addition\nof generated images and text on curated data. Experiments spanning several\ndatasets demonstrate that our new metric achieves the highest correlation with\nhuman judgments on both images and videos, outperforming existing\nreference-based metrics like CIDEr and SPICE and reference-free metrics like\nCLIP-Score. Finally, we test the system-level correlation of the proposed\nmetric when considering popular image captioning approaches, and assess the\nimpact of employing different cross-modal features. Our source code and trained\nmodels are publicly available at: https://github.com/aimagelab/pacscore.\n","authors":["Sara Sarto","Manuele Barraco","Marcella Cornia","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2303.12112v3.pdf","comment":"CVPR 2023 (highlight paper)"},{"id":"http://arxiv.org/abs/2307.05921v3","updated":"2023-07-20T08:14:17Z","published":"2023-07-12T05:36:47Z","title":"Reading Radiology Imaging Like The Radiologist","summary":" Automated radiology report generation aims to generate radiology reports that\ncontain rich, fine-grained descriptions of radiology imaging. Compared with\nimage captioning in the natural image domain, medical images are very similar\nto each other, with only minor differences in the occurrence of diseases. Given\nthe importance of these minor differences in the radiology report, it is\ncrucial to encourage the model to focus more on the subtle regions of disease\noccurrence. Secondly, the problem of visual and textual data biases is serious.\nNot only do normal cases make up the majority of the dataset, but sentences\ndescribing areas with pathological changes also constitute only a small part of\nthe paragraph. Lastly, generating medical image reports involves the challenge\nof long text generation, which requires more expertise and empirical training\nin medical knowledge. As a result, the difficulty of generating such reports is\nincreased. To address these challenges, we propose a disease-oriented retrieval\nframework that utilizes similar reports as prior knowledge references. We\ndesign a factual consistency captioning generator to generate more accurate and\nfactually consistent disease descriptions. Our framework can find most similar\nreports for a given disease from the CXR database by retrieving a\ndisease-oriented mask consisting of the position and morphological\ncharacteristics. By referencing the disease-oriented similar report and the\nvisual features, the factual consistency model can generate a more accurate\nradiology report.\n","authors":["Yuhao Wang"],"pdf_url":"https://arxiv.org/pdf/2307.05921v3.pdf","comment":"There are data writing errors in the paper"},{"id":"http://arxiv.org/abs/2307.10677v1","updated":"2023-07-20T07:57:14Z","published":"2023-07-20T07:57:14Z","title":"Deep learning for classification of noisy QR codes","summary":" We wish to define the limits of a classical classification model based on\ndeep learning when applied to abstract images, which do not represent visually\nidentifiable objects.QR codes (Quick Response codes) fall into this category of\nabstract images: one bit corresponding to one encoded character, QR codes were\nnot designed to be decoded manually. To understand the limitations of a deep\nlearning-based model for abstract image classification, we train an image\nclassification model on QR codes generated from information obtained when\nreading a health pass. We compare a classification model with a classical\n(deterministic) decoding method in the presence of noise. This study allows us\nto conclude that a model based on deep learning can be relevant for the\nunderstanding of abstract images.\n","authors":["Rebecca Leygonie","Sylvain Lobry"," )","Laurent Wendling (LIPADE)"],"pdf_url":"https://arxiv.org/pdf/2307.10677v1.pdf","comment":"in French language. RFIAP 2022 - Reconnaissance des Formes, Image,\n Apprentissage et Perception, Jul 2022, Vannes (Bretagne), France"},{"id":"http://arxiv.org/abs/2307.10667v1","updated":"2023-07-20T07:47:48Z","published":"2023-07-20T07:47:48Z","title":"Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image\n Sensors","summary":" As the physical size of recent CMOS image sensors (CIS) gets smaller, the\nlatest mobile cameras are adopting unique non-Bayer color filter array (CFA)\npatterns (e.g., Quad, Nona, QxQ), which consist of homogeneous color units with\nadjacent pixels. These non-Bayer sensors are superior to conventional Bayer CFA\nthanks to their changeable pixel-bin sizes for different light conditions but\nmay introduce visual artifacts during demosaicing due to their inherent pixel\npattern structures and sensor hardware characteristics. Previous demosaicing\nmethods have primarily focused on Bayer CFA, necessitating distinct\nreconstruction methods for non-Bayer patterned CIS with various CFA modes under\ndifferent lighting conditions. In this work, we propose an efficient unified\ndemosaicing method that can be applied to both conventional Bayer RAW and\nvarious non-Bayer CFAs' RAW data in different operation modes. Our Knowledge\nLearning-based demosaicing model for Adaptive Patterns, namely KLAP, utilizes\nCFA-adaptive filters for only 1% key filters in the network for each CFA, but\nstill manages to effectively demosaic all the CFAs, yielding comparable\nperformance to the large-scale models. Furthermore, by employing meta-learning\nduring inference (KLAP-M), our model is able to eliminate unknown\nsensor-generic artifacts in real RAW data, effectively bridging the gap between\nsynthetic images and real sensor RAW. Our KLAP and KLAP-M methods achieved\nstate-of-the-art demosaicing performance in both synthetic and real RAW data of\nBayer and non-Bayer CFAs.\n","authors":["Haechang Lee","Dongwon Park","Wongi Jeong","Kijeong Kim","Hyunwoo Je","Dongil Ryu","Se Young Chun"],"pdf_url":"https://arxiv.org/pdf/2307.10667v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10664v1","updated":"2023-07-20T07:46:34Z","published":"2023-07-20T07:46:34Z","title":"Lighting up NeRF via Unsupervised Decomposition and Enhancement","summary":" Neural Radiance Field (NeRF) is a promising approach for synthesizing novel\nviews, given a set of images and the corresponding camera poses of a scene.\nHowever, images photographed from a low-light scene can hardly be used to train\na NeRF model to produce high-quality results, due to their low pixel\nintensities, heavy noise, and color distortion. Combining existing low-light\nimage enhancement methods with NeRF methods also does not work well due to the\nview inconsistency caused by the individual 2D enhancement process. In this\npaper, we propose a novel approach, called Low-Light NeRF (or LLNeRF), to\nenhance the scene representation and synthesize normal-light novel views\ndirectly from sRGB low-light images in an unsupervised manner. The core of our\napproach is a decomposition of radiance field learning, which allows us to\nenhance the illumination, reduce noise and correct the distorted colors jointly\nwith the NeRF optimization process. Our method is able to produce novel view\nimages with proper lighting and vivid colors and details, given a collection of\ncamera-finished low dynamic range (8-bits/channel) images from a low-light\nscene. Experiments demonstrate that our method outperforms existing low-light\nenhancement methods and NeRF methods.\n","authors":["Haoyuan Wang","Xiaogang Xu","Ke Xu","Rynson WH. Lau"],"pdf_url":"https://arxiv.org/pdf/2307.10664v1.pdf","comment":"ICCV 2023. Project website: https://whyy.site/paper/llnerf"},{"id":"http://arxiv.org/abs/2306.16997v2","updated":"2023-07-20T07:29:03Z","published":"2023-06-29T14:54:10Z","title":"Unsupervised 3D registration through optimization-guided cyclical\n self-training","summary":" State-of-the-art deep learning-based registration methods employ three\ndifferent learning strategies: supervised learning, which requires costly\nmanual annotations, unsupervised learning, which heavily relies on hand-crafted\nsimilarity metrics designed by domain experts, or learning from synthetic data,\nwhich introduces a domain shift. To overcome the limitations of these\nstrategies, we propose a novel self-supervised learning paradigm for\nunsupervised registration, relying on self-training. Our idea is based on two\nkey insights. Feature-based differentiable optimizers 1) perform reasonable\nregistration even from random features and 2) stabilize the training of the\npreceding feature extraction network on noisy labels. Consequently, we propose\ncyclical self-training, where pseudo labels are initialized as the displacement\nfields inferred from random features and cyclically updated based on more and\nmore expressive features from the learning feature extractor, yielding a\nself-reinforcement effect. We evaluate the method for abdomen and lung\nregistration, consistently surpassing metric-based supervision and\noutperforming diverse state-of-the-art competitors. Source code is available at\nhttps://github.com/multimodallearning/reg-cyclical-self-train.\n","authors":["Alexander Bigalke","Lasse Hansen","Tony C. W. Mok","Mattias P. Heinrich"],"pdf_url":"https://arxiv.org/pdf/2306.16997v2.pdf","comment":"accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.10642v1","updated":"2023-07-20T07:12:56Z","published":"2023-07-20T07:12:56Z","title":"RetouchingFFHQ: A Large-scale Dataset for Fine-grained Face Retouching\n Detection","summary":" The widespread use of face retouching filters on short-video platforms has\nraised concerns about the authenticity of digital appearances and the impact of\ndeceptive advertising. To address these issues, there is a pressing need to\ndevelop advanced face retouching techniques. However, the lack of large-scale\nand fine-grained face retouching datasets has been a major obstacle to progress\nin this field. In this paper, we introduce RetouchingFFHQ, a large-scale and\nfine-grained face retouching dataset that contains over half a million\nconditionally-retouched images. RetouchingFFHQ stands out from previous\ndatasets due to its large scale, high quality, fine-grainedness, and\ncustomization. By including four typical types of face retouching operations\nand different retouching levels, we extend the binary face retouching detection\ninto a fine-grained, multi-retouching type, and multi-retouching level\nestimation problem. Additionally, we propose a Multi-granularity Attention\nModule (MAM) as a plugin for CNN backbones for enhanced cross-scale\nrepresentation learning. Extensive experiments using different baselines as\nwell as our proposed method on RetouchingFFHQ show decent performance on face\nretouching detection. With the proposed new dataset, we believe there is great\npotential for future work to tackle the challenging problem of real-world\nfine-grained face retouching detection.\n","authors":["Qichao Ying","Jiaxin Liu","Sheng Li","Haisheng Xu","Zhenxing Qian","Xinpeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10642v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2307.10638v1","updated":"2023-07-20T07:08:24Z","published":"2023-07-20T07:08:24Z","title":"Quantized Feature Distillation for Network Quantization","summary":" Neural network quantization aims to accelerate and trim full-precision neural\nnetwork models by using low bit approximations. Methods adopting the\nquantization aware training (QAT) paradigm have recently seen a rapid growth,\nbut are often conceptually complicated. This paper proposes a novel and highly\neffective QAT method, quantized feature distillation (QFD). QFD first trains a\nquantized (or binarized) representation as the teacher, then quantize the\nnetwork using knowledge distillation (KD). Quantitative results show that QFD\nis more flexible and effective (i.e., quantization friendly) than previous\nquantization methods. QFD surpasses existing methods by a noticeable margin on\nnot only image classification but also object detection, albeit being much\nsimpler. Furthermore, QFD quantizes ViT and Swin-Transformer on MS-COCO\ndetection and segmentation, which verifies its potential in real world\ndeployment. To the best of our knowledge, this is the first time that vision\ntransformers have been quantized in object detection and image segmentation\ntasks.\n","authors":["Ke Zhu","Yin-Yin He","Jianxin Wu"],"pdf_url":"https://arxiv.org/pdf/2307.10638v1.pdf","comment":"AAAI2023"},{"id":"http://arxiv.org/abs/2305.08396v3","updated":"2023-07-20T07:06:03Z","published":"2023-05-15T07:23:54Z","title":"MaxViT-UNet: Multi-Axis Attention for Medical Image Segmentation","summary":" Convolutional Neural Networks (CNNs) have made significant strides in medical\nimage analysis in recent years. However, the local nature of the convolution\noperator may pose a limitation for capturing global and long-range interactions\nin CNNs. Recently, Transformers have gained popularity in the computer vision\ncommunity and also medical image segmentation due to their ability to process\nglobal features effectively. The scalability issues of self-attention mechanism\nand lack of the CNN-like inductive bias may have limited their adoption.\nTherefore, hybrid Vision transformers (CNN-Transformer), exploiting advantages\nof both Convolution and Self-attention Mechanisms, have gained importance. In\nthis work, we present MaxViT-UNet, an Encoder-Decoder based hybrid vision\ntransformer (CNN-Transformer) for medical image segmentation. The proposed\nHybrid Decoder, based on MaxViT-block, is designed to harness the power of both\nthe convolution and self-attention mechanisms at each decoding stage with\nnominal computational burden. The inclusion of multi-axis self-attention,\nwithin each decoder stage, significantly enhances the discriminating capacity\nbetween the object and background regions, and thereby helps in improving the\nsegmentation efficiency. In the Hybrid Decoder block, the fusion process\ncommences by integrating the upsampled lower level decoder features, obtained\nthrough transpose convolution, with the skip-connection features derived from\nthe hybrid encoder. Subsequently, the fused features undergo refinement through\nthe utilization of a multi-axis attention mechanism. The proposed decoder block\nis repeated multiple times to progressively segment the nuclei regions.\nExperimental results on MoNuSeg18 and MoNuSAC20 dataset demonstrates the\neffectiveness of the proposed technique.\n","authors":["Abdul Rehman Khan","Asifullah Khan"],"pdf_url":"https://arxiv.org/pdf/2305.08396v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10636v1","updated":"2023-07-20T07:04:16Z","published":"2023-07-20T07:04:16Z","title":"Learning and Evaluating Human Preferences for Conversational Head\n Generation","summary":" A reliable and comprehensive evaluation metric that aligns with manual\npreference assessments is crucial for conversational head video synthesis\nmethod development. Existing quantitative evaluations often fail to capture the\nfull complexity of human preference, as they only consider limited evaluation\ndimensions. Qualitative evaluations and user studies offer a solution but are\ntime-consuming and labor-intensive. This limitation hinders the advancement of\nconversational head generation algorithms and systems. In this paper, we\npropose a novel learning-based evaluation metric named Preference Score (PS)\nfor fitting human preference according to the quantitative evaluations across\ndifferent dimensions. PS can serve as a quantitative evaluation without the\nneed for human annotation. Experimental results validate the superiority of\nPreference Score in aligning with human perception, and also demonstrates\nrobustness and generalizability to unseen data, making it a valuable tool for\nadvancing conversation head generation. We expect this metric could facilitate\nnew advances in conversational head generation.\n","authors":["Mohan Zhou","Yalong Bai","Wei Zhang","Ting Yao","Tiejun Zhao","Tao Mei"],"pdf_url":"https://arxiv.org/pdf/2307.10636v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12384v2","updated":"2023-07-20T07:04:04Z","published":"2023-03-22T08:47:37Z","title":"RegFormer: An Efficient Projection-Aware Transformer Network for\n Large-Scale Point Cloud Registration","summary":" Although point cloud registration has achieved remarkable advances in\nobject-level and indoor scenes, large-scale registration methods are rarely\nexplored. Challenges mainly arise from the huge point number, complex\ndistribution, and outliers of outdoor LiDAR scans. In addition, most existing\nregistration works generally adopt a two-stage paradigm: They first find\ncorrespondences by extracting discriminative local features, and then leverage\nestimators (eg. RANSAC) to filter outliers, which are highly dependent on\nwell-designed descriptors and post-processing choices. To address these\nproblems, we propose an end-to-end transformer network (RegFormer) for\nlarge-scale point cloud alignment without any further post-processing.\nSpecifically, a projection-aware hierarchical transformer is proposed to\ncapture long-range dependencies and filter outliers by extracting point\nfeatures globally. Our transformer has linear complexity, which guarantees high\nefficiency even for large-scale scenes. Furthermore, to effectively reduce\nmismatches, a bijective association transformer is designed for regressing the\ninitial transformation. Extensive experiments on KITTI and NuScenes datasets\ndemonstrate that our RegFormer achieves competitive performance in terms of\nboth accuracy and efficiency.\n","authors":["Jiuming Liu","Guangming Wang","Zhe Liu","Chaokang Jiang","Marc Pollefeys","Hesheng Wang"],"pdf_url":"https://arxiv.org/pdf/2303.12384v2.pdf","comment":"Accepted by ICCV2023. Codes will be released at\n https://github.com/IRMVLab/RegFormer"},{"id":"http://arxiv.org/abs/2307.10632v1","updated":"2023-07-20T06:58:11Z","published":"2023-07-20T06:58:11Z","title":"Parallelization of a new embedded application for automatic meteor\n detection","summary":" This article presents the methods used to parallelize a new computer vision\napplication. The system is able to automatically detect meteor from\nnon-stabilized cameras and noisy video sequences. The application is designed\nto be embedded in weather balloons or for airborne observation campaigns. Thus,\nthe final target is a low power system-on-chip (< 10 Watts) while the software\nneeds to compute a stream of frames in real-time (> 25 frames per second). For\nthis, first the application is split in a tasks graph, then different\nparallelization techniques are applied. Experiment results demonstrate the\nefficiency of the parallelization methods. For instance, on the Raspberry Pi 4\nand on a HD video sequence, the processing chain reaches 42 frames per second\nwhile it only consumes 6 Watts.\n","authors":["Mathuran Kandeepan","Clara Ciocan","Adrien Cassagne","Lionel Lacassagne"],"pdf_url":"https://arxiv.org/pdf/2307.10632v1.pdf","comment":"in French language, COMPAS 2023 - Conf{\\'e}rence francophone\n d'informatique en Parall{\\'e}lisme, Architecture et Syst{\\`e}me, Jul 2023,\n Annecy (France), France"},{"id":"http://arxiv.org/abs/2307.10625v1","updated":"2023-07-20T06:47:46Z","published":"2023-07-20T06:47:46Z","title":"Learning Discriminative Visual-Text Representation for Polyp\n Re-Identification","summary":" Colonoscopic Polyp Re-Identification aims to match a specific polyp in a\nlarge gallery with different cameras and views, which plays a key role for the\nprevention and treatment of colorectal cancer in the computer-aided diagnosis.\nHowever, traditional methods mainly focus on the visual representation\nlearning, while neglect to explore the potential of semantic features during\ntraining, which may easily leads to poor generalization capability when adapted\nthe pretrained model into the new scenarios. To relieve this dilemma, we\npropose a simple but effective training method named VT-ReID, which can\nremarkably enrich the representation of polyp videos with the interchange of\nhigh-level semantic information. Moreover, we elaborately design a novel\nclustering mechanism to introduce prior knowledge from textual data, which\nleverages contrastive learning to promote better separation from abundant\nunlabeled text data. To the best of our knowledge, this is the first attempt to\nemploy the visual-text feature with clustering mechanism for the colonoscopic\npolyp re-identification. Empirical results show that our method significantly\noutperforms current state-of-the art methods with a clear margin.\n","authors":["Suncheng Xiang","Cang Liu","Sijia Du","Dahong Qian"],"pdf_url":"https://arxiv.org/pdf/2307.10625v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10624v1","updated":"2023-07-20T06:44:42Z","published":"2023-07-20T06:44:42Z","title":"Joint Skeletal and Semantic Embedding Loss for Micro-gesture\n Classification","summary":" In this paper, we briefly introduce the solution of our team HFUT-VUT for the\nMicros-gesture Classification in the MiGA challenge at IJCAI 2023. The\nmicro-gesture classification task aims at recognizing the action category of a\ngiven video based on the skeleton data. For this task, we propose a\n3D-CNNs-based micro-gesture recognition network, which incorporates a skeletal\nand semantic embedding loss to improve action classification performance.\nFinally, we rank 1st in the Micro-gesture Classification Challenge, surpassing\nthe second-place team in terms of Top-1 accuracy by 1.10%.\n","authors":["Kun Li","Dan Guo","Guoliang Chen","Xinge Peng","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10624v1.pdf","comment":"1st Place in Micro-gesture Classification sub-challenge in MiGA at\n IJCAI-2023"},{"id":"http://arxiv.org/abs/2211.14085v3","updated":"2023-07-20T06:42:56Z","published":"2022-11-25T13:14:33Z","title":"Positive unlabeled learning with tensor networks","summary":" Positive unlabeled learning is a binary classification problem with positive\nand unlabeled data. It is common in domains where negative labels are costly or\nimpossible to obtain, e.g., medicine and personalized advertising. Most\napproaches to positive unlabeled learning apply to specific data types (e.g.,\nimages, categorical data) and can not generate new positive and negative\nsamples. This work introduces a feature-space distance-based tensor network\napproach to the positive unlabeled learning problem. The presented method is\nnot domain specific and significantly improves the state-of-the-art results on\nthe MNIST image and 15 categorical/mixed datasets. The trained tensor network\nmodel is also a generative model and enables the generation of new positive and\nnegative instances.\n","authors":["Bojan Žunkovič"],"pdf_url":"https://arxiv.org/pdf/2211.14085v3.pdf","comment":"12 pages, 6 figures, 4 tables"},{"id":"http://arxiv.org/abs/2307.10620v1","updated":"2023-07-20T06:37:47Z","published":"2023-07-20T06:37:47Z","title":"Quaternion tensor ring decomposition and application for color image\n inpainting","summary":" In recent years, tensor networks have emerged as powerful tools for solving\nlarge-scale optimization problems. One of the most promising tensor networks is\nthe tensor ring (TR) decomposition, which achieves circular dimensional\npermutation invariance in the model through the utilization of the trace\noperation and equitable treatment of the latent cores. On the other hand, more\nrecently, quaternions have gained significant attention and have been widely\nutilized in color image processing tasks due to their effectiveness in encoding\ncolor pixels. Therefore, in this paper, we propose the quaternion tensor ring\n(QTR) decomposition, which inherits the powerful and generalized representation\nabilities of the TR decomposition while leveraging the advantages of\nquaternions for color pixel representation. In addition to providing the\ndefinition of QTR decomposition and an algorithm for learning the QTR format,\nthis paper also proposes a low-rank quaternion tensor completion (LRQTC) model\nand its algorithm for color image inpainting based on the QTR decomposition.\nFinally, extensive experiments on color image inpainting demonstrate that the\nproposed QTLRC method is highly competitive.\n","authors":["Jifei Miao","Kit Ian Kou"],"pdf_url":"https://arxiv.org/pdf/2307.10620v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10616v1","updated":"2023-07-20T06:32:14Z","published":"2023-07-20T06:32:14Z","title":"Heterogeneous Federated Learning: State-of-the-art and Research\n Challenges","summary":" Federated learning (FL) has drawn increasing attention owing to its potential\nuse in large-scale industrial applications. Existing federated learning works\nmainly focus on model homogeneous settings. However, practical federated\nlearning typically faces the heterogeneity of data distributions, model\narchitectures, network environments, and hardware devices among participant\nclients. Heterogeneous Federated Learning (HFL) is much more challenging, and\ncorresponding solutions are diverse and complex. Therefore, a systematic survey\non this topic about the research challenges and state-of-the-art is essential.\nIn this survey, we firstly summarize the various research challenges in HFL\nfrom five aspects: statistical heterogeneity, model heterogeneity,\ncommunication heterogeneity, device heterogeneity, and additional challenges.\nIn addition, recent advances in HFL are reviewed and a new taxonomy of existing\nHFL methods is proposed with an in-depth analysis of their pros and cons. We\nclassify existing methods from three different levels according to the HFL\nprocedure: data-level, model-level, and server-level. Finally, several critical\nand promising future research directions in HFL are discussed, which may\nfacilitate further developments in this field. A periodically updated\ncollection on HFL is available at https://github.com/marswhu/HFL_Survey.\n","authors":["Mang Ye","Xiuwen Fang","Bo Du","Pong C. Yuen","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2307.10616v1.pdf","comment":"42 pages, 11 figures, and 4 tables"},{"id":"http://arxiv.org/abs/2307.10609v1","updated":"2023-07-20T06:07:09Z","published":"2023-07-20T06:07:09Z","title":"Hybrid Feature Embedding For Automatic Building Outline Extraction","summary":" Building outline extracted from high-resolution aerial images can be used in\nvarious application fields such as change detection and disaster assessment.\nHowever, traditional CNN model cannot recognize contours very precisely from\noriginal images. In this paper, we proposed a CNN and Transformer based model\ntogether with active contour model to deal with this problem. We also designed\na triple-branch decoder structure to handle different features generated by\nencoder. Experiment results show that our model outperforms other baseline\nmodel on two datasets, achieving 91.1% mIoU on Vaihingen and 83.8% on Bing\nhuts.\n","authors":["Weihang Ran","Wei Yuan","Xiaodan Shi","Zipei Fan","Ryosuke Shibasaki"],"pdf_url":"https://arxiv.org/pdf/2307.10609v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10603v1","updated":"2023-07-20T05:49:21Z","published":"2023-07-20T05:49:21Z","title":"Physics-Driven Turbulence Image Restoration with Stochastic Refinement","summary":" Image distortion by atmospheric turbulence is a stochastic degradation, which\nis a critical problem in long-range optical imaging systems. A number of\nresearch has been conducted during the past decades, including model-based and\nemerging deep-learning solutions with the help of synthetic data. Although fast\nand physics-grounded simulation tools have been introduced to help the\ndeep-learning models adapt to real-world turbulence conditions recently, the\ntraining of such models only relies on the synthetic data and ground truth\npairs. This paper proposes the Physics-integrated Restoration Network (PiRN) to\nbring the physics-based simulator directly into the training process to help\nthe network to disentangle the stochasticity from the degradation and the\nunderlying image. Furthermore, to overcome the ``average effect\" introduced by\ndeterministic models and the domain gap between the synthetic and real-world\ndegradation, we further introduce PiRN with Stochastic Refinement (PiRN-SR) to\nboost its perceptual quality. Overall, our PiRN and PiRN-SR improve the\ngeneralization to real-world unknown turbulence conditions and provide a\nstate-of-the-art restoration in both pixel-wise accuracy and perceptual\nquality. Our codes are available at \\url{https://github.com/VITA-Group/PiRN}.\n","authors":["Ajay Jaiswal","Xingguang Zhang","Stanley H. Chan","Zhangyang Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10603v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.10601v1","updated":"2023-07-20T05:46:32Z","published":"2023-07-20T05:46:32Z","title":"SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and\n Multi-View for 3D Object Retrieval","summary":" To address 3D object retrieval, substantial efforts have been made to\ngenerate highly discriminative descriptors of 3D objects represented by a\nsingle modality, e.g., voxels, point clouds or multi-view images. It is\npromising to leverage the complementary information from multi-modality\nrepresentations of 3D objects to further improve retrieval performance.\nHowever, multi-modality 3D object retrieval is rarely developed and analyzed on\nlarge-scale datasets. In this paper, we propose self-and-cross attention based\naggregation of point cloud and multi-view images (SCA-PVNet) for 3D object\nretrieval. With deep features extracted from point clouds and multi-view\nimages, we design two types of feature aggregation modules, namely the\nIn-Modality Aggregation Module (IMAM) and the Cross-Modality Aggregation Module\n(CMAM), for effective feature fusion. IMAM leverages a self-attention mechanism\nto aggregate multi-view features while CMAM exploits a cross-attention\nmechanism to interact point cloud features with multi-view features. The final\ndescriptor of a 3D object for object retrieval can be obtained via\nconcatenating the aggregated features from both modules. Extensive experiments\nand analysis are conducted on three datasets, ranging from small to large\nscale, to show the superiority of the proposed SCA-PVNet over the\nstate-of-the-art methods.\n","authors":["Dongyun Lin","Yi Cheng","Aiyuan Guo","Shangbo Mao","Yiqun Li"],"pdf_url":"https://arxiv.org/pdf/2307.10601v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.01928v3","updated":"2023-07-20T05:21:04Z","published":"2023-01-05T06:32:50Z","title":"Event Camera Data Pre-training","summary":" This paper proposes a pre-trained neural network for handling event camera\ndata. Our model is a self-supervised learning framework, and uses paired event\ncamera data and natural RGB images for training.\n Our method contains three modules connected in a sequence: i) a family of\nevent data augmentations, generating meaningful event images for\nself-supervised training; ii) a conditional masking strategy to sample\ninformative event patches from event images, encouraging our model to capture\nthe spatial layout of a scene and accelerating training; iii) a contrastive\nlearning approach, enforcing the similarity of embeddings between matching\nevent images, and between paired event and RGB images. An embedding projection\nloss is proposed to avoid the model collapse when enforcing the event image\nembedding similarities. A probability distribution alignment loss is proposed\nto encourage the event image to be consistent with its paired RGB image in the\nfeature space.\n Transfer learning performance on downstream tasks shows the superiority of\nour method over state-of-the-art methods. For example, we achieve top-1\naccuracy at 64.83% on the N-ImageNet dataset.\n","authors":["Yan Yang","Liyuan Pan","Liu Liu"],"pdf_url":"https://arxiv.org/pdf/2301.01928v3.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.10593v1","updated":"2023-07-20T05:15:03Z","published":"2023-07-20T05:15:03Z","title":"Event Blob Tracking: An Asynchronous Real-Time Algorithm","summary":" Event-based cameras have become increasingly popular for tracking fast-moving\nobjects due to their high temporal resolution, low latency, and high dynamic\nrange. In this paper, we propose a novel algorithm for tracking event blobs\nusing raw events asynchronously in real time. We introduce the concept of an\nevent blob as a spatio-temporal likelihood of event occurrence where the\nconditional spatial likelihood is blob-like. Many real-world objects generate\nevent blob data, for example, flickering LEDs such as car headlights or any\nsmall foreground object moving against a static or slowly varying background.\nThe proposed algorithm uses a nearest neighbour classifier with a dynamic\nthreshold criteria for data association coupled with a Kalman filter to track\nthe event blob state. Our algorithm achieves highly accurate tracking and event\nblob shape estimation even under challenging lighting conditions and high-speed\nmotions. The microsecond time resolution achieved means that the filter output\ncan be used to derive secondary information such as time-to-contact or range\nestimation, that will enable applications to real-world problems such as\ncollision avoidance in autonomous driving.\n","authors":["Ziwei Wang","Timothy Molloy","Pieter van Goor","Robert Mahony"],"pdf_url":"https://arxiv.org/pdf/2307.10593v1.pdf","comment":"17 pages, 8 figures, preprint version"},{"id":"http://arxiv.org/abs/2210.06551v4","updated":"2023-07-20T04:59:45Z","published":"2022-10-12T19:46:25Z","title":"MotionBERT: A Unified Perspective on Learning Human Motion\n Representations","summary":" We present a unified perspective on tackling various human-centric video\ntasks by learning human motion representations from large-scale and\nheterogeneous data resources. Specifically, we propose a pretraining stage in\nwhich a motion encoder is trained to recover the underlying 3D motion from\nnoisy partial 2D observations. The motion representations acquired in this way\nincorporate geometric, kinematic, and physical knowledge about human motion,\nwhich can be easily transferred to multiple downstream tasks. We implement the\nmotion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer)\nneural network. It could capture long-range spatio-temporal relationships among\nthe skeletal joints comprehensively and adaptively, exemplified by the lowest\n3D pose estimation error so far when trained from scratch. Furthermore, our\nproposed framework achieves state-of-the-art performance on all three\ndownstream tasks by simply finetuning the pretrained motion encoder with a\nsimple regression head (1-2 layers), which demonstrates the versatility of the\nlearned motion representations. Code and models are available at\nhttps://motionbert.github.io/\n","authors":["Wentao Zhu","Xiaoxuan Ma","Zhaoyang Liu","Libin Liu","Wayne Wu","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2210.06551v4.pdf","comment":"ICCV 2023 version"},{"id":"http://arxiv.org/abs/2307.10584v1","updated":"2023-07-20T04:51:10Z","published":"2023-07-20T04:51:10Z","title":"Reference-based Painterly Inpainting via Diffusion: Crossing the Wild\n Reference Domain Gap","summary":" Have you ever imagined how it would look if we placed new objects into\npaintings? For example, what would it look like if we placed a basketball into\nClaude Monet's ``Water Lilies, Evening Effect''? We propose Reference-based\nPainterly Inpainting, a novel task that crosses the wild reference domain gap\nand implants novel objects into artworks. Although previous works have examined\nreference-based inpainting, they are not designed for large domain\ndiscrepancies between the target and the reference, such as inpainting an\nartistic image using a photorealistic reference. This paper proposes a novel\ndiffusion framework, dubbed RefPaint, to ``inpaint more wildly'' by taking such\nreferences with large domain gaps. Built with an image-conditioned diffusion\nmodel, we introduce a ladder-side branch and a masked fusion mechanism to work\nwith the inpainting mask. By decomposing the CLIP image embeddings at inference\ntime, one can manipulate the strength of semantic and style information with\nease. Experiments demonstrate that our proposed RefPaint framework produces\nsignificantly better results than existing methods. Our method enables creative\npainterly image inpainting with reference objects that would otherwise be\ndifficult to achieve. Project page: https://vita-group.github.io/RefPaint/\n","authors":["Dejia Xu","Xingqian Xu","Wenyan Cong","Humphrey Shi","Zhangyang Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10577v1","updated":"2023-07-20T04:41:39Z","published":"2023-07-20T04:41:39Z","title":"Ethosight: A Joint-Embedding Based System for Nuanced Perception Using\n Contextual Label Affinity Metric and Reasoning Based Iterative Learning","summary":" Traditional computer vision models often require extensive manual effort for\ndata acquisition and validation, particularly when detecting subtle behavioral\nnuances or events. The difficulty in distinguishing routine behaviors from\npotential risks in real-world applications, like differentiating routine\nshopping from potential shoplifting, further complicates the process.\n We present Ethosight, a novel zero-shot computer vision algorithm. Ethosight\neradicates the need for pre-existing symbolic knowledge, initiating from a\nclean slate based on user requirements and semantic knowledge of interest.\nUsing localized label affinity calculations and a reasoning-guided iterative\nlearning loop, Ethosight infers scene details and iteratively refines the label\nset. Reasoning mechanisms can be derived from large language models like GPT4,\nsymbolic reasoners like OpenNARS, or hybrid systems.\n Ethosight further capitalizes on the capabilities of a pre-trained\nmulti-modal model, ImageBind, generating accurate semantic knowledge of images\nwithin a few cycles. It successfully captures both explicit and nuanced\nelements efficiently. We also introduce the implementation of Korzybski's\n\"time-binding\" concept in machines, which allows for generational learning and\nknowledge sharing across deployments.\n Our evaluations demonstrate Ethosight's efficacy across 40 complex use cases.\nIt has exhibited an exceptional ability to discern new areas of interest,\nconsistently generating high-affinity scores within the top five labels from a\nset of a thousand. Tests conducted across diverse environments attest to\nEthosight's robust performance. Detailed results and case studies within the\nmain body of this paper and an appendix underscore a promising trajectory\ntowards enhancing the adaptability and resilience of computer vision models in\ndetecting and extracting subtle and nuanced behaviors.\n","authors":["Hugo Latapie","Kristinn R. Thorisson","Shan Yu","Vahagn Petrosyan","Patrick Hammer","Pei Wang","Brandon Kynoch","Hanning Chen","Tangrui Li"],"pdf_url":"https://arxiv.org/pdf/2307.10577v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10575v1","updated":"2023-07-20T04:35:50Z","published":"2023-07-20T04:35:50Z","title":"Boosting Federated Learning Convergence with Prototype Regularization","summary":" As a distributed machine learning technique, federated learning (FL) requires\nclients to collaboratively train a shared model with an edge server without\nleaking their local data. However, the heterogeneous data distribution among\nclients often leads to a decrease in model performance. To tackle this issue,\nthis paper introduces a prototype-based regularization strategy to address the\nheterogeneity in the data distribution. Specifically, the regularization\nprocess involves the server aggregating local prototypes from distributed\nclients to generate a global prototype, which is then sent back to the\nindividual clients to guide their local training. The experimental results on\nMNIST and Fashion-MNIST show that our proposal achieves improvements of 3.3%\nand 8.9% in average test accuracy, respectively, compared to the most popular\nbaseline FedAvg. Furthermore, our approach has a fast convergence rate in\nheterogeneous settings.\n","authors":["Yu Qiao","Huy Q. Le","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2307.10575v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.04247v2","updated":"2023-07-20T04:28:36Z","published":"2023-05-07T11:18:39Z","title":"Estimation of control area in badminton doubles with pose information\n from top and back view drone videos","summary":" The application of visual tracking to the performance analysis of sports\nplayers in dynamic competitions is vital for effective coaching. In doubles\nmatches, coordinated positioning is crucial for maintaining control of the\ncourt and minimizing opponents' scoring opportunities. The analysis of such\nteamwork plays a vital role in understanding the dynamics of the game. However,\nprevious studies have primarily focused on analyzing and assessing singles\nplayers without considering occlusion in broadcast videos. These studies have\nrelied on discrete representations, which involve the analysis and\nrepresentation of specific actions (e.g., strokes) or events that occur during\nthe game while overlooking the meaningful spatial distribution. In this work,\nwe present the first annotated drone dataset from top and back views in\nbadminton doubles and propose a framework to estimate the control area\nprobability map, which can be used to evaluate teamwork performance. We present\nan efficient framework of deep neural networks that enables the calculation of\nfull probability surfaces. This framework utilizes the embedding of a Gaussian\nmixture map of players' positions and employs graph convolution on their poses.\nIn the experiment, we verify our approach by comparing various baselines and\ndiscovering the correlations between the score and control area. Additionally,\nwe propose a practical application for assessing optimal positioning to provide\ninstructions during a game. Our approach offers both visual and quantitative\nevaluations of players' movements, thereby providing valuable insights into\ndoubles teamwork. The dataset and related project code is available at\nhttps://github.com/Ning-D/Drone_BD_ControlArea\n","authors":["Ning Ding","Kazuya Takeda","Wenhui Jin","Yingjiu Bei","Keisuke Fujii"],"pdf_url":"https://arxiv.org/pdf/2305.04247v2.pdf","comment":"15 pages, 10 figures, to appear in Multimedia Tools and Applications"},{"id":"http://arxiv.org/abs/2307.10036v2","updated":"2023-07-20T04:26:46Z","published":"2023-07-19T15:19:02Z","title":"Class Attention to Regions of Lesion for Imbalanced Medical Image\n Recognition","summary":" Automated medical image classification is the key component in intelligent\ndiagnosis systems. However, most medical image datasets contain plenty of\nsamples of common diseases and just a handful of rare ones, leading to major\nclass imbalances. Currently, it is an open problem in intelligent diagnosis to\neffectively learn from imbalanced training data. In this paper, we propose a\nsimple yet effective framework, named \\textbf{C}lass \\textbf{A}ttention to\n\\textbf{RE}gions of the lesion (CARE), to handle data imbalance issues by\nembedding attention into the training process of \\textbf{C}onvolutional\n\\textbf{N}eural \\textbf{N}etworks (CNNs). The proposed attention module helps\nCNNs attend to lesion regions of rare diseases, therefore helping CNNs to learn\ntheir characteristics more effectively. In addition, this attention module\nworks only during the training phase and does not change the architecture of\nthe original network, so it can be directly combined with any existing CNN\narchitecture. The CARE framework needs bounding boxes to represent the lesion\nregions of rare diseases. To alleviate the need for manual annotation, we\nfurther developed variants of CARE by leveraging the traditional saliency\nmethods or a pretrained segmentation model for bounding box generation. Results\nshow that the CARE variants with automated bounding box generation are\ncomparable to the original CARE framework with \\textit{manual} bounding box\nannotations. A series of experiments on an imbalanced skin image dataset and a\npneumonia dataset indicates that our method can effectively help the network\nfocus on the lesion regions of rare diseases and remarkably improves the\nclassification performance of rare diseases.\n","authors":["Jia-Xin Zhuang","Jiabin Cai","Jianguo Zhang","Wei-shi Zheng","Ruixuan Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10036v2.pdf","comment":"Accepted by Neurocomputing on July 2023. 37 pages"},{"id":"http://arxiv.org/abs/2307.09724v2","updated":"2023-07-20T04:14:01Z","published":"2023-07-19T02:26:20Z","title":"AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks","summary":" To deliver the artistic expression of the target style, recent studies\nexploit the attention mechanism owing to its ability to map the local patches\nof the style image to the corresponding patches of the content image. However,\nbecause of the low semantic correspondence between arbitrary content and\nartworks, the attention module repeatedly abuses specific local patches from\nthe style image, resulting in disharmonious and evident repetitive artifacts.\nTo overcome this limitation and accomplish impeccable artistic style transfer,\nwe focus on enhancing the attention mechanism and capturing the rhythm of\npatterns that organize the style. In this paper, we introduce a novel metric,\nnamely pattern repeatability, that quantifies the repetition of patterns in the\nstyle image. Based on the pattern repeatability, we propose Aesthetic\nPattern-Aware style transfer Networks (AesPA-Net) that discover the sweet spot\nof local and global style expressions. In addition, we propose a novel\nself-supervisory task to encourage the attention mechanism to learn precise and\nmeaningful semantic correspondence. Lastly, we introduce the patch-wise style\nloss to transfer the elaborate rhythm of local patterns. Through qualitative\nand quantitative evaluations, we verify the reliability of the proposed pattern\nrepeatability that aligns with human perception, and demonstrate the\nsuperiority of the proposed framework.\n","authors":["Kibeom Hong","Seogkyu Jeon","Junsoo Lee","Namhyuk Ahn","Kunhee Kim","Pilhyeon Lee","Daesik Kim","Youngjung Uh","Hyeran Byun"],"pdf_url":"https://arxiv.org/pdf/2307.09724v2.pdf","comment":"Accepted by ICCV 2023. Code is available at this\n https://github.com/Kibeom-Hong/AesPA-Net"},{"id":"http://arxiv.org/abs/2307.10567v1","updated":"2023-07-20T04:12:10Z","published":"2023-07-20T04:12:10Z","title":"No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention\n and Zoom-in Boundary Detection","summary":" Temporal video grounding (TVG) aims to retrieve the time interval of a\nlanguage query from an untrimmed video. A significant challenge in TVG is the\nlow \"Semantic Noise Ratio (SNR)\", which results in worse performance with lower\nSNR. Prior works have addressed this challenge using sophisticated techniques.\nIn this paper, we propose a no-frills TVG model that consists of two core\nmodules, namely multi-scale neighboring attention and zoom-in boundary\ndetection. The multi-scale neighboring attention restricts each video token to\nonly aggregate visual contexts from its neighbor, enabling the extraction of\nthe most distinguishing information with multi-scale feature hierarchies from\nhigh-ratio noises. The zoom-in boundary detection then focuses on local-wise\ndiscrimination of the selected top candidates for fine-grained grounding\nadjustment. With an end-to-end training strategy, our model achieves\ncompetitive performance on different TVG benchmarks, while also having the\nadvantage of faster inference speed and lighter model parameters, thanks to its\nlightweight architecture.\n","authors":["Qi Zhang","Sipeng Zheng","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2307.10567v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14795v2","updated":"2023-07-20T03:39:19Z","published":"2023-06-26T15:53:02Z","title":"MotionGPT: Human Motion as a Foreign Language","summary":" Though the advancement of pre-trained large language models unfolds, the\nexploration of building a unified model for language and other multi-modal\ndata, such as motion, remains challenging and untouched so far. Fortunately,\nhuman motion displays a semantic coupling akin to human language, often\nperceived as a form of body language. By fusing language data with large-scale\nmotion models, motion-language pre-training that can enhance the performance of\nmotion-related tasks becomes feasible. Driven by this insight, we propose\nMotionGPT, a unified, versatile, and user-friendly motion-language model to\nhandle multiple motion-relevant tasks. Specifically, we employ the discrete\nvector quantization for human motion and transfer 3D motion into motion tokens,\nsimilar to the generation process of word tokens. Building upon this \"motion\nvocabulary\", we perform language modeling on both motion and text in a unified\nmanner, treating human motion as a specific language. Moreover, inspired by\nprompt learning, we pre-train MotionGPT with a mixture of motion-language data\nand fine-tune it on prompt-based question-and-answer tasks. Extensive\nexperiments demonstrate that MotionGPT achieves state-of-the-art performances\non multiple motion tasks including text-driven motion generation, motion\ncaptioning, motion prediction, and motion in-between.\n","authors":["Biao Jiang","Xin Chen","Wen Liu","Jingyi Yu","Gang Yu","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2306.14795v2.pdf","comment":"Project Page: https://github.com/OpenMotionLab/MotionGPT"},{"id":"http://arxiv.org/abs/2307.10554v1","updated":"2023-07-20T03:36:13Z","published":"2023-07-20T03:36:13Z","title":"EMQ: Evolving Training-free Proxies for Automated Mixed Precision\n Quantization","summary":" Mixed-Precision Quantization~(MQ) can achieve a competitive\naccuracy-complexity trade-off for models. Conventional training-based search\nmethods require time-consuming candidate training to search optimized per-layer\nbit-width configurations in MQ. Recently, some training-free approaches have\npresented various MQ proxies and significantly improve search efficiency.\nHowever, the correlation between these proxies and quantization accuracy is\npoorly understood. To address the gap, we first build the MQ-Bench-101, which\ninvolves different bit configurations and quantization results. Then, we\nobserve that the existing training-free proxies perform weak correlations on\nthe MQ-Bench-101. To efficiently seek superior proxies, we develop an automatic\nsearch of proxies framework for MQ via evolving algorithms. In particular, we\ndevise an elaborate search space involving the existing proxies and perform an\nevolution search to discover the best correlated MQ proxy. We proposed a\ndiversity-prompting selection strategy and compatibility screening protocol to\navoid premature convergence and improve search efficiency. In this way, our\nEvolving proxies for Mixed-precision Quantization~(EMQ) framework allows the\nauto-generation of proxies without heavy tuning and expert knowledge. Extensive\nexperiments on ImageNet with various ResNet and MobileNet families demonstrate\nthat our EMQ obtains superior performance than state-of-the-art mixed-precision\nmethods at a significantly reduced cost. The code will be released.\n","authors":["Peijie Dong","Lujun Li","Zimian Wei","Xin Niu","Zhiliang Tian","Hengyue Pan"],"pdf_url":"https://arxiv.org/pdf/2307.10554v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.10549v1","updated":"2023-07-20T03:26:57Z","published":"2023-07-20T03:26:57Z","title":"Dynamic Large Language Models on Blockchains","summary":" Training and deploying the large language models requires a large mount of\ncomputational resource because the language models contain billions of\nparameters and the text has thousands of tokens. Another problem is that the\nlarge language models are static. They are fixed after the training process. To\ntackle these issues, in this paper, we propose to train and deploy the dynamic\nlarge language model on blockchains, which have high computation performance\nand are distributed across a network of computers. A blockchain is a secure,\ndecentralized, and transparent system that allows for the creation of a\ntamper-proof ledger for transactions without the need for intermediaries. The\ndynamic large language models can continuously learn from the user input after\nthe training process. Our method provides a new way to develop the large\nlanguage models and also sheds a light on the next generation artificial\nintelligence systems.\n","authors":["Yuanhao Gong"],"pdf_url":"https://arxiv.org/pdf/2307.10549v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.04550v5","updated":"2023-07-20T02:00:22Z","published":"2021-06-08T17:39:14Z","title":"DETReg: Unsupervised Pretraining with Region Priors for Object Detection","summary":" Recent self-supervised pretraining methods for object detection largely focus\non pretraining the backbone of the object detector, neglecting key parts of\ndetection architecture. Instead, we introduce DETReg, a new self-supervised\nmethod that pretrains the entire object detection network, including the object\nlocalization and embedding components. During pretraining, DETReg predicts\nobject localizations to match the localizations from an unsupervised region\nproposal generator and simultaneously aligns the corresponding feature\nembeddings with embeddings from a self-supervised image encoder. We implement\nDETReg using the DETR family of detectors and show that it improves over\ncompetitive baselines when finetuned on COCO, PASCAL VOC, and Airbus Ship\nbenchmarks. In low-data regimes DETReg achieves improved performance, e.g.,\nwhen training with only 1% of the labels and in the few-shot learning settings.\n","authors":["Amir Bar","Xin Wang","Vadim Kantorov","Colorado J Reed","Roei Herzig","Gal Chechik","Anna Rohrbach","Trevor Darrell","Amir Globerson"],"pdf_url":"https://arxiv.org/pdf/2106.04550v5.pdf","comment":"Project page: https://www.amirbar.net/detreg/"},{"id":"http://arxiv.org/abs/2307.10518v1","updated":"2023-07-20T01:37:32Z","published":"2023-07-20T01:37:32Z","title":"Interactive Segmentation for Diverse Gesture Types Without Context","summary":" Interactive segmentation entails a human marking an image to guide how a\nmodel either creates or edits a segmentation. Our work addresses limitations of\nexisting methods: they either only support one gesture type for marking an\nimage (e.g., either clicks or scribbles) or require knowledge of the gesture\ntype being employed, and require specifying whether marked regions should be\nincluded versus excluded in the final segmentation. We instead propose a\nsimplified interactive segmentation task where a user only must mark an image,\nwhere the input can be of any gesture type without specifying the gesture type.\nWe support this new task by introducing the first interactive segmentation\ndataset with multiple gesture types as well as a new evaluation metric capable\nof holistically evaluating interactive segmentation algorithms. We then analyze\nnumerous interactive segmentation algorithms, including ones adapted for our\nnovel task. While we observe promising performance overall, we also highlight\nareas for future improvement. To facilitate further extensions of this work, we\npublicly share our new dataset at https://github.com/joshmyersdean/dig.\n","authors":["Josh Myers-Dean","Yifei Fan","Brian Price","Wilson Chan","Danna Gurari"],"pdf_url":"https://arxiv.org/pdf/2307.10518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08015v3","updated":"2023-07-20T01:11:21Z","published":"2023-07-16T11:52:27Z","title":"Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via\n Geometry-Guided Cross-View Transformer","summary":" Image retrieval-based cross-view localization methods often lead to very\ncoarse camera pose estimation, due to the limited sampling density of the\ndatabase satellite images. In this paper, we propose a method to increase the\naccuracy of a ground camera's location and orientation by estimating the\nrelative rotation and translation between the ground-level image and its\nmatched/retrieved satellite image. Our approach designs a geometry-guided\ncross-view transformer that combines the benefits of conventional geometry and\nlearnable cross-view transformers to map the ground-view observations to an\noverhead view. Given the synthesized overhead view and observed satellite\nfeature maps, we construct a neural pose optimizer with strong global\ninformation embedding ability to estimate the relative rotation between them.\nAfter aligning their rotations, we develop an uncertainty-guided spatial\ncorrelation to generate a probability map of the vehicle locations, from which\nthe relative translation can be determined. Experimental results demonstrate\nthat our method significantly outperforms the state-of-the-art. Notably, the\nlikelihood of restricting the vehicle lateral pose to be within 1m of its\nGround Truth (GT) value on the cross-view KITTI dataset has been improved from\n$35.54\\%$ to $76.44\\%$, and the likelihood of restricting the vehicle\norientation to be within $1^{\\circ}$ of its GT value has been improved from\n$19.64\\%$ to $99.10\\%$.\n","authors":["Yujiao Shi","Fei Wu","Akhil Perincherry","Ankit Vora","Hongdong Li"],"pdf_url":"https://arxiv.org/pdf/2307.08015v3.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2106.04066v6","updated":"2023-07-20T00:24:58Z","published":"2021-06-08T02:51:33Z","title":"Semantically Adversarial Scenario Generation with Explicit Knowledge\n Guidance","summary":" Generating adversarial scenarios, which have the potential to fail autonomous\ndriving systems, provides an effective way to improve robustness. Extending\npurely data-driven generative models, recent specialized models satisfy\nadditional controllable requirements such as embedding a traffic sign in a\ndriving scene by manipulating patterns implicitly in the neuron level. In this\npaper, we introduce a method to incorporate domain knowledge explicitly in the\ngeneration process to achieve the Semantically Adversarial Generation (SAG). To\nbe consistent with the composition of driving scenes, we first categorize the\nknowledge into two types, the property of objects and the relationship among\nobjects. We then propose a tree-structured variational auto-encoder (T-VAE) to\nlearn hierarchical scene representation. By imposing semantic rules on the\nproperties of nodes and edges in the tree structure, explicit knowledge\nintegration enables controllable generation. We construct a synthetic example\nto illustrate the controllability and explainability of our method in a\nsuccinct setting. We further extend to realistic environments for autonomous\nvehicles: our method efficiently identifies adversarial driving scenes against\ndifferent state-of-the-art 3D point cloud segmentation models and satisfies the\ntraffic rules specified as the explicit knowledge.\n","authors":["Wenhao Ding","Haohong Lin","Bo Li","Ding Zhao"],"pdf_url":"https://arxiv.org/pdf/2106.04066v6.pdf","comment":"20 pages, 13 figures"},{"id":"http://arxiv.org/abs/2307.10507v1","updated":"2023-07-20T00:07:29Z","published":"2023-07-20T00:07:29Z","title":"FedSoup: Improving Generalization and Personalization in Federated\n Learning via Selective Model Interpolation","summary":" Cross-silo federated learning (FL) enables the development of machine\nlearning models on datasets distributed across data centers such as hospitals\nand clinical research laboratories. However, recent research has found that\ncurrent FL algorithms face a trade-off between local and global performance\nwhen confronted with distribution shifts. Specifically, personalized FL methods\nhave a tendency to overfit to local data, leading to a sharp valley in the\nlocal model and inhibiting its ability to generalize to out-of-distribution\ndata. In this paper, we propose a novel federated model soup method (i.e.,\nselective interpolation of model parameters) to optimize the trade-off between\nlocal and global performance. Specifically, during the federated training\nphase, each client maintains its own global model pool by monitoring the\nperformance of the interpolated model between the local and global models. This\nallows us to alleviate overfitting and seek flat minima, which can\nsignificantly improve the model's generalization performance. We evaluate our\nmethod on retinal and pathological image classification tasks, and our proposed\nmethod achieves significant improvements for out-of-distribution\ngeneralization. Our code is available at https://github.com/ubc-tea/FedSoup.\n","authors":["Minghui Chen","Meirui Jiang","Qi Dou","Zehua Wang","Xiaoxiao Li"],"pdf_url":"https://arxiv.org/pdf/2307.10507v1.pdf","comment":"Accepted by MICCAI2023"},{"id":"http://arxiv.org/abs/2307.10506v1","updated":"2023-07-20T00:06:46Z","published":"2023-07-20T00:06:46Z","title":"Is Grad-CAM Explainable in Medical Images?","summary":" Explainable Deep Learning has gained significant attention in the field of\nartificial intelligence (AI), particularly in domains such as medical imaging,\nwhere accurate and interpretable machine learning models are crucial for\neffective diagnosis and treatment planning. Grad-CAM is a baseline that\nhighlights the most critical regions of an image used in a deep learning\nmodel's decision-making process, increasing interpretability and trust in the\nresults. It is applied in many computer vision (CV) tasks such as\nclassification and explanation. This study explores the principles of\nExplainable Deep Learning and its relevance to medical imaging, discusses\nvarious explainability techniques and their limitations, and examines medical\nimaging applications of Grad-CAM. The findings highlight the potential of\nExplainable Deep Learning and Grad-CAM in improving the accuracy and\ninterpretability of deep learning models in medical imaging. The code is\navailable in (will be available).\n","authors":["Subhashis Suara","Aayush Jha","Pratik Sinha","Arif Ahmed Sekh"],"pdf_url":"https://arxiv.org/pdf/2307.10506v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10504v1","updated":"2023-07-20T00:02:24Z","published":"2023-07-20T00:02:24Z","title":"Identifying Interpretable Subspaces in Image Representations","summary":" We propose Automatic Feature Explanation using Contrasting Concepts (FALCON),\nan interpretability framework to explain features of image representations. For\na target feature, FALCON captions its highly activating cropped images using a\nlarge captioning dataset (like LAION-400m) and a pre-trained vision-language\nmodel like CLIP. Each word among the captions is scored and ranked leading to a\nsmall number of shared, human-understandable concepts that closely describe the\ntarget feature. FALCON also applies contrastive interpretation using lowly\nactivating (counterfactual) images, to eliminate spurious concepts. Although\nmany existing approaches interpret features independently, we observe in\nstate-of-the-art self-supervised and supervised models, that less than 20% of\nthe representation space can be explained by individual features. We show that\nfeatures in larger spaces become more interpretable when studied in groups and\ncan be explained with high-order scoring concepts through FALCON. We discuss\nhow extracted concepts can be used to explain and debug failures in downstream\ntasks. Finally, we present a technique to transfer concepts from one\n(explainable) representation space to another unseen representation space by\nlearning a simple linear transformation.\n","authors":["Neha Kalibhat","Shweta Bhardwaj","Bayan Bruss","Hamed Firooz","Maziar Sanjabi","Soheil Feizi"],"pdf_url":"https://arxiv.org/pdf/2307.10504v1.pdf","comment":"Published at ICML 2023"},{"id":"http://arxiv.org/abs/2307.11081v1","updated":"2023-07-20T17:57:04Z","published":"2023-07-20T17:57:04Z","title":"GLSFormer: Gated - Long, Short Sequence Transformer for Step Recognition\n in Surgical Videos","summary":" Automated surgical step recognition is an important task that can\nsignificantly improve patient safety and decision-making during surgeries.\nExisting state-of-the-art methods for surgical step recognition either rely on\nseparate, multi-stage modeling of spatial and temporal information or operate\non short-range temporal resolution when learned jointly. However, the benefits\nof joint modeling of spatio-temporal features and long-range information are\nnot taken in account. In this paper, we propose a vision transformer-based\napproach to jointly learn spatio-temporal features directly from sequence of\nframe-level patches. Our method incorporates a gated-temporal attention\nmechanism that intelligently combines short-term and long-term spatio-temporal\nfeature representations. We extensively evaluate our approach on two cataract\nsurgery video datasets, namely Cataract-101 and D99, and demonstrate superior\nperformance compared to various state-of-the-art methods. These results\nvalidate the suitability of our proposed approach for automated surgical step\nrecognition. Our code is released at:\nhttps://github.com/nisargshah1999/GLSFormer\n","authors":["Nisarg A. Shah","Shameema Sikder","S. Swaroop Vedula","Vishal M. Patel"],"pdf_url":"https://arxiv.org/pdf/2307.11081v1.pdf","comment":"Accepted to MICCAI 2023 (Early Accept)"},{"id":"http://arxiv.org/abs/2307.11261v1","updated":"2023-07-20T22:41:23Z","published":"2023-07-20T22:41:23Z","title":"SimCol3D -- 3D Reconstruction during Colonoscopy Challenge","summary":" Colorectal cancer is one of the most common cancers in the world. While\ncolonoscopy is an effective screening technique, navigating an endoscope\nthrough the colon to detect polyps is challenging. A 3D map of the observed\nsurfaces could enhance the identification of unscreened colon tissue and serve\nas a training platform. However, reconstructing the colon from video footage\nremains unsolved due to numerous factors such as self-occlusion, reflective\nsurfaces, lack of texture, and tissue deformation that limit feature-based\nmethods. Learning-based approaches hold promise as robust alternatives, but\nnecessitate extensive datasets. By establishing a benchmark, the 2022 EndoVis\nsub-challenge SimCol3D aimed to facilitate data-driven depth and pose\nprediction during colonoscopy. The challenge was hosted as part of MICCAI 2022\nin Singapore. Six teams from around the world and representatives from academia\nand industry participated in the three sub-challenges: synthetic depth\nprediction, synthetic pose prediction, and real pose prediction. This paper\ndescribes the challenge, the submitted methods, and their results. We show that\ndepth prediction in virtual colonoscopy is robustly solvable, while pose\nestimation remains an open research question.\n","authors":["Anita Rau","Sophia Bano","Yueming Jin","Pablo Azagra","Javier Morlana","Edward Sanderson","Bogdan J. Matuszewski","Jae Young Lee","Dong-Jae Lee","Erez Posner","Netanel Frank","Varshini Elangovan","Sista Raviteja","Zhengwen Li","Jiquan Liu","Seenivasan Lalithkumar","Mobarakol Islam","Hongliang Ren","José M. M. Montiel","Danail Stoyanov"],"pdf_url":"https://arxiv.org/pdf/2307.11261v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11259v1","updated":"2023-07-20T22:35:27Z","published":"2023-07-20T22:35:27Z","title":"Towards Non-Parametric Models for Confidence Aware Image Prediction from\n Low Data using Gaussian Processes","summary":" The ability to envision future states is crucial to informed decision making\nwhile interacting with dynamic environments. With cameras providing a prevalent\nand information rich sensing modality, the problem of predicting future states\nfrom image sequences has garnered a lot of attention. Current state of the art\nmethods typically train large parametric models for their predictions. Though\noften able to predict with accuracy, these models rely on the availability of\nlarge training datasets to converge to useful solutions. In this paper we focus\non the problem of predicting future images of an image sequence from very\nlittle training data. To approach this problem, we use non-parametric models to\ntake a probabilistic approach to image prediction. We generate probability\ndistributions over sequentially predicted images and propagate uncertainty\nthrough time to generate a confidence metric for our predictions. Gaussian\nProcesses are used for their data efficiency and ability to readily incorporate\nnew training data online. We showcase our method by successfully predicting\nfuture frames of a smooth fluid simulation environment.\n","authors":["Nikhil U. Shinde","Florian Richter","Michael C. Yip"],"pdf_url":"https://arxiv.org/pdf/2307.11259v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11253v1","updated":"2023-07-20T22:09:04Z","published":"2023-07-20T22:09:04Z","title":"Joint one-sided synthetic unpaired image translation and segmentation\n for colorectal cancer prevention","summary":" Deep learning has shown excellent performance in analysing medical images.\nHowever, datasets are difficult to obtain due privacy issues, standardization\nproblems, and lack of annotations. We address these problems by producing\nrealistic synthetic images using a combination of 3D technologies and\ngenerative adversarial networks. We propose CUT-seg, a joint training where a\nsegmentation model and a generative model are jointly trained to produce\nrealistic images while learning to segment polyps. We take advantage of recent\none-sided translation models because they use significantly less memory,\nallowing us to add a segmentation model in the training loop. CUT-seg performs\nbetter, is computationally less expensive, and requires less real images than\nother memory-intensive image translation approaches that require two stage\ntraining. Promising results are achieved on five real polyp segmentation\ndatasets using only one real image and zero real annotations. As a part of this\nstudy we release Synth-Colon, an entirely synthetic dataset that includes 20000\nrealistic colon images and additional details about depth and 3D geometry:\nhttps://enric1994.github.io/synth-colon\n","authors":["Enric Moreu","Eric Arazo","Kevin McGuinness","Noel E. O'Connor"],"pdf_url":"https://arxiv.org/pdf/2307.11253v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2202.08680"},{"id":"http://arxiv.org/abs/2307.11227v1","updated":"2023-07-20T20:45:13Z","published":"2023-07-20T20:45:13Z","title":"UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with\n Vision-Language Models","summary":" In this study, we investigate the task of data pre-selection, which aims to\nselect instances for labeling from an unlabeled dataset through a single pass,\nthereby optimizing performance for undefined downstream tasks with a limited\nannotation budget. Previous approaches to data pre-selection relied solely on\nvisual features extracted from foundation models, such as CLIP and BLIP-2, but\nlargely ignored the powerfulness of text features. In this work, we argue that,\nwith proper design, the joint feature space of both vision and text can yield a\nbetter representation for data pre-selection. To this end, we introduce UP-DP,\na simple yet effective unsupervised prompt learning approach that adapts\nvision-language models, like BLIP-2, for data pre-selection. Specifically, with\nthe BLIP-2 parameters frozen, we train text prompts to extract the joint\nfeatures with improved representation, ensuring a diverse cluster structure\nthat covers the entire dataset. We extensively compare our method with the\nstate-of-the-art using seven benchmark datasets in different settings,\nachieving up to a performance gain of 20%. Interestingly, the prompts learned\nfrom one dataset demonstrate significant generalizability and can be applied\ndirectly to enhance the feature extraction of BLIP-2 from other datasets. To\nthe best of our knowledge, UP-DP is the first work to incorporate unsupervised\nprompt learning in a vision-language model for data pre-selection.\n","authors":["Xin Li","Sima Behpour","Thang Doan","Wenbin He","Liang Gou","Liu Ren"],"pdf_url":"https://arxiv.org/pdf/2307.11227v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.03297v2","updated":"2023-07-20T19:28:22Z","published":"2022-10-07T03:10:34Z","title":"Preprocessors Matter! Realistic Decision-Based Attacks on Machine\n Learning Systems","summary":" Decision-based attacks construct adversarial examples against a machine\nlearning (ML) model by making only hard-label queries. These attacks have\nmainly been applied directly to standalone neural networks. However, in\npractice, ML models are just one component of a larger learning system. We find\nthat by adding a single preprocessor in front of a classifier, state-of-the-art\nquery-based attacks are up to 7$\\times$ less effective at attacking a\nprediction pipeline than at attacking the model alone. We explain this\ndiscrepancy by the fact that most preprocessors introduce some notion of\ninvariance to the input space. Hence, attacks that are unaware of this\ninvariance inevitably waste a large number of queries to re-discover or\novercome it. We, therefore, develop techniques to (i) reverse-engineer the\npreprocessor and then (ii) use this extracted information to attack the\nend-to-end system. Our preprocessors extraction method requires only a few\nhundred queries, and our preprocessor-aware attacks recover the same efficacy\nas when attacking the model alone. The code can be found at\nhttps://github.com/google-research/preprocessor-aware-black-box-attack.\n","authors":["Chawin Sitawarin","Florian Tramèr","Nicholas Carlini"],"pdf_url":"https://arxiv.org/pdf/2210.03297v2.pdf","comment":"ICML 2023. Code can be found at\n https://github.com/google-research/preprocessor-aware-black-box-attack"},{"id":"http://arxiv.org/abs/2302.11827v2","updated":"2023-07-20T19:21:51Z","published":"2023-02-23T07:26:50Z","title":"Open Challenges for Monocular Single-shot 6D Object Pose Estimation","summary":" Object pose estimation is a non-trivial task that enables robotic\nmanipulation, bin picking, augmented reality, and scene understanding, to name\na few use cases. Monocular object pose estimation gained considerable momentum\nwith the rise of high-performing deep learning-based solutions and is\nparticularly interesting for the community since sensors are inexpensive and\ninference is fast. Prior works establish the comprehensive state of the art for\ndiverse pose estimation problems. Their broad scopes make it difficult to\nidentify promising future directions. We narrow down the scope to the problem\nof single-shot monocular 6D object pose estimation, which is commonly used in\nrobotics, and thus are able to identify such trends. By reviewing recent\npublications in robotics and computer vision, the state of the art is\nestablished at the union of both fields. Following that, we identify promising\nresearch directions in order to help researchers to formulate relevant research\nideas and effectively advance the state of the art. Findings include that\nmethods are sophisticated enough to overcome the domain shift and that\nocclusion handling is a fundamental challenge. We also highlight problems such\nas novel object pose estimation and challenging materials handling as central\nchallenges to advance robotics.\n","authors":["Stefan Thalhammer","Peter Hönig","Jean-Baptiste Weibel","Markus Vincze"],"pdf_url":"https://arxiv.org/pdf/2302.11827v2.pdf","comment":"Revised version in the making"},{"id":"http://arxiv.org/abs/2307.11197v1","updated":"2023-07-20T19:20:35Z","published":"2023-07-20T19:20:35Z","title":"Heuristic Hyperparameter Choice for Image Anomaly Detection","summary":" Anomaly detection (AD) in images is a fundamental computer vision problem by\ndeep learning neural network to identify images deviating significantly from\nnormality. The deep features extracted from pretrained models have been proved\nto be essential for AD based on multivariate Gaussian distribution analysis.\nHowever, since models are usually pretrained on a large dataset for\nclassification tasks such as ImageNet, they might produce lots of redundant\nfeatures for AD, which increases computational cost and degrades the\nperformance. We aim to do the dimension reduction of Negated Principal\nComponent Analysis (NPCA) for these features. So we proposed some heuristic to\nchoose hyperparameter of NPCA algorithm for getting as fewer components of\nfeatures as possible while ensuring a good performance.\n","authors":["Zeyu Jiang","João P. C. Bertoldo","Etienne Decencière"],"pdf_url":"https://arxiv.org/pdf/2307.11197v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03512v2","updated":"2023-07-20T18:27:42Z","published":"2023-07-07T11:00:44Z","title":"Tranfer Learning of Semantic Segmentation Methods for Identifying Buried\n Archaeological Structures on LiDAR Data","summary":" When applying deep learning to remote sensing data in archaeological\nresearch, a notable obstacle is the limited availability of suitable datasets\nfor training models. The application of transfer learning is frequently\nemployed to mitigate this drawback. However, there is still a need to explore\nits effectiveness when applied across different archaeological datasets. This\npaper compares the performance of various transfer learning configurations\nusing two semantic segmentation deep neural networks on two LiDAR datasets. The\nexperimental results indicate that transfer learning-based approaches in\narchaeology can lead to performance improvements, although a systematic\nenhancement has not yet been observed. We provide specific insights about the\nvalidity of such techniques that can serve as a baseline for future works.\n","authors":["Paolo Soleni","Wouter B. Verschoof-van der Vaart","Žiga Kokalj","Arianna Traviglia","Marco Fiorucci"],"pdf_url":"https://arxiv.org/pdf/2307.03512v2.pdf","comment":"Accepted to IEEE International Geoscience and Remote Sensing\n Symposium 2023 (IGARSS 2023) @IEEE copyright"},{"id":"http://arxiv.org/abs/2307.11141v1","updated":"2023-07-20T17:53:04Z","published":"2023-07-20T17:53:04Z","title":"Towards General Game Representations: Decomposing Games Pixels into\n Content and Style","summary":" On-screen game footage contains rich contextual information that players\nprocess when playing and experiencing a game. Learning pixel representations of\ngames can benefit artificial intelligence across several downstream tasks\nincluding game-playing agents, procedural content generation, and player\nmodelling. The generalizability of these methods, however, remains a challenge,\nas learned representations should ideally be shared across games with similar\ngame mechanics. This could allow, for instance, game-playing agents trained on\none game to perform well in similar games with no re-training. This paper\nexplores how generalizable pre-trained computer vision encoders can be for such\ntasks, by decomposing the latent space into content embeddings and style\nembeddings. The goal is to minimize the domain gap between games of the same\ngenre when it comes to game content critical for downstream tasks, and ignore\ndifferences in graphical style. We employ a pre-trained Vision Transformer\nencoder and a decomposition technique based on game genres to obtain separate\ncontent and style embeddings. Our findings show that the decomposed embeddings\nachieve style invariance across multiple games while still maintaining strong\ncontent extraction capabilities. We argue that the proposed decomposition of\ncontent and style offers better generalization capacities across game\nenvironments independently of the downstream task.\n","authors":["Chintan Trivedi","Konstantinos Makantasis","Antonios Liapis","Georgios N. Yannakakis"],"pdf_url":"https://arxiv.org/pdf/2307.11141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.10495v2","updated":"2023-07-20T16:33:52Z","published":"2022-10-19T12:04:47Z","title":"ADPS: Asymmetric Distillation Post-Segmentation Method for Image Anomaly\n Detection","summary":" Knowledge Distillation-based Anomaly Detection (KDAD) methods rely on the\nteacher-student paradigm to detect and segment anomalous regions by contrasting\nthe unique features extracted by both networks. However, existing KDAD methods\nsuffer from two main limitations: 1) the student network can effortlessly\nreplicate the teacher network's representations, and 2) the features of the\nteacher network serve solely as a ``reference standard\" and are not fully\nleveraged. Toward this end, we depart from the established paradigm and instead\npropose an innovative approach called Asymmetric Distillation Post-Segmentation\n(ADPS). Our ADPS employs an asymmetric distillation paradigm that takes\ndistinct forms of the same image as the input of the teacher-student networks,\ndriving the student network to learn discriminating representations for\nanomalous regions.\n Meanwhile, a customized Weight Mask Block (WMB) is proposed to generate a\ncoarse anomaly localization mask that transfers the distilled knowledge\nacquired from the asymmetric paradigm to the teacher network. Equipped with\nWMB, the proposed Post-Segmentation Module (PSM) is able to effectively detect\nand segment abnormal regions with fine structures and clear boundaries.\nExperimental results demonstrate that the proposed ADPS outperforms the\nstate-of-the-art methods in detecting and segmenting anomalies. Surprisingly,\nADPS significantly improves Average Precision (AP) metric by 9% and 20% on the\nMVTec AD and KolektorSDD2 datasets, respectively.\n","authors":["Peng Xing","Hao Tang","Jinhui Tang","Zechao Li"],"pdf_url":"https://arxiv.org/pdf/2210.10495v2.pdf","comment":"11pages,9 figures"},{"id":"http://arxiv.org/abs/2307.11130v1","updated":"2023-07-20T16:07:02Z","published":"2023-07-20T16:07:02Z","title":"Frequency-aware optical coherence tomography image super-resolution via\n conditional generative adversarial neural network","summary":" Optical coherence tomography (OCT) has stimulated a wide range of medical\nimage-based diagnosis and treatment in fields such as cardiology and\nophthalmology. Such applications can be further facilitated by deep\nlearning-based super-resolution technology, which improves the capability of\nresolving morphological structures. However, existing deep learning-based\nmethod only focuses on spatial distribution and disregard frequency fidelity in\nimage reconstruction, leading to a frequency bias. To overcome this limitation,\nwe propose a frequency-aware super-resolution framework that integrates three\ncritical frequency-based modules (i.e., frequency transformation, frequency\nskip connection, and frequency alignment) and frequency-based loss function\ninto a conditional generative adversarial network (cGAN). We conducted a\nlarge-scale quantitative study from an existing coronary OCT dataset to\ndemonstrate the superiority of our proposed framework over existing deep\nlearning frameworks. In addition, we confirmed the generalizability of our\nframework by applying it to fish corneal images and rat retinal images,\ndemonstrating its capability to super-resolve morphological details in eye\nimaging.\n","authors":["Xueshen Li","Zhenxing Dong","Hongshan Liu","Jennifer J. Kang-Mieler","Yuye Ling","Yu Gan"],"pdf_url":"https://arxiv.org/pdf/2307.11130v1.pdf","comment":"13 pages, 7 figures, submitted to Biomedical Optics Express special\n issue"},{"id":"http://arxiv.org/abs/2307.11118v1","updated":"2023-07-20T14:37:30Z","published":"2023-07-20T14:37:30Z","title":"Diffusion Sampling with Momentum for Mitigating Divergence Artifacts","summary":" Despite the remarkable success of diffusion models in image generation, slow\nsampling remains a persistent issue. To accelerate the sampling process, prior\nstudies have reformulated diffusion sampling as an ODE/SDE and introduced\nhigher-order numerical methods. However, these methods often produce divergence\nartifacts, especially with a low number of sampling steps, which limits the\nachievable acceleration. In this paper, we investigate the potential causes of\nthese artifacts and suggest that the small stability regions of these methods\ncould be the principal cause. To address this issue, we propose two novel\ntechniques. The first technique involves the incorporation of Heavy Ball (HB)\nmomentum, a well-known technique for improving optimization, into existing\ndiffusion numerical methods to expand their stability regions. We also prove\nthat the resulting methods have first-order convergence. The second technique,\ncalled Generalized Heavy Ball (GHVB), constructs a new high-order method that\noffers a variable trade-off between accuracy and artifact suppression.\nExperimental results show that our techniques are highly effective in reducing\nartifacts and improving image quality, surpassing state-of-the-art diffusion\nsolvers on both pixel-based and latent-based diffusion models for low-step\nsampling. Our research provides novel insights into the design of numerical\nmethods for future diffusion work.\n","authors":["Suttisak Wizadwongsa","Worameth Chinchuthakun","Pramook Khungurn","Amit Raj","Supasorn Suwajanakorn"],"pdf_url":"https://arxiv.org/pdf/2307.11118v1.pdf","comment":"Project page: https://github.com/sWizad/momentum-diffusion"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2307.11019v1","updated":"2023-07-20T16:46:10Z","published":"2023-07-20T16:46:10Z","title":"Investigating the Factual Knowledge Boundary of Large Language Models\n with Retrieval Augmentation","summary":" Knowledge-intensive tasks (e.g., open-domain question answering (QA)) require\na substantial amount of factual knowledge and often rely on external\ninformation for assistance. Recently, large language models (LLMs) (e.g.,\nChatGPT), have demonstrated impressive prowess in solving a wide range of tasks\nwith world knowledge, including knowledge-intensive tasks. However, it remains\nunclear how well LLMs are able to perceive their factual knowledge boundaries,\nparticularly how they behave when incorporating retrieval augmentation. In this\nstudy, we present an initial analysis of the factual knowledge boundaries of\nLLMs and how retrieval augmentation affects LLMs on open-domain QA. Specially,\nwe focus on three primary research questions and analyze them by examining QA\nperformance, priori judgement and posteriori judgement of LLMs. We show\nevidence that LLMs possess unwavering confidence in their capabilities to\nrespond to questions and the accuracy of their responses. Furthermore,\nretrieval augmentation proves to be an effective approach in enhancing LLMs'\nawareness of knowledge boundaries, thereby improving their judgemental\nabilities. Additionally, we also find that LLMs have a propensity to rely on\nthe provided retrieval results when formulating answers, while the quality of\nthese results significantly impacts their reliance. The code to reproduce this\nwork is available at https://github.com/RUCAIBox/LLM-Knowledge-Boundary.\n","authors":["Ruiyang Ren","Yuhao Wang","Yingqi Qu","Wayne Xin Zhao","Jing Liu","Hao Tian","Hua Wu","Ji-Rong Wen","Haifeng Wang"],"pdf_url":"https://arxiv.org/pdf/2307.11019v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2105.11876v3","updated":"2023-07-20T10:42:36Z","published":"2021-05-25T12:23:24Z","title":"Criterion-based Heterogeneous Collaborative Filtering for Multi-behavior\n Implicit Recommendation","summary":" Recent years have witnessed the explosive growth of interaction behaviors in\nmultimedia information systems, where multi-behavior recommender systems have\nreceived increasing attention by leveraging data from various auxiliary\nbehaviors such as tip and collect. Among various multi-behavior recommendation\nmethods, non-sampling methods have shown superiority over negative sampling\nmethods. However, two observations are usually ignored in existing\nstate-of-the-art non-sampling methods based on binary regression: (1) users\nhave different preference strengths for different items, so they cannot be\nmeasured simply by binary implicit data; (2) the dependency across multiple\nbehaviors varies for different users and items. To tackle the above issue, we\npropose a novel non-sampling learning framework named Criterion-guided\nHeterogeneous Collaborative Filtering (CHCF). CHCF introduces both upper and\nlower thresholds to indicate selection criteria, which will guide user\npreference learning. Besides, CHCF integrates criterion learning and user\npreference learning into a unified framework, which can be trained jointly for\nthe interaction prediction of the target behavior. We further theoretically\ndemonstrate that the optimization of Collaborative Metric Learning can be\napproximately achieved by the CHCF learning framework in a non-sampling form\neffectively. Extensive experiments on three real-world datasets show the\neffectiveness of CHCF in heterogeneous scenarios.\n","authors":["Xiao Luo","Daqing Wu","Yiyang Gu","Chong Chen","Luchen Liu","Jinwen Ma","Ming Zhang","Minghua Deng","Jianqiang Huang","Xian-Sheng Hua"],"pdf_url":"https://arxiv.org/pdf/2105.11876v3.pdf","comment":"Accepted by ACM Transactions on Knowledge Discovery from Data (TKDD)"},{"id":"http://arxiv.org/abs/2307.10747v1","updated":"2023-07-20T10:19:47Z","published":"2023-07-20T10:19:47Z","title":"Enhancing Job Recommendation through LLM-based Generative Adversarial\n Networks","summary":" Recommending suitable jobs to users is a critical task in online recruitment\nplatforms, as it can enhance users' satisfaction and the platforms'\nprofitability. While existing job recommendation methods encounter challenges\nsuch as the low quality of users' resumes, which hampers their accuracy and\npractical effectiveness. With the rapid development of large language models\n(LLMs), utilizing the rich external knowledge encapsulated within them, as well\nas their powerful capabilities of text processing and reasoning, is a promising\nway to complete users' resumes for more accurate recommendations. However,\ndirectly leveraging LLMs to enhance recommendation results is not a\none-size-fits-all solution, as LLMs may suffer from fabricated generation and\nfew-shot problems, which degrade the quality of resume completion. In this\npaper, we propose a novel LLM-based approach for job recommendation. To\nalleviate the limitation of fabricated generation for LLMs, we extract accurate\nand valuable information beyond users' self-description, which helps the LLMs\nbetter profile users for resume completion. Specifically, we not only extract\nusers' explicit properties (e.g., skills, interests) from their\nself-description but also infer users' implicit characteristics from their\nbehaviors for more accurate and meaningful resume completion. Nevertheless,\nsome users still suffer from few-shot problems, which arise due to scarce\ninteraction records, leading to limited guidance for the models in generating\nhigh-quality resumes. To address this issue, we propose aligning unpaired\nlow-quality with high-quality generated resumes by Generative Adversarial\nNetworks (GANs), which can refine the resume representations for better\nrecommendation results. Extensive experiments on three large real-world\nrecruitment datasets demonstrate the effectiveness of our proposed method.\n","authors":["Yingpeng Du","Di Luo","Rui Yan","Hongzhi Liu","Yang Song","Hengshu Zhu","Jie Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10747v1.pdf","comment":"13 pages, 6 figures, 3 tables"},{"id":"http://arxiv.org/abs/2011.00696v2","updated":"2023-07-20T08:56:26Z","published":"2020-11-02T03:07:38Z","title":"ABNIRML: Analyzing the Behavior of Neural IR Models","summary":" Pretrained contextualized language models such as BERT and T5 have\nestablished a new state-of-the-art for ad-hoc search. However, it is not yet\nwell-understood why these methods are so effective, what makes some variants\nmore effective than others, and what pitfalls they may have. We present a new\ncomprehensive framework for Analyzing the Behavior of Neural IR ModeLs\n(ABNIRML), which includes new types of diagnostic probes that allow us to test\nseveral characteristics -- such as writing styles, factuality, sensitivity to\nparaphrasing and word order -- that are not addressed by previous techniques.\nTo demonstrate the value of the framework, we conduct an extensive empirical\nstudy that yields insights into the factors that contribute to the neural\nmodel's gains, and identify potential unintended biases the models exhibit.\nSome of our results confirm conventional wisdom, like that recent neural\nranking models rely less on exact term overlap with the query, and instead\nleverage richer linguistic information, evidenced by their higher sensitivity\nto word and sentence order. Other results are more surprising, such as that\nsome models (e.g., T5 and ColBERT) are biased towards factually correct (rather\nthan simply relevant) texts. Further, some characteristics vary even for the\nsame base language model, and other characteristics can appear due to random\nvariations during model training.\n","authors":["Sean MacAvaney","Sergey Feldman","Nazli Goharian","Doug Downey","Arman Cohan"],"pdf_url":"https://arxiv.org/pdf/2011.00696v2.pdf","comment":"TACL version"},{"id":"http://arxiv.org/abs/2307.10702v1","updated":"2023-07-20T08:47:54Z","published":"2023-07-20T08:47:54Z","title":"A Constraint-based Recommender System via RDF Knowledge Graphs","summary":" Knowledge graphs, represented in RDF, are able to model entities and their\nrelations by means of ontologies. The use of knowledge graphs for information\nmodeling has attracted interest in recent years. In recommender systems, items\nand users can be mapped and integrated into the knowledge graph, which can\nrepresent more links and relationships between users and items.\nConstraint-based recommender systems are based on the idea of explicitly\nexploiting deep recommendation knowledge through constraints to identify\nrelevant recommendations. When combined with knowledge graphs, a\nconstraint-based recommender system gains several benefits in terms of\nconstraint sets. In this paper, we investigate and propose the construction of\na constraint-based recommender system via RDF knowledge graphs applied to the\nvehicle purchase/sale domain. The results of our experiments show that the\nproposed approach is able to efficiently identify recommendations in accordance\nwith user preferences.\n","authors":["Ngoc Luyen Le","Marie-Hélène Abel","Philippe Gouspillou"],"pdf_url":"https://arxiv.org/pdf/2307.10702v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10680v1","updated":"2023-07-20T08:14:06Z","published":"2023-07-20T08:14:06Z","title":"A Personalized Recommender System Based-on Knowledge Graph Embeddings","summary":" Knowledge graphs have proven to be effective for modeling entities and their\nrelationships through the use of ontologies. The recent emergence in interest\nfor using knowledge graphs as a form of information modeling has led to their\nincreased adoption in recommender systems. By incorporating users and items\ninto the knowledge graph, these systems can better capture the implicit\nconnections between them and provide more accurate recommendations. In this\npaper, we investigate and propose the construction of a personalized\nrecommender system via knowledge graphs embedding applied to the vehicle\npurchase/sale domain. The results of our experimentation demonstrate the\nefficacy of the proposed method in providing relevant recommendations that are\nconsistent with individual users.\n","authors":["Ngoc Luyen Le","Marie-Hélène Abel","Philippe Gouspillou"],"pdf_url":"https://arxiv.org/pdf/2307.10680v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10650v1","updated":"2023-07-20T07:30:27Z","published":"2023-07-20T07:30:27Z","title":"Language-Enhanced Session-Based Recommendation with Decoupled\n Contrastive Learning","summary":" Session-based recommendation techniques aim to capture dynamic user behavior\nby analyzing past interactions. However, existing methods heavily rely on\nhistorical item ID sequences to extract user preferences, leading to challenges\nsuch as popular bias and cold-start problems. In this paper, we propose a\nhybrid multimodal approach for session-based recommendation to address these\nchallenges. Our approach combines different modalities, including textual\ncontent and item IDs, leveraging the complementary nature of these modalities\nusing CatBoost. To learn universal item representations, we design a language\nrepresentation-based item retrieval architecture that extracts features from\nthe textual content utilizing pre-trained language models. Furthermore, we\nintroduce a novel Decoupled Contrastive Learning method to enhance the\neffectiveness of the language representation. This technique decouples the\nsequence representation and item representation space, facilitating\nbidirectional alignment through dual-queue contrastive learning.\nSimultaneously, the momentum queue provides a large number of negative samples,\neffectively enhancing the effectiveness of contrastive learning. Our approach\nyielded competitive results, securing a 5th place ranking in KDD CUP 2023 Task\n1. We have released the source code and pre-trained models associated with this\nwork.\n","authors":["Zhipeng Zhang","Piao Tong","Yingwei Ma","Qiao Liu","Xujiang Liu","Xu Luo"],"pdf_url":"https://arxiv.org/pdf/2307.10650v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10639v1","updated":"2023-07-20T07:08:25Z","published":"2023-07-20T07:08:25Z","title":"Improving Semantic Similarity Measure Within a Recommender System\n Based-on RDF Graphs","summary":" In today's era of information explosion, more users are becoming more reliant\nupon recommender systems to have better advice, suggestions, or inspire them.\nThe measure of the semantic relatedness or likeness between terms, words, or\ntext data plays an important role in different applications dealing with\ntextual data, as in a recommender system. Over the past few years, many\nontologies have been developed and used as a form of structured representation\nof knowledge bases for information systems. The measure of semantic similarity\nfrom ontology has developed by several methods. In this paper, we propose and\ncarry on an approach for the improvement of semantic similarity calculations\nwithin a recommender system based-on RDF graphs.\n","authors":["Ngoc Luyen Le","Marie-Hélène Abel","Philippe Gouspillou"],"pdf_url":"https://arxiv.org/pdf/2307.10639v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10617v1","updated":"2023-07-20T06:35:43Z","published":"2023-07-20T06:35:43Z","title":"Detecting deceptive reviews using text classification","summary":" In recent years, online reviews play a vital role for promoting any kind of\nproduct or services. Businesses may embed fake reviews in order to attract\ncustomers to purchase their products. They may even highlight the benefits of\ntheir own product or criticize the competition's product. Marketers,\nadvertisers, and other online business users have incentive to create fake\npositive reviews for products which they want to promote or give fake negative\nreviews for products which they really don't like. So now-a-days writing a\ndeceptive review is inevitable thing for promoting their own business or\ndegrading competitor's reputation. Thus, identifying deceptive reviews is an\nintense and on-going research area. This research paper proposes machine\nlearning model approach to identify deceptive reviews. The paper investigates\nthe performance of the several experiments done on a Deceptive Opinion Spam\nCorpus dataset of restaurants reviews. We developed a n-gram model and max\nfeatures to identify deceptive contents with a particular focus on fake\nreviews. Further, we conduct a benchmark study to investigate the performance\nof two different features extraction techniques and apply five machine learning\nclassification techniques. The experimental results show that passive\naggressive classifier outperforms other algorithms, and it reaches the highest\naccuracy not only in text classification but also to fake reviews. We also\nstudy the data augmentation and implement different deep learning techniques.\n","authors":["Anusuya Baby"],"pdf_url":"https://arxiv.org/pdf/2307.10617v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2306.11296v2","updated":"2023-07-20T02:20:35Z","published":"2023-06-20T05:20:29Z","title":"ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF\n Synthesis","summary":" We use prompt engineering to guide ChatGPT in the automation of text mining\nof metal-organic frameworks (MOFs) synthesis conditions from diverse formats\nand styles of the scientific literature. This effectively mitigates ChatGPT's\ntendency to hallucinate information -- an issue that previously made the use of\nLarge Language Models (LLMs) in scientific fields challenging. Our approach\ninvolves the development of a workflow implementing three different processes\nfor text mining, programmed by ChatGPT itself. All of them enable parsing,\nsearching, filtering, classification, summarization, and data unification with\ndifferent tradeoffs between labor, speed, and accuracy. We deploy this system\nto extract 26,257 distinct synthesis parameters pertaining to approximately 800\nMOFs sourced from peer-reviewed research articles. This process incorporates\nour ChemPrompt Engineering strategy to instruct ChatGPT in text mining,\nresulting in impressive precision, recall, and F1 scores of 90-99%.\nFurthermore, with the dataset built by text mining, we constructed a\nmachine-learning model with over 86% accuracy in predicting MOF experimental\ncrystallization outcomes and preliminarily identifying important factors in MOF\ncrystallization. We also developed a reliable data-grounded MOF chatbot to\nanswer questions on chemical reactions and synthesis procedures. Given that the\nprocess of using ChatGPT reliably mines and tabulates diverse MOF synthesis\ninformation in a unified format, while using only narrative language requiring\nno coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be\nvery useful across various other chemistry sub-disciplines.\n","authors":["Zhiling Zheng","Oufan Zhang","Christian Borgs","Jennifer T. Chayes","Omar M. Yaghi"],"pdf_url":"https://arxiv.org/pdf/2306.11296v2.pdf","comment":"Published on Journal of the American Chemical Society (2023); 102\n pages (18-page manuscript, 84 pages of supporting information)"},{"id":"http://arxiv.org/abs/2307.11224v1","updated":"2023-07-20T20:37:24Z","published":"2023-07-20T20:37:24Z","title":"Jina Embeddings: A Novel Set of High-Performance Sentence Embedding\n Models","summary":" Jina Embeddings constitutes a set of high-performance sentence embedding\nmodels adept at translating various textual inputs into numerical\nrepresentations, thereby capturing the semantic essence of the text. While\nthese models are not exclusively designed for text generation, they excel in\napplications such as dense retrieval and semantic textual similarity. This\npaper details the development of Jina Embeddings, starting with the creation of\na high-quality pairwise and triplet dataset. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-depth insights into the model\ntraining process, and concludes with a comprehensive performance evaluation\nusing the Massive Textual Embedding Benchmark (MTEB).\n","authors":["Michael Günther","Louis Milliken","Jonathan Geuter","Georgios Mastrapas","Bo Wang","Han Xiao"],"pdf_url":"https://arxiv.org/pdf/2307.11224v1.pdf","comment":"9 pages, 2 page appendix, EMNLP 2023 Industrial Track"},{"id":"http://arxiv.org/abs/2307.11140v1","updated":"2023-07-20T17:52:47Z","published":"2023-07-20T17:52:47Z","title":"RCVaR: an Economic Approach to Estimate Cyberattacks Costs using Data\n from Industry Reports","summary":" Digitization increases business opportunities and the risk of companies being\nvictims of devastating cyberattacks. Therefore, managing risk exposure and\ncybersecurity strategies is essential for digitized companies that want to\nsurvive in competitive markets. However, understanding company-specific risks\nand quantifying their associated costs is not trivial. Current approaches fail\nto provide individualized and quantitative monetary estimations of\ncybersecurity impacts. Due to limited resources and technical expertise, SMEs\nand even large companies are affected and struggle to quantify their\ncyberattack exposure. Therefore, novel approaches must be placed to support the\nunderstanding of the financial loss due to cyberattacks. This article\nintroduces the Real Cyber Value at Risk (RCVaR), an economical approach for\nestimating cybersecurity costs using real-world information from public\ncybersecurity reports. RCVaR identifies the most significant cyber risk factors\nfrom various sources and combines their quantitative results to estimate\nspecific cyberattacks costs for companies. Furthermore, RCVaR extends current\nmethods to achieve cost and risk estimations based on historical real-world\ndata instead of only probability-based simulations. The evaluation of the\napproach on unseen data shows the accuracy and efficiency of the RCVaR in\npredicting and managing cyber risks. Thus, it shows that the RCVaR is a\nvaluable addition to cybersecurity planning and risk management processes.\n","authors":["Muriel Figueredo Franco","Fabian Künzler","Jan von der Assen","Chao Feng","Burkhard Stiller"],"pdf_url":"https://arxiv.org/pdf/2307.11140v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2307.11091v1","updated":"2023-07-20T17:59:59Z","published":"2023-07-20T17:59:59Z","title":"Data-driven criteria for quantum correlations","summary":" We build a machine learning model to detect correlations in a three-qubit\nsystem using a neural network trained in an unsupervised manner on randomly\ngenerated states. The network is forced to recognize separable states, and\ncorrelated states are detected as anomalies. Quite surprisingly, we find that\nthe proposed detector performs much better at distinguishing a weaker form of\nquantum correlations, namely, the quantum discord, than entanglement. In fact,\nit has a tendency to grossly overestimate the set of entangled states even at\nthe optimal threshold for entanglement detection, while it underestimates the\nset of discordant states to a much lesser extent. In order to illustrate the\nnature of states classified as quantum-correlated, we construct a diagram\ncontaining various types of states -- entangled, as well as separable, both\ndiscordant and non-discordant. We find that the near-zero value of the\nrecognition loss reproduces the shape of the non-discordant separable states\nwith high accuracy, especially considering the non-trivial shape of this set on\nthe diagram. The network architecture is designed carefully: it preserves\nseparability, and its output is equivariant with respect to qubit permutations.\nWe show that the choice of architecture is important to get the highest\ndetection accuracy, much better than for a baseline model that just utilizes a\npartial trace operation.\n","authors":["Mateusz Krawczyk","Jarosław Pawłowski","Maciej M. Maśka","Katarzyna Roszak"],"pdf_url":"https://arxiv.org/pdf/2307.11091v1.pdf","comment":"7 pages, 3 figures, 3 tables, and extra 5 pages of supplementary\n materials"},{"id":"http://arxiv.org/abs/2307.11086v1","updated":"2023-07-20T17:59:33Z","published":"2023-07-20T17:59:33Z","title":"PAPR: Proximity Attention Point Rendering","summary":" Learning accurate and parsimonious point cloud representations of scene\nsurfaces from scratch remains a challenge in 3D representation learning.\nExisting point-based methods often suffer from the vanishing gradient problem\nor require a large number of points to accurately model scene geometry and\ntexture. To address these limitations, we propose Proximity Attention Point\nRendering (PAPR), a novel method that consists of a point-based scene\nrepresentation and a differentiable renderer. Our scene representation uses a\npoint cloud where each point is characterized by its spatial position,\nforeground score, and view-independent feature vector. The renderer selects the\nrelevant points for each ray and produces accurate colours using their\nassociated features. PAPR effectively learns point cloud positions to represent\nthe correct scene geometry, even when the initialization drastically differs\nfrom the target geometry. Notably, our method captures fine texture details\nwhile using only a parsimonious set of points. We also demonstrate four\npractical applications of our method: geometry editing, object manipulation,\ntexture transfer, and exposure control. More results and code are available on\nour project website at https://zvict.github.io/papr/.\n","authors":["Yanshu Zhang","Shichong Peng","Alireza Moazeni","Ke Li"],"pdf_url":"https://arxiv.org/pdf/2307.11086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07269v2","updated":"2023-07-20T17:59:25Z","published":"2023-07-14T10:50:43Z","title":"Frequency Domain Adversarial Training for Robust Volumetric Medical\n Segmentation","summary":" It is imperative to ensure the robustness of deep learning models in critical\napplications such as, healthcare. While recent advances in deep learning have\nimproved the performance of volumetric medical image segmentation models, these\nmodels cannot be deployed for real-world applications immediately due to their\nvulnerability to adversarial attacks. We present a 3D frequency domain\nadversarial attack for volumetric medical image segmentation models and\ndemonstrate its advantages over conventional input or voxel domain attacks.\nUsing our proposed attack, we introduce a novel frequency domain adversarial\ntraining approach for optimizing a robust model against voxel and frequency\ndomain attacks. Moreover, we propose frequency consistency loss to regulate our\nfrequency domain adversarial training that achieves a better tradeoff between\nmodel's performance on clean and adversarial samples. Code is publicly\navailable at https://github.com/asif-hanif/vafa.\n","authors":["Asif Hanif","Muzammal Naseer","Salman Khan","Mubarak Shah","Fahad Shahbaz Khan"],"pdf_url":"https://arxiv.org/pdf/2307.07269v2.pdf","comment":"This paper has been accepted in MICCAI 2023 conference"},{"id":"http://arxiv.org/abs/2301.13867v2","updated":"2023-07-20T17:59:14Z","published":"2023-01-31T18:59:03Z","title":"Mathematical Capabilities of ChatGPT","summary":" We investigate the mathematical capabilities of two iterations of ChatGPT\n(released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on\npublicly available datasets, as well as hand-crafted ones, using a novel\nmethodology. In contrast to formal mathematics, where large databases of formal\nproofs are available (e.g., the Lean Mathematical Library), current datasets of\nnatural-language mathematics, used to benchmark language models, either cover\nonly elementary mathematics or are very small. We address this by publicly\nreleasing two new datasets: GHOSTS and miniGHOSTS. These are the first\nnatural-language datasets curated by working researchers in mathematics that\n(1) aim to cover graduate-level mathematics, (2) provide a holistic overview of\nthe mathematical capabilities of language models, and (3) distinguish multiple\ndimensions of mathematical reasoning. These datasets also test whether ChatGPT\nand GPT-4 can be helpful assistants to professional mathematicians by emulating\nuse cases that arise in the daily professional activities of mathematicians. We\nbenchmark the models on a range of fine-grained performance metrics. For\nadvanced mathematics, this is the most detailed evaluation effort to date. We\nfind that ChatGPT can be used most successfully as a mathematical assistant for\nquerying facts, acting as a mathematical search engine and knowledge base\ninterface. GPT-4 can additionally be used for undergraduate-level mathematics\nbut fails on graduate-level difficulty. Contrary to many positive reports in\nthe media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of\nselection bias), their overall mathematical performance is well below the level\nof a graduate student. Hence, if your goal is to use ChatGPT to pass a\ngraduate-level math exam, you would be better off copying from your average\npeer!\n","authors":["Simon Frieder","Luca Pinchetti","Alexis Chevalier","Ryan-Rhys Griffiths","Tommaso Salvatori","Thomas Lukasiewicz","Philipp Christian Petersen","Julius Berner"],"pdf_url":"https://arxiv.org/pdf/2301.13867v2.pdf","comment":"Added further evaluations on another ChatGPT version and on GPT-4.\n The GHOSTS and miniGHOSTS datasets are available at\n https://github.com/xyfrieder/science-GHOSTS"},{"id":"http://arxiv.org/abs/2307.11085v1","updated":"2023-07-20T17:59:11Z","published":"2023-07-20T17:59:11Z","title":"Representation Learning in Anomaly Detection: Successes, Limits and a\n Grand Challenge","summary":" In this perspective paper, we argue that the dominant paradigm in anomaly\ndetection cannot scale indefinitely and will eventually hit fundamental limits.\nThis is due to the a no free lunch principle for anomaly detection. These\nlimitations can be overcome when there are strong tasks priors, as is the case\nfor many industrial tasks. When such priors do not exists, the task is much\nharder for anomaly detection. We pose two such tasks as grand challenges for\nanomaly detection: i) scientific discovery by anomaly detection ii) a\n\"mini-grand\" challenge of detecting the most anomalous image in the ImageNet\ndataset. We believe new anomaly detection tools and ideas would need to be\ndeveloped to overcome these challenges.\n","authors":["Yedid Hoshen"],"pdf_url":"https://arxiv.org/pdf/2307.11085v1.pdf","comment":"Keynote talk at the Visual Anomaly and Novelty Detection Workshop,\n CVPR'23"},{"id":"http://arxiv.org/abs/2205.09208v2","updated":"2023-07-20T17:57:36Z","published":"2022-05-18T20:34:25Z","title":"Torchhd: An Open Source Python Library to Support Research on\n Hyperdimensional Computing and Vector Symbolic Architectures","summary":" Hyperdimensional computing (HD), also known as vector symbolic architectures\n(VSA), is a framework for computing with distributed representations by\nexploiting properties of random high-dimensional vector spaces. The commitment\nof the scientific community to aggregate and disseminate research in this\nparticularly multidisciplinary area has been fundamental for its advancement.\nJoining these efforts, we present Torchhd, a high-performance open source\nPython library for HD/VSA. Torchhd seeks to make HD/VSA more accessible and\nserves as an efficient foundation for further research and application\ndevelopment. The easy-to-use library builds on top of PyTorch and features\nstate-of-the-art HD/VSA functionality, clear documentation, and implementation\nexamples from well-known publications. Comparing publicly available code with\ntheir corresponding Torchhd implementation shows that experiments can run up to\n100x faster. Torchhd is available at:\nhttps://github.com/hyperdimensional-computing/torchhd.\n","authors":["Mike Heddes","Igor Nunes","Pere Vergés","Denis Kleyko","Danny Abraham","Tony Givargis","Alexandru Nicolau","Alexander Veidenbaum"],"pdf_url":"https://arxiv.org/pdf/2205.09208v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11081v1","updated":"2023-07-20T17:57:04Z","published":"2023-07-20T17:57:04Z","title":"GLSFormer : Gated - Long, Short Sequence Transformer for Step\n Recognition in Surgical Videos","summary":" Automated surgical step recognition is an important task that can\nsignificantly improve patient safety and decision-making during surgeries.\nExisting state-of-the-art methods for surgical step recognition either rely on\nseparate, multi-stage modeling of spatial and temporal information or operate\non short-range temporal resolution when learned jointly. However, the benefits\nof joint modeling of spatio-temporal features and long-range information are\nnot taken in account. In this paper, we propose a vision transformer-based\napproach to jointly learn spatio-temporal features directly from sequence of\nframe-level patches. Our method incorporates a gated-temporal attention\nmechanism that intelligently combines short-term and long-term spatio-temporal\nfeature representations. We extensively evaluate our approach on two cataract\nsurgery video datasets, namely Cataract-101 and D99, and demonstrate superior\nperformance compared to various state-of-the-art methods. These results\nvalidate the suitability of our proposed approach for automated surgical step\nrecognition. Our code is released at:\nhttps://github.com/nisargshah1999/GLSFormer\n","authors":["Nisarg A. Shah","Shameema Sikder","S. Swaroop Vedula","Vishal M. Patel"],"pdf_url":"https://arxiv.org/pdf/2307.11081v1.pdf","comment":"Accepted to MICCAI 2023 (Early Accept)"},{"id":"http://arxiv.org/abs/2307.11078v1","updated":"2023-07-20T17:55:17Z","published":"2023-07-20T17:55:17Z","title":"Brain2Music: Reconstructing Music from Human Brain Activity","summary":" The process of reconstructing experiences from human brain activity offers a\nunique lens into how the brain interprets and represents the world. In this\npaper, we introduce a method for reconstructing music from brain activity,\ncaptured using functional magnetic resonance imaging (fMRI). Our approach uses\neither music retrieval or the MusicLM music generation model conditioned on\nembeddings derived from fMRI data. The generated music resembles the musical\nstimuli that human subjects experienced, with respect to semantic properties\nlike genre, instrumentation, and mood. We investigate the relationship between\ndifferent components of MusicLM and brain activity through a voxel-wise\nencoding modeling analysis. Furthermore, we discuss which brain regions\nrepresent information derived from purely textual descriptions of music\nstimuli. We provide supplementary material including examples of the\nreconstructed music at https://google-research.github.io/seanet/brain2music\n","authors":["Timo I. Denk","Yu Takagi","Takuya Matsuyama","Andrea Agostinelli","Tomoya Nakai","Christian Frank","Shinji Nishimoto"],"pdf_url":"https://arxiv.org/pdf/2307.11078v1.pdf","comment":"Preprint; 21 pages; supplementary material:\n https://google-research.github.io/seanet/brain2music"},{"id":"http://arxiv.org/abs/2307.11077v1","updated":"2023-07-20T17:55:14Z","published":"2023-07-20T17:55:14Z","title":"AlignDet: Aligning Pre-training and Fine-tuning in Object Detection","summary":" The paradigm of large-scale pre-training followed by downstream fine-tuning\nhas been widely employed in various object detection algorithms. In this paper,\nwe reveal discrepancies in data, model, and task between the pre-training and\nfine-tuning procedure in existing practices, which implicitly limit the\ndetector's performance, generalization ability, and convergence speed. To this\nend, we propose AlignDet, a unified pre-training framework that can be adapted\nto various existing detectors to alleviate the discrepancies. AlignDet\ndecouples the pre-training process into two stages, i.e., image-domain and\nbox-domain pre-training. The image-domain pre-training optimizes the detection\nbackbone to capture holistic visual abstraction, and box-domain pre-training\nlearns instance-level semantics and task-aware concepts to initialize the parts\nout of the backbone. By incorporating the self-supervised pre-trained\nbackbones, we can pre-train all modules for various detectors in an\nunsupervised paradigm. As depicted in Figure 1, extensive experiments\ndemonstrate that AlignDet can achieve significant improvements across diverse\nprotocols, such as detection algorithm, model backbone, data setting, and\ntraining schedule. For example, AlignDet improves FCOS by 5.3 mAP, RetinaNet by\n2.1 mAP, Faster R-CNN by 3.3 mAP, and DETR by 2.3 mAP under fewer epochs.\n","authors":["Ming Li","Jie Wu","Xionghui Wang","Chen Chen","Jie Qin","Xuefeng Xiao","Rui Wang","Min Zheng","Xin Pan"],"pdf_url":"https://arxiv.org/pdf/2307.11077v1.pdf","comment":"Accepted by ICCV 2023. Code and Models are publicly available.\n Project Page: https://liming-ai.github.io/AlignDet"},{"id":"http://arxiv.org/abs/2307.11069v1","updated":"2023-07-20T17:52:19Z","published":"2023-07-20T17:52:19Z","title":"Effectiveness and predictability of in-network storage cache for\n scientific workflows","summary":" Large scientific collaborations often have multiple scientists accessing the\nsame set of files while doing different analyses, which create repeated\naccesses to the large amounts of shared data located far away. These data\naccesses have long latency due to distance and occupy the limited bandwidth\navailable over the wide-area network. To reduce the wide-area network traffic\nand the data access latency, regional data storage caches have been installed\nas a new networking service. To study the effectiveness of such a cache system\nin scientific applications, we examine the Southern California Petabyte Scale\nCache for a high-energy physics experiment. By examining about 3TB of\noperational logs, we show that this cache removed 67.6% of file requests from\nthe wide-area network and reduced the traffic volume on wide-area network by\n12.3TB (or 35.4%) an average day. The reduction in the traffic volume (35.4%)\nis less than the reduction in file counts (67.6%) because the larger files are\nless likely to be reused. Due to this difference in data access patterns, the\ncache system has implemented a policy to avoid evicting smaller files when\nprocessing larger files. We also build a machine learning model to study the\npredictability of the cache behavior. Tests show that this model is able to\naccurately predict the cache accesses, cache misses, and network throughput,\nmaking the model useful for future studies on resource provisioning and\nplanning.\n","authors":["Caitlin Sim","Kesheng Wu","Alex Sim","Inder Monga","Chin Guok","Frank Wurthwein","Diego Davila","Harvey Newman","Justas Balcas"],"pdf_url":"https://arxiv.org/pdf/2307.11069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11049v1","updated":"2023-07-20T17:30:37Z","published":"2023-07-20T17:30:37Z","title":"Breadcrumbs to the Goal: Goal-Conditioned Exploration from\n Human-in-the-Loop Feedback","summary":" Exploration and reward specification are fundamental and intertwined\nchallenges for reinforcement learning. Solving sequential decision-making tasks\nrequiring expansive exploration requires either careful design of reward\nfunctions or the use of novelty-seeking exploration bonuses. Human supervisors\ncan provide effective guidance in the loop to direct the exploration process,\nbut prior methods to leverage this guidance require constant synchronous\nhigh-quality human feedback, which is expensive and impractical to obtain. In\nthis work, we present a technique called Human Guided Exploration (HuGE), which\nuses low-quality feedback from non-expert users that may be sporadic,\nasynchronous, and noisy. HuGE guides exploration for reinforcement learning not\nonly in simulation but also in the real world, all without meticulous reward\nspecification. The key concept involves bifurcating human feedback and policy\nlearning: human feedback steers exploration, while self-supervised learning\nfrom the exploration data yields unbiased policies. This procedure can leverage\nnoisy, asynchronous human feedback to learn policies with no hand-crafted\nreward design or exploration bonuses. HuGE is able to learn a variety of\nchallenging multi-stage robotic navigation and manipulation tasks in simulation\nusing crowdsourced feedback from non-expert users. Moreover, this paradigm can\nbe scaled to learning directly on real-world robots, using occasional,\nasynchronous feedback from human supervisors.\n","authors":["Marcel Torne","Max Balsells","Zihan Wang","Samedh Desai","Tao Chen","Pulkit Agrawal","Abhishek Gupta"],"pdf_url":"https://arxiv.org/pdf/2307.11049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11046v1","updated":"2023-07-20T17:28:01Z","published":"2023-07-20T17:28:01Z","title":"A Definition of Continual Reinforcement Learning","summary":" In this paper we develop a foundation for continual reinforcement learning.\n","authors":["David Abel","André Barreto","Benjamin Van Roy","Doina Precup","Hado van Hasselt","Satinder Singh"],"pdf_url":"https://arxiv.org/pdf/2307.11046v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11044v1","updated":"2023-07-20T17:27:29Z","published":"2023-07-20T17:27:29Z","title":"On the Convergence of Bounded Agents","summary":" When has an agent converged? Standard models of the reinforcement learning\nproblem give rise to a straightforward definition of convergence: An agent\nconverges when its behavior or performance in each environment state stops\nchanging. However, as we shift the focus of our learning problem from the\nenvironment's state to the agent's state, the concept of an agent's convergence\nbecomes significantly less clear. In this paper, we propose two complementary\naccounts of agent convergence in a framing of the reinforcement learning\nproblem that centers around bounded agents. The first view says that a bounded\nagent has converged when the minimal number of states needed to describe the\nagent's future behavior cannot decrease. The second view says that a bounded\nagent has converged just when the agent's performance only changes if the\nagent's internal state changes. We establish basic properties of these two\ndefinitions, show that they accommodate typical views of convergence in\nstandard settings, and prove several facts about their nature and relationship.\nWe take these perspectives, definitions, and analysis to bring clarity to a\ncentral idea of the field.\n","authors":["David Abel","André Barreto","Hado van Hasselt","Benjamin Van Roy","Doina Precup","Satinder Singh"],"pdf_url":"https://arxiv.org/pdf/2307.11044v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02719v3","updated":"2023-07-20T17:17:54Z","published":"2023-07-06T01:57:37Z","title":"Understanding Uncertainty Sampling","summary":" Uncertainty sampling is a prevalent active learning algorithm that queries\nsequentially the annotations of data samples which the current prediction model\nis uncertain about. However, the usage of uncertainty sampling has been largely\nheuristic: (i) There is no consensus on the proper definition of \"uncertainty\"\nfor a specific task under a specific loss; (ii) There is no theoretical\nguarantee that prescribes a standard protocol to implement the algorithm, for\nexample, how to handle the sequentially arrived annotated data under the\nframework of optimization algorithms such as stochastic gradient descent. In\nthis work, we systematically examine uncertainty sampling algorithms under both\nstream-based and pool-based active learning. We propose a notion of equivalent\nloss which depends on the used uncertainty measure and the original loss\nfunction and establish that an uncertainty sampling algorithm essentially\noptimizes against such an equivalent loss. The perspective verifies the\nproperness of existing uncertainty measures from two aspects: surrogate\nproperty and loss convexity. Furthermore, we propose a new notion for designing\nuncertainty measures called \\textit{loss as uncertainty}. The idea is to use\nthe conditional expected loss given the features as the uncertainty measure.\nSuch an uncertainty measure has nice analytical properties and generality to\ncover both classification and regression problems, which enable us to provide\nthe first generalization bound for uncertainty sampling algorithms under both\nstream-based and pool-based settings, in the full generality of the underlying\nmodel and problem. Lastly, we establish connections between certain variants of\nthe uncertainty sampling algorithms with risk-sensitive objectives and\ndistributional robustness, which can partly explain the advantage of\nuncertainty sampling algorithms when the sample size is small.\n","authors":["Shang Liu","Xiaocheng Li"],"pdf_url":"https://arxiv.org/pdf/2307.02719v3.pdf","comment":"Update: add numerical illustrations and experiments; correct some\n typos and modify the numbering"},{"id":"http://arxiv.org/abs/2307.11031v1","updated":"2023-07-20T17:07:28Z","published":"2023-07-20T17:07:28Z","title":"Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot\n Classification","summary":" Recent work has shown that language models' (LMs) prompt-based learning\ncapabilities make them well suited for automating data labeling in domains\nwhere manual annotation is expensive. The challenge is that while writing an\ninitial prompt is cheap, improving a prompt is costly -- practitioners often\nrequire significant labeled data in order to evaluate the impact of prompt\nmodifications. Our work asks whether it is possible to improve prompt-based\nlearning without additional labeled data. We approach this problem by\nattempting to modify the predictions of a prompt, rather than the prompt\nitself. Our intuition is that accurate predictions should also be consistent:\nsamples which are similar under some feature representation should receive the\nsame prompt prediction. We propose Embroid, a method which computes multiple\nrepresentations of a dataset under different embedding functions, and uses the\nconsistency between the LM predictions for neighboring samples to identify\nmispredictions. Embroid then uses these neighborhoods to create additional\npredictions for each sample, and combines these predictions with a simple\nlatent variable graphical model in order to generate a final corrected\nprediction. In addition to providing a theoretical analysis of Embroid, we\nconduct a rigorous empirical evaluation across six different LMs and up to 95\ndifferent tasks. We find that (1) Embroid substantially improves performance\nover original prompts (e.g., by an average of 7.3 points on GPT-JT), (2) also\nrealizes improvements for more sophisticated prompting strategies (e.g.,\nchain-of-thought), and (3) can be specialized to domains like law through the\nembedding functions.\n","authors":["Neel Guha","Mayee F. Chen","Kush Bhatia","Azalia Mirhoseini","Frederic Sala","Christopher Ré"],"pdf_url":"https://arxiv.org/pdf/2307.11031v1.pdf","comment":"38 pages, 22 figures, 8 tables"},{"id":"http://arxiv.org/abs/2307.11030v1","updated":"2023-07-20T17:05:51Z","published":"2023-07-20T17:05:51Z","title":"Cluster-aware Semi-supervised Learning: Relational Knowledge\n Distillation Provably Learns Clustering","summary":" Despite the empirical success and practical significance of (relational)\nknowledge distillation that matches (the relations of) features between teacher\nand student models, the corresponding theoretical interpretations remain\nlimited for various knowledge distillation paradigms. In this work, we take an\ninitial step toward a theoretical understanding of relational knowledge\ndistillation (RKD), with a focus on semi-supervised classification problems. We\nstart by casting RKD as spectral clustering on a population-induced graph\nunveiled by a teacher model. Via a notion of clustering error that quantifies\nthe discrepancy between the predicted and ground truth clusterings, we\nillustrate that RKD over the population provably leads to low clustering error.\nMoreover, we provide a sample complexity bound for RKD with limited unlabeled\nsamples. For semi-supervised learning, we further demonstrate the label\nefficiency of RKD through a general framework of cluster-aware semi-supervised\nlearning that assumes low clustering errors. Finally, by unifying data\naugmentation consistency regularization into this cluster-aware framework, we\nshow that despite the common effect of learning accurate clusterings, RKD\nfacilitates a \"global\" perspective through spectral clustering, whereas\nconsistency regularization focuses on a \"local\" perspective via expansion.\n","authors":["Yijun Dong","Kevin Miller","Qi Lei","Rachel Ward"],"pdf_url":"https://arxiv.org/pdf/2307.11030v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.05610v2","updated":"2023-07-20T16:46:36Z","published":"2023-05-09T17:01:17Z","title":"Can point cloud networks learn statistical shape models of anatomies?","summary":" Statistical Shape Modeling (SSM) is a valuable tool for investigating and\nquantifying anatomical variations within populations of anatomies. However,\ntraditional correspondence-based SSM generation methods have a prohibitive\ninference process and require complete geometric proxies (e.g., high-resolution\nbinary volumes or surface meshes) as input shapes to construct the SSM.\nUnordered 3D point cloud representations of shapes are more easily acquired\nfrom various medical imaging practices (e.g., thresholded images and surface\nscanning). Point cloud deep networks have recently achieved remarkable success\nin learning permutation-invariant features for different point cloud tasks\n(e.g., completion, semantic segmentation, classification). However, their\napplication to learning SSM from point clouds is to-date unexplored. In this\nwork, we demonstrate that existing point cloud encoder-decoder-based completion\nnetworks can provide an untapped potential for SSM, capturing population-level\nstatistical representations of shapes while reducing the inference burden and\nrelaxing the input requirement. We discuss the limitations of these techniques\nto the SSM application and suggest future improvements. Our work paves the way\nfor further exploration of point cloud deep learning for SSM, a promising\navenue for advancing shape analysis literature and broadening SSM to diverse\nuse cases.\n","authors":["Jadie Adams","Shireen Elhabian"],"pdf_url":"https://arxiv.org/pdf/2305.05610v2.pdf","comment":"Accepted to MICCAI 2023. 13 pages, 5 figures, appendix"},{"id":"http://arxiv.org/abs/2307.11018v1","updated":"2023-07-20T16:45:22Z","published":"2023-07-20T16:45:22Z","title":"Amortized Variational Inference: When and Why?","summary":" Amortized variational inference (A-VI) is a method for approximating the\nintractable posterior distributions that arise in probabilistic models. The\ndefining feature of A-VI is that it learns a global inference function that\nmaps each observation to its local latent variable's approximate posterior.\nThis stands in contrast to the more classical factorized (or mean-field)\nvariational inference (F-VI), which directly learns the parameters of the\napproximating distribution for each latent variable. In deep generative models,\nA-VI is used as a computational trick to speed up inference for local latent\nvariables. In this paper, we study A-VI as a general alternative to F-VI for\napproximate posterior inference. A-VI cannot produce an approximation with a\nlower Kullback-Leibler divergence than F-VI's optimal solution, because the\namortized family is a subset of the factorized family. Thus a central\ntheoretical problem is to characterize when A-VI still attains F-VI's optimal\nsolution. We derive conditions on both the model and the inference function\nunder which A-VI can theoretically achieve F-VI's optimum. We show that for a\nbroad class of hierarchical models, including deep generative models, it is\npossible to close the gap between A-VI and F-VI. Further, for an even broader\nclass of models, we establish when and how to expand the domain of the\ninference function to make amortization a feasible strategy. Finally, we prove\nthat for certain models -- including hidden Markov models and Gaussian\nprocesses -- A-VI cannot match F-VI's solution, no matter how expressive the\ninference function is. We also study A-VI empirically [...]\n","authors":["Charles C. Margossian","David M. Blei"],"pdf_url":"https://arxiv.org/pdf/2307.11018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11017v1","updated":"2023-07-20T16:45:16Z","published":"2023-07-20T16:45:16Z","title":"Multi-objective point cloud autoencoders for explainable myocardial\n infarction prediction","summary":" Myocardial infarction (MI) is one of the most common causes of death in the\nworld. Image-based biomarkers commonly used in the clinic, such as ejection\nfraction, fail to capture more complex patterns in the heart's 3D anatomy and\nthus limit diagnostic accuracy. In this work, we present the multi-objective\npoint cloud autoencoder as a novel geometric deep learning approach for\nexplainable infarction prediction, based on multi-class 3D point cloud\nrepresentations of cardiac anatomy and function. Its architecture consists of\nmultiple task-specific branches connected by a low-dimensional latent space to\nallow for effective multi-objective learning of both reconstruction and MI\nprediction, while capturing pathology-specific 3D shape information in an\ninterpretable latent space. Furthermore, its hierarchical branch design with\npoint cloud-based deep learning operations enables efficient multi-scale\nfeature learning directly on high-resolution anatomy point clouds. In our\nexperiments on a large UK Biobank dataset, the multi-objective point cloud\nautoencoder is able to accurately reconstruct multi-temporal 3D shapes with\nChamfer distances between predicted and input anatomies below the underlying\nimages' pixel resolution. Our method outperforms multiple machine learning and\ndeep learning benchmarks for the task of incident MI prediction by 19% in terms\nof Area Under the Receiver Operating Characteristic curve. In addition, its\ntask-specific compact latent space exhibits easily separable control and MI\nclusters with clinically plausible associations between subject encodings and\ncorresponding 3D shapes, thus demonstrating the explainability of the\nprediction.\n","authors":["Marcel Beetz","Abhirup Banerjee","Vicente Grau"],"pdf_url":"https://arxiv.org/pdf/2307.11017v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.01110v3","updated":"2023-07-20T16:44:47Z","published":"2022-07-03T20:07:00Z","title":"Data-Driven Modeling of Noise Time Series with Convolutional Generative\n Adversarial Networks","summary":" Random noise arising from physical processes is an inherent characteristic of\nmeasurements and a limiting factor for most signal processing and data analysis\ntasks. Given the recent interest in generative adversarial networks (GANs) for\ndata-driven modeling, it is important to determine to what extent GANs can\nfaithfully reproduce noise in target data sets. In this paper, we present an\nempirical investigation that aims to shed light on this issue for time series.\nNamely, we assess two general-purpose GANs for time series that are based on\nthe popular deep convolutional GAN (DCGAN) architecture, a direct time-series\nmodel and an image-based model that uses a short-time Fourier transform (STFT)\ndata representation. The GAN models are trained and quantitatively evaluated\nusing distributions of simulated noise time series with known ground-truth\nparameters. Target time series distributions include a broad range of noise\ntypes commonly encountered in physical measurements, electronics, and\ncommunication systems: band-limited thermal noise, power law noise, shot noise,\nand impulsive noise. We find that GANs are capable of learning many noise\ntypes, although they predictably struggle when the GAN architecture is not well\nsuited to some aspects of the noise, e.g., impulsive time-series with extreme\noutliers. Our findings provide insights into the capabilities and potential\nlimitations of current approaches to time-series GANs and highlight areas for\nfurther research. In addition, our battery of tests provides a useful benchmark\nto aid the development of deep generative models for time series.\n","authors":["Adam Wunderlich","Jack Sklar"],"pdf_url":"https://arxiv.org/pdf/2207.01110v3.pdf","comment":"27 pages, 20 figures"},{"id":"http://arxiv.org/abs/2302.06223v3","updated":"2023-07-20T16:40:14Z","published":"2023-02-13T09:54:50Z","title":"Variational Mixture of HyperGenerators for Learning Distributions Over\n Functions","summary":" Recent approaches build on implicit neural representations (INRs) to propose\ngenerative models over function spaces. However, they are computationally\ncostly when dealing with inference tasks, such as missing data imputation, or\ndirectly cannot tackle them. In this work, we propose a novel deep generative\nmodel, named VAMoH. VAMoH combines the capabilities of modeling continuous\nfunctions using INRs and the inference capabilities of Variational Autoencoders\n(VAEs). In addition, VAMoH relies on a normalizing flow to define the prior,\nand a mixture of hypernetworks to parametrize the data log-likelihood. This\ngives VAMoH a high expressive capability and interpretability. Through\nexperiments on a diverse range of data types, such as images, voxels, and\nclimate data, we show that VAMoH can effectively learn rich distributions over\ncontinuous functions. Furthermore, it can perform inference-related tasks, such\nas conditional super-resolution generation and in-painting, as well or better\nthan previous approaches, while being less computationally demanding.\n","authors":["Batuhan Koyuncu","Pablo Sanchez-Martin","Ignacio Peis","Pablo M. Olmos","Isabel Valera"],"pdf_url":"https://arxiv.org/pdf/2302.06223v3.pdf","comment":"Accepted at ICML 2023. Camera ready version"},{"id":"http://arxiv.org/abs/2012.07881v2","updated":"2023-07-20T16:38:57Z","published":"2020-12-14T19:02:26Z","title":"Perceptron Theory Can Predict the Accuracy of Neural Networks","summary":" Multilayer neural networks set the current state of the art for many\ntechnical classification problems. But, these networks are still, essentially,\nblack boxes in terms of analyzing them and predicting their performance. Here,\nwe develop a statistical theory for the one-layer perceptron and show that it\ncan predict performances of a surprisingly large variety of neural networks\nwith different architectures. A general theory of classification with\nperceptrons is developed by generalizing an existing theory for analyzing\nreservoir computing models and connectionist models for symbolic reasoning\nknown as vector symbolic architectures. Our statistical theory offers three\nformulas leveraging the signal statistics with increasing detail. The formulas\nare analytically intractable, but can be evaluated numerically. The description\nlevel that captures maximum details requires stochastic sampling methods.\nDepending on the network model, the simpler formulas already yield high\nprediction accuracy. The quality of the theory predictions is assessed in three\nexperimental settings, a memorization task for echo state networks (ESNs) from\nreservoir computing literature, a collection of classification datasets for\nshallow randomly connected networks, and the ImageNet dataset for deep\nconvolutional neural networks. We find that the second description level of the\nperceptron theory can predict the performance of types of ESNs, which could not\nbe described previously. The theory can predict deep multilayer neural networks\nby being applied to their output layer. While other methods for prediction of\nneural networks performance commonly require to train an estimator model, the\nproposed theory requires only the first two moments of the distribution of the\npostsynaptic sums in the output neurons. The perceptron theory compares\nfavorably to other methods that do not rely on training an estimator model.\n","authors":["Denis Kleyko","Antonello Rosato","E. Paxon Frady","Massimo Panella","Friedrich T. Sommer"],"pdf_url":"https://arxiv.org/pdf/2012.07881v2.pdf","comment":"16 pages, 14 figures"},{"id":"http://arxiv.org/abs/2307.11013v1","updated":"2023-07-20T16:38:18Z","published":"2023-07-20T16:38:18Z","title":"Flow Map Learning for Unknown Dynamical Systems: Overview,\n Implementation, and Benchmarks","summary":" Flow map learning (FML), in conjunction with deep neural networks (DNNs), has\nshown promises for data driven modeling of unknown dynamical systems. A\nremarkable feature of FML is that it is capable of producing accurate\npredictive models for partially observed systems, even when their exact\nmathematical models do not exist. In this paper, we present an overview of the\nFML framework, along with the important computational details for its\nsuccessful implementation. We also present a set of well defined benchmark\nproblems for learning unknown dynamical systems. All the numerical details of\nthese problems are presented, along with their FML results, to ensure that the\nproblems are accessible for cross-examination and the results are reproducible.\n","authors":["Victor Churchill","Dongbin Xiu"],"pdf_url":"https://arxiv.org/pdf/2307.11013v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11011v1","updated":"2023-07-20T16:36:04Z","published":"2023-07-20T16:36:04Z","title":"Neuron Sensitivity Guided Test Case Selection for Deep Learning Testing","summary":" Deep Neural Networks~(DNNs) have been widely deployed in software to address\nvarious tasks~(e.g., autonomous driving, medical diagnosis). However, they\ncould also produce incorrect behaviors that result in financial losses and even\nthreaten human safety. To reveal the incorrect behaviors in DNN and repair\nthem, DNN developers often collect rich unlabeled datasets from the natural\nworld and label them to test the DNN models. However, properly labeling a large\nnumber of unlabeled datasets is a highly expensive and time-consuming task.\n To address the above-mentioned problem, we propose NSS, Neuron Sensitivity\nguided test case Selection, which can reduce the labeling time by selecting\nvaluable test cases from unlabeled datasets. NSS leverages the internal\nneuron's information induced by test cases to select valuable test cases, which\nhave high confidence in causing the model to behave incorrectly. We evaluate\nNSS with four widely used datasets and four well-designed DNN models compared\nto SOTA baseline methods. The results show that NSS performs well in assessing\nthe test cases' probability of fault triggering and model improvement\ncapabilities. Specifically, compared with baseline approaches, NSS obtains a\nhigher fault detection rate~(e.g., when selecting 5\\% test case from the\nunlabeled dataset in MNIST \\& LeNet1 experiment, NSS can obtain 81.8\\% fault\ndetection rate, 20\\% higher than baselines).\n","authors":["Dong Huang","Qingwen Bu","Yichao Fu","Yuhao Qing","Bocheng Xiao","Heming Cui"],"pdf_url":"https://arxiv.org/pdf/2307.11011v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11007v1","updated":"2023-07-20T16:34:58Z","published":"2023-07-20T16:34:58Z","title":"Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To\n Achieve Better Generalization","summary":" Despite extensive studies, the underlying reason as to why overparameterized\nneural networks can generalize remains elusive. Existing theory shows that\ncommon stochastic optimizers prefer flatter minimizers of the training loss,\nand thus a natural potential explanation is that flatness implies\ngeneralization. This work critically examines this explanation. Through\ntheoretical and empirical investigation, we identify the following three\nscenarios for two-layer ReLU networks: (1) flatness provably implies\ngeneralization; (2) there exist non-generalizing flattest models and sharpness\nminimization algorithms fail to generalize, and (3) perhaps most surprisingly,\nthere exist non-generalizing flattest models, but sharpness minimization\nalgorithms still generalize. Our results suggest that the relationship between\nsharpness and generalization subtly depends on the data distributions and the\nmodel architectures and sharpness minimization algorithms do not only minimize\nsharpness to achieve better generalization. This calls for the search for other\nexplanations for the generalization of over-parameterized neural networks.\n","authors":["Kaiyue Wen","Tengyu Ma","Zhiyuan Li"],"pdf_url":"https://arxiv.org/pdf/2307.11007v1.pdf","comment":"34 pages,11 figures"},{"id":"http://arxiv.org/abs/2307.10999v1","updated":"2023-07-20T16:27:51Z","published":"2023-07-20T16:27:51Z","title":"Private Federated Learning with Autotuned Compression","summary":" We propose new techniques for reducing communication in private federated\nlearning without the need for setting or tuning compression rates. Our\non-the-fly methods automatically adjust the compression rate based on the error\ninduced during training, while maintaining provable privacy guarantees through\nthe use of secure aggregation and differential privacy. Our techniques are\nprovably instance-optimal for mean estimation, meaning that they can adapt to\nthe ``hardness of the problem\" with minimal interactivity. We demonstrate the\neffectiveness of our approach on real-world datasets by achieving favorable\ncompression rates without the need for tuning.\n","authors":["Enayat Ullah","Christopher A. Choquette-Choo","Peter Kairouz","Sewoong Oh"],"pdf_url":"https://arxiv.org/pdf/2307.10999v1.pdf","comment":"Accepted to ICML 2023"},{"id":"http://arxiv.org/abs/2307.10997v1","updated":"2023-07-20T16:25:58Z","published":"2023-07-20T16:25:58Z","title":"DREAM: Domain-free Reverse Engineering Attributes of Black-box Model","summary":" Deep learning models are usually black boxes when deployed on machine\nlearning platforms. Prior works have shown that the attributes ($e.g.$, the\nnumber of convolutional layers) of a target black-box neural network can be\nexposed through a sequence of queries. There is a crucial limitation: these\nworks assume the dataset used for training the target model to be known\nbeforehand and leverage this dataset for model attribute attack. However, it is\ndifficult to access the training dataset of the target black-box model in\nreality. Therefore, whether the attributes of a target black-box model could be\nstill revealed in this case is doubtful. In this paper, we investigate a new\nproblem of Domain-agnostic Reverse Engineering the Attributes of a black-box\ntarget Model, called DREAM, without requiring the availability of the target\nmodel's training dataset, and put forward a general and principled framework by\ncasting this problem as an out of distribution (OOD) generalization problem. In\nthis way, we can learn a domain-agnostic model to inversely infer the\nattributes of a target black-box model with unknown training data. This makes\nour method one of the kinds that can gracefully apply to an arbitrary domain\nfor model attribute reverse engineering with strong generalization ability.\nExtensive experimental studies are conducted and the results validate the\nsuperiority of our proposed method over the baselines.\n","authors":["Rongqing Li","Jiaqi Yu","Changsheng Li","Wenhan Luo","Ye Yuan","Guoren Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10997v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10994v1","updated":"2023-07-20T16:25:00Z","published":"2023-07-20T16:25:00Z","title":"Progressive distillation diffusion for raw music generation","summary":" This paper aims to apply a new deep learning approach to the task of\ngenerating raw audio files. It is based on diffusion models, a recent type of\ndeep generative model. This new type of method has recently shown outstanding\nresults with image generation. A lot of focus has been given to those models by\nthe computer vision community. On the other hand, really few have been given\nfor other types of applications such as music generation in waveform domain.\n In this paper the model for unconditional generating applied to music is\nimplemented: Progressive distillation diffusion with 1D U-Net. Then, a\ncomparison of different parameters of diffusion and their value in a full\nresult is presented. One big advantage of the methods implemented through this\nwork is the fact that the model is able to deal with progressing audio\nprocessing and generating , using transformation from 1-channel 128 x 384 to\n3-channel 128 x 128 mel-spectrograms and looped generation. The empirical\ncomparisons are realized across different self-collected datasets.\n","authors":["Svetlana Pavlova"],"pdf_url":"https://arxiv.org/pdf/2307.10994v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2207.12395v3","updated":"2023-07-20T16:21:58Z","published":"2022-07-25T17:58:09Z","title":"Tuning Stochastic Gradient Algorithms for Statistical Inference via\n Large-Sample Asymptotics","summary":" The tuning of stochastic gradient algorithms (SGAs) for optimization and\nsampling is often based on heuristics and trial-and-error rather than\ngeneralizable theory. We address this theory--practice gap by characterizing\nthe large-sample statistical asymptotics of SGAs via a joint\nstep-size--sample-size scaling limit. We show that iterate averaging with a\nlarge fixed step size is robust to the choice of tuning parameters and\nasymptotically has covariance proportional to that of the MLE sampling\ndistribution. We also prove a Bernstein--von Mises-like theorem to guide\ntuning, including for generalized posteriors that are robust to model\nmisspecification. Numerical experiments validate our results and\nrecommendations in realistic finite-sample regimes. Our work lays the\nfoundation for a systematic analysis of other stochastic gradient Markov chain\nMonte Carlo algorithms for a wide range of models.\n","authors":["Jeffrey Negrea","Jun Yang","Haoyue Feng","Daniel M. Roy","Jonathan H. Huggins"],"pdf_url":"https://arxiv.org/pdf/2207.12395v3.pdf","comment":"42 pgs"},{"id":"http://arxiv.org/abs/2307.10988v1","updated":"2023-07-20T16:18:33Z","published":"2023-07-20T16:18:33Z","title":"Investigating minimizing the training set fill distance in machine\n learning regression","summary":" Many machine learning regression methods leverage large datasets for training\npredictive models. However, using large datasets may not be feasible due to\ncomputational limitations or high labelling costs. Therefore, sampling small\ntraining sets from large pools of unlabelled data points is essential to\nmaximize model performance while maintaining computational efficiency. In this\nwork, we study a sampling approach aimed to minimize the fill distance of the\nselected set. We derive an upper bound for the maximum expected prediction\nerror that linearly depends on the training set fill distance, conditional to\nthe knowledge of data features. For empirical validation, we perform\nexperiments using two regression models on two datasets. We empirically show\nthat selecting a training set by aiming to minimize the fill distance, thereby\nminimizing the bound, significantly reduces the maximum prediction error of\nvarious regression models, outperforming existing sampling approaches by a\nlarge margin.\n","authors":["Paolo Climaco","Jochen Garcke"],"pdf_url":"https://arxiv.org/pdf/2307.10988v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09943v2","updated":"2023-07-20T16:11:39Z","published":"2023-07-19T12:35:16Z","title":"Impatient Bandits: Optimizing Recommendations for the Long-Term Without\n Delay","summary":" Recommender systems are a ubiquitous feature of online platforms.\nIncreasingly, they are explicitly tasked with increasing users' long-term\nsatisfaction. In this context, we study a content exploration task, which we\nformalize as a multi-armed bandit problem with delayed rewards. We observe that\nthere is an apparent trade-off in choosing the learning signal: Waiting for the\nfull reward to become available might take several weeks, hurting the rate at\nwhich learning happens, whereas measuring short-term proxy rewards reflects the\nactual long-term goal only imperfectly. We address this challenge in two steps.\nFirst, we develop a predictive model of delayed rewards that incorporates all\ninformation obtained to date. Full observations as well as partial (short or\nmedium-term) outcomes are combined through a Bayesian filter to obtain a\nprobabilistic belief. Second, we devise a bandit algorithm that takes advantage\nof this new predictive model. The algorithm quickly learns to identify content\naligned with long-term success by carefully balancing exploration and\nexploitation. We apply our approach to a podcast recommendation problem, where\nwe seek to identify shows that users engage with repeatedly over two months. We\nempirically validate that our approach results in substantially better\nperformance compared to approaches that either optimize for short-term proxies,\nor wait for the long-term outcome to be fully realized.\n","authors":["Thomas M. McDonald","Lucas Maystre","Mounia Lalmas","Daniel Russo","Kamil Ciosek"],"pdf_url":"https://arxiv.org/pdf/2307.09943v2.pdf","comment":"Presented at the 29th ACM SIGKDD Conference on Knowledge Discovery\n and Data Mining (KDD '23)"},{"id":"http://arxiv.org/abs/2307.10982v1","updated":"2023-07-20T16:09:57Z","published":"2023-07-20T16:09:57Z","title":"MASR: Metadata Aware Speech Representation","summary":" In the recent years, speech representation learning is constructed primarily\nas a self-supervised learning (SSL) task, using the raw audio signal alone,\nwhile ignoring the side-information that is often available for a given speech\nrecording. In this paper, we propose MASR, a Metadata Aware Speech\nRepresentation learning framework, which addresses the aforementioned\nlimitations. MASR enables the inclusion of multiple external knowledge sources\nto enhance the utilization of meta-data information. The external knowledge\nsources are incorporated in the form of sample-level pair-wise similarity\nmatrices that are useful in a hard-mining loss. A key advantage of the MASR\nframework is that it can be combined with any choice of SSL method. Using MASR\nrepresentations, we perform evaluations on several downstream tasks such as\nlanguage identification, speech recognition and other non-semantic tasks such\nas speaker and emotion recognition. In these experiments, we illustrate\nsignificant performance improvements for the MASR over other established\nbenchmarks. We perform a detailed analysis on the language identification task\nto provide insights on how the proposed loss function enables the\nrepresentations to separate closely related languages.\n","authors":["Anjali Raj","Shikhar Bharadwaj","Sriram Ganapathy","Min Ma","Shikhar Vashishth"],"pdf_url":"https://arxiv.org/pdf/2307.10982v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10981v1","updated":"2023-07-20T16:09:07Z","published":"2023-07-20T16:09:07Z","title":"PATROL: Privacy-Oriented Pruning for Collaborative Inference Against\n Model Inversion Attacks","summary":" Collaborative inference has been a promising solution to enable\nresource-constrained edge devices to perform inference using state-of-the-art\ndeep neural networks (DNNs). In collaborative inference, the edge device first\nfeeds the input to a partial DNN locally and then uploads the intermediate\nresult to the cloud to complete the inference. However, recent research\nindicates model inversion attacks (MIAs) can reconstruct input data from\nintermediate results, posing serious privacy concerns for collaborative\ninference. Existing perturbation and cryptography techniques are inefficient\nand unreliable in defending against MIAs while performing accurate inference.\nThis paper provides a viable solution, named PATROL, which develops\nprivacy-oriented pruning to balance privacy, efficiency, and utility of\ncollaborative inference. PATROL takes advantage of the fact that later layers\nin a DNN can extract more task-specific features. Given limited local resources\nfor collaborative inference, PATROL intends to deploy more layers at the edge\nbased on pruning techniques to enforce task-specific features for inference and\nreduce task-irrelevant but sensitive features for privacy preservation. To\nachieve privacy-oriented pruning, PATROL introduces two key components:\nLipschitz regularization and adversarial reconstruction training, which\nincrease the reconstruction errors by reducing the stability of MIAs and\nenhance the target inference model by adversarial training, respectively.\n","authors":["Shiwei Ding","Lan Zhang","Miao Pan","Xiaoyong Yuan"],"pdf_url":"https://arxiv.org/pdf/2307.10981v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.03017v3","updated":"2023-07-20T16:05:39Z","published":"2023-05-04T17:43:19Z","title":"Improving Code Example Recommendations on Informal Documentation Using\n BERT and Query-Aware LSH: A Comparative Study","summary":" Our research investigates the recommendation of code examples to aid software\ndevelopers, a practice that saves developers significant time by providing\nready-to-use code snippets. The focus of our study is Stack Overflow, a\ncommonly used resource for coding discussions and solutions, particularly in\nthe context of the Java programming language. We applied BERT, a powerful Large\nLanguage Model (LLM) that enables us to transform code examples into numerical\nvectors by extracting their semantic information. Once these numerical\nrepresentations are prepared, we identify Approximate Nearest Neighbors (ANN)\nusing Locality-Sensitive Hashing (LSH). Our research employed two variants of\nLSH: Random Hyperplane-based LSH and Query-Aware LSH. We rigorously compared\nthese two approaches across four parameters: HitRate, Mean Reciprocal Rank\n(MRR), Average Execution Time, and Relevance. Our study revealed that the\nQuery-Aware (QA) approach showed superior performance over the Random\nHyperplane-based (RH) method. Specifically, it exhibited a notable improvement\nof 20% to 35% in HitRate for query pairs compared to the RH approach.\nFurthermore, the QA approach proved significantly more time-efficient, with its\nspeed in creating hashing tables and assigning data samples to buckets being at\nleast four times faster. It can return code examples within milliseconds,\nwhereas the RH approach typically requires several seconds to recommend code\nexamples. Due to the superior performance of the QA approach, we tested it\nagainst PostFinder and FaCoY, the state-of-the-art baselines. Our QA method\nshowed comparable efficiency proving its potential for effective code\nrecommendation.\n","authors":["Sajjad Rahmani","AmirHossein Naghshzan","Latifa Guerrouj"],"pdf_url":"https://arxiv.org/pdf/2305.03017v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12619v2","updated":"2023-07-20T16:04:19Z","published":"2023-06-22T01:14:47Z","title":"Class-Incremental Learning based on Label Generation","summary":" Despite the great success of pre-trained language models, it is still a\nchallenge to use these models for continual learning, especially for the\nclass-incremental learning (CIL) setting due to catastrophic forgetting (CF).\nThis paper reports our finding that if we formulate CIL as a continual label\ngeneration problem, CF is drastically reduced and the generalizable\nrepresentations of pre-trained models can be better retained. We thus propose a\nnew CIL method (VAG) that also leverages the sparsity of vocabulary to focus\nthe generation and creates pseudo-replay samples by using label semantics.\nExperimental results show that VAG outperforms baselines by a large margin.\n","authors":["Yijia Shao","Yiduo Guo","Dongyan Zhao","Bing Liu"],"pdf_url":"https://arxiv.org/pdf/2306.12619v2.pdf","comment":"12 pages, ACL 2023 Main Conference"},{"id":"http://arxiv.org/abs/2307.10975v1","updated":"2023-07-20T16:04:07Z","published":"2023-07-20T16:04:07Z","title":"Globally Normalising the Transducer for Streaming Speech Recognition","summary":" The Transducer (e.g. RNN-Transducer or Conformer-Transducer) generates an\noutput label sequence as it traverses the input sequence. It is straightforward\nto use in streaming mode, where it generates partial hypotheses before the\ncomplete input has been seen. This makes it popular in speech recognition.\nHowever, in streaming mode the Transducer has a mathematical flaw which, simply\nput, restricts the model's ability to change its mind. The fix is to replace\nlocal normalisation (e.g. a softmax) with global normalisation, but then the\nloss function becomes impossible to evaluate exactly. A recent paper proposes\nto solve this by approximating the model, severely degrading performance.\nInstead, this paper proposes to approximate the loss function, allowing global\nnormalisation to apply to a state-of-the-art streaming model. Global\nnormalisation reduces its word error rate by 9-11% relative, closing almost\nhalf the gap between streaming and lookahead mode.\n","authors":["Rogier van Dalen"],"pdf_url":"https://arxiv.org/pdf/2307.10975v1.pdf","comment":"9 pages plus references and appendices"},{"id":"http://arxiv.org/abs/2210.06089v2","updated":"2023-07-20T16:01:03Z","published":"2022-10-12T11:04:22Z","title":"When are Local Queries Useful for Robust Learning?","summary":" Distributional assumptions have been shown to be necessary for the robust\nlearnability of concept classes when considering the exact-in-the-ball robust\nrisk and access to random examples by Gourdeau et al. (2019). In this paper, we\nstudy learning models where the learner is given more power through the use of\nlocal queries, and give the first distribution-free algorithms that perform\nrobust empirical risk minimization (ERM) for this notion of robustness. The\nfirst learning model we consider uses local membership queries (LMQ), where the\nlearner can query the label of points near the training sample. We show that,\nunder the uniform distribution, LMQs do not increase the robustness threshold\nof conjunctions and any superclass, e.g., decision lists and halfspaces. Faced\nwith this negative result, we introduce the local equivalence query\n($\\mathsf{LEQ}$) oracle, which returns whether the hypothesis and target\nconcept agree in the perturbation region around a point in the training sample,\nas well as a counterexample if it exists. We show a separation result: on the\none hand, if the query radius $\\lambda$ is strictly smaller than the\nadversary's perturbation budget $\\rho$, then distribution-free robust learning\nis impossible for a wide variety of concept classes; on the other hand, the\nsetting $\\lambda=\\rho$ allows us to develop robust ERM algorithms. We then\nbound the query complexity of these algorithms based on online learning\nguarantees and further improve these bounds for the special case of\nconjunctions. We finish by giving robust learning algorithms for halfspaces on\n$\\{0,1\\}^n$ and then obtaining robustness guarantees for halfspaces in\n$\\mathbb{R}^n$ against precision-bounded adversaries.\n","authors":["Pascale Gourdeau","Varun Kanade","Marta Kwiatkowska","James Worrell"],"pdf_url":"https://arxiv.org/pdf/2210.06089v2.pdf","comment":"Accepted to NeurIPS 2022; V2 contains new results (Section 3.6) and\n an erratum from the previous version (Appendix C)"},{"id":"http://arxiv.org/abs/2204.06362v2","updated":"2023-07-20T15:48:35Z","published":"2022-04-13T13:16:21Z","title":"A Review of Machine Learning Methods Applied to Structural Dynamics and\n Vibroacoustic","summary":" The use of Machine Learning (ML) has rapidly spread across several fields,\nhaving encountered many applications in Structural Dynamics and Vibroacoustic\n(SD\\&V). The increasing capabilities of ML to unveil insights from data, driven\nby unprecedented data availability, algorithms advances and computational\npower, enhance decision making, uncertainty handling, patterns recognition and\nreal-time assessments. Three main applications in SD\\&V have taken advantage of\nthese benefits. In Structural Health Monitoring, ML detection and prognosis\nlead to safe operation and optimized maintenance schedules. System\nidentification and control design are leveraged by ML techniques in Active\nNoise Control and Active Vibration Control. Finally, the so-called ML-based\nsurrogate models provide fast alternatives to costly simulations, enabling\nrobust and optimized product design. Despite the many works in the area, they\nhave not been reviewed and analyzed. Therefore, to keep track and understand\nthis ongoing integration of fields, this paper presents a survey of ML\napplications in SD\\&V analyses, shedding light on the current state of\nimplementation and emerging opportunities. The main methodologies, advantages,\nlimitations, and recommendations based on scientific knowledge were identified\nfor each of the three applications. Moreover, the paper considers the role of\nDigital Twins and Physics Guided ML to overcome current challenges and power\nfuture research progress. As a result, the survey provides a broad overview of\nthe present landscape of ML applied in SD\\&V and guides the reader to an\nadvanced understanding of progress and prospects in the field.\n","authors":["Barbara Cunha","Christophe Droz","Abdelmalek Zine","Stéphane Foulard","Mohamed Ichchou"],"pdf_url":"https://arxiv.org/pdf/2204.06362v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10936v1","updated":"2023-07-20T15:09:06Z","published":"2023-07-20T15:09:06Z","title":"PASTA: Pretrained Action-State Transformer Agents","summary":" Self-supervised learning has brought about a revolutionary paradigm shift in\nvarious computing domains, including NLP, vision, and biology. Recent\napproaches involve pre-training transformer models on vast amounts of unlabeled\ndata, serving as a starting point for efficiently solving downstream tasks. In\nthe realm of reinforcement learning, researchers have recently adapted these\napproaches by developing models pre-trained on expert trajectories, enabling\nthem to address a wide range of tasks, from robotics to recommendation systems.\nHowever, existing methods mostly rely on intricate pre-training objectives\ntailored to specific downstream applications. This paper presents a\ncomprehensive investigation of models we refer to as Pretrained Action-State\nTransformer Agents (PASTA). Our study uses a unified methodology and covers an\nextensive set of general downstream tasks including behavioral cloning, offline\nRL, sensor failure robustness, and dynamics change adaptation. Our goal is to\nsystematically compare various design choices and provide valuable insights to\npractitioners for building robust models. Key highlights of our study include\ntokenization at the action and state component level, using fundamental\npre-training objectives like next token prediction, training models across\ndiverse domains simultaneously, and using parameter efficient fine-tuning\n(PEFT). The developed models in our study contain fewer than 10 million\nparameters and the application of PEFT enables fine-tuning of fewer than 10,000\nparameters during downstream adaptation, allowing a broad community to use\nthese models and reproduce our experiments. We hope that this study will\nencourage further research into the use of transformers with first-principles\ndesign choices to represent RL trajectories and contribute to robust policy\nlearning.\n","authors":["Raphael Boige","Yannis Flet-Berliac","Arthur Flajolet","Guillaume Richard","Thomas Pierrot"],"pdf_url":"https://arxiv.org/pdf/2307.10936v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10935v1","updated":"2023-07-20T15:07:49Z","published":"2023-07-20T15:07:49Z","title":"Inorganic synthesis-structure maps in zeolites with machine learning and\n crystallographic distances","summary":" Zeolites are inorganic materials known for their diversity of applications,\nsynthesis conditions, and resulting polymorphs. Although their synthesis is\ncontrolled both by inorganic and organic synthesis conditions, computational\nstudies of zeolite synthesis have focused mostly on organic template design. In\nthis work, we use a strong distance metric between crystal structures and\nmachine learning (ML) to create inorganic synthesis maps in zeolites. Starting\nwith 253 known zeolites, we show how the continuous distances between\nframeworks reproduce inorganic synthesis conditions from the literature without\nusing labels such as building units. An unsupervised learning analysis shows\nthat neighboring zeolites according to our metric often share similar inorganic\nsynthesis conditions, even in template-based routes. In combination with ML\nclassifiers, we find synthesis-structure relationships for 14 common inorganic\nconditions in zeolites, namely Al, B, Be, Ca, Co, F, Ga, Ge, K, Mg, Na, P, Si,\nand Zn. By explaining the model predictions, we demonstrate how\n(dis)similarities towards known structures can be used as features for the\nsynthesis space. Finally, we show how these methods can be used to predict\ninorganic synthesis conditions for unrealized frameworks in hypothetical\ndatabases and interpret the outcomes by extracting local structural patterns\nfrom zeolites. In combination with template design, this work can accelerate\nthe exploration of the space of synthesis conditions for zeolites.\n","authors":["Daniel Schwalbe-Koda","Daniel E. Widdowson","Tuan Anh Pham","Vitaliy A. Kurlin"],"pdf_url":"https://arxiv.org/pdf/2307.10935v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10927v1","updated":"2023-07-20T14:56:29Z","published":"2023-07-20T14:56:29Z","title":"Modeling 3D cardiac contraction and relaxation with point cloud\n deformation networks","summary":" Global single-valued biomarkers of cardiac function typically used in\nclinical practice, such as ejection fraction, provide limited insight on the\ntrue 3D cardiac deformation process and hence, limit the understanding of both\nhealthy and pathological cardiac mechanics. In this work, we propose the Point\nCloud Deformation Network (PCD-Net) as a novel geometric deep learning approach\nto model 3D cardiac contraction and relaxation between the extreme ends of the\ncardiac cycle. It employs the recent advances in point cloud-based deep\nlearning into an encoder-decoder structure, in order to enable efficient\nmulti-scale feature learning directly on multi-class 3D point cloud\nrepresentations of the cardiac anatomy. We evaluate our approach on a large\ndataset of over 10,000 cases from the UK Biobank study and find average Chamfer\ndistances between the predicted and ground truth anatomies below the pixel\nresolution of the underlying image acquisition. Furthermore, we observe similar\nclinical metrics between predicted and ground truth populations and show that\nthe PCD-Net can successfully capture subpopulation-specific differences between\nnormal subjects and myocardial infarction (MI) patients. We then demonstrate\nthat the learned 3D deformation patterns outperform multiple clinical\nbenchmarks by 13% and 7% in terms of area under the receiver operating\ncharacteristic curve for the tasks of prevalent MI detection and incident MI\nprediction and by 7% in terms of Harrell's concordance index for MI survival\nanalysis.\n","authors":["Marcel Beetz","Abhirup Banerjee","Vicente Grau"],"pdf_url":"https://arxiv.org/pdf/2307.10927v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10926v1","updated":"2023-07-20T14:52:45Z","published":"2023-07-20T14:52:45Z","title":"Confidence intervals for performance estimates in 3D medical image\n segmentation","summary":" Medical segmentation models are evaluated empirically. As such an evaluation\nis based on a limited set of example images, it is unavoidably noisy. Beyond a\nmean performance measure, reporting confidence intervals is thus crucial.\nHowever, this is rarely done in medical image segmentation. The width of the\nconfidence interval depends on the test set size and on the spread of the\nperformance measure (its standard-deviation across of the test set). For\nclassification, many test images are needed to avoid wide confidence intervals.\nSegmentation, however, has not been studied, and it differs by the amount of\ninformation brought by a given test image. In this paper, we study the typical\nconfidence intervals in medical image segmentation. We carry experiments on 3D\nimage segmentation using the standard nnU-net framework, two datasets from the\nMedical Decathlon challenge and two performance measures: the Dice accuracy and\nthe Hausdorff distance. We show that the parametric confidence intervals are\nreasonable approximations of the bootstrap estimates for varying test set sizes\nand spread of the performance metric. Importantly, we show that the test size\nneeded to achieve a given precision is often much lower than for classification\ntasks. Typically, a 1% wide confidence interval requires about 100-200 test\nsamples when the spread is low (standard-deviation around 3%). More difficult\nsegmentation tasks may lead to higher spreads and require over 1000 samples.\n","authors":["R. El Jurdi","G. Varoquax","O. Colliot"],"pdf_url":"https://arxiv.org/pdf/2307.10926v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2307.10923v1","updated":"2023-07-20T14:49:58Z","published":"2023-07-20T14:49:58Z","title":"Sequential Multi-Dimensional Self-Supervised Learning for Clinical Time\n Series","summary":" Self-supervised learning (SSL) for clinical time series data has received\nsignificant attention in recent literature, since these data are highly rich\nand provide important information about a patient's physiological state.\nHowever, most existing SSL methods for clinical time series are limited in that\nthey are designed for unimodal time series, such as a sequence of structured\nfeatures (e.g., lab values and vitals signs) or an individual high-dimensional\nphysiological signal (e.g., an electrocardiogram). These existing methods\ncannot be readily extended to model time series that exhibit multimodality,\nwith structured features and high-dimensional data being recorded at each\ntimestep in the sequence. In this work, we address this gap and propose a new\nSSL method -- Sequential Multi-Dimensional SSL -- where a SSL loss is applied\nboth at the level of the entire sequence and at the level of the individual\nhigh-dimensional data points in the sequence in order to better capture\ninformation at both scales. Our strategy is agnostic to the specific form of\nloss function used at each level -- it can be contrastive, as in SimCLR, or\nnon-contrastive, as in VICReg. We evaluate our method on two real-world\nclinical datasets, where the time series contains sequences of (1)\nhigh-frequency electrocardiograms and (2) structured data from lab values and\nvitals signs. Our experimental results indicate that pre-training with our\nmethod and then fine-tuning on downstream tasks improves performance over\nbaselines on both datasets, and in several settings, can lead to improvements\nacross different self-supervised loss functions.\n","authors":["Aniruddh Raghu","Payal Chandak","Ridwan Alam","John Guttag","Collin M. Stultz"],"pdf_url":"https://arxiv.org/pdf/2307.10923v1.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2307.10922v1","updated":"2023-07-20T14:47:50Z","published":"2023-07-20T14:47:50Z","title":"Language-based Action Concept Spaces Improve Video Self-Supervised\n Learning","summary":" Recent contrastive language image pre-training has led to learning highly\ntransferable and robust image representations. However, adapting these models\nto video domains with minimal supervision remains an open problem. We explore a\nsimple step in that direction, using language tied self-supervised learning to\nadapt an image CLIP model to the video domain. A backbone modified for temporal\nmodeling is trained under self-distillation settings with train objectives\noperating in an action concept space. Feature vectors of various action\nconcepts extracted from a language encoder using relevant textual prompts\nconstruct this space. We introduce two train objectives, concept distillation\nand concept alignment, that retain generality of original representations while\nenforcing relations between actions and their attributes. Our approach improves\nzero-shot and linear probing performance on three action recognition\nbenchmarks.\n","authors":["Kanchana Ranasinghe","Michael Ryoo"],"pdf_url":"https://arxiv.org/pdf/2307.10922v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.14319v3","updated":"2023-07-20T14:37:12Z","published":"2022-12-29T14:28:32Z","title":"Gaussian Process Priors for Systems of Linear Partial Differential\n Equations with Constant Coefficients","summary":" Partial differential equations (PDEs) are important tools to model physical\nsystems and including them into machine learning models is an important way of\nincorporating physical knowledge. Given any system of linear PDEs with constant\ncoefficients, we propose a family of Gaussian process (GP) priors, which we\ncall EPGP, such that all realizations are exact solutions of this system. We\napply the Ehrenpreis-Palamodov fundamental principle, which works as a\nnon-linear Fourier transform, to construct GP kernels mirroring standard\nspectral methods for GPs. Our approach can infer probable solutions of linear\nPDE systems from any data such as noisy measurements, or pointwise defined\ninitial and boundary conditions. Constructing EPGP-priors is algorithmic,\ngenerally applicable, and comes with a sparse version (S-EPGP) that learns the\nrelevant spectral frequencies and works better for big data sets. We\ndemonstrate our approach on three families of systems of PDEs, the heat\nequation, wave equation, and Maxwell's equations, where we improve upon the\nstate of the art in computation time and precision, in some experiments by\nseveral orders of magnitude.\n","authors":["Marc Härkönen","Markus Lange-Hegermann","Bogdan Raiţă"],"pdf_url":"https://arxiv.org/pdf/2212.14319v3.pdf","comment":"26 pages, 8 figures; ICML 2023 (oral); updated with expanded\n appendices and ancillary files. Code available at\n https://github.com/haerski/EPGP. For animations, see\n https://mathrepo.mis.mpg.de/EPGP/index.html"},{"id":"http://arxiv.org/abs/2307.00405v2","updated":"2023-07-20T14:36:11Z","published":"2023-07-01T18:35:21Z","title":"Provably Efficient UCB-type Algorithms For Learning Predictive State\n Representations","summary":" The general sequential decision-making problem, which includes Markov\ndecision processes (MDPs) and partially observable MDPs (POMDPs) as special\ncases, aims at maximizing a cumulative reward by making a sequence of decisions\nbased on a history of observations and actions over time. Recent studies have\nshown that the sequential decision-making problem is statistically learnable if\nit admits a low-rank structure modeled by predictive state representations\n(PSRs). Despite these advancements, existing approaches typically involve\noracles or steps that are not computationally efficient. On the other hand, the\nupper confidence bound (UCB) based approaches, which have served successfully\nas computationally efficient methods in bandits and MDPs, have not been\ninvestigated for more general PSRs, due to the difficulty of optimistic bonus\ndesign in these more challenging settings. This paper proposes the first known\nUCB-type approach for PSRs, featuring a novel bonus term that upper bounds the\ntotal variation distance between the estimated and true models. We further\ncharacterize the sample complexity bounds for our designed UCB-type algorithms\nfor both online and offline PSRs. In contrast to existing approaches for PSRs,\nour UCB-type algorithms enjoy computational efficiency, last-iterate guaranteed\nnear-optimal policy, and guaranteed model accuracy.\n","authors":["Ruiquan Huang","Yingbin Liang","Jing Yang"],"pdf_url":"https://arxiv.org/pdf/2307.00405v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10907v1","updated":"2023-07-20T14:29:51Z","published":"2023-07-20T14:29:51Z","title":"The Role of Entropy and Reconstruction in Multi-View Self-Supervised\n Learning","summary":" The mechanisms behind the success of multi-view self-supervised learning\n(MVSSL) are not yet fully understood. Contrastive MVSSL methods have been\nstudied through the lens of InfoNCE, a lower bound of the Mutual Information\n(MI). However, the relation between other MVSSL methods and MI remains unclear.\nWe consider a different lower bound on the MI consisting of an entropy and a\nreconstruction term (ER), and analyze the main MVSSL families through its lens.\nThrough this ER bound, we show that clustering-based methods such as\nDeepCluster and SwAV maximize the MI. We also re-interpret the mechanisms of\ndistillation-based approaches such as BYOL and DINO, showing that they\nexplicitly maximize the reconstruction term and implicitly encourage a stable\nentropy, and we confirm this empirically. We show that replacing the objectives\nof common MVSSL methods with this ER bound achieves competitive performance,\nwhile making them stable when training with smaller batch sizes or smaller\nexponential moving average (EMA) coefficients.\n Github repo: https://github.com/apple/ml-entropy-reconstruction.\n","authors":["Borja Rodríguez-Gálvez","Arno Blaas","Pau Rodríguez","Adam Goliński","Xavier Suau","Jason Ramapuram","Dan Busbridge","Luca Zappella"],"pdf_url":"https://arxiv.org/pdf/2307.10907v1.pdf","comment":"18 pages: 9 of main text, 2 of references, and 7 of supplementary\n material. Appears in the proceedings of ICML 2023"},{"id":"http://arxiv.org/abs/2110.05216v2","updated":"2023-07-20T14:29:07Z","published":"2021-10-11T12:32:56Z","title":"High-order Tensor Pooling with Attention for Action Recognition","summary":" We aim at capturing high-order statistics of feature vectors formed by a\nneural network, and propose end-to-end second- and higher-order pooling to form\na tensor descriptor. Tensor descriptors require a robust similarity measure due\nto low numbers of aggregated vectors and the burstiness phenomenon, when a\ngiven feature appears more/less frequently than statistically expected. The\nHeat Diffusion Process (HDP) on a graph Laplacian is closely related to the\nEigenvalue Power Normalization (EPN) of the covariance/auto-correlation matrix,\nwhose inverse forms a loopy graph Laplacian. We show that the HDP and the EPN\nplay the same role, i.e., to boost or dampen the magnitude of the eigenspectrum\nthus preventing the burstiness. We equip higher-order tensors with EPN which\nacts as a spectral detector of higher-order occurrences to prevent burstiness.\nWe also prove that for a tensor of order r built from d dimensional feature\ndescriptors, such a detector gives the likelihood if at least one higher-order\noccurrence is 'projected' into one of binom(d,r) subspaces represented by the\ntensor; thus forming a tensor power normalization metric endowed with\nbinom(d,r) such 'detectors'. For experimental contributions, we apply several\nsecond- and higher-order pooling variants to action recognition, provide\npreviously not presented comparisons of such pooling variants, and show\nstate-of-the-art results on HMDB-51, YUP++ and MPII Cooking Activities.\n","authors":["Piotr Koniusz","Lei Wang","Ke Sun"],"pdf_url":"https://arxiv.org/pdf/2110.05216v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10895v1","updated":"2023-07-20T14:18:44Z","published":"2023-07-20T14:18:44Z","title":"Variational Point Encoding Deformation for Dental Modeling","summary":" Digital dentistry has made significant advancements in recent years, yet\nnumerous challenges remain to be addressed. In this study, we release a new\nextensive dataset of tooth meshes to encourage further research. Additionally,\nwe propose Variational FoldingNet (VF-Net), which extends FoldingNet to enable\nprobabilistic learning of point cloud representations. A key challenge in\nexisting latent variable models for point clouds is the lack of a 1-to-1\nmapping between input points and output points. Instead, they must rely on\noptimizing Chamfer distances, a metric that does not have a normalized\ndistributional counterpart, preventing its usage in probabilistic models. We\ndemonstrate that explicit minimization of Chamfer distances can be replaced by\na suitable encoder, which allows us to increase computational efficiency while\nsimplifying the probabilistic extension. Our experimental findings present\nempirical evidence demonstrating the superior performance of VF-Net over\nexisting models in terms of dental scan reconstruction and extrapolation.\nAdditionally, our investigation highlights the robustness of VF-Net's latent\nrepresentations. These results underscore the promising prospects of VF-Net as\nan effective and reliable method for point cloud reconstruction and analysis.\n","authors":["Johan Ziruo Ye","Thomas Ørkild","Peter Lempel Søndergaard","Søren Hauberg"],"pdf_url":"https://arxiv.org/pdf/2307.10895v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10892v1","updated":"2023-07-20T14:11:29Z","published":"2023-07-20T14:11:29Z","title":"Learning and Generalizing Polynomials in Simulation Metamodeling","summary":" The ability to learn polynomials and generalize out-of-distribution is\nessential for simulation metamodels in many disciplines of engineering, where\nthe time step updates are described by polynomials. While feed forward neural\nnetworks can fit any function, they cannot generalize out-of-distribution for\nhigher-order polynomials. Therefore, this paper collects and proposes\nmultiplicative neural network (MNN) architectures that are used as recursive\nbuilding blocks for approximating higher-order polynomials. Our experiments\nshow that MNNs are better than baseline models at generalizing, and their\nperformance in validation is true to their performance in out-of-distribution\ntests. In addition to MNN architectures, a simulation metamodeling approach is\nproposed for simulations with polynomial time step updates. For these\nsimulations, simulating a time interval can be performed in fewer steps by\nincreasing the step size, which entails approximating higher-order polynomials.\nWhile our approach is compatible with any simulation with polynomial time step\nupdates, a demonstration is shown for an epidemiology simulation model, which\nalso shows the inductive bias in MNNs for learning and generalizing\nhigher-order polynomials.\n","authors":["Jesper Hauch","Christoffer Riis","Francisco C. Pereira"],"pdf_url":"https://arxiv.org/pdf/2307.10892v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10891v1","updated":"2023-07-20T14:10:40Z","published":"2023-07-20T14:10:40Z","title":"Syntactic vs Semantic Linear Abstraction and Refinement of Neural\n Networks","summary":" Abstraction is a key verification technique to improve scalability. However,\nits use for neural networks is so far extremely limited. Previous approaches\nfor abstracting classification networks replace several neurons with one of\nthem that is similar enough. We can classify the similarity as defined either\nsyntactically (using quantities on the connections between neurons) or\nsemantically (on the activation values of neurons for various inputs).\nUnfortunately, the previous approaches only achieve moderate reductions, when\nimplemented at all. In this work, we provide a more flexible framework where a\nneuron can be replaced with a linear combination of other neurons, improving\nthe reduction. We apply this approach both on syntactic and semantic\nabstractions, and implement and evaluate them experimentally. Further, we\nintroduce a refinement method for our abstractions, allowing for finding a\nbetter balance between reduction and precision.\n","authors":["Calvin Chau","Jan Křetínský","Stefanie Mohr"],"pdf_url":"https://arxiv.org/pdf/2307.10891v1.pdf","comment":"Accepted at ATVA 2023"},{"id":"http://arxiv.org/abs/2307.10890v1","updated":"2023-07-20T14:10:33Z","published":"2023-07-20T14:10:33Z","title":"Player-optimal Stable Regret for Bandit Learning in Matching Markets","summary":" The problem of matching markets has been studied for a long time in the\nliterature due to its wide range of applications. Finding a stable matching is\na common equilibrium objective in this problem. Since market participants are\nusually uncertain of their preferences, a rich line of recent works study the\nonline setting where one-side participants (players) learn their unknown\npreferences from iterative interactions with the other side (arms). Most\nprevious works in this line are only able to derive theoretical guarantees for\nplayer-pessimal stable regret, which is defined compared with the players'\nleast-preferred stable matching. However, under the pessimal stable matching,\nplayers only obtain the least reward among all stable matchings. To maximize\nplayers' profits, player-optimal stable matching would be the most desirable.\nThough \\citet{basu21beyond} successfully bring an upper bound for\nplayer-optimal stable regret, their result can be exponentially large if\nplayers' preference gap is small. Whether a polynomial guarantee for this\nregret exists is a significant but still open problem. In this work, we provide\na new algorithm named explore-then-Gale-Shapley (ETGS) and show that the\noptimal stable regret of each player can be upper bounded by $O(K\\log\nT/\\Delta^2)$ where $K$ is the number of arms, $T$ is the horizon and $\\Delta$\nis the players' minimum preference gap among the first $N+1$-ranked arms. This\nresult significantly improves previous works which either have a weaker\nplayer-pessimal stable matching objective or apply only to markets with special\nassumptions. When the preferences of participants satisfy some special\nconditions, our regret upper bound also matches the previously derived lower\nbound.\n","authors":["Fang Kong","Shuai Li"],"pdf_url":"https://arxiv.org/pdf/2307.10890v1.pdf","comment":"SODA 2023"},{"id":"http://arxiv.org/abs/2307.02405v2","updated":"2023-07-20T14:10:24Z","published":"2023-07-05T16:27:33Z","title":"$ν^2$-Flows: Fast and improved neutrino reconstruction in\n multi-neutrino final states with conditional normalizing flows","summary":" In this work we introduce $\\nu^2$-Flows, an extension of the $\\nu$-Flows\nmethod to final states containing multiple neutrinos. The architecture can\nnatively scale for all combinations of object types and multiplicities in the\nfinal state for any desired neutrino multiplicities. In $t\\bar{t}$ dilepton\nevents, the momenta of both neutrinos and correlations between them are\nreconstructed more accurately than when using the most popular standard\nanalytical techniques, and solutions are found for all events. Inference time\nis significantly faster than competing methods, and can be reduced further by\nevaluating in parallel on graphics processing units. We apply $\\nu^2$-Flows to\n$t\\bar{t}$ dilepton events and show that the per-bin uncertainties in unfolded\ndistributions is much closer to the limit of performance set by perfect\nneutrino reconstruction than standard techniques. For the chosen double\ndifferential observables $\\nu^2$-Flows results in improved statistical\nprecision for each bin by a factor of 1.5 to 2 in comparison to the Neutrino\nWeighting method and up to a factor of four in comparison to the Ellipse\napproach.\n","authors":["John Andrew Raine","Matthew Leigh","Knut Zoch","Tobias Golling"],"pdf_url":"https://arxiv.org/pdf/2307.02405v2.pdf","comment":"20 pages, 16 figures, 5 tables"},{"id":"http://arxiv.org/abs/2303.16716v2","updated":"2023-07-20T13:54:48Z","published":"2023-03-29T14:15:38Z","title":"Topological Point Cloud Clustering","summary":" We present Topological Point Cloud Clustering (TPCC), a new method to cluster\npoints in an arbitrary point cloud based on their contribution to global\ntopological features. TPCC synthesizes desirable features from spectral\nclustering and topological data analysis and is based on considering the\nspectral properties of a simplicial complex associated to the considered point\ncloud. As it is based on considering sparse eigenvector computations, TPCC is\nsimilarly easy to interpret and implement as spectral clustering. However, by\nfocusing not just on a single matrix associated to a graph created from the\npoint cloud data, but on a whole set of Hodge-Laplacians associated to an\nappropriately constructed simplicial complex, we can leverage a far richer set\nof topological features to characterize the data points within the point cloud\nand benefit from the relative robustness of topological techniques against\nnoise. We test the performance of TPCC on both synthetic and real-world data\nand compare it with classical spectral clustering.\n","authors":["Vincent P. Grande","Michael T. Schaub"],"pdf_url":"https://arxiv.org/pdf/2303.16716v2.pdf","comment":"Accepted at the 40th International Conference on Machine Learning\n (ICML), 2023. Code available at\n https://git.rwth-aachen.de/netsci/publication-2023-topological-point-cloud-clustering"},{"id":"http://arxiv.org/abs/2306.14030v2","updated":"2023-07-20T13:54:05Z","published":"2023-06-24T18:17:38Z","title":"My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models\n and Evaluation Benchmarks","summary":" The research on code-mixed data is limited due to the unavailability of\ndedicated code-mixed datasets and pre-trained language models. In this work, we\nfocus on the low-resource Indian language Marathi which lacks any prior work in\ncode-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English\n(Mr-En) corpus with 10 million social media sentences for pretraining. We also\nrelease L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models\npre-trained on MeCorpus. Furthermore, for benchmarking, we present three\nsupervised datasets MeHate, MeSent, and MeLID for downstream tasks like\ncode-mixed Mr-En hate speech detection, sentiment analysis, and language\nidentification respectively. These evaluation datasets individually consist of\nmanually annotated \\url{~}12,000 Marathi-English code-mixed tweets. Ablations\nshow that the models trained on this novel corpus significantly outperform the\nexisting state-of-the-art BERT models. This is the first work that presents\nartifacts for code-mixed Marathi research. All datasets and models are publicly\nreleased at https://github.com/l3cube-pune/MarathiNLP .\n","authors":["Tanmay Chavan","Omkar Gokhale","Aditya Kane","Shantanu Patankar","Raviraj Joshi"],"pdf_url":"https://arxiv.org/pdf/2306.14030v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10875v1","updated":"2023-07-20T13:47:30Z","published":"2023-07-20T13:47:30Z","title":"Risk-optimized Outlier Removal for Robust Point Cloud Classification","summary":" The popularity of point cloud deep models for safety-critical purposes has\nincreased, but the reliability and security of these models can be compromised\nby intentional or naturally occurring point cloud noise. To combat this issue,\nwe present a novel point cloud outlier removal method called PointCVaR, which\nempowers standard-trained models to eliminate additional outliers and restore\nthe data. Our approach begins by conducting attribution analysis to determine\nthe influence of each point on the model output, which we refer to as point\nrisk. We then optimize the process of filtering high-risk points using\nConditional Value at Risk (CVaR) as the objective. The rationale for this\napproach is based on the observation that noise points in point clouds tend to\ncluster in the tail of the risk distribution, with a low frequency but a high\nlevel of risk, resulting in significant interference with classification\nresults. Despite requiring no additional training effort, our method produces\nexceptional results in various removal-and-classification experiments for noisy\npoint clouds, which are corrupted by random noise, adversarial noise, and\nbackdoor trigger noise. Impressively, it achieves 87% accuracy in defense\nagainst the backdoor attack by removing triggers. Overall, the proposed\nPointCVaR effectively eliminates noise points and enhances point cloud\nclassification, making it a promising plug-in module for various models in\ndifferent scenarios.\n","authors":["Xinke Li","Junchi Lu"],"pdf_url":"https://arxiv.org/pdf/2307.10875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10870v1","updated":"2023-07-20T13:42:13Z","published":"2023-07-20T13:42:13Z","title":"Nonlinear Meta-Learning Can Guarantee Faster Rates","summary":" Many recent theoretical works on \\emph{meta-learning} aim to achieve\nguarantees in leveraging similar representational structures from related tasks\ntowards simplifying a target task. Importantly, the main aim in theory works on\nthe subject is to understand the extent to which convergence rates -- in\nlearning a common representation -- \\emph{may scale with the number $N$ of\ntasks} (as well as the number of samples per task). First steps in this setting\ndemonstrate this property when both the shared representation amongst tasks,\nand task-specific regression functions, are linear. This linear setting readily\nreveals the benefits of aggregating tasks, e.g., via averaging arguments. In\npractice, however, the representation is often highly nonlinear, introducing\nnontrivial biases in each task that cannot easily be averaged out as in the\nlinear case. In the present work, we derive theoretical guarantees for\nmeta-learning with nonlinear representations. In particular, assuming the\nshared nonlinearity maps to an infinite-dimensional RKHS, we show that\nadditional biases can be mitigated with careful regularization that leverages\nthe smoothness of task-specific regression functions,\n","authors":["Dimitri Meunier","Zhu Li","Arthur Gretton","Samory Kpotufe"],"pdf_url":"https://arxiv.org/pdf/2307.10870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10869v1","updated":"2023-07-20T13:41:26Z","published":"2023-07-20T13:41:26Z","title":"Performance Issue Identification in Cloud Systems with\n Relational-Temporal Anomaly Detection","summary":" Performance issues permeate large-scale cloud service systems, which can lead\nto huge revenue losses. To ensure reliable performance, it's essential to\naccurately identify and localize these issues using service monitoring metrics.\nGiven the complexity and scale of modern cloud systems, this task can be\nchallenging and may require extensive expertise and resources beyond the\ncapacity of individual humans. Some existing methods tackle this problem by\nanalyzing each metric independently to detect anomalies. However, this could\nincur overwhelming alert storms that are difficult for engineers to diagnose\nmanually. To pursue better performance, not only the temporal patterns of\nmetrics but also the correlation between metrics (i.e., relational patterns)\nshould be considered, which can be formulated as a multivariate metrics anomaly\ndetection problem. However, most of the studies fall short of extracting these\ntwo types of features explicitly. Moreover, there exist some unlabeled\nanomalies mixed in the training data, which may hinder the detection\nperformance. To address these limitations, we propose the Relational- Temporal\nAnomaly Detection Model (RTAnomaly) that combines the relational and temporal\ninformation of metrics. RTAnomaly employs a graph attention layer to learn the\ndependencies among metrics, which will further help pinpoint the anomalous\nmetrics that may cause the anomaly effectively. In addition, we exploit the\nconcept of positive unlabeled learning to address the issue of potential\nanomalies in the training data. To evaluate our method, we conduct experiments\non a public dataset and two industrial datasets. RTAnomaly outperforms all the\nbaseline models by achieving an average F1 score of 0.929 and Hit@3 of 0.920,\ndemonstrating its superiority.\n","authors":["Wenwei Gu","Jinyang Liu","Zhuangbin Chen","Jianping Zhang","Yuxin Su","Jiazhen Gu","Cong Feng","Zengyin Yang","Michael Lyu"],"pdf_url":"https://arxiv.org/pdf/2307.10869v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10867v1","updated":"2023-07-20T13:40:22Z","published":"2023-07-20T13:40:22Z","title":"FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with\n Human Feedback","summary":" Captions are crucial for understanding scientific visualizations and\ndocuments. Existing captioning methods for scientific figures rely on\nfigure-caption pairs extracted from documents for training, many of which fall\nshort with respect to metrics like helpfulness, explainability, and\nvisual-descriptiveness [15] leading to generated captions being misaligned with\nreader preferences. To enable the generation of high-quality figure captions,\nwe introduce FigCaps-HF a new framework for figure-caption generation that can\nincorporate domain expert feedback in generating captions optimized for reader\npreferences. Our framework comprises of 1) an automatic method for evaluating\nquality of figure-caption pairs, 2) a novel reinforcement learning with human\nfeedback (RLHF) method to optimize a generative figure-to-caption model for\nreader preferences. We demonstrate the effectiveness of our simple learning\nframework by improving performance over standard fine-tuning across different\ntypes of models. In particular, when using BLIP as the base model, our RLHF\nframework achieves a mean gain of 35.7%, 16.9%, and 9% in ROUGE, BLEU, and\nMeteor, respectively. Finally, we release a large-scale benchmark dataset with\nhuman feedback on figure-caption pairs to enable further evaluation and\ndevelopment of RLHF techniques for this problem.\n","authors":["Ashish Singh","Prateek Agarwal","Zixuan Huang","Arpita Singh","Tong Yu","Sungchul Kim","Victor Bursztyn","Nikos Vlassis","Ryan A. Rossi"],"pdf_url":"https://arxiv.org/pdf/2307.10867v1.pdf","comment":"19 pages, 4 figures. Benchmark Documentation:\n https://figcapshf.github.io/"},{"id":"http://arxiv.org/abs/2307.10865v1","updated":"2023-07-20T13:34:11Z","published":"2023-07-20T13:34:11Z","title":"Addressing caveats of neural persistence with deep graph persistence","summary":" Neural Persistence is a prominent measure for quantifying neural network\ncomplexity, proposed in the emerging field of topological data analysis in deep\nlearning. In this work, however, we find both theoretically and empirically\nthat the variance of network weights and spatial concentration of large weights\nare the main factors that impact neural persistence. Whilst this captures\nuseful information for linear classifiers, we find that no relevant spatial\nstructure is present in later layers of deep neural networks, making neural\npersistence roughly equivalent to the variance of weights. Additionally, the\nproposed averaging procedure across layers for deep neural networks does not\nconsider interaction between layers. Based on our analysis, we propose an\nextension of the filtration underlying neural persistence to the whole neural\nnetwork instead of single layers, which is equivalent to calculating neural\npersistence on one particular matrix. This yields our deep graph persistence\nmeasure, which implicitly incorporates persistent paths through the network and\nalleviates variance-related issues through standardisation. Code is available\nat https://github.com/ExplainableML/Deep-Graph-Persistence .\n","authors":["Leander Girrbach","Anders Christensen","Ole Winther","Zeynep Akata","A. Sophia Koepke"],"pdf_url":"https://arxiv.org/pdf/2307.10865v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10864v1","updated":"2023-07-20T13:33:28Z","published":"2023-07-20T13:33:28Z","title":"Divide & Bind Your Attention for Improved Generative Semantic Nursing","summary":" Emerging large-scale text-to-image generative models, e.g., Stable Diffusion\n(SD), have exhibited overwhelming results with high fidelity. Despite the\nmagnificent progress, current state-of-the-art models still struggle to\ngenerate images fully adhering to the input prompt. Prior work, Attend &\nExcite, has introduced the concept of Generative Semantic Nursing (GSN), aiming\nto optimize cross-attention during inference time to better incorporate the\nsemantics. It demonstrates promising results in generating simple prompts,\ne.g., ``a cat and a dog''. However, its efficacy declines when dealing with\nmore complex prompts, and it does not explicitly address the problem of\nimproper attribute binding. To address the challenges posed by complex prompts\nor scenarios involving multiple entities and to achieve improved attribute\nbinding, we propose Divide & Bind. We introduce two novel loss objectives for\nGSN: a novel attendance loss and a binding loss. Our approach stands out in its\nability to faithfully synthesize desired objects with improved attribute\nalignment from complex prompts and exhibits superior performance across\nmultiple evaluation benchmarks. More videos and updates can be found on the\nproject page \\url{https://sites.google.com/view/divide-and-bind}.\n","authors":["Yumeng Li","Margret Keuper","Dan Zhang","Anna Khoreva"],"pdf_url":"https://arxiv.org/pdf/2307.10864v1.pdf","comment":"Project page: \\url{https://sites.google.com/view/divide-and-bind}"},{"id":"http://arxiv.org/abs/2307.09206v2","updated":"2023-07-20T13:29:27Z","published":"2023-07-18T12:42:59Z","title":"Context-Conditional Navigation with a Learning-Based Terrain- and\n Robot-Aware Dynamics Model","summary":" In autonomous navigation settings, several quantities can be subject to\nvariations. Terrain properties such as friction coefficients may vary over time\ndepending on the location of the robot. Also, the dynamics of the robot may\nchange due to, e.g., different payloads, changing the system's mass, or wear\nand tear, changing actuator gains or joint friction. An autonomous agent should\nthus be able to adapt to such variations. In this paper, we develop a novel\nprobabilistic, terrain- and robot-aware forward dynamics model, termed TRADYN,\nwhich is able to adapt to the above-mentioned variations. It builds on recent\nadvances in meta-learning forward dynamics models based on Neural Processes. We\nevaluate our method in a simulated 2D navigation setting with a unicycle-like\nrobot and different terrain layouts with spatially varying friction\ncoefficients. In our experiments, the proposed model exhibits lower prediction\nerror for the task of long-horizon trajectory prediction, compared to\nnon-adaptive ablation models. We also evaluate our model on the downstream task\nof navigation planning, which demonstrates improved performance in planning\ncontrol-efficient paths by taking robot and terrain properties into account.\n","authors":["Suresh Guttikonda","Jan Achterhold","Haolong Li","Joschka Boedecker","Joerg Stueckler"],"pdf_url":"https://arxiv.org/pdf/2307.09206v2.pdf","comment":"\\copyright 2023 IEEE. Accepted for publication in European Conference\n on Mobile Robots (ECMR), 2023. Updated copyright statement"},{"id":"http://arxiv.org/abs/2211.04974v2","updated":"2023-07-20T13:11:13Z","published":"2022-11-09T15:39:32Z","title":"Leveraging Offline Data in Online Reinforcement Learning","summary":" Two central paradigms have emerged in the reinforcement learning (RL)\ncommunity: online RL and offline RL. In the online RL setting, the agent has no\nprior knowledge of the environment, and must interact with it in order to find\nan $\\epsilon$-optimal policy. In the offline RL setting, the learner instead\nhas access to a fixed dataset to learn from, but is unable to otherwise\ninteract with the environment, and must obtain the best policy it can from this\noffline data. Practical scenarios often motivate an intermediate setting: if we\nhave some set of offline data and, in addition, may also interact with the\nenvironment, how can we best use the offline data to minimize the number of\nonline interactions necessary to learn an $\\epsilon$-optimal policy?\n In this work, we consider this setting, which we call the \\textsf{FineTuneRL}\nsetting, for MDPs with linear structure. We characterize the necessary number\nof online samples needed in this setting given access to some offline dataset,\nand develop an algorithm, \\textsc{FTPedel}, which is provably optimal, up to\n$H$ factors. We show through an explicit example that combining offline data\nwith online interactions can lead to a provable improvement over either purely\noffline or purely online RL. Finally, our results illustrate the distinction\nbetween \\emph{verifiable} learning, the typical setting considered in online\nRL, and \\emph{unverifiable} learning, the setting often considered in offline\nRL, and show that there is a formal separation between these regimes.\n","authors":["Andrew Wagenmaker","Aldo Pacchiano"],"pdf_url":"https://arxiv.org/pdf/2211.04974v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10845v1","updated":"2023-07-20T13:07:41Z","published":"2023-07-20T13:07:41Z","title":"Self-paced Weight Consolidation for Continual Learning","summary":" Continual learning algorithms which keep the parameters of new tasks close to\nthat of previous tasks, are popular in preventing catastrophic forgetting in\nsequential task learning settings. However, 1) the performance for the new\ncontinual learner will be degraded without distinguishing the contributions of\npreviously learned tasks; 2) the computational cost will be greatly increased\nwith the number of tasks, since most existing algorithms need to regularize all\nprevious tasks when learning new tasks. To address the above challenges, we\npropose a self-paced Weight Consolidation (spWC) framework to attain robust\ncontinual learning via evaluating the discriminative contributions of previous\ntasks. To be specific, we develop a self-paced regularization to reflect the\npriorities of past tasks via measuring difficulty based on key performance\nindicator (i.e., accuracy). When encountering a new task, all previous tasks\nare sorted from \"difficult\" to \"easy\" based on the priorities. Then the\nparameters of the new continual learner will be learned via selectively\nmaintaining the knowledge amongst more difficult past tasks, which could well\novercome catastrophic forgetting with less computational cost. We adopt an\nalternative convex search to iteratively update the model parameters and\npriority weights in the bi-convex formulation. The proposed spWC framework is\nplug-and-play, which is applicable to most continual learning algorithms (e.g.,\nEWC, MAS and RCIL) in different directions (e.g., classification and\nsegmentation). Experimental results on several public benchmark datasets\ndemonstrate that our proposed framework can effectively improve performance\nwhen compared with other popular continual learning algorithms.\n","authors":["Wei Cong","Yang Cong","Gan Sun","Yuyang Liu","Jiahua Dong"],"pdf_url":"https://arxiv.org/pdf/2307.10845v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10843v1","updated":"2023-07-20T13:04:26Z","published":"2023-07-20T13:04:26Z","title":"Global Precipitation Nowcasting of Integrated Multi-satellitE Retrievals\n for GPM: A U-Net Convolutional LSTM Architecture","summary":" This paper presents a deep learning architecture for nowcasting of\nprecipitation almost globally every 30 min with a 4-hour lead time. The\narchitecture fuses a U-Net and a convolutional long short-term memory (LSTM)\nneural network and is trained using data from the Integrated MultisatellitE\nRetrievals for GPM (IMERG) and a few key precipitation drivers from the Global\nForecast System (GFS). The impacts of different training loss functions,\nincluding the mean-squared error (regression) and the focal-loss\n(classification), on the quality of precipitation nowcasts are studied. The\nresults indicate that the regression network performs well in capturing light\nprecipitation (below 1.6 mm/hr), but the classification network can outperform\nthe regression network for nowcasting of precipitation extremes (>8 mm/hr), in\nterms of the critical success index (CSI).. Using the Wasserstein distance, it\nis shown that the predicted precipitation by the classification network has a\ncloser class probability distribution to the IMERG than the regression network.\nIt is uncovered that the inclusion of the physical variables can improve\nprecipitation nowcasting, especially at longer lead times in both networks.\nTaking IMERG as a relative reference, a multi-scale analysis in terms of\nfractions skill score (FSS), shows that the nowcasting machine remains skillful\n(FSS > 0.5) at the resolution of 10 km compared to 50 km for GFS. For\nprecipitation rates greater than 4~mm/hr, only the classification network\nremains FSS-skillful on scales greater than 50 km within a 2-hour lead time.\n","authors":["Reyhaneh Rahimi","Ardeshir Ebtehaj","Ali Behrangi","Jackson Tan"],"pdf_url":"https://arxiv.org/pdf/2307.10843v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10842v1","updated":"2023-07-20T13:02:45Z","published":"2023-07-20T13:02:45Z","title":"Label Calibration for Semantic Segmentation Under Domain Shift","summary":" Performance of a pre-trained semantic segmentation model is likely to\nsubstantially decrease on data from a new domain. We show a pre-trained model\ncan be adapted to unlabelled target domain data by calculating soft-label\nprototypes under the domain shift and making predictions according to the\nprototype closest to the vector with predicted class probabilities. The\nproposed adaptation procedure is fast, comes almost for free in terms of\ncomputational resources and leads to considerable performance improvements. We\ndemonstrate the benefits of such label calibration on the highly-practical\nsynthetic-to-real semantic segmentation problem.\n","authors":["Ondrej Bohdal","Da Li","Timothy Hospedales"],"pdf_url":"https://arxiv.org/pdf/2307.10842v1.pdf","comment":"ICLR 2023 Workshop on Pitfalls of Limited Data and Computation for\n Trustworthy ML"},{"id":"http://arxiv.org/abs/2207.02575v2","updated":"2023-07-20T12:59:44Z","published":"2022-07-06T10:42:57Z","title":"Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via\n Online Experiment Design","summary":" While much progress has been made in understanding the minimax sample\ncomplexity of reinforcement learning (RL) -- the complexity of learning on the\n\"worst-case\" instance -- such measures of complexity often do not capture the\ntrue difficulty of learning. In practice, on an \"easy\" instance, we might hope\nto achieve a complexity far better than that achievable on the worst-case\ninstance. In this work we seek to understand the \"instance-dependent\"\ncomplexity of learning near-optimal policies (PAC RL) in the setting of RL with\nlinear function approximation. We propose an algorithm, \\textsc{Pedel}, which\nachieves a fine-grained instance-dependent measure of complexity, the first of\nits kind in the RL with function approximation setting, thereby capturing the\ndifficulty of learning on each particular problem instance. Through an explicit\nexample, we show that \\textsc{Pedel} yields provable gains over low-regret,\nminimax-optimal algorithms and that such algorithms are unable to hit the\ninstance-optimal rate. Our approach relies on a novel online experiment\ndesign-based procedure which focuses the exploration budget on the \"directions\"\nmost relevant to learning a near-optimal policy, and may be of independent\ninterest.\n","authors":["Andrew Wagenmaker","Kevin Jamieson"],"pdf_url":"https://arxiv.org/pdf/2207.02575v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.06092v2","updated":"2023-07-20T12:54:32Z","published":"2023-07-12T11:35:37Z","title":"Quantitative CLTs in Deep Neural Networks","summary":" We study the distribution of a fully connected neural network with random\nGaussian weights and biases in which the hidden layer widths are proportional\nto a large constant $n$. Under mild assumptions on the non-linearity, we obtain\nquantitative bounds on normal approximations valid at large but finite $n$ and\nany fixed network depth. Our theorems show both for the finite-dimensional\ndistributions and the entire process, that the distance between a random fully\nconnected network (and its derivatives) to the corresponding infinite width\nGaussian process scales like $n^{-\\gamma}$ for $\\gamma>0$, with the exponent\ndepending on the metric used to measure discrepancy. Our bounds are strictly\nstronger in terms of their dependence on network width than any previously\navailable in the literature; in the one-dimensional case, we also prove that\nthey are optimal, i.e., we establish matching lower bounds.\n","authors":["Stefano Favaro","Boris Hanin","Domenico Marinucci","Ivan Nourdin","Giovanni Peccati"],"pdf_url":"https://arxiv.org/pdf/2307.06092v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10810v1","updated":"2023-07-20T12:20:18Z","published":"2023-07-20T12:20:18Z","title":"On Combining Expert Demonstrations in Imitation Learning via Optimal\n Transport","summary":" Imitation learning (IL) seeks to teach agents specific tasks through expert\ndemonstrations. One of the key approaches to IL is to define a distance between\nagent and expert and to find an agent policy that minimizes that distance.\nOptimal transport methods have been widely used in imitation learning as they\nprovide ways to measure meaningful distances between agent and expert\ntrajectories. However, the problem of how to optimally combine multiple expert\ndemonstrations has not been widely studied. The standard method is to simply\nconcatenate state (-action) trajectories, which is problematic when\ntrajectories are multi-modal. We propose an alternative method that uses a\nmulti-marginal optimal transport distance and enables the combination of\nmultiple and diverse state-trajectories in the OT sense, providing a more\nsensible geometric average of the demonstrations. Our approach enables an agent\nto learn from several experts, and its efficiency is analyzed on OpenAI Gym\ncontrol environments and demonstrates that the standard method is not always\noptimal.\n","authors":["Ilana Sebag","Samuel Cohen","Marc Peter Deisenroth"],"pdf_url":"https://arxiv.org/pdf/2307.10810v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.15382v2","updated":"2023-07-20T12:18:49Z","published":"2022-11-24T13:21:36Z","title":"Neural Network Complexity of Chaos and Turbulence","summary":" Chaos and turbulence are complex physical phenomena, yet a precise definition\nof the complexity measure that quantifies them is still lacking. In this work\nwe consider the relative complexity of chaos and turbulence from the\nperspective of deep neural networks. We analyze a set of classification\nproblems, where the network has to distinguish images of fluid profiles in the\nturbulent regime from other classes of images such as fluid profiles in the\nchaotic regime, various constructions of noise and real world images. We\nanalyze incompressible as well as weakly compressible fluid flows. We quantify\nthe complexity of the computation performed by the network via the intrinsic\ndimensionality of the internal feature representations, and calculate the\neffective number of independent features which the network uses in order to\ndistinguish between classes. In addition to providing a numerical estimate of\nthe complexity of the computation, the measure also characterizes the neural\nnetwork processing at intermediate and final stages. We construct adversarial\nexamples and use them to identify the two point correlation spectra for the\nchaotic and turbulent vorticity as the feature used by the network for\nclassification.\n","authors":["Tim Whittaker","Romuald A. Janik","Yaron Oz"],"pdf_url":"https://arxiv.org/pdf/2211.15382v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.01001v3","updated":"2023-07-20T12:17:50Z","published":"2021-06-02T07:53:54Z","title":"Warming up recurrent neural networks to maximise reachable\n multistability greatly improves learning","summary":" Training recurrent neural networks is known to be difficult when time\ndependencies become long. In this work, we show that most standard cells only\nhave one stable equilibrium at initialisation, and that learning on tasks with\nlong time dependencies generally occurs once the number of network stable\nequilibria increases; a property known as multistability. Multistability is\noften not easily attained by initially monostable networks, making learning of\nlong time dependencies between inputs and outputs difficult. This insight leads\nto the design of a novel way to initialise any recurrent cell connectivity\nthrough a procedure called \"warmup\" to improve its capability to learn\narbitrarily long time dependencies. This initialisation procedure is designed\nto maximise network reachable multistability, i.e., the number of equilibria\nwithin the network that can be reached through relevant input trajectories, in\nfew gradient steps. We show on several information restitution, sequence\nclassification, and reinforcement learning benchmarks that warming up greatly\nimproves learning speed and performance, for multiple recurrent cells, but\nsometimes impedes precision. We therefore introduce a double-layer architecture\ninitialised with a partial warmup that is shown to greatly improve learning of\nlong time dependencies while maintaining high levels of precision. This\napproach provides a general framework for improving learning abilities of any\nrecurrent cell when long time dependencies are present. We also show\nempirically that other initialisation and pretraining procedures from the\nliterature implicitly foster reachable multistability of recurrent cells.\n","authors":["Gaspard Lambrechts","Florent De Geeter","Nicolas Vecoven","Damien Ernst","Guillaume Drion"],"pdf_url":"https://arxiv.org/pdf/2106.01001v3.pdf","comment":"20 pages, 35 pages total, 38 figures"},{"id":"http://arxiv.org/abs/2307.10805v1","updated":"2023-07-20T12:16:26Z","published":"2023-07-20T12:16:26Z","title":"Communication-Efficient Split Learning via Adaptive Feature-Wise\n Compression","summary":" This paper proposes a novel communication-efficient split learning (SL)\nframework, named SplitFC, which reduces the communication overhead required for\ntransmitting intermediate feature and gradient vectors during the SL training\nprocess. The key idea of SplitFC is to leverage different dispersion degrees\nexhibited in the columns of the matrices. SplitFC incorporates two compression\nstrategies: (i) adaptive feature-wise dropout and (ii) adaptive feature-wise\nquantization. In the first strategy, the intermediate feature vectors are\ndropped with adaptive dropout probabilities determined based on the standard\ndeviation of these vectors. Then, by the chain rule, the intermediate gradient\nvectors associated with the dropped feature vectors are also dropped. In the\nsecond strategy, the non-dropped intermediate feature and gradient vectors are\nquantized using adaptive quantization levels determined based on the ranges of\nthe vectors. To minimize the quantization error, the optimal quantization\nlevels of this strategy are derived in a closed-form expression. Simulation\nresults on the MNIST, CIFAR-10, and CelebA datasets demonstrate that SplitFC\nprovides more than a 5.6% increase in classification accuracy compared to\nstate-of-the-art SL frameworks, while they require 320 times less communication\noverhead compared to the vanilla SL framework without compression.\n","authors":["Yongjeong Oh","Jaeho Lee","Christopher G. Brinton","Yo-Seb Jeon"],"pdf_url":"https://arxiv.org/pdf/2307.10805v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10803v1","updated":"2023-07-20T12:12:05Z","published":"2023-07-20T12:12:05Z","title":"Spatial-Temporal Data Mining for Ocean Science: Data, Methodologies, and\n Opportunities","summary":" With the increasing amount of spatial-temporal~(ST) ocean data, numerous\nspatial-temporal data mining (STDM) studies have been conducted to address\nvarious oceanic issues, e.g., climate forecasting and disaster warning.\nCompared with typical ST data (e.g., traffic data), ST ocean data is more\ncomplicated with some unique characteristics, e.g., diverse regionality and\nhigh sparsity. These characteristics make it difficult to design and train STDM\nmodels. Unfortunately, an overview of these studies is still missing, hindering\ncomputer scientists to identify the research issues in ocean while discouraging\nresearchers in ocean science from applying advanced STDM techniques. To remedy\nthis situation, we provide a comprehensive survey to summarize existing STDM\nstudies in ocean. Concretely, we first summarize the widely-used ST ocean\ndatasets and identify their unique characteristics. Then, typical ST ocean data\nquality enhancement techniques are discussed. Next, we classify existing STDM\nstudies for ocean into four types of tasks, i.e., prediction, event detection,\npattern mining, and anomaly detection, and elaborate the techniques for these\ntasks. Finally, promising research opportunities are highlighted. This survey\nwill help scientists from the fields of both computer science and ocean science\nhave a better understanding of the fundamental concepts, key techniques, and\nopen challenges of STDM in ocean.\n","authors":["Hanchen Yang","Wengen Li","Shuyu Wang","Hui Li","Jihong Guan","Shuigeng Zhou","Jiannong Cao"],"pdf_url":"https://arxiv.org/pdf/2307.10803v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2009.03259v2","updated":"2023-07-20T12:11:56Z","published":"2020-09-07T17:27:27Z","title":"Implicit Multidimensional Projection of Local Subspaces","summary":" We propose a visualization method to understand the effect of\nmultidimensional projection on local subspaces, using implicit function\ndifferentiation. Here, we understand the local subspace as the multidimensional\nlocal neighborhood of data points. Existing methods focus on the projection of\nmultidimensional data points, and the neighborhood information is ignored. Our\nmethod is able to analyze the shape and directional information of the local\nsubspace to gain more insights into the global structure of the data through\nthe perception of local structures. Local subspaces are fitted by\nmultidimensional ellipses that are spanned by basis vectors. An accurate and\nefficient vector transformation method is proposed based on analytical\ndifferentiation of multidimensional projections formulated as implicit\nfunctions. The results are visualized as glyphs and analyzed using a full set\nof specifically-designed interactions supported in our efficient web-based\nvisualization tool. The usefulness of our method is demonstrated using various\nmulti- and high-dimensional benchmark datasets. Our implicit differentiation\nvector transformation is evaluated through numerical comparisons; the overall\nmethod is evaluated through exploration examples and use cases.\n","authors":["Rongzheng Bian","Yumeng Xue","Liang Zhou","Jian Zhang","Baoquan Chen","Daniel Weiskopf","Yunhai Wang"],"pdf_url":"https://arxiv.org/pdf/2009.03259v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10802v1","updated":"2023-07-20T12:10:29Z","published":"2023-07-20T12:10:29Z","title":"Meta-Transformer: A Unified Framework for Multimodal Learning","summary":" Multimodal learning aims to build models that can process and relate\ninformation from multiple modalities. Despite years of development in this\nfield, it still remains challenging to design a unified network for processing\nvarious modalities ($\\textit{e.g.}$ natural language, 2D images, 3D point\nclouds, audio, video, time series, tabular data) due to the inherent gaps among\nthem. In this work, we propose a framework, named Meta-Transformer, that\nleverages a $\\textbf{frozen}$ encoder to perform multimodal perception without\nany paired multimodal training data. In Meta-Transformer, the raw input data\nfrom various modalities are mapped into a shared token space, allowing a\nsubsequent encoder with frozen parameters to extract high-level semantic\nfeatures of the input data. Composed of three main components: a unified data\ntokenizer, a modality-shared encoder, and task-specific heads for downstream\ntasks, Meta-Transformer is the first framework to perform unified learning\nacross 12 modalities with unpaired data. Experiments on different benchmarks\nreveal that Meta-Transformer can handle a wide range of tasks including\nfundamental perception (text, image, point cloud, audio, video), practical\napplication (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph,\ntabular, and time-series). Meta-Transformer indicates a promising future for\ndeveloping unified multimodal intelligence with transformers. Code will be\navailable at https://github.com/invictus717/MetaTransformer\n","authors":["Yiyuan Zhang","Kaixiong Gong","Kaipeng Zhang","Hongsheng Li","Yu Qiao","Wanli Ouyang","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2307.10802v1.pdf","comment":"Project website: https://kxgong.github.io/meta_transformer/"},{"id":"http://arxiv.org/abs/2205.12900v4","updated":"2023-07-20T12:10:09Z","published":"2022-05-25T16:46:01Z","title":"Pre-trained Perceptual Features Improve Differentially Private Image\n Generation","summary":" Training even moderately-sized generative models with differentially-private\nstochastic gradient descent (DP-SGD) is difficult: the required level of noise\nfor reasonable levels of privacy is simply too large. We advocate instead\nbuilding off a good, relevant representation on an informative public dataset,\nthen learning to model the private data with that representation. In\nparticular, we minimize the maximum mean discrepancy (MMD) between private\ntarget data and a generator's distribution, using a kernel based on perceptual\nfeatures learned from a public dataset. With the MMD, we can simply privatize\nthe data-dependent term once and for all, rather than introducing noise at each\nstep of optimization as in DP-SGD. Our algorithm allows us to generate\nCIFAR10-level images with $\\epsilon \\approx 2$ which capture distinctive\nfeatures in the distribution, far surpassing the current state of the art,\nwhich mostly focuses on datasets such as MNIST and FashionMNIST at a large\n$\\epsilon \\approx 10$. Our work introduces simple yet powerful foundations for\nreducing the gap between private and non-private deep generative models. Our\ncode is available at \\url{https://github.com/ParkLabML/DP-MEPF}.\n","authors":["Fredrik Harder","Milad Jalali Asadabadi","Danica J. Sutherland","Mijung Park"],"pdf_url":"https://arxiv.org/pdf/2205.12900v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2107.03455v2","updated":"2023-07-20T11:54:22Z","published":"2021-07-07T19:35:31Z","title":"Model Selection for Generic Contextual Bandits","summary":" We consider the problem of model selection for the general stochastic\ncontextual bandits under the realizability assumption. We propose a successive\nrefinement based algorithm called Adaptive Contextual Bandit ({\\ttfamily ACB}),\nthat works in phases and successively eliminates model classes that are too\nsimple to fit the given instance. We prove that this algorithm is adaptive,\ni.e., the regret rate order-wise matches that of any provable contextual bandit\nalgorithm (ex. \\cite{falcon}), that needs the knowledge of the true model\nclass. The price of not knowing the correct model class turns out to be only an\nadditive term contributing to the second order term in the regret bound. This\ncost possess the intuitive property that it becomes smaller as the model class\nbecomes easier to identify, and vice-versa. We also show that a much simpler\nexplore-then-commit (ETC) style algorithm also obtains similar regret bound,\ndespite not knowing the true model class. However, the cost of model selection\nis higher in ETC as opposed to in {\\ttfamily ACB}, as expected. Furthermore,\nfor the special case of linear contextual bandits, we propose specialized\nalgorithms that obtain sharper guarantees compared to the generic setup.\n","authors":["Avishek Ghosh","Abishek Sankararaman","Kannan Ramchandran"],"pdf_url":"https://arxiv.org/pdf/2107.03455v2.pdf","comment":"Accepted at IEEE Transactions on Information Theory. arXiv admin\n note: text overlap with arXiv:2006.02612"},{"id":"http://arxiv.org/abs/2307.10792v1","updated":"2023-07-20T11:45:38Z","published":"2023-07-20T11:45:38Z","title":"Optimizing PatchCore for Few/many-shot Anomaly Detection","summary":" Few-shot anomaly detection (AD) is an emerging sub-field of general AD, and\ntries to distinguish between normal and anomalous data using only few selected\nsamples. While newly proposed few-shot AD methods do compare against\npre-existing algorithms developed for the full-shot domain as baselines, they\ndo not dedicatedly optimize them for the few-shot setting. It thus remains\nunclear if the performance of such pre-existing algorithms can be further\nimproved. We address said question in this work. Specifically, we present a\nstudy on the AD/anomaly segmentation (AS) performance of PatchCore, the current\nstate-of-the-art full-shot AD/AS algorithm, in both the few-shot and the\nmany-shot settings. We hypothesize that further performance improvements can be\nrealized by (I) optimizing its various hyperparameters, and by (II)\ntransferring techniques known to improve few-shot supervised learning to the AD\ndomain. Exhaustive experiments on the public VisA and MVTec AD datasets reveal\nthat (I) significant performance improvements can be realized by optimizing\nhyperparameters such as the underlying feature extractor, and that (II)\nimage-level augmentations can, but are not guaranteed, to improve performance.\nBased on these findings, we achieve a new state of the art in few-shot AD on\nVisA, further demonstrating the merit of adapting pre-existing AD/AS methods to\nthe few-shot setting. Last, we identify the investigation of feature extractors\nwith a strong inductive bias as a potential future research direction for\n(few-shot) AD/AS.\n","authors":["João Santos","Triet Tran","Oliver Rippel"],"pdf_url":"https://arxiv.org/pdf/2307.10792v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10788v1","updated":"2023-07-20T11:38:55Z","published":"2023-07-20T11:38:55Z","title":"Adversarial attacks for mixtures of classifiers","summary":" Mixtures of classifiers (a.k.a. randomized ensembles) have been proposed as a\nway to improve robustness against adversarial attacks. However, it has been\nshown that existing attacks are not well suited for this kind of classifiers.\nIn this paper, we discuss the problem of attacking a mixture in a principled\nway and introduce two desirable properties of attacks based on a geometrical\nanalysis of the problem (effectiveness and maximality). We then show that\nexisting attacks do not meet both of these properties. Finally, we introduce a\nnew attack called lattice climber attack with theoretical guarantees on the\nbinary linear setting, and we demonstrate its performance by conducting\nexperiments on synthetic and real datasets.\n","authors":["Lucas Gnecco Heredia","Benjamin Negrevergne","Yann Chevaleyre"],"pdf_url":"https://arxiv.org/pdf/2307.10788v1.pdf","comment":"7 pages + 4 pages of appendix. 5 figures in main text"},{"id":"http://arxiv.org/abs/2307.09614v2","updated":"2023-07-20T11:36:52Z","published":"2023-07-13T19:03:06Z","title":"Multi-view self-supervised learning for multivariate variable-channel\n time series","summary":" Labeling of multivariate biomedical time series data is a laborious and\nexpensive process. Self-supervised contrastive learning alleviates the need for\nlarge, labeled datasets through pretraining on unlabeled data. However, for\nmultivariate time series data, the set of input channels often varies between\napplications, and most existing work does not allow for transfer between\ndatasets with different sets of input channels. We propose learning one encoder\nto operate on all input channels individually. We then use a message passing\nneural network to extract a single representation across channels. We\ndemonstrate the potential of this method by pretraining our model on a dataset\nwith six EEG channels and then fine-tuning it on a dataset with two different\nEEG channels. We compare models with and without the message passing neural\nnetwork across different contrastive loss functions. We show that our method,\ncombined with the TS2Vec loss, outperforms all other methods in most settings.\n","authors":["Thea Brüsch","Mikkel N. Schmidt","Tommy S. Alstrøm"],"pdf_url":"https://arxiv.org/pdf/2307.09614v2.pdf","comment":"To appear in proceedings of 2023 IEEE International workshop on\n Machine Learning for Signal Processing"},{"id":"http://arxiv.org/abs/2307.10787v1","updated":"2023-07-20T11:36:45Z","published":"2023-07-20T11:36:45Z","title":"Feed-Forward Source-Free Domain Adaptation via Class Prototypes","summary":" Source-free domain adaptation has become popular because of its practical\nusefulness and no need to access source data. However, the adaptation process\nstill takes a considerable amount of time and is predominantly based on\noptimization that relies on back-propagation. In this work we present a simple\nfeed-forward approach that challenges the need for back-propagation based\nadaptation. Our approach is based on computing prototypes of classes under the\ndomain shift using a pre-trained model. It achieves strong improvements in\naccuracy compared to the pre-trained model and requires only a small fraction\nof time of existing domain adaptation methods.\n","authors":["Ondrej Bohdal","Da Li","Timothy Hospedales"],"pdf_url":"https://arxiv.org/pdf/2307.10787v1.pdf","comment":"ECCV 2022 Workshop on Out of Distribution Generalization in Computer\n Vision (OOD-CV)"},{"id":"http://arxiv.org/abs/2307.10779v1","updated":"2023-07-20T11:29:17Z","published":"2023-07-20T11:29:17Z","title":"Efficient Beam Tree Recursion","summary":" Beam Tree Recursive Neural Network (BT-RvNN) was recently proposed as a\nsimple extension of Gumbel Tree RvNN and it was shown to achieve\nstate-of-the-art length generalization performance in ListOps while maintaining\ncomparable performance on other tasks. However, although not the worst in its\nkind, BT-RvNN can be still exorbitantly expensive in memory usage. In this\npaper, we identify the main bottleneck in BT-RvNN's memory usage to be the\nentanglement of the scorer function and the recursive cell function. We propose\nstrategies to remove this bottleneck and further simplify its memory usage.\nOverall, our strategies not only reduce the memory usage of BT-RvNN by\n$10$-$16$ times but also create a new state-of-the-art in ListOps while\nmaintaining similar performance in other tasks. In addition, we also propose a\nstrategy to utilize the induced latent-tree node representations produced by\nBT-RvNN to turn BT-RvNN from a sentence encoder of the form $f:\\mathbb{R}^{n\n\\times d} \\rightarrow \\mathbb{R}^{d}$ into a sequence contextualizer of the\nform $f:\\mathbb{R}^{n \\times d} \\rightarrow \\mathbb{R}^{n \\times d}$. Thus, our\nproposals not only open up a path for further scalability of RvNNs but also\nstandardize a way to use BT-RvNNs as another building block in the deep\nlearning toolkit that can be easily stacked or interfaced with other popular\nmodels such as Transformers and Structured State Space models.\n","authors":["Jishnu Ray Chowdhury","Cornelia Caragea"],"pdf_url":"https://arxiv.org/pdf/2307.10779v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10774v1","updated":"2023-07-20T11:14:24Z","published":"2023-07-20T11:14:24Z","title":"Assessing the Use of AutoML for Data-Driven Software Engineering","summary":" Background. Due to the widespread adoption of Artificial Intelligence (AI)\nand Machine Learning (ML) for building software applications, companies are\nstruggling to recruit employees with a deep understanding of such technologies.\nIn this scenario, AutoML is soaring as a promising solution to fill the AI/ML\nskills gap since it promises to automate the building of end-to-end AI/ML\npipelines that would normally be engineered by specialized team members. Aims.\nDespite the growing interest and high expectations, there is a dearth of\ninformation about the extent to which AutoML is currently adopted by teams\ndeveloping AI/ML-enabled systems and how it is perceived by practitioners and\nresearchers. Method. To fill these gaps, in this paper, we present a\nmixed-method study comprising a benchmark of 12 end-to-end AutoML tools on two\nSE datasets and a user survey with follow-up interviews to further our\nunderstanding of AutoML adoption and perception. Results. We found that AutoML\nsolutions can generate models that outperform those trained and optimized by\nresearchers to perform classification tasks in the SE domain. Also, our\nfindings show that the currently available AutoML solutions do not live up to\ntheir names as they do not equally support automation across the stages of the\nML development workflow and for all the team members. Conclusions. We derive\ninsights to inform the SE research community on how AutoML can facilitate their\nactivities and tool builders on how to design the next generation of AutoML\ntechnologies.\n","authors":["Fabio Calefato","Luigi Quaranta","Filippo Lanubile","Marcos Kalinowski"],"pdf_url":"https://arxiv.org/pdf/2307.10774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10773v1","updated":"2023-07-20T11:10:06Z","published":"2023-07-20T11:10:06Z","title":"Music Genre Classification with ResNet and Bi-GRU Using Visual\n Spectrograms","summary":" Music recommendation systems have emerged as a vital component to enhance\nuser experience and satisfaction for the music streaming services, which\ndominates music consumption. The key challenge in improving these recommender\nsystems lies in comprehending the complexity of music data, specifically for\nthe underpinning music genre classification. The limitations of manual genre\nclassification have highlighted the need for a more advanced system, namely the\nAutomatic Music Genre Classification (AMGC) system. While traditional machine\nlearning techniques have shown potential in genre classification, they heavily\nrely on manually engineered features and feature selection, failing to capture\nthe full complexity of music data. On the other hand, deep learning\nclassification architectures like the traditional Convolutional Neural Networks\n(CNN) are effective in capturing the spatial hierarchies but struggle to\ncapture the temporal dynamics inherent in music data. To address these\nchallenges, this study proposes a novel approach using visual spectrograms as\ninput, and propose a hybrid model that combines the strength of the Residual\nneural Network (ResNet) and the Gated Recurrent Unit (GRU). This model is\ndesigned to provide a more comprehensive analysis of music data, offering the\npotential to improve the music recommender systems through achieving a more\ncomprehensive analysis of music data and hence potentially more accurate genre\nclassification.\n","authors":["Junfei Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10773v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10768v1","updated":"2023-07-20T10:57:02Z","published":"2023-07-20T10:57:02Z","title":"Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of\n Working Memory","summary":" Working memory (WM), a fundamental cognitive process facilitating the\ntemporary storage, integration, manipulation, and retrieval of information,\nplays a vital role in reasoning and decision-making tasks. Robust benchmark\ndatasets that capture the multifaceted nature of WM are crucial for the\neffective development and evaluation of AI WM models. Here, we introduce a\ncomprehensive Working Memory (WorM) benchmark dataset for this purpose. WorM\ncomprises 10 tasks and a total of 1 million trials, assessing 4\nfunctionalities, 3 domains, and 11 behavioral and neural characteristics of WM.\nWe jointly trained and tested state-of-the-art recurrent neural networks and\ntransformers on all these tasks. We also include human behavioral benchmarks as\nan upper bound for comparison. Our results suggest that AI models replicate\nsome characteristics of WM in the brain, most notably primacy and recency\neffects, and neural clusters and correlates specialized for different domains\nand functionalities of WM. In the experiments, we also reveal some limitations\nin existing models to approximate human behavior. This dataset serves as a\nvaluable resource for communities in cognitive psychology, neuroscience, and\nAI, offering a standardized framework to compare and enhance WM models,\ninvestigate WM's neural underpinnings, and develop WM models with human-like\ncapabilities. Our source code and data are available at\nhttps://github.com/ZhangLab-DeepNeuroCogLab/WorM.\n","authors":["Ankur Sikarwar","Mengmi Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10768v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10763v1","updated":"2023-07-20T10:53:12Z","published":"2023-07-20T10:53:12Z","title":"MSQNet: Actor-agnostic Action Recognition with Multi-modal Query","summary":" Existing action recognition methods are typically actor-specific due to the\nintrinsic topological and apparent differences among the actors. This requires\nactor-specific pose estimation (e.g., humans vs. animals), leading to\ncumbersome model design complexity and high maintenance costs. Moreover, they\noften focus on learning the visual modality alone and single-label\nclassification whilst neglecting other available information sources (e.g.,\nclass name text) and the concurrent occurrence of multiple actions. To overcome\nthese limitations, we propose a new approach called 'actor-agnostic multi-modal\nmulti-label action recognition,' which offers a unified solution for various\ntypes of actors, including humans and animals. We further formulate a novel\nMulti-modal Semantic Query Network (MSQNet) model in a transformer-based object\ndetection framework (e.g., DETR), characterized by leveraging visual and\ntextual modalities to represent the action classes better. The elimination of\nactor-specific model designs is a key advantage, as it removes the need for\nactor pose estimation altogether. Extensive experiments on five publicly\navailable benchmarks show that our MSQNet consistently outperforms the prior\narts of actor-specific alternatives on human and animal single- and multi-label\naction recognition tasks by up to 50%. Code will be released at\nhttps://github.com/mondalanindya/MSQNet.\n","authors":["Anindya Mondal","Sauradip Nag","Joaquin M Prada","Xiatian Zhu","Anjan Dutta"],"pdf_url":"https://arxiv.org/pdf/2307.10763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13960v2","updated":"2023-07-20T10:26:56Z","published":"2023-06-24T13:29:54Z","title":"Regular SE(3) Group Convolutions for Volumetric Medical Image Analysis","summary":" Regular group convolutional neural networks (G-CNNs) have been shown to\nincrease model performance and improve equivariance to different geometrical\nsymmetries. This work addresses the problem of SE(3), i.e., roto-translation\nequivariance, on volumetric data. Volumetric image data is prevalent in many\nmedical settings. Motivated by the recent work on separable group convolutions,\nwe devise a SE(3) group convolution kernel separated into a continuous SO(3)\n(rotation) kernel and a spatial kernel. We approximate equivariance to the\ncontinuous setting by sampling uniform SO(3) grids. Our continuous SO(3) kernel\nis parameterized via RBF interpolation on similarly uniform grids. We\ndemonstrate the advantages of our approach in volumetric medical image\nanalysis. Our SE(3) equivariant models consistently outperform CNNs and regular\ndiscrete G-CNNs on challenging medical classification tasks and show\nsignificantly improved generalization capabilities. Our approach achieves up to\na 16.5% gain in accuracy over regular CNNs.\n","authors":["Thijs P. Kuipers","Erik J. Bekkers"],"pdf_url":"https://arxiv.org/pdf/2306.13960v2.pdf","comment":"10 pages, 1 figure, 2 tables, accepted at MICCAI 2023. Updated\n version to camera ready version 1"},{"id":"http://arxiv.org/abs/2307.10749v1","updated":"2023-07-20T10:24:18Z","published":"2023-07-20T10:24:18Z","title":"Mitigating Voter Attribute Bias for Fair Opinion Aggregation","summary":" The aggregation of multiple opinions plays a crucial role in decision-making,\nsuch as in hiring and loan review, and in labeling data for supervised\nlearning. Although majority voting and existing opinion aggregation models are\neffective for simple tasks, they are inappropriate for tasks without\nobjectively true labels in which disagreements may occur. In particular, when\nvoter attributes such as gender or race introduce bias into opinions, the\naggregation results may vary depending on the composition of voter attributes.\nA balanced group of voters is desirable for fair aggregation results but may be\ndifficult to prepare. In this study, we consider methods to achieve fair\nopinion aggregation based on voter attributes and evaluate the fairness of the\naggregated results. To this end, we consider an approach that combines opinion\naggregation models such as majority voting and the Dawid and Skene model (D&S\nmodel) with fairness options such as sample weighting. To evaluate the fairness\nof opinion aggregation, probabilistic soft labels are preferred over discrete\nclass labels. First, we address the problem of soft label estimation without\nconsidering voter attributes and identify some issues with the D&S model. To\naddress these limitations, we propose a new Soft D&S model with improved\naccuracy in estimating soft labels. Moreover, we evaluated the fairness of an\nopinion aggregation model, including Soft D&S, in combination with different\nfairness options using synthetic and semi-synthetic data. The experimental\nresults suggest that the combination of Soft D&S and data splitting as a\nfairness option is effective for dense data, whereas weighted majority voting\nis effective for sparse data. These findings should prove particularly valuable\nin supporting decision-making by human and machine-learning models with\nbalanced opinion aggregation.\n","authors":["Ryosuke Ueda","Koh Takeuchi","Hisashi Kashima"],"pdf_url":"https://arxiv.org/pdf/2307.10749v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10738v1","updated":"2023-07-20T10:04:55Z","published":"2023-07-20T10:04:55Z","title":"Fairness-Aware Client Selection for Federated Learning","summary":" Federated learning (FL) has enabled multiple data owners (a.k.a. FL clients)\nto train machine learning models collaboratively without revealing private\ndata. Since the FL server can only engage a limited number of clients in each\ntraining round, FL client selection has become an important research problem.\nExisting approaches generally focus on either enhancing FL model performance or\nenhancing the fair treatment of FL clients. The problem of balancing\nperformance and fairness considerations when selecting FL clients remains open.\nTo address this problem, we propose the Fairness-aware Federated Client\nSelection (FairFedCS) approach. Based on Lyapunov optimization, it dynamically\nadjusts FL clients' selection probabilities by jointly considering their\nreputations, times of participation in FL tasks and contributions to the\nresulting model performance. By not using threshold-based reputation filtering,\nit provides FL clients with opportunities to redeem their reputations after a\nperceived poor performance, thereby further enhancing fair client treatment.\nExtensive experiments based on real-world multimedia datasets show that\nFairFedCS achieves 19.6% higher fairness and 0.73% higher test accuracy on\naverage than the best-performing state-of-the-art approach.\n","authors":["Yuxin Shi","Zelei Liu","Zhuan Shi","Han Yu"],"pdf_url":"https://arxiv.org/pdf/2307.10738v1.pdf","comment":"Accepted by ICME 2023"},{"id":"http://arxiv.org/abs/2307.10736v1","updated":"2023-07-20T10:03:50Z","published":"2023-07-20T10:03:50Z","title":"Long-Tail Theory under Gaussian Mixtures","summary":" We suggest a simple Gaussian mixture model for data generation that complies\nwith Feldman's long tail theory (2020). We demonstrate that a linear classifier\ncannot decrease the generalization error below a certain level in the proposed\nmodel, whereas a nonlinear classifier with a memorization capacity can. This\nconfirms that for long-tailed distributions, rare training examples must be\nconsidered for optimal generalization to new data. Finally, we show that the\nperformance gap between linear and nonlinear models can be lessened as the tail\nbecomes shorter in the subpopulation frequency distribution, as confirmed by\nexperiments on synthetic and real data.\n","authors":["Arman Bolatov","Maxat Tezekbayev","Igor Melnykov","Artur Pak","Vassilina Nikoulina","Zhenisbek Assylbekov"],"pdf_url":"https://arxiv.org/pdf/2307.10736v1.pdf","comment":"accepted to ECAI 2023"},{"id":"http://arxiv.org/abs/2307.10718v1","updated":"2023-07-20T09:24:23Z","published":"2023-07-20T09:24:23Z","title":"Differences Between Hard and Noisy-labeled Samples: An Empirical Study","summary":" Extracting noisy or incorrectly labeled samples from a labeled dataset with\nhard/difficult samples is an important yet under-explored topic. Two general\nand often independent lines of work exist, one focuses on addressing noisy\nlabels, and another deals with hard samples. However, when both types of data\nare present, most existing methods treat them equally, which results in a\ndecline in the overall performance of the model. In this paper, we first design\nvarious synthetic datasets with custom hardness and noisiness levels for\ndifferent samples. Our proposed systematic empirical study enables us to better\nunderstand the similarities and more importantly the differences between\nhard-to-learn samples and incorrectly-labeled samples. These controlled\nexperiments pave the way for the development of methods that distinguish\nbetween hard and noisy samples. Through our study, we introduce a simple yet\neffective metric that filters out noisy-labeled samples while keeping the hard\nsamples. We study various data partitioning methods in the presence of label\nnoise and observe that filtering out noisy samples from hard samples with this\nproposed metric results in the best datasets as evidenced by the high test\naccuracy achieved after models are trained on the filtered datasets. We\ndemonstrate this for both our created synthetic datasets and for datasets with\nreal-world label noise. Furthermore, our proposed data partitioning method\nsignificantly outperforms other methods when employed within a semi-supervised\nlearning framework.\n","authors":["Mahsa Forouzesh","Patrick Thiran"],"pdf_url":"https://arxiv.org/pdf/2307.10718v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10710v1","updated":"2023-07-20T09:05:46Z","published":"2023-07-20T09:05:46Z","title":"Reparameterized Policy Learning for Multimodal Trajectory Optimization","summary":" We investigate the challenge of parametrizing policies for reinforcement\nlearning (RL) in high-dimensional continuous action spaces. Our objective is to\ndevelop a multimodal policy that overcomes limitations inherent in the\ncommonly-used Gaussian parameterization. To achieve this, we propose a\nprincipled framework that models the continuous RL policy as a generative model\nof optimal trajectories. By conditioning the policy on a latent variable, we\nderive a novel variational bound as the optimization objective, which promotes\nexploration of the environment. We then present a practical model-based RL\nmethod, called Reparameterized Policy Gradient (RPG), which leverages the\nmultimodal policy parameterization and learned world model to achieve strong\nexploration capabilities and high data efficiency. Empirical results\ndemonstrate that our method can help agents evade local optima in tasks with\ndense rewards and solve challenging sparse-reward environments by incorporating\nan object-centric intrinsic reward. Our method consistently outperforms\nprevious approaches across a range of tasks. Code and supplementary materials\nare available on the project page https://haosulab.github.io/RPG/\n","authors":["Zhiao Huang","Litian Liang","Zhan Ling","Xuanlin Li","Chuang Gan","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2307.10710v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10705v1","updated":"2023-07-20T08:53:47Z","published":"2023-07-20T08:53:47Z","title":"TwinLiteNet: An Efficient and Lightweight Model for Driveable Area and\n Lane Segmentation in Self-Driving Cars","summary":" Semantic segmentation is a common task in autonomous driving to understand\nthe surrounding environment. Driveable Area Segmentation and Lane Detection are\nparticularly important for safe and efficient navigation on the road. However,\noriginal semantic segmentation models are computationally expensive and require\nhigh-end hardware, which is not feasible for embedded systems in autonomous\nvehicles. This paper proposes a lightweight model for the driveable area and\nlane line segmentation. TwinLiteNet is designed cheaply but achieves accurate\nand efficient segmentation results. We evaluate TwinLiteNet on the BDD100K\ndataset and compare it with modern models. Experimental results show that our\nTwinLiteNet performs similarly to existing approaches, requiring significantly\nfewer computational resources. Specifically, TwinLiteNet achieves a mIoU score\nof 91.3% for the Drivable Area task and 31.08% IoU for the Lane Detection task\nwith only 0.4 million parameters and achieves 415 FPS on GPU RTX A5000.\nFurthermore, TwinLiteNet can run in real-time on embedded devices with limited\ncomputing power, especially since it achieves 60FPS on Jetson Xavier NX, making\nit an ideal solution for self-driving vehicles. Code is available:\nurl{https://github.com/chequanghuy/TwinLiteNet}.\n","authors":["Quang Huy Che","Dinh Phuc Nguyen","Minh Quan Pham","Duc Khai Lam"],"pdf_url":"https://arxiv.org/pdf/2307.10705v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10704v1","updated":"2023-07-20T08:53:16Z","published":"2023-07-20T08:53:16Z","title":"Decentralized Smart Charging of Large-Scale EVs using Adaptive\n Multi-Agent Multi-Armed Bandits","summary":" The drastic growth of electric vehicles and photovoltaics can introduce new\nchallenges, such as electrical current congestion and voltage limit violations\ndue to peak load demands. These issues can be mitigated by controlling the\noperation of electric vehicles i.e., smart charging. Centralized smart charging\nsolutions have already been proposed in the literature. But such solutions may\nlack scalability and suffer from inherent drawbacks of centralization, such as\na single point of failure, and data privacy concerns. Decentralization can help\ntackle these challenges. In this paper, a fully decentralized smart charging\nsystem is proposed using the philosophy of adaptive multi-agent systems. The\nproposed system utilizes multi-armed bandit learning to handle uncertainties in\nthe system. The presented system is decentralized, scalable, real-time,\nmodel-free, and takes fairness among different players into account. A detailed\ncase study is also presented for performance evaluation.\n","authors":["Sharyal Zafar","Raphaël Feraud","Anne Blavette","Guy Camilleri","Hamid Ben"],"pdf_url":"https://arxiv.org/pdf/2307.10704v1.pdf","comment":"CIRED 2023 International Conference & Exhibition on Electricity\n Distribution, Jun 2023, Rome, Italy"},{"id":"http://arxiv.org/abs/2307.10703v1","updated":"2023-07-20T08:50:16Z","published":"2023-07-20T08:50:16Z","title":"Graphs in State-Space Models for Granger Causality in Climate Science","summary":" Granger causality (GC) is often considered not an actual form of causality.\nStill, it is arguably the most widely used method to assess the predictability\nof a time series from another one. Granger causality has been widely used in\nmany applied disciplines, from neuroscience and econometrics to Earth sciences.\nWe revisit GC under a graphical perspective of state-space models. For that, we\nuse GraphEM, a recently presented expectation-maximisation algorithm for\nestimating the linear matrix operator in the state equation of a\nlinear-Gaussian state-space model. Lasso regularisation is included in the\nM-step, which is solved using a proximal splitting Douglas-Rachford algorithm.\nExperiments in toy examples and challenging climate problems illustrate the\nbenefits of the proposed model and inference technique over standard Granger\ncausality methods.\n","authors":["Víctor Elvira","Émilie Chouzenoux","Jordi Cerdà","Gustau Camps-Valls"],"pdf_url":"https://arxiv.org/pdf/2307.10703v1.pdf","comment":"4 pages, 2 figures, 3 tables, CausalStats23: When Causal Inference\n meets Statistical Analysis, April 17-21, 2023, Paris, France"},{"id":"http://arxiv.org/abs/2205.09753v2","updated":"2023-07-20T08:41:46Z","published":"2022-04-30T07:08:30Z","title":"HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory\n Prediction via Scene Encoding","summary":" Encoding a driving scene into vector representations has been an essential\ntask for autonomous driving that can benefit downstream tasks e.g. trajectory\nprediction. The driving scene often involves heterogeneous elements such as the\ndifferent types of objects (agents, lanes, traffic signs) and the semantic\nrelations between objects are rich and diverse. Meanwhile, there also exist\nrelativity across elements, which means that the spatial relation is a relative\nconcept and need be encoded in a ego-centric manner instead of in a global\ncoordinate system. Based on these observations, we propose Heterogeneous\nDriving Graph Transformer (HDGT), a backbone modelling the driving scene as a\nheterogeneous graph with different types of nodes and edges. For heterogeneous\ngraph construction, we connect different types of nodes according to diverse\nsemantic relations. For spatial relation encoding, the coordinates of the node\nas well as its in-edges are in the local node-centric coordinate system. For\nthe aggregation module in the graph neural network (GNN), we adopt the\ntransformer structure in a hierarchical way to fit the heterogeneous nature of\ninputs. Experimental results show that HDGT achieves state-of-the-art\nperformance for the task of trajectory prediction, on INTERACTION Prediction\nChallenge and Waymo Open Motion Challenge.\n","authors":["Xiaosong Jia","Penghao Wu","Li Chen","Yu Liu","Hongyang Li","Junchi Yan"],"pdf_url":"https://arxiv.org/pdf/2205.09753v2.pdf","comment":"Accepted at IEEE TPAMI in 2023. Code url:\n https://github.com/OpenDriveLab/HDGT"},{"id":"http://arxiv.org/abs/2307.10695v1","updated":"2023-07-20T08:38:01Z","published":"2023-07-20T08:38:01Z","title":"Self2Self+: Single-Image Denoising with Self-Supervised Learning and\n Image Quality Assessment Loss","summary":" Recently, denoising methods based on supervised learning have exhibited\npromising performance. However, their reliance on external datasets containing\nnoisy-clean image pairs restricts their applicability. To address this\nlimitation, researchers have focused on training denoising networks using\nsolely a set of noisy inputs. To improve the feasibility of denoising\nprocedures, in this study, we proposed a single-image self-supervised learning\nmethod in which only the noisy input image is used for network training. Gated\nconvolution was used for feature extraction and no-reference image quality\nassessment was used for guiding the training process. Moreover, the proposed\nmethod sampled instances from the input image dataset using Bernoulli sampling\nwith a certain dropout rate for training. The corresponding result was produced\nby averaging the generated predictions from various instances of the trained\nnetwork with dropouts. The experimental results indicated that the proposed\nmethod achieved state-of-the-art denoising performance on both synthetic and\nreal-world datasets. This highlights the effectiveness and practicality of our\nmethod as a potential solution for various noise removal tasks.\n","authors":["Jaekyun Ko","Sanghwan Lee"],"pdf_url":"https://arxiv.org/pdf/2307.10695v1.pdf","comment":"Technical report and supplemantry materials are combined into one\n paper. - Technical report: Page 1~7 - Supplemantry materials : Page 8~18"},{"id":"http://arxiv.org/abs/2302.08292v3","updated":"2023-07-20T08:35:26Z","published":"2023-02-16T13:41:19Z","title":"Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation\n for autonomous vehicles","summary":" Autonomous driving (AD) perception today relies heavily on deep learning\nbased architectures requiring large scale annotated datasets with their\nassociated costs for curation and annotation. The 3D semantic data are useful\nfor core perception tasks such as obstacle detection and ego-vehicle\nlocalization. We propose a new dataset, Navya 3D Segmentation (Navya3DSeg),\nwith a diverse label space corresponding to a large scale production grade\noperational domain, including rural, urban, industrial sites and universities\nfrom 13 countries. It contains 23 labeled sequences and 25 supplementary\nsequences without labels, designed to explore self-supervised and\nsemi-supervised semantic segmentation benchmarks on point clouds. We also\npropose a novel method for sequential dataset split generation based on\niterative multi-label stratification, and demonstrated to achieve a +1.2% mIoU\nimprovement over the original split proposed by SemanticKITTI dataset. A\ncomplete benchmark for semantic segmentation task was performed, with state of\nthe art methods. Finally, we demonstrate an Active Learning (AL) based dataset\ndistillation framework. We introduce a novel heuristic-free sampling method\ncalled ego-pose distance based sampling in the context of AL. A detailed\npresentation on the dataset is available here\nhttps://www.youtube.com/watch?v=5m6ALIs-s20.\n","authors":["Alexandre Almin","Léo Lemarié","Anh Duong","B Ravi Kiran"],"pdf_url":"https://arxiv.org/pdf/2302.08292v3.pdf","comment":"Accepted version to IEEE RA-L. Version with supplementary materials"},{"id":"http://arxiv.org/abs/2307.10683v1","updated":"2023-07-20T08:20:12Z","published":"2023-07-20T08:20:12Z","title":"Fractional Denoising for 3D Molecular Pre-training","summary":" Coordinate denoising is a promising 3D molecular pre-training method, which\nhas achieved remarkable performance in various downstream drug discovery tasks.\nTheoretically, the objective is equivalent to learning the force field, which\nis revealed helpful for downstream tasks. Nevertheless, there are two\nchallenges for coordinate denoising to learn an effective force field, i.e. low\ncoverage samples and isotropic force field. The underlying reason is that\nmolecular distributions assumed by existing denoising methods fail to capture\nthe anisotropic characteristic of molecules. To tackle these challenges, we\npropose a novel hybrid noise strategy, including noises on both dihedral angel\nand coordinate. However, denoising such hybrid noise in a traditional way is no\nmore equivalent to learning the force field. Through theoretical deductions, we\nfind that the problem is caused by the dependency of the input conformation for\ncovariance. To this end, we propose to decouple the two types of noise and\ndesign a novel fractional denoising method (Frad), which only denoises the\nlatter coordinate part. In this way, Frad enjoys both the merits of sampling\nmore low-energy structures and the force field equivalence. Extensive\nexperiments show the effectiveness of Frad in molecular representation, with a\nnew state-of-the-art on 9 out of 12 tasks of QM9 and on 7 out of 8 targets of\nMD17.\n","authors":["Shikun Feng","Yuyan Ni","Yanyan Lan","Zhi-Ming Ma","Wei-Ying Ma"],"pdf_url":"https://arxiv.org/pdf/2307.10683v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10677v1","updated":"2023-07-20T07:57:14Z","published":"2023-07-20T07:57:14Z","title":"Deep learning for classification of noisy QR codes","summary":" We wish to define the limits of a classical classification model based on\ndeep learning when applied to abstract images, which do not represent visually\nidentifiable objects.QR codes (Quick Response codes) fall into this category of\nabstract images: one bit corresponding to one encoded character, QR codes were\nnot designed to be decoded manually. To understand the limitations of a deep\nlearning-based model for abstract image classification, we train an image\nclassification model on QR codes generated from information obtained when\nreading a health pass. We compare a classification model with a classical\n(deterministic) decoding method in the presence of noise. This study allows us\nto conclude that a model based on deep learning can be relevant for the\nunderstanding of abstract images.\n","authors":["Rebecca Leygonie","Sylvain Lobry"," )","Laurent Wendling (LIPADE)"],"pdf_url":"https://arxiv.org/pdf/2307.10677v1.pdf","comment":"in French language. RFIAP 2022 - Reconnaissance des Formes, Image,\n Apprentissage et Perception, Jul 2022, Vannes (Bretagne), France"},{"id":"http://arxiv.org/abs/2307.07666v2","updated":"2023-07-20T07:55:04Z","published":"2023-07-15T00:26:51Z","title":"Efficient Action Robust Reinforcement Learning with Probabilistic Policy\n Execution Uncertainty","summary":" Robust reinforcement learning (RL) aims to find a policy that optimizes the\nworst-case performance in the face of uncertainties. In this paper, we focus on\naction robust RL with the probabilistic policy execution uncertainty, in which,\ninstead of always carrying out the action specified by the policy, the agent\nwill take the action specified by the policy with probability $1-\\rho$ and an\nalternative adversarial action with probability $\\rho$. We establish the\nexistence of an optimal policy on the action robust MDPs with probabilistic\npolicy execution uncertainty and provide the action robust Bellman optimality\nequation for its solution. Furthermore, we develop Action Robust Reinforcement\nLearning with Certificates (ARRLC) algorithm that achieves minimax optimal\nregret and sample complexity. Furthermore, we conduct numerical experiments to\nvalidate our approach's robustness, demonstrating that ARRLC outperforms\nnon-robust RL algorithms and converges faster than the robust TD algorithm in\nthe presence of action perturbations.\n","authors":["Guanlin Liu","Zhihan Zhou","Han Liu","Lifeng Lai"],"pdf_url":"https://arxiv.org/pdf/2307.07666v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10655v1","updated":"2023-07-20T07:35:42Z","published":"2023-07-20T07:35:42Z","title":"A Survey of What to Share in Federated Learning: Perspectives on Model\n Utility, Privacy Leakage, and Communication Efficiency","summary":" Federated learning (FL) has emerged as a highly effective paradigm for\nprivacy-preserving collaborative training among different parties. Unlike\ntraditional centralized learning, which requires collecting data from each\nparty, FL allows clients to share privacy-preserving information without\nexposing private datasets. This approach not only guarantees enhanced privacy\nprotection but also facilitates more efficient and secure collaboration among\nmultiple participants. Therefore, FL has gained considerable attention from\nresearchers, promoting numerous surveys to summarize the related works.\nHowever, the majority of these surveys concentrate on methods sharing model\nparameters during the training process, while overlooking the potential of\nsharing other forms of local information. In this paper, we present a\nsystematic survey from a new perspective, i.e., what to share in FL, with an\nemphasis on the model utility, privacy leakage, and communication efficiency.\nThis survey differs from previous ones due to four distinct contributions.\nFirst, we present a new taxonomy of FL methods in terms of the sharing methods,\nwhich includes three categories of shared information: model sharing, synthetic\ndata sharing, and knowledge sharing. Second, we analyze the vulnerability of\ndifferent sharing methods to privacy attacks and review the defense mechanisms\nthat provide certain privacy guarantees. Third, we conduct extensive\nexperiments to compare the performance and communication overhead of various\nsharing methods in FL. Besides, we assess the potential privacy leakage through\nmodel inversion and membership inference attacks, while comparing the\neffectiveness of various defense approaches. Finally, we discuss potential\ndeficiencies in current methods and outline future directions for improvement.\n","authors":["Jiawei Shao","Zijian Li","Wenqiang Sun","Tailin Zhou","Yuchang Sun","Lumin Liu","Zehong Lin","Jun Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10655v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10654v1","updated":"2023-07-20T07:35:15Z","published":"2023-07-20T07:35:15Z","title":"Conditional expectation network for SHAP","summary":" A very popular model-agnostic technique for explaining predictive models is\nthe SHapley Additive exPlanation (SHAP). The two most popular versions of SHAP\nare a conditional expectation version and an unconditional expectation version\n(the latter is also known as interventional SHAP). Except for tree-based\nmethods, usually the unconditional version is used (for computational reasons).\nWe provide a (surrogate) neural network approach which allows us to efficiently\ncalculate the conditional version for both neural networks and other regression\nmodels, and which properly considers the dependence structure in the feature\ncomponents. This proposal is also useful to provide drop1 and anova analyses in\ncomplex regression models which are similar to their generalized linear model\n(GLM) counterparts, and we provide a partial dependence plot (PDP) counterpart\nthat considers the right dependence structure in the feature components.\n","authors":["Ronald Richman","Mario V. Wüthrich"],"pdf_url":"https://arxiv.org/pdf/2307.10654v1.pdf","comment":"24 pages, 9 figures"},{"id":"http://arxiv.org/abs/2307.10653v1","updated":"2023-07-20T07:33:36Z","published":"2023-07-20T07:33:36Z","title":"Refining the Optimization Target for Automatic Univariate Time Series\n Anomaly Detection in Monitoring Services","summary":" Time series anomaly detection is crucial for industrial monitoring services\nthat handle a large volume of data, aiming to ensure reliability and optimize\nsystem performance. Existing methods often require extensive labeled resources\nand manual parameter selection, highlighting the need for automation. This\npaper proposes a comprehensive framework for automatic parameter optimization\nin time series anomaly detection models. The framework introduces three\noptimization targets: prediction score, shape score, and sensitivity score,\nwhich can be easily adapted to different model backbones without prior\nknowledge or manual labeling efforts. The proposed framework has been\nsuccessfully applied online for over six months, serving more than 50,000 time\nseries every minute. It simplifies the user's experience by requiring only an\nexpected sensitive value, offering a user-friendly interface, and achieving\ndesired detection results. Extensive evaluations conducted on public datasets\nand comparison with other methods further confirm the effectiveness of the\nproposed framework.\n","authors":["Manqing Dong","Zhanxiang Zhao","Yitong Geng","Wentao Li","Wei Wang","Huai Jiang"],"pdf_url":"https://arxiv.org/pdf/2307.10653v1.pdf","comment":"Accepted by 2023 IJCAI Workshop"},{"id":"http://arxiv.org/abs/2307.10648v1","updated":"2023-07-20T07:23:15Z","published":"2023-07-20T07:23:15Z","title":"Data-Driven Latency Probability Prediction for Wireless Networks:\n Focusing on Tail Probabilities","summary":" With the emergence of new application areas, such as cyber-physical systems\nand human-in-the-loop applications, there is a need to guarantee a certain\nlevel of end-to-end network latency with extremely high reliability, e.g.,\n99.999%. While mechanisms specified under IEEE 802.1as time-sensitive\nnetworking (TSN) can be used to achieve these requirements for switched\nEthernet networks, implementing TSN mechanisms in wireless networks is\nchallenging due to their stochastic nature. To conform the wireless link to a\nreliability level of 99.999%, the behavior of extremely rare outliers in the\nlatency probability distribution, or the tail of the distribution, must be\nanalyzed and controlled. This work proposes predicting the tail of the latency\ndistribution using state-of-the-art data-driven approaches, such as mixture\ndensity networks (MDN) and extreme value mixture models, to estimate the\nlikelihood of rare latencies conditioned on the network parameters, which can\nbe used to make more informed decisions in wireless transmission. Actual\nlatency measurements of IEEE 802.11g (WiFi), commercial private and a\nsoftware-defined 5G network are used to benchmark the proposed approaches and\nevaluate their sensitivities concerning the tail probabilities.\n","authors":["Samie Mostafavi","Gourav Prateek Sharma","James Gross"],"pdf_url":"https://arxiv.org/pdf/2307.10648v1.pdf","comment":"Submitted to IEEE Global Communications (GLOBECOM) 2023 conference"},{"id":"http://arxiv.org/abs/2305.15776v2","updated":"2023-07-20T07:20:20Z","published":"2023-05-25T06:43:42Z","title":"AUC Optimization from Multiple Unlabeled Datasets","summary":" Weakly supervised learning aims to empower machine learning when the perfect\nsupervision is unavailable, which has drawn great attention from researchers.\nAmong various types of weak supervision, one of the most challenging cases is\nto learn from multiple unlabeled (U) datasets with only a little knowledge of\nthe class priors, or U$^m$ learning for short. In this paper, we study the\nproblem of building an AUC (area under ROC curve) optimization model from\nmultiple unlabeled datasets, which maximizes the pairwise ranking ability of\nthe classifier. We propose U$^m$-AUC, an AUC optimization approach that\nconverts the U$^m$ data into a multi-label AUC optimization problem, and can be\ntrained efficiently. We show that the proposed U$^m$-AUC is effective\ntheoretically and empirically.\n","authors":["Yu Liu","Zheng Xie","Ming Li"],"pdf_url":"https://arxiv.org/pdf/2305.15776v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10644v1","updated":"2023-07-20T07:14:58Z","published":"2023-07-20T07:14:58Z","title":"Fisher-Rao distance and pullback SPD cone distances between multivariate\n normal distributions","summary":" Data sets of multivariate normal distributions abound in many scientific\nareas like diffusion tensor imaging, structure tensor computer vision, radar\nsignal processing, machine learning, just to name a few. In order to process\nthose normal data sets for downstream tasks like filtering, classification or\nclustering, one needs to define proper notions of dissimilarities between\nnormals and paths joining them. The Fisher-Rao distance defined as the\nRiemannian geodesic distance induced by the Fisher information metric is such a\nprincipled metric distance which however is not known in closed-form excepts\nfor a few particular cases. In this work, we first report a fast and robust\nmethod to approximate arbitrarily finely the Fisher-Rao distance between\nmultivariate normal distributions. Second, we introduce a class of distances\nbased on diffeomorphic embeddings of the normal manifold into a submanifold of\nthe higher-dimensional symmetric positive-definite cone corresponding to the\nmanifold of centered normal distributions. We show that the projective Hilbert\ndistance on the cone yields a metric on the embedded normal submanifold and we\npullback that cone distance with its associated straight line Hilbert cone\ngeodesics to obtain a distance and smooth paths between normal distributions.\nCompared to the Fisher-Rao distance approximation, the pullback Hilbert cone\ndistance is computationally light since it requires to compute only the extreme\nminimal and maximal eigenvalues of matrices. Finally, we show how to use those\ndistances in clustering tasks.\n","authors":["Frank Nielsen"],"pdf_url":"https://arxiv.org/pdf/2307.10644v1.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2208.06620v2","updated":"2023-07-20T07:09:45Z","published":"2022-08-13T10:36:04Z","title":"Opinion Market Model: Stemming Far-Right Opinion Spread using Positive\n Interventions","summary":" Online extremism has severe societal consequences, including normalizing hate\nspeech, user radicalization, and increased social divisions. Various mitigation\nstrategies have been explored to address these consequences. One such strategy\nuses positive interventions: controlled signals that add attention to the\nopinion ecosystem to boost certain opinions. To evaluate the effectiveness of\npositive interventions, we introduce the Opinion Market Model (OMM), a two-tier\nonline opinion ecosystem model that considers both inter-opinion interactions\nand the role of positive interventions. The size of the opinion attention\nmarket is modeled in the first tier using the multivariate discrete-time Hawkes\nprocess; in the second tier, opinions cooperate and compete for market share,\ngiven limited attention using the market share attraction model. We demonstrate\nthe convergence of our proposed estimation scheme on a synthetic dataset. Next,\nwe test OMM on two learning tasks, applying to two real-world datasets to\npredict attention market shares and uncover latent relationships between online\nitems. The first dataset comprises Facebook and Twitter discussions containing\nmoderate and far-right opinions about bushfires and climate change. The second\ndataset captures popular VEVO artists' YouTube and Twitter attention volumes.\nOMM outperforms the state-of-the-art predictive models on both datasets and\ncaptures latent cooperation-competition relations. We uncover (1) self- and\ncross-reinforcement between far-right and moderate opinions on the bushfires\nand (2) pairwise artist relations that correlate with real-world interactions\nsuch as collaborations and long-lasting feuds. Lastly, we use OMM as a testbed\nfor positive interventions and show how media coverage modulates the spread of\nfar-right opinions.\n","authors":["Pio Calderon","Rohit Ram","Marian-Andrei Rizoiu"],"pdf_url":"https://arxiv.org/pdf/2208.06620v2.pdf","comment":"accepted in the 18th AAAI International Conference on Web and Social\n Media (ICWSM'24)"},{"id":"http://arxiv.org/abs/2305.08396v3","updated":"2023-07-20T07:06:03Z","published":"2023-05-15T07:23:54Z","title":"MaxViT-UNet: Multi-Axis Attention for Medical Image Segmentation","summary":" Convolutional Neural Networks (CNNs) have made significant strides in medical\nimage analysis in recent years. However, the local nature of the convolution\noperator may pose a limitation for capturing global and long-range interactions\nin CNNs. Recently, Transformers have gained popularity in the computer vision\ncommunity and also medical image segmentation due to their ability to process\nglobal features effectively. The scalability issues of self-attention mechanism\nand lack of the CNN-like inductive bias may have limited their adoption.\nTherefore, hybrid Vision transformers (CNN-Transformer), exploiting advantages\nof both Convolution and Self-attention Mechanisms, have gained importance. In\nthis work, we present MaxViT-UNet, an Encoder-Decoder based hybrid vision\ntransformer (CNN-Transformer) for medical image segmentation. The proposed\nHybrid Decoder, based on MaxViT-block, is designed to harness the power of both\nthe convolution and self-attention mechanisms at each decoding stage with\nnominal computational burden. The inclusion of multi-axis self-attention,\nwithin each decoder stage, significantly enhances the discriminating capacity\nbetween the object and background regions, and thereby helps in improving the\nsegmentation efficiency. In the Hybrid Decoder block, the fusion process\ncommences by integrating the upsampled lower level decoder features, obtained\nthrough transpose convolution, with the skip-connection features derived from\nthe hybrid encoder. Subsequently, the fused features undergo refinement through\nthe utilization of a multi-axis attention mechanism. The proposed decoder block\nis repeated multiple times to progressively segment the nuclei regions.\nExperimental results on MoNuSeg18 and MoNuSAC20 dataset demonstrates the\neffectiveness of the proposed technique.\n","authors":["Abdul Rehman Khan","Asifullah Khan"],"pdf_url":"https://arxiv.org/pdf/2305.08396v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10635v1","updated":"2023-07-20T07:01:57Z","published":"2023-07-20T07:01:57Z","title":"SciBench: Evaluating College-Level Scientific Problem-Solving Abilities\n of Large Language Models","summary":" Recent advances in large language models (LLMs) have demonstrated notable\nprogress on many mathematical benchmarks. However, most of these benchmarks\nonly feature problems grounded in junior and senior high school subjects,\ncontain only multiple-choice questions, and are confined to a limited scope of\nelementary arithmetic operations. To address these issues, this paper\nintroduces an expansive benchmark suite SciBench that aims to systematically\nexamine the reasoning capabilities required for complex scientific problem\nsolving. SciBench contains two carefully curated datasets: an open set\nfeaturing a range of collegiate-level scientific problems drawn from\nmathematics, chemistry, and physics textbooks, and a closed set comprising\nproblems from undergraduate-level exams in computer science and mathematics.\nBased on the two datasets, we conduct an in-depth benchmark study of two\nrepresentative LLMs with various prompting strategies. The results reveal that\ncurrent LLMs fall short of delivering satisfactory performance, with an overall\nscore of merely 35.80%. Furthermore, through a detailed user study, we\ncategorize the errors made by LLMs into ten problem-solving abilities. Our\nanalysis indicates that no single prompting strategy significantly outperforms\nothers and some strategies that demonstrate improvements in certain\nproblem-solving skills result in declines in other skills. We envision that\nSciBench will catalyze further developments in the reasoning abilities of LLMs,\nthereby ultimately contributing to scientific research and discovery.\n","authors":["Xiaoxuan Wang","Ziniu Hu","Pan Lu","Yanqiao Zhu","Jieyu Zhang","Satyen Subramaniam","Arjun R. Loomba","Shichang Zhang","Yizhou Sun","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2307.10635v1.pdf","comment":"Work in progress, 18 pages"},{"id":"http://arxiv.org/abs/2307.10634v1","updated":"2023-07-20T06:59:02Z","published":"2023-07-20T06:59:02Z","title":"Generative Language Models on Nucleotide Sequences of Human Genes","summary":" Language models, primarily transformer-based ones, obtained colossal success\nin NLP. To be more precise, studies like BERT in NLU and works such as GPT-3\nfor NLG are very crucial. DNA sequences are very close to natural language in\nterms of structure, so if the DNA-related bioinformatics domain is concerned,\ndiscriminative models, like DNABert, exist. Yet, the generative side of the\ncoin is mainly unexplored to the best of our knowledge. Consequently, we\nfocused on developing an autoregressive generative language model like GPT-3\nfor DNA sequences. Because working with whole DNA sequences is challenging\nwithout substantial computational resources, we decided to carry out our study\non a smaller scale, focusing on nucleotide sequences of human genes, unique\nparts in DNA with specific functionalities, instead of the whole DNA. This\ndecision did not change the problem structure a lot due to the fact that both\nDNA and genes can be seen as 1D sequences consisting of four different\nnucleotides without losing much information and making too much simplification.\nFirst of all, we systematically examined an almost entirely unexplored problem\nand observed that RNNs performed the best while simple techniques like N-grams\nwere also promising. Another beneficial point was learning how to work with\ngenerative models on languages we do not understand, unlike natural language.\nHow essential using real-life tasks beyond the classical metrics such as\nperplexity is observed. Furthermore, checking whether the data-hungry nature of\nthese models can be changed through selecting a language with minimal\nvocabulary size, four owing to four different types of nucleotides, is\nexamined. The reason for reviewing this was that choosing such a language might\nmake the problem easier. However, what we observed in this study was it did not\nprovide that much of a change in the amount of data needed.\n","authors":["Musa Nuri Ihtiyar","Arzucan Ozgur"],"pdf_url":"https://arxiv.org/pdf/2307.10634v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10633v1","updated":"2023-07-20T06:58:55Z","published":"2023-07-20T06:58:55Z","title":"Multi-Method Self-Training: Improving Code Generation With Text, And\n Vice Versa","summary":" Large Language Models have many methods for solving the same problem. This\nintroduces novel strengths (different methods may work well for different\nproblems) and weaknesses (it may be difficult for users to know which method to\nuse). In this paper, we introduce Multi-Method Self-Training (MMST), where one\nmethod is trained on the filtered outputs of another, allowing us to augment\nthe strengths and ameliorate the weaknesses of each method. Using a 176B\nparameter model trained on both language and code, we show that MMST can 1)\nimprove the less performant method (up to 30%) making the model easier to use,\n2) improve the more performant method (up to 32.2%) making the model more\nperformant, and 3) improve the performance of related but distinct tasks (up to\n10.3%) by improving the ability of the model to generate rationales. We then\nconduct ablation analyses to explore why MMST works. We show that MMST\ngenerates more data than traditional self-training, but the improvement in\nperformance is driven by the use of multiple methods. We also analyze\nprompt-engineering and anti-correlated performance between methods as means of\nmaking MMST more effective. We hope the evidence from our paper motivates\nmachine learning researchers to explore ways in which advances in language\nmodels allow for new forms of training.\n","authors":["Shriyash K. Upadhyay","Etan J. Ginsberg"],"pdf_url":"https://arxiv.org/pdf/2307.10633v1.pdf","comment":"23 pages, 3 figures"},{"id":"http://arxiv.org/abs/2211.14085v3","updated":"2023-07-20T06:42:56Z","published":"2022-11-25T13:14:33Z","title":"Positive unlabeled learning with tensor networks","summary":" Positive unlabeled learning is a binary classification problem with positive\nand unlabeled data. It is common in domains where negative labels are costly or\nimpossible to obtain, e.g., medicine and personalized advertising. Most\napproaches to positive unlabeled learning apply to specific data types (e.g.,\nimages, categorical data) and can not generate new positive and negative\nsamples. This work introduces a feature-space distance-based tensor network\napproach to the positive unlabeled learning problem. The presented method is\nnot domain specific and significantly improves the state-of-the-art results on\nthe MNIST image and 15 categorical/mixed datasets. The trained tensor network\nmodel is also a generative model and enables the generation of new positive and\nnegative instances.\n","authors":["Bojan Žunkovič"],"pdf_url":"https://arxiv.org/pdf/2211.14085v3.pdf","comment":"12 pages, 6 figures, 4 tables"},{"id":"http://arxiv.org/abs/2307.10617v1","updated":"2023-07-20T06:35:43Z","published":"2023-07-20T06:35:43Z","title":"Detecting deceptive reviews using text classification","summary":" In recent years, online reviews play a vital role for promoting any kind of\nproduct or services. Businesses may embed fake reviews in order to attract\ncustomers to purchase their products. They may even highlight the benefits of\ntheir own product or criticize the competition's product. Marketers,\nadvertisers, and other online business users have incentive to create fake\npositive reviews for products which they want to promote or give fake negative\nreviews for products which they really don't like. So now-a-days writing a\ndeceptive review is inevitable thing for promoting their own business or\ndegrading competitor's reputation. Thus, identifying deceptive reviews is an\nintense and on-going research area. This research paper proposes machine\nlearning model approach to identify deceptive reviews. The paper investigates\nthe performance of the several experiments done on a Deceptive Opinion Spam\nCorpus dataset of restaurants reviews. We developed a n-gram model and max\nfeatures to identify deceptive contents with a particular focus on fake\nreviews. Further, we conduct a benchmark study to investigate the performance\nof two different features extraction techniques and apply five machine learning\nclassification techniques. The experimental results show that passive\naggressive classifier outperforms other algorithms, and it reaches the highest\naccuracy not only in text classification but also to fake reviews. We also\nstudy the data augmentation and implement different deep learning techniques.\n","authors":["Anusuya Baby"],"pdf_url":"https://arxiv.org/pdf/2307.10617v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2307.09018v2","updated":"2023-07-20T06:35:34Z","published":"2023-07-18T07:12:46Z","title":"Multimodal LLMs for health grounded in individual-specific data","summary":" Foundation large language models (LLMs) have shown an impressive ability to\nsolve tasks across a wide range of fields including health. To effectively\nsolve personalized health tasks, LLMs need the ability to ingest a diversity of\ndata modalities that are relevant to an individual's health status. In this\npaper, we take a step towards creating multimodal LLMs for health that are\ngrounded in individual-specific data by developing a framework (HeLM: Health\nLarge Language Model for Multimodal Understanding) that enables LLMs to use\nhigh-dimensional clinical modalities to estimate underlying disease risk. HeLM\nencodes complex data modalities by learning an encoder that maps them into the\nLLM's token embedding space and for simple modalities like tabular data by\nserializing the data into text. Using data from the UK Biobank, we show that\nHeLM can effectively use demographic and clinical features in addition to\nhigh-dimensional time-series data to estimate disease risk. For example, HeLM\nachieves an AUROC of 0.75 for asthma prediction when combining tabular and\nspirogram data modalities compared with 0.49 when only using tabular data.\nOverall, we find that HeLM outperforms or performs at parity with classical\nmachine learning approaches across a selection of eight binary traits.\nFurthermore, we investigate the downstream uses of this model such as its\ngeneralizability to out-of-distribution traits and its ability to power\nconversations around individual health and wellness.\n","authors":["Anastasiya Belyaeva","Justin Cosentino","Farhad Hormozdiari","Krish Eswaran","Shravya Shetty","Greg Corrado","Andrew Carroll","Cory Y. McLean","Nicholas A. Furlotte"],"pdf_url":"https://arxiv.org/pdf/2307.09018v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10616v1","updated":"2023-07-20T06:32:14Z","published":"2023-07-20T06:32:14Z","title":"Heterogeneous Federated Learning: State-of-the-art and Research\n Challenges","summary":" Federated learning (FL) has drawn increasing attention owing to its potential\nuse in large-scale industrial applications. Existing federated learning works\nmainly focus on model homogeneous settings. However, practical federated\nlearning typically faces the heterogeneity of data distributions, model\narchitectures, network environments, and hardware devices among participant\nclients. Heterogeneous Federated Learning (HFL) is much more challenging, and\ncorresponding solutions are diverse and complex. Therefore, a systematic survey\non this topic about the research challenges and state-of-the-art is essential.\nIn this survey, we firstly summarize the various research challenges in HFL\nfrom five aspects: statistical heterogeneity, model heterogeneity,\ncommunication heterogeneity, device heterogeneity, and additional challenges.\nIn addition, recent advances in HFL are reviewed and a new taxonomy of existing\nHFL methods is proposed with an in-depth analysis of their pros and cons. We\nclassify existing methods from three different levels according to the HFL\nprocedure: data-level, model-level, and server-level. Finally, several critical\nand promising future research directions in HFL are discussed, which may\nfacilitate further developments in this field. A periodically updated\ncollection on HFL is available at https://github.com/marswhu/HFL_Survey.\n","authors":["Mang Ye","Xiuwen Fang","Bo Du","Pong C. Yuen","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2307.10616v1.pdf","comment":"42 pages, 11 figures, and 4 tables"},{"id":"http://arxiv.org/abs/2305.18088v4","updated":"2023-07-20T06:29:28Z","published":"2023-05-25T05:34:39Z","title":"Drug Repurposing Targeting COVID-19 3CL Protease using Molecular Docking\n and Machine Learning Regression Approach","summary":" The COVID-19 pandemic has created a global health crisis, driving the need\nfor the rapid identification of potential therapeutics. To meet this challenge,\ndrug repurposing is the only solution with saving cost, time, and labor. In\nthis study, we used the Zinc database to screen the world-approved including\nFDA-approved 5903 drugs for repurposing as potential COVID-19 treatments\ntargeting the main protease 3CL of SARS-CoV-2. We performed molecular docking\nand checked the efficacy of drug molecules. To enhance the efficiency of drug\nrepurposing approach, we modeled the binding affinities using several machine\nlearning regression approaches for QSAR modeling such as decision tree, extra\ntrees, MLP, KNN, XGBoost, and gradient boosting. The computational results\ndemonstrated that Decision Tree Regression (DTR) model has improved statistical\nmeasures of R2 and RMSE. These simulated results helped to identify drugs with\nhigh binding affinity. From the docking and other statistical analysis, we\nshortlisted six promising drugs with their respective Zinc IDs (ZINC3873365,\nZINC85432544, ZINC203757351, ZINC85536956, ZINC8214470 and ZINC261494640)\nwithin the range of -15 kcal/mol to -13 kcal/mol. In the study, the repurposed\ndrugs are novel except ZINC203757351 antiviral compound that has already\nidentified against COVID-19 in other studies. Further, we analyzed the\nphysiochemical and pharmacokinetic properties of these top-ranked selected\ndrugs with respect to their best binding interaction for specific target\nprotease 3CLpro. Our study has provided an efficient framework for drug\nrepurposing against COVID-19. This highlights the potential of combining\nmolecular docking with machine learning regression approaches to accelerate the\nidentification of potential therapeutic candidates.\n","authors":["Imra Aqeel","Abdul Majid"],"pdf_url":"https://arxiv.org/pdf/2305.18088v4.pdf","comment":"27 Pages"},{"id":"http://arxiv.org/abs/2102.03403v2","updated":"2023-07-20T05:58:30Z","published":"2021-02-05T19:59:05Z","title":"Robust Principal Component Analysis: A Median of Means Approach","summary":" Principal Component Analysis (PCA) is a fundamental tool for data\nvisualization, denoising, and dimensionality reduction. It is widely popular in\nStatistics, Machine Learning, Computer Vision, and related fields. However, PCA\nis well-known to fall prey to outliers and often fails to detect the true\nunderlying low-dimensional structure within the dataset. Following the Median\nof Means (MoM) philosophy, recent supervised learning methods have shown great\nsuccess in dealing with outlying observations without much compromise to their\nlarge sample theoretical properties. This paper proposes a PCA procedure based\non the MoM principle. Called the \\textbf{M}edian of \\textbf{M}eans\n\\textbf{P}rincipal \\textbf{C}omponent \\textbf{A}nalysis (MoMPCA), the proposed\nmethod is not only computationally appealing but also achieves optimal\nconvergence rates under minimal assumptions. In particular, we explore the\nnon-asymptotic error bounds of the obtained solution via the aid of the\nRademacher complexities while granting absolutely no assumption on the outlying\nobservations. The derived concentration results are not dependent on the\ndimension because the analysis is conducted in a separable Hilbert space, and\nthe results only depend on the fourth moment of the underlying distribution in\nthe corresponding norm. The proposal's efficacy is also thoroughly showcased\nthrough simulations and real data applications.\n","authors":["Debolina Paul","Saptarshi Chakraborty","Swagatam Das"],"pdf_url":"https://arxiv.org/pdf/2102.03403v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.10224v4","updated":"2023-07-20T05:42:46Z","published":"2022-08-14T02:41:05Z","title":"Friendly Noise against Adversarial Noise: A Powerful Defense against\n Data Poisoning Attacks","summary":" A powerful category of (invisible) data poisoning attacks modify a subset of\ntraining examples by small adversarial perturbations to change the prediction\nof certain test-time data. Existing defense mechanisms are not desirable to\ndeploy in practice, as they often either drastically harm the generalization\nperformance, or are attack-specific, and prohibitively slow to apply. Here, we\npropose a simple but highly effective approach that unlike existing methods\nbreaks various types of invisible poisoning attacks with the slightest drop in\nthe generalization performance. We make the key observation that attacks\nintroduce local sharp regions of high training loss, which when minimized,\nresults in learning the adversarial perturbations and makes the attack\nsuccessful. To break poisoning attacks, our key idea is to alleviate the sharp\nloss regions introduced by poisons. To do so, our approach comprises two\ncomponents: an optimized friendly noise that is generated to maximally perturb\nexamples without degrading the performance, and a randomly varying noise\ncomponent. The combination of both components builds a very light-weight but\nextremely effective defense against the most powerful triggerless targeted and\nhidden-trigger backdoor poisoning attacks, including Gradient Matching,\nBulls-eye Polytope, and Sleeper Agent. We show that our friendly noise is\ntransferable to other architectures, and adaptive attacks cannot break our\ndefense due to its random noise component. Our code is available at:\nhttps://github.com/tianyu139/friendly-noise\n","authors":["Tian Yu Liu","Yu Yang","Baharan Mirzasoleiman"],"pdf_url":"https://arxiv.org/pdf/2208.10224v4.pdf","comment":"Code available at: https://github.com/tianyu139/friendly-noise"},{"id":"http://arxiv.org/abs/2210.08363v3","updated":"2023-07-20T05:41:18Z","published":"2022-10-15T19:32:20Z","title":"Data-Efficient Augmentation for Training Neural Networks","summary":" Data augmentation is essential to achieve state-of-the-art performance in\nmany deep learning applications. However, the most effective augmentation\ntechniques become computationally prohibitive for even medium-sized datasets.\nTo address this, we propose a rigorous technique to select subsets of data\npoints that when augmented, closely capture the training dynamics of full data\naugmentation. We first show that data augmentation, modeled as additive\nperturbations, improves learning and generalization by relatively enlarging and\nperturbing the smaller singular values of the network Jacobian, while\npreserving its prominent directions. This prevents overfitting and enhances\nlearning the harder to learn information. Then, we propose a framework to\niteratively extract small subsets of training data that when augmented, closely\ncapture the alignment of the fully augmented Jacobian with labels/residuals. We\nprove that stochastic gradient descent applied to the augmented subsets found\nby our approach has similar training dynamics to that of fully augmented data.\nOur experiments demonstrate that our method achieves 6.3x speedup on CIFAR10\nand 2.2x speedup on SVHN, and outperforms the baselines by up to 10% across\nvarious subset sizes. Similarly, on TinyImageNet and ImageNet, our method beats\nthe baselines by up to 8%, while achieving up to 3.3x speedup across various\nsubset sizes. Finally, training on and augmenting 50% subsets using our method\non a version of CIFAR10 corrupted with label noise even outperforms using the\nfull dataset. Our code is available at:\nhttps://github.com/tianyu139/data-efficient-augmentation\n","authors":["Tian Yu Liu","Baharan Mirzasoleiman"],"pdf_url":"https://arxiv.org/pdf/2210.08363v3.pdf","comment":"Code available at:\n https://github.com/tianyu139/data-efficient-augmentation"},{"id":"http://arxiv.org/abs/2206.08309v2","updated":"2023-07-20T05:32:00Z","published":"2022-06-16T17:11:41Z","title":"Pythae: Unifying Generative Autoencoders in Python -- A Benchmarking Use\n Case","summary":" In recent years, deep generative models have attracted increasing interest\ndue to their capacity to model complex distributions. Among those models,\nvariational autoencoders have gained popularity as they have proven both to be\ncomputationally efficient and yield impressive results in multiple fields.\nFollowing this breakthrough, extensive research has been done in order to\nimprove the original publication, resulting in a variety of different VAE\nmodels in response to different tasks. In this paper we present Pythae, a\nversatile open-source Python library providing both a unified implementation\nand a dedicated framework allowing straightforward, reproducible and reliable\nuse of generative autoencoder models. We then propose to use this library to\nperform a case study benchmark where we present and compare 19 generative\nautoencoder models representative of some of the main improvements on\ndownstream tasks such as image reconstruction, generation, classification,\nclustering and interpolation. The open-source library can be found at\nhttps://github.com/clementchadebec/benchmark_VAE.\n","authors":["Clément Chadebec","Louis J. Vincent","Stéphanie Allassonnière"],"pdf_url":"https://arxiv.org/pdf/2206.08309v2.pdf","comment":"Accepted to NeurIPS 2022"},{"id":"http://arxiv.org/abs/2210.16299v3","updated":"2023-07-20T05:27:03Z","published":"2022-10-28T17:52:18Z","title":"Nonuniqueness and Convergence to Equivalent Solutions in Observer-based\n Inverse Reinforcement Learning","summary":" A key challenge in solving the deterministic inverse reinforcement learning\n(IRL) problem online and in real-time is the existence of multiple solutions.\nNonuniqueness necessitates the study of the notion of equivalent solutions,\ni.e., solutions that result in a different cost functional but same feedback\nmatrix, and convergence to such solutions. While offline algorithms that result\nin convergence to equivalent solutions have been developed in the literature,\nonline, real-time techniques that address nonuniqueness are not available. In\nthis paper, a regularized history stack observer that converges to\napproximately equivalent solutions of the IRL problem is developed. Novel\ndata-richness conditions are developed to facilitate the analysis and\nsimulation results are provided to demonstrate the effectiveness of the\ndeveloped technique.\n","authors":["Jared Town","Zachary Morrison","Rushikesh Kamalapurkar"],"pdf_url":"https://arxiv.org/pdf/2210.16299v3.pdf","comment":"16 pages, 7 figures, submitted to American Controls Conference 2023"},{"id":"http://arxiv.org/abs/2307.10596v1","updated":"2023-07-20T05:23:49Z","published":"2023-07-20T05:23:49Z","title":"Ensemble Learning based Anomaly Detection for IoT Cybersecurity via\n Bayesian Hyperparameters Sensitivity Analysis","summary":" The Internet of Things (IoT) integrates more than billions of intelligent\ndevices over the globe with the capability of communicating with other\nconnected devices with little to no human intervention. IoT enables data\naggregation and analysis on a large scale to improve life quality in many\ndomains. In particular, data collected by IoT contain a tremendous amount of\ninformation for anomaly detection. The heterogeneous nature of IoT is both a\nchallenge and an opportunity for cybersecurity. Traditional approaches in\ncybersecurity monitoring often require different kinds of data pre-processing\nand handling for various data types, which might be problematic for datasets\nthat contain heterogeneous features. However, heterogeneous types of network\ndevices can often capture a more diverse set of signals than a single type of\ndevice readings, which is particularly useful for anomaly detection. In this\npaper, we present a comprehensive study on using ensemble machine learning\nmethods for enhancing IoT cybersecurity via anomaly detection. Rather than\nusing one single machine learning model, ensemble learning combines the\npredictive power from multiple models, enhancing their predictive accuracy in\nheterogeneous datasets rather than using one single machine learning model. We\npropose a unified framework with ensemble learning that utilises Bayesian\nhyperparameter optimisation to adapt to a network environment that contains\nmultiple IoT sensor readings. Experimentally, we illustrate their high\npredictive power when compared to traditional methods.\n","authors":["Tin Lai","Farnaz Farid","Abubakar Bello","Fariza Sabrina"],"pdf_url":"https://arxiv.org/pdf/2307.10596v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10588v1","updated":"2023-07-20T05:03:25Z","published":"2023-07-20T05:03:25Z","title":"Forecasting Battery Electric Vehicle Charging Behavior: A Deep Learning\n Approach Equipped with Micro-Clustering and SMOTE Techniques","summary":" Energy systems, climate change, and public health are among the primary\nreasons for moving toward electrification in transportation. Transportation\nelectrification is being promoted worldwide to reduce emissions. As a result,\nmany automakers will soon start making only battery electric vehicles (BEVs).\nBEV adoption rates are rising in California, mainly due to climate change and\nair pollution concerns. While great for climate and pollution goals, improperly\nmanaged BEV charging can lead to insufficient charging infrastructure and power\noutages. This study develops a novel Micro Clustering Deep Neural Network\n(MCDNN), an artificial neural network algorithm that is highly effective at\nlearning BEVs trip and charging data to forecast BEV charging events,\ninformation that is essential for electricity load aggregators and utility\nmanagers to provide charging stations and electricity capacity effectively. The\nMCDNN is configured using a robust dataset of trips and charges that occurred\nin California between 2015 and 2020 from 132 BEVs, spanning 5 BEV models for a\ntotal of 1570167 vehicle miles traveled. The numerical findings revealed that\nthe proposed MCDNN is more effective than benchmark approaches in this field,\nsuch as support vector machine, k nearest neighbors, decision tree, and other\nneural network-based models in predicting the charging events.\n","authors":["Hanif Tayarani","Trisha V. Ramadoss","Vaishnavi Karanam","Gil Tal","Christopher Nitta"],"pdf_url":"https://arxiv.org/pdf/2307.10588v1.pdf","comment":"18 pages,8 figures, 4 tables"},{"id":"http://arxiv.org/abs/2307.10586v1","updated":"2023-07-20T05:00:13Z","published":"2023-07-20T05:00:13Z","title":"A Holistic Assessment of the Reliability of Machine Learning Systems","summary":" As machine learning (ML) systems increasingly permeate high-stakes settings\nsuch as healthcare, transportation, military, and national security, concerns\nregarding their reliability have emerged. Despite notable progress, the\nperformance of these systems can significantly diminish due to adversarial\nattacks or environmental changes, leading to overconfident predictions,\nfailures to detect input faults, and an inability to generalize in unexpected\nscenarios. This paper proposes a holistic assessment methodology for the\nreliability of ML systems. Our framework evaluates five key properties:\nin-distribution accuracy, distribution-shift robustness, adversarial\nrobustness, calibration, and out-of-distribution detection. A reliability score\nis also introduced and used to assess the overall system reliability. To\nprovide insights into the performance of different algorithmic approaches, we\nidentify and categorize state-of-the-art techniques, then evaluate a selection\non real-world tasks using our proposed reliability metrics and reliability\nscore. Our analysis of over 500 models reveals that designing for one metric\ndoes not necessarily constrain others but certain algorithmic techniques can\nimprove reliability across multiple metrics simultaneously. This study\ncontributes to a more comprehensive understanding of ML reliability and\nprovides a roadmap for future research and development.\n","authors":["Anthony Corso","David Karamadian","Romeo Valentin","Mary Cooper","Mykel J. Kochenderfer"],"pdf_url":"https://arxiv.org/pdf/2307.10586v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10580v1","updated":"2023-07-20T04:46:34Z","published":"2023-07-20T04:46:34Z","title":"Intelligent model for offshore China sea fog forecasting","summary":" Accurate and timely prediction of sea fog is very important for effectively\nmanaging maritime and coastal economic activities. Given the intricate nature\nand inherent variability of sea fog, traditional numerical and statistical\nforecasting methods are often proven inadequate. This study aims to develop an\nadvanced sea fog forecasting method embedded in a numerical weather prediction\nmodel using the Yangtze River Estuary (YRE) coastal area as a case study. Prior\nto training our machine learning model, we employ a time-lagged correlation\nanalysis technique to identify key predictors and decipher the underlying\nmechanisms driving sea fog occurrence. In addition, we implement ensemble\nlearning and a focal loss function to address the issue of imbalanced data,\nthereby enhancing the predictive ability of our model. To verify the accuracy\nof our method, we evaluate its performance using a comprehensive dataset\nspanning one year, which encompasses both weather station observations and\nhistorical forecasts. Remarkably, our machine learning-based approach surpasses\nthe predictive performance of two conventional methods, the weather research\nand forecasting nonhydrostatic mesoscale model (WRF-NMM) and the algorithm\ndeveloped by the National Oceanic and Atmospheric Administration (NOAA)\nForecast Systems Laboratory (FSL). Specifically, in regard to predicting sea\nfog with a visibility of less than or equal to 1 km with a lead time of 60\nhours, our methodology achieves superior results by increasing the probability\nof detection (POD) while simultaneously reducing the false alarm ratio (FAR).\n","authors":["Yanfei Xiang","Qinghong Zhang","Mingqing Wang","Ruixue Xia","Yang Kong","Xiaomeng Huang"],"pdf_url":"https://arxiv.org/pdf/2307.10580v1.pdf","comment":"19 pages, 9 figures"},{"id":"http://arxiv.org/abs/2307.10579v1","updated":"2023-07-20T04:45:59Z","published":"2023-07-20T04:45:59Z","title":"SecureBoost Hyperparameter Tuning via Multi-Objective Federated Learning","summary":" SecureBoost is a tree-boosting algorithm leveraging homomorphic encryption to\nprotect data privacy in vertical federated learning setting. It is widely used\nin fields such as finance and healthcare due to its interpretability,\neffectiveness, and privacy-preserving capability. However, SecureBoost suffers\nfrom high computational complexity and risk of label leakage. To harness the\nfull potential of SecureBoost, hyperparameters of SecureBoost should be\ncarefully chosen to strike an optimal balance between utility, efficiency, and\nprivacy. Existing methods either set hyperparameters empirically or\nheuristically, which are far from optimal. To fill this gap, we propose a\nConstrained Multi-Objective SecureBoost (CMOSB) algorithm to find Pareto\noptimal solutions that each solution is a set of hyperparameters achieving\noptimal tradeoff between utility loss, training cost, and privacy leakage. We\ndesign measurements of the three objectives. In particular, the privacy leakage\nis measured using our proposed instance clustering attack. Experimental results\ndemonstrate that the CMOSB yields not only hyperparameters superior to the\nbaseline but also optimal sets of hyperparameters that can support the flexible\nrequirements of FL participants.\n","authors":["Ziyao Ren","Yan Kang","Lixin Fan","Linghua Yang","Tao Fan","Yongxin Tong","Qiang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.10579v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10575v1","updated":"2023-07-20T04:35:50Z","published":"2023-07-20T04:35:50Z","title":"Boosting Federated Learning Convergence with Prototype Regularization","summary":" As a distributed machine learning technique, federated learning (FL) requires\nclients to collaboratively train a shared model with an edge server without\nleaking their local data. However, the heterogeneous data distribution among\nclients often leads to a decrease in model performance. To tackle this issue,\nthis paper introduces a prototype-based regularization strategy to address the\nheterogeneity in the data distribution. Specifically, the regularization\nprocess involves the server aggregating local prototypes from distributed\nclients to generate a global prototype, which is then sent back to the\nindividual clients to guide their local training. The experimental results on\nMNIST and Fashion-MNIST show that our proposal achieves improvements of 3.3%\nand 8.9% in average test accuracy, respectively, compared to the most popular\nbaseline FedAvg. Furthermore, our approach has a fast convergence rate in\nheterogeneous settings.\n","authors":["Yu Qiao","Huy Q. Le","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2307.10575v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10569v1","updated":"2023-07-20T04:14:09Z","published":"2023-07-20T04:14:09Z","title":"Deceptive Alignment Monitoring","summary":" As the capabilities of large machine learning models continue to grow, and as\nthe autonomy afforded to such models continues to expand, the spectre of a new\nadversary looms: the models themselves. The threat that a model might behave in\na seemingly reasonable manner, while secretly and subtly modifying its behavior\nfor ulterior reasons is often referred to as deceptive alignment in the AI\nSafety & Alignment communities. Consequently, we call this new direction\nDeceptive Alignment Monitoring. In this work, we identify emerging directions\nin diverse machine learning subfields that we believe will become increasingly\nimportant and intertwined in the near future for deceptive alignment\nmonitoring, and we argue that advances in these fields present both long-term\nchallenges and new research opportunities. We conclude by advocating for\ngreater involvement by the adversarial machine learning community in these\nemerging directions.\n","authors":["Andres Carranza","Dhruv Pai","Rylan Schaeffer","Arnuv Tandon","Sanmi Koyejo"],"pdf_url":"https://arxiv.org/pdf/2307.10569v1.pdf","comment":"Accepted as BlueSky Oral to 2023 ICML AdvML Workshop"},{"id":"http://arxiv.org/abs/2307.10563v1","updated":"2023-07-20T04:00:37Z","published":"2023-07-20T04:00:37Z","title":"FACADE: A Framework for Adversarial Circuit Anomaly Detection and\n Evaluation","summary":" We present FACADE, a novel probabilistic and geometric framework designed for\nunsupervised mechanistic anomaly detection in deep neural networks. Its primary\ngoal is advancing the understanding and mitigation of adversarial attacks.\nFACADE aims to generate probabilistic distributions over circuits, which\nprovide critical insights to their contribution to changes in the manifold\nproperties of pseudo-classes, or high-dimensional modes in activation space,\nyielding a powerful tool for uncovering and combating adversarial attacks. Our\napproach seeks to improve model robustness, enhance scalable model oversight,\nand demonstrates promising applications in real-world deployment settings.\n","authors":["Dhruv Pai","Andres Carranza","Rylan Schaeffer","Arnuv Tandon","Sanmi Koyejo"],"pdf_url":"https://arxiv.org/pdf/2307.10563v1.pdf","comment":"Accepted as BlueSky Poster at 2023 ICML AdvML Workshop"},{"id":"http://arxiv.org/abs/2307.10562v1","updated":"2023-07-20T03:56:04Z","published":"2023-07-20T03:56:04Z","title":"Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared\n Adversarial Examples","summary":" Backdoor attacks are serious security threats to machine learning models\nwhere an adversary can inject poisoned samples into the training set, causing a\nbackdoored model which predicts poisoned samples with particular triggers to\nparticular target classes, while behaving normally on benign samples. In this\npaper, we explore the task of purifying a backdoored model using a small clean\ndataset. By establishing the connection between backdoor risk and adversarial\nrisk, we derive a novel upper bound for backdoor risk, which mainly captures\nthe risk on the shared adversarial examples (SAEs) between the backdoored model\nand the purified model. This upper bound further suggests a novel bi-level\noptimization problem for mitigating backdoor using adversarial training\ntechniques. To solve it, we propose Shared Adversarial Unlearning (SAU).\nSpecifically, SAU first generates SAEs, and then, unlearns the generated SAEs\nsuch that they are either correctly classified by the purified model and/or\ndifferently classified by the two models, such that the backdoor effect in the\nbackdoored model will be mitigated in the purified model. Experiments on\nvarious benchmark datasets and network architectures show that our proposed\nmethod achieves state-of-the-art performance for backdoor defense.\n","authors":["Shaokui Wei","Mingda Zhang","Hongyuan Zha","Baoyuan Wu"],"pdf_url":"https://arxiv.org/pdf/2307.10562v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10560v1","updated":"2023-07-20T03:55:53Z","published":"2023-07-20T03:55:53Z","title":"Post-variational quantum neural networks","summary":" Quantum computing has the potential to provide substantial computational\nadvantages over current state-of-the-art classical supercomputers. However,\ncurrent hardware is not advanced enough to execute fault-tolerant quantum\nalgorithms. An alternative of using hybrid quantum-classical computing with\nvariational algorithms can exhibit barren plateau issues, causing slow\nconvergence of gradient-based optimization techniques. In this paper, we\ndiscuss \"post-variational strategies\", which shift tunable parameters from the\nquantum computer to the classical computer, opting for ensemble strategies when\noptimizing quantum models. We discuss various strategies and design principles\nfor constructing individual quantum circuits, where the resulting ensembles can\nbe optimized with convex programming. Further, we discuss architectural designs\nof post-variational quantum neural networks and analyze the propagation of\nestimation errors throughout such neural networks. Lastly, we show that our\nalgorithm can be applied to real-world applications such as image\nclassification on handwritten digits, producing a 96% classification accuracy.\n","authors":["Po-Wei Huang","Patrick Rebentrost"],"pdf_url":"https://arxiv.org/pdf/2307.10560v1.pdf","comment":"17 pages, 9 figures"},{"id":"http://arxiv.org/abs/2307.10559v1","updated":"2023-07-20T03:54:47Z","published":"2023-07-20T03:54:47Z","title":"Air Traffic Controller Workload Level Prediction using Conformalized\n Dynamical Graph Learning","summary":" Air traffic control (ATC) is a safety-critical service system that demands\nconstant attention from ground air traffic controllers (ATCos) to maintain\ndaily aviation operations. The workload of the ATCos can have negative effects\non operational safety and airspace usage. To avoid overloading and ensure an\nacceptable workload level for the ATCos, it is important to predict the ATCos'\nworkload accurately for mitigation actions. In this paper, we first perform a\nreview of research on ATCo workload, mostly from the air traffic perspective.\nThen, we briefly introduce the setup of the human-in-the-loop (HITL)\nsimulations with retired ATCos, where the air traffic data and workload labels\nare obtained. The simulations are conducted under three Phoenix approach\nscenarios while the human ATCos are requested to self-evaluate their workload\nratings (i.e., low-1 to high-7). Preliminary data analysis is conducted. Next,\nwe propose a graph-based deep-learning framework with conformal prediction to\nidentify the ATCo workload levels. The number of aircraft under the\ncontroller's control varies both spatially and temporally, resulting in\ndynamically evolving graphs. The experiment results suggest that (a) besides\nthe traffic density feature, the traffic conflict feature contributes to the\nworkload prediction capabilities (i.e., minimum horizontal/vertical separation\ndistance); (b) directly learning from the spatiotemporal graph layout of\nairspace with graph neural network can achieve higher prediction accuracy,\ncompare to hand-crafted traffic complexity features; (c) conformal prediction\nis a valuable tool to further boost model prediction accuracy, resulting a\nrange of predicted workload labels. The code used is available at\n\\href{https://github.com/ymlasu/para-atm-collection/blob/master/air-traffic-prediction/ATC-Workload-Prediction/}{$\\mathsf{Link}$}.\n","authors":["Yutian Pang","Jueming Hu","Christopher S. Lieber","Nancy J. Cooke","Yongming Liu"],"pdf_url":"https://arxiv.org/pdf/2307.10559v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10550v1","updated":"2023-07-20T03:28:06Z","published":"2023-07-20T03:28:06Z","title":"SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer","summary":" Expressive speech synthesis models are trained by adding corpora with diverse\nspeakers, various emotions, and different speaking styles to the dataset, in\norder to control various characteristics of speech and generate the desired\nvoice. In this paper, we propose a style control (SC) VALL-E model based on the\nneural codec language model (called VALL-E), which follows the structure of the\ngenerative pretrained transformer 3 (GPT-3). The proposed SC VALL-E takes input\nfrom text sentences and prompt audio and is designed to generate controllable\nspeech by not simply mimicking the characteristics of the prompt audio but by\ncontrolling the attributes to produce diverse voices. We identify tokens in the\nstyle embedding matrix of the newly designed style network that represent\nattributes such as emotion, speaking rate, pitch, and voice intensity, and\ndesign a model that can control these attributes. To evaluate the performance\nof SC VALL-E, we conduct comparative experiments with three representative\nexpressive speech synthesis models: global style token (GST) Tacotron2,\nvariational autoencoder (VAE) Tacotron2, and original VALL-E. We measure word\nerror rate (WER), F0 voiced error (FVE), and F0 gross pitch error (F0GPE) as\nevaluation metrics to assess the accuracy of generated sentences. For comparing\nthe quality of synthesized speech, we measure comparative mean option score\n(CMOS) and similarity mean option score (SMOS). To evaluate the style control\nability of the generated speech, we observe the changes in F0 and\nmel-spectrogram by modifying the trained tokens. When using prompt audio that\nis not present in the training data, SC VALL-E generates a variety of\nexpressive sounds and demonstrates competitive performance compared to the\nexisting models. Our implementation, pretrained models, and audio samples are\nlocated on GitHub.\n","authors":["Daegyeom Kim","Seongho Hong","Yong-Hoon Choi"],"pdf_url":"https://arxiv.org/pdf/2307.10550v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08122v2","updated":"2023-07-20T03:07:28Z","published":"2023-07-16T18:31:25Z","title":"Tangent Transformers for Composition, Privacy and Removal","summary":" We introduce Tangent Attention Fine-Tuning (TAFT), a method for fine-tuning\nlinearized transformers obtained by computing a First-order Taylor Expansion\naround a pre-trained initialization. We show that the Jacobian-Vector Product\nresulting from linearization can be computed efficiently in a single forward\npass, reducing training and inference cost to the same order of magnitude as\nits original non-linear counterpart, while using the same number of parameters.\nFurthermore, we show that, when applied to various downstream visual\nclassification tasks, the resulting Tangent Transformer fine-tuned with TAFT\ncan perform comparably with fine-tuning the original non-linear network. Since\nTangent Transformers are linear with respect to the new set of weights, and the\nresulting fine-tuning loss is convex, we show that TAFT enjoys several\nadvantages compared to non-linear fine-tuning when it comes to model\ncomposition, parallel training, machine unlearning, and differential privacy.\n","authors":["Tian Yu Liu","Aditya Golatkar","Stefano Soatto"],"pdf_url":"https://arxiv.org/pdf/2307.08122v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.03718v4","updated":"2023-07-20T03:06:50Z","published":"2023-06-06T14:28:57Z","title":"Emotion-Conditioned Melody Harmonization with Hierarchical Variational\n Autoencoder","summary":" Existing melody harmonization models have made great progress in improving\nthe quality of generated harmonies, but most of them ignored the emotions\nbeneath the music. Meanwhile, the variability of harmonies generated by\nprevious methods is insufficient. To solve these problems, we propose a novel\nLSTM-based Hierarchical Variational Auto-Encoder (LHVAE) to investigate the\ninfluence of emotional conditions on melody harmonization, while improving the\nquality of generated harmonies and capturing the abundant variability of chord\nprogressions. Specifically, LHVAE incorporates latent variables and emotional\nconditions at different levels (piece- and bar-level) to model the global and\nlocal music properties. Additionally, we introduce an attention-based melody\ncontext vector at each step to better learn the correspondence between melodies\nand harmonies. Objective experimental results show that our proposed model\noutperforms other LSTM-based models. Through subjective evaluation, we conclude\nthat only altering the types of chords hardly changes the overall emotion of\nthe music. The qualitative analysis demonstrates the ability of our model to\ngenerate variable harmonies.\n","authors":["Shulei Ji","Xinyu Yang"],"pdf_url":"https://arxiv.org/pdf/2306.03718v4.pdf","comment":"Accepted by IEEE SMC 2023"},{"id":"http://arxiv.org/abs/2212.12658v2","updated":"2023-07-20T03:00:05Z","published":"2022-12-24T05:25:09Z","title":"Improving Uncertainty Quantification of Variance Networks by\n Tree-Structured Learning","summary":" To improve the uncertainty quantification of variance networks, we propose a\nnovel tree-structured local neural network model that partitions the feature\nspace into multiple regions based on uncertainty heterogeneity. A tree is built\nupon giving the training data, whose leaf nodes represent different regions\nwhere region-specific neural networks are trained to predict both the mean and\nthe variance for quantifying uncertainty. The proposed Uncertainty-Splitting\nNeural Regression Tree (USNRT) employs novel splitting criteria. At each node,\na neural network is trained on the full data first, and a statistical test for\nthe residuals is conducted to find the best split, corresponding to the two\nsub-regions with the most significant uncertainty heterogeneity between them.\nUSNRT is computationally friendly because very few leaf nodes are sufficient\nand pruning is unnecessary. Furthermore, an ensemble version can be easily\nconstructed to estimate the total uncertainty including the aleatory and\nepistemic. On extensive UCI datasets, USNRT or its ensemble shows superior\nperformance compared to some recent popular methods for quantifying uncertainty\nwith variances. Through comprehensive visualization and analysis, we uncover\nhow USNRT works and show its merits, revealing that uncertainty heterogeneity\ndoes exist in many datasets and can be learned by USNRT.\n","authors":["Wenxuan Ma","Xing Yan","Kun Zhang"],"pdf_url":"https://arxiv.org/pdf/2212.12658v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09767v2","updated":"2023-07-20T02:51:15Z","published":"2023-03-17T04:18:03Z","title":"It Is All About Data: A Survey on the Effects of Data on Adversarial\n Robustness","summary":" Adversarial examples are inputs to machine learning models that an attacker\nhas intentionally designed to confuse the model into making a mistake. Such\nexamples pose a serious threat to the applicability of machine-learning-based\nsystems, especially in life- and safety-critical domains. To address this\nproblem, the area of adversarial robustness investigates mechanisms behind\nadversarial attacks and defenses against these attacks. This survey reviews a\nparticular subset of this literature that focuses on investigating properties\nof training data in the context of model robustness under evasion attacks. It\nfirst summarizes the main properties of data leading to adversarial\nvulnerability. It then discusses guidelines and techniques for improving\nadversarial robustness by enhancing the data representation and learning\nprocedures, as well as techniques for estimating robustness guarantees given\nparticular data. Finally, it discusses gaps of knowledge and promising future\nresearch directions in this area.\n","authors":["Peiyu Xiong","Michael Tegegn","Jaskeerat Singh Sarin","Shubhraneel Pal","Julia Rubin"],"pdf_url":"https://arxiv.org/pdf/2303.09767v2.pdf","comment":"51 pages, 25 figures, under review"},{"id":"http://arxiv.org/abs/2304.10159v2","updated":"2023-07-20T02:49:49Z","published":"2023-04-20T08:32:58Z","title":"Deep-Q Learning with Hybrid Quantum Neural Network on Solving Maze\n Problems","summary":" Quantum computing holds great potential for advancing the limitations of\nmachine learning algorithms to handle higher data dimensions and reduce overall\ntraining parameters in deep neural network (DNN) models. This study uses a\nparameterized quantum circuit (PQC) on a gate-based quantum computer to\ninvestigate the potential for quantum advantage in a model-free reinforcement\nlearning problem. Through a comprehensive investigation and evaluation of the\ncurrent model and capabilities of quantum computers, we designed and trained a\nnovel hybrid Quantum neural network based on the latest Qiskit and PyTorch\nframework. We compared its performance with a full-classical DNN with and\nwithout an integrated PQC. Our research provides insights into the potential of\ndeep quantum learning to solve a maze problem and, potentially, other\nreinforcement learning problems. We conclude that various reinforcement\nlearning problems can be effective with reasonable training epochs. Moreover, a\ncomparative discussion of the various quantum reinforcement learning model on\nmaze problems is discussed to evaluate our research's overall potential and\nadvantages.\n","authors":["Hao-Yuan Chen","Yen-Jui Chang","Ching-Ray Chang"],"pdf_url":"https://arxiv.org/pdf/2304.10159v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10541v1","updated":"2023-07-20T02:42:23Z","published":"2023-07-20T02:42:23Z","title":"Differentially Flat Learning-based Model Predictive Control Using a\n Stability, State, and Input Constraining Safety Filter","summary":" Learning-based optimal control algorithms control unknown systems using past\ntrajectory data and a learned model of the system dynamics. These controllers\nuse either a linear approximation of the learned dynamics, trading performance\nfor faster computation, or nonlinear optimization methods, which typically\nperform better but can limit real-time applicability. In this work, we present\na novel nonlinear controller that exploits differential flatness to achieve\nsimilar performance to state-of-the-art learning-based controllers but with\nsignificantly less computational effort. Differential flatness is a property of\ndynamical systems whereby nonlinear systems can be exactly linearized through a\nnonlinear input mapping. Here, the nonlinear transformation is learned as a\nGaussian process and is used in a safety filter that guarantees, with high\nprobability, stability as well as input and flat state constraint satisfaction.\nThis safety filter is then used to refine inputs from a flat model predictive\ncontroller to perform constrained nonlinear learning-based optimal control\nthrough two successive convex optimizations. We compare our method to\nstate-of-the-art learning-based control strategies and achieve similar\nperformance, but with significantly better computational efficiency, while also\nrespecting flat state and input constraints, and guaranteeing stability.\n","authors":["Adam W. Hall","Melissa Greeff","Angela P. Schoellig"],"pdf_url":"https://arxiv.org/pdf/2307.10541v1.pdf","comment":"6 pages, 5 figures, Published in IEEE Control Systems Letters"},{"id":"http://arxiv.org/abs/2307.10529v1","updated":"2023-07-20T02:07:20Z","published":"2023-07-20T02:07:20Z","title":"Fast Unsupervised Deep Outlier Model Selection with Hypernetworks","summary":" Outlier detection (OD) finds many applications with a rich literature of\nnumerous techniques. Deep neural network based OD (DOD) has seen a recent surge\nof attention thanks to the many advances in deep learning. In this paper, we\nconsider a critical-yet-understudied challenge with unsupervised DOD, that is,\neffective hyperparameter (HP) tuning/model selection. While several prior work\nreport the sensitivity of OD models to HPs, it becomes ever so critical for the\nmodern DOD models that exhibit a long list of HPs. We introduce HYPER for\ntuning DOD models, tackling two fundamental challenges: (1) validation without\nsupervision (due to lack of labeled anomalies), and (2) efficient search of the\nHP/model space (due to exponential growth in the number of HPs). A key idea is\nto design and train a novel hypernetwork (HN) that maps HPs onto optimal\nweights of the main DOD model. In turn, HYPER capitalizes on a single HN that\ncan dynamically generate weights for many DOD models (corresponding to varying\nHPs), which offers significant speed-up. In addition, it employs meta-learning\non historical OD tasks with labels to train a proxy validation function,\nlikewise trained with our proposed HN efficiently. Extensive experiments on 35\nOD tasks show that HYPER achieves high performance against 8 baselines with\nsignificant efficiency gains.\n","authors":["Xueying Ding","Yue Zhao","Leman Akoglu"],"pdf_url":"https://arxiv.org/pdf/2307.10529v1.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.10524v1","updated":"2023-07-20T01:56:10Z","published":"2023-07-20T01:56:10Z","title":"Beyond Black-Box Advice: Learning-Augmented Algorithms for MDPs with\n Q-Value Predictions","summary":" We study the tradeoff between consistency and robustness in the context of a\nsingle-trajectory time-varying Markov Decision Process (MDP) with untrusted\nmachine-learned advice. Our work departs from the typical approach of treating\nadvice as coming from black-box sources by instead considering a setting where\nadditional information about how the advice is generated is available. We prove\na first-of-its-kind consistency and robustness tradeoff given Q-value advice\nunder a general MDP model that includes both continuous and discrete\nstate/action spaces. Our results highlight that utilizing Q-value advice\nenables dynamic pursuit of the better of machine-learned advice and a robust\nbaseline, thus result in near-optimal performance guarantees, which provably\nimproves what can be obtained solely with black-box advice.\n","authors":["Tongxin Li","Yiheng Lin","Shaolei Ren","Adam Wierman"],"pdf_url":"https://arxiv.org/pdf/2307.10524v1.pdf","comment":"27 pages"},{"id":"http://arxiv.org/abs/2302.10980v3","updated":"2023-07-20T01:34:16Z","published":"2023-02-21T20:26:39Z","title":"MultiRobustBench: Benchmarking Robustness Against Multiple Attacks","summary":" The bulk of existing research in defending against adversarial examples\nfocuses on defending against a single (typically bounded Lp-norm) attack, but\nfor a practical setting, machine learning (ML) models should be robust to a\nwide variety of attacks. In this paper, we present the first unified framework\nfor considering multiple attacks against ML models. Our framework is able to\nmodel different levels of learner's knowledge about the test-time adversary,\nallowing us to model robustness against unforeseen attacks and robustness\nagainst unions of attacks. Using our framework, we present the first\nleaderboard, MultiRobustBench, for benchmarking multiattack evaluation which\ncaptures performance across attack types and attack strengths. We evaluate the\nperformance of 16 defended models for robustness against a set of 9 different\nattack types, including Lp-based threat models, spatial transformations, and\ncolor changes, at 20 different attack strengths (180 attacks total).\nAdditionally, we analyze the state of current defenses against multiple\nattacks. Our analysis shows that while existing defenses have made progress in\nterms of average robustness across the set of attacks used, robustness against\nthe worst-case attack is still a big open problem as all existing models\nperform worse than random guessing.\n","authors":["Sihui Dai","Saeed Mahloujifar","Chong Xiang","Vikash Sehwag","Pin-Yu Chen","Prateek Mittal"],"pdf_url":"https://arxiv.org/pdf/2302.10980v3.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2305.11408v2","updated":"2023-07-20T00:58:30Z","published":"2023-05-19T03:31:42Z","title":"AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide\n for Simultaneous Speech Translation","summary":" Attention is the core mechanism of today's most used architectures for\nnatural language processing and has been analyzed from many perspectives,\nincluding its effectiveness for machine translation-related tasks. Among these\nstudies, attention resulted to be a useful source of information to get\ninsights about word alignment also when the input text is substituted with\naudio segments, as in the case of the speech translation (ST) task. In this\npaper, we propose AlignAtt, a novel policy for simultaneous ST (SimulST) that\nexploits the attention information to generate source-target alignments that\nguide the model during inference. Through experiments on the 8 language pairs\nof MuST-C v1.0, we show that AlignAtt outperforms previous state-of-the-art\nSimulST policies applied to offline-trained models with gains in terms of BLEU\nof 2 points and latency reductions ranging from 0.5s to 0.8s across the 8\nlanguages.\n","authors":["Sara Papi","Marco Turchi","Matteo Negri"],"pdf_url":"https://arxiv.org/pdf/2305.11408v2.pdf","comment":"Accepted at Interspeech 2023"},{"id":"http://arxiv.org/abs/2307.04603v4","updated":"2023-07-20T00:49:13Z","published":"2023-07-07T09:01:42Z","title":"Solvent: A Framework for Protein Folding","summary":" Consistency and reliability are crucial for conducting AI research. Many\nfamous research fields, such as object detection, have been compared and\nvalidated with solid benchmark frameworks. After AlphaFold2, the protein\nfolding task has entered a new phase, and many methods are proposed based on\nthe component of AlphaFold2. The importance of a unified research framework in\nprotein folding contains implementations and benchmarks to consistently and\nfairly compare various approaches. To achieve this, we present Solvent, an\nprotein folding framework that supports significant components of\nstate-of-the-art models in the manner of off-the-shelf interface Solvent\ncontains different models implemented in a unified codebase and supports\ntraining and evaluation for defined models on the same dataset. We benchmark\nwell-known algorithms and their components and provide experiments that give\nhelpful insights into the protein structure modeling field. We hope that\nSolvent will increase the reliability and consistency of proposed models and\ngives efficiency in both speed and costs, resulting in acceleration on protein\nfolding modeling research. The code is available at\nhttps://github.com/kakaobrain/solvent, and the project will continue to be\ndeveloped.\n","authors":["Jaemyung Lee","Kyeongtak Han","Jaehoon Kim","Hasun Yu","Youhan Lee"],"pdf_url":"https://arxiv.org/pdf/2307.04603v4.pdf","comment":"preprint, 8pages"},{"id":"http://arxiv.org/abs/2307.09702v2","updated":"2023-07-20T00:40:41Z","published":"2023-07-19T01:14:49Z","title":"Efficient Guided Generation for Large Language Models","summary":" In this article we describe an efficient approach to guiding language model\ntext generation with regular expressions and context-free grammars. Our\napproach adds little to no overhead to the token sequence generation process,\nand makes guided generation feasible in practice. An implementation is provided\nin the open source Python library Outlines.\n","authors":["Brandon T. Willard","Rémi Louf"],"pdf_url":"https://arxiv.org/pdf/2307.09702v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10507v1","updated":"2023-07-20T00:07:29Z","published":"2023-07-20T00:07:29Z","title":"FedSoup: Improving Generalization and Personalization in Federated\n Learning via Selective Model Interpolation","summary":" Cross-silo federated learning (FL) enables the development of machine\nlearning models on datasets distributed across data centers such as hospitals\nand clinical research laboratories. However, recent research has found that\ncurrent FL algorithms face a trade-off between local and global performance\nwhen confronted with distribution shifts. Specifically, personalized FL methods\nhave a tendency to overfit to local data, leading to a sharp valley in the\nlocal model and inhibiting its ability to generalize to out-of-distribution\ndata. In this paper, we propose a novel federated model soup method (i.e.,\nselective interpolation of model parameters) to optimize the trade-off between\nlocal and global performance. Specifically, during the federated training\nphase, each client maintains its own global model pool by monitoring the\nperformance of the interpolated model between the local and global models. This\nallows us to alleviate overfitting and seek flat minima, which can\nsignificantly improve the model's generalization performance. We evaluate our\nmethod on retinal and pathological image classification tasks, and our proposed\nmethod achieves significant improvements for out-of-distribution\ngeneralization. Our code is available at https://github.com/ubc-tea/FedSoup.\n","authors":["Minghui Chen","Meirui Jiang","Qi Dou","Zehua Wang","Xiaoxiao Li"],"pdf_url":"https://arxiv.org/pdf/2307.10507v1.pdf","comment":"Accepted by MICCAI2023"},{"id":"http://arxiv.org/abs/2307.10504v1","updated":"2023-07-20T00:02:24Z","published":"2023-07-20T00:02:24Z","title":"Identifying Interpretable Subspaces in Image Representations","summary":" We propose Automatic Feature Explanation using Contrasting Concepts (FALCON),\nan interpretability framework to explain features of image representations. For\na target feature, FALCON captions its highly activating cropped images using a\nlarge captioning dataset (like LAION-400m) and a pre-trained vision-language\nmodel like CLIP. Each word among the captions is scored and ranked leading to a\nsmall number of shared, human-understandable concepts that closely describe the\ntarget feature. FALCON also applies contrastive interpretation using lowly\nactivating (counterfactual) images, to eliminate spurious concepts. Although\nmany existing approaches interpret features independently, we observe in\nstate-of-the-art self-supervised and supervised models, that less than 20% of\nthe representation space can be explained by individual features. We show that\nfeatures in larger spaces become more interpretable when studied in groups and\ncan be explained with high-order scoring concepts through FALCON. We discuss\nhow extracted concepts can be used to explain and debug failures in downstream\ntasks. Finally, we present a technique to transfer concepts from one\n(explainable) representation space to another unseen representation space by\nlearning a simple linear transformation.\n","authors":["Neha Kalibhat","Shweta Bhardwaj","Bayan Bruss","Hamed Firooz","Maziar Sanjabi","Soheil Feizi"],"pdf_url":"https://arxiv.org/pdf/2307.10504v1.pdf","comment":"Published at ICML 2023"},{"id":"http://arxiv.org/abs/2307.11081v1","updated":"2023-07-20T17:57:04Z","published":"2023-07-20T17:57:04Z","title":"GLSFormer: Gated - Long, Short Sequence Transformer for Step Recognition\n in Surgical Videos","summary":" Automated surgical step recognition is an important task that can\nsignificantly improve patient safety and decision-making during surgeries.\nExisting state-of-the-art methods for surgical step recognition either rely on\nseparate, multi-stage modeling of spatial and temporal information or operate\non short-range temporal resolution when learned jointly. However, the benefits\nof joint modeling of spatio-temporal features and long-range information are\nnot taken in account. In this paper, we propose a vision transformer-based\napproach to jointly learn spatio-temporal features directly from sequence of\nframe-level patches. Our method incorporates a gated-temporal attention\nmechanism that intelligently combines short-term and long-term spatio-temporal\nfeature representations. We extensively evaluate our approach on two cataract\nsurgery video datasets, namely Cataract-101 and D99, and demonstrate superior\nperformance compared to various state-of-the-art methods. These results\nvalidate the suitability of our proposed approach for automated surgical step\nrecognition. Our code is released at:\nhttps://github.com/nisargshah1999/GLSFormer\n","authors":["Nisarg A. Shah","Shameema Sikder","S. Swaroop Vedula","Vishal M. Patel"],"pdf_url":"https://arxiv.org/pdf/2307.11081v1.pdf","comment":"Accepted to MICCAI 2023 (Early Accept)"},{"id":"http://arxiv.org/abs/2307.11018v1","updated":"2023-07-20T16:45:22Z","published":"2023-07-20T16:45:22Z","title":"Amortized Variational Inference: When and Why?","summary":" Amortized variational inference (A-VI) is a method for approximating the\nintractable posterior distributions that arise in probabilistic models. The\ndefining feature of A-VI is that it learns a global inference function that\nmaps each observation to its local latent variable's approximate posterior.\nThis stands in contrast to the more classical factorized (or mean-field)\nvariational inference (F-VI), which directly learns the parameters of the\napproximating distribution for each latent variable. In deep generative models,\nA-VI is used as a computational trick to speed up inference for local latent\nvariables. In this paper, we study A-VI as a general alternative to F-VI for\napproximate posterior inference. A-VI cannot produce an approximation with a\nlower Kullback-Leibler divergence than F-VI's optimal solution, because the\namortized family is a subset of the factorized family. Thus a central\ntheoretical problem is to characterize when A-VI still attains F-VI's optimal\nsolution. We derive conditions on both the model and the inference function\nunder which A-VI can theoretically achieve F-VI's optimum. We show that for a\nbroad class of hierarchical models, including deep generative models, it is\npossible to close the gap between A-VI and F-VI. Further, for an even broader\nclass of models, we establish when and how to expand the domain of the\ninference function to make amortization a feasible strategy. Finally, we prove\nthat for certain models -- including hidden Markov models and Gaussian\nprocesses -- A-VI cannot match F-VI's solution, no matter how expressive the\ninference function is. We also study A-VI empirically. On several examples, we\ncorroborate our theoretical results and investigate the performance of A-VI\nwhen varying the complexity of the inference function. When the gap between\nA-VI and F-VI can be closed, we find that the required complexity of the\nfunction need not scale with the number of observations, and that A-VI often\nconverges faster than F-VI.\n","authors":["Charles C. Margossian","David M. Blei"],"pdf_url":"https://arxiv.org/pdf/2307.11018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18451v3","updated":"2023-07-20T23:59:38Z","published":"2023-05-29T04:02:10Z","title":"Shift-Robust Molecular Relational Learning with Causal Substructure","summary":" Recently, molecular relational learning, whose goal is to predict the\ninteraction behavior between molecular pairs, got a surge of interest in\nmolecular sciences due to its wide range of applications. In this work, we\npropose CMRL that is robust to the distributional shift in molecular relational\nlearning by detecting the core substructure that is causally related to\nchemical reactions. To do so, we first assume a causal relationship based on\nthe domain knowledge of molecular sciences and construct a structural causal\nmodel (SCM) that reveals the relationship between variables. Based on the SCM,\nwe introduce a novel conditional intervention framework whose intervention is\nconditioned on the paired molecule. With the conditional intervention\nframework, our model successfully learns from the causal substructure and\nalleviates the confounding effect of shortcut substructures that are spuriously\ncorrelated to chemical reactions. Extensive experiments on various tasks with\nreal-world and synthetic datasets demonstrate the superiority of CMRL over\nstate-of-the-art baseline models. Our code is available at\nhttps://github.com/Namkyeong/CMRL.\n","authors":["Namkyeong Lee","Kanghoon Yoon","Gyoung S. Na","Sein Kim","Chanyoung Park"],"pdf_url":"https://arxiv.org/pdf/2305.18451v3.pdf","comment":"KDD 2023"},{"id":"http://arxiv.org/abs/2307.08167v2","updated":"2023-07-20T23:08:11Z","published":"2023-07-16T22:35:52Z","title":"Computing the gradients with respect to all parameters of a quantum\n neural network using a single circuit","summary":" When computing the gradients of a quantum neural network using the\nparameter-shift rule, the cost function needs to be calculated twice for the\ngradient with respect to a single adjustable parameter of the network. When the\ntotal number of parameters is high, the quantum circuit for the computation has\nto be adjusted and run for many times. Here we propose an approach to compute\nall the gradients using a single circuit only, with a much reduced circuit\ndepth and less classical registers. We also demonstrate experimentally, on both\nreal quantum hardware and simulator, that our approach has the advantages that\nthe circuit takes a significantly shorter time to compile than the conventional\napproach, resulting in a speedup on the total runtime.\n","authors":["Guang Ping He"],"pdf_url":"https://arxiv.org/pdf/2307.08167v2.pdf","comment":"Added a suggestion on improving real quantum computers"},{"id":"http://arxiv.org/abs/2307.11249v1","updated":"2023-07-20T21:49:38Z","published":"2023-07-20T21:49:38Z","title":"On the Fisher-Rao Gradient of the Evidence Lower Bound","summary":" This article studies the Fisher-Rao gradient, also referred to as the natural\ngradient, of the evidence lower bound, the ELBO, which plays a crucial role\nwithin the theory of the Variational Autonecoder, the Helmholtz Machine and the\nFree Energy Principle. The natural gradient of the ELBO is related to the\nnatural gradient of the Kullback-Leibler divergence from a target distribution,\nthe prime objective function of learning. Based on invariance properties of\ngradients within information geometry, conditions on the underlying model are\nprovided that ensure the equivalence of minimising the prime objective function\nand the maximisation of the ELBO.\n","authors":["Nihat Ay","Jesse van Oostrum"],"pdf_url":"https://arxiv.org/pdf/2307.11249v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11242v1","updated":"2023-07-20T21:25:25Z","published":"2023-07-20T21:25:25Z","title":"On-Sensor Data Filtering using Neuromorphic Computing for High Energy\n Physics Experiments","summary":" This work describes the investigation of neuromorphic computing-based spiking\nneural network (SNN) models used to filter data from sensor electronics in high\nenergy physics experiments conducted at the High Luminosity Large Hadron\nCollider. We present our approach for developing a compact neuromorphic model\nthat filters out the sensor data based on the particle's transverse momentum\nwith the goal of reducing the amount of data being sent to the downstream\nelectronics. The incoming charge waveforms are converted to streams of\nbinary-valued events, which are then processed by the SNN. We present our\ninsights on the various system design choices - from data encoding to optimal\nhyperparameters of the training algorithm - for an accurate and compact SNN\noptimized for hardware deployment. Our results show that an SNN trained with an\nevolutionary algorithm and an optimized set of hyperparameters obtains a signal\nefficiency of about 91% with nearly half as many parameters as a deep neural\nnetwork.\n","authors":["Shruti R. Kulkarni","Aaron Young","Prasanna Date","Narasinga Rao Miniskar","Jeffrey S. Vetter","Farah Fahim","Benjamin Parpillon","Jennet Dickinson","Nhan Tran","Jieun Yoo","Corrinne Mills","Morris Swartz","Petar Maksimovic","Catherine D. Schuman","Alice Bean"],"pdf_url":"https://arxiv.org/pdf/2307.11242v1.pdf","comment":"Manuscript accepted at ICONS'23"},{"id":"http://arxiv.org/abs/2307.11239v1","updated":"2023-07-20T21:22:02Z","published":"2023-07-20T21:22:02Z","title":"Edgewise outliers of network indexed signals","summary":" We consider models for network indexed multivariate data involving a\ndependence between variables as well as across graph nodes.\n In the framework of these models, we focus on outliers detection and\nintroduce the concept of edgewise outliers. For this purpose, we first derive\nthe distribution of some sums of squares, in particular squared Mahalanobis\ndistances that can be used to fix detection rules and thresholds for outlier\ndetection. We then propose a robust version of the deterministic MCD algorithm\nthat we call edgewise MCD. An application on simulated data shows the interest\nof taking the dependence structure into account. We also illustrate the utility\nof the proposed method with a real data set.\n","authors":["Christopher Rieser","Anne Ruiz-Gazen","Christine Thomas-Agnan"],"pdf_url":"https://arxiv.org/pdf/2307.11239v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11234v1","updated":"2023-07-20T21:10:54Z","published":"2023-07-20T21:10:54Z","title":"QDC: Quantum Diffusion Convolution Kernels on Graphs","summary":" Graph convolutional neural networks (GCNs) operate by aggregating messages\nover local neighborhoods given the prediction task under interest. Many GCNs\ncan be understood as a form of generalized diffusion of input features on the\ngraph, and significant work has been dedicated to improving predictive accuracy\nby altering the ways of message passing. In this work, we propose a new\nconvolution kernel that effectively rewires the graph according to the\noccupation correlations of the vertices by trading on the generalized diffusion\nparadigm for the propagation of a quantum particle over the graph. We term this\nnew convolution kernel the Quantum Diffusion Convolution (QDC) operator. In\naddition, we introduce a multiscale variant that combines messages from the QDC\noperator and the traditional combinatorial Laplacian. To understand our method,\nwe explore the spectral dependence of homophily and the importance of quantum\ndynamics in the construction of a bandpass filter. Through these studies, as\nwell as experiments on a range of datasets, we observe that QDC improves\npredictive performance on the widely used benchmark datasets when compared to\nsimilar methods.\n","authors":["Thomas Markovich"],"pdf_url":"https://arxiv.org/pdf/2307.11234v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.13807v2","updated":"2023-07-20T20:57:08Z","published":"2023-01-31T17:50:52Z","title":"Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using\n Cooperative Co-Evolutionary Search","summary":" In Machine Learning (ML)-enabled autonomous systems (MLASs), it is essential\nto identify the hazard boundary of ML Components (MLCs) in the MLAS under\nanalysis. Given that such boundary captures the conditions in terms of MLC\nbehavior and system context that can lead to hazards, it can then be used to,\nfor example, build a safety monitor that can take any predefined fallback\nmechanisms at runtime when reaching the hazard boundary. However, determining\nsuch hazard boundary for an ML component is challenging. This is due to the\nproblem space combining system contexts (i.e., scenarios) and MLC behaviors\n(i.e., inputs and outputs) being far too large for exhaustive exploration and\neven to handle using conventional metaheuristics, such as genetic algorithms.\nAdditionally, the high computational cost of simulations required to determine\nany MLAS safety violations makes the problem even more challenging.\nFurthermore, it is unrealistic to consider a region in the problem space\ndeterministically safe or unsafe due to the uncontrollable parameters in\nsimulations and the non-linear behaviors of ML models (e.g., deep neural\nnetworks) in the MLAS under analysis. To address the challenges, we propose\nMLCSHE (ML Component Safety Hazard Envelope), a novel method based on a\nCooperative Co-Evolutionary Algorithm (CCEA), which aims to tackle a\nhigh-dimensional problem by decomposing it into two lower-dimensional search\nsubproblems. Moreover, we take a probabilistic view of safe and unsafe regions\nand define a novel fitness function to measure the distance from the\nprobabilistic hazard boundary and thus drive the search effectively. We\nevaluate the effectiveness and efficiency of MLCSHE on a complex Autonomous\nVehicle (AV) case study. Our evaluation results show that MLCSHE is\nsignificantly more effective and efficient compared to a standard genetic\nalgorithm and random search.\n","authors":["Sepehr Sharifi","Donghwan Shin","Lionel C. Briand","Nathan Aschbacher"],"pdf_url":"https://arxiv.org/pdf/2301.13807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11228v1","updated":"2023-07-20T20:46:39Z","published":"2023-07-20T20:46:39Z","title":"From Adaptive Query Release to Machine Unlearning","summary":" We formalize the problem of machine unlearning as design of efficient\nunlearning algorithms corresponding to learning algorithms which perform a\nselection of adaptive queries from structured query classes. We give efficient\nunlearning algorithms for linear and prefix-sum query classes. As applications,\nwe show that unlearning in many problems, in particular, stochastic convex\noptimization (SCO), can be reduced to the above, yielding improved guarantees\nfor the problem. In particular, for smooth Lipschitz losses and any $\\rho>0$,\nour results yield an unlearning algorithm with excess population risk of\n$\\tilde O\\big(\\frac{1}{\\sqrt{n}}+\\frac{\\sqrt{d}}{n\\rho}\\big)$ with unlearning\nquery (gradient) complexity $\\tilde O(\\rho \\cdot \\text{Retraining\nComplexity})$, where $d$ is the model dimensionality and $n$ is the initial\nnumber of samples. For non-smooth Lipschitz losses, we give an unlearning\nalgorithm with excess population risk $\\tilde\nO\\big(\\frac{1}{\\sqrt{n}}+\\big(\\frac{\\sqrt{d}}{n\\rho}\\big)^{1/2}\\big)$ with the\nsame unlearning query (gradient) complexity. Furthermore, in the special case\nof Generalized Linear Models (GLMs), such as those in linear and logistic\nregression, we get dimension-independent rates of $\\tilde\nO\\big(\\frac{1}{\\sqrt{n}} +\\frac{1}{(n\\rho)^{2/3}}\\big)$ and $\\tilde\nO\\big(\\frac{1}{\\sqrt{n}} +\\frac{1}{(n\\rho)^{1/3}}\\big)$ for smooth Lipschitz\nand non-smooth Lipschitz losses respectively. Finally, we give generalizations\nof the above from one unlearning request to \\textit{dynamic} streams consisting\nof insertions and deletions.\n","authors":["Enayat Ullah","Raman Arora"],"pdf_url":"https://arxiv.org/pdf/2307.11228v1.pdf","comment":"Accepted to ICML 2023"},{"id":"http://arxiv.org/abs/2307.11224v1","updated":"2023-07-20T20:37:24Z","published":"2023-07-20T20:37:24Z","title":"Jina Embeddings: A Novel Set of High-Performance Sentence Embedding\n Models","summary":" Jina Embeddings constitutes a set of high-performance sentence embedding\nmodels adept at translating various textual inputs into numerical\nrepresentations, thereby capturing the semantic essence of the text. While\nthese models are not exclusively designed for text generation, they excel in\napplications such as dense retrieval and semantic textual similarity. This\npaper details the development of Jina Embeddings, starting with the creation of\na high-quality pairwise and triplet dataset. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-depth insights into the model\ntraining process, and concludes with a comprehensive performance evaluation\nusing the Massive Textual Embedding Benchmark (MTEB).\n","authors":["Michael Günther","Louis Milliken","Jonathan Geuter","Georgios Mastrapas","Bo Wang","Han Xiao"],"pdf_url":"https://arxiv.org/pdf/2307.11224v1.pdf","comment":"9 pages, 2 page appendix, EMNLP 2023 Industrial Track"},{"id":"http://arxiv.org/abs/2307.11214v1","updated":"2023-07-20T19:56:30Z","published":"2023-07-20T19:56:30Z","title":"FairMobi-Net: A Fairness-aware Deep Learning Model for Urban Mobility\n Flow Generation","summary":" Generating realistic human flows across regions is essential for our\nunderstanding of urban structures and population activity patterns, enabling\nimportant applications in the fields of urban planning and management. However,\na notable shortcoming of most existing mobility generation methodologies is\nneglect of prediction fairness, which can result in underestimation of mobility\nflows across regions with vulnerable population groups, potentially resulting\nin inequitable resource distribution and infrastructure development. To\novercome this limitation, our study presents a novel, fairness-aware deep\nlearning model, FairMobi-Net, for inter-region human flow prediction. The\nFairMobi-Net model uniquely incorporates fairness loss into the loss function\nand employs a hybrid approach, merging binary classification and numerical\nregression techniques for human flow prediction. We validate the FairMobi-Net\nmodel using comprehensive human mobility datasets from four U.S. cities,\npredicting human flow at the census-tract level. Our findings reveal that the\nFairMobi-Net model outperforms state-of-the-art models (such as the DeepGravity\nmodel) in producing more accurate and equitable human flow predictions across a\nvariety of region pairs, regardless of regional income differences. The model\nmaintains a high degree of accuracy consistently across diverse regions,\naddressing the previous fairness concern. Further analysis of feature\nimportance elucidates the impact of physical distances and road network\nstructures on human flows across regions. With fairness as its touchstone, the\nmodel and results provide researchers and practitioners across the fields of\nurban sciences, transportation engineering, and computing with an effective\ntool for accurate generation of human mobility flows across regions.\n","authors":["Zhewei Liu","Lipai Huang","Chao Fan","Ali Mostafavi"],"pdf_url":"https://arxiv.org/pdf/2307.11214v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11211v1","updated":"2023-07-20T19:53:09Z","published":"2023-07-20T19:53:09Z","title":"The Effect of Epidemiological Cohort Creation on the Machine Learning\n Prediction of Homelessness and Police Interaction Outcomes Using\n Administrative Health Care Data","summary":" Background: Mental illness can lead to adverse outcomes such as homelessness\nand police interaction and understanding of the events leading up to these\nadverse outcomes is important. Predictive models may help identify individuals\nat risk of such adverse outcomes. Using a fixed observation window cohort with\nlogistic regression (LR) or machine learning (ML) models can result in lower\nperformance when compared with adaptive and parcellated windows. Method: An\nadministrative healthcare dataset was used, comprising of 240,219 individuals\nin Calgary, Alberta, Canada who were diagnosed with addiction or mental health\n(AMH) between April 1, 2013, and March 31, 2018. The cohort was followed for 2\nyears to identify factors associated with homelessness and police interactions.\nTo understand the benefit of flexible windows to predictive models, an\nalternative cohort was created. Then LR and ML models, including random forests\n(RF), and extreme gradient boosting (XGBoost) were compared in the two cohorts.\nResults: Among 237,602 individuals, 0.8% (1,800) experienced first\nhomelessness, while 0.32% (759) reported initial police interaction among\n237,141 individuals. Male sex (AORs: H=1.51, P=2.52), substance disorder (AORs:\nH=3.70, P=2.83), psychiatrist visits (AORs: H=1.44, P=1.49), and drug abuse\n(AORs: H=2.67, P=1.83) were associated with initial homelessness (H) and police\ninteraction (P). XGBoost showed superior performance using the flexible method\n(sensitivity =91%, AUC =90% for initial homelessness, and sensitivity =90%,\nAUC=89% for initial police interaction)\n Conclusion: This study identified key features associated with initial\nhomelessness and police interaction and demonstrated that flexible windows can\nimprove predictive modeling.\n","authors":["Faezehsadat Shahidi","M. Ethan MacDonald","Dallas Seitz","Geoffrey Messier"],"pdf_url":"https://arxiv.org/pdf/2307.11211v1.pdf","comment":"to be published in Frontiers in Digital Health, Health Informatics"},{"id":"http://arxiv.org/abs/2307.11209v1","updated":"2023-07-20T19:52:14Z","published":"2023-07-20T19:52:14Z","title":"Clinical Trial Active Learning","summary":" This paper presents a novel approach to active learning that takes into\naccount the non-independent and identically distributed (non-i.i.d.) structure\nof a clinical trial setting. There exists two types of clinical trials:\nretrospective and prospective. Retrospective clinical trials analyze data after\ntreatment has been performed; prospective clinical trials collect data as\ntreatment is ongoing. Typically, active learning approaches assume the dataset\nis i.i.d. when selecting training samples; however, in the case of clinical\ntrials, treatment results in a dependency between the data collected at the\ncurrent and past visits. Thus, we propose prospective active learning to\novercome the limitations present in traditional active learning methods and\napply it to disease detection in optical coherence tomography (OCT) images,\nwhere we condition on the time an image was collected to enforce the i.i.d.\nassumption. We compare our proposed method to the traditional active learning\nparadigm, which we refer to as retrospective in nature. We demonstrate that\nprospective active learning outperforms retrospective active learning in two\ndifferent types of test settings.\n","authors":["Zoe Fowler","Kiran Kokilepersaud","Mohit Prabhushankar","Ghassan AlRegib"],"pdf_url":"https://arxiv.org/pdf/2307.11209v1.pdf","comment":"Accepted at 14th ACM International Conference on Bioinformatics,\n Computational Biology and Health Informatics (ACM-BCB)"},{"id":"http://arxiv.org/abs/2307.06324v4","updated":"2023-07-20T19:51:06Z","published":"2023-07-12T17:41:07Z","title":"Provably Faster Gradient Descent via Long Steps","summary":" This work establishes provably faster convergence rates for gradient descent\nin smooth convex optimization via a computer-assisted analysis technique. Our\ntheory allows nonconstant stepsize policies with frequent long steps\npotentially violating descent by analyzing the overall effect of many\niterations at once rather than the typical one-iteration inductions used in\nmost first-order method analyses. We show that long steps, which may increase\nthe objective value in the short term, lead to provably faster convergence in\nthe long term. A conjecture towards proving a faster $O(1/T\\log T)$ rate for\ngradient descent is also motivated along with simple numerical validation.\n","authors":["Benjamin Grimmer"],"pdf_url":"https://arxiv.org/pdf/2307.06324v4.pdf","comment":"Apologies for the several updates done shortly after first posting\n this work: In these, I have added more references to excellent relevant works\n I missed in my initial literature review, esp the Master's thesis of Jason\n Altschuler"},{"id":"http://arxiv.org/abs/2210.03297v2","updated":"2023-07-20T19:28:22Z","published":"2022-10-07T03:10:34Z","title":"Preprocessors Matter! Realistic Decision-Based Attacks on Machine\n Learning Systems","summary":" Decision-based attacks construct adversarial examples against a machine\nlearning (ML) model by making only hard-label queries. These attacks have\nmainly been applied directly to standalone neural networks. However, in\npractice, ML models are just one component of a larger learning system. We find\nthat by adding a single preprocessor in front of a classifier, state-of-the-art\nquery-based attacks are up to 7$\\times$ less effective at attacking a\nprediction pipeline than at attacking the model alone. We explain this\ndiscrepancy by the fact that most preprocessors introduce some notion of\ninvariance to the input space. Hence, attacks that are unaware of this\ninvariance inevitably waste a large number of queries to re-discover or\novercome it. We, therefore, develop techniques to (i) reverse-engineer the\npreprocessor and then (ii) use this extracted information to attack the\nend-to-end system. Our preprocessors extraction method requires only a few\nhundred queries, and our preprocessor-aware attacks recover the same efficacy\nas when attacking the model alone. The code can be found at\nhttps://github.com/google-research/preprocessor-aware-black-box-attack.\n","authors":["Chawin Sitawarin","Florian Tramèr","Nicholas Carlini"],"pdf_url":"https://arxiv.org/pdf/2210.03297v2.pdf","comment":"ICML 2023. Code can be found at\n https://github.com/google-research/preprocessor-aware-black-box-attack"},{"id":"http://arxiv.org/abs/2307.11197v1","updated":"2023-07-20T19:20:35Z","published":"2023-07-20T19:20:35Z","title":"Heuristic Hyperparameter Choice for Image Anomaly Detection","summary":" Anomaly detection (AD) in images is a fundamental computer vision problem by\ndeep learning neural network to identify images deviating significantly from\nnormality. The deep features extracted from pretrained models have been proved\nto be essential for AD based on multivariate Gaussian distribution analysis.\nHowever, since models are usually pretrained on a large dataset for\nclassification tasks such as ImageNet, they might produce lots of redundant\nfeatures for AD, which increases computational cost and degrades the\nperformance. We aim to do the dimension reduction of Negated Principal\nComponent Analysis (NPCA) for these features. So we proposed some heuristic to\nchoose hyperparameter of NPCA algorithm for getting as fewer components of\nfeatures as possible while ensuring a good performance.\n","authors":["Zeyu Jiang","João P. C. Bertoldo","Etienne Decencière"],"pdf_url":"https://arxiv.org/pdf/2307.11197v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.03933v2","updated":"2023-07-20T19:12:45Z","published":"2023-06-06T18:01:03Z","title":"High-dimensional and Permutation Invariant Anomaly Detection","summary":" Methods for anomaly detection of new physics processes are often limited to\nlow-dimensional spaces due to the difficulty of learning high-dimensional\nprobability densities. Particularly at the constituent level, incorporating\ndesirable properties such as permutation invariance and variable-length inputs\nbecomes difficult within popular density estimation methods. In this work, we\nintroduce a permutation-invariant density estimator for particle physics data\nbased on diffusion models, specifically designed to handle variable-length\ninputs. We demonstrate the efficacy of our methodology by utilizing the learned\ndensity as a permutation-invariant anomaly detection score, effectively\nidentifying jets with low likelihood under the background-only hypothesis. To\nvalidate our density estimation method, we investigate the ratio of learned\ndensities and compare to those obtained by a supervised classification\nalgorithm.\n","authors":["Vinicius Mikuni","Benjamin Nachman"],"pdf_url":"https://arxiv.org/pdf/2306.03933v2.pdf","comment":"7 pages, 5 figures"},{"id":"http://arxiv.org/abs/2212.12606v2","updated":"2023-07-20T18:58:11Z","published":"2022-12-23T22:44:25Z","title":"A Convergence Rate for Manifold Neural Networks","summary":" High-dimensional data arises in numerous applications, and the rapidly\ndeveloping field of geometric deep learning seeks to develop neural network\narchitectures to analyze such data in non-Euclidean domains, such as graphs and\nmanifolds. Recent work by Z. Wang, L. Ruiz, and A. Ribeiro has introduced a\nmethod for constructing manifold neural networks using the spectral\ndecomposition of the Laplace Beltrami operator. Moreover, in this work, the\nauthors provide a numerical scheme for implementing such neural networks when\nthe manifold is unknown and one only has access to finitely many sample points.\nThe authors show that this scheme, which relies upon building a data-driven\ngraph, converges to the continuum limit as the number of sample points tends to\ninfinity. Here, we build upon this result by establishing a rate of convergence\nthat depends on the intrinsic dimension of the manifold but is independent of\nthe ambient dimension. We also discuss how the rate of convergence depends on\nthe depth of the network and the number of filters used in each layer.\n","authors":["Joyce Chew","Deanna Needell","Michael Perlmutter"],"pdf_url":"https://arxiv.org/pdf/2212.12606v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.11589v2","updated":"2023-07-20T18:48:37Z","published":"2022-10-20T21:01:14Z","title":"Monotonic Risk Relationships under Distribution Shifts for Regularized\n Risk Minimization","summary":" Machine learning systems are often applied to data that is drawn from a\ndifferent distribution than the training distribution. Recent work has shown\nthat for a variety of classification and signal reconstruction problems, the\nout-of-distribution performance is strongly linearly correlated with the\nin-distribution performance. If this relationship or more generally a monotonic\none holds, it has important consequences. For example, it allows to optimize\nperformance on one distribution as a proxy for performance on the other. In\nthis paper, we study conditions under which a monotonic relationship between\nthe performances of a model on two distributions is expected. We prove an exact\nasymptotic linear relation for squared error and a monotonic relation for\nmisclassification error for ridge-regularized general linear models under\ncovariate shift, as well as an approximate linear relation for linear inverse\nproblems.\n","authors":["Daniel LeJeune","Jiayu Liu","Reinhard Heckel"],"pdf_url":"https://arxiv.org/pdf/2210.11589v2.pdf","comment":"34 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.09782v2","updated":"2023-07-20T18:47:20Z","published":"2023-07-19T06:58:03Z","title":"ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization\n Using Floating-Point Formats","summary":" In the complex domain of large language models (LLMs), striking a balance\nbetween computational efficiency and maintaining model quality is a formidable\nchallenge. Navigating the inherent limitations of uniform quantization,\nparticularly when dealing with outliers, and motivated by the launch of\nNVIDIA's H100 hardware, this study delves into the viability of floating-point\n(FP) quantization, particularly focusing on FP8 and FP4, as a potential\nsolution. Our comprehensive investigation reveals that for LLMs, FP8 activation\nconsistently outshines its integer (INT8) equivalent, with the performance edge\nbecoming more noticeable in models possessing parameters beyond one billion.\nFor weight quantization, our findings indicate that FP4 exhibits comparable, if\nnot superior, performance to INT4, simplifying deployment on FP-supported\nhardware like H100. To mitigate the overhead from precision alignment caused by\nthe disparity between weights and activations, we propose two scaling\nconstraints for weight quantization that negligibly impact the performance\ncompared to the standard W4A8 model. We additionally enhance our quantization\nmethods by integrating the Low Rank Compensation (LoRC) strategy, yielding\nimprovements especially in smaller models. The results of our investigation\nemphasize the immense potential of FP quantization for LLMs, paving the way for\nhigh-efficiency deployment in resource-limited settings.\n","authors":["Xiaoxia Wu","Zhewei Yao","Yuxiong He"],"pdf_url":"https://arxiv.org/pdf/2307.09782v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11166v1","updated":"2023-07-20T18:01:48Z","published":"2023-07-20T18:01:48Z","title":"Exploring reinforcement learning techniques for discrete and continuous\n control tasks in the MuJoCo environment","summary":" We leverage the fast physics simulator, MuJoCo to run tasks in a continuous\ncontrol environment and reveal details like the observation space, action\nspace, rewards, etc. for each task. We benchmark value-based methods for\ncontinuous control by comparing Q-learning and SARSA through a discretization\napproach, and using them as baselines, progressively moving into one of the\nstate-of-the-art deep policy gradient method DDPG. Over a large number of\nepisodes, Qlearning outscored SARSA, but DDPG outperformed both in a small\nnumber of episodes. Lastly, we also fine-tuned the model hyper-parameters\nexpecting to squeeze more performance but using lesser time and resources. We\nanticipated that the new design for DDPG would vastly improve performance, yet\nafter only a few episodes, we were able to achieve decent average rewards. We\nexpect to improve the performance provided adequate time and computational\nresources.\n","authors":["Vaddadi Sai Rahul","Debajyoti Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2307.11166v1.pdf","comment":"Released @ Dec 2021. For associated project files, see\n https://github.com/chakrabortyde/mujoco-control-tasks"}],"Multimedia":[{"id":"http://arxiv.org/abs/2307.11025v1","updated":"2023-07-20T16:53:41Z","published":"2023-07-20T16:53:41Z","title":"Investigating VTubing as a Reconstruction of Streamer Self-Presentation:\n Identity, Performance, and Gender","summary":" VTubers, or Virtual YouTubers, are live streamers who create streaming\ncontent using animated 2D or 3D virtual avatars. In recent years, there has\nbeen a significant increase in the number of VTuber creators and viewers across\nthe globe. This practise has drawn research attention into topics such as\nviewers' engagement behaviors and perceptions, however, as animated avatars\noffer more identity and performance flexibility than traditional live streaming\nwhere one uses their own body, little research has focused on how this\nflexibility influences how creators present themselves. This research thus\nseeks to fill this gap by presenting results from a qualitative study of 16\nChinese-speaking VTubers' streaming practices. The data revealed that the\nvirtual avatars that were used while live streaming afforded creators\nopportunities to present themselves using inflated presentations and resulted\nin inclusive interactions with viewers. The results also unveiled the inflated,\nand often sexualized, gender expressions of VTubers while they were situated in\nmisogynistic environments. The socio-technical facets of VTubing were found to\npotentially reduce sexual harassment and sexism, whilst also raising\nself-objectification concerns.\n","authors":["Qian Wan","Zhicong Lu"],"pdf_url":"https://arxiv.org/pdf/2307.11025v1.pdf","comment":"Under review at ACM CSCW after a Major Revision"},{"id":"http://arxiv.org/abs/2210.05335v3","updated":"2023-07-20T16:24:14Z","published":"2022-10-11T10:54:54Z","title":"MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model","summary":" Multimodal semantic understanding often has to deal with uncertainty, which\nmeans the obtained messages tend to refer to multiple targets. Such uncertainty\nis problematic for our interpretation, including inter- and intra-modal\nuncertainty. Little effort has studied the modeling of this uncertainty,\nparticularly in pre-training on unlabeled datasets and fine-tuning in\ntask-specific downstream datasets. In this paper, we project the\nrepresentations of all modalities as probabilistic distributions via a\nProbability Distribution Encoder (PDE) by utilizing sequence-level\ninteractions. Compared to the existing deterministic methods, such uncertainty\nmodeling can convey richer multimodal semantic information and more complex\nrelationships. Furthermore, we integrate uncertainty modeling with popular\npre-training frameworks and propose suitable pre-training tasks:\nDistribution-based Vision-Language Contrastive learning (D-VLC),\nDistribution-based Masked Language Modeling (D-MLM), and Distribution-based\nImage-Text Matching (D-ITM). The fine-tuned models are applied to challenging\ndownstream tasks, including image-text retrieval, visual question answering,\nvisual reasoning, and visual entailment, and achieve state-of-the-art results.\n","authors":["Yatai Ji","Junjie Wang","Yuan Gong","Lin Zhang","Yanru Zhu","Hongfa Wang","Jiaxing Zhang","Tetsuya Sakai","Yujiu Yang"],"pdf_url":"https://arxiv.org/pdf/2210.05335v3.pdf","comment":"CVPR 2023 Main Track Long Paper"},{"id":"http://arxiv.org/abs/2307.10802v1","updated":"2023-07-20T12:10:29Z","published":"2023-07-20T12:10:29Z","title":"Meta-Transformer: A Unified Framework for Multimodal Learning","summary":" Multimodal learning aims to build models that can process and relate\ninformation from multiple modalities. Despite years of development in this\nfield, it still remains challenging to design a unified network for processing\nvarious modalities ($\\textit{e.g.}$ natural language, 2D images, 3D point\nclouds, audio, video, time series, tabular data) due to the inherent gaps among\nthem. In this work, we propose a framework, named Meta-Transformer, that\nleverages a $\\textbf{frozen}$ encoder to perform multimodal perception without\nany paired multimodal training data. In Meta-Transformer, the raw input data\nfrom various modalities are mapped into a shared token space, allowing a\nsubsequent encoder with frozen parameters to extract high-level semantic\nfeatures of the input data. Composed of three main components: a unified data\ntokenizer, a modality-shared encoder, and task-specific heads for downstream\ntasks, Meta-Transformer is the first framework to perform unified learning\nacross 12 modalities with unpaired data. Experiments on different benchmarks\nreveal that Meta-Transformer can handle a wide range of tasks including\nfundamental perception (text, image, point cloud, audio, video), practical\napplication (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph,\ntabular, and time-series). Meta-Transformer indicates a promising future for\ndeveloping unified multimodal intelligence with transformers. Code will be\navailable at https://github.com/invictus717/MetaTransformer\n","authors":["Yiyuan Zhang","Kaixiong Gong","Kaipeng Zhang","Hongsheng Li","Yu Qiao","Wanli Ouyang","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2307.10802v1.pdf","comment":"Project website: https://kxgong.github.io/meta_transformer/"},{"id":"http://arxiv.org/abs/2303.12112v3","updated":"2023-07-20T08:16:09Z","published":"2023-03-21T18:03:14Z","title":"Positive-Augmented Contrastive Learning for Image and Video Captioning\n Evaluation","summary":" The CLIP model has been recently proven to be very effective for a variety of\ncross-modal tasks, including the evaluation of captions generated from\nvision-and-language architectures. In this paper, we propose a new recipe for a\ncontrastive-based evaluation metric for image captioning, namely\nPositive-Augmented Contrastive learning Score (PAC-S), that in a novel way\nunifies the learning of a contrastive visual-semantic space with the addition\nof generated images and text on curated data. Experiments spanning several\ndatasets demonstrate that our new metric achieves the highest correlation with\nhuman judgments on both images and videos, outperforming existing\nreference-based metrics like CIDEr and SPICE and reference-free metrics like\nCLIP-Score. Finally, we test the system-level correlation of the proposed\nmetric when considering popular image captioning approaches, and assess the\nimpact of employing different cross-modal features. Our source code and trained\nmodels are publicly available at: https://github.com/aimagelab/pacscore.\n","authors":["Sara Sarto","Manuele Barraco","Marcella Cornia","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2303.12112v3.pdf","comment":"CVPR 2023 (highlight paper)"},{"id":"http://arxiv.org/abs/2307.10642v1","updated":"2023-07-20T07:12:56Z","published":"2023-07-20T07:12:56Z","title":"RetouchingFFHQ: A Large-scale Dataset for Fine-grained Face Retouching\n Detection","summary":" The widespread use of face retouching filters on short-video platforms has\nraised concerns about the authenticity of digital appearances and the impact of\ndeceptive advertising. To address these issues, there is a pressing need to\ndevelop advanced face retouching techniques. However, the lack of large-scale\nand fine-grained face retouching datasets has been a major obstacle to progress\nin this field. In this paper, we introduce RetouchingFFHQ, a large-scale and\nfine-grained face retouching dataset that contains over half a million\nconditionally-retouched images. RetouchingFFHQ stands out from previous\ndatasets due to its large scale, high quality, fine-grainedness, and\ncustomization. By including four typical types of face retouching operations\nand different retouching levels, we extend the binary face retouching detection\ninto a fine-grained, multi-retouching type, and multi-retouching level\nestimation problem. Additionally, we propose a Multi-granularity Attention\nModule (MAM) as a plugin for CNN backbones for enhanced cross-scale\nrepresentation learning. Extensive experiments using different baselines as\nwell as our proposed method on RetouchingFFHQ show decent performance on face\nretouching detection. With the proposed new dataset, we believe there is great\npotential for future work to tackle the challenging problem of real-world\nfine-grained face retouching detection.\n","authors":["Qichao Ying","Jiaxin Liu","Sheng Li","Haisheng Xu","Zhenxing Qian","Xinpeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.10642v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2306.03718v4","updated":"2023-07-20T03:06:50Z","published":"2023-06-06T14:28:57Z","title":"Emotion-Conditioned Melody Harmonization with Hierarchical Variational\n Autoencoder","summary":" Existing melody harmonization models have made great progress in improving\nthe quality of generated harmonies, but most of them ignored the emotions\nbeneath the music. Meanwhile, the variability of harmonies generated by\nprevious methods is insufficient. To solve these problems, we propose a novel\nLSTM-based Hierarchical Variational Auto-Encoder (LHVAE) to investigate the\ninfluence of emotional conditions on melody harmonization, while improving the\nquality of generated harmonies and capturing the abundant variability of chord\nprogressions. Specifically, LHVAE incorporates latent variables and emotional\nconditions at different levels (piece- and bar-level) to model the global and\nlocal music properties. Additionally, we introduce an attention-based melody\ncontext vector at each step to better learn the correspondence between melodies\nand harmonies. Objective experimental results show that our proposed model\noutperforms other LSTM-based models. Through subjective evaluation, we conclude\nthat only altering the types of chords hardly changes the overall emotion of\nthe music. The qualitative analysis demonstrates the ability of our model to\ngenerate variable harmonies.\n","authors":["Shulei Ji","Xinyu Yang"],"pdf_url":"https://arxiv.org/pdf/2306.03718v4.pdf","comment":"Accepted by IEEE SMC 2023"}]},"2023-07-21T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2307.11729v1","updated":"2023-07-21T17:40:47Z","published":"2023-07-21T17:40:47Z","title":"OUTFOX: LLM-generated Essay Detection through In-context Learning with\n Adversarially Generated Examples","summary":" Large Language Models (LLMs) have achieved human-level fluency in text\ngeneration, making it difficult to distinguish between human-written and\nLLM-generated texts. This poses a growing risk of misuse of LLMs and demands\nthe development of detectors to identify LLM-generated texts. However, existing\ndetectors degrade detection accuracy by simply paraphrasing LLM-generated\ntexts. Furthermore, the effectiveness of these detectors in real-life\nsituations, such as when students use LLMs for writing homework assignments\n(e.g., essays) and quickly learn how to evade these detectors, has not been\nexplored. In this paper, we propose OUTFOX, a novel framework that improves the\nrobustness of LLM-generated-text detectors by allowing both the detector and\nthe attacker to consider each other's output and apply this to the domain of\nstudent essays. In our framework, the attacker uses the detector's prediction\nlabels as examples for in-context learning and adversarially generates essays\nthat are harder to detect. While the detector uses the adversarially generated\nessays as examples for in-context learning to learn to detect essays from a\nstrong attacker. Our experiments show that our proposed detector learned\nin-context from the attacker improves the detection performance on the attacked\ndataset by up to +41.3 point F1-score. While our proposed attacker can\ndrastically degrade the performance of the detector by up to -57.0 point\nF1-score compared to the paraphrasing method.\n","authors":["Ryuto Koike","Masahiro Kaneko","Naoaki Okazaki"],"pdf_url":"https://arxiv.org/pdf/2307.11729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10490v2","updated":"2023-07-21T16:51:15Z","published":"2023-07-19T23:03:20Z","title":"(Ab)using Images and Sounds for Indirect Instruction Injection in\n Multi-Modal LLMs","summary":" We demonstrate how images and sounds can be used for indirect prompt and\ninstruction injection in multi-modal LLMs. An attacker generates an adversarial\nperturbation corresponding to the prompt and blends it into an image or audio\nrecording. When the user asks the (unmodified, benign) model about the\nperturbed image or audio, the perturbation steers the model to output the\nattacker-chosen text and/or make the subsequent dialog follow the attacker's\ninstruction. We illustrate this attack with several proof-of-concept examples\ntargeting LLaVa and PandaGPT.\n","authors":["Eugene Bagdasaryan","Tsung-Yin Hsieh","Ben Nassi","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2307.10490v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.06576v3","updated":"2023-07-21T16:06:32Z","published":"2023-07-13T06:25:22Z","title":"Going Beyond Local: Global Graph-Enhanced Personalized News\n Recommendations","summary":" Precisely recommending candidate news articles to users has always been a\ncore challenge for personalized news recommendation systems. Most recent works\nprimarily focus on using advanced natural language processing techniques to\nextract semantic information from rich textual data, employing content-based\nmethods derived from local historical news. However, this approach lacks a\nglobal perspective, failing to account for users' hidden motivations and\nbehaviors beyond semantic information. To address this challenge, we propose a\nnovel model called GLORY (Global-LOcal news Recommendation sYstem), which\ncombines global representations learned from other users with local\nrepresentations to enhance personalized recommendation systems. We accomplish\nthis by constructing a Global-aware Historical News Encoder, which includes a\nglobal news graph and employs gated graph neural networks to enrich news\nrepresentations, thereby fusing historical news representations by a historical\nnews aggregator. Similarly, we extend this approach to a Global Candidate News\nEncoder, utilizing a global entity graph and a candidate news aggregator to\nenhance candidate news representation. Evaluation results on two public news\ndatasets demonstrate that our method outperforms existing approaches.\nFurthermore, our model offers more diverse recommendations.\n","authors":["Boming Yang","Dairui Liu","Toyotaro Suzumura","Ruihai Dong","Irene Li"],"pdf_url":"https://arxiv.org/pdf/2307.06576v3.pdf","comment":"10 pages, Recsys 2023"},{"id":"http://arxiv.org/abs/2307.11661v1","updated":"2023-07-21T15:49:59Z","published":"2023-07-21T15:49:59Z","title":"Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts","summary":" Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have\nrevolutionized visual representation learning by providing good performance on\ndownstream datasets. VLMs are 0-shot adapted to a downstream dataset by\ndesigning prompts that are relevant to the dataset. Such prompt engineering\nmakes use of domain expertise and a validation dataset. Meanwhile, recent\ndevelopments in generative pretrained models like GPT-4 mean they can be used\nas advanced internet search tools. They can also be manipulated to provide\nvisual information in any structure. In this work, we show that GPT-4 can be\nused to generate text that is visually descriptive and how this can be used to\nadapt CLIP to downstream tasks. We show considerable improvements in 0-shot\ntransfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD\n(~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt.\nWe also design a simple few-shot adapter that learns to choose the best\npossible sentences to construct generalizable classifiers that outperform the\nrecently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized\nfine-grained datasets. We will release the code, prompts, and auxiliary text\ndataset upon acceptance.\n","authors":["Mayug Maniparambil","Chris Vorster","Derek Molloy","Noel Murphy","Kevin McGuinness","Noel E. O'Connor"],"pdf_url":"https://arxiv.org/pdf/2307.11661v1.pdf","comment":"10 pages, Pre-print"},{"id":"http://arxiv.org/abs/2307.11636v1","updated":"2023-07-21T14:58:44Z","published":"2023-07-21T14:58:44Z","title":"OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?","summary":" This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-scale\ndataset for humour generation and understanding. Humour is an abstract,\nsubjective, and context-dependent cognitive construct involving several\ncognitive factors, making it a challenging task to generate and interpret.\nHence, humour generation and understanding can serve as a new task for\nevaluating the ability of deep-learning methods to process abstract and\nsubjective information. Due to the scarcity of data, humour-related generation\ntasks such as captioning remain under-explored. To address this gap,\nOxfordTVG-HIC offers approximately 2.9M image-text pairs with humour scores to\ntrain a generalizable humour captioning model. Contrary to existing captioning\ndatasets, OxfordTVG-HIC features a wide range of emotional and semantic\ndiversity resulting in out-of-context examples that are particularly conducive\nto generating humour. Moreover, OxfordTVG-HIC is curated devoid of offensive\ncontent. We also show how OxfordTVG-HIC can be leveraged for evaluating the\nhumour of a generated text. Through explainability analysis of the trained\nmodels, we identify the visual and linguistic cues influential for evoking\nhumour prediction (and generation). We observe qualitatively that these cues\nare aligned with the benign violation theory of humour in cognitive psychology.\n","authors":["Runjia Li","Shuyang Sun","Mohamed Elhoseiny","Philip Torr"],"pdf_url":"https://arxiv.org/pdf/2307.11636v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2212.09648v4","updated":"2023-07-21T14:44:45Z","published":"2022-12-19T17:28:22Z","title":"NusaCrowd: Open Source Initiative for Indonesian NLP Resources","summary":" We present NusaCrowd, a collaborative initiative to collect and unify\nexisting resources for Indonesian languages, including opening access to\npreviously non-public resources. Through this initiative, we have brought\ntogether 137 datasets and 118 standardized data loaders. The quality of the\ndatasets has been assessed manually and automatically, and their value is\ndemonstrated through multiple experiments. NusaCrowd's data collection enables\nthe creation of the first zero-shot benchmarks for natural language\nunderstanding and generation in Indonesian and the local languages of\nIndonesia. Furthermore, NusaCrowd brings the creation of the first multilingual\nautomatic speech recognition benchmark in Indonesian and the local languages of\nIndonesia. Our work strives to advance natural language processing (NLP)\nresearch for languages that are under-represented despite being widely spoken.\n","authors":["Samuel Cahyawijaya","Holy Lovenia","Alham Fikri Aji","Genta Indra Winata","Bryan Wilie","Rahmad Mahendra","Christian Wibisono","Ade Romadhony","Karissa Vincentio","Fajri Koto","Jennifer Santoso","David Moeljadi","Cahya Wirawan","Frederikus Hudi","Ivan Halim Parmonangan","Ika Alfina","Muhammad Satrio Wicaksono","Ilham Firdausi Putra","Samsul Rahmadani","Yulianti Oenang","Ali Akbar Septiandri","James Jaya","Kaustubh D. Dhole","Arie Ardiyanti Suryani","Rifki Afina Putri","Dan Su","Keith Stevens","Made Nindyatama Nityasya","Muhammad Farid Adilazuarda","Ryan Ignatius","Ryandito Diandaru","Tiezheng Yu","Vito Ghifari","Wenliang Dai","Yan Xu","Dyah Damapuspita","Cuk Tho","Ichwanul Muslim Karo Karo","Tirana Noor Fatyanosa","Ziwei Ji","Pascale Fung","Graham Neubig","Timothy Baldwin","Sebastian Ruder","Herry Sujaini","Sakriani Sakti","Ayu Purwarianti"],"pdf_url":"https://arxiv.org/pdf/2212.09648v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11610v1","updated":"2023-07-21T14:25:39Z","published":"2023-07-21T14:25:39Z","title":"CausE: Towards Causal Knowledge Graph Embedding","summary":" Knowledge graph embedding (KGE) focuses on representing the entities and\nrelations of a knowledge graph (KG) into the continuous vector spaces, which\ncan be employed to predict the missing triples to achieve knowledge graph\ncompletion (KGC). However, KGE models often only briefly learn structural\ncorrelations of triple data and embeddings would be misled by the trivial\npatterns and noisy links in real-world KGs. To address this issue, we build the\nnew paradigm of KGE in the context of causality and embedding disentanglement.\nWe further propose a Causality-enhanced knowledge graph Embedding (CausE)\nframework. CausE employs causal intervention to estimate the causal effect of\nthe confounder embeddings and design new training objectives to make stable\npredictions. Experimental results demonstrate that CausE could outperform the\nbaseline models and achieve state-of-the-art KGC performance. We release our\ncode in https://github.com/zjukg/CausE.\n","authors":["Yichi Zhang","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.11610v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2107.00841v3","updated":"2023-07-21T14:03:40Z","published":"2021-07-02T05:29:39Z","title":"ClueReader: Heterogeneous Graph Attention Network for Multi-hop Machine\n Reading Comprehension","summary":" Multi-hop machine reading comprehension is a challenging task in natural\nlanguage processing as it requires more reasoning ability across multiple\ndocuments. Spectral models based on graph convolutional networks have shown\ngood inferring abilities and lead to competitive results. However, the analysis\nand reasoning of some are inconsistent with those of humans. Inspired by the\nconcept of grandmother cells in cognitive neuroscience, we propose a\nheterogeneous graph attention network model named ClueReader to imitate the\ngrandmother cell concept. The model is designed to assemble the semantic\nfeatures in multi-level representations and automatically concentrate or\nalleviate information for reasoning through the attention mechanism. The name\nClueReader is a metaphor for the pattern of the model: it regards the subjects\nof queries as the starting points of clues, takes the reasoning entities as\nbridge points, considers the latent candidate entities as grandmother cells,\nand the clues end up in candidate entities. The proposed model enables the\nvisualization of the reasoning graph, making it possible to analyze the\nimportance of edges connecting entities and the selectivity in the mention and\ncandidate nodes, which is easier to comprehend empirically. Evaluations on the\nopen-domain multi-hop reading dataset WikiHop and drug-drug interaction dataset\nMedHop proved the validity of ClueReader and showed the feasibility of its\napplication of the model in the molecular biology domain.\n","authors":["Peng Gao","Feng Gao","Peng Wang","Jian-Cheng Ni","Fei Wang","Hamido Fujita"],"pdf_url":"https://arxiv.org/pdf/2107.00841v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11584v1","updated":"2023-07-21T13:48:11Z","published":"2023-07-21T13:48:11Z","title":"A Change of Heart: Improving Speech Emotion Recognition through\n Speech-to-Text Modality Conversion","summary":" Speech Emotion Recognition (SER) is a challenging task. In this paper, we\nintroduce a modality conversion concept aimed at enhancing emotion recognition\nperformance on the MELD dataset. We assess our approach through two\nexperiments: first, a method named Modality-Conversion that employs automatic\nspeech recognition (ASR) systems, followed by a text classifier; second, we\nassume perfect ASR output and investigate the impact of modality conversion on\nSER, this method is called Modality-Conversion++. Our findings indicate that\nthe first method yields substantial results, while the second method\noutperforms state-of-the-art (SOTA) speech-based approaches in terms of SER\nweighted-F1 (WF1) score on the MELD dataset. This research highlights the\npotential of modality conversion for tasks that can be conducted in alternative\nmodalities.\n","authors":["Zeinab Sadat Taghavi","Ali Satvaty","Hossein Sameti"],"pdf_url":"https://arxiv.org/pdf/2307.11584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11558v1","updated":"2023-07-21T13:06:02Z","published":"2023-07-21T13:06:02Z","title":"Advancing Visual Grounding with Scene Knowledge: Benchmark and Method","summary":" Visual grounding (VG) aims to establish fine-grained alignment between vision\nand language. Ideally, it can be a testbed for vision-and-language models to\nevaluate their understanding of the images and texts and their reasoning\nabilities over their joint space. However, most existing VG datasets are\nconstructed using simple description texts, which do not require sufficient\nreasoning over the images and texts. This has been demonstrated in a recent\nstudy~\\cite{luo2022goes}, where a simple LSTM-based text encoder without\npretraining can achieve state-of-the-art performance on mainstream VG datasets.\nTherefore, in this paper, we propose a novel benchmark of \\underline{S}cene\n\\underline{K}nowledge-guided \\underline{V}isual \\underline{G}rounding (SK-VG),\nwhere the image content and referring expressions are not sufficient to ground\nthe target objects, forcing the models to have a reasoning ability on the\nlong-form scene knowledge. To perform this task, we propose two approaches to\naccept the triple-type input, where the former embeds knowledge into the image\nfeatures before the image-query interaction; the latter leverages linguistic\nstructure to assist in computing the image-text matching. We conduct extensive\nexperiments to analyze the above methods and show that the proposed approaches\nachieve promising results but still leave room for improvement, including\nperformance and interpretability. The dataset and code are available at\n\\url{https://github.com/zhjohnchan/SK-VG}.\n","authors":["Zhihong Chen","Ruifei Zhang","Yibing Song","Xiang Wan","Guanbin Li"],"pdf_url":"https://arxiv.org/pdf/2307.11558v1.pdf","comment":"Computer Vision and Natural Language Processing. 21 pages, 14\n figures. CVPR-2023"},{"id":"http://arxiv.org/abs/2307.11545v1","updated":"2023-07-21T12:46:15Z","published":"2023-07-21T12:46:15Z","title":"Bridging Vision and Language Encoders: Parameter-Efficient Tuning for\n Referring Image Segmentation","summary":" Parameter Efficient Tuning (PET) has gained attention for reducing the number\nof parameters while maintaining performance and providing better hardware\nresource savings, but few studies investigate dense prediction tasks and\ninteraction between modalities. In this paper, we do an investigation of\nefficient tuning problems on referring image segmentation. We propose a novel\nadapter called Bridger to facilitate cross-modal information exchange and\ninject task-specific information into the pre-trained model. We also design a\nlightweight decoder for image segmentation. Our approach achieves comparable or\nsuperior performance with only 1.61\\% to 3.38\\% backbone parameter updates,\nevaluated on challenging benchmarks. The code is available at\n\\url{https://github.com/kkakkkka/ETRIS}.\n","authors":["Zunnan Xu","Zhihong Chen","Yong Zhang","Yibing Song","Xiang Wan","Guanbin Li"],"pdf_url":"https://arxiv.org/pdf/2307.11545v1.pdf","comment":"Computer Vision and Natural Language Processing. 14 pages, 8 figures.\n ICCV-2023"},{"id":"http://arxiv.org/abs/2307.11516v1","updated":"2023-07-21T11:54:53Z","published":"2023-07-21T11:54:53Z","title":"IndigoVX: Where Human Intelligence Meets AI for Optimal Decision Making","summary":" This paper defines a new approach for augmenting human intelligence with AI\nfor optimal goal solving. Our proposed AI, Indigo, is an acronym for Informed\nNumerical Decision-making through Iterative Goal-Oriented optimization. When\ncombined with a human collaborator, we term the joint system IndigoVX, for\nVirtual eXpert. The system is conceptually simple. We envisage this method\nbeing applied to games or business strategies, with the human providing\nstrategic context and the AI offering optimal, data-driven moves. Indigo\noperates through an iterative feedback loop, harnessing the human expert's\ncontextual knowledge and the AI's data-driven insights to craft and refine\nstrategies towards a well-defined goal. Using a quantified three-score schema,\nthis hybridization allows the combined team to evaluate strategies and refine\ntheir plan, while adapting to challenges and changes in real-time.\n","authors":["Kais Dukes"],"pdf_url":"https://arxiv.org/pdf/2307.11516v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2105.04900v2","updated":"2023-07-21T11:04:29Z","published":"2021-05-11T09:41:25Z","title":"Forecasting consumer confidence through semantic network analysis of\n online news","summary":" This research studies the impact of online news on social and economic\nconsumer perceptions through semantic network analysis. Using over 1.8 million\nonline articles on Italian media covering four years, we calculate the semantic\nimportance of specific economic-related keywords to see if words appearing in\nthe articles could anticipate consumers' judgments about the economic situation\nand the Consumer Confidence Index. We use an innovative approach to analyze big\ntextual data, combining methods and tools of text mining and social network\nanalysis. Results show a strong predictive power for the judgments about the\ncurrent households and national situation. Our indicator offers a complementary\napproach to estimating consumer confidence, lessening the limitations of\ntraditional survey-based methods.\n","authors":["A. Fronzetti Colladon","F. Grippa","B. Guardabascio","G. Costante","F. Ravazzolo"],"pdf_url":"https://arxiv.org/pdf/2105.04900v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.12851v2","updated":"2023-07-21T10:22:53Z","published":"2023-05-22T09:20:58Z","title":"Enhancing Coherence of Extractive Summarization with Multitask Learning","summary":" This study proposes a multitask learning architecture for extractive\nsummarization with coherence boosting. The architecture contains an extractive\nsummarizer and coherent discriminator module. The coherent discriminator is\ntrained online on the sentence vectors of the augmented textual input, thus\nimproving its general ability of judging whether the input sentences are\ncoherent. Meanwhile, we maximize the coherent scores from the coherent\ndiscriminator by updating the parameters of the summarizer. To make the\nextractive sentences trainable in a differentiable manner, we introduce two\nstrategies, including pre-trained converting model (model-based) and converting\nmatrix (MAT-based) that merge sentence representations. Experiments show that\nour proposed method significantly improves the proportion of consecutive\nsentences in the extracted summaries based on their positions in the original\narticle (i.e., automatic sentence-level coherence metric), while the goodness\nin terms of other automatic metrics (i.e., Rouge scores and BertScores) are\npreserved. Human evaluation also evidences the improvement of coherence and\nconsistency of the extracted summaries given by our method.\n","authors":["Renlong Jie","Xiaojun Meng","Lifeng Shang","Xin Jiang","Qun Liu"],"pdf_url":"https://arxiv.org/pdf/2305.12851v2.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2307.11457v1","updated":"2023-07-21T09:39:50Z","published":"2023-07-21T09:39:50Z","title":"Incorporating Human Translator Style into English-Turkish Literary\n Machine Translation","summary":" Although machine translation systems are mostly designed to serve in the\ngeneral domain, there is a growing tendency to adapt these systems to other\ndomains like literary translation. In this paper, we focus on English-Turkish\nliterary translation and develop machine translation models that take into\naccount the stylistic features of translators. We fine-tune a pre-trained\nmachine translation model by the manually-aligned works of a particular\ntranslator. We make a detailed analysis of the effects of manual and automatic\nalignments, data augmentation methods, and corpus size on the translations. We\npropose an approach based on stylistic features to evaluate the style of a\ntranslator in the output translations. We show that the human translator style\ncan be highly recreated in the target machine translations by adapting the\nmodels to the style of the translator.\n","authors":["Zeynep Yirmibeşoğlu","Olgun Dursun","Harun Dallı","Mehmet Şahin","Ena Hodzik","Sabri Gürses","Tunga Güngör"],"pdf_url":"https://arxiv.org/pdf/2307.11457v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11450v1","updated":"2023-07-21T09:30:46Z","published":"2023-07-21T09:30:46Z","title":"Topic Identification For Spontaneous Speech: Enriching Audio Features\n With Embedded Linguistic Information","summary":" Traditional topic identification solutions from audio rely on an automatic\nspeech recognition system (ASR) to produce transcripts used as input to a\ntext-based model. These approaches work well in high-resource scenarios, where\nthere are sufficient data to train both components of the pipeline. However, in\nlow-resource situations, the ASR system, even if available, produces\nlow-quality transcripts, leading to a bad text-based classifier. Moreover,\nspontaneous speech containing hesitations can further degrade the performance\nof the ASR model. In this paper, we investigate alternatives to the standard\ntext-only solutions by comparing audio-only and hybrid techniques of jointly\nutilising text and audio features. The models evaluated on spontaneous Finnish\nspeech demonstrate that purely audio-based solutions are a viable option when\nASR components are not available, while the hybrid multi-modal solutions\nachieve the best results.\n","authors":["Dejan Porjazovski","Tamás Grósz","Mikko Kurimo"],"pdf_url":"https://arxiv.org/pdf/2307.11450v1.pdf","comment":"Accepted to EUSIPCO 2023"},{"id":"http://arxiv.org/abs/2306.14096v3","updated":"2023-07-21T08:57:38Z","published":"2023-06-25T02:24:30Z","title":"Chinese Fine-Grained Financial Sentiment Analysis with Large Language\n Models","summary":" Entity-level fine-grained sentiment analysis in the financial domain is a\ncrucial subtask of sentiment analysis and currently faces numerous challenges.\nThe primary challenge stems from the lack of high-quality and large-scale\nannotated corpora specifically designed for financial text sentiment analysis,\nwhich in turn limits the availability of data necessary for developing\neffective text processing techniques. Recent advancements in large language\nmodels (LLMs) have yielded remarkable performance in natural language\nprocessing tasks, primarily centered around language pattern matching. In this\npaper, we propose a novel and extensive Chinese fine-grained financial\nsentiment analysis dataset, FinChina SA, for enterprise early warning. We\nthoroughly evaluate and experiment with well-known existing open-source LLMs\nusing our dataset. We firmly believe that our dataset will serve as a valuable\nresource to advance the exploration of real-world financial sentiment analysis\ntasks, which should be the focus of future research. Our dataset and all code\nto replicate the experimental results will be released.\n","authors":["Yinyu Lan","Yanru Wu","Wang Xu","Weiqiang Feng","Youhao Zhang"],"pdf_url":"https://arxiv.org/pdf/2306.14096v3.pdf","comment":"FinLLM Symposium at IJCAI 2023"},{"id":"http://arxiv.org/abs/2306.02250v2","updated":"2023-07-21T07:46:03Z","published":"2023-06-04T03:46:45Z","title":"Large Language Model Augmented Narrative Driven Recommendations","summary":" Narrative-driven recommendation (NDR) presents an information access problem\nwhere users solicit recommendations with verbose descriptions of their\npreferences and context, for example, travelers soliciting recommendations for\npoints of interest while describing their likes/dislikes and travel\ncircumstances. These requests are increasingly important with the rise of\nnatural language-based conversational interfaces for search and recommendation\nsystems. However, NDR lacks abundant training data for models, and current\nplatforms commonly do not support these requests. Fortunately, classical\nuser-item interaction datasets contain rich textual data, e.g., reviews, which\noften describe user preferences and context - this may be used to bootstrap\ntraining for NDR models. In this work, we explore using large language models\n(LLMs) for data augmentation to train NDR models. We use LLMs for authoring\nsynthetic narrative queries from user-item interactions with few-shot prompting\nand train retrieval models for NDR on synthetic queries and user-item\ninteraction data. Our experiments demonstrate that this is an effective\nstrategy for training small-parameter retrieval models that outperform other\nretrieval and LLM baselines for narrative-driven recommendation.\n","authors":["Sheshera Mysore","Andrew McCallum","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2306.02250v2.pdf","comment":"RecSys 2023 Camera-ready"},{"id":"http://arxiv.org/abs/2304.04250v2","updated":"2023-07-21T07:39:58Z","published":"2023-04-09T14:52:18Z","title":"Editable User Profiles for Controllable Text Recommendation","summary":" Methods for making high-quality recommendations often rely on learning latent\nrepresentations from interaction data. These methods, while performant, do not\nprovide ready mechanisms for users to control the recommendation they receive.\nOur work tackles this problem by proposing LACE, a novel concept value\nbottleneck model for controllable text recommendations. LACE represents each\nuser with a succinct set of human-readable concepts through retrieval given\nuser-interacted documents and learns personalized representations of the\nconcepts based on user documents. This concept based user profile is then\nleveraged to make recommendations. The design of our model affords control over\nthe recommendations through a number of intuitive interactions with a\ntransparent user profile. We first establish the quality of recommendations\nobtained from LACE in an offline evaluation on three recommendation tasks\nspanning six datasets in warm-start, cold-start, and zero-shot setups. Next, we\nvalidate the controllability of LACE under simulated user interactions.\nFinally, we implement LACE in an interactive controllable recommender system\nand conduct a user study to demonstrate that users are able to improve the\nquality of recommendations they receive through interactions with an editable\nuser profile.\n","authors":["Sheshera Mysore","Mahmood Jasim","Andrew McCallum","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2304.04250v2.pdf","comment":"SIGIR-2023 Camera Ready"},{"id":"http://arxiv.org/abs/2307.11394v1","updated":"2023-07-21T07:22:18Z","published":"2023-07-21T07:22:18Z","title":"MeetEval: A Toolkit for Computation of Word Error Rates for Meeting\n Transcription Systems","summary":" MeetEval is an open-source toolkit to evaluate all kinds of meeting\ntranscription systems. It provides a unified interface for the computation of\ncommonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER\nalong other WER definitions. We extend the cpWER computation by a temporal\nconstraint to ensure that only words are identified as correct when the\ntemporal alignment is plausible. This leads to a better quality of the matching\nof the hypothesis string to the reference string that more closely resembles\nthe actual transcription quality, and a system is penalized if it provides poor\ntime annotations. Since word-level timing information is often not available,\nwe present a way to approximate exact word-level timings from segment-level\ntimings (e.g., a sentence) and show that the approximation leads to a similar\nWER as a matching with exact word-level annotations. At the same time, the time\nconstraint leads to a speedup of the matching algorithm, which outweighs the\nadditional overhead caused by processing the time stamps.\n","authors":["Thilo von Neumann","Christoph Boeddeker","Marc Delcroix","Reinhold Haeb-Umbach"],"pdf_url":"https://arxiv.org/pdf/2307.11394v1.pdf","comment":"Accepted for presentation at the Chime7 workshop 2023"},{"id":"http://arxiv.org/abs/2306.17519v2","updated":"2023-07-21T06:57:49Z","published":"2023-06-30T10:12:30Z","title":"GPT-FinRE: In-context Learning for Financial Relation Extraction using\n Large Language Models","summary":" Relation extraction (RE) is a crucial task in natural language processing\n(NLP) that aims to identify and classify relationships between entities\nmentioned in text. In the financial domain, relation extraction plays a vital\nrole in extracting valuable information from financial documents, such as news\narticles, earnings reports, and company filings. This paper describes our\nsolution to relation extraction on one such dataset REFinD. The dataset was\nreleased along with shared task as a part of the Fourth Workshop on Knowledge\nDiscovery from Unstructured Data in Financial Services, co-located with SIGIR\n2023. In this paper, we employed OpenAI models under the framework of\nin-context learning (ICL). We utilized two retrieval strategies to find top K\nrelevant in-context learning demonstrations / examples from training data for a\ngiven test example. The first retrieval mechanism, we employed, is a\nlearning-free dense retriever and the other system is a learning-based\nretriever. We were able to achieve 3rd rank overall. Our best F1-score is\n0.718.\n","authors":["Pawan Kumar Rajpoot","Ankur Parikh"],"pdf_url":"https://arxiv.org/pdf/2306.17519v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2305.02105 by other authors"},{"id":"http://arxiv.org/abs/2307.11380v1","updated":"2023-07-21T06:38:37Z","published":"2023-07-21T06:38:37Z","title":"Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect\n ChatGPT-Generated Text","summary":" The remarkable capabilities of large-scale language models, such as ChatGPT,\nin text generation have incited awe and spurred researchers to devise detectors\nto mitigate potential risks, including misinformation, phishing, and academic\ndishonesty. Despite this, most previous studies, including HC3, have been\npredominantly geared towards creating detectors that differentiate between\npurely ChatGPT-generated texts and human-authored texts. This approach,\nhowever, fails to work on discerning texts generated through human-machine\ncollaboration, such as ChatGPT-polished texts. Addressing this gap, we\nintroduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts),\nfacilitating the construction of more robust detectors. It diverges from extant\ncorpora by comprising pairs of human-written and ChatGPT-polished abstracts\ninstead of purely ChatGPT-generated texts. Additionally, we propose the \"Polish\nRatio\" method, an innovative measure of ChatGPT's involvement in text\ngeneration based on editing distance. It provides a mechanism to measure the\ndegree of human originality in the resulting text. Our experimental results\nshow our proposed model has better robustness on the HPPT dataset and two\nexisting datasets (HC3 and CDB). Furthermore, the \"Polish Ratio\" we proposed\noffers a more comprehensive explanation by quantifying the degree of ChatGPT\ninvolvement, which indicates that a Polish Ratio value greater than 0.2\nsignifies ChatGPT involvement and a value exceeding 0.6 implies that ChatGPT\ngenerates most of the text.\n","authors":["Lingyi Yang","Feng Jiang","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2307.11380v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11346v1","updated":"2023-07-21T04:43:00Z","published":"2023-07-21T04:43:00Z","title":"CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study","summary":" Participant recruitment based on unstructured medical texts such as clinical\nnotes and radiology reports has been a challenging yet important task for the\ncohort establishment in clinical research. Recently, Large Language Models\n(LLMs) such as ChatGPT have achieved tremendous success in various downstream\ntasks thanks to their promising performance in language understanding,\ninference, and generation. It is then natural to test their feasibility in\nsolving the cohort recruitment task, which involves the classification of a\ngiven paragraph of medical text into disease label(s). However, when applied to\nknowledge-intensive problem settings such as medical text classification, where\nthe LLMs are expected to understand the decision made by human experts and\naccurately identify the implied disease labels, the LLMs show a mediocre\nperformance. A possible explanation is that, by only using the medical text,\nthe LLMs neglect to use the rich context of additional information that\nlanguages afford. To this end, we propose to use a knowledge graph as auxiliary\ninformation to guide the LLMs in making predictions. Moreover, to further boost\nthe LLMs adapt to the problem setting, we apply a chain-of-thought (CoT) sample\nselection strategy enhanced by reinforcement learning, which selects a set of\nCoT samples given each individual medical report. Experimental results and\nvarious ablation studies show that our few-shot learning method achieves\nsatisfactory performance compared with fine-tuning strategies and gains superb\nadvantages when the available data is limited. The code and sample dataset of\nthe proposed CohortGPT model is available at:\nhttps://anonymous.4open.science/r/CohortGPT-4872/\n","authors":["Zihan Guan","Zihao Wu","Zhengliang Liu","Dufan Wu","Hui Ren","Quanzheng Li","Xiang Li","Ninghao Liu"],"pdf_url":"https://arxiv.org/pdf/2307.11346v1.pdf","comment":"16 pages, 10 figures"},{"id":"http://arxiv.org/abs/2307.11344v1","updated":"2023-07-21T04:22:43Z","published":"2023-07-21T04:22:43Z","title":"DEFTri: A Few-Shot Label Fused Contextual Representation Learning For\n Product Defect Triage in e-Commerce","summary":" Defect Triage is a time-sensitive and critical process in a large-scale agile\nsoftware development lifecycle for e-commerce. Inefficiencies arising from\nhuman and process dependencies in this domain have motivated research in\nautomated approaches using machine learning to accurately assign defects to\nqualified teams. This work proposes a novel framework for automated defect\ntriage (DEFTri) using fine-tuned state-of-the-art pre-trained BERT on labels\nfused text embeddings to improve contextual representations from\nhuman-generated product defects. For our multi-label text classification defect\ntriage task, we also introduce a Walmart proprietary dataset of product defects\nusing weak supervision and adversarial learning, in a few-shot setting.\n","authors":["Ipsita Mohanty"],"pdf_url":"https://arxiv.org/pdf/2307.11344v1.pdf","comment":"In Proceedings of the Fifth Workshop on e-Commerce and NLP ECNLP 5\n 2022 Pages 1-7"},{"id":"http://arxiv.org/abs/2307.11316v1","updated":"2023-07-21T02:51:41Z","published":"2023-07-21T02:51:41Z","title":"Making Pre-trained Language Models both Task-solvers and\n Self-calibrators","summary":" Pre-trained language models (PLMs) serve as backbones for various real-world\nsystems. For high-stake applications, it's equally essential to have reasonable\nconfidence estimations in predictions. While the vanilla confidence scores of\nPLMs can already be effectively utilized, PLMs consistently become\noverconfident in their wrong predictions, which is not desirable in practice.\nPrevious work shows that introducing an extra calibration task can mitigate\nthis issue. The basic idea involves acquiring additional data to train models\nin predicting the confidence of their initial predictions. However, it only\ndemonstrates the feasibility of this kind of method, assuming that there are\nabundant extra available samples for the introduced calibration task. In this\nwork, we consider the practical scenario that we need to effectively utilize\ntraining samples to make PLMs both task-solvers and self-calibrators. Three\nchallenges are presented, including limited training samples, data imbalance,\nand distribution shifts. We first conduct pilot experiments to quantify various\ndecisive factors in the calibration task. Based on the empirical analysis\nresults, we propose a training algorithm LM-TOAST to tackle the challenges.\nExperimental results show that LM-TOAST can effectively utilize the training\ndata to make PLMs have reasonable confidence estimations while maintaining the\noriginal task performance. Further, we consider three downstream applications,\nnamely selective classification, adversarial defense, and model cascading, to\nshow the practical usefulness of LM-TOAST. The code will be made public at\n\\url{https://github.com/Yangyi-Chen/LM-TOAST}.\n","authors":["Yangyi Chen","Xingyao Wang","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2307.11316v1.pdf","comment":"Accepted to Findings of ACL 2023"},{"id":"http://arxiv.org/abs/2307.11315v1","updated":"2023-07-21T02:47:18Z","published":"2023-07-21T02:47:18Z","title":"Generating Image-Specific Text Improves Fine-grained Image\n Classification","summary":" Recent vision-language models outperform vision-only models on many image\nclassification tasks. However, because of the absence of paired text/image\ndescriptions, it remains difficult to fine-tune these models for fine-grained\nimage classification. In this work, we propose a method, GIST, for generating\nimage-specific fine-grained text descriptions from image-only datasets, and\nshow that these text descriptions can be used to improve classification. Key\nparts of our method include 1. prompting a pretrained large language model with\ndomain-specific prompts to generate diverse fine-grained text descriptions for\neach class and 2. using a pretrained vision-language model to match each image\nto label-preserving text descriptions that capture relevant visual features in\nthe image. We demonstrate the utility of GIST by fine-tuning vision-language\nmodels on the image-and-generated-text pairs to learn an aligned\nvision-language representation space for improved classification. We evaluate\nour learned representation space in full-shot and few-shot scenarios across\nfour diverse fine-grained classification datasets, each from a different\ndomain. Our method achieves an average improvement of $4.1\\%$ in accuracy over\nCLIP linear probes and an average of $1.1\\%$ improvement in accuracy over the\nprevious state-of-the-art image-text classification method on the full-shot\ndatasets. Our method achieves similar improvements across few-shot regimes.\nCode is available at https://github.com/emu1729/GIST.\n","authors":["Emily Mu","Kathleen M. Lewis","Adrian V. Dalca","John Guttag"],"pdf_url":"https://arxiv.org/pdf/2307.11315v1.pdf","comment":"The first two authors contributed equally to this work"},{"id":"http://arxiv.org/abs/2307.10291v2","updated":"2023-07-21T02:34:58Z","published":"2023-07-18T14:30:36Z","title":"Mutual Reinforcement Effects in Japanese Sentence Classification and\n Named Entity Recognition Tasks","summary":" Information extraction(IE) is a crucial subfield within natural language\nprocessing. However, for the traditionally segmented approach to sentence\nclassification and Named Entity Recognition, the intricate interactions between\nthese individual subtasks remain largely uninvestigated. In this study, we\npropose an integrative analysis, converging sentence classification with Named\nEntity Recognition, with the objective to unveil and comprehend the mutual\nreinforcement effect within these two information extraction subtasks. To\nachieve this, we introduce a Sentence Classification and Named Entity\nRecognition Multi-task (SCNM) approach that combines Sentence Classification\n(SC) and Named Entity Recognition (NER). We develop a Sentence-to-Label\nGeneration (SLG) framework for SCNM and construct a Wikipedia dataset\ncontaining both SC and NER. Using a format converter, we unify input formats\nand employ a generative model to generate SC-labels, NER-labels, and associated\ntext segments. We propose a Constraint Mechanism (CM) to improve generated\nformat accuracy. Our results show SC accuracy increased by 1.13 points and NER\nby 1.06 points in SCNM compared to standalone tasks, with CM raising format\naccuracy from 63.61 to 100. The findings indicate mutual reinforcement effects\nbetween SC and NER, and integration enhances both tasks' performance. We\nadditionally implemented the SLG framework on single SC task. It yielded\nsuperior accuracies compared to the baseline on two distinct Japanese SC\ndatasets. Notably, in the experiment of few-shot learning, SLG framework shows\nmuch better performance than fine-tune method. These empirical findings\ncontribute additional evidence to affirm the efficacy of the SLG framework.\n","authors":["Chengguang Gan","Qinghao Zhang","Tatsunori Mori"],"pdf_url":"https://arxiv.org/pdf/2307.10291v2.pdf","comment":"25 pages, 12 figures, 19 tables. arXiv admin note: substantial text\n overlap with arXiv:2306.15978"},{"id":"http://arxiv.org/abs/2307.10432v2","updated":"2023-07-21T02:22:14Z","published":"2023-07-19T19:40:34Z","title":"PharmacyGPT: The AI Pharmacist","summary":" In this study, we introduce PharmacyGPT, a novel framework to assess the\ncapabilities of large language models (LLMs) such as ChatGPT and GPT-4 in\nemulating the role of clinical pharmacists. Our methodology encompasses the\nutilization of LLMs to generate comprehensible patient clusters, formulate\nmedication plans, and forecast patient outcomes. We conduct our investigation\nusing real data acquired from the intensive care unit (ICU) at the University\nof North Carolina Chapel Hill (UNC) Hospital. Our analysis offers valuable\ninsights into the potential applications and limitations of LLMs in the field\nof clinical pharmacy, with implications for both patient care and the\ndevelopment of future AI-driven healthcare solutions. By evaluating the\nperformance of PharmacyGPT, we aim to contribute to the ongoing discourse\nsurrounding the integration of artificial intelligence in healthcare settings,\nultimately promoting the responsible and efficacious use of such technologies.\n","authors":["Zhengliang Liu","Zihao Wu","Mengxuan Hu","Bokai Zhao","Lin Zhao","Tianyi Zhang","Haixing Dai","Xianyan Chen","Ye Shen","Sheng Li","Brian Murray","Tianming Liu","Andrea Sikora"],"pdf_url":"https://arxiv.org/pdf/2307.10432v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13971v2","updated":"2023-07-21T01:58:13Z","published":"2023-06-24T13:57:32Z","title":"Towards Robust Aspect-based Sentiment Analysis through\n Non-counterfactual Augmentations","summary":" While state-of-the-art NLP models have demonstrated excellent performance for\naspect based sentiment analysis (ABSA), substantial evidence has been presented\non their lack of robustness. This is especially manifested as significant\ndegradation in performance when faced with out-of-distribution data. Recent\nsolutions that rely on counterfactually augmented datasets show promising\nresults, but they are inherently limited because of the lack of access to\nexplicit causal structure. In this paper, we present an alternative approach\nthat relies on non-counterfactual data augmentation. Our proposal instead\nrelies on using noisy, cost-efficient data augmentations that preserve\nsemantics associated with the target aspect. Our approach then relies on\nmodelling invariances between different versions of the data to improve\nrobustness. A comprehensive suite of experiments shows that our proposal\nsignificantly improves upon strong pre-trained baselines on both standard and\nrobustness-specific datasets. Our approach further establishes a new\nstate-of-the-art on the ABSA robustness benchmark and transfers well across\ndomains.\n","authors":["Xinyu Liu","Yan Ding","Kaikai An","Chunyang Xiao","Pranava Madhyastha","Tong Xiao","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2306.13971v2.pdf","comment":"10pages,1 figure,10 tables"},{"id":"http://arxiv.org/abs/2307.11278v1","updated":"2023-07-21T00:34:38Z","published":"2023-07-21T00:34:38Z","title":"Generator-Retriever-Generator: A Novel Approach to Open-domain Question\n Answering","summary":" Open-domain question answering (QA) tasks usually require the retrieval of\nrelevant information from a large corpus to generate accurate answers. We\npropose a novel approach called Generator-Retriever-Generator (GRG) that\ncombines document retrieval techniques with a large language model (LLM), by\nfirst prompting the model to generate contextual documents based on a given\nquestion. In parallel, a dual-encoder network retrieves documents that are\nrelevant to the question from an external corpus. The generated and retrieved\ndocuments are then passed to the second LLM, which generates the final answer.\nBy combining document retrieval and LLM generation, our approach addresses the\nchallenges of open-domain QA, such as generating informative and contextually\nrelevant answers. GRG outperforms the state-of-the-art generate-then-read and\nretrieve-then-read pipelines (GENREAD and RFiD) improving their performance at\nleast by +5.2, +4.2, and +1.6 on TriviaQA, NQ, and WebQ datasets, respectively.\nWe provide code, datasets, and checkpoints\n\\footnote{\\url{https://github.com/abdoelsayed2016/GRG}}\n","authors":["Abdelrahman Abdallah","Adam Jatowt"],"pdf_url":"https://arxiv.org/pdf/2307.11278v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.08584v3","updated":"2023-07-21T22:08:45Z","published":"2022-11-15T23:57:34Z","title":"Toward expanding the scope of radiology report summarization to multiple\n anatomies and modalities","summary":" Radiology report summarization (RRS) is a growing area of research. Given the\nFindings section of a radiology report, the goal is to generate a summary\n(called an Impression section) that highlights the key observations and\nconclusions of the radiology study. However, RRS currently faces essential\nlimitations.First, many prior studies conduct experiments on private datasets,\npreventing reproduction of results and fair comparisons across different\nsystems and solutions. Second, most prior approaches are evaluated solely on\nchest X-rays. To address these limitations, we propose a dataset (MIMIC-RRS)\ninvolving three new modalities and seven new anatomies based on the MIMIC-III\nand MIMIC-CXR datasets. We then conduct extensive experiments to evaluate the\nperformance of models both within and across modality-anatomy pairs in\nMIMIC-RRS. In addition, we evaluate their clinical efficacy via RadGraph, a\nfactual correctness metric.\n","authors":["Zhihong Chen","Maya Varma","Xiang Wan","Curtis Langlotz","Jean-Benoit Delbrouck"],"pdf_url":"https://arxiv.org/pdf/2211.08584v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11922v1","updated":"2023-07-21T22:02:50Z","published":"2023-07-21T22:02:50Z","title":"Selective Perception: Optimizing State Descriptions with Reinforcement\n Learning for Language Model Actors","summary":" Large language models (LLMs) are being applied as actors for sequential\ndecision making tasks in domains such as robotics and games, utilizing their\ngeneral world knowledge and planning abilities. However, previous work does\nlittle to explore what environment state information is provided to LLM actors\nvia language. Exhaustively describing high-dimensional states can impair\nperformance and raise inference costs for LLM actors. Previous LLM actors avoid\nthe issue by relying on hand-engineered, task-specific protocols to determine\nwhich features to communicate about a state and which to leave out. In this\nwork, we propose Brief Language INputs for DEcision-making Responses (BLINDER),\na method for automatically selecting concise state descriptions by learning a\nvalue function for task-conditioned state descriptions. We evaluate BLINDER on\nthe challenging video game NetHack and a robotic manipulation task. Our method\nimproves task success rate, reduces input size and compute costs, and\ngeneralizes between LLM actors.\n","authors":["Kolby Nottingham","Yasaman Razeghi","Kyungmin Kim","JB Lanier","Pierre Baldi","Roy Fox","Sameer Singh"],"pdf_url":"https://arxiv.org/pdf/2307.11922v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.03341v3","updated":"2023-07-21T19:11:58Z","published":"2023-06-06T01:26:53Z","title":"Inference-Time Intervention: Eliciting Truthful Answers from a Language\n Model","summary":" We introduce Inference-Time Intervention (ITI), a technique designed to\nenhance the truthfulness of large language models (LLMs). ITI operates by\nshifting model activations during inference, following a set of directions\nacross a limited number of attention heads. This intervention significantly\nimproves the performance of LLaMA models on the TruthfulQA benchmark. On an\ninstruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from\n32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and\ndemonstrate how to balance it by tuning the intervention strength. ITI is\nminimally invasive and computationally inexpensive. Moreover, the technique is\ndata efficient: while approaches like RLHF require extensive annotations, ITI\nlocates truthful directions using only few hundred examples. Our findings\nsuggest that LLMs may have an internal representation of the likelihood of\nsomething being true, even as they produce falsehoods on the surface.\n","authors":["Kenneth Li","Oam Patel","Fernanda Viégas","Hanspeter Pfister","Martin Wattenberg"],"pdf_url":"https://arxiv.org/pdf/2306.03341v3.pdf","comment":"code: https://github.com/likenneth/honest_llama"},{"id":"http://arxiv.org/abs/2307.11865v1","updated":"2023-07-21T19:09:37Z","published":"2023-07-21T19:09:37Z","title":"CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction\n Execution for Robots","summary":" This work explores the capacity of large language models (LLMs) to address\nproblems at the intersection of spatial planning and natural language\ninterfaces for navigation.Our focus is on following relatively complex\ninstructions that are more akin to natural conversation than traditional\nexplicit procedural directives seen in robotics. Unlike most prior work, where\nnavigation directives are provided as imperative commands (e.g., go to the\nfridge), we examine implicit directives within conversational interactions. We\nleverage the 3D simulator AI2Thor to create complex and repeatable scenarios at\nscale, and augment it by adding complex language queries for 40 object types.\nWe demonstrate that a robot can better parse descriptive language queries than\nexisting methods by using an LLM to interpret the user interaction in the\ncontext of a list of the objects in the scene.\n","authors":["Nikhil Kakodkar","Dmitriy Rivkin","Bobak H. Baghi","Francois Hogan","Gregory Dudek"],"pdf_url":"https://arxiv.org/pdf/2307.11865v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11864v1","updated":"2023-07-21T19:09:24Z","published":"2023-07-21T19:09:24Z","title":"The Looming Threat of Fake and LLM-generated LinkedIn Profiles:\n Challenges and Opportunities for Detection and Prevention","summary":" In this paper, we present a novel method for detecting fake and Large\nLanguage Model (LLM)-generated profiles in the LinkedIn Online Social Network\nimmediately upon registration and before establishing connections. Early fake\nprofile identification is crucial to maintaining the platform's integrity since\nit prevents imposters from acquiring the private and sensitive information of\nlegitimate users and from gaining an opportunity to increase their credibility\nfor future phishing and scamming activities. This work uses textual information\nprovided in LinkedIn profiles and introduces the Section and Subsection Tag\nEmbedding (SSTE) method to enhance the discriminative characteristics of these\ndata for distinguishing between legitimate profiles and those created by\nimposters manually or by using an LLM. Additionally, the dearth of a large\npublicly available LinkedIn dataset motivated us to collect 3600 LinkedIn\nprofiles for our research. We will release our dataset publicly for research\npurposes. This is, to the best of our knowledge, the first large publicly\navailable LinkedIn dataset for fake LinkedIn account detection. Within our\nparadigm, we assess static and contextualized word embeddings, including GloVe,\nFlair, BERT, and RoBERTa. We show that the suggested method can distinguish\nbetween legitimate and fake profiles with an accuracy of about 95% across all\nword embeddings. In addition, we show that SSTE has a promising accuracy for\nidentifying LLM-generated profiles, despite the fact that no LLM-generated\nprofiles were employed during the training phase, and can achieve an accuracy\nof approximately 90% when only 20 LLM-generated profiles are added to the\ntraining set. It is a significant finding since the proliferation of several\nLLMs in the near future makes it extremely challenging to design a single\nsystem that can identify profiles created with various LLMs.\n","authors":["Navid Ayoobi","Sadat Shahriar","Arjun Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2307.11864v1.pdf","comment":"33rd ACM Conference on Hypertext and Social Media (HT '23)"},{"id":"http://arxiv.org/abs/2307.11848v1","updated":"2023-07-21T18:35:24Z","published":"2023-07-21T18:35:24Z","title":"MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through\n Multi-Answer Open-Domain Question Answering","summary":" Check-worthy claim detection aims at providing plausible misinformation to\ndownstream fact-checking systems or human experts to check. This is a crucial\nstep toward accelerating the fact-checking process. Many efforts have been put\ninto how to identify check-worthy claims from a small scale of pre-collected\nclaims, but how to efficiently detect check-worthy claims directly from a\nlarge-scale information source, such as Twitter, remains underexplored. To fill\nthis gap, we introduce MythQA, a new multi-answer open-domain question\nanswering(QA) task that involves contradictory stance mining for query-based\nlarge-scale check-worthy claim detection. The idea behind this is that\ncontradictory claims are a strong indicator of misinformation that merits\nscrutiny by the appropriate authorities. To study this task, we construct\nTweetMythQA, an evaluation dataset containing 522 factoid multi-answer\nquestions based on controversial topics. Each question is annotated with\nmultiple answers. Moreover, we collect relevant tweets for each distinct\nanswer, then classify them into three categories: \"Supporting\", \"Refuting\", and\n\"Neutral\". In total, we annotated 5.3K tweets. Contradictory evidence is\ncollected for all answers in the dataset. Finally, we present a baseline system\nfor MythQA and evaluate existing NLP models for each system component using the\nTweetMythQA dataset. We provide initial benchmarks and identify key challenges\nfor future models to improve upon. Code and data are available at:\nhttps://github.com/TonyBY/Myth-QA\n","authors":["Yang Bai","Anthony Colas","Daisy Zhe Wang"],"pdf_url":"https://arxiv.org/pdf/2307.11848v1.pdf","comment":"Accepted by SIGIR 2023"},{"id":"http://arxiv.org/abs/2307.11845v1","updated":"2023-07-21T18:29:04Z","published":"2023-07-21T18:29:04Z","title":"Multimodal Document Analytics for Banking Process Automation","summary":" In response to growing FinTech competition and the need for improved\noperational efficiency, this research focuses on understanding the potential of\nadvanced document analytics, particularly using multimodal models, in banking\nprocesses. We perform a comprehensive analysis of the diverse banking document\nlandscape, highlighting the opportunities for efficiency gains through\nautomation and advanced analytics techniques in the customer business. Building\non the rapidly evolving field of natural language processing (NLP), we\nillustrate the potential of models such as LayoutXLM, a cross-lingual,\nmultimodal, pre-trained model, for analyzing diverse documents in the banking\nsector. This model performs a text token classification on German company\nregister extracts with an overall F1 score performance of around 80\\%. Our\nempirical evidence confirms the critical role of layout information in\nimproving model performance and further underscores the benefits of integrating\nimage information. Interestingly, our study shows that over 75% F1 score can be\nachieved with only 30% of the training data, demonstrating the efficiency of\nLayoutXLM. Through addressing state-of-the-art document analysis frameworks,\nour study aims to enhance process efficiency and demonstrate the real-world\napplicability and benefits of multimodal models within banking.\n","authors":["Christopher Gerling","Stefan Lessmann"],"pdf_url":"https://arxiv.org/pdf/2307.11845v1.pdf","comment":"A Preprint"},{"id":"http://arxiv.org/abs/2307.11795v1","updated":"2023-07-21T08:39:15Z","published":"2023-07-21T08:39:15Z","title":"Prompting Large Language Models with Speech Recognition Abilities","summary":" Large language models have proven themselves highly flexible, able to solve a\nwide range of generative tasks, such as abstractive summarization and\nopen-ended question answering. In this paper we extend the capabilities of LLMs\nby directly attaching a small audio encoder allowing it to perform speech\nrecognition. By directly prepending a sequence of audial embeddings to the text\ntoken embeddings, the LLM can be converted to an automatic speech recognition\n(ASR) system, and be used in the exact same manner as its textual counterpart.\nExperiments on Multilingual LibriSpeech (MLS) show that incorporating a\nconformer encoder into the open sourced LLaMA-7B allows it to outperform\nmonolingual baselines by 18% and perform multilingual speech recognition\ndespite LLaMA being trained overwhelmingly on English text. Furthermore, we\nperform ablation studies to investigate whether the LLM can be completely\nfrozen during training to maintain its original capabilities, scaling up the\naudio encoder, and increasing the audio encoder striding to generate fewer\nembeddings. The results from these studies show that multilingual ASR is\npossible even when the LLM is frozen or when strides of almost 1 second are\nused in the audio encoder opening up the possibility for LLMs to operate on\nlong-form audio.\n","authors":["Yassir Fathullah","Chunyang Wu","Egor Lakomkin","Junteng Jia","Yuan Shangguan","Ke Li","Jinxi Guo","Wenhan Xiong","Jay Mahadeokar","Ozlem Kalinli","Christian Fuegen","Mike Seltzer"],"pdf_url":"https://arxiv.org/pdf/2307.11795v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2307.11748v1","updated":"2023-07-21T17:58:47Z","published":"2023-07-21T17:58:47Z","title":"BandRe: Rethinking Band-Pass Filters for Scale-Wise Object Detection\n Evaluation","summary":" Scale-wise evaluation of object detectors is important for real-world\napplications. However, existing metrics are either coarse or not sufficiently\nreliable. In this paper, we propose novel scale-wise metrics that strike a\nbalance between fineness and reliability, using a filter bank consisting of\ntriangular and trapezoidal band-pass filters. We conduct experiments with two\nmethods on two datasets and show that the proposed metrics can highlight the\ndifferences between the methods and between the datasets. Code is available at\nhttps://github.com/shinya7y/UniverseNet .\n","authors":["Yosuke Shinya"],"pdf_url":"https://arxiv.org/pdf/2307.11748v1.pdf","comment":"Honorable Mention Solution Award in Small Object Detection Challenge\n for Spotting Birds, International Conference on Machine Vision Applications\n (MVA) 2023"},{"id":"http://arxiv.org/abs/2108.02226v2","updated":"2023-07-21T17:27:10Z","published":"2021-08-04T18:08:28Z","title":"Terabyte-scale supervised 3D training and benchmarking dataset of the\n mouse kidney","summary":" The performance of machine learning algorithms, when used for segmenting 3D\nbiomedical images, does not reach the level expected based on results achieved\nwith 2D photos. This may be explained by the comparative lack of high-volume,\nhigh-quality training datasets, which require state-of-the-art imaging\nfacilities, domain experts for annotation and large computational and personal\nresources. The HR-Kidney dataset presented in this work bridges this gap by\nproviding 1.7 TB of artefact-corrected synchrotron radiation-based X-ray\nphase-contrast microtomography images of whole mouse kidneys and validated\nsegmentations of 33 729 glomeruli, which corresponds to a one to two orders of\nmagnitude increase over currently available biomedical datasets. The image sets\nalso contain the underlying raw data, threshold- and morphology-based\nsemi-automatic segmentations of renal vasculature and uriniferous tubules, as\nwell as true 3D manual annotations. We therewith provide a broad basis for the\nscientific community to build upon and expand in the fields of image\nprocessing, data augmentation and machine learning, in particular unsupervised\nand semi-supervised learning investigations, as well as transfer learning and\ngenerative adversarial networks.\n","authors":["Willy Kuo","Diego Rossinelli","Georg Schulz","Roland H. Wenger","Simone Hieber","Bert Müller","Vartan Kurtcuoglu"],"pdf_url":"https://arxiv.org/pdf/2108.02226v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.10656v2","updated":"2023-07-21T17:15:24Z","published":"2023-03-19T13:41:59Z","title":"More From Less: Self-Supervised Knowledge Distillation for Routine\n Histopathology Data","summary":" Medical imaging technologies are generating increasingly large amounts of\nhigh-quality, information-dense data. Despite the progress, practical use of\nadvanced imaging technologies for research and diagnosis remains limited by\ncost and availability, so information-sparse data such as H&E stains are relied\non in practice. The study of diseased tissue requires methods which can\nleverage these information-dense data to extract more value from routine,\ninformation-sparse data. Using self-supervised deep learning, we demonstrate\nthat it is possible to distil knowledge during training from information-dense\ndata into models which only require information-sparse data for inference. This\nimproves downstream classification accuracy on information-sparse data, making\nit comparable with the fully-supervised baseline. We find substantial effects\non the learned representations, and this training process identifies subtle\nfeatures which otherwise go undetected. This approach enables the design of\nmodels which require only routine images, but contain insights from\nstate-of-the-art data, allowing better use of the available resources.\n","authors":["Lucas Farndale","Robert Insall","Ke Yuan"],"pdf_url":"https://arxiv.org/pdf/2303.10656v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11706v1","updated":"2023-07-21T17:02:55Z","published":"2023-07-21T17:02:55Z","title":"3D Skeletonization of Complex Grapevines for Robotic Pruning","summary":" Robotic pruning of dormant grapevines is an area of active research in order\nto promote vine balance and grape quality, but so far robotic efforts have\nlargely focused on planar, simplified vines not representative of commercial\nvineyards. This paper aims to advance the robotic perception capabilities\nnecessary for pruning in denser and more complex vine structures by extending\nplant skeletonization techniques. The proposed pipeline generates skeletal\ngrapevine models that have lower reprojection error and higher connectivity\nthan baseline algorithms. We also show how 3D and skeletal information enables\nprediction accuracy of pruning weight for dense vines surpassing prior work,\nwhere pruning weight is an important vine metric influencing pruning site\nselection.\n","authors":["Eric Schneider","Sushanth Jayanth","Abhisesh Silwal","George Kantor"],"pdf_url":"https://arxiv.org/pdf/2307.11706v1.pdf","comment":"6 pages, IROS 2023 Computer Vision for Automation"},{"id":"http://arxiv.org/abs/2307.11702v1","updated":"2023-07-21T16:56:36Z","published":"2023-07-21T16:56:36Z","title":"SACReg: Scene-Agnostic Coordinate Regression for Visual Localization","summary":" Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every\npixel of a given image, has recently shown promising potential. However,\nexisting methods remain mostly scene-specific or limited to small scenes and\nthus hardly scale to realistic datasets. In this paper, we propose a new\nparadigm where a single generic SCR model is trained once to be then deployed\nto new test scenes, regardless of their scale and without further finetuning.\nFor a given query image, it collects inputs from off-the-shelf image retrieval\ntechniques and Structure-from-Motion databases: a list of relevant database\nimages with sparse pointwise 2D-3D annotations. The model is based on the\ntransformer architecture and can take a variable number of images and sparse\n2D-3D annotations as input. It is trained on a few diverse datasets and\nsignificantly outperforms other scene regression approaches on several\nbenchmarks, including scene-specific models, for visual localization. In\nparticular, we set a new state of the art on the Cambridge localization\nbenchmark, even outperforming feature-matching-based approaches.\n","authors":["Jerome Revaud","Yohann Cabon","Romain Brégier","JongMin Lee","Philippe Weinzaepfel"],"pdf_url":"https://arxiv.org/pdf/2307.11702v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.06828v3","updated":"2023-07-21T16:54:18Z","published":"2022-11-13T06:03:28Z","title":"Enhancing Few-shot Image Classification with Cosine Transformer","summary":" This paper addresses the few-shot image classification problem, where the\nclassification task is performed on unlabeled query samples given a small\namount of labeled support samples only. One major challenge of the few-shot\nlearning problem is the large variety of object visual appearances that\nprevents the support samples to represent that object comprehensively. This\nmight result in a significant difference between support and query samples,\ntherefore undermining the performance of few-shot algorithms. In this paper, we\ntackle the problem by proposing Few-shot Cosine Transformer (FS-CT), where the\nrelational map between supports and queries is effectively obtained for the\nfew-shot tasks. The FS-CT consists of two parts, a learnable prototypical\nembedding network to obtain categorical representations from support samples\nwith hard cases, and a transformer encoder to effectively achieve the\nrelational map from two different support and query samples. We introduce\nCosine Attention, a more robust and stable attention module that enhances the\ntransformer module significantly and therefore improves FS-CT performance from\n5% to over 20% in accuracy compared to the default scaled dot-product\nmechanism. Our method performs competitive results in mini-ImageNet, CUB-200,\nand CIFAR-FS on 1-shot learning and 5-shot learning tasks across backbones and\nfew-shot configurations. We also developed a custom few-shot dataset for Yoga\npose recognition to demonstrate the potential of our algorithm for practical\napplication. Our FS-CT with cosine attention is a lightweight, simple few-shot\nalgorithm that can be applied for a wide range of applications, such as\nhealthcare, medical, and security surveillance. The official implementation\ncode of our Few-shot Cosine Transformer is available at\nhttps://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer\n","authors":["Quang-Huy Nguyen","Cuong Q. Nguyen","Dung D. Le","Hieu H. Pham"],"pdf_url":"https://arxiv.org/pdf/2211.06828v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11661v1","updated":"2023-07-21T15:49:59Z","published":"2023-07-21T15:49:59Z","title":"Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts","summary":" Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have\nrevolutionized visual representation learning by providing good performance on\ndownstream datasets. VLMs are 0-shot adapted to a downstream dataset by\ndesigning prompts that are relevant to the dataset. Such prompt engineering\nmakes use of domain expertise and a validation dataset. Meanwhile, recent\ndevelopments in generative pretrained models like GPT-4 mean they can be used\nas advanced internet search tools. They can also be manipulated to provide\nvisual information in any structure. In this work, we show that GPT-4 can be\nused to generate text that is visually descriptive and how this can be used to\nadapt CLIP to downstream tasks. We show considerable improvements in 0-shot\ntransfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD\n(~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt.\nWe also design a simple few-shot adapter that learns to choose the best\npossible sentences to construct generalizable classifiers that outperform the\nrecently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized\nfine-grained datasets. We will release the code, prompts, and auxiliary text\ndataset upon acceptance.\n","authors":["Mayug Maniparambil","Chris Vorster","Derek Molloy","Noel Murphy","Kevin McGuinness","Noel E. O'Connor"],"pdf_url":"https://arxiv.org/pdf/2307.11661v1.pdf","comment":"10 pages, Pre-print"},{"id":"http://arxiv.org/abs/2307.11654v1","updated":"2023-07-21T15:42:01Z","published":"2023-07-21T15:42:01Z","title":"FEDD -- Fair, Efficient, and Diverse Diffusion-based Lesion Segmentation\n and Malignancy Classification","summary":" Skin diseases affect millions of people worldwide, across all ethnicities.\nIncreasing diagnosis accessibility requires fair and accurate segmentation and\nclassification of dermatology images. However, the scarcity of annotated\nmedical images, especially for rare diseases and underrepresented skin tones,\nposes a challenge to the development of fair and accurate models. In this\nstudy, we introduce a Fair, Efficient, and Diverse Diffusion-based framework\nfor skin lesion segmentation and malignancy classification. FEDD leverages\nsemantically meaningful feature embeddings learned through a denoising\ndiffusion probabilistic backbone and processes them via linear probes to\nachieve state-of-the-art performance on Diverse Dermatology Images (DDI). We\nachieve an improvement in intersection over union of 0.18, 0.13, 0.06, and 0.07\nwhile using only 5%, 10%, 15%, and 20% labeled samples, respectively.\nAdditionally, FEDD trained on 10% of DDI demonstrates malignancy classification\naccuracy of 81%, 14% higher compared to the state-of-the-art. We showcase high\nefficiency in data-constrained scenarios while providing fair performance for\ndiverse skin tones and rare malignancy conditions. Our newly annotated DDI\nsegmentation masks and training code can be found on\nhttps://github.com/hectorcarrion/fedd.\n","authors":["Héctor Carrión","Narges Norouzi"],"pdf_url":"https://arxiv.org/pdf/2307.11654v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11643v1","updated":"2023-07-21T15:22:32Z","published":"2023-07-21T15:22:32Z","title":"Morphological Image Analysis and Feature Extraction for Reasoning with\n AI-based Defect Detection and Classification Models","summary":" As the use of artificial intelligent (AI) models becomes more prevalent in\nindustries such as engineering and manufacturing, it is essential that these\nmodels provide transparent reasoning behind their predictions. This paper\nproposes the AI-Reasoner, which extracts the morphological characteristics of\ndefects (DefChars) from images and utilises decision trees to reason with the\nDefChar values. Thereafter, the AI-Reasoner exports visualisations (i.e.\ncharts) and textual explanations to provide insights into outputs made by\nmasked-based defect detection and classification models. It also provides\neffective mitigation strategies to enhance data pre-processing and overall\nmodel performance. The AI-Reasoner was tested on explaining the outputs of an\nIE Mask R-CNN model using a set of 366 images containing defects. The results\ndemonstrated its effectiveness in explaining the IE Mask R-CNN model's\npredictions. Overall, the proposed AI-Reasoner provides a solution for\nimproving the performance of AI models in industrial applications that require\ndefect analysis.\n","authors":["Jiajun Zhang","Georgina Cosma","Sarah Bugby","Axel Finke","Jason Watkins"],"pdf_url":"https://arxiv.org/pdf/2307.11643v1.pdf","comment":"8 pages, 3 figures, 5 tables; submitted to 2023 IEEE symposium series\n on computational intelligence (SSCI)"},{"id":"http://arxiv.org/abs/2307.11638v1","updated":"2023-07-21T15:04:21Z","published":"2023-07-21T15:04:21Z","title":"Deep Reinforcement Learning Based System for Intraoperative\n Hyperspectral Video Autofocusing","summary":" Hyperspectral imaging (HSI) captures a greater level of spectral detail than\ntraditional optical imaging, making it a potentially valuable intraoperative\ntool when precise tissue differentiation is essential. Hardware limitations of\ncurrent optical systems used for handheld real-time video HSI result in a\nlimited focal depth, thereby posing usability issues for integration of the\ntechnology into the operating room. This work integrates a focus-tunable liquid\nlens into a video HSI exoscope, and proposes novel video autofocusing methods\nbased on deep reinforcement learning. A first-of-its-kind robotic focal-time\nscan was performed to create a realistic and reproducible testing dataset. We\nbenchmarked our proposed autofocus algorithm against traditional policies, and\nfound our novel approach to perform significantly ($p<0.05$) better than\ntraditional techniques ($0.070\\pm.098$ mean absolute focal error compared to\n$0.146\\pm.148$). In addition, we performed a blinded usability trial by having\ntwo neurosurgeons compare the system with different autofocus policies, and\nfound our novel approach to be the most favourable, making our system a\ndesirable addition for intraoperative HSI.\n","authors":["Charlie Budd","Jianrong Qiu","Oscar MacCormac","Martin Huber","Christopher Mower","Mirek Janatka","Théo Trotouin","Jonathan Shapey","Mads S. Bergholt","Tom Vercauteren"],"pdf_url":"https://arxiv.org/pdf/2307.11638v1.pdf","comment":"To be presented at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.11636v1","updated":"2023-07-21T14:58:44Z","published":"2023-07-21T14:58:44Z","title":"OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?","summary":" This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-scale\ndataset for humour generation and understanding. Humour is an abstract,\nsubjective, and context-dependent cognitive construct involving several\ncognitive factors, making it a challenging task to generate and interpret.\nHence, humour generation and understanding can serve as a new task for\nevaluating the ability of deep-learning methods to process abstract and\nsubjective information. Due to the scarcity of data, humour-related generation\ntasks such as captioning remain under-explored. To address this gap,\nOxfordTVG-HIC offers approximately 2.9M image-text pairs with humour scores to\ntrain a generalizable humour captioning model. Contrary to existing captioning\ndatasets, OxfordTVG-HIC features a wide range of emotional and semantic\ndiversity resulting in out-of-context examples that are particularly conducive\nto generating humour. Moreover, OxfordTVG-HIC is curated devoid of offensive\ncontent. We also show how OxfordTVG-HIC can be leveraged for evaluating the\nhumour of a generated text. Through explainability analysis of the trained\nmodels, we identify the visual and linguistic cues influential for evoking\nhumour prediction (and generation). We observe qualitatively that these cues\nare aligned with the benign violation theory of humour in cognitive psychology.\n","authors":["Runjia Li","Shuyang Sun","Mohamed Elhoseiny","Philip Torr"],"pdf_url":"https://arxiv.org/pdf/2307.11636v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2303.15823v2","updated":"2023-07-21T14:55:21Z","published":"2023-03-28T08:51:15Z","title":"Automated wildlife image classification: An active learning tool for\n ecological applications","summary":" Wildlife camera trap images are being used extensively to investigate animal\nabundance, habitat associations, and behavior, which is complicated by the fact\nthat experts must first classify the images manually. Artificial intelligence\nsystems can take over this task but usually need a large number of\nalready-labeled training images to achieve sufficient performance. This\nrequirement necessitates human expert labor and poses a particular challenge\nfor projects with few cameras or short durations. We propose a label-efficient\nlearning strategy that enables researchers with small or medium-sized image\ndatabases to leverage the potential of modern machine learning, thus freeing\ncrucial resources for subsequent analyses.\n Our methodological proposal is two-fold: (1) We improve current strategies of\ncombining object detection and image classification by tuning the\nhyperparameters of both models. (2) We provide an active learning (AL) system\nthat allows training deep learning models very efficiently in terms of required\nhuman-labeled training images. We supply a software package that enables\nresearchers to use these methods directly and thereby ensure the broad\napplicability of the proposed framework in ecological practice.\n We show that our tuning strategy improves predictive performance. We\ndemonstrate how the AL pipeline reduces the amount of pre-labeled data needed\nto achieve a specific predictive performance and that it is especially valuable\nfor improving out-of-sample predictive performance.\n We conclude that the combination of tuning and AL increases predictive\nperformance substantially. Furthermore, we argue that our work can broadly\nimpact the community through the ready-to-use software package provided.\nFinally, the publication of our models tailored to European wildlife data\nenriches existing model bases mostly trained on data from Africa and North\nAmerica.\n","authors":["Ludwig Bothmann","Lisa Wimmer","Omid Charrakh","Tobias Weber","Hendrik Edelhoff","Wibke Peters","Hien Nguyen","Caryl Benjamin","Annette Menzel"],"pdf_url":"https://arxiv.org/pdf/2303.15823v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.03056v3","updated":"2023-07-21T14:45:20Z","published":"2023-03-06T11:59:13Z","title":"MOISST: Multimodal Optimization of Implicit Scene for SpatioTemporal\n calibration","summary":" With the recent advances in autonomous driving and the decreasing cost of\nLiDARs, the use of multimodal sensor systems is on the rise. However, in order\nto make use of the information provided by a variety of complimentary sensors,\nit is necessary to accurately calibrate them. We take advantage of recent\nadvances in computer graphics and implicit volumetric scene representation to\ntackle the problem of multi-sensor spatial and temporal calibration. Thanks to\na new formulation of the Neural Radiance Field (NeRF) optimization, we are able\nto jointly optimize calibration parameters along with scene representation\nbased on radiometric and geometric measurements. Our method enables accurate\nand robust calibration from data captured in uncontrolled and unstructured\nurban environments, making our solution more scalable than existing calibration\nsolutions. We demonstrate the accuracy and robustness of our method in urban\nscenes typically encountered in autonomous driving scenarios.\n","authors":["Quentin Herau","Nathan Piasco","Moussab Bennehar","Luis Roldão","Dzmitry Tsishkou","Cyrille Migniot","Pascal Vasseur","Cédric Demonceaux"],"pdf_url":"https://arxiv.org/pdf/2303.03056v3.pdf","comment":"Accepted at IROS2023 Project site: https://qherau.github.io/MOISST/"},{"id":"http://arxiv.org/abs/2307.11618v1","updated":"2023-07-21T14:37:17Z","published":"2023-07-21T14:37:17Z","title":"Divide and Adapt: Active Domain Adaptation via Customized Learning","summary":" Active domain adaptation (ADA) aims to improve the model adaptation\nperformance by incorporating active learning (AL) techniques to label a\nmaximally-informative subset of target samples. Conventional AL methods do not\nconsider the existence of domain shift, and hence, fail to identify the truly\nvaluable samples in the context of domain adaptation. To accommodate active\nlearning and domain adaption, the two naturally different tasks, in a\ncollaborative framework, we advocate that a customized learning strategy for\nthe target data is the key to the success of ADA solutions. We present\nDivide-and-Adapt (DiaNA), a new ADA framework that partitions the target\ninstances into four categories with stratified transferable properties. With a\nnovel data subdivision protocol based on uncertainty and domainness, DiaNA can\naccurately recognize the most gainful samples. While sending the informative\ninstances for annotation, DiaNA employs tailored learning strategies for the\nremaining categories. Furthermore, we propose an informativeness score that\nunifies the data partitioning criteria. This enables the use of a Gaussian\nmixture model (GMM) to automatically sample unlabeled data into the proposed\nfour categories. Thanks to the \"divideand-adapt\" spirit, DiaNA can handle data\nwith large variations of domain gap. In addition, we show that DiaNA can\ngeneralize to different domain adaptation settings, such as unsupervised domain\nadaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain\nadaptation (SFDA), etc.\n","authors":["Duojun Huang","Jichang Li","Weikai Chen","Junshi Huang","Zhenhua Chai","Guanbin Li"],"pdf_url":"https://arxiv.org/pdf/2307.11618v1.pdf","comment":"CVPR2023, Highlight paper"},{"id":"http://arxiv.org/abs/2307.11604v1","updated":"2023-07-21T14:14:29Z","published":"2023-07-21T14:14:29Z","title":"Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised\n Medical Image Segmentation","summary":" Medical imaging has witnessed remarkable progress but usually requires a\nlarge amount of high-quality annotated data which is time-consuming and costly\nto obtain. To alleviate this burden, semi-supervised learning has garnered\nattention as a potential solution. In this paper, we present Meta-Learning for\nBootstrapping Medical Image Segmentation (MLB-Seg), a novel method for tackling\nthe challenge of semi-supervised medical image segmentation. Specifically, our\napproach first involves training a segmentation model on a small set of clean\nlabeled images to generate initial labels for unlabeled data. To further\noptimize this bootstrapping process, we introduce a per-pixel weight mapping\nsystem that dynamically assigns weights to both the initialized labels and the\nmodel's own predictions. These weights are determined using a meta-process that\nprioritizes pixels with loss gradient directions closer to those of clean data,\nwhich is based on a small set of precisely annotated images. To facilitate the\nmeta-learning process, we additionally introduce a consistency-based Pseudo\nLabel Enhancement (PLE) scheme that improves the quality of the model's own\npredictions by ensembling predictions from various augmented versions of the\nsame input. In order to improve the quality of the weight maps obtained through\nmultiple augmentations of a single input, we introduce a mean teacher into the\nPLE scheme. This method helps to reduce noise in the weight maps and stabilize\nits generation process. Our extensive experimental results on public atrial and\nprostate segmentation datasets demonstrate that our proposed method achieves\nstate-of-the-art results under semi-supervision. Our code is available at\nhttps://github.com/aijinrjinr/MLB-Seg.\n","authors":["Qingyue Wei","Lequan Yu","Xianhang Li","Wei Shao","Cihang Xie","Lei Xing","Yuyin Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.11604v1.pdf","comment":"Accepted to MICCAI 2023. Code is publicly available at\n https://github.com/aijinrjinr/MLB-Seg"},{"id":"http://arxiv.org/abs/2307.11603v1","updated":"2023-07-21T14:12:28Z","published":"2023-07-21T14:12:28Z","title":"Cascaded multitask U-Net using topological loss for vessel segmentation\n and centerline extraction","summary":" Vessel segmentation and centerline extraction are two crucial preliminary\ntasks for many computer-aided diagnosis tools dealing with vascular diseases.\nRecently, deep-learning based methods have been widely applied to these tasks.\nHowever, classic deep-learning approaches struggle to capture the complex\ngeometry and specific topology of vascular networks, which is of the utmost\nimportance in most applications. To overcome these limitations, the clDice\nloss, a topological loss that focuses on the vessel centerlines, has been\nrecently proposed. This loss requires computing, with a proposed soft-skeleton\nalgorithm, the skeletons of both the ground truth and the predicted\nsegmentation. However, the soft-skeleton algorithm provides suboptimal results\non 3D images, which makes the clDice hardly suitable on 3D images. In this\npaper, we propose to replace the soft-skeleton algorithm by a U-Net which\ncomputes the vascular skeleton directly from the segmentation. We show that our\nmethod provides more accurate skeletons than the soft-skeleton algorithm. We\nthen build upon this network a cascaded U-Net trained with the clDice loss to\nembed topological constraints during the segmentation. The resulting model is\nable to predict both the vessel segmentation and centerlines with a more\naccurate topology.\n","authors":["Pierre Rougé","Nicolas Passat","Odyssée Merveille"],"pdf_url":"https://arxiv.org/pdf/2307.11603v1.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2305.18453v2","updated":"2023-07-21T13:26:21Z","published":"2023-05-29T04:14:38Z","title":"Conditional Diffusion Models for Semantic 3D Medical Image Synthesis","summary":" The demand for artificial intelligence (AI) in healthcare is rapidly\nincreasing. However, significant challenges arise from data scarcity and\nprivacy concerns, particularly in medical imaging. While existing generative\nmodels have achieved success in image synthesis and image-to-image translation\ntasks, there remains a gap in the generation of 3D semantic medical images. To\naddress this gap, we introduce Med-DDPM, a diffusion model specifically\ndesigned for semantic 3D medical image synthesis, effectively tackling data\nscarcity and privacy issues. The novelty of Med-DDPM lies in its incorporation\nof semantic conditioning, enabling precise control during the image generation\nprocess. Our model outperforms Generative Adversarial Networks (GANs) in terms\nof stability and performance, generating diverse and anatomically coherent\nimages with high visual fidelity. Comparative analysis against state-of-the-art\naugmentation techniques demonstrates that Med-DDPM produces comparable results,\nhighlighting its potential as a data augmentation tool for enhancing model\naccuracy. In conclusion, Med-DDPM pioneers 3D semantic medical image synthesis\nby delivering high-quality and anatomically coherent images. Furthermore, the\nintegration of semantic conditioning with Med-DDPM holds promise for image\nanonymization in the field of biomedical imaging, showcasing the capabilities\nof the model in addressing challenges related to data scarcity and privacy\nconcerns.\n","authors":["Zolnamar Dorjsembe","Hsing-Kuo Pao","Sodtavilan Odonchimed","Furen Xiao"],"pdf_url":"https://arxiv.org/pdf/2305.18453v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11567v1","updated":"2023-07-21T13:18:43Z","published":"2023-07-21T13:18:43Z","title":"CortexMorph: fast cortical thickness estimation via diffeomorphic\n registration using VoxelMorph","summary":" The thickness of the cortical band is linked to various neurological and\npsychiatric conditions, and is often estimated through surface-based methods\nsuch as Freesurfer in MRI studies. The DiReCT method, which calculates cortical\nthickness using a diffeomorphic deformation of the gray-white matter interface\ntowards the pial surface, offers an alternative to surface-based methods.\nRecent studies using a synthetic cortical thickness phantom have demonstrated\nthat the combination of DiReCT and deep-learning-based segmentation is more\nsensitive to subvoxel cortical thinning than Freesurfer.\n While anatomical segmentation of a T1-weighted image now takes seconds,\nexisting implementations of DiReCT rely on iterative image registration methods\nwhich can take up to an hour per volume. On the other hand, learning-based\ndeformable image registration methods like VoxelMorph have been shown to be\nfaster than classical methods while improving registration accuracy. This paper\nproposes CortexMorph, a new method that employs unsupervised deep learning to\ndirectly regress the deformation field needed for DiReCT. By combining\nCortexMorph with a deep-learning-based segmentation model, it is possible to\nestimate region-wise thickness in seconds from a T1-weighted image, while\nmaintaining the ability to detect cortical atrophy. We validate this claim on\nthe OASIS-3 dataset and the synthetic cortical thickness phantom of Rusak et\nal.\n","authors":["Richard McKinley","Christian Rummel"],"pdf_url":"https://arxiv.org/pdf/2307.11567v1.pdf","comment":"Accepted (early acceptance) at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.11558v1","updated":"2023-07-21T13:06:02Z","published":"2023-07-21T13:06:02Z","title":"Advancing Visual Grounding with Scene Knowledge: Benchmark and Method","summary":" Visual grounding (VG) aims to establish fine-grained alignment between vision\nand language. Ideally, it can be a testbed for vision-and-language models to\nevaluate their understanding of the images and texts and their reasoning\nabilities over their joint space. However, most existing VG datasets are\nconstructed using simple description texts, which do not require sufficient\nreasoning over the images and texts. This has been demonstrated in a recent\nstudy~\\cite{luo2022goes}, where a simple LSTM-based text encoder without\npretraining can achieve state-of-the-art performance on mainstream VG datasets.\nTherefore, in this paper, we propose a novel benchmark of \\underline{S}cene\n\\underline{K}nowledge-guided \\underline{V}isual \\underline{G}rounding (SK-VG),\nwhere the image content and referring expressions are not sufficient to ground\nthe target objects, forcing the models to have a reasoning ability on the\nlong-form scene knowledge. To perform this task, we propose two approaches to\naccept the triple-type input, where the former embeds knowledge into the image\nfeatures before the image-query interaction; the latter leverages linguistic\nstructure to assist in computing the image-text matching. We conduct extensive\nexperiments to analyze the above methods and show that the proposed approaches\nachieve promising results but still leave room for improvement, including\nperformance and interpretability. The dataset and code are available at\n\\url{https://github.com/zhjohnchan/SK-VG}.\n","authors":["Zhihong Chen","Ruifei Zhang","Yibing Song","Xiang Wan","Guanbin Li"],"pdf_url":"https://arxiv.org/pdf/2307.11558v1.pdf","comment":"Computer Vision and Natural Language Processing. 21 pages, 14\n figures. CVPR-2023"},{"id":"http://arxiv.org/abs/2307.11550v1","updated":"2023-07-21T12:53:54Z","published":"2023-07-21T12:53:54Z","title":"YOLOPose V2: Understanding and Improving Transformer-based 6D Pose\n Estimation","summary":" 6D object pose estimation is a crucial prerequisite for autonomous robot\nmanipulation applications. The state-of-the-art models for pose estimation are\nconvolutional neural network (CNN)-based. Lately, Transformers, an architecture\noriginally proposed for natural language processing, is achieving\nstate-of-the-art results in many computer vision tasks as well. Equipped with\nthe multi-head self-attention mechanism, Transformers enable simple\nsingle-stage end-to-end architectures for learning object detection and 6D\nobject pose estimation jointly. In this work, we propose YOLOPose (short form\nfor You Only Look Once Pose estimation), a Transformer-based multi-object 6D\npose estimation method based on keypoint regression and an improved variant of\nthe YOLOPose model. In contrast to the standard heatmaps for predicting\nkeypoints in an image, we directly regress the keypoints. Additionally, we\nemploy a learnable orientation estimation module to predict the orientation\nfrom the keypoints. Along with a separate translation estimation module, our\nmodel is end-to-end differentiable. Our method is suitable for real-time\napplications and achieves results comparable to state-of-the-art methods. We\nanalyze the role of object queries in our architecture and reveal that the\nobject queries specialize in detecting objects in specific image regions.\nFurthermore, we quantify the accuracy trade-off of using datasets of smaller\nsizes to train our model.\n","authors":["Arul Selvam Periyasamy","Arash Amini","Vladimir Tsaturyan","Sven Behnke"],"pdf_url":"https://arxiv.org/pdf/2307.11550v1.pdf","comment":"Robotics and Autonomous Systems Journal, Elsevier, to appear 2023.\n arXiv admin note: substantial text overlap with arXiv:2205.02536"},{"id":"http://arxiv.org/abs/2307.11545v1","updated":"2023-07-21T12:46:15Z","published":"2023-07-21T12:46:15Z","title":"Bridging Vision and Language Encoders: Parameter-Efficient Tuning for\n Referring Image Segmentation","summary":" Parameter Efficient Tuning (PET) has gained attention for reducing the number\nof parameters while maintaining performance and providing better hardware\nresource savings, but few studies investigate dense prediction tasks and\ninteraction between modalities. In this paper, we do an investigation of\nefficient tuning problems on referring image segmentation. We propose a novel\nadapter called Bridger to facilitate cross-modal information exchange and\ninject task-specific information into the pre-trained model. We also design a\nlightweight decoder for image segmentation. Our approach achieves comparable or\nsuperior performance with only 1.61\\% to 3.38\\% backbone parameter updates,\nevaluated on challenging benchmarks. The code is available at\n\\url{https://github.com/kkakkkka/ETRIS}.\n","authors":["Zunnan Xu","Zhihong Chen","Yong Zhang","Yibing Song","Xiang Wan","Guanbin Li"],"pdf_url":"https://arxiv.org/pdf/2307.11545v1.pdf","comment":"Computer Vision and Natural Language Processing. 14 pages, 8 figures.\n ICCV-2023"},{"id":"http://arxiv.org/abs/2303.11057v3","updated":"2023-07-21T12:43:23Z","published":"2023-03-20T12:14:13Z","title":"Learning Foresightful Dense Visual Affordance for Deformable Object\n Manipulation","summary":" Understanding and manipulating deformable objects (e.g., ropes and fabrics)\nis an essential yet challenging task with broad applications. Difficulties come\nfrom complex states and dynamics, diverse configurations and high-dimensional\naction space of deformable objects. Besides, the manipulation tasks usually\nrequire multiple steps to accomplish, and greedy policies may easily lead to\nlocal optimal states. Existing studies usually tackle this problem using\nreinforcement learning or imitating expert demonstrations, with limitations in\nmodeling complex states or requiring hand-crafted expert policies. In this\npaper, we study deformable object manipulation using dense visual affordance,\nwith generalization towards diverse states, and propose a novel kind of\nforesightful dense affordance, which avoids local optima by estimating states'\nvalues for long-term manipulation. We propose a framework for learning this\nrepresentation, with novel designs such as multi-stage stable learning and\nefficient self-supervised data collection without experts. Experiments\ndemonstrate the superiority of our proposed foresightful dense affordance.\nProject page: https://hyperplane-lab.github.io/DeformableAffordance\n","authors":["Ruihai Wu","Chuanruo Ning","Hao Dong"],"pdf_url":"https://arxiv.org/pdf/2303.11057v3.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.11543v1","updated":"2023-07-21T12:43:07Z","published":"2023-07-21T12:43:07Z","title":"KVN: Keypoints Voting Network with Differentiable RANSAC for Stereo Pose\n Estimation","summary":" Object pose estimation is a fundamental computer vision task exploited in\nseveral robotics and augmented reality applications. Many established\napproaches rely on predicting 2D-3D keypoint correspondences using RANSAC\n(Random sample consensus) and estimating the object pose using the PnP\n(Perspective-n-Point) algorithm. Being RANSAC non-differentiable,\ncorrespondences cannot be directly learned in an end-to-end fashion. In this\npaper, we address the stereo image-based object pose estimation problem by (i)\nintroducing a differentiable RANSAC layer into a well-known monocular pose\nestimation network; (ii) exploiting an uncertainty-driven multi-view PnP solver\nwhich can fuse information from multiple views. We evaluate our approach on a\nchallenging public stereo object pose estimation dataset, yielding\nstate-of-the-art results against other recent approaches. Furthermore, in our\nablation study, we show that the differentiable RANSAC layer plays a\nsignificant role in the accuracy of the proposed method. We release with this\npaper the open-source implementation of our method.\n","authors":["Ivano Donadi","Alberto Pretto"],"pdf_url":"https://arxiv.org/pdf/2307.11543v1.pdf","comment":"Submitted to IEEE Robotics and Automation Letters"},{"id":"http://arxiv.org/abs/2307.11530v1","updated":"2023-07-21T12:23:39Z","published":"2023-07-21T12:23:39Z","title":"UWAT-GAN: Fundus Fluorescein Angiography Synthesis via Ultra-wide-angle\n Transformation Multi-scale GAN","summary":" Fundus photography is an essential examination for clinical and differential\ndiagnosis of fundus diseases. Recently, Ultra-Wide-angle Fundus (UWF)\ntechniques, UWF Fluorescein Angiography (UWF-FA) and UWF Scanning Laser\nOphthalmoscopy (UWF-SLO) have been gradually put into use. However, Fluorescein\nAngiography (FA) and UWF-FA require injecting sodium fluorescein which may have\ndetrimental influences. To avoid negative impacts, cross-modality medical image\ngeneration algorithms have been proposed. Nevertheless, current methods in\nfundus imaging could not produce high-resolution images and are unable to\ncapture tiny vascular lesion areas. This paper proposes a novel conditional\ngenerative adversarial network (UWAT-GAN) to synthesize UWF-FA from UWF-SLO.\nUsing multi-scale generators and a fusion module patch to better extract global\nand local information, our model can generate high-resolution images. Moreover,\nan attention transmit module is proposed to help the decoder learn effectively.\nBesides, a supervised approach is used to train the network using multiple new\nweighted losses on different scales of data. Experiments on an in-house UWF\nimage dataset demonstrate the superiority of the UWAT-GAN over the\nstate-of-the-art methods. The source code is available at:\nhttps://github.com/Tinysqua/UWAT-GAN.\n","authors":["Zhaojie Fang","Zhanghao Chen","Pengxue Wei","Wangting Li","Shaochong Zhang","Ahmed Elazab","Gangyong Jia","Ruiquan Ge","Changmiao Wang"],"pdf_url":"https://arxiv.org/pdf/2307.11530v1.pdf","comment":"26th International Conference on Medical Image Computing and Computer\n Assisted Intervention"},{"id":"http://arxiv.org/abs/2307.11528v1","updated":"2023-07-21T12:18:35Z","published":"2023-07-21T12:18:35Z","title":"Improving Viewpoint Robustness for Visual Recognition via Adversarial\n Training","summary":" Viewpoint invariance remains challenging for visual recognition in the 3D\nworld, as altering the viewing directions can significantly impact predictions\nfor the same object. While substantial efforts have been dedicated to making\nneural networks invariant to 2D image translations and rotations, viewpoint\ninvariance is rarely investigated. Motivated by the success of adversarial\ntraining in enhancing model robustness, we propose Viewpoint-Invariant\nAdversarial Training (VIAT) to improve the viewpoint robustness of image\nclassifiers. Regarding viewpoint transformation as an attack, we formulate VIAT\nas a minimax optimization problem, where the inner maximization characterizes\ndiverse adversarial viewpoints by learning a Gaussian mixture distribution\nbased on the proposed attack method GMVFool. The outer minimization obtains a\nviewpoint-invariant classifier by minimizing the expected loss over the\nworst-case viewpoint distributions that can share the same one for different\nobjects within the same category. Based on GMVFool, we contribute a large-scale\ndataset called ImageNet-V+ to benchmark viewpoint robustness. Experimental\nresults show that VIAT significantly improves the viewpoint robustness of\nvarious image classifiers based on the diversity of adversarial viewpoints\ngenerated by GMVFool. Furthermore, we propose ViewRS, a certified viewpoint\nrobustness method that provides a certified radius and accuracy to demonstrate\nthe effectiveness of VIAT from the theoretical perspective.\n","authors":["Shouwei Ruan","Yinpeng Dong","Hang Su","Jianteng Peng","Ning Chen","Xingxing Wei"],"pdf_url":"https://arxiv.org/pdf/2307.11528v1.pdf","comment":"14 pages, 12 figures. arXiv admin note: substantial text overlap with\n arXiv:2307.10235"},{"id":"http://arxiv.org/abs/2303.11630v2","updated":"2023-07-21T12:15:41Z","published":"2023-03-21T06:54:18Z","title":"BoxSnake: Polygonal Instance Segmentation with Box Supervision","summary":" Box-supervised instance segmentation has gained much attention as it requires\nonly simple box annotations instead of costly mask or polygon annotations.\nHowever, existing box-supervised instance segmentation models mainly focus on\nmask-based frameworks. We propose a new end-to-end training technique, termed\nBoxSnake, to achieve effective polygonal instance segmentation using only box\nannotations for the first time. Our method consists of two loss functions: (1)\na point-based unary loss that constrains the bounding box of predicted polygons\nto achieve coarse-grained segmentation; and (2) a distance-aware pairwise loss\nthat encourages the predicted polygons to fit the object boundaries. Compared\nwith the mask-based weakly-supervised methods, BoxSnake further reduces the\nperformance gap between the predicted segmentation and the bounding box, and\nshows significant superiority on the Cityscapes dataset.\n","authors":["Rui Yang","Lin Song","Yixiao Ge","Xiu Li"],"pdf_url":"https://arxiv.org/pdf/2303.11630v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.06666v2","updated":"2023-07-21T12:15:16Z","published":"2023-07-13T10:19:04Z","title":"Transformer-based end-to-end classification of variable-length\n volumetric data","summary":" The automatic classification of 3D medical data is memory-intensive. Also,\nvariations in the number of slices between samples is common. Na\\\"ive solutions\nsuch as subsampling can solve these problems, but at the cost of potentially\neliminating relevant diagnosis information. Transformers have shown promising\nperformance for sequential data analysis. However, their application for long\nsequences is data, computationally, and memory demanding. In this paper, we\npropose an end-to-end Transformer-based framework that allows to classify\nvolumetric data of variable length in an efficient fashion. Particularly, by\nrandomizing the input volume-wise resolution(#slices) during training, we\nenhance the capacity of the learnable positional embedding assigned to each\nvolume slice. Consequently, the accumulated positional information in each\npositional embedding can be generalized to the neighbouring slices, even for\nhigh-resolution volumes at the test time. By doing so, the model will be more\nrobust to variable volume length and amenable to different computational\nbudgets. We evaluated the proposed approach in retinal OCT volume\nclassification and achieved 21.96% average improvement in balanced accuracy on\na 9-class diagnostic task, compared to state-of-the-art video transformers. Our\nfindings show that varying the volume-wise resolution of the input during\ntraining results in more informative volume representation as compared to\ntraining with fixed number of slices per volume.\n","authors":["Marzieh Oghbaie","Teresa Araujo","Taha Emre","Ursula Schmidt-Erfurth","Hrvoje Bogunovic"],"pdf_url":"https://arxiv.org/pdf/2307.06666v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11526v1","updated":"2023-07-21T12:14:33Z","published":"2023-07-21T12:14:33Z","title":"CopyRNeRF: Protecting the CopyRight of Neural Radiance Fields","summary":" Neural Radiance Fields (NeRF) have the potential to be a major representation\nof media. Since training a NeRF has never been an easy task, the protection of\nits model copyright should be a priority. In this paper, by analyzing the pros\nand cons of possible copyright protection solutions, we propose to protect the\ncopyright of NeRF models by replacing the original color representation in NeRF\nwith a watermarked color representation. Then, a distortion-resistant rendering\nscheme is designed to guarantee robust message extraction in 2D renderings of\nNeRF. Our proposed method can directly protect the copyright of NeRF models\nwhile maintaining high rendering quality and bit accuracy when compared among\noptional solutions.\n","authors":["Ziyuan Luo","Qing Guo","Ka Chun Cheung","Simon See","Renjie Wan"],"pdf_url":"https://arxiv.org/pdf/2307.11526v1.pdf","comment":"11 pages, 6 figures, accepted by iccv 2023 non-camera-ready version"},{"id":"http://arxiv.org/abs/2304.14133v2","updated":"2023-07-21T12:06:17Z","published":"2023-04-27T12:28:29Z","title":"VERITE: A Robust Benchmark for Multimodal Misinformation Detection\n Accounting for Unimodal Bias","summary":" Multimedia content has become ubiquitous on social media platforms, leading\nto the rise of multimodal misinformation (MM) and the urgent need for effective\nstrategies to detect and prevent its spread. In recent years, the challenge of\nmultimodal misinformation detection (MMD) has garnered significant attention by\nresearchers and has mainly involved the creation of annotated, weakly\nannotated, or synthetically generated training datasets, along with the\ndevelopment of various deep learning MMD models. However, the problem of\nunimodal bias in MMD benchmarks -- where biased or unimodal methods outperform\ntheir multimodal counterparts on an inherently multimodal task -- has been\noverlooked. In this study, we systematically investigate and identify the\npresence of unimodal bias in widely-used MMD benchmarks (VMU-Twitter, COSMOS),\nraising concerns about their suitability for reliable evaluation. To address\nthis issue, we introduce the \"VERification of Image-TExtpairs\" (VERITE)\nbenchmark for MMD which incorporates real-world data, excludes \"asymmetric\nmultimodal misinformation\" and utilizes \"modality balancing\". We conduct an\nextensive comparative study with a Transformer-based architecture that shows\nthe ability of VERITE to effectively address unimodal bias, rendering it a\nrobust evaluation framework for MMD. Furthermore, we introduce a new method --\ntermed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating\nrealistic synthetic training data that preserve crossmodal relations between\nlegitimate images and false human-written captions. By leveraging CHASMA in the\ntraining process, we observe consistent and notable improvements in predictive\nperformance on VERITE; with a 9.2% increase in accuracy. We release our code\nat: https://github.com/stevejpapad/image-text-verification\n","authors":["Stefanos-Iordanis Papadopoulos","Christos Koutlis","Symeon Papadopoulos","Panagiotis C. Petrantonakis"],"pdf_url":"https://arxiv.org/pdf/2304.14133v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11518v1","updated":"2023-07-21T12:03:39Z","published":"2023-07-21T12:03:39Z","title":"BatMobility: Towards Flying Without Seeing for Autonomous Drones","summary":" Unmanned aerial vehicles (UAVs) rely on optical sensors such as cameras and\nlidar for autonomous operation. However, such optical sensors are error-prone\nin bad lighting, inclement weather conditions including fog and smoke, and\naround textureless or transparent surfaces. In this paper, we ask: is it\npossible to fly UAVs without relying on optical sensors, i.e., can UAVs fly\nwithout seeing? We present BatMobility, a lightweight mmWave radar-only\nperception system for UAVs that eliminates the need for optical sensors.\nBatMobility enables two core functionalities for UAVs -- radio flow estimation\n(a novel FMCW radar-based alternative for optical flow based on\nsurface-parallel doppler shift) and radar-based collision avoidance. We build\nBatMobility using commodity sensors and deploy it as a real-time system on a\nsmall off-the-shelf quadcopter running an unmodified flight controller. Our\nevaluation shows that BatMobility achieves comparable or better performance\nthan commercial-grade optical sensors across a wide range of scenarios.\n","authors":["Emerson Sie","Zikun Liu","Deepak Vasisht"],"pdf_url":"https://arxiv.org/pdf/2307.11518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07308v3","updated":"2023-07-21T11:52:28Z","published":"2023-06-12T13:48:37Z","title":"Self-Supervised Hyperspectral Inpainting with the Optimisation inspired\n Deep Neural Network Prior","summary":" Hyperspectral Image (HSI)s cover hundreds or thousands of narrow spectral\nbands, conveying a wealth of spatial and spectral information. However, due to\nthe instrumental errors and the atmospheric changes, the HSI obtained in\npractice are often contaminated by noise and dead pixels(lines), resulting in\nmissing information that may severely compromise the subsequent applications.\nWe introduce here a novel HSI missing pixel prediction algorithm, called Low\nRank and Sparsity Constraint Plug-and-Play (LRS-PnP). It is shown that LRS-PnP\nis able to predict missing pixels and bands even when all spectral bands of the\nimage are missing. The proposed LRS-PnP algorithm is further extended to a\nself-supervised model by combining the LRS-PnP with the Deep Image Prior (DIP),\ncalled LRS-PnP-DIP. In a series of experiments with real data, It is shown that\nthe LRS-PnP-DIP either achieves state-of-the-art inpainting performance\ncompared to other learning-based methods, or outperforms them.\n","authors":["Shuo Li","Mehrdad Yaghoobi"],"pdf_url":"https://arxiv.org/pdf/2306.07308v3.pdf","comment":"Presented in ISCS23"},{"id":"http://arxiv.org/abs/2208.05788v2","updated":"2023-07-21T11:50:11Z","published":"2022-08-10T12:29:01Z","title":"Semantic Self-adaptation: Enhancing Generalization with a Single Sample","summary":" The lack of out-of-domain generalization is a critical weakness of deep\nnetworks for semantic segmentation. Previous studies relied on the assumption\nof a static model, i. e., once the training process is complete, model\nparameters remain fixed at test time. In this work, we challenge this premise\nwith a self-adaptive approach for semantic segmentation that adjusts the\ninference process to each input sample. Self-adaptation operates on two levels.\nFirst, it fine-tunes the parameters of convolutional layers to the input image\nusing consistency regularization. Second, in Batch Normalization layers,\nself-adaptation interpolates between the training and the reference\ndistribution derived from a single test sample. Despite both techniques being\nwell known in the literature, their combination sets new state-of-the-art\naccuracy on synthetic-to-real generalization benchmarks. Our empirical study\nsuggests that self-adaptation may complement the established practice of model\nregularization at training time for improving deep network generalization to\nout-of-domain data. Our code and pre-trained models are available at\nhttps://github.com/visinf/self-adaptive.\n","authors":["Sherwin Bahmani","Oliver Hahn","Eduard Zamfir","Nikita Araslanov","Daniel Cremers","Stefan Roth"],"pdf_url":"https://arxiv.org/pdf/2208.05788v2.pdf","comment":"Published in TMLR (July 2023); OpenReview:\n https://openreview.net/forum?id=ILNqQhGbLx; Code:\n https://github.com/visinf/self-adaptive; Video: https://youtu.be/s4DG65ic0EA"},{"id":"http://arxiv.org/abs/2307.11514v1","updated":"2023-07-21T11:50:05Z","published":"2023-07-21T11:50:05Z","title":"CORE: Cooperative Reconstruction for Multi-Agent Perception","summary":" This paper presents CORE, a conceptually simple, effective and\ncommunication-efficient model for multi-agent cooperative perception. It\naddresses the task from a novel perspective of cooperative reconstruction,\nbased on two key insights: 1) cooperating agents together provide a more\nholistic observation of the environment, and 2) the holistic observation can\nserve as valuable supervision to explicitly guide the model learning how to\nreconstruct the ideal observation based on collaboration. CORE instantiates the\nidea with three major components: a compressor for each agent to create more\ncompact feature representation for efficient broadcasting, a lightweight\nattentive collaboration component for cross-agent message aggregation, and a\nreconstruction module to reconstruct the observation based on aggregated\nfeature representations. This learning-to-reconstruct idea is task-agnostic,\nand offers clear and reasonable supervision to inspire more effective\ncollaboration, eventually promoting perception tasks. We validate CORE on\nOPV2V, a large-scale multi-agent percetion dataset, in two tasks, i.e., 3D\nobject detection and semantic segmentation. Results demonstrate that the model\nachieves state-of-the-art performance on both tasks, and is more\ncommunication-efficient.\n","authors":["Binglu Wang","Lei Zhang","Zhaozhong Wang","Yongqiang Zhao","Tianfei Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.11514v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11513v1","updated":"2023-07-21T11:49:30Z","published":"2023-07-21T11:49:30Z","title":"Bone mineral density estimation from a plain X-ray image by learning\n decomposition into projections of bone-segmented computed tomography","summary":" Osteoporosis is a prevalent bone disease that causes fractures in fragile\nbones, leading to a decline in daily living activities. Dual-energy X-ray\nabsorptiometry (DXA) and quantitative computed tomography (QCT) are highly\naccurate for diagnosing osteoporosis; however, these modalities require special\nequipment and scan protocols. To frequently monitor bone health, low-cost,\nlow-dose, and ubiquitously available diagnostic methods are highly anticipated.\nIn this study, we aim to perform bone mineral density (BMD) estimation from a\nplain X-ray image for opportunistic screening, which is potentially useful for\nearly diagnosis. Existing methods have used multi-stage approaches consisting\nof extraction of the region of interest and simple regression to estimate BMD,\nwhich require a large amount of training data. Therefore, we propose an\nefficient method that learns decomposition into projections of bone-segmented\nQCT for BMD estimation under limited datasets. The proposed method achieved\nhigh accuracy in BMD estimation, where Pearson correlation coefficients of\n0.880 and 0.920 were observed for DXA-measured BMD and QCT-measured BMD\nestimation tasks, respectively, and the root mean square of the coefficient of\nvariation values were 3.27 to 3.79% for four measurements with different poses.\nFurthermore, we conducted extensive validation experiments, including\nmulti-pose, uncalibrated-CT, and compression experiments toward actual\napplication in routine clinical practice.\n","authors":["Yi Gu","Yoshito Otake","Keisuke Uemura","Mazen Soufi","Masaki Takao","Hugues Talbot","Seiji Okada","Nobuhiko Sugano","Yoshinobu Sato"],"pdf_url":"https://arxiv.org/pdf/2307.11513v1.pdf","comment":"20 pages and 22 figures"},{"id":"http://arxiv.org/abs/2305.19920v2","updated":"2023-07-21T11:27:30Z","published":"2023-05-31T14:56:18Z","title":"MSKdeX: Musculoskeletal (MSK) decomposition from an X-ray image for\n fine-grained estimation of lean muscle mass and muscle volume","summary":" Musculoskeletal diseases such as sarcopenia and osteoporosis are major\nobstacles to health during aging. Although dual-energy X-ray absorptiometry\n(DXA) and computed tomography (CT) can be used to evaluate musculoskeletal\nconditions, frequent monitoring is difficult due to the cost and accessibility\n(as well as high radiation exposure in the case of CT). We propose a method\n(named MSKdeX) to estimate fine-grained muscle properties from a plain X-ray\nimage, a low-cost, low-radiation, and highly accessible imaging modality,\nthrough musculoskeletal decomposition leveraging fine-grained segmentation in\nCT. We train a multi-channel quantitative image translation model to decompose\nan X-ray image into projections of CT of individual muscles to infer the lean\nmuscle mass and muscle volume. We propose the object-wise intensity-sum loss, a\nsimple yet surprisingly effective metric invariant to muscle deformation and\nprojection direction, utilizing information in CT and X-ray images collected\nfrom the same patient. While our method is basically an unpaired image-to-image\ntranslation, we also exploit the nature of the bone's rigidity, which provides\nthe paired data through 2D-3D rigid registration, adding strong pixel-wise\nsupervision in unpaired training. Through the evaluation using a 539-patient\ndataset, we showed that the proposed method significantly outperformed\nconventional methods. The average Pearson correlation coefficient between the\npredicted and CT-derived ground truth metrics was increased from 0.460 to\n0.863. We believe our method opened up a new musculoskeletal diagnosis method\nand has the potential to be extended to broader applications in multi-channel\nquantitative image translation tasks. Our source code will be released soon.\n","authors":["Yi Gu","Yoshito Otake","Keisuke Uemura","Masaki Takao","Mazen Soufi","Yuta Hiasa","Hugues Talbot","Seiji Okata","Nobuhiko Sugano","Yoshinobu Sato"],"pdf_url":"https://arxiv.org/pdf/2305.19920v2.pdf","comment":"MICCAI 2023 early acceptance (12 pages and 6 figures)"},{"id":"http://arxiv.org/abs/2306.00988v2","updated":"2023-07-21T11:27:10Z","published":"2023-06-01T17:59:57Z","title":"Continual Learning for Abdominal Multi-Organ and Tumor Segmentation","summary":" The ability to dynamically extend a model to new data and classes is critical\nfor multiple organ and tumor segmentation. However, due to privacy regulations,\naccessing previous data and annotations can be problematic in the medical\ndomain. This poses a significant barrier to preserving the high segmentation\naccuracy of the old classes when learning from new classes because of the\ncatastrophic forgetting problem. In this paper, we first empirically\ndemonstrate that simply using high-quality pseudo labels can fairly mitigate\nthis problem in the setting of organ segmentation. Furthermore, we put forward\nan innovative architecture designed specifically for continuous organ and tumor\nsegmentation, which incurs minimal computational overhead. Our proposed design\ninvolves replacing the conventional output layer with a suite of lightweight,\nclass-specific heads, thereby offering the flexibility to accommodate newly\nemerging classes. These heads enable independent predictions for newly\nintroduced and previously learned classes, effectively minimizing the impact of\nnew classes on old ones during the course of continual learning. We further\npropose incorporating Contrastive Language-Image Pretraining (CLIP) embeddings\ninto the organ-specific heads. These embeddings encapsulate the semantic\ninformation of each class, informed by extensive image-text co-training. The\nproposed method is evaluated on both in-house and public abdominal CT datasets\nunder organ and tumor segmentation tasks. Empirical results suggest that the\nproposed design improves the segmentation performance of a baseline neural\nnetwork on newly-introduced and previously-learned classes along the learning\ntrajectory.\n","authors":["Yixiao Zhang","Xinyi Li","Huimiao Chen","Alan Yuille","Yaoyao Liu","Zongwei Zhou"],"pdf_url":"https://arxiv.org/pdf/2306.00988v2.pdf","comment":"MICCAI-2023"},{"id":"http://arxiv.org/abs/2303.05966v2","updated":"2023-07-21T11:21:30Z","published":"2023-03-10T14:55:35Z","title":"Score-Based Generative Models for Medical Image Segmentation using\n Signed Distance Functions","summary":" Medical image segmentation is a crucial task that relies on the ability to\naccurately identify and isolate regions of interest in medical images. Thereby,\ngenerative approaches allow to capture the statistical properties of\nsegmentation masks that are dependent on the respective structures. In this\nwork we propose a conditional score-based generative modeling framework to\nrepresent the signed distance function (SDF) leading to an implicit\ndistribution of segmentation masks. The advantage of leveraging the SDF is a\nmore natural distortion when compared to that of binary masks. By learning the\nscore function of the conditional distribution of SDFs we can accurately sample\nfrom the distribution of segmentation masks, allowing for the evaluation of\nstatistical quantities. Thus, this probabilistic representation allows for the\ngeneration of uncertainty maps represented by the variance, which can aid in\nfurther analysis and enhance the predictive robustness. We qualitatively and\nquantitatively illustrate competitive performance of the proposed method on a\npublic nuclei and gland segmentation data set, highlighting its potential\nutility in medical image segmentation applications.\n","authors":["Lea Bogensperger","Dominik Narnhofer","Filip Ilic","Thomas Pock"],"pdf_url":"https://arxiv.org/pdf/2303.05966v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11482v1","updated":"2023-07-21T10:36:05Z","published":"2023-07-21T10:36:05Z","title":"Redemption from Range-view for Accurate 3D Object Detection","summary":" Most recent approaches for 3D object detection predominantly rely on\npoint-view or bird's-eye view representations, with limited exploration of\nrange-view-based methods. The range-view representation suffers from scale\nvariation and surface texture deficiency, both of which pose significant\nlimitations for developing corresponding methods. Notably, the surface texture\nloss problem has been largely ignored by all existing methods, despite its\nsignificant impact on the accuracy of range-view-based 3D object detection. In\nthis study, we propose Redemption from Range-view R-CNN (R2 R-CNN), a novel and\naccurate approach that comprehensively explores the range-view representation.\nOur proposed method addresses scale variation through the HD Meta Kernel, which\ncaptures range-view geometry information in multiple scales. Additionally, we\nintroduce Feature Points Redemption (FPR) to recover the lost 3D surface\ntexture information from the range view, and Synchronous-Grid RoI Pooling\n(S-Grid RoI Pooling), a multi-scaled approach with multiple receptive fields\nfor accurate box refinement. Our R2 R-CNN outperforms existing range-view-based\nmethods, achieving state-of-the-art performance on both the KITTI benchmark and\nthe Waymo Open Dataset. Our study highlights the critical importance of\naddressing the surface texture loss problem for accurate 3D object detection in\nrange-view-based methods. Codes will be made publicly available.\n","authors":["Yihan Wang","Qiao Yan"],"pdf_url":"https://arxiv.org/pdf/2307.11482v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11477v1","updated":"2023-07-21T10:28:19Z","published":"2023-07-21T10:28:19Z","title":"SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view\n 3D Object Detection","summary":" Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a\nfeasible solution for economical autonomous driving. However, the existing\nBEV-based multi-view 3D detectors generally transform all image features into\nBEV features, without considering the problem that the large proportion of\nbackground information may submerge the object information. In this paper, we\npropose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out\nbackground information according to the semantic segmentation of image features\nand transform image features into semantic-aware BEV features. Accordingly, we\npropose BEV-Paste, an effective data augmentation strategy that closely matches\nwith semantic-aware BEV feature. In addition, we design a Multi-Scale\nCross-Task (MSCT) head, which combines task-specific and cross-task information\nto predict depth distribution and semantic segmentation more accurately,\nfurther improving the quality of semantic-aware BEV feature. Finally, we\nintegrate the above modules into a novel multi-view 3D object detection\nframework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves\nstate-of-the-art performance. Code has been available at\nhttps://github.com/mengtan00/SA-BEV.git.\n","authors":["Jinqing Zhang","Yanan Zhang","Qingjie Liu","Yunhong Wang"],"pdf_url":"https://arxiv.org/pdf/2307.11477v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11471v1","updated":"2023-07-21T10:12:09Z","published":"2023-07-21T10:12:09Z","title":"Robust Visual Question Answering: Datasets, Methods, and Future\n Challenges","summary":" Visual question answering requires a system to provide an accurate natural\nlanguage answer given an image and a natural language question. However, it is\nwidely recognized that previous generic VQA methods often exhibit a tendency to\nmemorize biases present in the training data rather than learning proper\nbehaviors, such as grounding images before predicting answers. Therefore, these\nmethods usually achieve high in-distribution but poor out-of-distribution\nperformance. In recent years, various datasets and debiasing methods have been\nproposed to evaluate and enhance the VQA robustness, respectively. This paper\nprovides the first comprehensive survey focused on this emerging fashion.\nSpecifically, we first provide an overview of the development process of\ndatasets from in-distribution and out-of-distribution perspectives. Then, we\nexamine the evaluation metrics employed by these datasets. Thirdly, we propose\na typology that presents the development process, similarities and differences,\nrobustness comparison, and technical features of existing debiasing methods.\nFurthermore, we analyze and discuss the robustness of representative\nvision-and-language pre-training models on VQA. Finally, through a thorough\nreview of the available literature and experimental analysis, we discuss the\nkey areas for future research from various viewpoints.\n","authors":["Jie Ma","Pinghui Wang","Dechen Kong","Zewei Wang","Jun Liu","Hongbin Pei","Junzhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2307.11471v1.pdf","comment":"IEEE TPAMI (Under Review)"},{"id":"http://arxiv.org/abs/2307.11470v1","updated":"2023-07-21T10:10:18Z","published":"2023-07-21T10:10:18Z","title":"Physics-Aware Semi-Supervised Underwater Image Enhancement","summary":" Underwater images normally suffer from degradation due to the transmission\nmedium of water bodies. Both traditional prior-based approaches and deep\nlearning-based methods have been used to address this problem. However, the\ninflexible assumption of the former often impairs their effectiveness in\nhandling diverse underwater scenes, while the generalization of the latter to\nunseen images is usually weakened by insufficient data. In this study, we\nleverage both the physics-based underwater Image Formation Model (IFM) and deep\nlearning techniques for Underwater Image Enhancement (UIE). To this end, we\npropose a novel Physics-Aware Dual-Stream Underwater Image Enhancement Network,\ni.e., PA-UIENet, which comprises a Transmission Estimation Steam (T-Stream) and\nan Ambient Light Estimation Stream (A-Stream). This network fulfills the UIE\ntask by explicitly estimating the degradation parameters of the IFM. We also\nadopt an IFM-inspired semi-supervised learning framework, which exploits both\nthe labeled and unlabeled images, to address the issue of insufficient data.\nOur method performs better than, or at least comparably to, eight baselines\nacross five testing sets in the degradation estimation and UIE tasks. This\nshould be due to the fact that it not only can model the degradation but also\ncan learn the characteristics of diverse underwater scenes.\n","authors":["Hao Qi","Xinghui Dong"],"pdf_url":"https://arxiv.org/pdf/2307.11470v1.pdf","comment":"12 pages, 5 figures"},{"id":"http://arxiv.org/abs/2307.11469v1","updated":"2023-07-21T10:08:58Z","published":"2023-07-21T10:08:58Z","title":"Distribution Shift Matters for Knowledge Distillation with Webly\n Collected Images","summary":" Knowledge distillation aims to learn a lightweight student network from a\npre-trained teacher network. In practice, existing knowledge distillation\nmethods are usually infeasible when the original training data is unavailable\ndue to some privacy issues and data management considerations. Therefore,\ndata-free knowledge distillation approaches proposed to collect training\ninstances from the Internet. However, most of them have ignored the common\ndistribution shift between the instances from original training data and webly\ncollected data, affecting the reliability of the trained student network. To\nsolve this problem, we propose a novel method dubbed ``Knowledge Distillation\nbetween Different Distributions\" (KD$^{3}$), which consists of three\ncomponents. Specifically, we first dynamically select useful training instances\nfrom the webly collected data according to the combined predictions of teacher\nnetwork and student network. Subsequently, we align both the weighted features\nand classifier parameters of the two networks for knowledge memorization.\nMeanwhile, we also build a new contrastive learning block called\nMixDistribution to generate perturbed data with a new distribution for instance\nalignment, so that the student network can further learn a\ndistribution-invariant representation. Intensive experiments on various\nbenchmark datasets demonstrate that our proposed KD$^{3}$ can outperform the\nstate-of-the-art data-free knowledge distillation approaches.\n","authors":["Jialiang Tang","Shuo Chen","Gang Niu","Masashi Sugiyama","Chen Gong"],"pdf_url":"https://arxiv.org/pdf/2307.11469v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11466v1","updated":"2023-07-21T10:02:02Z","published":"2023-07-21T10:02:02Z","title":"MatSpectNet: Material Segmentation Network with Domain-Aware and\n Physically-Constrained Hyperspectral Reconstruction","summary":" Achieving accurate material segmentation for 3-channel RGB images is\nchallenging due to the considerable variation in a material's appearance.\nHyperspectral images, which are sets of spectral measurements sampled at\nmultiple wavelengths, theoretically offer distinct information for material\nidentification, as variations in intensity of electromagnetic radiation\nreflected by a surface depend on the material composition of a scene. However,\nexisting hyperspectral datasets are impoverished regarding the number of images\nand material categories for the dense material segmentation task, and\ncollecting and annotating hyperspectral images with a spectral camera is\nprohibitively expensive. To address this, we propose a new model, the\nMatSpectNet to segment materials with recovered hyperspectral images from RGB\nimages. The network leverages the principles of colour perception in modern\ncameras to constrain the reconstructed hyperspectral images and employs the\ndomain adaptation method to generalise the hyperspectral reconstruction\ncapability from a spectral recovery dataset to material segmentation datasets.\nThe reconstructed hyperspectral images are further filtered using learned\nresponse curves and enhanced with human perception. The performance of\nMatSpectNet is evaluated on the LMD dataset as well as the OpenSurfaces\ndataset. Our experiments demonstrate that MatSpectNet attains a 1.60% increase\nin average pixel accuracy and a 3.42% improvement in mean class accuracy\ncompared with the most recent publication. The project code is attached to the\nsupplementary material and will be published on GitHub.\n","authors":["Yuwen Heng","Yihong Wu","Jiawen Chen","Srinandan Dasmahapatra","Hansung Kim"],"pdf_url":"https://arxiv.org/pdf/2307.11466v1.pdf","comment":"7 pages main content"},{"id":"http://arxiv.org/abs/2210.09563v2","updated":"2023-07-21T10:01:25Z","published":"2022-10-18T03:32:18Z","title":"FedForgery: Generalized Face Forgery Detection with Residual Federated\n Learning","summary":" With the continuous development of deep learning in the field of image\ngeneration models, a large number of vivid forged faces have been generated and\nspread on the Internet. These high-authenticity artifacts could grow into a\nthreat to society security. Existing face forgery detection methods directly\nutilize the obtained public shared or centralized data for training but ignore\nthe personal privacy and security issues when personal data couldn't be\ncentralizedly shared in real-world scenarios. Additionally, different\ndistributions caused by diverse artifact types would further bring adverse\ninfluences on the forgery detection task. To solve the mentioned problems, the\npaper proposes a novel generalized residual Federated learning for face Forgery\ndetection (FedForgery). The designed variational autoencoder aims to learn\nrobust discriminative residual feature maps to detect forgery faces (with\ndiverse or even unknown artifact types). Furthermore, the general federated\nlearning strategy is introduced to construct distributed detection model\ntrained collaboratively with multiple local decentralized devices, which could\nfurther boost the representation generalization. Experiments conducted on\npublicly available face forgery detection datasets prove the superior\nperformance of the proposed FedForgery. The designed novel generalized face\nforgery detection protocols and source code would be publicly available.\n","authors":["Decheng Liu","Zhan Dang","Chunlei Peng","Yu Zheng","Shuang Li","Nannan Wang","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2210.09563v2.pdf","comment":"The code is available at https://github.com/GANG370/FedForgery. The\n paper has been accepted in the IEEE Transactions on Information Forensics &\n Security"},{"id":"http://arxiv.org/abs/2307.10926v2","updated":"2023-07-21T09:47:01Z","published":"2023-07-20T14:52:45Z","title":"Confidence intervals for performance estimates in 3D medical image\n segmentation","summary":" Medical segmentation models are evaluated empirically. As such an evaluation\nis based on a limited set of example images, it is unavoidably noisy. Beyond a\nmean performance measure, reporting confidence intervals is thus crucial.\nHowever, this is rarely done in medical image segmentation. The width of the\nconfidence interval depends on the test set size and on the spread of the\nperformance measure (its standard-deviation across of the test set). For\nclassification, many test images are needed to avoid wide confidence intervals.\nSegmentation, however, has not been studied, and it differs by the amount of\ninformation brought by a given test image. In this paper, we study the typical\nconfidence intervals in medical image segmentation. We carry experiments on 3D\nimage segmentation using the standard nnU-net framework, two datasets from the\nMedical Decathlon challenge and two performance measures: the Dice accuracy and\nthe Hausdorff distance. We show that the parametric confidence intervals are\nreasonable approximations of the bootstrap estimates for varying test set sizes\nand spread of the performance metric. Importantly, we show that the test size\nneeded to achieve a given precision is often much lower than for classification\ntasks. Typically, a 1% wide confidence interval requires about 100-200 test\nsamples when the spread is low (standard-deviation around 3%). More difficult\nsegmentation tasks may lead to higher spreads and require over 1000 samples.\n","authors":["R. El Jurdi","G. Varoquaux","O. Colliot"],"pdf_url":"https://arxiv.org/pdf/2307.10926v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2307.11458v1","updated":"2023-07-21T09:40:42Z","published":"2023-07-21T09:40:42Z","title":"Strip-MLP: Efficient Token Interaction for Vision MLP","summary":" Token interaction operation is one of the core modules in MLP-based models to\nexchange and aggregate information between different spatial locations.\nHowever, the power of token interaction on the spatial dimension is highly\ndependent on the spatial resolution of the feature maps, which limits the\nmodel's expressive ability, especially in deep layers where the feature are\ndown-sampled to a small spatial size. To address this issue, we present a novel\nmethod called \\textbf{Strip-MLP} to enrich the token interaction power in three\nways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that\nallows the token to interact with other tokens in a cross-strip manner,\nenabling the tokens in a row (or column) to contribute to the information\naggregations in adjacent but different strips of rows (or columns). Secondly, a\n\\textbf{C}ascade \\textbf{G}roup \\textbf{S}trip \\textbf{M}ixing \\textbf{M}odule\n(CGSMM) is proposed to overcome the performance degradation caused by small\nspatial feature size. The module allows tokens to interact more effectively in\nthe manners of within-patch and cross-patch, which is independent to the\nfeature spatial size. Finally, based on the Strip MLP layer, we propose a novel\n\\textbf{L}ocal \\textbf{S}trip \\textbf{M}ixing \\textbf{M}odule (LSMM) to boost\nthe token interaction power in the local region. Extensive experiments\ndemonstrate that Strip-MLP significantly improves the performance of MLP-based\nmodels on small datasets and obtains comparable or even better results on\nImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy\nthan existing MLP-based models by +2.44\\% on Caltech-101 and +2.16\\% on\nCIFAR-100. The source codes will be available\nat~\\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\\_MLP}.\n","authors":["Guiping Cao","Shengda Luo","Wenjian Huang","Xiangyuan Lan","Dongmei Jiang","Yaowei Wang","Jianguo Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.11458v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08092v2","updated":"2023-07-21T09:31:57Z","published":"2023-07-16T16:29:26Z","title":"Gait Data Augmentation using Physics-Based Biomechanical Simulation","summary":" This paper focuses on addressing the problem of data scarcity for gait\nanalysis. Standard augmentation methods may produce gait sequences that are not\nconsistent with the biomechanical constraints of human walking. To address this\nissue, we propose a novel framework for gait data augmentation by using\nOpenSIM, a physics-based simulator, to synthesize biomechanically plausible\nwalking sequences. The proposed approach is validated by augmenting the WBDS\nand CASIA-B datasets and then training gait-based classifiers for 3D gender\ngait classification and 2D gait person identification respectively.\nExperimental results indicate that our augmentation approach can improve the\nperformance of model-based gait classifiers and deliver state-of-the-art\nresults for gait-based person identification with an accuracy of up to 96.11%\non the CASIA-B dataset.\n","authors":["Mritula Chandrasekaran","Jarek Francik","Dimitrios Makris"],"pdf_url":"https://arxiv.org/pdf/2307.08092v2.pdf","comment":"30 pages including references, 5 Figures submitted to ESWA"},{"id":"http://arxiv.org/abs/2307.02953v2","updated":"2023-07-21T09:26:06Z","published":"2023-07-06T12:39:06Z","title":"SegNetr: Rethinking the local-global interactions and skip connections\n in U-shaped networks","summary":" Recently, U-shaped networks have dominated the field of medical image\nsegmentation due to their simple and easily tuned structure. However, existing\nU-shaped segmentation networks: 1) mostly focus on designing complex\nself-attention modules to compensate for the lack of long-term dependence based\non convolution operation, which increases the overall number of parameters and\ncomputational complexity of the network; 2) simply fuse the features of encoder\nand decoder, ignoring the connection between their spatial locations. In this\npaper, we rethink the above problem and build a lightweight medical image\nsegmentation network, called SegNetr. Specifically, we introduce a novel\nSegNetr block that can perform local-global interactions dynamically at any\nstage and with only linear complexity. At the same time, we design a general\ninformation retention skip connection (IRSC) to preserve the spatial location\ninformation of encoder features and achieve accurate fusion with the decoder\nfeatures. We validate the effectiveness of SegNetr on four mainstream medical\nimage segmentation datasets, with 59\\% and 76\\% fewer parameters and GFLOPs\nthan vanilla U-Net, while achieving segmentation performance comparable to\nstate-of-the-art methods. Notably, the components proposed in this paper can be\napplied to other U-shaped networks to improve their segmentation performance.\n","authors":["Junlong Cheng","Chengrui Gao","Fengjie Wang","Min Zhu"],"pdf_url":"https://arxiv.org/pdf/2307.02953v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.04246v2","updated":"2023-07-21T09:15:42Z","published":"2023-02-08T18:26:10Z","title":"Shortcut Detection with Variational Autoencoders","summary":" For real-world applications of machine learning (ML), it is essential that\nmodels make predictions based on well-generalizing features rather than\nspurious correlations in the data. The identification of such spurious\ncorrelations, also known as shortcuts, is a challenging problem and has so far\nbeen scarcely addressed. In this work, we present a novel approach to detect\nshortcuts in image and audio datasets by leveraging variational autoencoders\n(VAEs). The disentanglement of features in the latent space of VAEs allows us\nto discover feature-target correlations in datasets and semi-automatically\nevaluate them for ML shortcuts. We demonstrate the applicability of our method\non several real-world datasets and identify shortcuts that have not been\ndiscovered before.\n","authors":["Nicolas M. Müller","Simon Roschmann","Shahbaz Khan","Philip Sperl","Konstantin Böttinger"],"pdf_url":"https://arxiv.org/pdf/2302.04246v2.pdf","comment":"Accepted at the ICML 2023 Workshop on Spurious Correlations,\n Invariance and Stability"},{"id":"http://arxiv.org/abs/2307.04378v3","updated":"2023-07-21T09:13:55Z","published":"2023-07-10T07:24:44Z","title":"Towards Generalizable Diabetic Retinopathy Grading in Unseen Domains","summary":" Diabetic Retinopathy (DR) is a common complication of diabetes and a leading\ncause of blindness worldwide. Early and accurate grading of its severity is\ncrucial for disease management. Although deep learning has shown great\npotential for automated DR grading, its real-world deployment is still\nchallenging due to distribution shifts among source and target domains, known\nas the domain generalization problem. Existing works have mainly attributed the\nperformance degradation to limited domain shifts caused by simple visual\ndiscrepancies, which cannot handle complex real-world scenarios. Instead, we\npresent preliminary evidence suggesting the existence of three-fold\ngeneralization issues: visual and degradation style shifts, diagnostic pattern\ndiversity, and data imbalance. To tackle these issues, we propose a novel\nunified framework named Generalizable Diabetic Retinopathy Grading Network\n(GDRNet). GDRNet consists of three vital components: fundus visual-artifact\naugmentation (FundusAug), dynamic hybrid-supervised loss (DahLoss), and\ndomain-class-aware re-balancing (DCR). FundusAug generates realistic augmented\nimages via visual transformation and image degradation, while DahLoss jointly\nleverages pixel-level consistency and image-level semantics to capture the\ndiverse diagnostic patterns and build generalizable feature representations.\nMoreover, DCR mitigates the data imbalance from a domain-class view and avoids\nundesired over-emphasis on rare domain-class pairs. Finally, we design a\npublicly available benchmark for fair evaluations. Extensive comparison\nexperiments against advanced methods and exhaustive ablation studies\ndemonstrate the effectiveness and generalization ability of GDRNet.\n","authors":["Haoxuan Che","Yuhan Cheng","Haibo Jin","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2307.04378v3.pdf","comment":"Early Accepted by MICCAI 2023, the 26th International Conference on\n Medical Image Computing and Computer Assisted Intervention"},{"id":"http://arxiv.org/abs/2305.18310v2","updated":"2023-07-21T09:12:17Z","published":"2023-05-17T14:14:31Z","title":"Motion-Scenario Decoupling for Rat-Aware Video Position Prediction:\n Strategy and Benchmark","summary":" Recently significant progress has been made in human action recognition and\nbehavior prediction using deep learning techniques, leading to improved\nvision-based semantic understanding. However, there is still a lack of\nhigh-quality motion datasets for small bio-robotics, which presents more\nchallenging scenarios for long-term movement prediction and behavior control\nbased on third-person observation. In this study, we introduce RatPose, a\nbio-robot motion prediction dataset constructed by considering the influence\nfactors of individuals and environments based on predefined annotation rules.\nTo enhance the robustness of motion prediction against these factors, we\npropose a Dual-stream Motion-Scenario Decoupling (\\textit{DMSD}) framework that\neffectively separates scenario-oriented and motion-oriented features and\ndesigns a scenario contrast loss and motion clustering loss for overall\ntraining. With such distinctive architecture, the dual-branch feature flow\ninformation is interacted and compensated in a decomposition-then-fusion\nmanner. Moreover, we demonstrate significant performance improvements of the\nproposed \\textit{DMSD} framework on different difficulty-level tasks. We also\nimplement long-term discretized trajectory prediction tasks to verify the\ngeneralization ability of the proposed dataset.\n","authors":["Xiaofeng Liu","Jiaxin Gao","Yaohua Liu","Risheng Liu","Nenggan Zheng"],"pdf_url":"https://arxiv.org/pdf/2305.18310v2.pdf","comment":"Rat, Video Position Prediction"},{"id":"http://arxiv.org/abs/2303.09975v4","updated":"2023-07-21T09:05:53Z","published":"2023-03-17T13:48:17Z","title":"MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image\n Segmentation","summary":" There has been exploding interest in embracing Transformer-based\narchitectures for medical image segmentation. However, the lack of large-scale\nannotated medical datasets make achieving performances equivalent to those in\nnatural images challenging. Convolutional networks, in contrast, have higher\ninductive biases and consequently, are easily trainable to high performance.\nRecently, the ConvNeXt architecture attempted to modernize the standard ConvNet\nby mirroring Transformer blocks. In this work, we improve upon this to design a\nmodernized and scalable convolutional architecture customized to challenges of\ndata-scarce medical settings. We introduce MedNeXt, a Transformer-inspired\nlarge kernel segmentation network which introduces - 1) A fully ConvNeXt 3D\nEncoder-Decoder Network for medical image segmentation, 2) Residual ConvNeXt up\nand downsampling blocks to preserve semantic richness across scales, 3) A novel\ntechnique to iteratively increase kernel sizes by upsampling small kernel\nnetworks, to prevent performance saturation on limited medical data, 4)\nCompound scaling at multiple levels (depth, width, kernel size) of MedNeXt.\nThis leads to state-of-the-art performance on 4 tasks on CT and MRI modalities\nand varying dataset sizes, representing a modernized deep architecture for\nmedical image segmentation. Our code is made publicly available at:\nhttps://github.com/MIC-DKFZ/MedNeXt.\n","authors":["Saikat Roy","Gregor Koehler","Constantin Ulrich","Michael Baumgartner","Jens Petersen","Fabian Isensee","Paul F. Jaeger","Klaus Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2303.09975v4.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.11438v1","updated":"2023-07-21T08:58:49Z","published":"2023-07-21T08:58:49Z","title":"Attention Consistency Refined Masked Frequency Forgery Representation\n for Generalizing Face Forgery Detection","summary":" Due to the successful development of deep image generation technology, visual\ndata forgery detection would play a more important role in social and economic\nsecurity. Existing forgery detection methods suffer from unsatisfactory\ngeneralization ability to determine the authenticity in the unseen domain. In\nthis paper, we propose a novel Attention Consistency Refined masked frequency\nforgery representation model toward generalizing face forgery detection\nalgorithm (ACMF). Most forgery technologies always bring in high-frequency\naware cues, which make it easy to distinguish source authenticity but difficult\nto generalize to unseen artifact types. The masked frequency forgery\nrepresentation module is designed to explore robust forgery cues by randomly\ndiscarding high-frequency information. In addition, we find that the forgery\nattention map inconsistency through the detection network could affect the\ngeneralizability. Thus, the forgery attention consistency is introduced to\nforce detectors to focus on similar attention regions for better generalization\nability. Experiment results on several public face forgery datasets\n(FaceForensic++, DFD, Celeb-DF, and WDF datasets) demonstrate the superior\nperformance of the proposed method compared with the state-of-the-art methods.\n","authors":["Decheng Liu","Tao Chen","Chunlei Peng","Nannan Wang","Ruimin Hu","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2307.11438v1.pdf","comment":"The source code and models are publicly available at\n https://github.com/chenboluo/ACMF"},{"id":"http://arxiv.org/abs/2307.11434v1","updated":"2023-07-21T08:55:23Z","published":"2023-07-21T08:55:23Z","title":"Batching for Green AI -- An Exploratory Study on Inference","summary":" The batch size is an essential parameter to tune during the development of\nnew neural networks. Amongst other quality indicators, it has a large degree of\ninfluence on the model's accuracy, generalisability, training times and\nparallelisability. This fact is generally known and commonly studied. However,\nduring the application phase of a deep learning model, when the model is\nutilised by an end-user for inference, we find that there is a disregard for\nthe potential benefits of introducing a batch size. In this study, we examine\nthe effect of input batching on the energy consumption and response times of\nfive fully-trained neural networks for computer vision that were considered\nstate-of-the-art at the time of their publication. The results suggest that\nbatching has a significant effect on both of these metrics. Furthermore, we\npresent a timeline of the energy efficiency and accuracy of neural networks\nover the past decade. We find that in general, energy consumption rises at a\nmuch steeper pace than accuracy and question the necessity of this evolution.\nAdditionally, we highlight one particular network, ShuffleNetV2(2018), that\nachieved a competitive performance for its time while maintaining a much lower\nenergy consumption. Nevertheless, we highlight that the results are model\ndependent.\n","authors":["Tim Yarally","Luís Cruz","Daniel Feitosa","June Sallou","Arie van Deursen"],"pdf_url":"https://arxiv.org/pdf/2307.11434v1.pdf","comment":"8 pages, 4 figures, 1 table. Accepted at Euromicro Conference Series\n on Software Engineering and Advanced Applications (SEAA) 2023"},{"id":"http://arxiv.org/abs/2307.09004v2","updated":"2023-07-21T08:41:23Z","published":"2023-07-18T06:44:20Z","title":"Ord2Seq: Regarding Ordinal Regression as Label Sequence Prediction","summary":" Ordinal regression refers to classifying object instances into ordinal\ncategories. It has been widely studied in many scenarios, such as medical\ndisease grading, movie rating, etc. Known methods focused only on learning\ninter-class ordinal relationships, but still incur limitations in\ndistinguishing adjacent categories thus far. In this paper, we propose a simple\nsequence prediction framework for ordinal regression called Ord2Seq, which, for\nthe first time, transforms each ordinal category label into a special label\nsequence and thus regards an ordinal regression task as a sequence prediction\nprocess. In this way, we decompose an ordinal regression task into a series of\nrecursive binary classification steps, so as to subtly distinguish adjacent\ncategories. Comprehensive experiments show the effectiveness of distinguishing\nadjacent categories for performance improvement and our new approach exceeds\nstate-of-the-art performances in four different scenarios. Codes are available\nat https://github.com/wjh892521292/Ord2Seq.\n","authors":["Jinhong Wang","Yi Cheng","Jintai Chen","Tingting Chen","Danny Chen","Jian Wu"],"pdf_url":"https://arxiv.org/pdf/2307.09004v2.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2208.00657v2","updated":"2023-07-21T08:39:22Z","published":"2022-08-01T07:35:45Z","title":"SiamixFormer: a fully-transformer Siamese network with temporal Fusion\n for accurate building detection and change detection in bi-temporal remote\n sensing images","summary":" Building detection and change detection using remote sensing images can help\nurban and rescue planning. Moreover, they can be used for building damage\nassessment after natural disasters. Currently, most of the existing models for\nbuilding detection use only one image (pre-disaster image) to detect buildings.\nThis is based on the idea that post-disaster images reduce the model's\nperformance because of presence of destroyed buildings. In this paper, we\npropose a siamese model, called SiamixFormer, which uses pre- and post-disaster\nimages as input. Our model has two encoders and has a hierarchical transformer\narchitecture. The output of each stage in both encoders is given to a temporal\ntransformer for feature fusion in a way that query is generated from\npre-disaster images and (key, value) is generated from post-disaster images. To\nthis end, temporal features are also considered in feature fusion. Another\nadvantage of using temporal transformers in feature fusion is that they can\nbetter maintain large receptive fields generated by transformer encoders\ncompared with CNNs. Finally, the output of the temporal transformer is given to\na simple MLP decoder at each stage. The SiamixFormer model is evaluated on xBD,\nand WHU datasets, for building detection and on LEVIR-CD and CDD datasets for\nchange detection and could outperform the state-of-the-art.\n","authors":["Amir Mohammadian","Foad Ghaderi"],"pdf_url":"https://arxiv.org/pdf/2208.00657v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11418v1","updated":"2023-07-21T08:22:14Z","published":"2023-07-21T08:22:14Z","title":"FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural\n Radiance Fields","summary":" As recent advances in Neural Radiance Fields (NeRF) have enabled\nhigh-fidelity 3D face reconstruction and novel view synthesis, its manipulation\nalso became an essential task in 3D vision. However, existing manipulation\nmethods require extensive human labor, such as a user-provided semantic mask\nand manual attribute search unsuitable for non-expert users. Instead, our\napproach is designed to require a single text to manipulate a face\nreconstructed with NeRF. To do so, we first train a scene manipulator, a latent\ncode-conditional deformable NeRF, over a dynamic scene to control a face\ndeformation using the latent code. However, representing a scene deformation\nwith a single latent code is unfavorable for compositing local deformations\nobserved in different instances. As so, our proposed Position-conditional\nAnchor Compositor (PAC) learns to represent a manipulated scene with spatially\nvarying latent codes. Their renderings with the scene manipulator are then\noptimized to yield high cosine similarity to a target text in CLIP embedding\nspace for text-driven manipulation. To the best of our knowledge, our approach\nis the first to address the text-driven manipulation of a face reconstructed\nwith NeRF. Extensive results, comparisons, and ablation studies demonstrate the\neffectiveness of our approach.\n","authors":["Sungwon Hwang","Junha Hyung","Daejin Kim","Min-Jung Kim","Jaegul Choo"],"pdf_url":"https://arxiv.org/pdf/2307.11418v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.11413v1","updated":"2023-07-21T08:15:39Z","published":"2023-07-21T08:15:39Z","title":"A Video-based Detector for Suspicious Activity in Examination with\n OpenPose","summary":" Examinations are a crucial part of the learning process, and academic\ninstitutions invest significant resources into maintaining their integrity by\npreventing cheating from students or facilitators. However, cheating has become\nrampant in examination setups, compromising their integrity. The traditional\nmethod of relying on invigilators to monitor every student is impractical and\nineffective. To address this issue, there is a need to continuously record exam\nsessions to monitor students for suspicious activities. However, these\nrecordings are often too lengthy for invigilators to analyze effectively, and\nfatigue may cause them to miss significant details. To widen the coverage,\ninvigilators could use fixed overhead or wearable cameras. This paper\nintroduces a framework that uses automation to analyze videos and detect\nsuspicious activities during examinations efficiently and effectively. We\nutilized the OpenPose framework and Convolutional Neural Network (CNN) to\nidentify students exchanging objects during exams. This detection system is\nvital in preventing cheating and promoting academic integrity, fairness, and\nquality education for institutions.\n","authors":["Reuben Moyo","Stanley Ndebvu","Michael Zimba","Jimmy Mbelwa"],"pdf_url":"https://arxiv.org/pdf/2307.11413v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11411v1","updated":"2023-07-21T08:10:26Z","published":"2023-07-21T08:10:26Z","title":"Deep Directly-Trained Spiking Neural Networks for Object Detection","summary":" Spiking neural networks (SNNs) are brain-inspired energy-efficient models\nthat encode information in spatiotemporal dynamics. Recently, deep SNNs trained\ndirectly have shown great success in achieving high performance on\nclassification tasks with very few time steps. However, how to design a\ndirectly-trained SNN for the regression task of object detection still remains\na challenging problem. To address this problem, we propose EMS-YOLO, a novel\ndirectly-trained SNN framework for object detection, which is the first trial\nto train a deep SNN with surrogate gradients for object detection rather than\nANN-SNN conversion strategies. Specifically, we design a full-spike residual\nblock, EMS-ResNet, which can effectively extend the depth of the\ndirectly-trained SNN with low power consumption. Furthermore, we theoretically\nanalyze and prove the EMS-ResNet could avoid gradient vanishing or exploding.\nThe results demonstrate that our approach outperforms the state-of-the-art\nANN-SNN conversion methods (at least 500 time steps) in extremely fewer time\nsteps (only 4 time steps). It is shown that our model could achieve comparable\nperformance to the ANN with the same architecture while consuming 5.83 times\nless energy on the frame-based COCO Dataset and the event-based Gen1 Dataset.\n","authors":["Qiaoyi Su","Yuhong Chou","Yifan Hu","Jianing Li","Shijie Mei","Ziyang Zhang","Guoqi Li"],"pdf_url":"https://arxiv.org/pdf/2307.11411v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.11410v1","updated":"2023-07-21T08:09:47Z","published":"2023-07-21T08:09:47Z","title":"Subject-Diffusion:Open Domain Personalized Text-to-Image Generation\n without Test-time Fine-tuning","summary":" Recent progress in personalized image generation using diffusion models has\nbeen significant. However, development in the area of open-domain and\nnon-fine-tuning personalized image generation is proceeding rather slowly. In\nthis paper, we propose Subject-Diffusion, a novel open-domain personalized\nimage generation model that, in addition to not requiring test-time\nfine-tuning, also only requires a single reference image to support\npersonalized generation of single- or multi-subject in any domain. Firstly, we\nconstruct an automatic data labeling tool and use the LAION-Aesthetics dataset\nto construct a large-scale dataset consisting of 76M images and their\ncorresponding subject detection bounding boxes, segmentation masks and text\ndescriptions. Secondly, we design a new unified framework that combines text\nand image semantics by incorporating coarse location and fine-grained reference\nimage control to maximize subject fidelity and generalization. Furthermore, we\nalso adopt an attention control mechanism to support multi-subject generation.\nExtensive qualitative and quantitative results demonstrate that our method\noutperforms other SOTA frameworks in single, multiple, and human customized\nimage generation. Please refer to our\n\\href{https://oppo-mente-lab.github.io/subject_diffusion/}{project page}\n","authors":["Jian Ma","Junhao Liang","Chen Chen","Haonan Lu"],"pdf_url":"https://arxiv.org/pdf/2307.11410v1.pdf","comment":"14 pages, 10 figures"},{"id":"http://arxiv.org/abs/2307.11404v1","updated":"2023-07-21T07:56:32Z","published":"2023-07-21T07:56:32Z","title":"Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for\n Occluded Facial Expression Recognition","summary":" Most research on facial expression recognition (FER) is conducted in highly\ncontrolled environments, but its performance is often unacceptable when applied\nto real-world situations. This is because when unexpected objects occlude the\nface, the FER network faces difficulties extracting facial features and\naccurately predicting facial expressions. Therefore, occluded FER (OFER) is a\nchallenging problem. Previous studies on occlusion-aware FER have typically\nrequired fully annotated facial images for training. However, collecting facial\nimages with various occlusions and expression annotations is time-consuming and\nexpensive. Latent-OFER, the proposed method, can detect occlusions, restore\noccluded parts of the face as if they were unoccluded, and recognize them,\nimproving FER accuracy. This approach involves three steps: First, the vision\ntransformer (ViT)-based occlusion patch detector masks the occluded position by\ntraining only latent vectors from the unoccluded patches using the support\nvector data description algorithm. Second, the hybrid reconstruction network\ngenerates the masking position as a complete image using the ViT and\nconvolutional neural network (CNN). Last, the expression-relevant latent vector\nextractor retrieves and uses expression-related information from all latent\nvectors by applying a CNN-based class activation map. This mechanism has a\nsignificant advantage in preventing performance degradation from occlusion by\nunseen objects. The experimental results on several databases demonstrate the\nsuperiority of the proposed method over state-of-the-art methods.\n","authors":["Isack Lee","Eungi Lee","Seok Bong Yoo"],"pdf_url":"https://arxiv.org/pdf/2307.11404v1.pdf","comment":"11 pages, 8 figures"},{"id":"http://arxiv.org/abs/2307.11397v1","updated":"2023-07-21T07:29:38Z","published":"2023-07-21T07:29:38Z","title":"Probabilistic Modeling of Inter- and Intra-observer Variability in\n Medical Image Segmentation","summary":" Medical image segmentation is a challenging task, particularly due to inter-\nand intra-observer variability, even between medical experts. In this paper, we\npropose a novel model, called Probabilistic Inter-Observer and iNtra-Observer\nvariation NetwOrk (Pionono). It captures the labeling behavior of each rater\nwith a multidimensional probability distribution and integrates this\ninformation with the feature maps of the image to produce probabilistic\nsegmentation predictions. The model is optimized by variational inference and\ncan be trained end-to-end. It outperforms state-of-the-art models such as\nSTAPLE, Probabilistic U-Net, and models based on confusion matrices.\nAdditionally, Pionono predicts multiple coherent segmentation maps that mimic\nthe rater's expert opinion, which provides additional valuable information for\nthe diagnostic process. Experiments on real-world cancer segmentation datasets\ndemonstrate the high accuracy and efficiency of Pionono, making it a powerful\ntool for medical image analysis.\n","authors":["Arne Schmidt","Pablo Morales-Álvarez","Rafael Molina"],"pdf_url":"https://arxiv.org/pdf/2307.11397v1.pdf","comment":"13 pages, 5 figures"},{"id":"http://arxiv.org/abs/2307.09815v2","updated":"2023-07-21T07:10:28Z","published":"2023-07-19T08:03:53Z","title":"LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network","summary":" Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent\nblur is a challenging task.~Existing blur map-based deblurring methods have\ndemonstrated promising results. In this paper, we propose, to the best of our\nknowledge, the first framework to introduce the contrastive language-image\npre-training framework (CLIP) to achieve accurate blur map estimation from DP\npairs unsupervisedly. To this end, we first carefully design text prompts to\nenable CLIP to understand blur-related geometric prior knowledge from the DP\npair. Then, we propose a format to input stereo DP pair to the CLIP without any\nfine-tuning, where the CLIP is pre-trained on monocular images. Given the\nestimated blur map, we introduce a blur-prior attention block, a blur-weighting\nloss and a blur-aware loss to recover the all-in-focus image. Our method\nachieves state-of-the-art performance in extensive experiments.\n","authors":["Hao Yang","Liyuan Pan","Yan Yang","Miaomiao Liu"],"pdf_url":"https://arxiv.org/pdf/2307.09815v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10577v2","updated":"2023-07-21T06:59:21Z","published":"2023-07-20T04:41:39Z","title":"Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced\n Perception based on Joint-Embedding & Contextual Label Affinity","summary":" Traditional computer vision models often require extensive manual effort for\ndata acquisition, annotation and validation, particularly when detecting subtle\nbehavioral nuances or events. The difficulty in distinguishing routine\nbehaviors from potential risks in real-world applications, such as\ndifferentiating routine shopping from potential shoplifting, further\ncomplicates the process. Moreover, these models may demonstrate high false\npositive rates and imprecise event detection when exposed to real-world\nscenarios that differ significantly from the conditions of the training data.\n To overcome these hurdles, we present Ethosight, a novel zero-shot computer\nvision system. Ethosight initiates with a clean slate based on user\nrequirements and semantic knowledge of interest. Using localized label affinity\ncalculations and a reasoning-guided iterative learning loop, Ethosight infers\nscene details and iteratively refines the label set. Reasoning mechanisms can\nbe derived from large language models like GPT4, symbolic reasoners like\nOpenNARS\\cite{wang2013}\\cite{wang2006}, or hybrid systems.\n Our evaluations demonstrate Ethosight's efficacy across 40 complex use cases,\nspanning domains such as health, safety, and security. Detailed results and\ncase studies within the main body of this paper and an appendix underscore a\npromising trajectory towards enhancing the adaptability and resilience of\ncomputer vision models in detecting and extracting subtle and nuanced\nbehaviors.\n","authors":["Hugo Latapie","Kristinn R. Thorisson","Shan Yu","Vahagn Petrosyan","Patrick Hammer","Pei Wang","Brandon Kynoch","Hanning Chen","Tangrui Li"],"pdf_url":"https://arxiv.org/pdf/2307.10577v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11386v1","updated":"2023-07-21T06:56:21Z","published":"2023-07-21T06:56:21Z","title":"CLR: Channel-wise Lightweight Reprogramming for Continual Learning","summary":" Continual learning aims to emulate the human ability to continually\naccumulate knowledge over sequential tasks. The main challenge is to maintain\nperformance on previously learned tasks after learning new tasks, i.e., to\navoid catastrophic forgetting. We propose a Channel-wise Lightweight\nReprogramming (CLR) approach that helps convolutional neural networks (CNNs)\novercome catastrophic forgetting during continual learning. We show that a CNN\nmodel trained on an old task (or self-supervised proxy task) could be\n``reprogrammed\" to solve a new task by using our proposed lightweight (very\ncheap) reprogramming parameter. With the help of CLR, we have a better\nstability-plasticity trade-off to solve continual learning problems: To\nmaintain stability and retain previous task ability, we use a common\ntask-agnostic immutable part as the shared ``anchor\" parameter set. We then add\ntask-specific lightweight reprogramming parameters to reinterpret the outputs\nof the immutable parts, to enable plasticity and integrate new knowledge. To\nlearn sequential tasks, we only train the lightweight reprogramming parameters\nto learn each new task. Reprogramming parameters are task-specific and\nexclusive to each task, which makes our method immune to catastrophic\nforgetting. To minimize the parameter requirement of reprogramming to learn new\ntasks, we make reprogramming lightweight by only adjusting essential kernels\nand learning channel-wise linear mappings from anchor parameters to\ntask-specific domain knowledge. We show that, for general CNNs, the CLR\nparameter increase is less than 0.6\\% for any new task. Our method outperforms\n13 state-of-the-art continual learning baselines on a new challenging sequence\nof 53 image classification datasets. Code and data are available at\nhttps://github.com/gyhandy/Channel-wise-Lightweight-Reprogramming\n","authors":["Yunhao Ge","Yuecheng Li","Shuo Ni","Jiaping Zhao","Ming-Hsuan Yang","Laurent Itti"],"pdf_url":"https://arxiv.org/pdf/2307.11386v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2303.06146v2","updated":"2023-07-21T06:34:54Z","published":"2023-03-10T18:59:33Z","title":"StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces","summary":" Recent advances in face manipulation using StyleGAN have produced impressive\nresults. However, StyleGAN is inherently limited to cropped aligned faces at a\nfixed image resolution it is pre-trained on. In this paper, we propose a simple\nand effective solution to this limitation by using dilated convolutions to\nrescale the receptive fields of shallow layers in StyleGAN, without altering\nany model parameters. This allows fixed-size small features at shallow layers\nto be extended into larger ones that can accommodate variable resolutions,\nmaking them more robust in characterizing unaligned faces. To enable real face\ninversion and manipulation, we introduce a corresponding encoder that provides\nthe first-layer feature of the extended StyleGAN in addition to the latent\nstyle code. We validate the effectiveness of our method using unaligned face\ninputs of various resolutions in a diverse set of face manipulation tasks,\nincluding facial attribute editing, super-resolution, sketch/mask-to-face\ntranslation, and face toonification.\n","authors":["Shuai Yang","Liming Jiang","Ziwei Liu","Chen Change Loy"],"pdf_url":"https://arxiv.org/pdf/2303.06146v2.pdf","comment":"ICCV 2023. Code: https://github.com/williamyang1991/StyleGANEX\n Project page: https://www.mmlab-ntu.com/project/styleganex/"},{"id":"http://arxiv.org/abs/2307.11375v1","updated":"2023-07-21T06:17:09Z","published":"2023-07-21T06:17:09Z","title":"LatentAugment: Data Augmentation via Guided Manipulation of GAN's Latent\n Space","summary":" Data Augmentation (DA) is a technique to increase the quantity and diversity\nof the training data, and by that alleviate overfitting and improve\ngeneralisation. However, standard DA produces synthetic data for augmentation\nwith limited diversity. Generative Adversarial Networks (GANs) may unlock\nadditional information in a dataset by generating synthetic samples having the\nappearance of real images. However, these models struggle to simultaneously\naddress three key requirements: fidelity and high-quality samples; diversity\nand mode coverage; and fast sampling. Indeed, GANs generate high-quality\nsamples rapidly, but have poor mode coverage, limiting their adoption in DA\napplications. We propose LatentAugment, a DA strategy that overcomes the low\ndiversity of GANs, opening up for use in DA applications. Without external\nsupervision, LatentAugment modifies latent vectors and moves them into latent\nspace regions to maximise the synthetic images' diversity and fidelity. It is\nalso agnostic to the dataset and the downstream task. A wide set of experiments\nshows that LatentAugment improves the generalisation of a deep model\ntranslating from MRI-to-CT beating both standard DA as well GAN-based sampling.\nMoreover, still in comparison with GAN-based sampling, LatentAugment synthetic\nsamples show superior mode coverage and diversity. Code is available at:\nhttps://github.com/ltronchin/LatentAugment.\n","authors":["Lorenzo Tronchin","Minh H. Vu","Paolo Soda","Tommy Löfstedt"],"pdf_url":"https://arxiv.org/pdf/2307.11375v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16649v4","updated":"2023-07-21T05:46:30Z","published":"2023-05-26T05:41:20Z","title":"FSD: Fully-Specialized Detector via Neural Architecture Search","summary":" Most generic object detectors are mainly built for standard object detection\ntasks such as COCO and PASCAL VOC. They might not work well and/or efficiently\non tasks of other domains consisting of images that are visually different from\nstandard datasets. To this end, many advances have been focused on adapting a\ngeneral-purposed object detector with limited domain-specific designs. However,\ndesigning a successful task-specific detector requires extraneous manual\nexperiments and parameter tuning through trial and error. In this paper, we\nfirst propose and examine a fully-automatic pipeline to design a\nfully-specialized detector (FSD) which mainly incorporates a\nneural-architectural-searched model by exploring ideal network structures over\nthe backbone and task-specific head. On the DeepLesion dataset, extensive\nresults show that FSD can achieve 3.1 mAP gain while using approximately 40%\nfewer parameters on binary lesion detection task and improved the mAP by around\n10% on multi-type lesion detection task via our region-aware graph modeling\ncompared with existing general-purposed medical lesion detection networks.\n","authors":["Zhe Huang","Yudian Li"],"pdf_url":"https://arxiv.org/pdf/2305.16649v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11364v1","updated":"2023-07-21T05:33:57Z","published":"2023-07-21T05:33:57Z","title":"Photo2Relief: Let Human in the Photograph Stand Out","summary":" In this paper, we propose a technique for making humans in photographs\nprotrude like reliefs. Unlike previous methods which mostly focus on the face\nand head, our method aims to generate art works that describe the whole body\nactivity of the character. One challenge is that there is no ground-truth for\nsupervised deep learning. We introduce a sigmoid variant function to manipulate\ngradients tactfully and train our neural networks by equipping with a loss\nfunction defined in gradient domain. The second challenge is that actual\nphotographs often across different light conditions. We used image-based\nrendering technique to address this challenge and acquire rendering images and\ndepth data under different lighting conditions. To make a clear division of\nlabor in network modules, a two-scale architecture is proposed to create\nhigh-quality relief from a single photograph. Extensive experimental results on\na variety of scenes show that our method is a highly effective solution for\ngenerating digital 2.5D artwork from photographs.\n","authors":["Zhongping Ji","Feifei Che","Hanshuo Liu","Ziyi Zhao","Yu-Wei Zhang","Wenping Wang"],"pdf_url":"https://arxiv.org/pdf/2307.11364v1.pdf","comment":"10 pages, 11 figures"},{"id":"http://arxiv.org/abs/2307.11360v1","updated":"2023-07-21T05:26:32Z","published":"2023-07-21T05:26:32Z","title":"ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection","summary":" Object detection is the key technique to a number of Computer Vision\napplications, but it often requires large amounts of annotated data to achieve\ndecent results. Moreover, for pedestrian detection specifically, the collected\ndata might contain some personally identifiable information (PII), which is\nhighly restricted in many countries. This label intensive and privacy\nconcerning task has recently led to an increasing interest in training the\ndetection models using synthetically generated pedestrian datasets collected\nwith a photo-realistic video game engine. The engine is able to generate\nunlimited amounts of data with precise and consistent annotations, which gives\npotential for significant gains in the real-world applications. However, the\nuse of synthetic data for training introduces a synthetic-to-real domain shift\naggravating the final performance. To close the gap between the real and\nsynthetic data, we propose to use a Generative Adversarial Network (GAN), which\nperformsparameterized unpaired image-to-image translation to generate more\nrealistic images. The key benefit of using the GAN is its intrinsic preference\nof low-level changes to geometric ones, which means annotations of a given\nsynthetic image remain accurate even after domain translation is performed thus\neliminating the need for labeling real data. We extensively experimented with\nthe proposed method using MOTSynth dataset to train and MOT17 and MOT20\ndetection datasets to test, with experimental results demonstrating the\neffectiveness of this method. Our approach not only produces visually plausible\nsamples but also does not require any labels of the real domain thus making it\napplicable to the variety of downstream tasks.\n","authors":["Daria Reshetova","Guanhang Wu","Marcel Puyat","Chunhui Gu","Huizhong Chen"],"pdf_url":"https://arxiv.org/pdf/2307.11360v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04541v2","updated":"2023-07-21T05:08:44Z","published":"2023-07-10T13:09:42Z","title":"Learning Large Margin Sparse Embeddings for Open Set Medical Diagnosis","summary":" Fueled by deep learning, computer-aided diagnosis achieves huge advances.\nHowever, out of controlled lab environments, algorithms could face multiple\nchallenges. Open set recognition (OSR), as an important one, states that\ncategories unseen in training could appear in testing. In medical fields, it\ncould derive from incompletely collected training datasets and the constantly\nemerging new or rare diseases. OSR requires an algorithm to not only correctly\nclassify known classes, but also recognize unknown classes and forward them to\nexperts for further diagnosis. To tackle OSR, we assume that known classes\ncould densely occupy small parts of the embedding space and the remaining\nsparse regions could be recognized as unknowns. Following it, we propose Open\nMargin Cosine Loss (OMCL) unifying two mechanisms. The former, called Margin\nLoss with Adaptive Scale (MLAS), introduces angular margin for reinforcing\nintra-class compactness and inter-class separability, together with an adaptive\nscaling factor to strengthen the generalization capacity. The latter, called\nOpen-Space Suppression (OSS), opens the classifier by recognizing sparse\nembedding space as unknowns using proposed feature space descriptors. Besides,\nsince medical OSR is still a nascent field, two publicly available benchmark\ndatasets are proposed for comparison. Extensive ablation studies and feature\nvisualization demonstrate the effectiveness of each design. Compared with\nstate-of-the-art methods, MLAS achieves superior performances, measured by ACC,\nAUROC, and OSCR.\n","authors":["Mingyuan Liu","Lu Xu","Jicong Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.04541v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10698v2","updated":"2023-07-21T05:05:52Z","published":"2023-07-20T08:39:20Z","title":"Reverse Knowledge Distillation: Training a Large Model using a Small One\n for Retinal Image Matching on Limited Data","summary":" Retinal image matching plays a crucial role in monitoring disease progression\nand treatment response. However, datasets with matched keypoints between\ntemporally separated pairs of images are not available in abundance to train\ntransformer-based model. We propose a novel approach based on reverse knowledge\ndistillation to train large models with limited data while preventing\noverfitting. Firstly, we propose architectural modifications to a CNN-based\nsemi-supervised method called SuperRetina that help us improve its results on a\npublicly available dataset. Then, we train a computationally heavier model\nbased on a vision transformer encoder using the lighter CNN-based model, which\nis counter-intuitive in the field knowledge-distillation research where\ntraining lighter models based on heavier ones is the norm. Surprisingly, such\nreverse knowledge distillation improves generalization even further. Our\nexperiments suggest that high-dimensional fitting in representation space may\nprevent overfitting unlike training directly to match the final output. We also\nprovide a public dataset with annotations for retinal image keypoint detection\nand matching to help the research community develop algorithms for retinal\nimage applications.\n","authors":["Sahar Almahfouz Nasser","Nihar Gupte","Amit Sethi"],"pdf_url":"https://arxiv.org/pdf/2307.10698v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10829v2","updated":"2023-07-21T04:46:07Z","published":"2023-07-10T12:18:18Z","title":"Exact Diffusion Inversion via Bi-directional Integration Approximation","summary":" Recently, different methods have been proposed to address the inconsistency\nissue of DDIM inversion to enable image editing, such as EDICT\n\\cite{Wallace23EDICT} and Null-text inversion \\cite{Mokady23NullTestInv}.\nHowever, the above methods introduce considerable computational overhead. In\nthis paper, we propose a new technique, named \\emph{bi-directional integration\napproximation} (BDIA), to perform exact diffusion inversion with neglible\ncomputational overhead. Suppose we would like to estimate the next diffusion\nstate $\\boldsymbol{z}_{i-1}$ at timestep $t_i$ with the historical information\n$(i,\\boldsymbol{z}_i)$ and $(i+1,\\boldsymbol{z}_{i+1})$. We first obtain the\nestimated Gaussian noise $\\hat{\\boldsymbol{\\epsilon}}(\\boldsymbol{z}_i,i)$, and\nthen apply the DDIM update procedure twice for approximating the ODE\nintegration over the next time-slot $[t_i, t_{i-1}]$ in the forward manner and\nthe previous time-slot $[t_i, t_{t+1}]$ in the backward manner. The DDIM step\nfor the previous time-slot is used to refine the integration approximation made\nearlier when computing $\\boldsymbol{z}_i$. One nice property with BDIA-DDIM is\nthat the update expression for $\\boldsymbol{z}_{i-1}$ is a linear combination\nof $(\\boldsymbol{z}_{i+1}, \\boldsymbol{z}_i,\n\\hat{\\boldsymbol{\\epsilon}}(\\boldsymbol{z}_i,i))$. This allows for exact\nbackward computation of $\\boldsymbol{z}_{i+1}$ given $(\\boldsymbol{z}_i,\n\\boldsymbol{z}_{i-1})$, thus leading to exact diffusion inversion. Experiments\non both image reconstruction and image editing were conducted, confirming our\nstatement. BDIA can also be applied to improve the performance of other ODE\nsolvers in addition to DDIM. In our work, it is found that applying BDIA to the\nEDM sampling procedure produces slightly better FID score over CIFAR10.\n","authors":["Guoqiang Zhang","J. P. Lewis","W. Bastiaan Kleijn"],"pdf_url":"https://arxiv.org/pdf/2307.10829v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2304.11328"},{"id":"http://arxiv.org/abs/2307.11342v1","updated":"2023-07-21T04:15:02Z","published":"2023-07-21T04:15:02Z","title":"Tuning Pre-trained Model via Moment Probing","summary":" Recently, efficient fine-tuning of large-scale pre-trained models has\nattracted increasing research interests, where linear probing (LP) as a\nfundamental module is involved in exploiting the final representations for\ntask-dependent classification. However, most of the existing methods focus on\nhow to effectively introduce a few of learnable parameters, and little work\npays attention to the commonly used LP module. In this paper, we propose a\nnovel Moment Probing (MP) method to further explore the potential of LP.\nDistinguished from LP which builds a linear classification head based on the\nmean of final features (e.g., word tokens for ViT) or classification tokens,\nour MP performs a linear classifier on feature distribution, which provides the\nstronger representation ability by exploiting richer statistical information\ninherent in features. Specifically, we represent feature distribution by its\ncharacteristic function, which is efficiently approximated by using first- and\nsecond-order moments of features. Furthermore, we propose a multi-head\nconvolutional cross-covariance (MHC$^3$) to compute second-order moments in an\nefficient and effective manner. By considering that MP could affect feature\nlearning, we introduce a partially shared module to learn two recalibrating\nparameters (PSRP) for backbones based on MP, namely MP$_{+}$. Extensive\nexperiments on ten benchmarks using various models show that our MP\nsignificantly outperforms LP and is competitive with counterparts at less\ntraining cost, while our MP$_{+}$ achieves state-of-the-art performance.\n","authors":["Mingze Gao","Qilong Wang","Zhenyi Lin","Pengfei Zhu","Qinghua Hu","Jingbo Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.11342v1.pdf","comment":"Accepted to ICCV 2023; Project Page:\n https://github.com/mingzeG/Moment-Probing"},{"id":"http://arxiv.org/abs/2307.11336v1","updated":"2023-07-21T03:50:23Z","published":"2023-07-21T03:50:23Z","title":"Character Time-series Matching For Robust License Plate Recognition","summary":" Automatic License Plate Recognition (ALPR) is becoming a popular study area\nand is applied in many fields such as transportation or smart city. However,\nthere are still several limitations when applying many current methods to\npractical problems due to the variation in real-world situations such as light\nchanges, unclear License Plate (LP) characters, and image quality. Almost\nrecent ALPR algorithms process on a single frame, which reduces accuracy in\ncase of worse image quality. This paper presents methods to improve license\nplate recognition accuracy by tracking the license plate in multiple frames.\nFirst, the Adaptive License Plate Rotation algorithm is applied to correctly\nalign the detected license plate. Second, we propose a method called Character\nTime-series Matching to recognize license plate characters from many\nconsequence frames. The proposed method archives high performance in the\nUFPR-ALPR dataset which is \\boldmath$96.7\\%$ accuracy in real-time on RTX A5000\nGPU card. We also deploy the algorithm for the Vietnamese ALPR system. The\naccuracy for license plate detection and character recognition are 0.881 and\n0.979 $mAP^{test}$@.5 respectively. The source code is available at\nhttps://github.com/chequanghuy/Character-Time-series-Matching.git\n","authors":["Quang Huy Che","Tung Do Thanh","Cuong Truong Van"],"pdf_url":"https://arxiv.org/pdf/2307.11336v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11335v1","updated":"2023-07-21T03:47:28Z","published":"2023-07-21T03:47:28Z","title":"Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural\n Radiance Fields","summary":" Despite the tremendous progress in neural radiance fields (NeRF), we still\nface a dilemma of the trade-off between quality and efficiency, e.g., MipNeRF\npresents fine-detailed and anti-aliased renderings but takes days for training,\nwhile Instant-ngp can accomplish the reconstruction in a few minutes but\nsuffers from blurring or aliasing when rendering at various distances or\nresolutions due to ignoring the sampling area. To this end, we propose a novel\nTri-Mip encoding that enables both instant reconstruction and anti-aliased\nhigh-fidelity rendering for neural radiance fields. The key is to factorize the\npre-filtered 3D feature spaces in three orthogonal mipmaps. In this way, we can\nefficiently perform 3D area sampling by taking advantage of 2D pre-filtered\nfeature maps, which significantly elevates the rendering quality without\nsacrificing efficiency. To cope with the novel Tri-Mip representation, we\npropose a cone-casting rendering technique to efficiently sample anti-aliased\n3D features with the Tri-Mip encoding considering both pixel imaging and\nobserving distance. Extensive experiments on both synthetic and real-world\ndatasets demonstrate our method achieves state-of-the-art rendering quality and\nreconstruction speed while maintaining a compact representation that reduces\n25% model size compared against Instant-ngp.\n","authors":["Wenbo Hu","Yuling Wang","Lin Ma","Bangbang Yang","Lin Gao","Xiao Liu","Yuewen Ma"],"pdf_url":"https://arxiv.org/pdf/2307.11335v1.pdf","comment":"Accepted to ICCV 2023 Project page:\n https://wbhu.github.io/projects/Tri-MipRF"},{"id":"http://arxiv.org/abs/2307.11334v1","updated":"2023-07-21T03:43:07Z","published":"2023-07-21T03:43:07Z","title":"Improving Transferability of Adversarial Examples via Bayesian Attacks","summary":" This paper presents a substantial extension of our work published at ICLR.\nOur ICLR work advocated for enhancing transferability in adversarial examples\nby incorporating a Bayesian formulation into model parameters, which\neffectively emulates the ensemble of infinitely many deep neural networks,\nwhile, in this paper, we introduce a novel extension by incorporating the\nBayesian formulation into the model input as well, enabling the joint\ndiversification of both the model input and model parameters. Our empirical\nfindings demonstrate that: 1) the combination of Bayesian formulations for both\nthe model input and model parameters yields significant improvements in\ntransferability; 2) by introducing advanced approximations of the posterior\ndistribution over the model input, adversarial transferability achieves further\nenhancement, surpassing all state-of-the-arts when attacking without model\nfine-tuning. Moreover, we propose a principled approach to fine-tune model\nparameters in such an extended Bayesian formulation. The derived optimization\nobjective inherently encourages flat minima in the parameter space and input\nspace. Extensive experiments demonstrate that our method achieves a new\nstate-of-the-art on transfer-based attacks, improving the average success rate\non ImageNet and CIFAR-10 by 19.14% and 2.08%, respectively, when comparing with\nour ICLR basic Bayesian method. We will make our code publicly available.\n","authors":["Qizhang Li","Yiwen Guo","Xiaochen Yang","Wangmeng Zuo","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2307.11334v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11323v1","updated":"2023-07-21T03:08:28Z","published":"2023-07-21T03:08:28Z","title":"HVDetFusion: A Simple and Robust Camera-Radar Fusion Framework","summary":" In the field of autonomous driving, 3D object detection is a very important\nperception module. Although the current SOTA algorithm combines Camera and\nLidar sensors, limited by the high price of Lidar, the current mainstream\nlanding schemes are pure Camera sensors or Camera+Radar sensors. In this study,\nwe propose a new detection algorithm called HVDetFusion, which is a multi-modal\ndetection algorithm that not only supports pure camera data as input for\ndetection, but also can perform fusion input of radar data and camera data. The\ncamera stream does not depend on the input of Radar data, thus addressing the\ndownside of previous methods. In the pure camera stream, we modify the\nframework of Bevdet4D for better perception and more efficient inference, and\nthis stream has the whole 3D detection output. Further, to incorporate the\nbenefits of Radar signals, we use the prior information of different object\npositions to filter the false positive information of the original radar data,\naccording to the positioning information and radial velocity information\nrecorded by the radar sensors to supplement and fuse the BEV features generated\nby the original camera data, and the effect is further improved in the process\nof fusion training. Finally, HVDetFusion achieves the new state-of-the-art\n67.4\\% NDS on the challenging nuScenes test set among all camera-radar 3D\nobject detectors. The code is available at\nhttps://github.com/HVXLab/HVDetFusion\n","authors":["Kai Lei","Zhan Chen","Shuman Jia","Xiaoteng Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.11323v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11317v1","updated":"2023-07-21T02:57:40Z","published":"2023-07-21T02:57:40Z","title":"XLDA: Linear Discriminant Analysis for Scaling Continual Learning to\n Extreme Classification at the Edge","summary":" Streaming Linear Discriminant Analysis (LDA) while proven in\nClass-incremental Learning deployments at the edge with limited classes (upto\n1000), has not been proven for deployment in extreme classification scenarios.\nIn this paper, we present: (a) XLDA, a framework for Class-IL in edge\ndeployment where LDA classifier is proven to be equivalent to FC layer\nincluding in extreme classification scenarios, and (b) optimizations to enable\nXLDA-based training and inference for edge deployment where there is a\nconstraint on available compute resources. We show up to 42x speed up using a\nbatched training approach and up to 5x inference speedup with nearest neighbor\nsearch on extreme datasets like AliProducts (50k classes) and Google Landmarks\nV2 (81k classes)\n","authors":["Karan Shah","Vishruth Veerendranath","Anushka Hebbar","Raghavendra Bhat"],"pdf_url":"https://arxiv.org/pdf/2307.11317v1.pdf","comment":"Submitted at ICML 2023: PAC-Bayes Interactive Learning Workshop"},{"id":"http://arxiv.org/abs/2307.11315v1","updated":"2023-07-21T02:47:18Z","published":"2023-07-21T02:47:18Z","title":"Generating Image-Specific Text Improves Fine-grained Image\n Classification","summary":" Recent vision-language models outperform vision-only models on many image\nclassification tasks. However, because of the absence of paired text/image\ndescriptions, it remains difficult to fine-tune these models for fine-grained\nimage classification. In this work, we propose a method, GIST, for generating\nimage-specific fine-grained text descriptions from image-only datasets, and\nshow that these text descriptions can be used to improve classification. Key\nparts of our method include 1. prompting a pretrained large language model with\ndomain-specific prompts to generate diverse fine-grained text descriptions for\neach class and 2. using a pretrained vision-language model to match each image\nto label-preserving text descriptions that capture relevant visual features in\nthe image. We demonstrate the utility of GIST by fine-tuning vision-language\nmodels on the image-and-generated-text pairs to learn an aligned\nvision-language representation space for improved classification. We evaluate\nour learned representation space in full-shot and few-shot scenarios across\nfour diverse fine-grained classification datasets, each from a different\ndomain. Our method achieves an average improvement of $4.1\\%$ in accuracy over\nCLIP linear probes and an average of $1.1\\%$ improvement in accuracy over the\nprevious state-of-the-art image-text classification method on the full-shot\ndatasets. Our method achieves similar improvements across few-shot regimes.\nCode is available at https://github.com/emu1729/GIST.\n","authors":["Emily Mu","Kathleen M. Lewis","Adrian V. Dalca","John Guttag"],"pdf_url":"https://arxiv.org/pdf/2307.11315v1.pdf","comment":"The first two authors contributed equally to this work"},{"id":"http://arxiv.org/abs/2212.03434v5","updated":"2023-07-21T02:34:02Z","published":"2022-12-07T03:39:18Z","title":"Name Your Colour For the Task: Artificially Discover Colour Naming via\n Colour Quantisation Transformer","summary":" The long-standing theory that a colour-naming system evolves under dual\npressure of efficient communication and perceptual mechanism is supported by\nmore and more linguistic studies, including analysing four decades of\ndiachronic data from the Nafaanra language. This inspires us to explore whether\nmachine learning could evolve and discover a similar colour-naming system via\noptimising the communication efficiency represented by high-level recognition\nperformance. Here, we propose a novel colour quantisation transformer,\nCQFormer, that quantises colour space while maintaining the accuracy of machine\nrecognition on the quantised images. Given an RGB image, Annotation Branch maps\nit into an index map before generating the quantised image with a colour\npalette; meanwhile the Palette Branch utilises a key-point detection way to\nfind proper colours in the palette among the whole colour space. By interacting\nwith colour annotation, CQFormer is able to balance both the machine vision\naccuracy and colour perceptual structure such as distinct and stable colour\ndistribution for discovered colour system. Very interestingly, we even observe\nthe consistent evolution pattern between our artificial colour system and basic\ncolour terms across human languages. Besides, our colour quantisation method\nalso offers an efficient quantisation method that effectively compresses the\nimage storage while maintaining high performance in high-level recognition\ntasks such as classification and detection. Extensive experiments demonstrate\nthe superior performance of our method with extremely low bit-rate colours,\nshowing potential to integrate into quantisation network to quantities from\nimage to network activation. The source code is available at\nhttps://github.com/ryeocthiv/CQFormer\n","authors":["Shenghan Su","Lin Gu","Yue Yang","Zenghui Zhang","Tatsuya Harada"],"pdf_url":"https://arxiv.org/pdf/2212.03434v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.10769v3","updated":"2023-07-21T02:32:23Z","published":"2023-04-21T06:35:54Z","title":"Deep Multiview Clustering by Contrasting Cluster Assignments","summary":" Multiview clustering (MVC) aims to reveal the underlying structure of\nmultiview data by categorizing data samples into clusters. Deep learning-based\nmethods exhibit strong feature learning capabilities on large-scale datasets.\nFor most existing deep MVC methods, exploring the invariant representations of\nmultiple views is still an intractable problem. In this paper, we propose a\ncross-view contrastive learning (CVCL) method that learns view-invariant\nrepresentations and produces clustering results by contrasting the cluster\nassignments among multiple views. Specifically, we first employ deep\nautoencoders to extract view-dependent features in the pretraining stage. Then,\na cluster-level CVCL strategy is presented to explore consistent semantic label\ninformation among the multiple views in the fine-tuning stage. Thus, the\nproposed CVCL method is able to produce more discriminative cluster assignments\nby virtue of this learning strategy. Moreover, we provide a theoretical\nanalysis of soft cluster assignment alignment. Extensive experimental results\nobtained on several datasets demonstrate that the proposed CVCL method\noutperforms several state-of-the-art approaches.\n","authors":["Jie Chen","Hua Mao","Wai Lok Woo","Xi Peng"],"pdf_url":"https://arxiv.org/pdf/2304.10769v3.pdf","comment":"10 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.11308v1","updated":"2023-07-21T02:28:54Z","published":"2023-07-21T02:28:54Z","title":"DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport","summary":" Sampling from diffusion probabilistic models (DPMs) can be viewed as a\npiecewise distribution transformation, which generally requires hundreds or\nthousands of steps of the inverse diffusion trajectory to get a high-quality\nimage. Recent progress in designing fast samplers for DPMs achieves a trade-off\nbetween sampling speed and sample quality by knowledge distillation or\nadjusting the variance schedule or the denoising equation. However, it can't be\noptimal in both aspects and often suffer from mode mixture in short steps. To\ntackle this problem, we innovatively regard inverse diffusion as an optimal\ntransport (OT) problem between latents at different stages and propose the\nDPM-OT, a unified learning framework for fast DPMs with a direct expressway\nrepresented by OT map, which can generate high-quality samples within around 10\nfunction evaluations. By calculating the semi-discrete optimal transport map\nbetween the data latents and the white noise, we obtain an expressway from the\nprior distribution to the data distribution, while significantly alleviating\nthe problem of mode mixture. In addition, we give the error bound of the\nproposed method, which theoretically guarantees the stability of the algorithm.\nExtensive experiments validate the effectiveness and advantages of DPM-OT in\nterms of speed and quality (FID and mode mixture), thus representing an\nefficient solution for generative modeling. Source codes are available at\nhttps://github.com/cognaclee/DPM-OT\n","authors":["Zezeng Li","ShengHao Li","Zhanpeng Wang","Na Lei","Zhongxuan Luo","Xianfeng Gu"],"pdf_url":"https://arxiv.org/pdf/2307.11308v1.pdf","comment":"iccv2023 accepted"},{"id":"http://arxiv.org/abs/2301.06262v3","updated":"2023-07-21T02:28:28Z","published":"2023-01-16T05:08:50Z","title":"Collaborative Perception in Autonomous Driving: Methods, Datasets and\n Challenges","summary":" Collaborative perception is essential to address occlusion and sensor failure\nissues in autonomous driving. In recent years, theoretical and experimental\ninvestigations of novel works for collaborative perception have increased\ntremendously. So far, however, few reviews have focused on systematical\ncollaboration modules and large-scale collaborative perception datasets. This\nwork reviews recent achievements in this field to bridge this gap and motivate\nfuture research. We start with a brief overview of collaboration schemes. After\nthat, we systematically summarize the collaborative perception methods for\nideal scenarios and real-world issues. The former focuses on collaboration\nmodules and efficiency, and the latter is devoted to addressing the problems in\nactual application. Furthermore, we present large-scale public datasets and\nsummarize quantitative results on these benchmarks. Finally, we highlight gaps\nand overlook challenges between current academic research and real-world\napplications. The project page is\nhttps://github.com/CatOneTwo/Collaborative-Perception-in-Autonomous-Driving\n","authors":["Yushan Han","Hui Zhang","Huifang Li","Yi Jin","Congyan Lang","Yidong Li"],"pdf_url":"https://arxiv.org/pdf/2301.06262v3.pdf","comment":"18 pages, 6 figures. Accepted by IEEE Intelligent Transportation\n Systems Magazine. URL:\n https://github.com/CatOneTwo/Collaborative-Perception-in-Autonomous-Driving"},{"id":"http://arxiv.org/abs/2307.11307v1","updated":"2023-07-21T02:28:20Z","published":"2023-07-21T02:28:20Z","title":"EndoSurf: Neural Surface Reconstruction of Deformable Tissues with\n Stereo Endoscope Videos","summary":" Reconstructing soft tissues from stereo endoscope videos is an essential\nprerequisite for many medical applications. Previous methods struggle to\nproduce high-quality geometry and appearance due to their inadequate\nrepresentations of 3D scenes. To address this issue, we propose a novel\nneural-field-based method, called EndoSurf, which effectively learns to\nrepresent a deforming surface from an RGBD sequence. In EndoSurf, we model\nsurface dynamics, shape, and texture with three neural fields. First, 3D points\nare transformed from the observed space to the canonical space using the\ndeformation field. The signed distance function (SDF) field and radiance field\nthen predict their SDFs and colors, respectively, with which RGBD images can be\nsynthesized via differentiable volume rendering. We constrain the learned shape\nby tailoring multiple regularization strategies and disentangling geometry and\nappearance. Experiments on public endoscope datasets demonstrate that EndoSurf\nsignificantly outperforms existing solutions, particularly in reconstructing\nhigh-fidelity shapes. Code is available at\nhttps://github.com/Ruyi-Zha/endosurf.git.\n","authors":["Ruyi Zha","Xuelian Cheng","Hongdong Li","Mehrtash Harandi","Zongyuan Ge"],"pdf_url":"https://arxiv.org/pdf/2307.11307v1.pdf","comment":"MICCAI 2023 (Early Accept); Ruyi Zha and Xuelian Cheng made equal\n contributions. Corresponding author: Ruyi Zha (ruyi.zha@gmail.com)"},{"id":"http://arxiv.org/abs/2307.10711v2","updated":"2023-07-21T02:06:41Z","published":"2023-07-20T09:06:21Z","title":"AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of\n Diffusion Probabilistic Models","summary":" Existing customization methods require access to multiple reference examples\nto align pre-trained diffusion probabilistic models (DPMs) with user-provided\nconcepts. This paper aims to address the challenge of DPM customization when\nthe only available supervision is a differentiable metric defined on the\ngenerated contents. Since the sampling procedure of DPMs involves recursive\ncalls to the denoising UNet, na\\\"ive gradient backpropagation requires storing\nthe intermediate states of all iterations, resulting in extremely high memory\nconsumption. To overcome this issue, we propose a novel method AdjointDPM,\nwhich first generates new samples from diffusion models by solving the\ncorresponding probability-flow ODEs. It then uses the adjoint sensitivity\nmethod to backpropagate the gradients of the loss to the models' parameters\n(including conditioning signals, network weights, and initial noises) by\nsolving another augmented ODE. To reduce numerical errors in both the forward\ngeneration and gradient backpropagation processes, we further reparameterize\nthe probability-flow ODE and augmented ODE as simple non-stiff ODEs using\nexponential integration. Finally, we demonstrate the effectiveness of\nAdjointDPM on three interesting tasks: converting visual effects into\nidentification text embeddings, finetuning DPMs for specific types of\nstylization, and optimizing initial noise to generate adversarial samples for\nsecurity auditing.\n","authors":["Jiachun Pan","Jun Hao Liew","Vincent Y. F. Tan","Jiashi Feng","Hanshu Yan"],"pdf_url":"https://arxiv.org/pdf/2307.10711v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.04973v2","updated":"2023-07-21T01:40:31Z","published":"2023-02-09T23:25:28Z","title":"Invariant Slot Attention: Object Discovery with Slot-Centric Reference\n Frames","summary":" Automatically discovering composable abstractions from raw perceptual data is\na long-standing challenge in machine learning. Recent slot-based neural\nnetworks that learn about objects in a self-supervised manner have made\nexciting progress in this direction. However, they typically fall short at\nadequately capturing spatial symmetries present in the visual world, which\nleads to sample inefficiency, such as when entangling object appearance and\npose. In this paper, we present a simple yet highly effective method for\nincorporating spatial symmetries via slot-centric reference frames. We\nincorporate equivariance to per-object pose transformations into the attention\nand generation mechanism of Slot Attention by translating, scaling, and\nrotating position encodings. These changes result in little computational\noverhead, are easy to implement, and can result in large gains in terms of data\nefficiency and overall improvements to object discovery. We evaluate our method\non a wide range of synthetic object discovery benchmarks namely CLEVR,\nTetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising\nimprovements on the challenging real-world Waymo Open dataset.\n","authors":["Ondrej Biza","Sjoerd van Steenkiste","Mehdi S. M. Sajjadi","Gamaleldin F. Elsayed","Aravindh Mahendran","Thomas Kipf"],"pdf_url":"https://arxiv.org/pdf/2302.04973v2.pdf","comment":"Accepted at ICML 2023. Project page: https://invariantsa.github.io/"},{"id":"http://arxiv.org/abs/2307.11285v1","updated":"2023-07-21T01:04:52Z","published":"2023-07-21T01:04:52Z","title":"MAS: Towards Resource-Efficient Federated Multiple-Task Learning","summary":" Federated learning (FL) is an emerging distributed machine learning method\nthat empowers in-situ model training on decentralized edge devices. However,\nmultiple simultaneous FL tasks could overload resource-constrained devices. In\nthis work, we propose the first FL system to effectively coordinate and train\nmultiple simultaneous FL tasks. We first formalize the problem of training\nsimultaneous FL tasks. Then, we present our new approach, MAS (Merge and\nSplit), to optimize the performance of training multiple simultaneous FL tasks.\nMAS starts by merging FL tasks into an all-in-one FL task with a multi-task\narchitecture. After training for a few rounds, MAS splits the all-in-one FL\ntask into two or more FL tasks by using the affinities among tasks measured\nduring the all-in-one training. It then continues training each split of FL\ntasks based on model parameters from the all-in-one training. Extensive\nexperiments demonstrate that MAS outperforms other methods while reducing\ntraining time by 2x and reducing energy consumption by 40%. We hope this work\nwill inspire the community to further study and optimize training simultaneous\nFL tasks.\n","authors":["Weiming Zhuang","Yonggang Wen","Lingjuan Lyu","Shuai Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.11285v1.pdf","comment":"ICCV'23. arXiv admin note: substantial text overlap with\n arXiv:2207.04202"},{"id":"http://arxiv.org/abs/2307.11274v1","updated":"2023-07-21T00:15:56Z","published":"2023-07-21T00:15:56Z","title":"Screening Mammography Breast Cancer Detection","summary":" Breast cancer is a leading cause of cancer-related deaths, but current\nprograms are expensive and prone to false positives, leading to unnecessary\nfollow-up and patient anxiety. This paper proposes a solution to automated\nbreast cancer detection, to improve the efficiency and accuracy of screening\nprograms. Different methodologies were tested against the RSNA dataset of\nradiographic breast images of roughly 20,000 female patients and yielded an\naverage validation case pF1 score of 0.56 across methods.\n","authors":["Debajyoti Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2307.11274v1.pdf","comment":"Released @ Apr 2023. For associated project files, see\n https://github.com/chakrabortyde/rsna-breast-cancer"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2307.06576v3","updated":"2023-07-21T16:06:32Z","published":"2023-07-13T06:25:22Z","title":"Going Beyond Local: Global Graph-Enhanced Personalized News\n Recommendations","summary":" Precisely recommending candidate news articles to users has always been a\ncore challenge for personalized news recommendation systems. Most recent works\nprimarily focus on using advanced natural language processing techniques to\nextract semantic information from rich textual data, employing content-based\nmethods derived from local historical news. However, this approach lacks a\nglobal perspective, failing to account for users' hidden motivations and\nbehaviors beyond semantic information. To address this challenge, we propose a\nnovel model called GLORY (Global-LOcal news Recommendation sYstem), which\ncombines global representations learned from other users with local\nrepresentations to enhance personalized recommendation systems. We accomplish\nthis by constructing a Global-aware Historical News Encoder, which includes a\nglobal news graph and employs gated graph neural networks to enrich news\nrepresentations, thereby fusing historical news representations by a historical\nnews aggregator. Similarly, we extend this approach to a Global Candidate News\nEncoder, utilizing a global entity graph and a candidate news aggregator to\nenhance candidate news representation. Evaluation results on two public news\ndatasets demonstrate that our method outperforms existing approaches.\nFurthermore, our model offers more diverse recommendations.\n","authors":["Boming Yang","Dairui Liu","Toyotaro Suzumura","Ruihai Dong","Irene Li"],"pdf_url":"https://arxiv.org/pdf/2307.06576v3.pdf","comment":"10 pages, Recsys 2023"},{"id":"http://arxiv.org/abs/2307.11650v1","updated":"2023-07-21T15:28:47Z","published":"2023-07-21T15:28:47Z","title":"Alleviating the Long-Tail Problem in Conversational Recommender Systems","summary":" Conversational recommender systems (CRS) aim to provide the recommendation\nservice via natural language conversations. To develop an effective CRS,\nhigh-quality CRS datasets are very crucial. However, existing CRS datasets\nsuffer from the long-tail issue, \\ie a large proportion of items are rarely (or\neven never) mentioned in the conversations, which are called long-tail items.\nAs a result, the CRSs trained on these datasets tend to recommend frequent\nitems, and the diversity of the recommended items would be largely reduced,\nmaking users easier to get bored.\n To address this issue, this paper presents \\textbf{LOT-CRS}, a novel\nframework that focuses on simulating and utilizing a balanced CRS dataset (\\ie\ncovering all the items evenly) for improving \\textbf{LO}ng-\\textbf{T}ail\nrecommendation performance of CRSs. In our approach, we design two pre-training\ntasks to enhance the understanding of simulated conversation for long-tail\nitems, and adopt retrieval-augmented fine-tuning with label smoothness strategy\nto further improve the recommendation of long-tail items. Extensive experiments\non two public CRS datasets have demonstrated the effectiveness and\nextensibility of our approach, especially on long-tail recommendation.\n","authors":["Zhipeng Zhao","Kun Zhou","Xiaolei Wang","Wayne Xin Zhao","Fan Pan","Zhao Cao","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2307.11650v1.pdf","comment":"work in progress"},{"id":"http://arxiv.org/abs/2307.11496v1","updated":"2023-07-21T11:04:20Z","published":"2023-07-21T11:04:20Z","title":"Identifying document similarity using a fast estimation of the\n Levenshtein Distance based on compression and signatures","summary":" Identifying document similarity has many applications, e.g., source code\nanalysis or plagiarism detection. However, identifying similarities is not\ntrivial and can be time complex. For instance, the Levenshtein Distance is a\ncommon metric to define the similarity between two documents but has quadratic\nruntime which makes it impractical for large documents where large starts with\na few hundred kilobytes. In this paper, we present a novel concept that allows\nestimating the Levenshtein Distance: the algorithm first compresses documents\nto signatures (similar to hash values) using a user-defined compression ratio.\nSignatures can then be compared against each other (some constrains apply)\nwhere the outcome is the estimated Levenshtein Distance. Our evaluation shows\npromising results in terms of runtime efficiency and accuracy. In addition, we\nintroduce a significance score allowing examiners to set a threshold and\nidentify related documents.\n","authors":["Peter Coates","Frank Breitinger"],"pdf_url":"https://arxiv.org/pdf/2307.11496v1.pdf","comment":"In: Proceedings of the Digital Forensics Research Conference Europe\n (DFRWS EU). 2022"},{"id":"http://arxiv.org/abs/2307.10617v2","updated":"2023-07-21T09:49:15Z","published":"2023-07-20T06:35:43Z","title":"Unmasking Falsehoods in Reviews: An Exploration of NLP Techniques","summary":" In the contemporary digital landscape, online reviews have become an\nindispensable tool for promoting products and services across various\nbusinesses. Marketers, advertisers, and online businesses have found incentives\nto create deceptive positive reviews for their products and negative reviews\nfor their competitors' offerings. As a result, the writing of deceptive reviews\nhas become an unavoidable practice for businesses seeking to promote themselves\nor undermine their rivals. Detecting such deceptive reviews has become an\nintense and ongoing area of research. This research paper proposes a machine\nlearning model to identify deceptive reviews, with a particular focus on\nrestaurants. This study delves into the performance of numerous experiments\nconducted on a dataset of restaurant reviews known as the Deceptive Opinion\nSpam Corpus. To accomplish this, an n-gram model and max features are developed\nto effectively identify deceptive content, particularly focusing on fake\nreviews. A benchmark study is undertaken to explore the performance of two\ndifferent feature extraction techniques, which are then coupled with five\ndistinct machine learning classification algorithms. The experimental results\nreveal that the passive aggressive classifier stands out among the various\nalgorithms, showcasing the highest accuracy not only in text classification but\nalso in identifying fake reviews. Moreover, the research delves into data\naugmentation and implements various deep learning techniques to further enhance\nthe process of detecting deceptive reviews. The findings shed light on the\nefficacy of the proposed machine learning approach and offer valuable insights\ninto dealing with deceptive reviews in the realm of online businesses.\n","authors":["Anusuya Baby Hari Krishnan"],"pdf_url":"https://arxiv.org/pdf/2307.10617v2.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2306.02250v2","updated":"2023-07-21T07:46:03Z","published":"2023-06-04T03:46:45Z","title":"Large Language Model Augmented Narrative Driven Recommendations","summary":" Narrative-driven recommendation (NDR) presents an information access problem\nwhere users solicit recommendations with verbose descriptions of their\npreferences and context, for example, travelers soliciting recommendations for\npoints of interest while describing their likes/dislikes and travel\ncircumstances. These requests are increasingly important with the rise of\nnatural language-based conversational interfaces for search and recommendation\nsystems. However, NDR lacks abundant training data for models, and current\nplatforms commonly do not support these requests. Fortunately, classical\nuser-item interaction datasets contain rich textual data, e.g., reviews, which\noften describe user preferences and context - this may be used to bootstrap\ntraining for NDR models. In this work, we explore using large language models\n(LLMs) for data augmentation to train NDR models. We use LLMs for authoring\nsynthetic narrative queries from user-item interactions with few-shot prompting\nand train retrieval models for NDR on synthetic queries and user-item\ninteraction data. Our experiments demonstrate that this is an effective\nstrategy for training small-parameter retrieval models that outperform other\nretrieval and LLM baselines for narrative-driven recommendation.\n","authors":["Sheshera Mysore","Andrew McCallum","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2306.02250v2.pdf","comment":"RecSys 2023 Camera-ready"},{"id":"http://arxiv.org/abs/2304.04250v2","updated":"2023-07-21T07:39:58Z","published":"2023-04-09T14:52:18Z","title":"Editable User Profiles for Controllable Text Recommendation","summary":" Methods for making high-quality recommendations often rely on learning latent\nrepresentations from interaction data. These methods, while performant, do not\nprovide ready mechanisms for users to control the recommendation they receive.\nOur work tackles this problem by proposing LACE, a novel concept value\nbottleneck model for controllable text recommendations. LACE represents each\nuser with a succinct set of human-readable concepts through retrieval given\nuser-interacted documents and learns personalized representations of the\nconcepts based on user documents. This concept based user profile is then\nleveraged to make recommendations. The design of our model affords control over\nthe recommendations through a number of intuitive interactions with a\ntransparent user profile. We first establish the quality of recommendations\nobtained from LACE in an offline evaluation on three recommendation tasks\nspanning six datasets in warm-start, cold-start, and zero-shot setups. Next, we\nvalidate the controllability of LACE under simulated user interactions.\nFinally, we implement LACE in an interactive controllable recommender system\nand conduct a user study to demonstrate that users are able to improve the\nquality of recommendations they receive through interactions with an editable\nuser profile.\n","authors":["Sheshera Mysore","Mahmood Jasim","Andrew McCallum","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2304.04250v2.pdf","comment":"SIGIR-2023 Camera Ready"},{"id":"http://arxiv.org/abs/2307.11325v1","updated":"2023-07-21T03:23:17Z","published":"2023-07-21T03:23:17Z","title":"Analysis of Elephant Movement in Sub-Saharan Africa: Ecological,\n Climatic, and Conservation Perspectives","summary":" The interaction between elephants and their environment has profound\nimplications for both ecology and conservation strategies. This study presents\nan analytical approach to decipher the intricate patterns of elephant movement\nin Sub-Saharan Africa, concentrating on key ecological drivers such as seasonal\nvariations and rainfall patterns. Despite the complexities surrounding these\ninfluential factors, our analysis provides a holistic view of elephant\nmigratory behavior in the context of the dynamic African landscape. Our\ncomprehensive approach enables us to predict the potential impact of these\necological determinants on elephant migration, a critical step in establishing\ninformed conservation strategies. This projection is particularly crucial given\nthe impacts of global climate change on seasonal and rainfall patterns, which\ncould substantially influence elephant movements in the future. The findings of\nour work aim to not only advance the understanding of movement ecology but also\nfoster a sustainable coexistence of humans and elephants in Sub-Saharan Africa.\nBy predicting potential elephant routes, our work can inform strategies to\nminimize human-elephant conflict, effectively manage land use, and enhance\nanti-poaching efforts. This research underscores the importance of integrating\nmovement ecology and climatic variables for effective wildlife management and\nconservation planning.\n","authors":["Matthew Hines","Gregory Glatzer","Shreya Ghosh","Prasenjit Mitra"],"pdf_url":"https://arxiv.org/pdf/2307.11325v1.pdf","comment":"11 pages, 17 figures, Accepted in ACM SIGCAS SIGCHI Conference on\n Computing and Sustainable Societies (COMPASS 2023)"},{"id":"http://arxiv.org/abs/2307.10479v2","updated":"2023-07-21T19:24:10Z","published":"2023-07-19T22:20:06Z","title":"Fast Approximate Nearest Neighbor Search with a Dynamic Exploration\n Graph using Continuous Refinement","summary":" For approximate nearest neighbor search, graph-based algorithms have shown to\noffer the best trade-off between accuracy and search time. We propose the\nDynamic Exploration Graph (DEG) which significantly outperforms existing\nalgorithms in terms of search and exploration efficiency by combining two new\nideas: First, a single undirected even regular graph is incrementally built by\npartially replacing existing edges to integrate new vertices and to update old\nneighborhoods at the same time. Secondly, an edge optimization algorithm is\nused to continuously improve the quality of the graph. Combining this ongoing\nrefinement with the graph construction process leads to a well-organized graph\nstructure at all times, resulting in: (1) increased search efficiency, (2)\npredictable index size, (3) guaranteed connectivity and therefore reachability\nof all vertices, and (4) a dynamic graph structure. In addition we investigate\nhow well existing graph-based search systems can handle indexed queries where\nthe seed vertex of a search is the query itself. Such exploration tasks,\ndespite their good starting point, are not necessarily easy. High efficiency in\napproximate nearest neighbor search (ANNS) does not automatically imply good\nperformance in exploratory search. Extensive experiments show that our new\nDynamic Exploration Graph outperforms existing algorithms significantly for\nindexed and unindexed queries.\n","authors":["Nico Hezel","Kai Uwe Barthel","Konstantin Schall","Klaus Jung"],"pdf_url":"https://arxiv.org/pdf/2307.10479v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11848v1","updated":"2023-07-21T18:35:24Z","published":"2023-07-21T18:35:24Z","title":"MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through\n Multi-Answer Open-Domain Question Answering","summary":" Check-worthy claim detection aims at providing plausible misinformation to\ndownstream fact-checking systems or human experts to check. This is a crucial\nstep toward accelerating the fact-checking process. Many efforts have been put\ninto how to identify check-worthy claims from a small scale of pre-collected\nclaims, but how to efficiently detect check-worthy claims directly from a\nlarge-scale information source, such as Twitter, remains underexplored. To fill\nthis gap, we introduce MythQA, a new multi-answer open-domain question\nanswering(QA) task that involves contradictory stance mining for query-based\nlarge-scale check-worthy claim detection. The idea behind this is that\ncontradictory claims are a strong indicator of misinformation that merits\nscrutiny by the appropriate authorities. To study this task, we construct\nTweetMythQA, an evaluation dataset containing 522 factoid multi-answer\nquestions based on controversial topics. Each question is annotated with\nmultiple answers. Moreover, we collect relevant tweets for each distinct\nanswer, then classify them into three categories: \"Supporting\", \"Refuting\", and\n\"Neutral\". In total, we annotated 5.3K tweets. Contradictory evidence is\ncollected for all answers in the dataset. Finally, we present a baseline system\nfor MythQA and evaluate existing NLP models for each system component using the\nTweetMythQA dataset. We provide initial benchmarks and identify key challenges\nfor future models to improve upon. Code and data are available at:\nhttps://github.com/TonyBY/Myth-QA\n","authors":["Yang Bai","Anthony Colas","Daisy Zhe Wang"],"pdf_url":"https://arxiv.org/pdf/2307.11848v1.pdf","comment":"Accepted by SIGIR 2023"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2307.11749v1","updated":"2023-07-21T17:59:15Z","published":"2023-07-21T17:59:15Z","title":"Differentially Private Heavy Hitter Detection using Federated Analytics","summary":" In this work, we study practical heuristics to improve the performance of\nprefix-tree based algorithms for differentially private heavy hitter detection.\nOur model assumes each user has multiple data points and the goal is to learn\nas many of the most frequent data points as possible across all users' data\nwith aggregate and local differential privacy. We propose an adaptive\nhyperparameter tuning algorithm that improves the performance of the algorithm\nwhile satisfying computational, communication and privacy constraints. We\nexplore the impact of different data-selection schemes as well as the impact of\nintroducing deny lists during multiple runs of the algorithm. We test these\nimprovements using extensive experimentation on the Reddit\ndataset~\\cite{caldas2018leaf} on the task of learning the most frequent words.\n","authors":["Karan Chadha","Junye Chen","John Duchi","Vitaly Feldman","Hanieh Hashemi","Omid Javidbakht","Audra McMillan","Kunal Talwar"],"pdf_url":"https://arxiv.org/pdf/2307.11749v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.03327v2","updated":"2023-07-21T17:54:14Z","published":"2023-03-06T17:54:33Z","title":"Tight Bounds for $γ$-Regret via the Decision-Estimation Coefficient","summary":" In this work, we give a statistical characterization of the $\\gamma$-regret\nfor arbitrary structured bandit problems, the regret which arises when\ncomparing against a benchmark that is $\\gamma$ times the optimal solution. The\n$\\gamma$-regret emerges in structured bandit problems over a function class\n$\\mathcal{F}$ where finding an exact optimum of $f \\in \\mathcal{F}$ is\nintractable. Our characterization is given in terms of the $\\gamma$-DEC, a\nstatistical complexity parameter for the class $\\mathcal{F}$, which is a\nmodification of the constrained Decision-Estimation Coefficient (DEC) of Foster\net al., 2023 (and closely related to the original offset DEC of Foster et al.,\n2021). Our lower bound shows that the $\\gamma$-DEC is a fundamental limit for\nany model class $\\mathcal{F}$: for any algorithm, there exists some $f \\in\n\\mathcal{F}$ for which the $\\gamma$-regret of that algorithm scales (nearly)\nwith the $\\gamma$-DEC of $\\mathcal{F}$. We provide an upper bound showing that\nthere exists an algorithm attaining a nearly matching $\\gamma$-regret. Due to\nsignificant challenges in applying the prior results on the DEC to the\n$\\gamma$-regret case, both our lower and upper bounds require novel techniques\nand a new algorithm.\n","authors":["Margalit Glasgow","Alexander Rakhlin"],"pdf_url":"https://arxiv.org/pdf/2303.03327v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11732v1","updated":"2023-07-21T17:45:28Z","published":"2023-07-21T17:45:28Z","title":"Advancing Ad Auction Realism: Practical Insights & Modeling Implications","summary":" This paper proposes a learning model of online ad auctions that allows for\nthe following four key realistic characteristics of contemporary online\nauctions: (1) ad slots can have different values and click-through rates\ndepending on users' search queries, (2) the number and identity of competing\nadvertisers are unobserved and change with each auction, (3) advertisers only\nreceive partial, aggregated feedback, and (4) payment rules are only partially\nspecified. We model advertisers as agents governed by an adversarial bandit\nalgorithm, independent of auction mechanism intricacies. Our objective is to\nsimulate the behavior of advertisers for counterfactual analysis, prediction,\nand inference purposes. Our findings reveal that, in such richer environments,\n\"soft floors\" can enhance key performance metrics even when bidders are drawn\nfrom the same population. We further demonstrate how to infer advertiser value\ndistributions from observed bids, thereby affirming the practical efficacy of\nour approach even in a more realistic auction setting.\n","authors":["Ming Chen","Sareh Nabi","Marciano Siniscalchi"],"pdf_url":"https://arxiv.org/pdf/2307.11732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11730v1","updated":"2023-07-21T17:43:50Z","published":"2023-07-21T17:43:50Z","title":"Mitigating Communications Threats in Decentralized Federated Learning\n through Moving Target Defense","summary":" The rise of Decentralized Federated Learning (DFL) has enabled the training\nof machine learning models across federated participants, fostering\ndecentralized model aggregation and reducing dependence on a server. However,\nthis approach introduces unique communication security challenges that have yet\nto be thoroughly addressed in the literature. These challenges primarily\noriginate from the decentralized nature of the aggregation process, the varied\nroles and responsibilities of the participants, and the absence of a central\nauthority to oversee and mitigate threats. Addressing these challenges, this\npaper first delineates a comprehensive threat model, highlighting the potential\nrisks of DFL communications. In response to these identified risks, this work\nintroduces a security module designed for DFL platforms to counter\ncommunication-based attacks. The module combines security techniques such as\nsymmetric and asymmetric encryption with Moving Target Defense (MTD)\ntechniques, including random neighbor selection and IP/port switching. The\nsecurity module is implemented in a DFL platform called Fedstellar, allowing\nthe deployment and monitoring of the federation. A DFL scenario has been\ndeployed, involving eight physical devices implementing three security\nconfigurations: (i) a baseline with no security, (ii) an encrypted\nconfiguration, and (iii) a configuration integrating both encryption and MTD\ntechniques. The effectiveness of the security module is validated through\nexperiments with the MNIST dataset and eclipse attacks. The results indicated\nan average F1 score of 95%, with moderate increases in CPU usage (up to 63.2%\n+-3.5%) and network traffic (230 MB +-15 MB) under the most secure\nconfiguration, mitigating the risks posed by eavesdropping or eclipse attacks.\n","authors":["Enrique Tomás Martínez Beltrán","Pedro Miguel Sánchez Sánchez","Sergio López Bernal","Gérôme Bovet","Manuel Gil Pérez","Gregorio Martínez Pérez","Alberto Huertas Celdrán"],"pdf_url":"https://arxiv.org/pdf/2307.11730v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.10870v2","updated":"2023-07-21T17:37:26Z","published":"2023-02-21T18:34:51Z","title":"On Provable Copyright Protection for Generative Models","summary":" There is a growing concern that learned conditional generative models may\noutput samples that are substantially similar to some copyrighted data $C$ that\nwas in their training set. We give a formal definition of $\\textit{near\naccess-freeness (NAF)}$ and prove bounds on the probability that a model\nsatisfying this definition outputs a sample similar to $C$, even if $C$ is\nincluded in its training set. Roughly speaking, a generative model $p$ is\n$\\textit{$k$-NAF}$ if for every potentially copyrighted data $C$, the output of\n$p$ diverges by at most $k$-bits from the output of a model $q$ that\n$\\textit{did not access $C$ at all}$. We also give generative model learning\nalgorithms, which efficiently modify the original generative model learning\nalgorithm in a black box manner, that output generative models with strong\nbounds on the probability of sampling protected content. Furthermore, we\nprovide promising experiments for both language (transformers) and image\n(diffusion) generative models, showing minimal degradation in output quality\nwhile ensuring strong protections against sampling protected content.\n","authors":["Nikhil Vyas","Sham Kakade","Boaz Barak"],"pdf_url":"https://arxiv.org/pdf/2302.10870v2.pdf","comment":"Accepted at ICML 2023"},{"id":"http://arxiv.org/abs/2307.10496v2","updated":"2023-07-21T17:34:51Z","published":"2023-07-19T23:29:40Z","title":"A Competitive Learning Approach for Specialized Models: A Solution for\n Complex Physical Systems with Distinct Functional Regimes","summary":" Complex systems in science and engineering sometimes exhibit behavior that\nchanges across different regimes. Traditional global models struggle to capture\nthe full range of this complex behavior, limiting their ability to accurately\nrepresent the system. In response to this challenge, we propose a novel\ncompetitive learning approach for obtaining data-driven models of physical\nsystems. The primary idea behind the proposed approach is to employ dynamic\nloss functions for a set of models that are trained concurrently on the data.\nEach model competes for each observation during training, allowing for the\nidentification of distinct functional regimes within the dataset. To\ndemonstrate the effectiveness of the learning approach, we coupled it with\nvarious regression methods that employ gradient-based optimizers for training.\nThe proposed approach was tested on various problems involving model discovery\nand function approximation, demonstrating its ability to successfully identify\nfunctional regimes, discover true governing equations, and reduce test errors.\n","authors":["Okezzi F. Ukorigho","Opeoluwa Owoyele"],"pdf_url":"https://arxiv.org/pdf/2307.10496v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.08736v2","updated":"2023-07-21T17:21:57Z","published":"2022-12-16T22:18:48Z","title":"A Neural Network Warm-Start Approach for the Inverse Acoustic Obstacle\n Scattering Problem","summary":" We consider the inverse acoustic obstacle problem for sound-soft star-shaped\nobstacles in two dimensions wherein the boundary of the obstacle is determined\nfrom measurements of the scattered field at a collection of receivers outside\nthe object. One of the standard approaches for solving this problem is to\nreformulate it as an optimization problem: finding the boundary of the domain\nthat minimizes the $L^2$ distance between computed values of the scattered\nfield and the given measurement data. The optimization problem is\ncomputationally challenging since the local set of convexity shrinks with\nincreasing frequency and results in an increasing number of local minima in the\nvicinity of the true solution. In many practical experimental settings, low\nfrequency measurements are unavailable due to limitations of the experimental\nsetup or the sensors used for measurement. Thus, obtaining a good initial guess\nfor the optimization problem plays a vital role in this environment.\n We present a neural network warm-start approach for solving the inverse\nscattering problem, where an initial guess for the optimization problem is\nobtained using a trained neural network. We demonstrate the effectiveness of\nour method with several numerical examples. For high frequency problems, this\napproach outperforms traditional iterative methods such as Gauss-Newton\ninitialized without any prior (i.e., initialized using a unit circle), or\ninitialized using the solution of a direct method such as the linear sampling\nmethod. The algorithm remains robust to noise in the scattered field\nmeasurements and also converges to the true solution for limited aperture data.\nHowever, the number of training samples required to train the neural network\nscales exponentially in frequency and the complexity of the obstacles\nconsidered. We conclude with a discussion of this phenomenon and potential\ndirections for future research.\n","authors":["Mo Zhou","Jiequn Han","Manas Rachh","Carlos Borges"],"pdf_url":"https://arxiv.org/pdf/2212.08736v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.15471v3","updated":"2023-07-21T17:20:16Z","published":"2023-03-25T10:21:13Z","title":"Embedding Contextual Information through Reward Shaping in Multi-Agent\n Learning: A Case Study from Google Football","summary":" Artificial Intelligence has been used to help human complete difficult tasks\nin complicated environments by providing optimized strategies for\ndecision-making or replacing the manual labour. In environments including\nmultiple agents, such as football, the most common methods to train agents are\nImitation Learning and Multi-Agent Reinforcement Learning (MARL). However, the\nagents trained by Imitation Learning cannot outperform the expert demonstrator,\nwhich makes humans hardly get new insights from the learnt policy. Besides,\nMARL is prone to the credit assignment problem. In environments with sparse\nreward signal, this method can be inefficient. The objective of our research is\nto create a novel reward shaping method by embedding contextual information in\nreward function to solve the aforementioned challenges. We demonstrate this in\nthe Google Research Football (GRF) environment. We quantify the contextual\ninformation extracted from game state observation and use this quantification\ntogether with original sparse reward to create the shaped reward. The\nexperiment results in the GRF environment prove that our reward shaping method\nis a useful addition to state-of-the-art MARL algorithms for training agents in\nenvironments with sparse reward signal.\n","authors":["Chaoyi Gu","Varuna De Silva","Corentin Artaud","Rafael Pina"],"pdf_url":"https://arxiv.org/pdf/2303.15471v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11714v1","updated":"2023-07-21T17:19:01Z","published":"2023-07-21T17:19:01Z","title":"Convergence of SGD for Training Neural Networks with Sliced Wasserstein\n Losses","summary":" Optimal Transport has sparked vivid interest in recent years, in particular\nthanks to the Wasserstein distance, which provides a geometrically sensible and\nintuitive way of comparing probability measures. For computational reasons, the\nSliced Wasserstein (SW) distance was introduced as an alternative to the\nWasserstein distance, and has seen uses for training generative Neural Networks\n(NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed\npractically in such a setting, there is to our knowledge no theoretical\nguarantee for this observation. Leveraging recent works on convergence of SGD\non non-smooth and non-convex functions by Bianchi et al. (2022), we aim to\nbridge that knowledge gap, and provide a realistic context under which\nfixed-step SGD trajectories for the SW loss on NN parameters converge. More\nprecisely, we show that the trajectories approach the set of (sub)-gradient\nflow equations as the step decreases. Under stricter assumptions, we show a\nmuch stronger convergence result for noised and projected SGD schemes, namely\nthat the long-run limits of the trajectories approach a set of generalised\ncritical points of the loss function.\n","authors":["Eloi Tanguy"],"pdf_url":"https://arxiv.org/pdf/2307.11714v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11704v1","updated":"2023-07-21T17:00:06Z","published":"2023-07-21T17:00:06Z","title":"JoinGym: An Efficient Query Optimization Environment for Reinforcement\n Learning","summary":" In this paper, we present \\textsc{JoinGym}, an efficient and lightweight\nquery optimization environment for reinforcement learning (RL). Join order\nselection (JOS) is a classic NP-hard combinatorial optimization problem from\ndatabase query optimization and can serve as a practical testbed for the\ngeneralization capabilities of RL algorithms. We describe how to formulate each\nof the left-deep and bushy variants of the JOS problem as a Markov Decision\nProcess (MDP), and we provide an implementation adhering to the standard\nGymnasium API. We highlight that our implementation \\textsc{JoinGym} is\ncompletely based on offline traces of all possible joins, which enables RL\npractitioners to easily and quickly test their methods on a realistic data\nmanagement problem without needing to setup any systems. Moreover, we also\nprovide all possible join traces on $3300$ novel SQL queries generated from the\nIMDB dataset. Upon benchmarking popular RL algorithms, we find that at least\none method can obtain near-optimal performance on train-set queries but their\nperformance degrades by several orders of magnitude on test-set queries. This\ngap motivates further research for RL algorithms that generalize well in\nmulti-task combinatorial optimization problems.\n","authors":["Kaiwen Wang","Junxiong Wang","Yueying Li","Nathan Kallus","Immanuel Trummer","Wen Sun"],"pdf_url":"https://arxiv.org/pdf/2307.11704v1.pdf","comment":"We will make all the queries available soon"},{"id":"http://arxiv.org/abs/2307.10490v2","updated":"2023-07-21T16:51:15Z","published":"2023-07-19T23:03:20Z","title":"(Ab)using Images and Sounds for Indirect Instruction Injection in\n Multi-Modal LLMs","summary":" We demonstrate how images and sounds can be used for indirect prompt and\ninstruction injection in multi-modal LLMs. An attacker generates an adversarial\nperturbation corresponding to the prompt and blends it into an image or audio\nrecording. When the user asks the (unmodified, benign) model about the\nperturbed image or audio, the perturbation steers the model to output the\nattacker-chosen text and/or make the subsequent dialog follow the attacker's\ninstruction. We illustrate this attack with several proof-of-concept examples\ntargeting LLaVa and PandaGPT.\n","authors":["Eugene Bagdasaryan","Tsung-Yin Hsieh","Ben Nassi","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2307.10490v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11695v1","updated":"2023-07-21T16:50:10Z","published":"2023-07-21T16:50:10Z","title":"Using simulation to calibrate real data acquisition in veterinary\n medicine","summary":" This paper explores the innovative use of simulation environments to enhance\ndata acquisition and diagnostics in veterinary medicine, focusing specifically\non gait analysis in dogs. The study harnesses the power of Blender and the\nBlenderproc library to generate synthetic datasets that reflect diverse\nanatomical, environmental, and behavioral conditions. The generated data,\nrepresented in graph form and standardized for optimal analysis, is utilized to\ntrain machine learning algorithms for identifying normal and abnormal gaits.\nTwo distinct datasets with varying degrees of camera angle granularity are\ncreated to further investigate the influence of camera perspective on model\naccuracy. Preliminary results suggest that this simulation-based approach holds\npromise for advancing veterinary diagnostics by enabling more precise data\nacquisition and more effective machine learning models. By integrating\nsynthetic and real-world patient data, the study lays a robust foundation for\nimproving overall effectiveness and efficiency in veterinary medicine.\n","authors":["Krystian Strzałka","Szymon Mazurek","Maciej Wielgosz","Paweł Russek","Jakub Caputa","Daria Łukasik","Jan Krupiński","Jakub Grzeszczyk","Michał Karwatowski","Rafał Frączek","Ernest Jamro","Marcin Pietroń","Sebastian Koryciak","Agnieszka Dąbrowska-Boruch","Kazimierz Wiatr"],"pdf_url":"https://arxiv.org/pdf/2307.11695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11672v1","updated":"2023-07-21T16:18:58Z","published":"2023-07-21T16:18:58Z","title":"Fast Adaptive Test-Time Defense with Robust Features","summary":" Adaptive test-time defenses are used to improve the robustness of deep neural\nnetworks to adversarial examples. However, existing methods significantly\nincrease the inference time due to additional optimization on the model\nparameters or the input at test time. In this work, we propose a novel adaptive\ntest-time defense strategy that is easy to integrate with any existing (robust)\ntraining procedure without additional test-time computation. Based on the\nnotion of robustness of features that we present, the key idea is to project\nthe trained models to the most robust feature space, thereby reducing the\nvulnerability to adversarial attacks in non-robust directions. We theoretically\nshow that the top eigenspace of the feature matrix are more robust for a\ngeneralized additive model and support our argument for a large width neural\nnetwork with the Neural Tangent Kernel (NTK) equivalence. We conduct extensive\nexperiments on CIFAR-10 and CIFAR-100 datasets for several robustness\nbenchmarks, including the state-of-the-art methods in RobustBench, and observe\nthat the proposed method outperforms existing adaptive test-time defenses at\nmuch lower computation costs.\n","authors":["Anurag Singh","Mahalakshmi Sabanayagam","Krikamol Muandet","Debarghya Ghoshdastidar"],"pdf_url":"https://arxiv.org/pdf/2307.11672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17282v2","updated":"2023-07-21T16:15:21Z","published":"2023-05-26T22:01:47Z","title":"Universal consistency of the $k$-NN rule in metric spaces and Nagata\n dimension. II","summary":" We continue to investigate the $k$ nearest neighbour learning rule in\nseparable metric spaces. Thanks to the results of C\\'erou and Guyader (2006)\nand Preiss (1983), this rule is known to be universally consistent in every\nmetric space $X$ that is sigma-finite dimensional in the sense of Nagata. Here\nwe show that the rule is strongly universally consistent in such spaces in the\nabsence of ties. Under the tie-breaking strategy applied by Devroye,\nGy\\\"{o}rfi, Krzy\\.{z}ak, and Lugosi (1994) in the Euclidean setting, we manage\nto show the strong universal consistency in non-Archimedian metric spaces (that\nis, those of Nagata dimension zero). Combining the theorem of C\\'erou and\nGuyader with results of Assouad and Quentin de Gromard (2006), one deduces that\nthe $k$-NN rule is universally consistent in metric spaces having finite\ndimension in the sense of de Groot. In particular, the $k$-NN rule is\nuniversally consistent in the Heisenberg group which is not sigma-finite\ndimensional in the sense of Nagata as follows from an example independently\nconstructed by Kor\\'anyi and Reimann (1995) and Sawyer and Wheeden (1992).\n","authors":["Sushma Kumari","Vladimir G. Pestov"],"pdf_url":"https://arxiv.org/pdf/2305.17282v2.pdf","comment":"Latex 2e, 17 pages. The Heisenberg group is now presented in more\n detail, with some proofs and more references added, and a discussion of open\n problems added at the end"},{"id":"http://arxiv.org/abs/2307.11668v1","updated":"2023-07-21T16:12:46Z","published":"2023-07-21T16:12:46Z","title":"An Efficient Interior-Point Method for Online Convex Optimization","summary":" A new algorithm for regret minimization in online convex optimization is\ndescribed. The regret of the algorithm after $T$ time periods is $O(\\sqrt{T\n\\log T})$ - which is the minimum possible up to a logarithmic term. In\naddition, the new algorithm is adaptive, in the sense that the regret bounds\nhold not only for the time periods $1,\\ldots,T$ but also for every sub-interval\n$s,s+1,\\ldots,t$. The running time of the algorithm matches that of newly\nintroduced interior point algorithms for regret minimization: in\n$n$-dimensional space, during each iteration the new algorithm essentially\nsolves a system of linear equations of order $n$, rather than solving some\nconstrained convex optimization problem in $n$ dimensions and possibly many\nconstraints.\n","authors":["Elad Hazan","Nimrod Megiddo"],"pdf_url":"https://arxiv.org/pdf/2307.11668v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2109.14778v2","updated":"2023-07-21T16:05:34Z","published":"2021-09-30T00:58:27Z","title":"CALDA: Improving Multi-Source Time Series Domain Adaptation with\n Contrastive Adversarial Learning","summary":" Unsupervised domain adaptation (UDA) provides a strategy for improving\nmachine learning performance in data-rich (target) domains where ground truth\nlabels are inaccessible but can be found in related (source) domains. In cases\nwhere meta-domain information such as label distributions is available, weak\nsupervision can further boost performance. We propose a novel framework, CALDA,\nto tackle these two problems. CALDA synergistically combines the principles of\ncontrastive learning and adversarial learning to robustly support multi-source\nUDA (MS-UDA) for time series data. Similar to prior methods, CALDA utilizes\nadversarial learning to align source and target feature representations. Unlike\nprior approaches, CALDA additionally leverages cross-source label information\nacross domains. CALDA pulls examples with the same label close to each other,\nwhile pushing apart examples with different labels, reshaping the space through\ncontrastive learning. Unlike prior contrastive adaptation methods, CALDA\nrequires neither data augmentation nor pseudo labeling, which may be more\nchallenging for time series. We empirically validate our proposed approach.\nBased on results from human activity recognition, electromyography, and\nsynthetic datasets, we find utilizing cross-source information improves\nperformance over prior time series and contrastive methods. Weak supervision\nfurther improves performance, even in the presence of noise, allowing CALDA to\noffer generalizable strategies for MS-UDA. Code is available at:\nhttps://github.com/floft/calda\n","authors":["Garrett Wilson","Janardhan Rao Doppa","Diane J. Cook"],"pdf_url":"https://arxiv.org/pdf/2109.14778v2.pdf","comment":"Accepted at IEEE Transactions on Pattern Analysis and Machine\n Intelligence"},{"id":"http://arxiv.org/abs/2307.11661v1","updated":"2023-07-21T15:49:59Z","published":"2023-07-21T15:49:59Z","title":"Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts","summary":" Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have\nrevolutionized visual representation learning by providing good performance on\ndownstream datasets. VLMs are 0-shot adapted to a downstream dataset by\ndesigning prompts that are relevant to the dataset. Such prompt engineering\nmakes use of domain expertise and a validation dataset. Meanwhile, recent\ndevelopments in generative pretrained models like GPT-4 mean they can be used\nas advanced internet search tools. They can also be manipulated to provide\nvisual information in any structure. In this work, we show that GPT-4 can be\nused to generate text that is visually descriptive and how this can be used to\nadapt CLIP to downstream tasks. We show considerable improvements in 0-shot\ntransfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD\n(~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt.\nWe also design a simple few-shot adapter that learns to choose the best\npossible sentences to construct generalizable classifiers that outperform the\nrecently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized\nfine-grained datasets. We will release the code, prompts, and auxiliary text\ndataset upon acceptance.\n","authors":["Mayug Maniparambil","Chris Vorster","Derek Molloy","Noel Murphy","Kevin McGuinness","Noel E. O'Connor"],"pdf_url":"https://arxiv.org/pdf/2307.11661v1.pdf","comment":"10 pages, Pre-print"},{"id":"http://arxiv.org/abs/2307.11655v1","updated":"2023-07-21T15:43:32Z","published":"2023-07-21T15:43:32Z","title":"Bandits with Deterministically Evolving States","summary":" We propose a model for learning with bandit feedback while accounting for\ndeterministically evolving and unobservable states that we call Bandits with\nDeterministically Evolving States. The workhorse applications of our model are\nlearning for recommendation systems and learning for online ads. In both cases,\nthe reward that the algorithm obtains at each round is a function of the\nshort-term reward of the action chosen and how ``healthy'' the system is (i.e.,\nas measured by its state). For example, in recommendation systems, the reward\nthat the platform obtains from a user's engagement with a particular type of\ncontent depends not only on the inherent features of the specific content, but\nalso on how the user's preferences have evolved as a result of interacting with\nother types of content on the platform. Our general model accounts for the\ndifferent rate $\\lambda \\in [0,1]$ at which the state evolves (e.g., how fast a\nuser's preferences shift as a result of previous content consumption) and\nencompasses standard multi-armed bandits as a special case. The goal of the\nalgorithm is to minimize a notion of regret against the best-fixed sequence of\narms pulled. We analyze online learning algorithms for any possible\nparametrization of the evolution rate $\\lambda$. Specifically, the regret rates\nobtained are: for $\\lambda \\in [0, 1/T^2]$: $\\widetilde O(\\sqrt{KT})$; for\n$\\lambda = T^{-a/b}$ with $b < a < 2b$: $\\widetilde O (T^{b/a})$; for $\\lambda\n\\in (1/T, 1 - 1/\\sqrt{T}): \\widetilde O (K^{1/3}T^{2/3})$; and for $\\lambda \\in\n[1 - 1/\\sqrt{T}, 1]: \\widetilde O (K\\sqrt{T})$.\n","authors":["Khashayar Khosravi","Renato Paes Leme","Chara Podimata","Apostolis Tsorvantzis"],"pdf_url":"https://arxiv.org/pdf/2307.11655v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.09208v3","updated":"2023-07-21T15:27:34Z","published":"2022-05-18T20:34:25Z","title":"Torchhd: An Open Source Python Library to Support Research on\n Hyperdimensional Computing and Vector Symbolic Architectures","summary":" Hyperdimensional computing (HD), also known as vector symbolic architectures\n(VSA), is a framework for computing with distributed representations by\nexploiting properties of random high-dimensional vector spaces. The commitment\nof the scientific community to aggregate and disseminate research in this\nparticularly multidisciplinary area has been fundamental for its advancement.\nJoining these efforts, we present Torchhd, a high-performance open source\nPython library for HD/VSA. Torchhd seeks to make HD/VSA more accessible and\nserves as an efficient foundation for further research and application\ndevelopment. The easy-to-use library builds on top of PyTorch and features\nstate-of-the-art HD/VSA functionality, clear documentation, and implementation\nexamples from well-known publications. Comparing publicly available code with\ntheir corresponding Torchhd implementation shows that experiments can run up to\n100x faster. Torchhd is available at:\nhttps://github.com/hyperdimensional-computing/torchhd.\n","authors":["Mike Heddes","Igor Nunes","Pere Vergés","Denis Kleyko","Danny Abraham","Tony Givargis","Alexandru Nicolau","Alexander Veidenbaum"],"pdf_url":"https://arxiv.org/pdf/2205.09208v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.08647v4","updated":"2023-07-21T14:59:16Z","published":"2023-02-17T01:32:44Z","title":"Multiresolution Graph Transformers and Wavelet Positional Encoding for\n Learning Hierarchical Structures","summary":" Contemporary graph learning algorithms are not well-defined for large\nmolecules since they do not consider the hierarchical interactions among the\natoms, which are essential to determine the molecular properties of\nmacromolecules. In this work, we propose Multiresolution Graph Transformers\n(MGT), the first graph transformer architecture that can learn to represent\nlarge molecules at multiple scales. MGT can learn to produce representations\nfor the atoms and group them into meaningful functional groups or repeating\nunits. We also introduce Wavelet Positional Encoding (WavePE), a new positional\nencoding method that can guarantee localization in both spectral and spatial\ndomains. Our proposed model achieves competitive results on two macromolecule\ndatasets consisting of polymers and peptides, and one drug-like molecule\ndataset. Importantly, our model outperforms other state-of-the-art methods and\nachieves chemical accuracy in estimating molecular properties (e.g., GAP, HOMO\nand LUMO) calculated by Density Functional Theory (DFT) in the polymers\ndataset. Furthermore, the visualizations, including clustering results on\nmacromolecules and low-dimensional spaces of their representations, demonstrate\nthe capability of our methodology in learning to represent long-range and\nhierarchical structures. Our PyTorch implementation is publicly available at\nhttps://github.com/HySonLab/Multires-Graph-Transformer\n","authors":["Nhat Khang Ngo","Truong Son Hy","Risi Kondor"],"pdf_url":"https://arxiv.org/pdf/2302.08647v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11629v1","updated":"2023-07-21T14:53:12Z","published":"2023-07-21T14:53:12Z","title":"Scalable Multi-agent Skill Discovery based on Kronecker Graphs","summary":" Covering skill (a.k.a., option) discovery has been developed to improve the\nexploration of RL in single-agent scenarios with sparse reward signals, through\nconnecting the most distant states in the embedding space provided by the\nFiedler vector of the state transition graph. Given that joint state space\ngrows exponentially with the number of agents in multi-agent systems, existing\nresearches still relying on single-agent option discovery either become\nprohibitive or fail to directly discover joint options that improve the\nconnectivity of the joint state space. In this paper, we show how to directly\ncompute multi-agent options with collaborative exploratory behaviors while\nstill enjoying the ease of decomposition. Our key idea is to approximate the\njoint state space as a Kronecker graph, based on which we can directly estimate\nits Fiedler vector using the Laplacian spectrum of individual agents'\ntransition graphs. Further, considering that directly computing the Laplacian\nspectrum is intractable for tasks with infinite-scale state spaces, we further\npropose a deep learning extension of our method by estimating eigenfunctions\nthrough NN-based representation learning techniques. The evaluation on\nmulti-agent tasks built with simulators like Mujoco, shows that the proposed\nalgorithm can successfully identify multi-agent options, and significantly\noutperforms the state-of-the-art. Codes are available at:\nhttps://github.itap.purdue.edu/Clan-labs/Scalable_MAOD_via_KP.\n","authors":["Jiayu Chen","Jingdi Chen","Tian Lan","Vaneet Aggarwal"],"pdf_url":"https://arxiv.org/pdf/2307.11629v1.pdf","comment":"Accepted to NeurIPS 2022. arXiv admin note: substantial text overlap\n with arXiv:2201.08227"},{"id":"http://arxiv.org/abs/2307.11620v1","updated":"2023-07-21T14:37:54Z","published":"2023-07-21T14:37:54Z","title":"Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local\n Value Regularization","summary":" Offline reinforcement learning (RL) has received considerable attention in\nrecent years due to its attractive capability of learning policies from offline\ndatasets without environmental interactions. Despite some success in the\nsingle-agent setting, offline multi-agent RL (MARL) remains to be a challenge.\nThe large joint state-action space and the coupled multi-agent behaviors pose\nextra complexities for offline policy optimization. Most existing offline MARL\nstudies simply apply offline data-related regularizations on individual agents,\nwithout fully considering the multi-agent system at the global level. In this\nwork, we present OMIGA, a new offline m ulti-agent RL algorithm with implicit\nglobal-to-local v alue regularization. OMIGA provides a principled framework to\nconvert global-level value regularization into equivalent implicit local value\nregularizations and simultaneously enables in-sample learning, thus elegantly\nbridging multi-agent value decomposition and policy learning with offline\nregularizations. Based on comprehensive experiments on the offline multi-agent\nMuJoCo and StarCraft II micro-management tasks, we show that OMIGA achieves\nsuperior performance over the state-of-the-art offline MARL methods in almost\nall tasks.\n","authors":["Xiangsen Wang","Haoran Xu","Yinan Zheng","Xianyuan Zhan"],"pdf_url":"https://arxiv.org/pdf/2307.11620v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11617v1","updated":"2023-07-21T14:36:40Z","published":"2023-07-21T14:36:40Z","title":"Robust Fully-Asynchronous Methods for Distributed Training over General\n Architecture","summary":" Perfect synchronization in distributed machine learning problems is\ninefficient and even impossible due to the existence of latency, package losses\nand stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient\nTracking method (R-FAST), where each device performs local computation and\ncommunication at its own pace without any form of synchronization. Different\nfrom existing asynchronous distributed algorithms, R-FAST can eliminate the\nimpact of data heterogeneity across devices and allow for packet losses by\nemploying a robust gradient tracking strategy that relies on properly designed\nauxiliary variables for tracking and buffering the overall gradient vector.\nMore importantly, the proposed method utilizes two spanning-tree graphs for\ncommunication so long as both share at least one common root, enabling flexible\ndesigns in communication architectures. We show that R-FAST converges in\nexpectation to a neighborhood of the optimum with a geometric rate for smooth\nand strongly convex objectives; and to a stationary point with a sublinear rate\nfor general non-convex settings. Extensive experiments demonstrate that R-FAST\nruns 1.5-2 times faster than synchronous benchmark algorithms, such as\nRing-AllReduce and D-PSGD, while still achieving comparable accuracy, and\noutperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP,\nespecially in the presence of stragglers.\n","authors":["Zehan Zhu","Ye Tian","Yan Huang","Jinming Xu","Shibo He"],"pdf_url":"https://arxiv.org/pdf/2307.11617v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11609v1","updated":"2023-07-21T14:25:22Z","published":"2023-07-21T14:25:22Z","title":"Persistent Ballistic Entanglement Spreading with Optimal Control in\n Quantum Spin Chains","summary":" Entanglement propagation provides a key routine to understand quantum\nmany-body dynamics in and out of equilibrium. In this work, we uncover that the\n``variational entanglement-enhancing'' field (VEEF) robustly induces a\npersistent ballistic spreading of entanglement in quantum spin chains. The VEEF\nis time dependent, and is optimally controlled to maximize the bipartite\nentanglement entropy (EE) of the final state. Such a linear growth persists\ntill the EE reaches the genuine saturation $\\tilde{S} = - \\log_{2}\n2^{-\\frac{N}{2}}=\\frac{N}{2}$ with $N$ the total number of spins. The EE\nsatisfies $S(t) = v t$ for the time $t \\leq \\frac{N}{2v}$, with $v$ the\nvelocity. These results are in sharp contrast with the behaviors without VEEF,\nwhere the EE generally approaches a sub-saturation known as the Page value\n$\\tilde{S}_{P} =\\tilde{S} - \\frac{1}{2\\ln{2}}$ in the long-time limit, and the\nentanglement growth deviates from being linear before the Page value is\nreached. The dependence between the velocity and interactions is explored, with\n$v \\simeq 2.76$, $4.98$, and $5.75$ for the spin chains with Ising, XY, and\nHeisenberg interactions, respectively. We further show that the nonlinear\ngrowth of EE emerges with the presence of long-range interactions.\n","authors":["Ying Lu","Pei Shi","Xiao-Han Wang","Jie Hu","Shi-Ju Ran"],"pdf_url":"https://arxiv.org/pdf/2307.11609v1.pdf","comment":"5 pages, 4 figures"},{"id":"http://arxiv.org/abs/2307.11608v1","updated":"2023-07-21T14:25:06Z","published":"2023-07-21T14:25:06Z","title":"Learning minimal representations of stochastic processes with\n variational autoencoders","summary":" Stochastic processes have found numerous applications in science, as they are\nbroadly used to model a variety of natural phenomena. Due to their intrinsic\nrandomness and uncertainty, they are however difficult to characterize. Here,\nwe introduce an unsupervised machine learning approach to determine the minimal\nset of parameters required to effectively describe the dynamics of a stochastic\nprocess. Our method builds upon an extended $\\beta$-variational autoencoder\narchitecture. By means of simulated datasets corresponding to paradigmatic\ndiffusion models, we showcase its effectiveness in extracting the minimal\nrelevant parameters that accurately describe these dynamics. Furthermore, the\nmethod enables the generation of new trajectories that faithfully replicate the\nexpected stochastic behavior. Overall, our approach enables for the autonomous\ndiscovery of unknown parameters describing stochastic processes, hence\nenhancing our comprehension of complex phenomena across various fields.\n","authors":["Gabriel Fernández-Fernández","Carlo Manzo","Maciej Lewenstein","Alexandre Dauphin","Gorka Muñoz-Gil"],"pdf_url":"https://arxiv.org/pdf/2307.11608v1.pdf","comment":"9 pages, 5 figures, 1 table. Code available at\n https://github.com/GabrielFernandezFernandez/SPIVAE"},{"id":"http://arxiv.org/abs/2307.11607v1","updated":"2023-07-21T14:23:41Z","published":"2023-07-21T14:23:41Z","title":"Finding Optimal Diverse Feature Sets with Alternative Feature Selection","summary":" Feature selection is popular for obtaining small, interpretable, yet highly\naccurate prediction models. Conventional feature-selection methods typically\nyield one feature set only, which might not suffice in some scenarios. For\nexample, users might be interested in finding alternative feature sets with\nsimilar prediction quality, offering different explanations of the data. In\nthis article, we introduce alternative feature selection and formalize it as an\noptimization problem. In particular, we define alternatives via constraints and\nenable users to control the number and dissimilarity of alternatives. Next, we\nanalyze the complexity of this optimization problem and show NP-hardness.\nFurther, we discuss how to integrate conventional feature-selection methods as\nobjectives. Finally, we evaluate alternative feature selection with 30\nclassification datasets. We observe that alternative feature sets may indeed\nhave high prediction quality, and we analyze several factors influencing this\noutcome.\n","authors":["Jakob Bach"],"pdf_url":"https://arxiv.org/pdf/2307.11607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.00211v2","updated":"2023-07-21T13:57:09Z","published":"2022-12-01T01:40:03Z","title":"A Unified Algorithm Framework for Unsupervised Discovery of Skills based\n on Determinantal Point Process","summary":" Learning rich skills through temporal abstractions without supervision of\nexternal rewards is at the frontier of Reinforcement Learning research.\nExisting works mainly fall into two distinctive categories: variational and\nLaplacian-based skill (a.k.a., option) discovery. The former maximizes the\ndiversity of the discovered options through a mutual information loss but\noverlooks coverage of the state space, while the latter focuses on improving\nthe coverage of options by increasing connectivity during exploration, but does\nnot consider diversity. In this paper, we propose a unified framework that\nquantifies diversity and coverage through a novel use of the Determinantal\nPoint Process (DPP) and enables unsupervised option discovery explicitly\noptimizing both objectives. Specifically, we define the DPP kernel matrix with\nthe Laplacian spectrum of the state transition graph and use the expected mode\nnumber in the trajectories as the objective to capture and enhance both\ndiversity and coverage of the learned options. The proposed option discovery\nalgorithm is extensively evaluated using challenging tasks built with Mujoco\nand Atari, demonstrating that our proposed algorithm substantially outperforms\nSOTA baselines from both diversity- and coverage-driven categories. The codes\nare available at https://github.com/LucasCJYSDL/ODPP.\n","authors":["Jiayu Chen","Vaneet Aggarwal","Tian Lan"],"pdf_url":"https://arxiv.org/pdf/2212.00211v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11588v1","updated":"2023-07-21T13:51:45Z","published":"2023-07-21T13:51:45Z","title":"Transferability of Convolutional Neural Networks in Stationary Learning\n Tasks","summary":" Recent advances in hardware and big data acquisition have accelerated the\ndevelopment of deep learning techniques. For an extended period of time,\nincreasing the model complexity has led to performance improvements for various\ntasks. However, this trend is becoming unsustainable and there is a need for\nalternative, computationally lighter methods. In this paper, we introduce a\nnovel framework for efficient training of convolutional neural networks (CNNs)\nfor large-scale spatial problems. To accomplish this we investigate the\nproperties of CNNs for tasks where the underlying signals are stationary. We\nshow that a CNN trained on small windows of such signals achieves a nearly\nperformance on much larger windows without retraining. This claim is supported\nby our theoretical analysis, which provides a bound on the performance\ndegradation. Additionally, we conduct thorough experimental analysis on two\ntasks: multi-target tracking and mobile infrastructure on demand. Our results\nshow that the CNN is able to tackle problems with many hundreds of agents after\nbeing trained with fewer than ten. Thus, CNN architectures provide solutions to\nthese problems at previously computationally intractable scales.\n","authors":["Damian Owerko","Charilaos I. Kanatsoulis","Jennifer Bondarchuk","Donald J. Bucci Jr","Alejandro Ribeiro"],"pdf_url":"https://arxiv.org/pdf/2307.11588v1.pdf","comment":"14 pages, 7 figures, for associated code see\n https://github.com/damowerko/mtt"},{"id":"http://arxiv.org/abs/2307.11584v1","updated":"2023-07-21T13:48:11Z","published":"2023-07-21T13:48:11Z","title":"A Change of Heart: Improving Speech Emotion Recognition through\n Speech-to-Text Modality Conversion","summary":" Speech Emotion Recognition (SER) is a challenging task. In this paper, we\nintroduce a modality conversion concept aimed at enhancing emotion recognition\nperformance on the MELD dataset. We assess our approach through two\nexperiments: first, a method named Modality-Conversion that employs automatic\nspeech recognition (ASR) systems, followed by a text classifier; second, we\nassume perfect ASR output and investigate the impact of modality conversion on\nSER, this method is called Modality-Conversion++. Our findings indicate that\nthe first method yields substantial results, while the second method\noutperforms state-of-the-art (SOTA) speech-based approaches in terms of SER\nweighted-F1 (WF1) score on the MELD dataset. This research highlights the\npotential of modality conversion for tasks that can be conducted in alternative\nmodalities.\n","authors":["Zeinab Sadat Taghavi","Ali Satvaty","Hossein Sameti"],"pdf_url":"https://arxiv.org/pdf/2307.11584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.08227v3","updated":"2023-07-21T13:42:59Z","published":"2022-01-20T15:33:08Z","title":"Learning Multi-agent Skills for Tabular Reinforcement Learning using\n Factor Graphs","summary":" Covering skill (a.k.a., option) discovery has been developed to improve the\nexploration of reinforcement learning in single-agent scenarios with sparse\nreward signals, through connecting the most distant states in the embedding\nspace provided by the Fiedler vector of the state transition graph. However,\nthese option discovery methods cannot be directly extended to multi-agent\nscenarios, since the joint state space grows exponentially with the number of\nagents in the system. Thus, existing researches on adopting options in\nmulti-agent scenarios still rely on single-agent option discovery and fail to\ndirectly discover the joint options that can improve the connectivity of the\njoint state space of agents. In this paper, we show that it is indeed possible\nto directly compute multi-agent options with collaborative exploratory\nbehaviors among the agents, while still enjoying the ease of decomposition. Our\nkey idea is to approximate the joint state space as a Kronecker graph -- the\nKronecker product of individual agents' state transition graphs, based on which\nwe can directly estimate the Fiedler vector of the joint state space using the\nLaplacian spectrum of individual agents' transition graphs. This decomposition\nenables us to efficiently construct multi-agent joint options by encouraging\nagents to connect the sub-goal joint states which are corresponding to the\nminimum or maximum values of the estimated joint Fiedler vector. The evaluation\nbased on multi-agent collaborative tasks shows that the proposed algorithm can\nsuccessfully identify multi-agent options, and significantly outperforms prior\nworks using single-agent options or no options, in terms of both faster\nexploration and higher cumulative rewards.\n","authors":["Jiayu Chen","Jingdi Chen","Tian Lan","Vaneet Aggarwal"],"pdf_url":"https://arxiv.org/pdf/2201.08227v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.03269v2","updated":"2023-07-21T13:27:13Z","published":"2022-10-07T00:40:59Z","title":"Multi-agent Deep Covering Skill Discovery","summary":" The use of skills (a.k.a., options) can greatly accelerate exploration in\nreinforcement learning, especially when only sparse reward signals are\navailable. While option discovery methods have been proposed for individual\nagents, in multi-agent reinforcement learning settings, discovering\ncollaborative options that can coordinate the behavior of multiple agents and\nencourage them to visit the under-explored regions of their joint state space\nhas not been considered. In this case, we propose Multi-agent Deep Covering\nOption Discovery, which constructs the multi-agent options through minimizing\nthe expected cover time of the multiple agents' joint state space. Also, we\npropose a novel framework to adopt the multi-agent options in the MARL process.\nIn practice, a multi-agent task can usually be divided into some sub-tasks,\neach of which can be completed by a sub-group of the agents. Therefore, our\nalgorithm framework first leverages an attention mechanism to find\ncollaborative agent sub-groups that would benefit most from coordinated\nactions. Then, a hierarchical algorithm, namely HA-MSAC, is developed to learn\nthe multi-agent options for each sub-group to complete their sub-tasks first,\nand then to integrate them through a high-level policy as the solution of the\nwhole task. This hierarchical option construction allows our framework to\nstrike a balance between scalability and effective collaboration among the\nagents. The evaluation based on multi-agent collaborative tasks shows that the\nproposed algorithm can effectively capture the agent interactions with the\nattention mechanism, successfully identify multi-agent options, and\nsignificantly outperforms prior works using single-agent options or no options,\nin terms of both faster exploration and higher task rewards.\n","authors":["Jiayu Chen","Marina Haliem","Tian Lan","Vaneet Aggarwal"],"pdf_url":"https://arxiv.org/pdf/2210.03269v2.pdf","comment":"This paper was presented in part at the ICML Reinforcement Learning\n for Real Life Workshop, July 2021"},{"id":"http://arxiv.org/abs/2305.18453v2","updated":"2023-07-21T13:26:21Z","published":"2023-05-29T04:14:38Z","title":"Conditional Diffusion Models for Semantic 3D Medical Image Synthesis","summary":" The demand for artificial intelligence (AI) in healthcare is rapidly\nincreasing. However, significant challenges arise from data scarcity and\nprivacy concerns, particularly in medical imaging. While existing generative\nmodels have achieved success in image synthesis and image-to-image translation\ntasks, there remains a gap in the generation of 3D semantic medical images. To\naddress this gap, we introduce Med-DDPM, a diffusion model specifically\ndesigned for semantic 3D medical image synthesis, effectively tackling data\nscarcity and privacy issues. The novelty of Med-DDPM lies in its incorporation\nof semantic conditioning, enabling precise control during the image generation\nprocess. Our model outperforms Generative Adversarial Networks (GANs) in terms\nof stability and performance, generating diverse and anatomically coherent\nimages with high visual fidelity. Comparative analysis against state-of-the-art\naugmentation techniques demonstrates that Med-DDPM produces comparable results,\nhighlighting its potential as a data augmentation tool for enhancing model\naccuracy. In conclusion, Med-DDPM pioneers 3D semantic medical image synthesis\nby delivering high-quality and anatomically coherent images. Furthermore, the\nintegration of semantic conditioning with Med-DDPM holds promise for image\nanonymization in the field of biomedical imaging, showcasing the capabilities\nof the model in addressing challenges related to data scarcity and privacy\nconcerns.\n","authors":["Zolnamar Dorjsembe","Hsing-Kuo Pao","Sodtavilan Odonchimed","Furen Xiao"],"pdf_url":"https://arxiv.org/pdf/2305.18453v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11565v1","updated":"2023-07-21T13:17:22Z","published":"2023-07-21T13:17:22Z","title":"FMT: Removing Backdoor Feature Maps via Feature Map Testing in Deep\n Neural Networks","summary":" Deep neural networks have been widely used in many critical applications,\nsuch as autonomous vehicles and medical diagnosis. However, their security is\nthreatened by backdoor attack, which is achieved by adding artificial patterns\nto specific training data. Existing defense strategies primarily focus on using\nreverse engineering to reproduce the backdoor trigger generated by attackers\nand subsequently repair the DNN model by adding the trigger into inputs and\nfine-tuning the model with ground-truth labels. However, once the trigger\ngenerated by the attackers is complex and invisible, the defender can not\nsuccessfully reproduce the trigger. Consequently, the DNN model will not be\nrepaired since the trigger is not effectively removed.\n In this work, we propose Feature Map Testing~(FMT). Different from existing\ndefense strategies, which focus on reproducing backdoor triggers, FMT tries to\ndetect the backdoor feature maps, which are trained to extract backdoor\ninformation from the inputs. After detecting these backdoor feature maps, FMT\nwill erase them and then fine-tune the model with a secure subset of training\ndata. Our experiments demonstrate that, compared to existing defense\nstrategies, FMT can effectively reduce the Attack Success Rate (ASR) even\nagainst the most complex and invisible attack triggers. Second, unlike\nconventional defense methods that tend to exhibit low Robust Accuracy (i.e.,\nthe model's accuracy on the poisoned data), FMT achieves higher RA, indicating\nits superiority in maintaining model performance while mitigating the effects\nof backdoor attacks~(e.g., FMT obtains 87.40\\% RA in CIFAR10). Third, compared\nto existing feature map pruning techniques, FMT can cover more backdoor feature\nmaps~(e.g., FMT removes 83.33\\% of backdoor feature maps from the model in the\nCIFAR10 \\& BadNet scenario).\n","authors":["Dong Huang","Qingwen Bu","Yahao Qing","Yichao Fu","Heming Cui"],"pdf_url":"https://arxiv.org/pdf/2307.11565v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2301.09559v2","updated":"2023-07-21T13:13:01Z","published":"2023-01-23T17:20:25Z","title":"SpArX: Sparse Argumentative Explanations for Neural Networks","summary":" Neural networks (NNs) have various applications in AI, but explaining their\ndecisions remains challenging. Existing approaches often focus on explaining\nhow changing individual inputs affects NNs' outputs. However, an explanation\nthat is consistent with the input-output behaviour of an NN is not necessarily\nfaithful to the actual mechanics thereof. In this paper, we exploit\nrelationships between multi-layer perceptrons (MLPs) and quantitative\nargumentation frameworks (QAFs) to create argumentative explanations for the\nmechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining\nas much of the original structure as possible. It then translates the sparse\nMLP into an equivalent QAF to shed light on the underlying decision process of\nthe MLP, producing global and/or local explanations. We demonstrate\nexperimentally that SpArX can give more faithful explanations than existing\napproaches, while simultaneously providing deeper insights into the actual\nreasoning process of MLPs.\n","authors":["Hamed Ayoobi","Nico Potyka","Francesca Toni"],"pdf_url":"https://arxiv.org/pdf/2301.09559v2.pdf","comment":"Accepted at the European Conference on Artificial Intelligence (ECAI)\n 2023 Conference"},{"id":"http://arxiv.org/abs/2307.11552v1","updated":"2023-07-21T12:58:03Z","published":"2023-07-21T12:58:03Z","title":"A multi-modal representation of El Niño Southern Oscillation Diversity","summary":" The El Ni\\~no-Southern Oscillation (ENSO) is characterized by alternating\nperiods of warm (El Ni\\~no) and cold (La Ni\\~na) sea surface temperature\nanomalies (SSTA) in the equatorial Pacific. Although El Ni\\~no and La Ni\\~na\nare well-defined climate patterns, no two events are alike. To date, ENSO\ndiversity has been described primarily in terms of the longitudinal location of\npeak SSTA, used to define a bimodal classification of events in Eastern Pacific\n(EP) and Central Pacific (CP) types. Here, we use low-dimensional\nrepresentations of Pacific SSTAs to argue that binary categorical memberships\nare unsuitable to describe ENSO events. Using fuzzy unsupervised clustering, we\nrecover the four known ENSO categories, along with a fifth category: an Extreme\nEl Ni\\~no. We show that Extreme El Ni\\~nos differ both in their intensity and\ntemporal evolution from canonical EP El Ni\\~nos. We also find that CP La\nNi\\~nas, EP El Ni\\~nos, and Extreme El Ni\\~nos contribute the most to\ninterdecadal ENSO variability.\n","authors":["Jakob Schlör","Felix Strnad","Antonietta Capotondi","Bedartha Goswami"],"pdf_url":"https://arxiv.org/pdf/2307.11552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11546v1","updated":"2023-07-21T12:47:28Z","published":"2023-07-21T12:47:28Z","title":"Towards practical reinforcement learning for tokamak magnetic control","summary":" Reinforcement learning (RL) has shown promising results for real-time control\nsystems, including the domain of plasma magnetic control. However, there are\nstill significant drawbacks compared to traditional feedback control approaches\nfor magnetic confinement. In this work, we address key drawbacks of the RL\nmethod; achieving higher control accuracy for desired plasma properties,\nreducing the steady-state error, and decreasing the required time to learn new\ntasks. We build on top of \\cite{degrave2022magnetic}, and present algorithmic\nimprovements to the agent architecture and training procedure. We present\nsimulation results that show up to 65\\% improvement in shape accuracy, achieve\nsubstantial reduction in the long-term bias of the plasma current, and\nadditionally reduce the training time required to learn new tasks by a factor\nof 3 or more. We present new experiments using the upgraded RL-based\ncontrollers on the TCV tokamak, which validate the simulation results achieved,\nand point the way towards routinely achieving accurate discharges using the RL\napproach.\n","authors":["Brendan D. Tracey","Andrea Michi","Yuri Chervonyi","Ian Davies","Cosmin Paduraru","Nevena Lazic","Federico Felici","Timo Ewalds","Craig Donner","Cristian Galperti","Jonas Buchli","Michael Neunert","Andrea Huber","Jonathan Evens","Paula Kurylowicz","Daniel J. Mankowitz","Martin Riedmiller","The TCV Team"],"pdf_url":"https://arxiv.org/pdf/2307.11546v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11532v1","updated":"2023-07-21T12:26:42Z","published":"2023-07-21T12:26:42Z","title":"Training Latency Minimization for Model-Splitting Allowed Federated Edge\n Learning","summary":" To alleviate the shortage of computing power faced by clients in training\ndeep neural networks (DNNs) using federated learning (FL), we leverage the edge\ncomputing and split learning to propose a model-splitting allowed FL (SFL)\nframework, with the aim to minimize the training latency without loss of test\naccuracy. Under the synchronized global update setting, the latency to complete\na round of global training is determined by the maximum latency for the clients\nto complete a local training session. Therefore, the training latency\nminimization problem (TLMP) is modelled as a minimizing-maximum problem. To\nsolve this mixed integer nonlinear programming problem, we first propose a\nregression method to fit the quantitative-relationship between the cut-layer\nand other parameters of an AI-model, and thus, transform the TLMP into a\ncontinuous problem. Considering that the two subproblems involved in the TLMP,\nnamely, the cut-layer selection problem for the clients and the computing\nresource allocation problem for the parameter-server are relative independence,\nan alternate-optimization-based algorithm with polynomial time complexity is\ndeveloped to obtain a high-quality solution to the TLMP. Extensive experiments\nare performed on a popular DNN-model EfficientNetV2 using dataset MNIST, and\nthe results verify the validity and improved performance of the proposed SFL\nframework.\n","authors":["Yao Wen","Guopeng Zhang","Kezhi Wang","Kun Yang"],"pdf_url":"https://arxiv.org/pdf/2307.11532v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.01639v2","updated":"2023-07-21T12:16:41Z","published":"2023-06-02T15:59:47Z","title":"Reduction of finite sampling noise in quantum neural networks","summary":" Quantum neural networks (QNNs) use parameterized quantum circuits with\ndata-dependent inputs and generate outputs through the evaluation of\nexpectation values. Calculating these expectation values necessitates repeated\ncircuit evaluations, thus introducing fundamental finite-sampling noise even on\nerror-free quantum computers. We reduce this noise by introducing the variance\nregularization, a technique for reducing the variance of the expectation value\nduring the quantum model training. This technique requires no additional\ncircuit evaluations if the QNN is properly constructed. Our empirical findings\ndemonstrate the reduced variance speeds up the training and lowers the output\nnoise as well as decreases the number of necessary evaluations of gradient\ncircuits. This regularization method is benchmarked on the regression of\nmultiple functions. We show that in our examples, it lowers the variance by an\norder of magnitude on average and leads to a significantly reduced noise level\nof the QNN. We finally demonstrate QNN training on a real quantum device and\nevaluate the impact of error mitigation. Here, the optimization is feasible\nonly due to the reduced number of necessary shots in the gradient evaluation\nresulting from the reduced variance.\n","authors":["David A. Kreplin","Marco Roth"],"pdf_url":"https://arxiv.org/pdf/2306.01639v2.pdf","comment":"11 pages, 10 figures; refined section 5"},{"id":"http://arxiv.org/abs/2306.07308v3","updated":"2023-07-21T11:52:28Z","published":"2023-06-12T13:48:37Z","title":"Self-Supervised Hyperspectral Inpainting with the Optimisation inspired\n Deep Neural Network Prior","summary":" Hyperspectral Image (HSI)s cover hundreds or thousands of narrow spectral\nbands, conveying a wealth of spatial and spectral information. However, due to\nthe instrumental errors and the atmospheric changes, the HSI obtained in\npractice are often contaminated by noise and dead pixels(lines), resulting in\nmissing information that may severely compromise the subsequent applications.\nWe introduce here a novel HSI missing pixel prediction algorithm, called Low\nRank and Sparsity Constraint Plug-and-Play (LRS-PnP). It is shown that LRS-PnP\nis able to predict missing pixels and bands even when all spectral bands of the\nimage are missing. The proposed LRS-PnP algorithm is further extended to a\nself-supervised model by combining the LRS-PnP with the Deep Image Prior (DIP),\ncalled LRS-PnP-DIP. In a series of experiments with real data, It is shown that\nthe LRS-PnP-DIP either achieves state-of-the-art inpainting performance\ncompared to other learning-based methods, or outperforms them.\n","authors":["Shuo Li","Mehrdad Yaghoobi"],"pdf_url":"https://arxiv.org/pdf/2306.07308v3.pdf","comment":"Presented in ISCS23"},{"id":"http://arxiv.org/abs/2303.06067v2","updated":"2023-07-21T11:40:45Z","published":"2023-03-10T16:48:54Z","title":"Modeling Events and Interactions through Temporal Processes -- A Survey","summary":" In real-world scenario, many phenomena produce a collection of events that\noccur in continuous time. Point Processes provide a natural mathematical\nframework for modeling these sequences of events. In this survey, we\ninvestigate probabilistic models for modeling event sequences through temporal\nprocesses. We revise the notion of event modeling and provide the mathematical\nfoundations that characterize the literature on the topic. We define an\nontology to categorize the existing approaches in terms of three families:\nsimple, marked, and spatio-temporal point processes. For each family, we\nsystematically review the existing approaches based based on deep learning.\nFinally, we analyze the scenarios where the proposed techniques can be used for\naddressing prediction and modeling aspects.\n","authors":["Angelica Liguori","Luciano Caroprese","Marco Minici","Bruno Veloso","Francesco Spinnato","Mirco Nanni","Giuseppe Manco","Joao Gama"],"pdf_url":"https://arxiv.org/pdf/2303.06067v2.pdf","comment":"Image replacements"},{"id":"http://arxiv.org/abs/2304.14118v2","updated":"2023-07-21T11:36:40Z","published":"2023-04-27T12:05:34Z","title":"Learning Neural PDE Solvers with Parameter-Guided Channel Attention","summary":" Scientific Machine Learning (SciML) is concerned with the development of\nlearned emulators of physical systems governed by partial differential\nequations (PDE). In application domains such as weather forecasting, molecular\ndynamics, and inverse design, ML-based surrogate models are increasingly used\nto augment or replace inefficient and often non-differentiable numerical\nsimulation algorithms. While a number of ML-based methods for approximating the\nsolutions of PDEs have been proposed in recent years, they typically do not\nadapt to the parameters of the PDEs, making it difficult to generalize to PDE\nparameters not seen during training. We propose a Channel Attention mechanism\nguided by PDE Parameter Embeddings (CAPE) component for neural surrogate models\nand a simple yet effective curriculum learning strategy. The CAPE module can be\ncombined with neural PDE solvers allowing them to adapt to unseen PDE\nparameters. The curriculum learning strategy provides a seamless transition\nbetween teacher-forcing and fully auto-regressive training. We compare CAPE in\nconjunction with the curriculum learning strategy using a popular PDE benchmark\nand obtain consistent and significant improvements over the baseline models.\nThe experiments also show several advantages of CAPE, such as its increased\nability to generalize to unseen PDE parameters without large increases\ninference time and parameter count.\n","authors":["Makoto Takamoto","Francesco Alesiani","Mathias Niepert"],"pdf_url":"https://arxiv.org/pdf/2304.14118v2.pdf","comment":"accepted for publication in ICML2023"},{"id":"http://arxiv.org/abs/2306.00988v2","updated":"2023-07-21T11:27:10Z","published":"2023-06-01T17:59:57Z","title":"Continual Learning for Abdominal Multi-Organ and Tumor Segmentation","summary":" The ability to dynamically extend a model to new data and classes is critical\nfor multiple organ and tumor segmentation. However, due to privacy regulations,\naccessing previous data and annotations can be problematic in the medical\ndomain. This poses a significant barrier to preserving the high segmentation\naccuracy of the old classes when learning from new classes because of the\ncatastrophic forgetting problem. In this paper, we first empirically\ndemonstrate that simply using high-quality pseudo labels can fairly mitigate\nthis problem in the setting of organ segmentation. Furthermore, we put forward\nan innovative architecture designed specifically for continuous organ and tumor\nsegmentation, which incurs minimal computational overhead. Our proposed design\ninvolves replacing the conventional output layer with a suite of lightweight,\nclass-specific heads, thereby offering the flexibility to accommodate newly\nemerging classes. These heads enable independent predictions for newly\nintroduced and previously learned classes, effectively minimizing the impact of\nnew classes on old ones during the course of continual learning. We further\npropose incorporating Contrastive Language-Image Pretraining (CLIP) embeddings\ninto the organ-specific heads. These embeddings encapsulate the semantic\ninformation of each class, informed by extensive image-text co-training. The\nproposed method is evaluated on both in-house and public abdominal CT datasets\nunder organ and tumor segmentation tasks. Empirical results suggest that the\nproposed design improves the segmentation performance of a baseline neural\nnetwork on newly-introduced and previously-learned classes along the learning\ntrajectory.\n","authors":["Yixiao Zhang","Xinyi Li","Huimiao Chen","Alan Yuille","Yaoyao Liu","Zongwei Zhou"],"pdf_url":"https://arxiv.org/pdf/2306.00988v2.pdf","comment":"MICCAI-2023"},{"id":"http://arxiv.org/abs/2307.11503v1","updated":"2023-07-21T11:19:00Z","published":"2023-07-21T11:19:00Z","title":"General regularization in covariate shift adaptation","summary":" Sample reweighting is one of the most widely used methods for correcting the\nerror of least squares learning algorithms in reproducing kernel Hilbert spaces\n(RKHS), that is caused by future data distributions that are different from the\ntraining data distribution. In practical situations, the sample weights are\ndetermined by values of the estimated Radon-Nikod\\'ym derivative, of the future\ndata distribution w.r.t.~the training data distribution. In this work, we\nreview known error bounds for reweighted kernel regression in RKHS and obtain,\nby combination, novel results. We show under weak smoothness conditions, that\nthe amount of samples, needed to achieve the same order of accuracy as in the\nstandard supervised learning without differences in data distributions, is\nsmaller than proven by state-of-the-art analyses.\n","authors":["Duc Hoan Nguyen","Sergei V. Pereverzyev","Werner Zellinger"],"pdf_url":"https://arxiv.org/pdf/2307.11503v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11494v1","updated":"2023-07-21T10:56:36Z","published":"2023-07-21T10:56:36Z","title":"Predict, Refine, Synthesize: Self-Guiding Diffusion Models for\n Probabilistic Time Series Forecasting","summary":" Diffusion models have achieved state-of-the-art performance in generative\nmodeling tasks across various domains. Prior works on time series diffusion\nmodels have primarily focused on developing conditional models tailored to\nspecific forecasting or imputation tasks. In this work, we explore the\npotential of task-agnostic, unconditional diffusion models for several time\nseries applications. We propose TSDiff, an unconditionally trained diffusion\nmodel for time series. Our proposed self-guidance mechanism enables\nconditioning TSDiff for downstream tasks during inference, without requiring\nauxiliary networks or altering the training procedure. We demonstrate the\neffectiveness of our method on three different time series tasks: forecasting,\nrefinement, and synthetic data generation. First, we show that TSDiff is\ncompetitive with several task-specific conditional forecasting methods\n(predict). Second, we leverage the learned implicit probability density of\nTSDiff to iteratively refine the predictions of base forecasters with reduced\ncomputational overhead over reverse diffusion (refine). Notably, the generative\nperformance of the model remains intact -- downstream forecasters trained on\nsynthetic samples from TSDiff outperform forecasters that are trained on\nsamples from other state-of-the-art generative time series models, occasionally\neven outperforming models trained on real data (synthesize).\n","authors":["Marcel Kollovieh","Abdul Fatir Ansari","Michael Bohlke-Schneider","Jasper Zschiegner","Hao Wang","Yuyang Wang"],"pdf_url":"https://arxiv.org/pdf/2307.11494v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11487v1","updated":"2023-07-21T10:45:08Z","published":"2023-07-21T10:45:08Z","title":"A New Deep State-Space Analysis Framework for Patient Latent State\n Estimation and Classification from EHR Time Series Data","summary":" Many diseases, including cancer and chronic conditions, require extended\ntreatment periods and long-term strategies. Machine learning and AI research\nfocusing on electronic health records (EHRs) have emerged to address this need.\nEffective treatment strategies involve more than capturing sequential changes\nin patient test values. It requires an explainable and clinically interpretable\nmodel by capturing the patient's internal state over time.\n In this study, we propose the \"deep state-space analysis framework,\" using\ntime-series unsupervised learning of EHRs with a deep state-space model. This\nframework enables learning, visualizing, and clustering of temporal changes in\npatient latent states related to disease progression.\n We evaluated our framework using time-series laboratory data from 12,695\ncancer patients. By estimating latent states, we successfully discover latent\nstates related to prognosis. By visualization and cluster analysis, the\ntemporal transition of patient status and test items during state transitions\ncharacteristic of each anticancer drug were identified. Our framework surpasses\nexisting methods in capturing interpretable latent space. It can be expected to\nenhance our comprehension of disease progression from EHRs, aiding treatment\nadjustments and prognostic determinations.\n","authors":["Aya Nakamura","Ryosuke Kojima","Yuji Okamoto","Eiichiro Uchino","Yohei Mineharu","Yohei Harada","Mayumi Kamada","Manabu Muto","Motoko Yanagita","Yasushi Okuno"],"pdf_url":"https://arxiv.org/pdf/2307.11487v1.pdf","comment":"21 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.06092v3","updated":"2023-07-21T10:04:23Z","published":"2023-07-12T11:35:37Z","title":"Quantitative CLTs in Deep Neural Networks","summary":" We study the distribution of a fully connected neural network with random\nGaussian weights and biases in which the hidden layer widths are proportional\nto a large constant $n$. Under mild assumptions on the non-linearity, we obtain\nquantitative bounds on normal approximations valid at large but finite $n$ and\nany fixed network depth. Our theorems show both for the finite-dimensional\ndistributions and the entire process, that the distance between a random fully\nconnected network (and its derivatives) to the corresponding infinite width\nGaussian process scales like $n^{-\\gamma}$ for $\\gamma>0$, with the exponent\ndepending on the metric used to measure discrepancy. Our bounds are strictly\nstronger in terms of their dependence on network width than any previously\navailable in the literature; in the one-dimensional case, we also prove that\nthey are optimal, i.e., we establish matching lower bounds.\n","authors":["Stefano Favaro","Boris Hanin","Domenico Marinucci","Ivan Nourdin","Giovanni Peccati"],"pdf_url":"https://arxiv.org/pdf/2307.06092v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11465v1","updated":"2023-07-21T10:01:55Z","published":"2023-07-21T10:01:55Z","title":"A Deep Learning Approach for Overall Survival Analysis with Missing\n Values","summary":" One of the most challenging fields where Artificial Intelligence (AI) can be\napplied is lung cancer research, specifically non-small cell lung cancer\n(NSCLC). In particular, overall survival (OS) is a vital indicator of patient\nstatus, helping to identify subgroups with diverse survival probabilities,\nenabling tailored treatment and improved OS rates. In this analysis, there are\ntwo challenges to take into account. First, few studies effectively exploit the\ninformation available from each patient, leveraging both uncensored (i.e.,\ndead) and censored (i.e., survivors) patients, considering also the death\ntimes. Second, the handling of incomplete data is a common issue in the medical\nfield. This problem is typically tackled through the use of imputation methods.\nOur objective is to present an AI model able to overcome these limits,\neffectively learning from both censored and uncensored patients and their\navailable features, for the prediction of OS for NSCLC patients. We present a\nnovel approach to survival analysis in the context of NSCLC, which exploits the\nstrengths of the transformer architecture accounting for only available\nfeatures without requiring any imputation strategy. By making use of ad-hoc\nlosses for OS, it accounts for both censored and uncensored patients,\nconsidering risks over time. We evaluated the results over a period of 6 years\nusing different time granularities obtaining a Ct-index, a time-dependent\nvariant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1\nyear and 2 years, respectively, outperforming all state-of-the-art methods\nregardless of the imputation method used.\n","authors":["Camillo Maria Caruso","Valerio Guarrasi","Sara Ramella","Paolo Soda"],"pdf_url":"https://arxiv.org/pdf/2307.11465v1.pdf","comment":"19 pages, 2 figures"},{"id":"http://arxiv.org/abs/2307.11462v1","updated":"2023-07-21T09:55:44Z","published":"2023-07-21T09:55:44Z","title":"Improve Long-term Memory Learning Through Rescaling the Error Temporally","summary":" This paper studies the error metric selection for long-term memory learning\nin sequence modelling. We examine the bias towards short-term memory in\ncommonly used errors, including mean absolute/squared error. Our findings show\nthat all temporally positive-weighted errors are biased towards short-term\nmemory in learning linear functionals. To reduce this bias and improve\nlong-term memory learning, we propose the use of a temporally rescaled error.\nIn addition to reducing the bias towards short-term memory, this approach can\nalso alleviate the vanishing gradient issue. We conduct numerical experiments\non different long-memory tasks and sequence models to validate our claims.\nNumerical results confirm the importance of appropriate temporally rescaled\nerror for effective long-term memory learning. To the best of our knowledge,\nthis is the first work that quantitatively analyzes different errors' memory\nbias towards short-term memory in sequence modelling.\n","authors":["Shida Wang","Zhanglu Yan"],"pdf_url":"https://arxiv.org/pdf/2307.11462v1.pdf","comment":"12 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.10617v2","updated":"2023-07-21T09:49:15Z","published":"2023-07-20T06:35:43Z","title":"Unmasking Falsehoods in Reviews: An Exploration of NLP Techniques","summary":" In the contemporary digital landscape, online reviews have become an\nindispensable tool for promoting products and services across various\nbusinesses. Marketers, advertisers, and online businesses have found incentives\nto create deceptive positive reviews for their products and negative reviews\nfor their competitors' offerings. As a result, the writing of deceptive reviews\nhas become an unavoidable practice for businesses seeking to promote themselves\nor undermine their rivals. Detecting such deceptive reviews has become an\nintense and ongoing area of research. This research paper proposes a machine\nlearning model to identify deceptive reviews, with a particular focus on\nrestaurants. This study delves into the performance of numerous experiments\nconducted on a dataset of restaurant reviews known as the Deceptive Opinion\nSpam Corpus. To accomplish this, an n-gram model and max features are developed\nto effectively identify deceptive content, particularly focusing on fake\nreviews. A benchmark study is undertaken to explore the performance of two\ndifferent feature extraction techniques, which are then coupled with five\ndistinct machine learning classification algorithms. The experimental results\nreveal that the passive aggressive classifier stands out among the various\nalgorithms, showcasing the highest accuracy not only in text classification but\nalso in identifying fake reviews. Moreover, the research delves into data\naugmentation and implements various deep learning techniques to further enhance\nthe process of detecting deceptive reviews. The findings shed light on the\nefficacy of the proposed machine learning approach and offer valuable insights\ninto dealing with deceptive reviews in the realm of online businesses.\n","authors":["Anusuya Baby Hari Krishnan"],"pdf_url":"https://arxiv.org/pdf/2307.10617v2.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2307.05825v2","updated":"2023-07-21T09:47:20Z","published":"2023-07-11T22:16:13Z","title":"Bayesian taut splines for estimating the number of modes","summary":" The number of modes in a probability density function is representative of\nthe model's complexity and can also be viewed as the number of existing\nsubpopulations. Despite its relevance, little research has been devoted to its\nestimation. Focusing on the univariate setting, we propose a novel approach\ntargeting prediction accuracy inspired by some overlooked aspects of the\nproblem. We argue for the need for structure in the solutions, the subjective\nand uncertain nature of modes, and the convenience of a holistic view blending\nglobal and local density properties. Our method builds upon a combination of\nflexible kernel estimators and parsimonious compositional splines. Feature\nexploration, model selection and mode testing are implemented in the Bayesian\ninference paradigm, providing soft solutions and allowing to incorporate expert\njudgement in the process. The usefulness of our proposal is illustrated through\na case study in sports analytics, showcasing multiple companion visualisation\ntools. A thorough simulation study demonstrates that traditional\nmodality-driven approaches paradoxically struggle to provide accurate results.\nIn this context, our method emerges as a top-tier alternative offering\ninnovative solutions for analysts.\n","authors":["José E. Chacón","Javier Fernández Serrano"],"pdf_url":"https://arxiv.org/pdf/2307.05825v2.pdf","comment":"20 pages, 8 figures (manuscript) + 19 pages, 16 figures\n (supplementary material)"},{"id":"http://arxiv.org/abs/2307.10926v2","updated":"2023-07-21T09:47:01Z","published":"2023-07-20T14:52:45Z","title":"Confidence intervals for performance estimates in 3D medical image\n segmentation","summary":" Medical segmentation models are evaluated empirically. As such an evaluation\nis based on a limited set of example images, it is unavoidably noisy. Beyond a\nmean performance measure, reporting confidence intervals is thus crucial.\nHowever, this is rarely done in medical image segmentation. The width of the\nconfidence interval depends on the test set size and on the spread of the\nperformance measure (its standard-deviation across of the test set). For\nclassification, many test images are needed to avoid wide confidence intervals.\nSegmentation, however, has not been studied, and it differs by the amount of\ninformation brought by a given test image. In this paper, we study the typical\nconfidence intervals in medical image segmentation. We carry experiments on 3D\nimage segmentation using the standard nnU-net framework, two datasets from the\nMedical Decathlon challenge and two performance measures: the Dice accuracy and\nthe Hausdorff distance. We show that the parametric confidence intervals are\nreasonable approximations of the bootstrap estimates for varying test set sizes\nand spread of the performance metric. Importantly, we show that the test size\nneeded to achieve a given precision is often much lower than for classification\ntasks. Typically, a 1% wide confidence interval requires about 100-200 test\nsamples when the spread is low (standard-deviation around 3%). More difficult\nsegmentation tasks may lead to higher spreads and require over 1000 samples.\n","authors":["R. El Jurdi","G. Varoquaux","O. Colliot"],"pdf_url":"https://arxiv.org/pdf/2307.10926v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2307.02953v2","updated":"2023-07-21T09:26:06Z","published":"2023-07-06T12:39:06Z","title":"SegNetr: Rethinking the local-global interactions and skip connections\n in U-shaped networks","summary":" Recently, U-shaped networks have dominated the field of medical image\nsegmentation due to their simple and easily tuned structure. However, existing\nU-shaped segmentation networks: 1) mostly focus on designing complex\nself-attention modules to compensate for the lack of long-term dependence based\non convolution operation, which increases the overall number of parameters and\ncomputational complexity of the network; 2) simply fuse the features of encoder\nand decoder, ignoring the connection between their spatial locations. In this\npaper, we rethink the above problem and build a lightweight medical image\nsegmentation network, called SegNetr. Specifically, we introduce a novel\nSegNetr block that can perform local-global interactions dynamically at any\nstage and with only linear complexity. At the same time, we design a general\ninformation retention skip connection (IRSC) to preserve the spatial location\ninformation of encoder features and achieve accurate fusion with the decoder\nfeatures. We validate the effectiveness of SegNetr on four mainstream medical\nimage segmentation datasets, with 59\\% and 76\\% fewer parameters and GFLOPs\nthan vanilla U-Net, while achieving segmentation performance comparable to\nstate-of-the-art methods. Notably, the components proposed in this paper can be\napplied to other U-shaped networks to improve their segmentation performance.\n","authors":["Junlong Cheng","Chengrui Gao","Fengjie Wang","Min Zhu"],"pdf_url":"https://arxiv.org/pdf/2307.02953v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.04246v2","updated":"2023-07-21T09:15:42Z","published":"2023-02-08T18:26:10Z","title":"Shortcut Detection with Variational Autoencoders","summary":" For real-world applications of machine learning (ML), it is essential that\nmodels make predictions based on well-generalizing features rather than\nspurious correlations in the data. The identification of such spurious\ncorrelations, also known as shortcuts, is a challenging problem and has so far\nbeen scarcely addressed. In this work, we present a novel approach to detect\nshortcuts in image and audio datasets by leveraging variational autoencoders\n(VAEs). The disentanglement of features in the latent space of VAEs allows us\nto discover feature-target correlations in datasets and semi-automatically\nevaluate them for ML shortcuts. We demonstrate the applicability of our method\non several real-world datasets and identify shortcuts that have not been\ndiscovered before.\n","authors":["Nicolas M. Müller","Simon Roschmann","Shahbaz Khan","Philip Sperl","Konstantin Böttinger"],"pdf_url":"https://arxiv.org/pdf/2302.04246v2.pdf","comment":"Accepted at the ICML 2023 Workshop on Spurious Correlations,\n Invariance and Stability"},{"id":"http://arxiv.org/abs/2303.09975v4","updated":"2023-07-21T09:05:53Z","published":"2023-03-17T13:48:17Z","title":"MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image\n Segmentation","summary":" There has been exploding interest in embracing Transformer-based\narchitectures for medical image segmentation. However, the lack of large-scale\nannotated medical datasets make achieving performances equivalent to those in\nnatural images challenging. Convolutional networks, in contrast, have higher\ninductive biases and consequently, are easily trainable to high performance.\nRecently, the ConvNeXt architecture attempted to modernize the standard ConvNet\nby mirroring Transformer blocks. In this work, we improve upon this to design a\nmodernized and scalable convolutional architecture customized to challenges of\ndata-scarce medical settings. We introduce MedNeXt, a Transformer-inspired\nlarge kernel segmentation network which introduces - 1) A fully ConvNeXt 3D\nEncoder-Decoder Network for medical image segmentation, 2) Residual ConvNeXt up\nand downsampling blocks to preserve semantic richness across scales, 3) A novel\ntechnique to iteratively increase kernel sizes by upsampling small kernel\nnetworks, to prevent performance saturation on limited medical data, 4)\nCompound scaling at multiple levels (depth, width, kernel size) of MedNeXt.\nThis leads to state-of-the-art performance on 4 tasks on CT and MRI modalities\nand varying dataset sizes, representing a modernized deep architecture for\nmedical image segmentation. Our code is made publicly available at:\nhttps://github.com/MIC-DKFZ/MedNeXt.\n","authors":["Saikat Roy","Gregor Koehler","Constantin Ulrich","Michael Baumgartner","Jens Petersen","Fabian Isensee","Paul F. Jaeger","Klaus Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2303.09975v4.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.11436v1","updated":"2023-07-21T08:57:16Z","published":"2023-07-21T08:57:16Z","title":"Neural Operators for Delay-Compensating Control of Hyperbolic PIDEs","summary":" The recently introduced DeepONet operator-learning framework for PDE control\nis extended from the results for basic hyperbolic and parabolic PDEs to an\nadvanced hyperbolic class that involves delays on both the state and the system\noutput or input. The PDE backstepping design produces gain functions that are\noutputs of a nonlinear operator, mapping functions on a spatial domain into\nfunctions on a spatial domain, and where this gain-generating operator's inputs\nare the PDE's coefficients. The operator is approximated with a DeepONet neural\nnetwork to a degree of accuracy that is provably arbitrarily tight. Once we\nproduce this approximation-theoretic result in infinite dimension, with it we\nestablish stability in closed loop under feedback that employs approximate\ngains. In addition to supplying such results under full-state feedback, we also\ndevelop DeepONet-approximated observers and output-feedback laws and prove\ntheir own stabilizing properties under neural operator approximations. With\nnumerical simulations we illustrate the theoretical results and quantify the\nnumerical effort savings, which are of two orders of magnitude, thanks to\nreplacing the numerical PDE solving with the DeepONet.\n","authors":["Jie Qi","Jing Zhang","Miroslav Krstic"],"pdf_url":"https://arxiv.org/pdf/2307.11436v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11434v1","updated":"2023-07-21T08:55:23Z","published":"2023-07-21T08:55:23Z","title":"Batching for Green AI -- An Exploratory Study on Inference","summary":" The batch size is an essential parameter to tune during the development of\nnew neural networks. Amongst other quality indicators, it has a large degree of\ninfluence on the model's accuracy, generalisability, training times and\nparallelisability. This fact is generally known and commonly studied. However,\nduring the application phase of a deep learning model, when the model is\nutilised by an end-user for inference, we find that there is a disregard for\nthe potential benefits of introducing a batch size. In this study, we examine\nthe effect of input batching on the energy consumption and response times of\nfive fully-trained neural networks for computer vision that were considered\nstate-of-the-art at the time of their publication. The results suggest that\nbatching has a significant effect on both of these metrics. Furthermore, we\npresent a timeline of the energy efficiency and accuracy of neural networks\nover the past decade. We find that in general, energy consumption rises at a\nmuch steeper pace than accuracy and question the necessity of this evolution.\nAdditionally, we highlight one particular network, ShuffleNetV2(2018), that\nachieved a competitive performance for its time while maintaining a much lower\nenergy consumption. Nevertheless, we highlight that the results are model\ndependent.\n","authors":["Tim Yarally","Luís Cruz","Daniel Feitosa","June Sallou","Arie van Deursen"],"pdf_url":"https://arxiv.org/pdf/2307.11434v1.pdf","comment":"8 pages, 4 figures, 1 table. Accepted at Euromicro Conference Series\n on Software Engineering and Advanced Applications (SEAA) 2023"},{"id":"http://arxiv.org/abs/2307.11432v1","updated":"2023-07-21T08:52:08Z","published":"2023-07-21T08:52:08Z","title":"An Analysis of Multi-Agent Reinforcement Learning for Decentralized\n Inventory Control Systems","summary":" Most solutions to the inventory management problem assume a centralization of\ninformation that is incompatible with organisational constraints in real supply\nchain networks. The inventory management problem is a well-known planning\nproblem in operations research, concerned with finding the optimal re-order\npolicy for nodes in a supply chain. While many centralized solutions to the\nproblem exist, they are not applicable to real-world supply chains made up of\nindependent entities. The problem can however be naturally decomposed into\nsub-problems, each associated with an independent entity, turning it into a\nmulti-agent system. Therefore, a decentralized data-driven solution to\ninventory management problems using multi-agent reinforcement learning is\nproposed where each entity is controlled by an agent. Three multi-agent\nvariations of the proximal policy optimization algorithm are investigated\nthrough simulations of different supply chain networks and levels of\nuncertainty. The centralized training decentralized execution framework is\ndeployed, which relies on offline centralization during simulation-based policy\nidentification, but enables decentralization when the policies are deployed\nonline to the real system. Results show that using multi-agent proximal policy\noptimization with a centralized critic leads to performance very close to that\nof a centralized data-driven solution and outperforms a distributed model-based\nsolution in most cases while respecting the information constraints of the\nsystem.\n","authors":["Marwan Mousa","Damien van de Berg","Niki Kotecha","Ehecatl Antonio del Rio-Chanona","Max Mowbray"],"pdf_url":"https://arxiv.org/pdf/2307.11432v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15342v2","updated":"2023-07-21T08:51:09Z","published":"2023-05-24T16:55:49Z","title":"Is Your Model \"MADD\"? A Novel Metric to Evaluate Algorithmic Fairness\n for Predictive Student Models","summary":" Predictive student models are increasingly used in learning environments due\nto their ability to enhance educational outcomes and support stakeholders in\nmaking informed decisions. However, predictive models can be biased and produce\nunfair outcomes, leading to potential discrimination against some students and\npossible harmful long-term implications. This has prompted research on fairness\nmetrics meant to capture and quantify such biases. Nonetheless, so far,\nexisting fairness metrics used in education are predictive\nperformance-oriented, focusing on assessing biased outcomes across groups of\nstudents, without considering the behaviors of the models nor the severity of\nthe biases in the outcomes. Therefore, we propose a novel metric, the Model\nAbsolute Density Distance (MADD), to analyze models' discriminatory behaviors\nindependently from their predictive performance. We also provide a\ncomplementary visualization-based analysis to enable fine-grained human\nassessment of how the models discriminate between groups of students. We\nevaluate our approach on the common task of predicting student success in\nonline courses, using several common predictive classification models on an\nopen educational dataset. We also compare our metric to the only predictive\nperformance-oriented fairness metric developed in education, ABROCA. Results on\nthis dataset show that: (1) fair predictive performance does not guarantee fair\nmodels' behaviors and thus fair outcomes, (2) there is no direct relationship\nbetween data bias and predictive performance bias nor discriminatory behaviors\nbias, and (3) trained on the same data, models exhibit different discriminatory\nbehaviors, according to different sensitive features too. We thus recommend\nusing the MADD on models that show satisfying predictive performance, to gain a\nfiner-grained understanding on how they behave and to refine models selection\nand their usage.\n","authors":["Mélina Verger","Sébastien Lallé","François Bouchet","Vanda Luengo"],"pdf_url":"https://arxiv.org/pdf/2305.15342v2.pdf","comment":"12 pages, conference"},{"id":"http://arxiv.org/abs/2307.11423v1","updated":"2023-07-21T08:33:55Z","published":"2023-07-21T08:33:55Z","title":"Attention to Entropic Communication","summary":" The concept of attention, numerical weights that emphasize the importance of\nparticular data, has proven to be very relevant in artificial intelligence.\nRelative entropy (RE, aka Kullback-Leibler divergence) plays a central role in\ncommunication theory. Here we combine these concepts, attention and RE. RE\nguides optimal encoding of messages in bandwidth-limited communication as well\nas optimal message decoding via the maximum entropy principle (MEP). In the\ncoding scenario, RE can be derived from four requirements, namely being\nanalytical, local, proper, and calibrated. Weighted RE, used for attention\nsteering in communications, turns out to be improper. To see how proper\nattention communication can emerge, we analyze a scenario of a message sender\nwho wants to ensure that the receiver of the message can perform well-informed\nactions. If the receiver decodes the message using the MEP, the sender only\nneeds to know the receiver's utility function to inform optimally, but not the\nreceiver's initial knowledge state. In case only the curvature of the utility\nfunction maxima are known, it becomes desirable to accurately communicate an\nattention function, in this case a by this curvature weighted and re-normalized\nprobability function. Entropic attention communication is here proposed as the\ndesired generalization of entropic communication that permits weighting while\nbeing proper, thereby aiding the design of optimal communication protocols in\ntechnical applications and helping to understand human communication. For\nexample, our analysis shows how to derive the level of cooperation expected\nunder misaligned interests of otherwise honest communication partners.\n","authors":["Torsten Enßlin","Carolin Weidinger","Philipp Frank"],"pdf_url":"https://arxiv.org/pdf/2307.11423v1.pdf","comment":"23 pages, 4 figures, submitted"},{"id":"http://arxiv.org/abs/2306.09087v2","updated":"2023-07-21T08:32:35Z","published":"2023-06-15T12:33:39Z","title":"Deep learning based Meta-modeling for Multi-objective Technology\n Optimization of Electrical Machines","summary":" Optimization of rotating electrical machines is both time- and\ncomputationally expensive. Because of the different parametrization, design\noptimization is commonly executed separately for each machine technology. In\nthis paper, we present the application of a variational auto-encoder (VAE) to\noptimize two different machine technologies simultaneously, namely an\nasynchronous machine and a permanent magnet synchronous machine. After\ntraining, we employ a deep neural network and a decoder as meta-models to\npredict global key performance indicators (KPIs) and generate associated new\ndesigns, respectively, through unified latent space in the optimization loop.\nNumerical results demonstrate concurrent parametric multi-objective technology\noptimization in the high-dimensional design space. The VAE-based approach is\nquantitatively compared to a classical deep learning-based direct approach for\nKPIs prediction.\n","authors":["Vivek Parekh","Dominik Flore","Sebastian Schöps"],"pdf_url":"https://arxiv.org/pdf/2306.09087v2.pdf","comment":"12 pages, 15 figures"},{"id":"http://arxiv.org/abs/2306.09260v2","updated":"2023-07-21T08:18:51Z","published":"2023-06-07T14:22:41Z","title":"IsoEx: an explainable unsupervised approach to process event logs cyber\n investigation","summary":" 39 seconds. That is the timelapse between two consecutive cyber attacks as of\n2023. Meaning that by the time you are done reading this abstract, about 1 or 2\nadditional cyber attacks would have occurred somewhere in the world. In this\ncontext of highly increased frequency of cyber threats, Security Operation\nCenters (SOC) and Computer Emergency Response Teams (CERT) can be overwhelmed.\nIn order to relieve the cybersecurity teams in their investigative effort and\nhelp them focus on more added-value tasks, machine learning approaches and\nmethods started to emerge. This paper introduces a novel method, IsoEx, for\ndetecting anomalous and potentially problematic command lines during the\ninvestigation of contaminated devices. IsoEx is built around a set of features\nthat leverages the log structure of the command line, as well as its\nparent/child relationship, to achieve a greater accuracy than traditional\nmethods. To detect anomalies, IsoEx resorts to an unsupervised anomaly\ndetection technique that is both highly sensitive and lightweight. A key\ncontribution of the paper is its emphasis on interpretability, achieved through\nthe features themselves and the application of eXplainable Artificial\nIntelligence (XAI) techniques and visualizations. This is critical to ensure\nthe adoption of the method by SOC and CERT teams, as the paper argues that the\ncurrent literature on machine learning for log investigation has not adequately\naddressed the issue of explainability. This method was proven efficient in a\nreal-life environment as it was built to support a company\\'s SOC and CERT\n","authors":["Pierre Lavieille","Ismail Alaoui Hassani Atlas"],"pdf_url":"https://arxiv.org/pdf/2306.09260v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11408v1","updated":"2023-07-21T08:07:16Z","published":"2023-07-21T08:07:16Z","title":"Direct and inverse modeling of soft robots by learning a condensed FEM\n model","summary":" The Finite Element Method (FEM) is a powerful modeling tool for predicting\nthe behavior of soft robots. However, its use for control can be difficult for\nnon-specialists of numerical computation: it requires an optimization of the\ncomputation to make it real-time. In this paper, we propose a learning-based\napproach to obtain a compact but sufficiently rich mechanical representation.\nOur choice is based on nonlinear compliance data in the actuator/effector space\nprovided by a condensation of the FEM model. We demonstrate that this compact\nmodel can be learned with a reasonable amount of data and, at the same time, be\nvery efficient in terms of modeling, since we can deduce the direct and inverse\nkinematics of the robot. We also show how to couple some models learned\nindividually in particular on an example of a gripper composed of two soft\nfingers. Other results are shown by comparing the inverse model derived from\nthe full FEM model and the one from the compact learned version. This work\nopens new perspectives, namely for the embedded control of soft robots, but\nalso for their design. These perspectives are also discussed in the paper.\n","authors":["Etienne Ménager","Tanguy Navez","Olivier Goury","Christian Duriez"],"pdf_url":"https://arxiv.org/pdf/2307.11408v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09382v3","updated":"2023-07-21T07:59:06Z","published":"2023-06-15T12:59:04Z","title":"Sound Demixing Challenge 2023 Music Demixing Track Technical Report:\n TFC-TDF-UNet v3","summary":" In this report, we present our award-winning solutions for the Music Demixing\nTrack of Sound Demixing Challenge 2023. First, we propose TFC-TDF-UNet v3, a\ntime-efficient music source separation model that achieves state-of-the-art\nresults on the MUSDB benchmark. We then give full details regarding our\nsolutions for each Leaderboard, including a loss masking approach for\nnoise-robust training. Code for reproducing model training and final\nsubmissions is available at github.com/kuielab/sdx23.\n","authors":["Minseok Kim","Jun Hyung Lee","Soonyoung Jung"],"pdf_url":"https://arxiv.org/pdf/2306.09382v3.pdf","comment":"5 pages, 4 tables"},{"id":"http://arxiv.org/abs/2304.04250v2","updated":"2023-07-21T07:39:58Z","published":"2023-04-09T14:52:18Z","title":"Editable User Profiles for Controllable Text Recommendation","summary":" Methods for making high-quality recommendations often rely on learning latent\nrepresentations from interaction data. These methods, while performant, do not\nprovide ready mechanisms for users to control the recommendation they receive.\nOur work tackles this problem by proposing LACE, a novel concept value\nbottleneck model for controllable text recommendations. LACE represents each\nuser with a succinct set of human-readable concepts through retrieval given\nuser-interacted documents and learns personalized representations of the\nconcepts based on user documents. This concept based user profile is then\nleveraged to make recommendations. The design of our model affords control over\nthe recommendations through a number of intuitive interactions with a\ntransparent user profile. We first establish the quality of recommendations\nobtained from LACE in an offline evaluation on three recommendation tasks\nspanning six datasets in warm-start, cold-start, and zero-shot setups. Next, we\nvalidate the controllability of LACE under simulated user interactions.\nFinally, we implement LACE in an interactive controllable recommender system\nand conduct a user study to demonstrate that users are able to improve the\nquality of recommendations they receive through interactions with an editable\nuser profile.\n","authors":["Sheshera Mysore","Mahmood Jasim","Andrew McCallum","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2304.04250v2.pdf","comment":"SIGIR-2023 Camera Ready"},{"id":"http://arxiv.org/abs/2307.11397v1","updated":"2023-07-21T07:29:38Z","published":"2023-07-21T07:29:38Z","title":"Probabilistic Modeling of Inter- and Intra-observer Variability in\n Medical Image Segmentation","summary":" Medical image segmentation is a challenging task, particularly due to inter-\nand intra-observer variability, even between medical experts. In this paper, we\npropose a novel model, called Probabilistic Inter-Observer and iNtra-Observer\nvariation NetwOrk (Pionono). It captures the labeling behavior of each rater\nwith a multidimensional probability distribution and integrates this\ninformation with the feature maps of the image to produce probabilistic\nsegmentation predictions. The model is optimized by variational inference and\ncan be trained end-to-end. It outperforms state-of-the-art models such as\nSTAPLE, Probabilistic U-Net, and models based on confusion matrices.\nAdditionally, Pionono predicts multiple coherent segmentation maps that mimic\nthe rater's expert opinion, which provides additional valuable information for\nthe diagnostic process. Experiments on real-world cancer segmentation datasets\ndemonstrate the high accuracy and efficiency of Pionono, making it a powerful\ntool for medical image analysis.\n","authors":["Arne Schmidt","Pablo Morales-Álvarez","Rafael Molina"],"pdf_url":"https://arxiv.org/pdf/2307.11397v1.pdf","comment":"13 pages, 5 figures"},{"id":"http://arxiv.org/abs/2303.06146v2","updated":"2023-07-21T06:34:54Z","published":"2023-03-10T18:59:33Z","title":"StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces","summary":" Recent advances in face manipulation using StyleGAN have produced impressive\nresults. However, StyleGAN is inherently limited to cropped aligned faces at a\nfixed image resolution it is pre-trained on. In this paper, we propose a simple\nand effective solution to this limitation by using dilated convolutions to\nrescale the receptive fields of shallow layers in StyleGAN, without altering\nany model parameters. This allows fixed-size small features at shallow layers\nto be extended into larger ones that can accommodate variable resolutions,\nmaking them more robust in characterizing unaligned faces. To enable real face\ninversion and manipulation, we introduce a corresponding encoder that provides\nthe first-layer feature of the extended StyleGAN in addition to the latent\nstyle code. We validate the effectiveness of our method using unaligned face\ninputs of various resolutions in a diverse set of face manipulation tasks,\nincluding facial attribute editing, super-resolution, sketch/mask-to-face\ntranslation, and face toonification.\n","authors":["Shuai Yang","Liming Jiang","Ziwei Liu","Chen Change Loy"],"pdf_url":"https://arxiv.org/pdf/2303.06146v2.pdf","comment":"ICCV 2023. Code: https://github.com/williamyang1991/StyleGANEX\n Project page: https://www.mmlab-ntu.com/project/styleganex/"},{"id":"http://arxiv.org/abs/2307.11379v1","updated":"2023-07-21T06:34:41Z","published":"2023-07-21T06:34:41Z","title":"Towards Better Fairness-Utility Trade-off: A Comprehensive\n Measurement-Based Reinforcement Learning Framework","summary":" Machine learning is widely used to make decisions with societal impact such\nas bank loan approving, criminal sentencing, and resume filtering. How to\nensure its fairness while maintaining utility is a challenging but crucial\nissue. Fairness is a complex and context-dependent concept with over 70\ndifferent measurement metrics. Since existing regulations are often vague in\nterms of which metric to use and different organizations may prefer different\nfairness metrics, it is important to have means of improving fairness\ncomprehensively. Existing mitigation techniques often target at one specific\nfairness metric and have limitations in improving multiple notions of fairness\nsimultaneously. In this work, we propose CFU (Comprehensive Fairness-Utility),\na reinforcement learning-based framework, to efficiently improve the\nfairness-utility trade-off in machine learning classifiers. A comprehensive\nmeasurement that can simultaneously consider multiple fairness notions as well\nas utility is established, and new metrics are proposed based on an in-depth\nanalysis of the relationship between different fairness metrics. The reward\nfunction of CFU is constructed with comprehensive measurement and new metrics.\nWe conduct extensive experiments to evaluate CFU on 6 tasks, 3 machine learning\nmodels, and 15 fairness-utility measurements. The results demonstrate that CFU\ncan improve the classifier on multiple fairness metrics without sacrificing its\nutility. It outperforms all state-of-the-art techniques and has witnessed a\n37.5% improvement on average.\n","authors":["Simiao Zhang","Jitao Bai","Menghong Guan","Yihao Huang","Yueling Zhang","Jun Sun","Geguang Pu"],"pdf_url":"https://arxiv.org/pdf/2307.11379v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.07493v3","updated":"2023-07-21T06:28:40Z","published":"2022-05-16T07:53:42Z","title":"Multi-scale Attention Flow for Probabilistic Time Series Forecasting","summary":" The probability prediction of multivariate time series is a notoriously\nchallenging but practical task. On the one hand, the challenge is how to\neffectively capture the cross-series correlations between interacting time\nseries, to achieve accurate distribution modeling. On the other hand, we should\nconsider how to capture the contextual information within time series more\naccurately to model multivariate temporal dynamics of time series. In this\nwork, we proposed a novel non-autoregressive deep learning model, called\nMulti-scale Attention Normalizing Flow(MANF), where we integrate multi-scale\nattention and relative position information and the multivariate data\ndistribution is represented by the conditioned normalizing flow. Additionally,\ncompared with autoregressive modeling methods, our model avoids the influence\nof cumulative error and does not increase the time complexity. Extensive\nexperiments demonstrate that our model achieves state-of-the-art performance on\nmany popular multivariate datasets.\n","authors":["Shibo Feng","Chunyan Miao","Ke Xu","Jiaxiang Wu","Pengcheng Wu","Yang Zhang","Peilin Zhao"],"pdf_url":"https://arxiv.org/pdf/2205.07493v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11375v1","updated":"2023-07-21T06:17:09Z","published":"2023-07-21T06:17:09Z","title":"LatentAugment: Data Augmentation via Guided Manipulation of GAN's Latent\n Space","summary":" Data Augmentation (DA) is a technique to increase the quantity and diversity\nof the training data, and by that alleviate overfitting and improve\ngeneralisation. However, standard DA produces synthetic data for augmentation\nwith limited diversity. Generative Adversarial Networks (GANs) may unlock\nadditional information in a dataset by generating synthetic samples having the\nappearance of real images. However, these models struggle to simultaneously\naddress three key requirements: fidelity and high-quality samples; diversity\nand mode coverage; and fast sampling. Indeed, GANs generate high-quality\nsamples rapidly, but have poor mode coverage, limiting their adoption in DA\napplications. We propose LatentAugment, a DA strategy that overcomes the low\ndiversity of GANs, opening up for use in DA applications. Without external\nsupervision, LatentAugment modifies latent vectors and moves them into latent\nspace regions to maximise the synthetic images' diversity and fidelity. It is\nalso agnostic to the dataset and the downstream task. A wide set of experiments\nshows that LatentAugment improves the generalisation of a deep model\ntranslating from MRI-to-CT beating both standard DA as well GAN-based sampling.\nMoreover, still in comparison with GAN-based sampling, LatentAugment synthetic\nsamples show superior mode coverage and diversity. Code is available at:\nhttps://github.com/ltronchin/LatentAugment.\n","authors":["Lorenzo Tronchin","Minh H. Vu","Paolo Soda","Tommy Löfstedt"],"pdf_url":"https://arxiv.org/pdf/2307.11375v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11373v1","updated":"2023-07-21T06:12:39Z","published":"2023-07-21T06:12:39Z","title":"Diverse Offline Imitation via Fenchel Duality","summary":" There has been significant recent progress in the area of unsupervised skill\ndiscovery, with various works proposing mutual information based objectives, as\na source of intrinsic motivation. Prior works predominantly focused on\ndesigning algorithms that require online access to the environment. In\ncontrast, we develop an \\textit{offline} skill discovery algorithm. Our problem\nformulation considers the maximization of a mutual information objective\nconstrained by a KL-divergence. More precisely, the constraints ensure that the\nstate occupancy of each skill remains close to the state occupancy of an\nexpert, within the support of an offline dataset with good state-action\ncoverage. Our main contribution is to connect Fenchel duality, reinforcement\nlearning and unsupervised skill discovery, and to give a simple offline\nalgorithm for learning diverse skills that are aligned with an expert.\n","authors":["Marin Vlastelica","Pavel Kolev","Jin Cheng","Georg Martius"],"pdf_url":"https://arxiv.org/pdf/2307.11373v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11371v1","updated":"2023-07-21T06:03:43Z","published":"2023-07-21T06:03:43Z","title":"Random Separating Hyperplane Theorem and Learning Polytopes","summary":" The Separating Hyperplane theorem is a fundamental result in Convex Geometry\nwith myriad applications. Our first result, Random Separating Hyperplane\nTheorem (RSH), is a strengthening of this for polytopes. $\\rsh$ asserts that if\nthe distance between $a$ and a polytope $K$ with $k$ vertices and unit diameter\nin $\\Re^d$ is at least $\\delta$, where $\\delta$ is a fixed constant in $(0,1)$,\nthen a randomly chosen hyperplane separates $a$ and $K$ with probability at\nleast $1/poly(k)$ and margin at least $\\Omega \\left(\\delta/\\sqrt{d} \\right)$.\nAn immediate consequence of our result is the first near optimal bound on the\nerror increase in the reduction from a Separation oracle to an Optimization\noracle over a polytope.\n RSH has algorithmic applications in learning polytopes. We consider a\nfundamental problem, denoted the ``Hausdorff problem'', of learning a unit\ndiameter polytope $K$ within Hausdorff distance $\\delta$, given an optimization\noracle for $K$. Using RSH, we show that with polynomially many random queries\nto the optimization oracle, $K$ can be approximated within error $O(\\delta)$.\nTo our knowledge this is the first provable algorithm for the Hausdorff\nProblem. Building on this result, we show that if the vertices of $K$ are\nwell-separated, then an optimization oracle can be used to generate a list of\npoints, each within Hausdorff distance $O(\\delta)$ of $K$, with the property\nthat the list contains a point close to each vertex of $K$. Further, we show\nhow to prune this list to generate a (unique) approximation to each vertex of\nthe polytope. We prove that in many latent variable settings, e.g., topic\nmodeling, LDA, optimization oracles do exist provided we project to a suitable\nSVD subspace. Thus, our work yields the first efficient algorithm for finding\napproximations to the vertices of the latent polytope under the\nwell-separatedness assumption.\n","authors":["Chiranjib Bhattacharyya","Ravindran Kannan","Amit Kumar"],"pdf_url":"https://arxiv.org/pdf/2307.11371v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11357v1","updated":"2023-07-21T05:17:21Z","published":"2023-07-21T05:17:21Z","title":"Bridging the Reality Gap of Reinforcement Learning based Traffic Signal\n Control using Domain Randomization and Meta Learning","summary":" Reinforcement Learning (RL) has been widely explored in Traffic Signal\nControl (TSC) applications, however, still no such system has been deployed in\npractice. A key barrier to progress in this area is the reality gap, the\ndiscrepancy that results from differences between simulation models and their\nreal-world equivalents. In this paper, we address this challenge by first\npresenting a comprehensive analysis of potential simulation parameters that\ncontribute to this reality gap. We then also examine two promising strategies\nthat can bridge this gap: Domain Randomization (DR) and Model-Agnostic\nMeta-Learning (MAML). Both strategies were trained with a traffic simulation\nmodel of an intersection. In addition, the model was embedded in LemgoRL, a\nframework that integrates realistic, safety-critical requirements into the\ncontrol system. Subsequently, we evaluated the performance of the two methods\non a separate model of the same intersection that was developed with a\ndifferent traffic simulator. In this way, we mimic the reality gap. Our\nexperimental results show that both DR and MAML outperform a state-of-the-art\nRL algorithm, therefore highlighting their potential to mitigate the reality\ngap in RLbased TSC systems.\n","authors":["Arthur Müller","Matthia Sabatelli"],"pdf_url":"https://arxiv.org/pdf/2307.11357v1.pdf","comment":"Paper was accepted by the ITSC 2023 (26th IEEE International\n Conference on Intelligent Transportation Systems)"},{"id":"http://arxiv.org/abs/2307.09484v2","updated":"2023-07-21T05:13:55Z","published":"2023-06-06T12:45:15Z","title":"MolFM: A Multimodal Molecular Foundation Model","summary":" Molecular knowledge resides within three different modalities of information\nsources: molecular structures, biomedical documents, and knowledge bases.\nEffective incorporation of molecular knowledge from these modalities holds\nparamount significance in facilitating biomedical research. However, existing\nmultimodal molecular foundation models exhibit limitations in capturing\nintricate connections between molecular structures and texts, and more\nimportantly, none of them attempt to leverage a wealth of molecular expertise\nderived from knowledge graphs. In this study, we introduce MolFM, a multimodal\nmolecular foundation model designed to facilitate joint representation learning\nfrom molecular structures, biomedical texts, and knowledge graphs. We propose\ncross-modal attention between atoms of molecular structures, neighbors of\nmolecule entities and semantically related texts to facilitate cross-modal\ncomprehension. We provide theoretical analysis that our cross-modal\npre-training captures local and global molecular knowledge by minimizing the\ndistance in the feature space between different modalities of the same\nmolecule, as well as molecules sharing similar structures or functions. MolFM\nachieves state-of-the-art performance on various downstream tasks. On\ncross-modal retrieval, MolFM outperforms existing models with 12.13% and 5.04%\nabsolute gains under the zero-shot and fine-tuning settings, respectively.\nFurthermore, qualitative analysis showcases MolFM's implicit ability to provide\ngrounding from molecular substructures and knowledge graphs. Code and models\nare available on https://github.com/BioFM/OpenBioMed.\n","authors":["Yizhen Luo","Kai Yang","Massimo Hong","Xing Yi Liu","Zaiqing Nie"],"pdf_url":"https://arxiv.org/pdf/2307.09484v2.pdf","comment":"31 pages, 15 figures, and 15 tables"},{"id":"http://arxiv.org/abs/2307.11353v1","updated":"2023-07-21T05:05:55Z","published":"2023-07-21T05:05:55Z","title":"What can a Single Attention Layer Learn? A Study Through the Random\n Features Lens","summary":" Attention layers -- which map a sequence of inputs to a sequence of outputs\n-- are core building blocks of the Transformer architecture which has achieved\nsignificant breakthroughs in modern artificial intelligence. This paper\npresents a rigorous theoretical study on the learning and generalization of a\nsingle multi-head attention layer, with a sequence of key vectors and a\nseparate query vector as input. We consider the random feature setting where\nthe attention layer has a large number of heads, with randomly sampled frozen\nquery and key matrices, and trainable value matrices. We show that such a\nrandom-feature attention layer can express a broad class of target functions\nthat are permutation invariant to the key vectors. We further provide\nquantitative excess risk bounds for learning these target functions from finite\nsamples, using random feature attention with finitely many heads.\n Our results feature several implications unique to the attention structure\ncompared with existing random features theory for neural networks, such as (1)\nAdvantages in the sample complexity over standard two-layer random-feature\nnetworks; (2) Concrete and natural classes of functions that can be learned\nefficiently by a random-feature attention layer; and (3) The effect of the\nsampling distribution of the query-key weight matrix (the product of the query\nand key matrix), where Gaussian random weights with a non-zero mean result in\nbetter sample complexities over the zero-mean counterpart for learning certain\nnatural target functions. Experiments on simulated data corroborate our\ntheoretical findings and further illustrate the interplay between the sample\nsize and the complexity of the target function.\n","authors":["Hengyu Fu","Tianyu Guo","Yu Bai","Song Mei"],"pdf_url":"https://arxiv.org/pdf/2307.11353v1.pdf","comment":"41pages, 5 figures"},{"id":"http://arxiv.org/abs/2106.06134v4","updated":"2023-07-21T05:02:21Z","published":"2021-06-11T02:44:00Z","title":"Is Homophily a Necessity for Graph Neural Networks?","summary":" Graph neural networks (GNNs) have shown great prowess in learning\nrepresentations suitable for numerous graph-based machine learning tasks. When\napplied to semi-supervised node classification, GNNs are widely believed to\nwork well due to the homophily assumption (\"like attracts like\"), and fail to\ngeneralize to heterophilous graphs where dissimilar nodes connect. Recent works\ndesign new architectures to overcome such heterophily-related limitations,\nciting poor baseline performance and new architecture improvements on a few\nheterophilous graph benchmark datasets as evidence for this notion. In our\nexperiments, we empirically find that standard graph convolutional networks\n(GCNs) can actually achieve better performance than such carefully designed\nmethods on some commonly used heterophilous graphs. This motivates us to\nreconsider whether homophily is truly necessary for good GNN performance. We\nfind that this claim is not quite true, and in fact, GCNs can achieve strong\nperformance on heterophilous graphs under certain conditions. Our work\ncarefully characterizes these conditions, and provides supporting theoretical\nunderstanding and empirical observations. Finally, we examine existing\nheterophilous graphs benchmarks and reconcile how the GCN (under)performs on\nthem based on this understanding.\n","authors":["Yao Ma","Xiaorui Liu","Neil Shah","Jiliang Tang"],"pdf_url":"https://arxiv.org/pdf/2106.06134v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11352v1","updated":"2023-07-21T04:59:23Z","published":"2023-07-21T04:59:23Z","title":"Model-based Offline Reinforcement Learning with Count-based Conservatism","summary":" In this paper, we propose a model-based offline reinforcement learning method\nthat integrates count-based conservatism, named $\\texttt{Count-MORL}$. Our\nmethod utilizes the count estimates of state-action pairs to quantify model\nestimation error, marking the first algorithm of demonstrating the efficacy of\ncount-based conservatism in model-based offline deep RL to the best of our\nknowledge. For our proposed method, we first show that the estimation error is\ninversely proportional to the frequency of state-action pairs. Secondly, we\ndemonstrate that the learned policy under the count-based conservative model\noffers near-optimality performance guarantees. Through extensive numerical\nexperiments, we validate that $\\texttt{Count-MORL}$ with hash code\nimplementation significantly outperforms existing offline RL algorithms on the\nD4RL benchmark datasets. The code is accessible at\n$\\href{https://github.com/oh-lab/Count-MORL}{https://github.com/oh-lab/Count-MORL}$.\n","authors":["Byeongchan Kim","Min-hwan Oh"],"pdf_url":"https://arxiv.org/pdf/2307.11352v1.pdf","comment":"Accepted in ICML 2023"},{"id":"http://arxiv.org/abs/2307.11351v1","updated":"2023-07-21T04:55:03Z","published":"2023-07-21T04:55:03Z","title":"Bounded P-values in Parametric Programming-based Selective Inference","summary":" Selective inference (SI) has been actively studied as a promising framework\nfor statistical hypothesis testing for data-driven hypotheses. The basic idea\nof SI is to make inferences conditional on an event that a hypothesis is\nselected. In order to perform SI, this event must be characterized in a\ntraceable form. When selection event is too difficult to characterize,\nadditional conditions are introduced for tractability. This additional\nconditions often causes the loss of power, and this issue is referred to as\nover-conditioning. Parametric programming-based SI (PP-based SI) has been\nproposed as one way to address the over-conditioning issue. The main problem of\nPP-based SI is its high computational cost due to the need to exhaustively\nexplore the data space. In this study, we introduce a procedure to reduce the\ncomputational cost while guaranteeing the desired precision, by proposing a\nmethod to compute the upper and lower bounds of p-values. We also proposed\nthree types of search strategies that efficiently improve these bounds. We\ndemonstrate the effectiveness of the proposed method in hypothesis testing\nproblems for feature selection in linear models and attention region\nidentification in deep neural networks.\n","authors":["Tomohiro Shiraishi","Daiki Miwa","Vo Nguyen Le Duy","Ichiro Takeuchi"],"pdf_url":"https://arxiv.org/pdf/2307.11351v1.pdf","comment":"47pages, 14figures"},{"id":"http://arxiv.org/abs/2302.09738v5","updated":"2023-07-21T04:19:43Z","published":"2023-02-20T03:31:11Z","title":"Simplifying Momentum-based Positive-definite Submanifold Optimization\n with Applications to Deep Learning","summary":" Riemannian submanifold optimization with momentum is computationally\nchallenging because, to ensure that the iterates remain on the submanifold, we\noften need to solve difficult differential equations. Here, we simplify such\ndifficulties for a class of structured symmetric positive-definite matrices\nwith the affine-invariant metric. We do so by proposing a generalized version\nof the Riemannian normal coordinates that dynamically orthonormalizes the\nmetric and locally converts the problem into an unconstrained problem in the\nEuclidean space. We use our approach to simplify existing approaches for\nstructured covariances and develop matrix-inverse-free $2^\\text{nd}$-order\noptimizers for deep learning with low precision by using only matrix\nmultiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL\n","authors":["Wu Lin","Valentin Duruisseaux","Melvin Leok","Frank Nielsen","Mohammad Emtiyaz Khan","Mark Schmidt"],"pdf_url":"https://arxiv.org/pdf/2302.09738v5.pdf","comment":"An updated version of the ICML 2023 paper. Updated the main text and\n added more numerical results for DNNs including a new baseline method and\n improving existing baseline methods"},{"id":"http://arxiv.org/abs/2307.11334v1","updated":"2023-07-21T03:43:07Z","published":"2023-07-21T03:43:07Z","title":"Improving Transferability of Adversarial Examples via Bayesian Attacks","summary":" This paper presents a substantial extension of our work published at ICLR.\nOur ICLR work advocated for enhancing transferability in adversarial examples\nby incorporating a Bayesian formulation into model parameters, which\neffectively emulates the ensemble of infinitely many deep neural networks,\nwhile, in this paper, we introduce a novel extension by incorporating the\nBayesian formulation into the model input as well, enabling the joint\ndiversification of both the model input and model parameters. Our empirical\nfindings demonstrate that: 1) the combination of Bayesian formulations for both\nthe model input and model parameters yields significant improvements in\ntransferability; 2) by introducing advanced approximations of the posterior\ndistribution over the model input, adversarial transferability achieves further\nenhancement, surpassing all state-of-the-arts when attacking without model\nfine-tuning. Moreover, we propose a principled approach to fine-tune model\nparameters in such an extended Bayesian formulation. The derived optimization\nobjective inherently encourages flat minima in the parameter space and input\nspace. Extensive experiments demonstrate that our method achieves a new\nstate-of-the-art on transfer-based attacks, improving the average success rate\non ImageNet and CIFAR-10 by 19.14% and 2.08%, respectively, when comparing with\nour ICLR basic Bayesian method. We will make our code publicly available.\n","authors":["Qizhang Li","Yiwen Guo","Xiaochen Yang","Wangmeng Zuo","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2307.11334v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11333v1","updated":"2023-07-21T03:41:55Z","published":"2023-07-21T03:41:55Z","title":"Demystifying Local and Global Fairness Trade-offs in Federated Learning\n Using Partial Information Decomposition","summary":" In this paper, we present an information-theoretic perspective to group\nfairness trade-offs in federated learning (FL) with respect to sensitive\nattributes, such as gender, race, etc. Existing works mostly focus on either\n\\emph{global fairness} (overall disparity of the model across all clients) or\n\\emph{local fairness} (disparity of the model at each individual client),\nwithout always considering their trade-offs. There is a lack of understanding\nof the interplay between global and local fairness in FL, and if and when one\nimplies the other. To address this gap, we leverage a body of work in\ninformation theory called partial information decomposition (PID) which first\nidentifies three sources of unfairness in FL, namely, \\emph{Unique Disparity},\n\\emph{Redundant Disparity}, and \\emph{Masked Disparity}. Using canonical\nexamples, we demonstrate how these three disparities contribute to global and\nlocal fairness. This decomposition helps us derive fundamental limits and\ntrade-offs between global or local fairness, particularly under data\nheterogeneity, as well as, derive conditions under which one implies the other.\nWe also present experimental results on benchmark datasets to support our\ntheoretical findings. This work offers a more nuanced understanding of the\nsources of disparity in FL that can inform the use of local disparity\nmitigation techniques, and their convergence and effectiveness when deployed in\npractice.\n","authors":["Faisal Hamman","Sanghamitra Dutta"],"pdf_url":"https://arxiv.org/pdf/2307.11333v1.pdf","comment":"Accepted at ICML Workshop on Federated Learning and Analytics in\n Practice"},{"id":"http://arxiv.org/abs/2307.11332v1","updated":"2023-07-21T03:40:53Z","published":"2023-07-21T03:40:53Z","title":"Beyond Convergence: Identifiability of Machine Learning and Deep\n Learning Models","summary":" Machine learning (ML) and deep learning models are extensively used for\nparameter optimization and regression problems. However, not all inverse\nproblems in ML are ``identifiable,'' indicating that model parameters may not\nbe uniquely determined from the available data and the data model's\ninput-output relationship. In this study, we investigate the notion of model\nparameter identifiability through a case study focused on parameter estimation\nfrom motion sensor data. Utilizing a bipedal-spring mass human walk dynamics\nmodel, we generate synthetic data representing diverse gait patterns and\nconditions. Employing a deep neural network, we attempt to estimate\nsubject-wise parameters, including mass, stiffness, and equilibrium leg length.\nThe results show that while certain parameters can be identified from the\nobservation data, others remain unidentifiable, highlighting that\nunidentifiability is an intrinsic limitation of the experimental setup,\nnecessitating a change in data collection and experimental scenarios. Beyond\nthis specific case study, the concept of identifiability has broader\nimplications in ML and deep learning. Addressing unidentifiability requires\nproven identifiable models (with theoretical support), multimodal data fusion\ntechniques, and advancements in model-based machine learning. Understanding and\nresolving unidentifiability challenges will lead to more reliable and accurate\napplications across diverse domains, transcending mere model convergence and\nenhancing the reliability of machine learning models.\n","authors":["Reza Sameni"],"pdf_url":"https://arxiv.org/pdf/2307.11332v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.10736v3","updated":"2023-07-21T03:39:05Z","published":"2022-03-21T05:00:54Z","title":"The activity-weight duality in feed forward neural networks: The\n geometric determinants of generalization","summary":" One of the fundamental problems in machine learning is generalization. In\nneural network models with a large number of weights (parameters), many\nsolutions can be found to fit the training data equally well. The key question\nis which solution can describe testing data not in the training set. Here, we\nreport the discovery of an exact duality (equivalence) between changes in\nactivities in a given layer of neurons and changes in weights that connect to\nthe next layer of neurons in a densely connected layer in any feed forward\nneural network. The activity-weight (A-W) duality allows us to map variations\nin inputs (data) to variations of the corresponding dual weights. By using this\nmapping, we show that the generalization loss can be decomposed into a sum of\ncontributions from different eigen-directions of the Hessian matrix of the loss\nfunction at the solution in weight space. The contribution from a given\neigen-direction is the product of two geometric factors (determinants): the\nsharpness of the loss landscape and the standard deviation of the dual weights,\nwhich is found to scale with the weight norm of the solution. Our results\nprovide an unified framework, which we used to reveal how different\nregularization schemes (weight decay, stochastic gradient descent with\ndifferent batch sizes and learning rates, dropout), training data size, and\nlabeling noise affect generalization performance by controlling either one or\nboth of these two geometric determinants for generalization. These insights can\nbe used to guide development of algorithms for finding more generalizable\nsolutions in overparametrized neural networks.\n","authors":["Yu Feng","Yuhai Tu"],"pdf_url":"https://arxiv.org/pdf/2203.10736v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11327v1","updated":"2023-07-21T03:24:55Z","published":"2023-07-21T03:24:55Z","title":"Systematic Adaptation of Communication-focused Machine Learning Models\n from Real to Virtual Environments for Human-Robot Collaboration","summary":" Virtual reality has proved to be useful in applications in several fields\nranging from gaming, medicine, and training to development of interfaces that\nenable human-robot collaboration. It empowers designers to explore applications\noutside of the constraints posed by the real world environment and develop\ninnovative solutions and experiences. Hand gestures recognition which has been\na topic of much research and subsequent commercialization in the real world has\nbeen possible because of the creation of large, labelled datasets. In order to\nutilize the power of natural and intuitive hand gestures in the virtual domain\nfor enabling embodied teleoperation of collaborative robots, similarly large\ndatasets must be created so as to keep the working interface easy to learn and\nflexible enough to add more gestures. Depending on the application, this may be\ncomputationally or economically prohibitive. Thus, the adaptation of trained\ndeep learning models that perform well in the real environment to the virtual\nmay be a solution to this challenge. This paper presents a systematic framework\nfor the real to virtual adaptation using limited size of virtual dataset along\nwith guidelines for creating a curated dataset. Finally, while hand gestures\nhave been considered as the communication mode, the guidelines and\nrecommendations presented are generic. These are applicable to other modes such\nas body poses and facial expressions which have large datasets available in the\nreal domain which must be adapted to the virtual one.\n","authors":["Debasmita Mukherjee","Ritwik Singhai","Homayoun Najjaran"],"pdf_url":"https://arxiv.org/pdf/2307.11327v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11325v1","updated":"2023-07-21T03:23:17Z","published":"2023-07-21T03:23:17Z","title":"Analysis of Elephant Movement in Sub-Saharan Africa: Ecological,\n Climatic, and Conservation Perspectives","summary":" The interaction between elephants and their environment has profound\nimplications for both ecology and conservation strategies. This study presents\nan analytical approach to decipher the intricate patterns of elephant movement\nin Sub-Saharan Africa, concentrating on key ecological drivers such as seasonal\nvariations and rainfall patterns. Despite the complexities surrounding these\ninfluential factors, our analysis provides a holistic view of elephant\nmigratory behavior in the context of the dynamic African landscape. Our\ncomprehensive approach enables us to predict the potential impact of these\necological determinants on elephant migration, a critical step in establishing\ninformed conservation strategies. This projection is particularly crucial given\nthe impacts of global climate change on seasonal and rainfall patterns, which\ncould substantially influence elephant movements in the future. The findings of\nour work aim to not only advance the understanding of movement ecology but also\nfoster a sustainable coexistence of humans and elephants in Sub-Saharan Africa.\nBy predicting potential elephant routes, our work can inform strategies to\nminimize human-elephant conflict, effectively manage land use, and enhance\nanti-poaching efforts. This research underscores the importance of integrating\nmovement ecology and climatic variables for effective wildlife management and\nconservation planning.\n","authors":["Matthew Hines","Gregory Glatzer","Shreya Ghosh","Prasenjit Mitra"],"pdf_url":"https://arxiv.org/pdf/2307.11325v1.pdf","comment":"11 pages, 17 figures, Accepted in ACM SIGCAS SIGCHI Conference on\n Computing and Sustainable Societies (COMPASS 2023)"},{"id":"http://arxiv.org/abs/2307.11317v1","updated":"2023-07-21T02:57:40Z","published":"2023-07-21T02:57:40Z","title":"XLDA: Linear Discriminant Analysis for Scaling Continual Learning to\n Extreme Classification at the Edge","summary":" Streaming Linear Discriminant Analysis (LDA) while proven in\nClass-incremental Learning deployments at the edge with limited classes (upto\n1000), has not been proven for deployment in extreme classification scenarios.\nIn this paper, we present: (a) XLDA, a framework for Class-IL in edge\ndeployment where LDA classifier is proven to be equivalent to FC layer\nincluding in extreme classification scenarios, and (b) optimizations to enable\nXLDA-based training and inference for edge deployment where there is a\nconstraint on available compute resources. We show up to 42x speed up using a\nbatched training approach and up to 5x inference speedup with nearest neighbor\nsearch on extreme datasets like AliProducts (50k classes) and Google Landmarks\nV2 (81k classes)\n","authors":["Karan Shah","Vishruth Veerendranath","Anushka Hebbar","Raghavendra Bhat"],"pdf_url":"https://arxiv.org/pdf/2307.11317v1.pdf","comment":"Submitted at ICML 2023: PAC-Bayes Interactive Learning Workshop"},{"id":"http://arxiv.org/abs/2307.10579v2","updated":"2023-07-21T02:54:25Z","published":"2023-07-20T04:45:59Z","title":"SecureBoost Hyperparameter Tuning via Multi-Objective Federated Learning","summary":" SecureBoost is a tree-boosting algorithm leveraging homomorphic encryption to\nprotect data privacy in vertical federated learning setting. It is widely used\nin fields such as finance and healthcare due to its interpretability,\neffectiveness, and privacy-preserving capability. However, SecureBoost suffers\nfrom high computational complexity and risk of label leakage. To harness the\nfull potential of SecureBoost, hyperparameters of SecureBoost should be\ncarefully chosen to strike an optimal balance between utility, efficiency, and\nprivacy. Existing methods either set hyperparameters empirically or\nheuristically, which are far from optimal. To fill this gap, we propose a\nConstrained Multi-Objective SecureBoost (CMOSB) algorithm to find Pareto\noptimal solutions that each solution is a set of hyperparameters achieving\noptimal tradeoff between utility loss, training cost, and privacy leakage. We\ndesign measurements of the three objectives. In particular, the privacy leakage\nis measured using our proposed instance clustering attack. Experimental results\ndemonstrate that the CMOSB yields not only hyperparameters superior to the\nbaseline but also optimal sets of hyperparameters that can support the flexible\nrequirements of FL participants.\n","authors":["Ziyao Ren","Yan Kang","Lixin Fan","Linghua Yang","Tao Fan","Yongxin Tong","Qiang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.10579v2.pdf","comment":"FL-ICAI'23"},{"id":"http://arxiv.org/abs/2307.11316v1","updated":"2023-07-21T02:51:41Z","published":"2023-07-21T02:51:41Z","title":"Making Pre-trained Language Models both Task-solvers and\n Self-calibrators","summary":" Pre-trained language models (PLMs) serve as backbones for various real-world\nsystems. For high-stake applications, it's equally essential to have reasonable\nconfidence estimations in predictions. While the vanilla confidence scores of\nPLMs can already be effectively utilized, PLMs consistently become\noverconfident in their wrong predictions, which is not desirable in practice.\nPrevious work shows that introducing an extra calibration task can mitigate\nthis issue. The basic idea involves acquiring additional data to train models\nin predicting the confidence of their initial predictions. However, it only\ndemonstrates the feasibility of this kind of method, assuming that there are\nabundant extra available samples for the introduced calibration task. In this\nwork, we consider the practical scenario that we need to effectively utilize\ntraining samples to make PLMs both task-solvers and self-calibrators. Three\nchallenges are presented, including limited training samples, data imbalance,\nand distribution shifts. We first conduct pilot experiments to quantify various\ndecisive factors in the calibration task. Based on the empirical analysis\nresults, we propose a training algorithm LM-TOAST to tackle the challenges.\nExperimental results show that LM-TOAST can effectively utilize the training\ndata to make PLMs have reasonable confidence estimations while maintaining the\noriginal task performance. Further, we consider three downstream applications,\nnamely selective classification, adversarial defense, and model cascading, to\nshow the practical usefulness of LM-TOAST. The code will be made public at\n\\url{https://github.com/Yangyi-Chen/LM-TOAST}.\n","authors":["Yangyi Chen","Xingyao Wang","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2307.11316v1.pdf","comment":"Accepted to Findings of ACL 2023"},{"id":"http://arxiv.org/abs/2307.11314v1","updated":"2023-07-21T02:47:03Z","published":"2023-07-21T02:47:03Z","title":"Neuromorphic Online Learning for Spatiotemporal Patterns with a\n Forward-only Timeline","summary":" Spiking neural networks (SNNs) are bio-plausible computing models with high\nenergy efficiency. The temporal dynamics of neurons and synapses enable them to\ndetect temporal patterns and generate sequences. While Backpropagation Through\nTime (BPTT) is traditionally used to train SNNs, it is not suitable for online\nlearning of embedded applications due to its high computation and memory cost\nas well as extended latency. Previous works have proposed online learning\nalgorithms, but they often utilize highly simplified spiking neuron models\nwithout synaptic dynamics and reset feedback, resulting in subpar performance.\nIn this work, we present Spatiotemporal Online Learning for Synaptic Adaptation\n(SOLSA), specifically designed for online learning of SNNs composed of Leaky\nIntegrate and Fire (LIF) neurons with exponentially decayed synapses and soft\nreset. The algorithm not only learns the synaptic weight but also adapts the\ntemporal filters associated to the synapses. Compared to the BPTT algorithm,\nSOLSA has much lower memory requirement and achieves a more balanced temporal\nworkload distribution. Moreover, SOLSA incorporates enhancement techniques such\nas scheduled weight update, early stop training and adaptive synapse filter,\nwhich speed up the convergence and enhance the learning performance. When\ncompared to other non-BPTT based SNN learning, SOLSA demonstrates an average\nlearning accuracy improvement of 14.2%. Furthermore, compared to BPTT, SOLSA\nachieves a 5% higher average learning accuracy with a 72% reduction in memory\ncost.\n","authors":["Zhenhang Zhang","Jingang Jin","Haowen Fang","Qinru Qiu"],"pdf_url":"https://arxiv.org/pdf/2307.11314v1.pdf","comment":"9 pages,8 figures"},{"id":"http://arxiv.org/abs/2303.17555v2","updated":"2023-07-21T02:20:39Z","published":"2023-03-16T21:02:09Z","title":"Factoring the Matrix of Domination: A Critical Review and Reimagination\n of Intersectionality in AI Fairness","summary":" Intersectionality is a critical framework that, through inquiry and praxis,\nallows us to examine how social inequalities persist through domains of\nstructure and discipline. Given AI fairness' raison d'etre of \"fairness\", we\nargue that adopting intersectionality as an analytical framework is pivotal to\neffectively operationalizing fairness. Through a critical review of how\nintersectionality is discussed in 30 papers from the AI fairness literature, we\ndeductively and inductively: 1) map how intersectionality tenets operate within\nthe AI fairness paradigm and 2) uncover gaps between the conceptualization and\noperationalization of intersectionality. We find that researchers\noverwhelmingly reduce intersectionality to optimizing for fairness metrics over\ndemographic subgroups. They also fail to discuss their social context and when\nmentioning power, they mostly situate it only within the AI pipeline. We: 3)\noutline and assess the implications of these gaps for critical inquiry and\npraxis, and 4) provide actionable recommendations for AI fairness researchers\nto engage with intersectionality in their work by grounding it in AI\nepistemology.\n","authors":["Anaelia Ovalle","Arjun Subramonian","Vagrant Gautam","Gilbert Gee","Kai-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2303.17555v2.pdf","comment":"To appear at AIES 2023"},{"id":"http://arxiv.org/abs/2302.04973v2","updated":"2023-07-21T01:40:31Z","published":"2023-02-09T23:25:28Z","title":"Invariant Slot Attention: Object Discovery with Slot-Centric Reference\n Frames","summary":" Automatically discovering composable abstractions from raw perceptual data is\na long-standing challenge in machine learning. Recent slot-based neural\nnetworks that learn about objects in a self-supervised manner have made\nexciting progress in this direction. However, they typically fall short at\nadequately capturing spatial symmetries present in the visual world, which\nleads to sample inefficiency, such as when entangling object appearance and\npose. In this paper, we present a simple yet highly effective method for\nincorporating spatial symmetries via slot-centric reference frames. We\nincorporate equivariance to per-object pose transformations into the attention\nand generation mechanism of Slot Attention by translating, scaling, and\nrotating position encodings. These changes result in little computational\noverhead, are easy to implement, and can result in large gains in terms of data\nefficiency and overall improvements to object discovery. We evaluate our method\non a wide range of synthetic object discovery benchmarks namely CLEVR,\nTetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising\nimprovements on the challenging real-world Waymo Open dataset.\n","authors":["Ondrej Biza","Sjoerd van Steenkiste","Mehdi S. M. Sajjadi","Gamaleldin F. Elsayed","Aravindh Mahendran","Thomas Kipf"],"pdf_url":"https://arxiv.org/pdf/2302.04973v2.pdf","comment":"Accepted at ICML 2023. Project page: https://invariantsa.github.io/"},{"id":"http://arxiv.org/abs/2307.11289v1","updated":"2023-07-21T01:18:02Z","published":"2023-07-21T01:18:02Z","title":"PI-VEGAN: Physics Informed Variational Embedding Generative Adversarial\n Networks for Stochastic Differential Equations","summary":" We present a new category of physics-informed neural networks called physics\ninformed variational embedding generative adversarial network (PI-VEGAN), that\neffectively tackles the forward, inverse, and mixed problems of stochastic\ndifferential equations. In these scenarios, the governing equations are known,\nbut only a limited number of sensor measurements of the system parameters are\navailable. We integrate the governing physical laws into PI-VEGAN with\nautomatic differentiation, while introducing a variational encoder for\napproximating the latent variables of the actual distribution of the\nmeasurements. These latent variables are integrated into the generator to\nfacilitate accurate learning of the characteristics of the stochastic partial\nequations. Our model consists of three components, namely the encoder,\ngenerator, and discriminator, each of which is updated alternatively employing\nthe stochastic gradient descent algorithm. We evaluate the effectiveness of\nPI-VEGAN in addressing forward, inverse, and mixed problems that require the\nconcurrent calculation of system parameters and solutions. Numerical results\ndemonstrate that the proposed method achieves satisfactory stability and\naccuracy in comparison with the previous physics-informed generative\nadversarial network (PI-WGAN).\n","authors":["Ruisong Gao","Yufeng Wang","Min Yang","Chuanjun Chen"],"pdf_url":"https://arxiv.org/pdf/2307.11289v1.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2307.11288v1","updated":"2023-07-21T01:17:31Z","published":"2023-07-21T01:17:31Z","title":"Kernelized Offline Contextual Dueling Bandits","summary":" Preference-based feedback is important for many applications where direct\nevaluation of a reward function is not feasible. A notable recent example\narises in reinforcement learning from human feedback on large language models.\nFor many of these applications, the cost of acquiring the human feedback can be\nsubstantial or even prohibitive. In this work, we take advantage of the fact\nthat often the agent can choose contexts at which to obtain human feedback in\norder to most efficiently identify a good policy, and introduce the offline\ncontextual dueling bandit setting. We give an upper-confidence-bound style\nalgorithm for this setting and prove a regret bound. We also give empirical\nconfirmation that this method outperforms a similar strategy that uses\nuniformly sampled contexts.\n","authors":["Viraj Mehta","Ojash Neopane","Vikramjeet Das","Sen Lin","Jeff Schneider","Willie Neiswanger"],"pdf_url":"https://arxiv.org/pdf/2307.11288v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.09702v5","updated":"2023-07-21T01:07:19Z","published":"2022-05-19T17:11:45Z","title":"Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency\n Analysis","summary":" Graph neural networks (GNNs) are among the most powerful tools in deep\nlearning. They routinely solve complex problems on unstructured networks, such\nas node classification, graph classification, or link prediction, with high\naccuracy. However, both inference and training of GNNs are complex, and they\nuniquely combine the features of irregular graph processing with dense and\nregular computations. This complexity makes it very challenging to execute GNNs\nefficiently on modern massively parallel architectures. To alleviate this, we\nfirst design a taxonomy of parallelism in GNNs, considering data and model\nparallelism, and different forms of pipelining. Then, we use this taxonomy to\ninvestigate the amount of parallelism in numerous GNN models, GNN-driven\nmachine learning tasks, software frameworks, or hardware accelerators. We use\nthe work-depth model, and we also assess communication volume and\nsynchronization. We specifically focus on the sparsity/density of the\nassociated tensors, in order to understand how to effectively apply techniques\nsuch as vectorization. We also formally analyze GNN pipelining, and we\ngeneralize the established Message-Passing class of GNN models to cover\narbitrary pipeline depths, facilitating future optimizations. Finally, we\ninvestigate different forms of asynchronicity, navigating the path for future\nasynchronous parallel GNN pipelines. The outcomes of our analysis are\nsynthesized in a set of insights that help to maximize GNN performance, and a\ncomprehensive list of challenges and opportunities for further research into\nefficient GNN computations. Our work will help to advance the design of future\nGNNs.\n","authors":["Maciej Besta","Torsten Hoefler"],"pdf_url":"https://arxiv.org/pdf/2205.09702v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11285v1","updated":"2023-07-21T01:04:52Z","published":"2023-07-21T01:04:52Z","title":"MAS: Towards Resource-Efficient Federated Multiple-Task Learning","summary":" Federated learning (FL) is an emerging distributed machine learning method\nthat empowers in-situ model training on decentralized edge devices. However,\nmultiple simultaneous FL tasks could overload resource-constrained devices. In\nthis work, we propose the first FL system to effectively coordinate and train\nmultiple simultaneous FL tasks. We first formalize the problem of training\nsimultaneous FL tasks. Then, we present our new approach, MAS (Merge and\nSplit), to optimize the performance of training multiple simultaneous FL tasks.\nMAS starts by merging FL tasks into an all-in-one FL task with a multi-task\narchitecture. After training for a few rounds, MAS splits the all-in-one FL\ntask into two or more FL tasks by using the affinities among tasks measured\nduring the all-in-one training. It then continues training each split of FL\ntasks based on model parameters from the all-in-one training. Extensive\nexperiments demonstrate that MAS outperforms other methods while reducing\ntraining time by 2x and reducing energy consumption by 40%. We hope this work\nwill inspire the community to further study and optimize training simultaneous\nFL tasks.\n","authors":["Weiming Zhuang","Yonggang Wen","Lingjuan Lyu","Shuai Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.11285v1.pdf","comment":"ICCV'23. arXiv admin note: substantial text overlap with\n arXiv:2207.04202"},{"id":"http://arxiv.org/abs/2307.11280v1","updated":"2023-07-21T00:49:07Z","published":"2023-07-21T00:49:07Z","title":"Epsilon*: Privacy Metric for Machine Learning Models","summary":" We introduce Epsilon*, a new privacy metric for measuring the privacy risk of\na single model instance prior to, during, or after deployment of privacy\nmitigation strategies. The metric does not require access to the training data\nsampling or model training algorithm. Epsilon* is a function of true positive\nand false positive rates in a hypothesis test used by an adversary in a\nmembership inference attack. We distinguish between quantifying the privacy\nloss of a trained model instance and quantifying the privacy loss of the\ntraining mechanism which produces this model instance. Existing approaches in\nthe privacy auditing literature provide lower bounds for the latter, while our\nmetric provides a lower bound for the former by relying on an\n(${\\epsilon}$,${\\delta}$)-type of quantification of the privacy of the trained\nmodel instance. We establish a relationship between these lower bounds and show\nhow to implement Epsilon* to avoid numerical and noise amplification\ninstability. We further show in experiments on benchmark public data sets that\nEpsilon* is sensitive to privacy risk mitigation by training with differential\nprivacy (DP), where the value of Epsilon* is reduced by up to 800% compared to\nthe Epsilon* values of non-DP trained baseline models. This metric allows\nprivacy auditors to be independent of model owners, and enables all\ndecision-makers to visualize the privacy-utility landscape to make informed\ndecisions regarding the trade-offs between model privacy and utility.\n","authors":["Diana M. Negoescu","Humberto Gonzalez","Saad Eddin Al Orjany","Jilei Yang","Yuliia Lut","Rahul Tandra","Xiaowen Zhang","Xinyi Zheng","Zach Douglas","Vidita Nolkha","Parvez Ahammad","Gennady Samorodnitsky"],"pdf_url":"https://arxiv.org/pdf/2307.11280v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11274v1","updated":"2023-07-21T00:15:56Z","published":"2023-07-21T00:15:56Z","title":"Screening Mammography Breast Cancer Detection","summary":" Breast cancer is a leading cause of cancer-related deaths, but current\nprograms are expensive and prone to false positives, leading to unnecessary\nfollow-up and patient anxiety. This paper proposes a solution to automated\nbreast cancer detection, to improve the efficiency and accuracy of screening\nprograms. Different methodologies were tested against the RSNA dataset of\nradiographic breast images of roughly 20,000 female patients and yielded an\naverage validation case pF1 score of 0.56 across methods.\n","authors":["Debajyoti Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2307.11274v1.pdf","comment":"Released @ Apr 2023. For associated project files, see\n https://github.com/chakrabortyde/rsna-breast-cancer"},{"id":"http://arxiv.org/abs/2305.13503v2","updated":"2023-07-21T00:15:28Z","published":"2023-05-22T21:39:38Z","title":"Asynchronous Multi-Model Dynamic Federated Learning over Wireless\n Networks: Theory, Modeling, and Optimization","summary":" Federated learning (FL) has emerged as a key technique for distributed\nmachine learning (ML). Most literature on FL has focused on ML model training\nfor (i) a single task/model, with (ii) a synchronous scheme for uplink/downlink\ntransfer of model parameters, and (iii) a static data distribution setting\nacross devices. These assumptions are often not well representative of\nconditions encountered in practical FL environments. To address this, we\ndevelop DMA-FL, which considers dynamic FL with multiple downstream tasks to be\ntrained over an asynchronous model transmission architecture. We first\ncharacterize the convergence of ML model training under DMA-FL via introducing\na family of scheduling tensors and rectangular functions to capture the\nscheduling of devices. Our convergence analysis sheds light on the impact of\nresource allocation, device scheduling, and individual model states on the\nperformance of ML models. We then formulate a non-convex mixed integer\noptimization problem for jointly configuring the resource allocation and device\nscheduling to strike an efficient trade-off between energy consumption and ML\nperformance. We develop a solution methodology employing successive convex\napproximations with convergence guarantee to a stationary point. Through\nnumerical simulations, we reveal the advantages of DMA-FL in terms of model\nperformance and network resource savings.\n","authors":["Zhan-Lun Chang","Seyyedali Hosseinalipour","Mung Chiang","Christopher G. Brinton"],"pdf_url":"https://arxiv.org/pdf/2305.13503v2.pdf","comment":"Submission to IEEE Transactions on Cognitive Communications and\n Networking"}],"Multimedia":[{"id":"http://arxiv.org/abs/2304.14133v2","updated":"2023-07-21T12:06:17Z","published":"2023-04-27T12:28:29Z","title":"VERITE: A Robust Benchmark for Multimodal Misinformation Detection\n Accounting for Unimodal Bias","summary":" Multimedia content has become ubiquitous on social media platforms, leading\nto the rise of multimodal misinformation (MM) and the urgent need for effective\nstrategies to detect and prevent its spread. In recent years, the challenge of\nmultimodal misinformation detection (MMD) has garnered significant attention by\nresearchers and has mainly involved the creation of annotated, weakly\nannotated, or synthetically generated training datasets, along with the\ndevelopment of various deep learning MMD models. However, the problem of\nunimodal bias in MMD benchmarks -- where biased or unimodal methods outperform\ntheir multimodal counterparts on an inherently multimodal task -- has been\noverlooked. In this study, we systematically investigate and identify the\npresence of unimodal bias in widely-used MMD benchmarks (VMU-Twitter, COSMOS),\nraising concerns about their suitability for reliable evaluation. To address\nthis issue, we introduce the \"VERification of Image-TExtpairs\" (VERITE)\nbenchmark for MMD which incorporates real-world data, excludes \"asymmetric\nmultimodal misinformation\" and utilizes \"modality balancing\". We conduct an\nextensive comparative study with a Transformer-based architecture that shows\nthe ability of VERITE to effectively address unimodal bias, rendering it a\nrobust evaluation framework for MMD. Furthermore, we introduce a new method --\ntermed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating\nrealistic synthetic training data that preserve crossmodal relations between\nlegitimate images and false human-written captions. By leveraging CHASMA in the\ntraining process, we observe consistent and notable improvements in predictive\nperformance on VERITE; with a 9.2% increase in accuracy. We release our code\nat: https://github.com/stevejpapad/image-text-verification\n","authors":["Stefanos-Iordanis Papadopoulos","Christos Koutlis","Symeon Papadopoulos","Panagiotis C. Petrantonakis"],"pdf_url":"https://arxiv.org/pdf/2304.14133v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09382v3","updated":"2023-07-21T07:59:06Z","published":"2023-06-15T12:59:04Z","title":"Sound Demixing Challenge 2023 Music Demixing Track Technical Report:\n TFC-TDF-UNet v3","summary":" In this report, we present our award-winning solutions for the Music Demixing\nTrack of Sound Demixing Challenge 2023. First, we propose TFC-TDF-UNet v3, a\ntime-efficient music source separation model that achieves state-of-the-art\nresults on the MUSDB benchmark. We then give full details regarding our\nsolutions for each Leaderboard, including a loss masking approach for\nnoise-robust training. Code for reproducing model training and final\nsubmissions is available at github.com/kuielab/sdx23.\n","authors":["Minseok Kim","Jun Hyung Lee","Soonyoung Jung"],"pdf_url":"https://arxiv.org/pdf/2306.09382v3.pdf","comment":"5 pages, 4 tables"},{"id":"http://arxiv.org/abs/2301.12688v3","updated":"2023-07-21T18:13:10Z","published":"2023-01-30T06:37:35Z","title":"Dynamic Storyboard Generation in an Engine-based Virtual Environment for\n Video Production","summary":" Amateurs working on mini-films and short-form videos usually spend lots of\ntime and effort on the multi-round complicated process of setting and adjusting\nscenes, plots, and cameras to deliver satisfying video shots. We present\nVirtual Dynamic Storyboard (VDS) to allow users storyboarding shots in virtual\nenvironments, where the filming staff can easily test the settings of shots\nbefore the actual filming. VDS runs on a \"propose-simulate-discriminate\" mode:\nGiven a formatted story script and a camera script as input, it generates\nseveral character animation and camera movement proposals following predefined\nstory and cinematic rules to allow an off-the-shelf simulation engine to render\nvideos. To pick up the top-quality dynamic storyboard from the candidates, we\nequip it with a shot ranking discriminator based on shot quality criteria\nlearned from professional manual-created data. VDS is comprehensively validated\nvia extensive experiments and user studies, demonstrating its efficiency,\neffectiveness, and great potential in assisting amateur video production.\n","authors":["Anyi Rao","Xuekun Jiang","Yuwei Guo","Linning Xu","Lei Yang","Libiao Jin","Dahua Lin","Bo Dai"],"pdf_url":"https://arxiv.org/pdf/2301.12688v3.pdf","comment":"Project page: https://virtualfilmstudio.github.io/"}]},"2023-07-24T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2307.12981v1","updated":"2023-07-24T17:59:02Z","published":"2023-07-24T17:59:02Z","title":"3D-LLM: Injecting the 3D World into Large Language Models","summary":" Large language models (LLMs) and Vision-Language Models (VLMs) have been\nproven to excel at multiple tasks, such as commonsense reasoning. Powerful as\nthese models can be, they are not grounded in the 3D physical world, which\ninvolves richer concepts such as spatial relationships, affordances, physics,\nlayout, and so on. In this work, we propose to inject the 3D world into large\nlanguage models and introduce a whole new family of 3D-LLMs. Specifically,\n3D-LLMs can take 3D point clouds and their features as input and perform a\ndiverse set of 3D-related tasks, including captioning, dense captioning, 3D\nquestion answering, task decomposition, 3D grounding, 3D-assisted dialog,\nnavigation, and so on. Using three types of prompting mechanisms that we\ndesign, we are able to collect over 300k 3D-language data covering these tasks.\nTo efficiently train 3D-LLMs, we first utilize a 3D feature extractor that\nobtains 3D features from rendered multi- view images. Then, we use 2D VLMs as\nour backbones to train our 3D-LLMs. By introducing a 3D localization mechanism,\n3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show\nthat our model outperforms state-of-the-art baselines by a large margin (e.g.,\nthe BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore,\nexperiments on our held-in datasets for 3D captioning, task composition, and\n3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative\nexamples also show that our model could perform more tasks beyond the scope of\nexisting LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.\n","authors":["Yining Hong","Haoyu Zhen","Peihao Chen","Shuhong Zheng","Yilun Du","Zhenfang Chen","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2307.12981v1.pdf","comment":"Project Page: : https://vis-www.cs.umass.edu/3dllm/"},{"id":"http://arxiv.org/abs/2307.12976v1","updated":"2023-07-24T17:52:46Z","published":"2023-07-24T17:52:46Z","title":"Evaluating the Ripple Effects of Knowledge Editing in Language Models","summary":" Modern language models capture a large body of factual knowledge. However,\nsome facts can be incorrectly induced or become obsolete over time, resulting\nin factually incorrect generations. This has led to the development of various\nediting methods that allow updating facts encoded by the model. Evaluation of\nthese methods has primarily focused on testing whether an individual fact has\nbeen successfully injected, and if similar predictions for other subjects have\nnot changed. Here we argue that such evaluation is limited, since injecting one\nfact (e.g. ``Jack Depp is the son of Johnny Depp'') introduces a ``ripple\neffect'' in the form of additional facts that the model needs to update\n(e.g.``Jack Depp is the sibling of Lily-Rose Depp''). To address this issue, we\npropose a novel set of evaluation criteria that consider the implications of an\nedit on related facts. Using these criteria, we then construct \\ripple{}, a\ndiagnostic benchmark of 5K factual edits, capturing a variety of types of\nripple effects. We evaluate prominent editing methods on \\ripple{}, showing\nthat current methods fail to introduce consistent changes in the model's\nknowledge. In addition, we find that a simple in-context editing baseline\nobtains the best scores on our benchmark, suggesting a promising research\ndirection for model editing.\n","authors":["Roi Cohen","Eden Biran","Ori Yoran","Amir Globerson","Mor Geva"],"pdf_url":"https://arxiv.org/pdf/2307.12976v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12973v1","updated":"2023-07-24T17:49:31Z","published":"2023-07-24T17:49:31Z","title":"Leveraging Label Variation in Large Language Models for Zero-Shot Text\n Classification","summary":" The zero-shot learning capabilities of large language models (LLMs) make them\nideal for text classification without annotation or supervised training. Many\nstudies have shown impressive results across multiple tasks. While tasks, data,\nand results differ widely, their similarities to human annotation can aid us in\ntackling new tasks with minimal expenses. We evaluate using 5 state-of-the-art\nLLMs as \"annotators\" on 5 different tasks (age, gender, topic, sentiment\nprediction, and hate speech detection), across 4 languages: English, French,\nGerman, and Spanish. No single model excels at all tasks, across languages, or\nacross all labels within a task. However, aggregation techniques designed for\nhuman annotators perform substantially better than any one individual model.\nOverall, though, LLMs do not rival even simple supervised models, so they do\nnot (yet) replace the need for human annotation. We also discuss the tradeoffs\nbetween speed, accuracy, cost, and bias when it comes to aggregated model\nlabeling versus human annotation.\n","authors":["Flor Miriam Plaza-del-Arco","Debora Nozza","Dirk Hovy"],"pdf_url":"https://arxiv.org/pdf/2307.12973v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12966v1","updated":"2023-07-24T17:44:58Z","published":"2023-07-24T17:44:58Z","title":"Aligning Large Language Models with Human: A Survey","summary":" Large Language Models (LLMs) trained on extensive textual corpora have\nemerged as leading solutions for a broad array of Natural Language Processing\n(NLP) tasks. Despite their notable performance, these models are prone to\ncertain limitations such as misunderstanding human instructions, generating\npotentially biased content, or factually incorrect (hallucinated) information.\nHence, aligning LLMs with human expectations has become an active area of\ninterest within the research community. This survey presents a comprehensive\noverview of these alignment technologies, including the following aspects. (1)\nData collection: the methods for effectively collecting high-quality\ninstructions for LLM alignment, including the use of NLP benchmarks, human\nannotations, and leveraging strong LLMs. (2) Training methodologies: a detailed\nreview of the prevailing training methods employed for LLM alignment. Our\nexploration encompasses Supervised Fine-tuning, both Online and Offline human\npreference training, along with parameter-efficient training mechanisms. (3)\nModel Evaluation: the methods for evaluating the effectiveness of these\nhuman-aligned LLMs, presenting a multifaceted approach towards their\nassessment. In conclusion, we collate and distill our findings, shedding light\non several promising future research avenues in the field. This survey,\ntherefore, serves as a valuable resource for anyone invested in understanding\nand advancing the alignment of LLMs to better suit human-oriented tasks and\nexpectations. An associated GitHub link collecting the latest papers is\navailable at https://github.com/GaryYufei/AlignLLMHumanSurvey.\n","authors":["Yufei Wang","Wanjun Zhong","Liangyou Li","Fei Mi","Xingshan Zeng","Wenyong Huang","Lifeng Shang","Xin Jiang","Qun Liu"],"pdf_url":"https://arxiv.org/pdf/2307.12966v1.pdf","comment":"work in progress"},{"id":"http://arxiv.org/abs/2303.04245v2","updated":"2023-07-24T17:29:04Z","published":"2023-03-07T21:42:17Z","title":"How Do Transformers Learn Topic Structure: Towards a Mechanistic\n Understanding","summary":" While the successes of transformers across many domains are indisputable,\naccurate understanding of the learning mechanics is still largely lacking.\nTheir capabilities have been probed on benchmarks which include a variety of\nstructured and reasoning tasks -- but mathematical understanding is lagging\nsubstantially behind. Recent lines of work have begun studying representational\naspects of this question: that is, the size/depth/complexity of attention-based\nnetworks to perform certain tasks. However, there is no guarantee the learning\ndynamics will converge to the constructions proposed. In our paper, we provide\nfine-grained mechanistic understanding of how transformers learn \"semantic\nstructure\", understood as capturing co-occurrence structure of words.\nPrecisely, we show, through a combination of mathematical analysis and\nexperiments on Wikipedia data and synthetic data modeled by Latent Dirichlet\nAllocation (LDA), that the embedding layer and the self-attention layer encode\nthe topical structure. In the former case, this manifests as higher average\ninner product of embeddings between same-topic words. In the latter, it\nmanifests as higher average pairwise attention between same-topic words. The\nmathematical results involve several assumptions to make the analysis\ntractable, which we verify on data, and might be of independent interest as\nwell.\n","authors":["Yuchen Li","Yuanzhi Li","Andrej Risteski"],"pdf_url":"https://arxiv.org/pdf/2303.04245v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12950v1","updated":"2023-07-24T17:23:22Z","published":"2023-07-24T17:23:22Z","title":"RLCD: Reinforcement Learning from Contrast Distillation for Language\n Model Alignment","summary":" We propose Reinforcement Learning from Contrast Distillation (RLCD), a method\nfor aligning language models to follow natural language principles without\nusing human feedback. RLCD trains a preference model using simulated preference\npairs that contain both a high-quality and low-quality example, generated using\ncontrasting positive and negative prompts. The preference model is then used to\nimprove a base unaligned language model via reinforcement learning.\nEmpirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context\ndistillation (Huang et al., 2022) baselines across three diverse alignment\ntasks--harmlessness, helpfulness, and story outline generation--and on both 7B\nand 30B model scales for preference data simulation.\n","authors":["Kevin Yang","Dan Klein","Asli Celikyilmaz","Nanyun Peng","Yuandong Tian"],"pdf_url":"https://arxiv.org/pdf/2307.12950v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12949v1","updated":"2023-07-24T17:22:04Z","published":"2023-07-24T17:22:04Z","title":"Boosting Punctuation Restoration with Data Generation and Reinforcement\n Learning","summary":" Punctuation restoration is an important task in automatic speech recognition\n(ASR) which aim to restore the syntactic structure of generated ASR texts to\nimprove readability. While punctuated texts are abundant from written\ndocuments, the discrepancy between written punctuated texts and ASR texts\nlimits the usability of written texts in training punctuation restoration\nsystems for ASR texts. This paper proposes a reinforcement learning method to\nexploit in-topic written texts and recent advances in large pre-trained\ngenerative language models to bridge this gap. The experiments show that our\nmethod achieves state-of-the-art performance on the ASR test set on two\nbenchmark datasets for punctuation restoration.\n","authors":["Viet Dac Lai","Abel Salinas","Hao Tan","Trung Bui","Quan Tran","Seunghyun Yoon","Hanieh Deilamsalehy","Franck Dernoncourt","Thien Huu Nguyen"],"pdf_url":"https://arxiv.org/pdf/2307.12949v1.pdf","comment":"Accepted at INTERSPEECH 2023, 6 pages"},{"id":"http://arxiv.org/abs/2307.12935v1","updated":"2023-07-24T16:55:37Z","published":"2023-07-24T16:55:37Z","title":"Rule By Example: Harnessing Logical Rules for Explainable Hate Speech\n Detection","summary":" Classic approaches to content moderation typically apply a rule-based\nheuristic approach to flag content. While rules are easily customizable and\nintuitive for humans to interpret, they are inherently fragile and lack the\nflexibility or robustness needed to moderate the vast amount of undesirable\ncontent found online today. Recent advances in deep learning have demonstrated\nthe promise of using highly effective deep neural models to overcome these\nchallenges. However, despite the improved performance, these data-driven models\nlack transparency and explainability, often leading to mistrust from everyday\nusers and a lack of adoption by many platforms. In this paper, we present Rule\nBy Example (RBE): a novel exemplar-based contrastive learning approach for\nlearning from logical rules for the task of textual content moderation. RBE is\ncapable of providing rule-grounded predictions, allowing for more explainable\nand customizable predictions compared to typical deep learning-based\napproaches. We demonstrate that our approach is capable of learning rich rule\nembedding representations using only a few data examples. Experimental results\non 3 popular hate speech classification datasets show that RBE is able to\noutperform state-of-the-art deep learning classifiers as well as the use of\nrules in both supervised and unsupervised settings while providing explainable\nmodel predictions via rule-grounding.\n","authors":["Christopher Clarke","Matthew Hall","Gaurav Mittal","Ye Yu","Sandra Sajeev","Jason Mars","Mei Chen"],"pdf_url":"https://arxiv.org/pdf/2307.12935v1.pdf","comment":"ACL 2023 Main Conference"},{"id":"http://arxiv.org/abs/2307.12896v1","updated":"2023-07-24T15:44:23Z","published":"2023-07-24T15:44:23Z","title":"Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models","summary":" The article introduces corrections to Zipf's and Heaps' laws based on\nsystematic models of the hapax rate. The derivation rests on two assumptions:\nThe first one is the standard urn model which predicts that marginal frequency\ndistributions for shorter texts look as if word tokens were sampled blindly\nfrom a given longer text. The second assumption posits that the rate of hapaxes\nis a simple function of the text size. Four such functions are discussed: the\nconstant model, the Davis model, the linear model, and the logistic model. It\nis shown that the logistic model yields the best fit.\n","authors":["Łukasz Dębowski"],"pdf_url":"https://arxiv.org/pdf/2307.12896v1.pdf","comment":"41 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2304.08649v3","updated":"2023-07-24T15:33:25Z","published":"2023-04-17T22:53:54Z","title":"Classification of US Supreme Court Cases using BERT-Based Techniques","summary":" Models based on bidirectional encoder representations from transformers\n(BERT) produce state of the art (SOTA) results on many natural language\nprocessing (NLP) tasks such as named entity recognition (NER), part-of-speech\n(POS) tagging etc. An interesting phenomenon occurs when classifying long\ndocuments such as those from the US supreme court where BERT-based models can\nbe considered difficult to use on a first-pass or out-of-the-box basis. In this\npaper, we experiment with several BERT-based classification techniques for US\nsupreme court decisions or supreme court database (SCDB) and compare them with\nthe previous SOTA results. We then compare our results specifically with SOTA\nmodels for long documents. We compare our results for two classification tasks:\n(1) a broad classification task with 15 categories and (2) a fine-grained\nclassification task with 279 categories. Our best result produces an accuracy\nof 80\\% on the 15 broad categories and 60\\% on the fine-grained 279 categories\nwhich marks an improvement of 8\\% and 28\\% respectively from previously\nreported SOTA results.\n","authors":["Shubham Vatsal","Adam Meyers","John E. Ortega"],"pdf_url":"https://arxiv.org/pdf/2304.08649v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10490v3","updated":"2023-07-24T15:24:17Z","published":"2023-07-19T23:03:20Z","title":"(Ab)using Images and Sounds for Indirect Instruction Injection in\n Multi-Modal LLMs","summary":" We demonstrate how images and sounds can be used for indirect prompt and\ninstruction injection in multi-modal LLMs. An attacker generates an adversarial\nperturbation corresponding to the prompt and blends it into an image or audio\nrecording. When the user asks the (unmodified, benign) model about the\nperturbed image or audio, the perturbation steers the model to output the\nattacker-chosen text and/or make the subsequent dialog follow the attacker's\ninstruction. We illustrate this attack with several proof-of-concept examples\ntargeting LLaVa and PandaGPT.\n","authors":["Eugene Bagdasaryan","Tsung-Yin Hsieh","Ben Nassi","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2307.10490v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12856v1","updated":"2023-07-24T14:56:30Z","published":"2023-07-24T14:56:30Z","title":"A Real-World WebAgent with Planning, Long Context Understanding, and\n Program Synthesis","summary":" Pre-trained large language models (LLMs) have recently achieved better\ngeneralization and sample efficiency in autonomous web navigation. However, the\nperformance on real-world websites has still suffered from (1) open domainness,\n(2) limited context length, and (3) lack of inductive bias on HTML. We\nintroduce WebAgent, an LLM-driven agent that can complete the tasks on real\nwebsites following natural language instructions. WebAgent plans ahead by\ndecomposing instructions into canonical sub-instructions, summarizes long HTML\ndocuments into task-relevant snippets, and acts on websites via generated\nPython programs from those. We design WebAgent with Flan-U-PaLM, for grounded\ncode generation, and HTML-T5, new pre-trained LLMs for long HTML documents\nusing local and global attention mechanisms and a mixture of long-span\ndenoising objectives, for planning and summarization. We empirically\ndemonstrate that our recipe improves the success on a real website by over 50%,\nand that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9%\nhigher success rate than prior SoTA on the MiniWoB web navigation benchmark and\nbetter accuracy on offline task planning evaluation.\n","authors":["Izzeddin Gur","Hiroki Furuta","Austin Huang","Mustafa Safdari","Yutaka Matsuo","Douglas Eck","Aleksandra Faust"],"pdf_url":"https://arxiv.org/pdf/2307.12856v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12835v1","updated":"2023-07-24T14:33:49Z","published":"2023-07-24T14:33:49Z","title":"Joint Dropout: Improving Generalizability in Low-Resource Neural Machine\n Translation through Phrase Pair Variables","summary":" Despite the tremendous success of Neural Machine Translation (NMT), its\nperformance on low-resource language pairs still remains subpar, partly due to\nthe limited ability to handle previously unseen inputs, i.e., generalization.\nIn this paper, we propose a method called Joint Dropout, that addresses the\nchallenge of low-resource neural machine translation by substituting phrases\nwith variables, resulting in significant enhancement of compositionality, which\nis a key aspect of generalization. We observe a substantial improvement in\ntranslation quality for language pairs with minimal resources, as seen in BLEU\nand Direct Assessment scores. Furthermore, we conduct an error analysis, and\nfind Joint Dropout to also enhance generalizability of low-resource NMT in\nterms of robustness and adaptability across different domains\n","authors":["Ali Araabi","Vlad Niculae","Christof Monz"],"pdf_url":"https://arxiv.org/pdf/2307.12835v1.pdf","comment":"Accepted at MT Summit 2023"},{"id":"http://arxiv.org/abs/2307.12803v1","updated":"2023-07-24T13:54:37Z","published":"2023-07-24T13:54:37Z","title":"Guidance in Radiology Report Summarization: An Empirical Evaluation and\n Error Analysis","summary":" Automatically summarizing radiology reports into a concise impression can\nreduce the manual burden of clinicians and improve the consistency of\nreporting. Previous work aimed to enhance content selection and factuality\nthrough guided abstractive summarization. However, two key issues persist.\nFirst, current methods heavily rely on domain-specific resources to extract the\nguidance signal, limiting their transferability to domains and languages where\nthose resources are unavailable. Second, while automatic metrics like ROUGE\nshow progress, we lack a good understanding of the errors and failure modes in\nthis task. To bridge these gaps, we first propose a domain-agnostic guidance\nsignal in form of variable-length extractive summaries. Our empirical results\non two English benchmarks demonstrate that this guidance signal improves upon\nunguided summarization while being competitive with domain-specific methods.\nAdditionally, we run an expert evaluation of four systems according to a\ntaxonomy of 11 fine-grained errors. We find that the most pressing differences\nbetween automatic summaries and those of radiologists relate to content\nselection including omissions (up to 52%) and additions (up to 57%). We\nhypothesize that latent reporting factors and corpus-level inconsistencies may\nlimit models to reliably learn content selection from the available data,\npresenting promising directions for future work.\n","authors":["Jan Trienes","Paul Youssef","Jörg Schlötterer","Christin Seifert"],"pdf_url":"https://arxiv.org/pdf/2307.12803v1.pdf","comment":"Accepted at INLG2023"},{"id":"http://arxiv.org/abs/2307.12798v1","updated":"2023-07-24T13:51:19Z","published":"2023-07-24T13:51:19Z","title":"RRAML: Reinforced Retrieval Augmented Machine Learning","summary":" The emergence of large language models (LLMs) has revolutionized machine\nlearning and related fields, showcasing remarkable abilities in comprehending,\ngenerating, and manipulating human language. However, their conventional usage\nthrough API-based text prompt submissions imposes certain limitations in terms\nof context constraints and external source availability. To address these\nchallenges, we propose a novel framework called Reinforced Retrieval Augmented\nMachine Learning (RRAML). RRAML integrates the reasoning capabilities of LLMs\nwith supporting information retrieved by a purpose-built retriever from a vast\nuser-provided database. By leveraging recent advancements in reinforcement\nlearning, our method effectively addresses several critical challenges.\nFirstly, it circumvents the need for accessing LLM gradients. Secondly, our\nmethod alleviates the burden of retraining LLMs for specific tasks, as it is\noften impractical or impossible due to restricted access to the model and the\ncomputational intensity involved. Additionally we seamlessly link the\nretriever's task with the reasoner, mitigating hallucinations and reducing\nirrelevant, and potentially damaging retrieved documents. We believe that the\nresearch agenda outlined in this paper has the potential to profoundly impact\nthe field of AI, democratizing access to and utilization of LLMs for a wide\nrange of entities.\n","authors":["Andrea Bacciu","Florin Cocunasu","Federico Siciliano","Fabrizio Silvestri","Nicola Tonellotto","Giovanni Trappolini"],"pdf_url":"https://arxiv.org/pdf/2307.12798v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2011.12662v4","updated":"2023-07-24T13:22:58Z","published":"2020-11-25T11:44:12Z","title":"XTQA: Span-Level Explanations of the Textbook Question Answering","summary":" Textbook Question Answering (TQA) is a task that one should answer a\ndiagram/non-diagram question given a large multi-modal context consisting of\nabundant essays and diagrams. We argue that the explainability of this task\nshould place students as a key aspect to be considered. To address this issue,\nwe devise a novel architecture towards span-level eXplanations of the TQA\n(XTQA) based on our proposed coarse-to-fine grained algorithm, which can\nprovide not only the answers but also the span-level evidences to choose them\nfor students. This algorithm first coarsely chooses top $M$ paragraphs relevant\nto questions using the TF-IDF method, and then chooses top $K$ evidence spans\nfinely from all candidate spans within these paragraphs by computing the\ninformation gain of each span to questions. Experimental results shows that\nXTQA significantly improves the state-of-the-art performance compared with\nbaselines. The source code is available at\nhttps://github.com/keep-smile-001/opentqa\n","authors":["Jie Ma","Qi Chai","Jun Liu","Qingyu Yin","Pinghui Wang","Qinghua Zheng"],"pdf_url":"https://arxiv.org/pdf/2011.12662v4.pdf","comment":"Accepted by IEEE TNNLS"},{"id":"http://arxiv.org/abs/2307.12759v1","updated":"2023-07-24T13:04:21Z","published":"2023-07-24T13:04:21Z","title":"Code-Switched Urdu ASR for Noisy Telephonic Environment using Data\n Centric Approach with Hybrid HMM and CNN-TDNN","summary":" Call Centers have huge amount of audio data which can be used for achieving\nvaluable business insights and transcription of phone calls is manually tedious\ntask. An effective Automated Speech Recognition system can accurately\ntranscribe these calls for easy search through call history for specific\ncontext and content allowing automatic call monitoring, improving QoS through\nkeyword search and sentiment analysis. ASR for Call Center requires more\nrobustness as telephonic environment are generally noisy. Moreover, there are\nmany low-resourced languages that are on verge of extinction which can be\npreserved with help of Automatic Speech Recognition Technology. Urdu is the\n$10^{th}$ most widely spoken language in the world, with 231,295,440 worldwide\nstill remains a resource constrained language in ASR. Regional call-center\nconversations operate in local language, with a mix of English numbers and\ntechnical terms generally causing a \"code-switching\" problem. Hence, this paper\ndescribes an implementation framework of a resource efficient Automatic Speech\nRecognition/ Speech to Text System in a noisy call-center environment using\nChain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid\nHMM-DNN approach allowed us to utilize the advantages of Neural Network with\nless labelled data. Adding CNN with TDNN has shown to work better in noisy\nenvironment due to CNN's additional frequency dimension which captures extra\ninformation from noisy speech, thus improving accuracy. We collected data from\nvarious open sources and labelled some of the unlabelled data after analysing\nits general context and content from Urdu language as well as from commonly\nused words from other languages, primarily English and were able to achieve WER\nof 5.2% with noisy as well as clean environment in isolated words or numbers as\nwell as in continuous spontaneous speech.\n","authors":["Muhammad Danyal Khan","Raheem Ali","Arshad Aziz"],"pdf_url":"https://arxiv.org/pdf/2307.12759v1.pdf","comment":"32 pages, 19 figures, 2 tables, preprint"},{"id":"http://arxiv.org/abs/2305.16731v3","updated":"2023-07-24T11:20:10Z","published":"2023-05-26T08:33:28Z","title":"Automatic Emotion Experiencer Recognition","summary":" The most prominent subtask in emotion analysis is emotion classification; to\nassign a category to a textual unit, for instance a social media post. Many\nresearch questions from the social sciences do, however, not only require the\ndetection of the emotion of an author of a post but to understand who is\nascribed an emotion in text. This task is tackled by emotion role labeling\nwhich aims at extracting who is described in text to experience an emotion,\nwhy, and towards whom. This could, however, be considered overly sophisticated\nif the main question to answer is who feels which emotion. A targeted approach\nfor such setup is to classify emotion experiencer mentions (aka \"emoters\")\nregarding the emotion they presumably perceive. This task is similar to named\nentity recognition of person names with the difference that not every mentioned\nentity name is an emoter. While, very recently, data with emoter annotations\nhas been made available, no experiments have yet been performed to detect such\nmentions. With this paper, we provide baseline experiments to understand how\nchallenging the task is. We further evaluate the impact on experiencer-specific\nemotion categorization and appraisal detection in a pipeline, when gold\nmentions are not available. We show that experiencer detection in text is a\nchallenging task, with a precision of .82 and a recall of .56 (F1 =.66). These\nresults motivate future work of jointly modeling emoter spans and\nemotion/appraisal predictions.\n","authors":["Maximilian Wegge","Roman Klinger"],"pdf_url":"https://arxiv.org/pdf/2305.16731v3.pdf","comment":"accepted to the CPSS workshop at KONVENS"},{"id":"http://arxiv.org/abs/2307.12659v1","updated":"2023-07-24T10:03:28Z","published":"2023-07-24T10:03:28Z","title":"A Model for Every User and Budget: Label-Free and Personalized\n Mixed-Precision Quantization","summary":" Recent advancement in Automatic Speech Recognition (ASR) has produced large\nAI models, which become impractical for deployment in mobile devices. Model\nquantization is effective to produce compressed general-purpose models, however\nsuch models may only be deployed to a restricted sub-domain of interest. We\nshow that ASR models can be personalized during quantization while relying on\njust a small set of unlabelled samples from the target domain. To this end, we\npropose myQASR, a mixed-precision quantization method that generates tailored\nquantization schemes for diverse users under any memory requirement with no\nfine-tuning. myQASR automatically evaluates the quantization sensitivity of\nnetwork layers by analysing the full-precision activation values. We are then\nable to generate a personalised mixed-precision quantization scheme for any\npre-determined memory budget. Results for large-scale ASR models show how\nmyQASR improves performance for specific genders, languages, and speakers.\n","authors":["Edward Fish","Umberto Michieli","Mete Ozay"],"pdf_url":"https://arxiv.org/pdf/2307.12659v1.pdf","comment":"INTERSPEECH 2023"},{"id":"http://arxiv.org/abs/2301.09790v3","updated":"2023-07-24T10:03:01Z","published":"2023-01-24T02:44:02Z","title":"The Next Chapter: A Study of Large Language Models in Storytelling","summary":" To enhance the quality of generated stories, recent story generation models\nhave been investigating the utilization of higher-level attributes like plots\nor commonsense knowledge. The application of prompt-based learning with large\nlanguage models (LLMs), exemplified by GPT-3, has exhibited remarkable\nperformance in diverse natural language processing (NLP) tasks. This paper\nconducts a comprehensive investigation, utilizing both automatic and human\nevaluation, to compare the story generation capacity of LLMs with recent models\nacross three datasets with variations in style, register, and length of\nstories. The results demonstrate that LLMs generate stories of significantly\nhigher quality compared to other story generation models. Moreover, they\nexhibit a level of performance that competes with human authors, albeit with\nthe preliminary observation that they tend to replicate real stories in\nsituations involving world knowledge, resembling a form of plagiarism.\n","authors":["Zhuohan Xie","Trevor Cohn","Jey Han Lau"],"pdf_url":"https://arxiv.org/pdf/2301.09790v3.pdf","comment":"Accepted to INLG2023"},{"id":"http://arxiv.org/abs/2304.14721v4","updated":"2023-07-24T09:49:55Z","published":"2023-04-28T09:42:18Z","title":"Towards autonomous system: flexible modular production system enhanced\n with large language model agents","summary":" In this paper, we present a novel framework that combines large language\nmodels (LLMs), digital twins and industrial automation system to enable\nintelligent planning and control of production processes. We retrofit the\nautomation system for a modular production facility and create executable\ncontrol interfaces of fine-granular functionalities and coarse-granular skills.\nLow-level functionalities are executed by automation components, and high-level\nskills are performed by automation modules. Subsequently, a digital twin system\nis developed, registering these interfaces and containing additional\ndescriptive information about the production system. Based on the retrofitted\nautomation system and the created digital twins, LLM-agents are designed to\ninterpret descriptive information in the digital twins and control the physical\nsystem through service interfaces. These LLM-agents serve as intelligent agents\non different levels within an automation system, enabling autonomous planning\nand control of flexible production. Given a task instruction as input, the\nLLM-agents orchestrate a sequence of atomic functionalities and skills to\naccomplish the task. We demonstrate how our implemented prototype can handle\nun-predefined tasks, plan a production process, and execute the operations.\nThis research highlights the potential of integrating LLMs into industrial\nautomation systems in the context of smart factory for more agile, flexible,\nand adaptive production processes, while it also underscores the critical\ninsights and limitations for future work. Demos at:\nhttps://github.com/YuchenXia/GPT4IndustrialAutomation\n","authors":["Yuchen Xia","Manthan Shenoy","Nasser Jazdi","Michael Weyrich"],"pdf_url":"https://arxiv.org/pdf/2304.14721v4.pdf","comment":"This is the pre-print draft manuscript. The peer-reviewed version\n will be published exclusively by IEEE after the conference, which is set to\n take place from September 12th to 15th, 2023. We've made several improvements\n to the final version of the paper based on valuable feedback and suggestions\n from other researchers"},{"id":"http://arxiv.org/abs/2307.12639v1","updated":"2023-07-24T09:30:30Z","published":"2023-07-24T09:30:30Z","title":"Fake News Detection Through Graph-based Neural Networks: A Survey","summary":" The popularity of online social networks has enabled rapid dissemination of\ninformation. People now can share and consume information much more rapidly\nthan ever before. However, low-quality and/or accidentally/deliberately fake\ninformation can also spread rapidly. This can lead to considerable and negative\nimpacts on society. Identifying, labelling and debunking online misinformation\nas early as possible has become an increasingly urgent problem. Many methods\nhave been proposed to detect fake news including many deep learning and\ngraph-based approaches. In recent years, graph-based methods have yielded\nstrong results, as they can closely model the social context and propagation\nprocess of online news. In this paper, we present a systematic review of fake\nnews detection studies based on graph-based and deep learning-based techniques.\nWe classify existing graph-based methods into knowledge-driven methods,\npropagation-based methods, and heterogeneous social context-based methods,\ndepending on how a graph structure is constructed to model news related\ninformation flows. We further discuss the challenges and open problems in\ngraph-based fake news detection and identify future research directions.\n","authors":["Shuzhi Gong","Richard O. Sinnott","Jianzhong Qi","Cecile Paris"],"pdf_url":"https://arxiv.org/pdf/2307.12639v1.pdf","comment":"18 pages, 3 tables, 7 figures"},{"id":"http://arxiv.org/abs/2210.04676v2","updated":"2023-07-24T09:00:03Z","published":"2022-10-10T13:26:45Z","title":"Learning \"O\" Helps for Learning More: Handling the Concealed Entity\n Problem for Class-incremental NER","summary":" As the categories of named entities rapidly increase, the deployed NER models\nare required to keep updating toward recognizing more entity types, creating a\ndemand for class-incremental learning for NER. Considering the privacy concerns\nand storage constraints, the standard paradigm for class-incremental NER\nupdates the models with training data only annotated with the new classes, yet\nthe entities from other entity classes are unlabeled, regarded as \"Non-entity\"\n(or \"O\"). In this work, we conduct an empirical study on the \"Unlabeled Entity\nProblem\" and find that it leads to severe confusion between \"O\" and entities,\ndecreasing class discrimination of old classes and declining the model's\nability to learn new classes. To solve the Unlabeled Entity Problem, we propose\na novel representation learning method to learn discriminative representations\nfor the entity classes and \"O\". Specifically, we propose an entity-aware\ncontrastive learning method that adaptively detects entity clusters in \"O\".\nFurthermore, we propose two effective distance-based relabeling strategies for\nbetter learning the old classes. We introduce a more realistic and challenging\nbenchmark for class-incremental NER, and the proposed method achieves up to\n10.62\\% improvement over the baseline methods.\n","authors":["Ruotian Ma","Xuanting Chen","Lin Zhang","Xin Zhou","Junzhe Wang","Tao Gui","Qi Zhang","Xiang Gao","Yunwen Chen"],"pdf_url":"https://arxiv.org/pdf/2210.04676v2.pdf","comment":"Accepted by ACL 2023"},{"id":"http://arxiv.org/abs/2306.16108v2","updated":"2023-07-24T08:14:44Z","published":"2023-06-28T11:24:48Z","title":"Is ChatGPT a Biomedical Expert? -- Exploring the Zero-Shot Performance\n of Current GPT Models in Biomedical Tasks","summary":" We assessed the performance of commercial Large Language Models (LLMs)\nGPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b\nPhase B, which is focused on answer generation, both models demonstrated\ncompetitive abilities with leading systems. Remarkably, they achieved this with\nsimple zero-shot learning, grounded with relevant snippets. Even without\nrelevant snippets, their performance was decent, though not on par with the\nbest systems. Interestingly, the older and cheaper GPT-3.5-Turbo system was\nable to compete with GPT-4 in the grounded Q&A setting on factoid and list\nanswers. In Task 11b Phase A, focusing on retrieval, query expansion through\nzero-shot learning improved performance, but the models fell short compared to\nother systems. The code needed to rerun these experiments is available through\nGitHub.\n","authors":["Samy Ateia","Udo Kruschwitz"],"pdf_url":"https://arxiv.org/pdf/2306.16108v2.pdf","comment":"Preprint accepted at the 11th BioASQ Workshop at the 14th Conference\n and Labs of the Evaluation Forum (CLEF) 2023; Changes: 1. Added related work\n and experimental setup sections. 2. Reworked discussion and future work\n section. 3. Fixed multiple typos and improved style. Changed license"},{"id":"http://arxiv.org/abs/2307.12573v1","updated":"2023-07-24T07:40:59Z","published":"2023-07-24T07:40:59Z","title":"Tachikuma: Understading Complex Interactions with Multi-Character and\n Novel Objects by Large Language Models","summary":" Recent advancements in natural language and Large Language Models (LLMs) have\nenabled AI agents to simulate human-like interactions within virtual worlds.\nHowever, these interactions still face limitations in complexity and\nflexibility, particularly in scenarios involving multiple characters and novel\nobjects. Pre-defining all interactable objects in the agent's world model\npresents challenges, and conveying implicit intentions to multiple characters\nthrough complex interactions remains difficult. To address these issues, we\npropose integrating virtual Game Masters (GMs) into the agent's world model,\ndrawing inspiration from Tabletop Role-Playing Games (TRPGs). GMs play a\ncrucial role in overseeing information, estimating players' intentions,\nproviding environment descriptions, and offering feedback, compensating for\ncurrent world model deficiencies. To facilitate future explorations for complex\ninteractions, we introduce a benchmark named Tachikuma, comprising a Multiple\ncharacter and novel Object based interaction Estimation (MOE) task and a\nsupporting dataset. MOE challenges models to understand characters' intentions\nand accurately determine their actions within intricate contexts involving\nmulti-character and novel object interactions. Besides, the dataset captures\nlog data from real-time communications during gameplay, providing diverse,\ngrounded, and complex interactions for further explorations. Finally, we\npresent a simple prompting baseline and evaluate its performance, demonstrating\nits effectiveness in enhancing interaction understanding. We hope that our\ndataset and task will inspire further research in complex interactions with\nnatural language, fostering the development of more advanced AI agents.\n","authors":["Yuanzhi Liang","Linchao Zhu","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12573v1.pdf","comment":"Preliminary version of an ongoing work"},{"id":"http://arxiv.org/abs/2307.12564v1","updated":"2023-07-24T07:17:33Z","published":"2023-07-24T07:17:33Z","title":"Towards Generalising Neural Topical Representations","summary":" Topic models have evolved from conventional Bayesian probabilistic models to\nNeural Topic Models (NTMs) over the last two decays. Although NTMs have\nachieved promising performance when trained and tested on a specific corpus,\ntheir generalisation ability across corpora is rarely studied. In practice, we\noften expect that an NTM trained on a source corpus can still produce quality\ntopical representation for documents in a different target corpus without\nretraining. In this work, we aim to improve NTMs further so that their benefits\ngeneralise reliably across corpora and tasks. To do so, we propose to model\nsimilar documents by minimising their semantical distance when training NTMs.\nSpecifically, similar documents are created by data augmentation during\ntraining; The semantical distance between documents is measured by the\nHierarchical Topic Transport Distance (HOTT), which computes the Optimal\nTransport (OT) distance between the topical representations. Our framework can\nbe readily applied to most NTMs as a plug-and-play module. Extensive\nexperiments show that our framework significantly improves the generalisation\nability regarding neural topical representation across corpora.\n","authors":["Xiaohao Yang","He Zhao","Dinh Phung","Lan Du"],"pdf_url":"https://arxiv.org/pdf/2307.12564v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2103.11578v2","updated":"2023-07-24T06:53:10Z","published":"2021-03-22T04:44:43Z","title":"SparseGAN: Sparse Generative Adversarial Network for Text Generation","summary":" It is still a challenging task to learn a neural text generation model under\nthe framework of generative adversarial networks (GANs) since the entire\ntraining process is not differentiable. The existing training strategies either\nsuffer from unreliable gradient estimations or imprecise sentence\nrepresentations. Inspired by the principle of sparse coding, we propose a\nSparseGAN that generates semantic-interpretable, but sparse sentence\nrepresentations as inputs to the discriminator. The key idea is that we treat\nan embedding matrix as an over-complete dictionary, and use a linear\ncombination of very few selected word embeddings to approximate the output\nfeature representation of the generator at each time step. With such\nsemantic-rich representations, we not only reduce unnecessary noises for\nefficient adversarial training, but also make the entire training process fully\ndifferentiable. Experiments on multiple text generation datasets yield\nperformance improvements, especially in sequence-level metrics, such as BLEU.\n","authors":["Liping Yuan","Jiehang Zeng","Xiaoqing Zheng"],"pdf_url":"https://arxiv.org/pdf/2103.11578v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.09710v3","updated":"2023-07-24T05:39:27Z","published":"2022-11-17T17:45:59Z","title":"Style Classification of Rabbinic Literature for Detection of Lost\n Midrash Tanhuma Material","summary":" Midrash collections are complex rabbinic works that consist of text in\nmultiple languages, which evolved through long processes of unstable oral and\nwritten transmission. Determining the origin of a given passage in such a\ncompilation is not always straightforward and is often a matter of dispute\namong scholars, yet it is essential for scholars' understanding of the passage\nand its relationship to other texts in the rabbinic corpus. To help solve this\nproblem, we propose a system for classification of rabbinic literature based on\nits style, leveraging recent advances in natural language processing for Hebrew\ntexts. Additionally, we demonstrate how this method can be applied to uncover\nlost material from a specific midrash genre, Tan\\d{h}uma-Yelammedenu, that has\nbeen preserved in later anthologies.\n","authors":["Shlomo Tannor","Nachum Dershowitz","Moshe Lavee"],"pdf_url":"https://arxiv.org/pdf/2211.09710v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12520v1","updated":"2023-07-24T04:29:43Z","published":"2023-07-24T04:29:43Z","title":"Lost In Translation: Generating Adversarial Examples Robust to\n Round-Trip Translation","summary":" Language Models today provide a high accuracy across a large number of\ndownstream tasks. However, they remain susceptible to adversarial attacks,\nparticularly against those where the adversarial examples maintain considerable\nsimilarity to the original text. Given the multilingual nature of text, the\neffectiveness of adversarial examples across translations and how machine\ntranslations can improve the robustness of adversarial examples remain largely\nunexplored. In this paper, we present a comprehensive study on the robustness\nof current text adversarial attacks to round-trip translation. We demonstrate\nthat 6 state-of-the-art text-based adversarial attacks do not maintain their\nefficacy after round-trip translation. Furthermore, we introduce an\nintervention-based solution to this problem, by integrating Machine Translation\ninto the process of adversarial example generation and demonstrating increased\nrobustness to round-trip translation. Our results indicate that finding\nadversarial examples robust to translation can help identify the insufficiency\nof language models that is common across languages, and motivate further\nresearch into multilingual adversarial attacks.\n","authors":["Neel Bhandari","Pin-Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2307.12520v1.pdf","comment":"Published at International Conference on Acoustics, Speech, and\n Signal Processing (ICASSP) 2023"},{"id":"http://arxiv.org/abs/2009.04639v2","updated":"2023-07-24T03:56:31Z","published":"2020-09-10T02:22:21Z","title":"Improving Coreference Resolution by Leveraging Entity-Centric Features\n with Graph Neural Networks and Second-order Inference","summary":" One of the major challenges in coreference resolution is how to make use of\nentity-level features defined over clusters of mentions rather than mention\npairs. However, coreferent mentions usually spread far apart in an entire text,\nwhich makes it extremely difficult to incorporate entity-level features. We\npropose a graph neural network-based coreference resolution method that can\ncapture the entity-centric information by encouraging the sharing of features\nacross all mentions that probably refer to the same real-world entity. Mentions\nare linked to each other via the edges modeling how likely two linked mentions\npoint to the same entity. Modeling by such graphs, the features between\nmentions can be shared by message passing operations in an entity-centric\nmanner. A global inference algorithm up to second-order features is also\npresented to optimally cluster mentions into consistent groups. Experimental\nresults show our graph neural network-based method combing with the\nsecond-order decoding algorithm (named GNNCR) achieved close to\nstate-of-the-art performance on the English CoNLL-2012 Shared Task dataset.\n","authors":["Lu Liu","Zhenqiao Song","Xiaoqing Zheng","Jun He"],"pdf_url":"https://arxiv.org/pdf/2009.04639v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12507v1","updated":"2023-07-24T03:44:17Z","published":"2023-07-24T03:44:17Z","title":"Investigating the Existence of \"Secret Language'' in Language Models","summary":" In this paper, we study the problem of secret language in NLP, where current\nlanguage models (LMs) seem to have a hidden vocabulary that allows them to\ninterpret absurd inputs as meaningful concepts. We investigate two research\nquestions: ``Does the secret language phenomenon exist in different language\nmodels?'' and ``Does secret language depend on specific context?'' To answer\nthese questions, we introduce a novel method named \\textit{SecretFinding}, a\ngradient-based approach that can automatically discover secret languages in\nLMs. We conduct experiments on five representative models (Electra, ALBERT,\nRoberta, DistillBERT, and CLIP) finetuned on four NLP benchmarks (SST-2, MRPC,\nSNLI, and SQuAD) and a language-grounding benchmark (MSCOCO). Our experimental\nresults show that even when we replace the most important words with others\nthat are semantically dissimilar to the original words in a sentence, LMs do\nnot consider the new sentence semantically dissimilar to the original, as the\noutput does not change with a high probability. This phenomenon holds true\nacross the five models and five tasks and gives a positive answer to the first\nresearch question. As for the second research question, we find that the secret\nlanguage discovered by \\textit{SecretFinding} is quite general and could even\nbe transferred to other models in the black-box settings, such as GPT-3 and\nChatGPT. Finally, we discuss the causes of secret language, how to eliminate\nit, the potential connection to memorization, and ethical implications.\nExamples of secret language found by SecretFinding are available on\nhttps://huggingface.co/spaces/anonymousauthors/ACL23_SecretLanguage.\n","authors":["Yimu Wang","Peng Shi","Hongyang Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13040v3","updated":"2023-07-24T03:31:42Z","published":"2023-05-22T13:47:51Z","title":"SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented\n Dialogue Agents","summary":" Task-oriented dialogue (TOD) models have made significant progress in recent\nyears. However, previous studies primarily focus on datasets written by\nannotators, which has resulted in a gap between academic research and\nreal-world spoken conversation scenarios. While several small-scale spoken TOD\ndatasets are proposed to address robustness issues such as ASR errors, they\nignore the unique challenges in spoken conversation. To tackle the limitations,\nwe introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD,\ncontaining 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from\nhuman-to-human spoken conversations. SpokenWOZ further incorporates common\nspoken characteristics such as word-by-word processing and reasoning in spoken\nlanguage. Based on these characteristics, we present cross-turn slot and\nreasoning slot detection as new challenges. We conduct experiments on various\nbaselines, including text-modal models, newly proposed dual-modal models, and\nLLMs, e.g., ChatGPT. The results show that the current models still have\nsubstantial room for improvement in spoken conversation, where the most\nadvanced dialogue state tracker only achieves 25.65% in joint goal accuracy and\nthe SOTA end-to-end model only correctly completes the user request in 52.1% of\ndialogues. The dataset, code, and leaderboard are available:\nhttps://spokenwoz.github.io/SpokenWOZ-github.io/.\n","authors":["Shuzheng Si","Wentao Ma","Haoyu Gao","Yuchuan Wu","Ting-En Lin","Yinpei Dai","Hangyu Li","Rui Yan","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2305.13040v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2009.07481v2","updated":"2023-07-24T03:26:17Z","published":"2020-09-16T05:58:00Z","title":"Unsupervised Summarization by Jointly Extracting Sentences and Keywords","summary":" We present RepRank, an unsupervised graph-based ranking model for extractive\nmulti-document summarization in which the similarity between words, sentences,\nand word-to-sentence can be estimated by the distances between their vector\nrepresentations in a unified vector space. In order to obtain desirable\nrepresentations, we propose a self-attention based learning method that\nrepresent a sentence by the weighted sum of its word embeddings, and the\nweights are concentrated to those words hopefully better reflecting the content\nof a document. We show that salient sentences and keywords can be extracted in\na joint and mutual reinforcement process using our learned representations, and\nprove that this process always converges to a unique solution leading to\nimprovement in performance. A variant of absorbing random walk and the\ncorresponding sampling-based algorithm are also described to avoid redundancy\nand increase diversity in the summaries. Experiment results with multiple\nbenchmark datasets show that RepRank achieved the best or comparable\nperformance in ROUGE.\n","authors":["Zongyi Li","Xiaoqing Zheng","Jun He"],"pdf_url":"https://arxiv.org/pdf/2009.07481v2.pdf","comment":"10 pages(includes 2 pages references), 1 figure"},{"id":"http://arxiv.org/abs/2307.12498v1","updated":"2023-07-24T03:07:40Z","published":"2023-07-24T03:07:40Z","title":"Robust Automatic Speech Recognition via WavAugment Guided Phoneme\n Adversarial Training","summary":" Developing a practically-robust automatic speech recognition (ASR) is\nchallenging since the model should not only maintain the original performance\non clean samples, but also achieve consistent efficacy under small volume\nperturbations and large domain shifts. To address this problem, we propose a\nnovel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use\nadversarial examples in phoneme space as augmentation to make the model\ninvariant to minor fluctuations in phoneme representation and preserve the\nperformance on clean samples. In addition, wapat utilizes the phoneme\nrepresentation of augmented samples to guide the generation of adversaries,\nwhich helps to find more stable and diverse gradient-directions, resulting in\nimproved generalization. Extensive experiments demonstrate the effectiveness of\nwapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat\noutperforms the original model by 6.28% WER reduction on ESB, achieving the new\nstate-of-the-art.\n","authors":["Gege Qi","Yuefeng Chen","Xiaofeng Mao","Xiaojun Jia","Ranjie Duan","Rong Zhang","Hui Xue"],"pdf_url":"https://arxiv.org/pdf/2307.12498v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11610v2","updated":"2023-07-24T01:35:47Z","published":"2023-07-21T14:25:39Z","title":"CausE: Towards Causal Knowledge Graph Embedding","summary":" Knowledge graph embedding (KGE) focuses on representing the entities and\nrelations of a knowledge graph (KG) into the continuous vector spaces, which\ncan be employed to predict the missing triples to achieve knowledge graph\ncompletion (KGC). However, KGE models often only briefly learn structural\ncorrelations of triple data and embeddings would be misled by the trivial\npatterns and noisy links in real-world KGs. To address this issue, we build the\nnew paradigm of KGE in the context of causality and embedding disentanglement.\nWe further propose a Causality-enhanced knowledge graph Embedding (CausE)\nframework. CausE employs causal intervention to estimate the causal effect of\nthe confounder embeddings and design new training objectives to make stable\npredictions. Experimental results demonstrate that CausE could outperform the\nbaseline models and achieve state-of-the-art KGC performance. We release our\ncode in https://github.com/zjukg/CausE.\n","authors":["Yichi Zhang","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.11610v2.pdf","comment":"Accepted by CCKS 2023 as a research paper"},{"id":"http://arxiv.org/abs/2306.14096v4","updated":"2023-07-24T00:58:11Z","published":"2023-06-25T02:24:30Z","title":"Chinese Fine-Grained Financial Sentiment Analysis with Large Language\n Models","summary":" Entity-level fine-grained sentiment analysis in the financial domain is a\ncrucial subtask of sentiment analysis and currently faces numerous challenges.\nThe primary challenge stems from the lack of high-quality and large-scale\nannotated corpora specifically designed for financial text sentiment analysis,\nwhich in turn limits the availability of data necessary for developing\neffective text processing techniques. Recent advancements in large language\nmodels (LLMs) have yielded remarkable performance in natural language\nprocessing tasks, primarily centered around language pattern matching. In this\npaper, we propose a novel and extensive Chinese fine-grained financial\nsentiment analysis dataset, FinChina SA, for enterprise early warning. We\nthoroughly evaluate and experiment with well-known existing open-source LLMs\nusing our dataset. We firmly believe that our dataset will serve as a valuable\nresource to advance the exploration of real-world financial sentiment analysis\ntasks, which should be the focus of future research. The FinChina SA dataset is\npublicly available at https://github.com/YerayL/FinChina-SA\n","authors":["Yinyu Lan","Yanru Wu","Wang Xu","Weiqiang Feng","Youhao Zhang"],"pdf_url":"https://arxiv.org/pdf/2306.14096v4.pdf","comment":"FinLLM Symposium at IJCAI 2023"},{"id":"http://arxiv.org/abs/2305.01788v3","updated":"2023-07-24T00:54:51Z","published":"2023-05-02T21:33:10Z","title":"Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation\n Incorporating Gloss Information","summary":" Visual Word Sense Disambiguation (VWSD) is a task to find the image that most\naccurately depicts the correct sense of the target word for the given context.\nPreviously, image-text matching models often suffered from recognizing\npolysemous words. This paper introduces an unsupervised VWSD approach that uses\ngloss information of an external lexical knowledge-base, especially the sense\ndefinitions. Specifically, we suggest employing Bayesian inference to\nincorporate the sense definitions when sense information of the answer is not\nprovided. In addition, to ameliorate the out-of-dictionary (OOD) issue, we\npropose a context-aware definition generation with GPT-3. Experimental results\nshow that the VWSD performance significantly increased with our Bayesian\ninference-based approach. In addition, our context-aware definition generation\nachieved prominent performance improvement in OOD examples exhibiting better\nperformance than the existing definition generation method.\n","authors":["Sunjae Kwon","Rishabh Garodia","Minhwa Lee","Zhichao Yang","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2305.01788v3.pdf","comment":"ACL 2023, https://aclanthology.org/2023.acl-long.88"},{"id":"http://arxiv.org/abs/2307.02591v2","updated":"2023-07-24T00:47:23Z","published":"2023-07-05T18:41:29Z","title":"ODD: A Benchmark Dataset for the NLP-based Opioid Related Aberrant\n Behavior Detection","summary":" Opioid related aberrant behaviors (ORAB) present novel risk factors for\nopioid overdose. Previously, ORAB have been mainly assessed by survey results\nand by monitoring drug administrations. Such methods however, cannot scale up\nand do not cover the entire spectrum of aberrant behaviors. On the other hand,\nORAB are widely documented in electronic health record notes. This paper\nintroduces a novel biomedical natural language processing benchmark dataset\nnamed ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset\ncomprising of more than 750 publicly available EHR notes. ODD has been designed\nto identify ORAB from patients' EHR notes and classify them into nine\ncategories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3)\nOpioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiapines, 7)\nMedication Changes, 8) Central Nervous System-related, and 9) Social\nDeterminants of Health. We explored two state-of-the-art natural language\nprocessing (NLP) models (finetuning pretrained language models and\nprompt-tuning approaches) to identify ORAB. Experimental results show that the\nprompt-tuning models outperformed the finetuning models in most cateogories and\nthe gains were especially higher among uncommon categories (Suggested aberrant\nbehavior, Diagnosed opioid dependency and Medication change). Although the best\nmodel achieved the highest 83.92% on area under precision recall curve,\nuncommon classes (Suggested Aberrant Behavior, Diagnosed Opioid Dependence, and\nMedication Change) still have a large room for performance improvement.\n","authors":["Sunjae Kwon","Xun Wang","Weisong Liu","Emily Druhl","Minhee L. Sung","Joel I. Reisman","Wenjun Li","Robert D. Kerns","William Becker","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2307.02591v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2307.13176v1","updated":"2023-07-24T23:53:13Z","published":"2023-07-24T23:53:13Z","title":"Schema-Driven Actionable Insight Generation and Smart Recommendation","summary":" In natural language generation (NLG), insight mining is seen as a\ndata-to-text task, where data is mined for interesting patterns and verbalised\ninto 'insight' statements. An 'over-generate and rank' paradigm is intuitively\nused to generate such insights. The multidimensionality and subjectivity of\nthis process make it challenging. This paper introduces a schema-driven method\nto generate actionable insights from data to drive growth and change. It also\nintroduces a technique to rank the insights to align with user interests based\non their feedback. We show preliminary qualitative results of the insights\ngenerated using our technique and demonstrate its ability to adapt to feedback.\n","authors":["Allmin Susaiyah","Aki Härmä","Milan Petković"],"pdf_url":"https://arxiv.org/pdf/2307.13176v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13173v1","updated":"2023-07-24T23:42:32Z","published":"2023-07-24T23:42:32Z","title":"Opinion Mining Using Population-tuned Generative Language Models","summary":" We present a novel method for mining opinions from text collections using\ngenerative language models trained on data collected from different\npopulations. We describe the basic definitions, methodology and a generic\nalgorithm for opinion insight mining. We demonstrate the performance of our\nmethod in an experiment where a pre-trained generative model is fine-tuned\nusing specifically tailored content with unnatural and fully annotated\nopinions. We show that our approach can learn and transfer the opinions to the\nsemantic classes while maintaining the proportion of polarisation. Finally, we\ndemonstrate the usage of an insight mining system to scale up the discovery of\nopinion insights from a real text corpus.\n","authors":["Allmin Susaiyah","Abhinay Pandya","Aki Härmä"],"pdf_url":"https://arxiv.org/pdf/2307.13173v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13128v1","updated":"2023-07-24T21:05:47Z","published":"2023-07-24T21:05:47Z","title":"Explaining Math Word Problem Solvers","summary":" Automated math word problem solvers based on neural networks have\nsuccessfully managed to obtain 70-80\\% accuracy in solving arithmetic word\nproblems. However, it has been shown that these solvers may rely on superficial\npatterns to obtain their equations. In order to determine what information math\nword problem solvers use to generate solutions, we remove parts of the input\nand measure the model's performance on the perturbed dataset. Our results show\nthat the model is not sensitive to the removal of many words from the input and\ncan still manage to find a correct answer when given a nonsense question. This\nindicates that automatic solvers do not follow the semantic logic of math word\nproblems, and may be overfitting to the presence of specific words.\n","authors":["Abby Newcomb","Jugal Kalita"],"pdf_url":"https://arxiv.org/pdf/2307.13128v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.15498v2","updated":"2023-07-24T20:08:20Z","published":"2021-06-29T15:25:33Z","title":"Classification of Consumer Belief Statements From Social Media","summary":" Social media offer plenty of information to perform market research in order\nto meet the requirements of customers. One way how this research is conducted\nis that a domain expert gathers and categorizes user-generated content into a\ncomplex and fine-grained class structure. In many of such cases, little data\nmeets complex annotations. It is not yet fully understood how this can be\nleveraged successfully for classification. We examine the classification\naccuracy of expert labels when used with a) many fine-grained classes and b)\nfew abstract classes. For scenario b) we compare abstract class labels given by\nthe domain expert as baseline and by automatic hierarchical clustering. We\ncompare this to another baseline where the entire class structure is given by a\ncompletely unsupervised clustering approach. By doing so, this work can serve\nas an example of how complex expert annotations are potentially beneficial and\ncan be utilized in the most optimal way for opinion mining in highly specific\ndomains. By exploring across a range of techniques and experiments, we find\nthat automated class abstraction approaches in particular the unsupervised\napproach performs remarkably well against domain expert baseline on text\nclassification tasks. This has the potential to inspire opinion mining\napplications in order to support market researchers in practice and to inspire\nfine-grained automated content analysis on a large scale.\n","authors":["Gerhard Johann Hagerer","Wenbin Le","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2106.15498v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2110.10575v2","updated":"2023-07-24T20:07:07Z","published":"2021-10-20T14:04:13Z","title":"SocialVisTUM: An Interactive Visualization Toolkit for Correlated Neural\n Topic Models on Social Media Opinion Mining","summary":" Recent research in opinion mining proposed word embedding-based topic\nmodeling methods that provide superior coherence compared to traditional topic\nmodeling. In this paper, we demonstrate how these methods can be used to\ndisplay correlated topic models on social media texts using SocialVisTUM, our\nproposed interactive visualization toolkit. It displays a graph with topics as\nnodes and their correlations as edges. Further details are displayed\ninteractively to support the exploration of large text collections, e.g.,\nrepresentative words and sentences of topics, topic and sentiment\ndistributions, hierarchical topic clustering, and customizable, predefined\ntopic labels. The toolkit optimizes automatically on custom data for optimal\ncoherence. We show a working instance of the toolkit on data crawled from\nEnglish social media discussions about organic food consumption. The\nvisualization confirms findings of a qualitative consumer research study.\nSocialVisTUM and its training procedures are accessible online.\n","authors":["Gerhard Johann Hagerer","Martin Kirchhoff","Hannah Danner","Robert Pesch","Mainak Ghosh","Archishman Roy","Jiaxi Zhao","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2110.10575v2.pdf","comment":"Demo paper accepted for publication on RANLP 2021; 8 pages, 5\n figures, 1 table"},{"id":"http://arxiv.org/abs/2110.15134v2","updated":"2023-07-24T20:05:38Z","published":"2021-10-28T14:09:44Z","title":"An Analysis of Programming Course Evaluations Before and After the\n Introduction of an Autograder","summary":" Commonly, introductory programming courses in higher education institutions\nhave hundreds of participating students eager to learn to program. The manual\neffort for reviewing the submitted source code and for providing feedback can\nno longer be managed. Manually reviewing the submitted homework can be\nsubjective and unfair, particularly if many tutors are responsible for grading.\nDifferent autograders can help in this situation; however, there is a lack of\nknowledge about how autograders can impact students' overall perception of\nprogramming classes and teaching. This is relevant for course organizers and\ninstitutions to keep their programming courses attractive while coping with\nincreasing students.\n This paper studies the answers to the standardized university evaluation\nquestionnaires of multiple large-scale foundational computer science courses\nwhich recently introduced autograding. The differences before and after this\nintervention are analyzed. By incorporating additional observations, we\nhypothesize how the autograder might have contributed to the significant\nchanges in the data, such as, improved interactions between tutors and\nstudents, improved overall course quality, improved learning success, increased\ntime spent, and reduced difficulty. This qualitative study aims to provide\nhypotheses for future research to define and conduct quantitative surveys and\ndata analysis. The autograder technology can be validated as a teaching method\nto improve student satisfaction with programming courses.\n","authors":["Gerhard Johann Hagerer","Laura Lahesoo","Miriam Anschütz","Stephan Krusche","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2110.15134v2.pdf","comment":"Accepted full paper article on IEEE ITHET 2021"},{"id":"http://arxiv.org/abs/2111.02259v3","updated":"2023-07-24T20:03:14Z","published":"2021-11-03T14:49:50Z","title":"A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion\n Mining","summary":" User-generated content from social media is produced in many languages,\nmaking it technically challenging to compare the discussed themes from one\ndomain across different cultures and regions. It is relevant for domains in a\nglobalized world, such as market research, where people from two nations and\nmarkets might have different requirements for a product. We propose a simple,\nmodern, and effective method for building a single topic model with sentiment\nanalysis capable of covering multiple languages simultanteously, based on a\npre-trained state-of-the-art deep neural network for natural language\nunderstanding. To demonstrate its feasibility, we apply the model to newspaper\narticles and user comments of a specific domain, i.e., organic food products\nand related consumption behavior. The themes match across languages.\nAdditionally, we obtain an high proportion of stable and domain-relevant\ntopics, a meaningful relation between topics and their respective textual\ncontents, and an interpretable representation for social media documents.\nMarketing can potentially benefit from our method, since it provides an\neasy-to-use means of addressing specific customer interests from different\nmarket regions around the globe. For reproducibility, we provide the code,\ndata, and results of our study.\n","authors":["Gerhard Johann Hagerer","Wing Sheung Leung","Qiaoxi Liu","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2111.02259v3.pdf","comment":"10 pages, 2 tables, 5 figures, full paper, peer-reviewed, published\n at KDIR/IC3k 2021 conference"},{"id":"http://arxiv.org/abs/2307.13106v1","updated":"2023-07-24T19:54:15Z","published":"2023-07-24T19:54:15Z","title":"How to use LLMs for Text Analysis","summary":" This guide introduces Large Language Models (LLM) as a highly versatile text\nanalysis method within the social sciences. As LLMs are easy-to-use, cheap,\nfast, and applicable on a broad range of text analysis tasks, ranging from text\nannotation and classification to sentiment analysis and critical discourse\nanalysis, many scholars believe that LLMs will transform how we do text\nanalysis. This how-to guide is aimed at students and researchers with limited\nprogramming experience, and offers a simple introduction to how LLMs can be\nused for text analysis in your own research project, as well as advice on best\npractices. We will go through each of the steps of analyzing textual data with\nLLMs using Python: installing the software, setting up the API, loading the\ndata, developing an analysis prompt, analyzing the text, and validating the\nresults. As an illustrative example, we will use the challenging task of\nidentifying populism in political texts, and show how LLMs move beyond the\nexisting state-of-the-art.\n","authors":["Petter Törnberg"],"pdf_url":"https://arxiv.org/pdf/2307.13106v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2111.02326v2","updated":"2023-07-24T19:44:53Z","published":"2021-11-03T16:20:16Z","title":"End-to-End Annotator Bias Approximation on Crowdsourced Single-Label\n Sentiment Analysis","summary":" Sentiment analysis is often a crowdsourcing task prone to subjective labels\ngiven by many annotators. It is not yet fully understood how the annotation\nbias of each annotator can be modeled correctly with state-of-the-art methods.\nHowever, resolving annotator bias precisely and reliably is the key to\nunderstand annotators' labeling behavior and to successfully resolve\ncorresponding individual misconceptions and wrongdoings regarding the\nannotation task. Our contribution is an explanation and improvement for precise\nneural end-to-end bias modeling and ground truth estimation, which reduces an\nundesired mismatch in that regard of the existing state-of-the-art.\nClassification experiments show that it has potential to improve accuracy in\ncases where each sample is annotated only by one single annotator. We provide\nthe whole source code publicly and release an own domain-specific sentiment\ndataset containing 10,000 sentences discussing organic food products. These are\ncrawled from social media and are singly labeled by 10 non-expert annotators.\n","authors":["Gerhard Johann Hagerer","David Szabo","Andreas Koch","Maria Luisa Ripoll Dominguez","Christian Widmer","Maximilian Wich","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2111.02326v2.pdf","comment":"10 pages, 2 figures, 2 tables, full conference paper, peer-reviewed"},{"id":"http://arxiv.org/abs/2305.17008v2","updated":"2023-07-24T19:18:25Z","published":"2023-05-26T15:09:11Z","title":"NormBank: A Knowledge Bank of Situational Social Norms","summary":" We present NormBank, a knowledge bank of 155k situational norms. This\nresource is designed to ground flexible normative reasoning for interactive,\nassistive, and collaborative AI systems. Unlike prior commonsense resources,\nNormBank grounds each inference within a multivalent sociocultural frame, which\nincludes the setting (e.g., restaurant), the agents' contingent roles (waiter,\ncustomer), their attributes (age, gender), and other physical, social, and\ncultural constraints (e.g., the temperature or the country of operation). In\ntotal, NormBank contains 63k unique constraints from a taxonomy that we\nintroduce and iteratively refine here. Constraints then apply in different\ncombinations to frame social norms. Under these manipulations, norms are\nnon-monotonic - one can cancel an inference by updating its frame even\nslightly. Still, we find evidence that neural models can help reliably extend\nthe scope and coverage of NormBank. We further demonstrate the utility of this\nresource with a series of transfer experiments.\n","authors":["Caleb Ziems","Jane Dwivedi-Yu","Yi-Chia Wang","Alon Halevy","Diyi Yang"],"pdf_url":"https://arxiv.org/pdf/2305.17008v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13085v1","updated":"2023-07-24T19:14:38Z","published":"2023-07-24T19:14:38Z","title":"Making Metadata More FAIR Using Large Language Models","summary":" With the global increase in experimental data artifacts, harnessing them in a\nunified fashion leads to a major stumbling block - bad metadata. To bridge this\ngap, this work presents a Natural Language Processing (NLP) informed\napplication, called FAIRMetaText, that compares metadata. Specifically,\nFAIRMetaText analyzes the natural language descriptions of metadata and\nprovides a mathematical similarity measure between two terms. This measure can\nthen be utilized for analyzing varied metadata, by suggesting terms for\ncompliance or grouping similar terms for identification of replaceable terms.\nThe efficacy of the algorithm is presented qualitatively and quantitatively on\npublicly available research artifacts and demonstrates large gains across\nmetadata related tasks through an in-depth study of a wide variety of Large\nLanguage Models (LLMs). This software can drastically reduce the human effort\nin sifting through various natural language metadata while employing several\nexperimental datasets on the same topic.\n","authors":["Sowmya S. Sundaram","Mark A. Musen"],"pdf_url":"https://arxiv.org/pdf/2307.13085v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.00017v2","updated":"2023-07-24T18:46:22Z","published":"2023-05-30T15:15:40Z","title":"Towards Explainable and Language-Agnostic LLMs: Symbolic Reverse\n Engineering of Language at Scale","summary":" Large language models (LLMs) have achieved a milestone that undenia-bly\nchanged many held beliefs in artificial intelligence (AI). However, there\nremains many limitations of these LLMs when it comes to true language\nunderstanding, limitations that are a byproduct of the under-lying architecture\nof deep neural networks. Moreover, and due to their subsymbolic nature,\nwhatever knowledge these models acquire about how language works will always be\nburied in billions of microfeatures (weights), none of which is meaningful on\nits own, making such models hopelessly unexplainable. To address these\nlimitations, we suggest com-bining the strength of symbolic representations\nwith what we believe to be the key to the success of LLMs, namely a successful\nbottom-up re-verse engineering of language at scale. As such we argue for a\nbottom-up reverse engineering of language in a symbolic setting. Hints on what\nthis project amounts to have been suggested by several authors, and we discuss\nin some detail here how this project could be accomplished.\n","authors":["Walid S. Saba"],"pdf_url":"https://arxiv.org/pdf/2306.00017v2.pdf","comment":"Draft, preprint"},{"id":"http://arxiv.org/abs/2307.13018v1","updated":"2023-07-24T17:17:13Z","published":"2023-07-24T17:17:13Z","title":"The potential of LLMs for coding with low-resource and domain-specific\n programming languages","summary":" This paper presents a study on the feasibility of using large language models\n(LLM) for coding with low-resource and domain-specific programming languages\nthat typically lack the amount of data required for effective LLM processing\ntechniques. This study focuses on the econometric scripting language named\nhansl of the open-source software gretl and employs a proprietary LLM based on\nGPT-3.5. Our findings suggest that LLMs can be a useful tool for writing,\nunderstanding, improving, and documenting gretl code, which includes generating\ndescriptive docstrings for functions and providing precise explanations for\nabstract and poorly documented econometric code. While the LLM showcased\npromoting docstring-to-code translation capability, we also identify some\nlimitations, such as its inability to improve certain sections of code and to\nwrite accurate unit tests. This study is a step towards leveraging the power of\nLLMs to facilitate software development in low-resource programming languages\nand ultimately to lower barriers to entry for their adoption.\n","authors":["Artur Tarassow"],"pdf_url":"https://arxiv.org/pdf/2307.13018v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2307.12981v1","updated":"2023-07-24T17:59:02Z","published":"2023-07-24T17:59:02Z","title":"3D-LLM: Injecting the 3D World into Large Language Models","summary":" Large language models (LLMs) and Vision-Language Models (VLMs) have been\nproven to excel at multiple tasks, such as commonsense reasoning. Powerful as\nthese models can be, they are not grounded in the 3D physical world, which\ninvolves richer concepts such as spatial relationships, affordances, physics,\nlayout, and so on. In this work, we propose to inject the 3D world into large\nlanguage models and introduce a whole new family of 3D-LLMs. Specifically,\n3D-LLMs can take 3D point clouds and their features as input and perform a\ndiverse set of 3D-related tasks, including captioning, dense captioning, 3D\nquestion answering, task decomposition, 3D grounding, 3D-assisted dialog,\nnavigation, and so on. Using three types of prompting mechanisms that we\ndesign, we are able to collect over 300k 3D-language data covering these tasks.\nTo efficiently train 3D-LLMs, we first utilize a 3D feature extractor that\nobtains 3D features from rendered multi- view images. Then, we use 2D VLMs as\nour backbones to train our 3D-LLMs. By introducing a 3D localization mechanism,\n3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show\nthat our model outperforms state-of-the-art baselines by a large margin (e.g.,\nthe BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore,\nexperiments on our held-in datasets for 3D captioning, task composition, and\n3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative\nexamples also show that our model could perform more tasks beyond the scope of\nexisting LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.\n","authors":["Yining Hong","Haoyu Zhen","Peihao Chen","Shuhong Zheng","Yilun Du","Zhenfang Chen","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2307.12981v1.pdf","comment":"Project Page: : https://vis-www.cs.umass.edu/3dllm/"},{"id":"http://arxiv.org/abs/2209.05407v3","updated":"2023-07-24T17:58:31Z","published":"2022-09-12T16:59:36Z","title":"Segmenting Known Objects and Unseen Unknowns without Prior Knowledge","summary":" Panoptic segmentation methods assign a known class to each pixel given in\ninput. Even for state-of-the-art approaches, this inevitably enforces decisions\nthat systematically lead to wrong predictions for objects outside the training\ncategories. However, robustness against out-of-distribution samples and corner\ncases is crucial in safety-critical settings to avoid dangerous consequences.\nSince real-world datasets cannot contain enough data points to adequately\nsample the long tail of the underlying distribution, models must be able to\ndeal with unseen and unknown scenarios as well. Previous methods targeted this\nby re-identifying already-seen unlabeled objects. In this work, we propose the\nnecessary step to extend segmentation with a new setting which we term holistic\nsegmentation. Holistic segmentation aims to identify and separate objects of\nunseen unknown categories into instances, without any prior knowledge about\nthem, while performing panoptic segmentation of known classes. We tackle this\nnew problem with U3HS, which finds unknowns as highly uncertain regions and\nclusters their corresponding instance-aware embeddings into individual objects.\nBy doing so, for the first time in panoptic segmentation with unknown objects,\nour U3HS is trained without unknown categories, reducing assumptions and\nleaving the settings as unconstrained as in real-life scenarios. Extensive\nexperiments on public data from MS COCO, Cityscapes, and Lost&Found demonstrate\nthe effectiveness of U3HS for this new, challenging, and assumptions-free\nsetting called holistic segmentation.\n","authors":["Stefano Gasperini","Alvaro Marcos-Ramiro","Michael Schmidt","Nassir Navab","Benjamin Busam","Federico Tombari"],"pdf_url":"https://arxiv.org/pdf/2209.05407v3.pdf","comment":"Accepted at ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12980v1","updated":"2023-07-24T17:58:06Z","published":"2023-07-24T17:58:06Z","title":"A Systematic Survey of Prompt Engineering on Vision-Language Foundation\n Models","summary":" Prompt engineering is a technique that involves augmenting a large\npre-trained model with task-specific hints, known as prompts, to adapt the\nmodel to new tasks. Prompts can be created manually as natural language\ninstructions or generated automatically as either natural language instructions\nor vector representations. Prompt engineering enables the ability to perform\npredictions based solely on prompts without updating model parameters, and the\neasier application of large pre-trained models in real-world tasks. In past\nyears, Prompt engineering has been well-studied in natural language processing.\nRecently, it has also been intensively studied in vision-language modeling.\nHowever, there is currently a lack of a systematic overview of prompt\nengineering on pre-trained vision-language models. This paper aims to provide a\ncomprehensive survey of cutting-edge research in prompt engineering on three\ntypes of vision-language models: multimodal-to-text generation models (e.g.\nFlamingo), image-text matching models (e.g. CLIP), and text-to-image generation\nmodels (e.g. Stable Diffusion). For each type of model, a brief model summary,\nprompting methods, prompting-based applications, and the corresponding\nresponsibility and integrity issues are summarized and discussed. Furthermore,\nthe commonalities and differences between prompting on vision-language models,\nlanguage models, and vision models are also discussed. The challenges, future\ndirections, and research opportunities are summarized to foster future research\non this topic.\n","authors":["Jindong Gu","Zhen Han","Shuo Chen","Ahmad Beirami","Bailan He","Gengyuan Zhang","Ruotong Liao","Yao Qin","Volker Tresp","Philip Torr"],"pdf_url":"https://arxiv.org/pdf/2307.12980v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12972v1","updated":"2023-07-24T17:49:11Z","published":"2023-07-24T17:49:11Z","title":"DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting","summary":" In this paper, we propose a new operator, called 3D DeFormable Attention\n(DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image\nfeatures into a unified 3D space for 3D object detection. Existing feature\nlifting approaches, such as Lift-Splat-based and 2D attention-based, either use\nestimated depth to get pseudo LiDAR features and then splat them to a 3D space,\nwhich is a one-pass operation without feature refinement, or ignore depth and\nlift features by 2D attention mechanisms, which achieve finer semantics while\nsuffering from a depth ambiguity problem. In contrast, our DFA3D-based method\nfirst leverages the estimated depth to expand each view's 2D feature map to 3D\nand then utilizes DFA3D to aggregate features from the expanded 3D feature\nmaps. With the help of DFA3D, the depth ambiguity problem can be effectively\nalleviated from the root, and the lifted features can be progressively refined\nlayer by layer, thanks to the Transformer-like architecture. In addition, we\npropose a mathematically equivalent implementation of DFA3D which can\nsignificantly improve its memory efficiency and computational speed. We\nintegrate DFA3D into several methods that use 2D attention-based feature\nlifting with only a few modifications in code and evaluate on the nuScenes\ndataset. The experiment results show a consistent improvement of +1.41\\% mAP on\naverage, and up to +15.1\\% mAP improvement when high-quality depth information\nis available, demonstrating the superiority, applicability, and huge potential\nof DFA3D. The code is available at\nhttps://github.com/IDEA-Research/3D-deformable-attention.git.\n","authors":["Hongyang Li","Hao Zhang","Zhaoyang Zeng","Shilong Liu","Feng Li","Tianhe Ren","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12972v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12970v1","updated":"2023-07-24T17:49:04Z","published":"2023-07-24T17:49:04Z","title":"Volcanic ash delimitation using Artificial Intelligence based on Pix2Pix","summary":" Volcanic eruptions emit ash that can be harmful to human health and cause\ndamage to infrastructure, economic activities and the environment. The\ndelimitation of ash clouds allows to know their behavior and dispersion, which\nhelps in the prevention and mitigation of this phenomenon. Traditional methods\ntake advantage of specialized software programs to process the bands or\nchannels that compose the satellite images. However, their use is limited to\nexperts and demands a lot of time and significant computational resources. In\nrecent years, Artificial Intelligence has been a milestone in the computational\ntreatment of complex problems in different areas. In particular, Deep Learning\ntechniques allow automatic, fast and accurate processing of digital images. The\npresent work proposes the use of the Pix2Pix model, a type of generative\nadversarial network that, once trained, learns the mapping of input images to\noutput images. The architecture of such a network consisting of a generator and\na discriminator provides the versatility needed to produce black and white ash\ncloud images from multispectral satellite images. The evaluation of the model,\nbased on loss and accuracy plots, a confusion matrix, and visual inspection,\nindicates a satisfactory solution for accurate ash cloud delineation,\napplicable in any area of the world and becomes a useful tool in risk\nmanagement.\n","authors":["Christian Carrillo","Gissela Torres","Christian Mejia-Escobar"],"pdf_url":"https://arxiv.org/pdf/2307.12970v1.pdf","comment":"18 pages, in Spanish language, 15 figures"},{"id":"http://arxiv.org/abs/2307.12967v1","updated":"2023-07-24T17:45:40Z","published":"2023-07-24T17:45:40Z","title":"Learning Dense Correspondences between Photos and Sketches","summary":" Humans effortlessly grasp the connection between sketches and real-world\nobjects, even when these sketches are far from realistic. Moreover, human\nsketch understanding goes beyond categorization -- critically, it also entails\nunderstanding how individual elements within a sketch correspond to parts of\nthe physical world it represents. What are the computational ingredients needed\nto support this ability? Towards answering this question, we make two\ncontributions: first, we introduce a new sketch-photo correspondence benchmark,\n$\\textit{PSC6k}$, containing 150K annotations of 6250 sketch-photo pairs across\n125 object categories, augmenting the existing Sketchy dataset with\nfine-grained correspondence metadata. Second, we propose a self-supervised\nmethod for learning dense correspondences between sketch-photo pairs, building\nupon recent advances in correspondence learning for pairs of photos. Our model\nuses a spatial transformer network to estimate the warp flow between latent\nrepresentations of a sketch and photo extracted by a contrastive learning-based\nConvNet backbone. We found that this approach outperformed several strong\nbaselines and produced predictions that were quantitatively consistent with\nother warp-based methods. However, our benchmark also revealed systematic\ndifferences between predictions of the suite of models we tested and those of\nhumans. Taken together, our work suggests a promising path towards developing\nartificial systems that achieve more human-like understanding of visual images\nat different levels of abstraction. Project page:\nhttps://photo-sketch-correspondence.github.io\n","authors":["Xuanchen Lu","Xiaolong Wang","Judith E Fan"],"pdf_url":"https://arxiv.org/pdf/2307.12967v1.pdf","comment":"Accepted to ICML 2023. Project page:\n https://photo-sketch-correspondence.github.io"},{"id":"http://arxiv.org/abs/2307.12964v1","updated":"2023-07-24T17:43:13Z","published":"2023-07-24T17:43:13Z","title":"Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature\n Alignment","summary":" Text-to-video retrieval systems have recently made significant progress by\nutilizing pre-trained models trained on large-scale image-text pairs. However,\nmost of the latest methods primarily focus on the video modality while\ndisregarding the audio signal for this task. Nevertheless, a recent advancement\nby ECLIPSE has improved long-range text-to-video retrieval by developing an\naudiovisual video representation. Nonetheless, the objective of the\ntext-to-video retrieval task is to capture the complementary audio and video\ninformation that is pertinent to the text query rather than simply achieving\nbetter audio and video alignment. To address this issue, we introduce TEFAL, a\nTExt-conditioned Feature ALignment method that produces both audio and video\nrepresentations conditioned on the text query. Instead of using only an\naudiovisual attention block, which could suppress the audio information\nrelevant to the text query, our approach employs two independent cross-modal\nattention blocks that enable the text to attend to the audio and video\nrepresentations separately. Our proposed method's efficacy is demonstrated on\nfour benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and\nCharades, and achieves better than state-of-the-art performance consistently\nacross the four datasets. This is attributed to the additional\ntext-query-conditioned audio representation and the complementary information\nit adds to the text-query-conditioned video representation.\n","authors":["Sarah Ibrahimi","Xiaohang Sun","Pichao Wang","Amanmeet Garg","Ashutosh Sanan","Mohamed Omar"],"pdf_url":"https://arxiv.org/pdf/2307.12964v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12941v1","updated":"2023-07-24T17:11:39Z","published":"2023-07-24T17:11:39Z","title":"On Privileged and Convergent Bases in Neural Network Representations","summary":" In this study, we investigate whether the representations learned by neural\nnetworks possess a privileged and convergent basis. Specifically, we examine\nthe significance of feature directions represented by individual neurons.\nFirst, we establish that arbitrary rotations of neural representations cannot\nbe inverted (unlike linear networks), indicating that they do not exhibit\ncomplete rotational invariance. Subsequently, we explore the possibility of\nmultiple bases achieving identical performance. To do this, we compare the\nbases of networks trained with the same parameters but with varying random\ninitializations. Our study reveals two findings: (1) Even in wide networks such\nas WideResNets, neural networks do not converge to a unique basis; (2) Basis\ncorrelation increases significantly when a few early layers of the network are\nfrozen identically.\n Furthermore, we analyze Linear Mode Connectivity, which has been studied as a\nmeasure of basis correlation. Our findings give evidence that while Linear Mode\nConnectivity improves with increased network width, this improvement is not due\nto an increase in basis correlation.\n","authors":["Davis Brown","Nikhil Vyas","Yamini Bansal"],"pdf_url":"https://arxiv.org/pdf/2307.12941v1.pdf","comment":"In the Workshop on High-dimensional Learning Dynamics at ICML 2023"},{"id":"http://arxiv.org/abs/2307.12917v1","updated":"2023-07-24T16:18:22Z","published":"2023-07-24T16:18:22Z","title":"Hierarchical Skeleton Meta-Prototype Contrastive Learning with Hard\n Skeleton Mining for Unsupervised Person Re-Identification","summary":" With rapid advancements in depth sensors and deep learning, skeleton-based\nperson re-identification (re-ID) models have recently achieved remarkable\nprogress with many advantages. Most existing solutions learn single-level\nskeleton features from body joints with the assumption of equal skeleton\nimportance, while they typically lack the ability to exploit more informative\nskeleton features from various levels such as limb level with more global body\npatterns. The label dependency of these methods also limits their flexibility\nin learning more general skeleton representations. This paper proposes a\ngeneric unsupervised Hierarchical skeleton Meta-Prototype Contrastive learning\n(Hi-MPC) approach with Hard Skeleton Mining (HSM) for person re-ID with\nunlabeled 3D skeletons. Firstly, we construct hierarchical representations of\nskeletons to model coarse-to-fine body and motion features from the levels of\nbody joints, components, and limbs. Then a hierarchical meta-prototype\ncontrastive learning model is proposed to cluster and contrast the most typical\nskeleton features (\"prototypes\") from different-level skeletons. By converting\noriginal prototypes into meta-prototypes with multiple homogeneous\ntransformations, we induce the model to learn the inherent consistency of\nprototypes to capture more effective skeleton features for person re-ID.\nFurthermore, we devise a hard skeleton mining mechanism to adaptively infer the\ninformative importance of each skeleton, so as to focus on harder skeletons to\nlearn more discriminative skeleton representations. Extensive evaluations on\nfive datasets demonstrate that our approach outperforms a wide variety of\nstate-of-the-art skeleton-based methods. We further show the general\napplicability of our method to cross-view person re-ID and RGB-based scenarios\nwith estimated skeletons.\n","authors":["Haocong Rao","Cyril Leung","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2307.12917v1.pdf","comment":"Accepted by International Journal of Computer Vision (IJCV). Codes\n are available at https://github.com/Kali-Hac/Hi-MPC. Supplemental materials\n will be included in the published version"},{"id":"http://arxiv.org/abs/2307.12914v1","updated":"2023-07-24T16:13:43Z","published":"2023-07-24T16:13:43Z","title":"Towards a Visual-Language Foundation Model for Computational Pathology","summary":" The accelerated adoption of digital pathology and advances in deep learning\nhave enabled the development of powerful models for various pathology tasks\nacross a diverse array of diseases and patient cohorts. However, model training\nis often difficult due to label scarcity in the medical domain and the model's\nusage is limited by the specific task and disease for which it is trained.\nAdditionally, most models in histopathology leverage only image data, a stark\ncontrast to how humans teach each other and reason about histopathologic\nentities. We introduce CONtrastive learning from Captions for Histopathology\n(CONCH), a visual-language foundation model developed using diverse sources of\nhistopathology images, biomedical text, and notably over 1.17 million\nimage-caption pairs via task-agnostic pretraining. Evaluated on a suite of 13\ndiverse benchmarks, CONCH can be transferred to a wide range of downstream\ntasks involving either or both histopathology images and text, achieving\nstate-of-the-art performance on histology image classification, segmentation,\ncaptioning, text-to-image and image-to-text retrieval. CONCH represents a\nsubstantial leap over concurrent visual-language pretrained systems for\nhistopathology, with the potential to directly facilitate a wide array of\nmachine learning-based workflows requiring minimal or no further supervised\nfine-tuning.\n","authors":["Ming Y. Lu","Bowen Chen","Drew F. K. Williamson","Richard J. Chen","Ivy Liang","Tong Ding","Guillaume Jaume","Igor Odintsov","Andrew Zhang","Long Phi Le","Georg Gerber","Anil V Parwani","Faisal Mahmood"],"pdf_url":"https://arxiv.org/pdf/2307.12914v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12909v1","updated":"2023-07-24T16:08:32Z","published":"2023-07-24T16:08:32Z","title":"Dyn-E: Local Appearance Editing of Dynamic Neural Radiance Fields","summary":" Recently, the editing of neural radiance fields (NeRFs) has gained\nconsiderable attention, but most prior works focus on static scenes while\nresearch on the appearance editing of dynamic scenes is relatively lacking. In\nthis paper, we propose a novel framework to edit the local appearance of\ndynamic NeRFs by manipulating pixels in a single frame of training video.\nSpecifically, to locally edit the appearance of dynamic NeRFs while preserving\nunedited regions, we introduce a local surface representation of the edited\nregion, which can be inserted into and rendered along with the original NeRF\nand warped to arbitrary other frames through a learned invertible motion\nrepresentation network. By employing our method, users without professional\nexpertise can easily add desired content to the appearance of a dynamic scene.\nWe extensively evaluate our approach on various scenes and show that our\napproach achieves spatially and temporally consistent editing results. Notably,\nour approach is versatile and applicable to different variants of dynamic NeRF\nrepresentations.\n","authors":["Shangzhan Zhang","Sida Peng","Yinji ShenTu","Qing Shuai","Tianrun Chen","Kaicheng Yu","Hujun Bao","Xiaowei Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.12909v1.pdf","comment":"project page: https://dyn-e.github.io/"},{"id":"http://arxiv.org/abs/2307.12907v1","updated":"2023-07-24T16:02:42Z","published":"2023-07-24T16:02:42Z","title":"GridMM: Grid Memory Map for Vision-and-Language Navigation","summary":" Vision-and-language navigation (VLN) enables the agent to navigate to a\nremote location following the natural language instruction in 3D environments.\nTo represent the previously visited environment, most approaches for VLN\nimplement memory using recurrent states, topological maps, or top-down semantic\nmaps. In contrast to these approaches, we build the top-down egocentric and\ndynamically growing Grid Memory Map (i.e., GridMM) to structure the visited\nenvironment. From a global perspective, historical observations are projected\ninto a unified grid map in a top-down view, which can better represent the\nspatial relations of the environment. From a local perspective, we further\npropose an instruction relevance aggregation method to capture fine-grained\nvisual clues in each grid region. Extensive experiments are conducted on both\nthe REVERIE, R2R, SOON datasets in the discrete environments, and the R2R-CE\ndataset in the continuous environments, showing the superiority of our proposed\nmethod.\n","authors":["Zihan Wang","Xiangyang Li","Jiahao Yang","Yeqi Liu","Shuqiang Jiang"],"pdf_url":"https://arxiv.org/pdf/2307.12907v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12900v1","updated":"2023-07-24T15:47:21Z","published":"2023-07-24T15:47:21Z","title":"Automotive Object Detection via Learning Sparse Events by Temporal\n Dynamics of Spiking Neurons","summary":" Event-based sensors, with their high temporal resolution (1us) and dynamical\nrange (120dB), have the potential to be deployed in high-speed platforms such\nas vehicles and drones. However, the highly sparse and fluctuating nature of\nevents poses challenges for conventional object detection techniques based on\nArtificial Neural Networks (ANNs). In contrast, Spiking Neural Networks (SNNs)\nare well-suited for representing event-based data due to their inherent\ntemporal dynamics. In particular, we demonstrate that the membrane potential\ndynamics can modulate network activity upon fluctuating events and strengthen\nfeatures of sparse input. In addition, the spike-triggered adaptive threshold\ncan stabilize training which further improves network performance. Based on\nthis, we develop an efficient spiking feature pyramid network for event-based\nobject detection. Our proposed SNN outperforms previous SNNs and sophisticated\nANNs with attention mechanisms, achieving a mean average precision (map50) of\n47.7% on the Gen1 benchmark dataset. This result significantly surpasses the\nprevious best SNN by 9.7% and demonstrates the potential of SNNs for\nevent-based vision. Our model has a concise architecture while maintaining high\naccuracy and much lower computation cost as a result of sparse computation. Our\ncode will be publicly available.\n","authors":["Hu Zhang","Luziwei Leng","Kaiwei Che","Qian Liu","Jie Cheng","Qinghai Guo","Jiangxing Liao","Ran Cheng"],"pdf_url":"https://arxiv.org/pdf/2307.12900v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.12803v3","updated":"2023-07-24T15:27:16Z","published":"2022-01-30T12:53:51Z","title":"Generalizing similarity in noisy setups: the DIBS phenomenon","summary":" This work uncovers an interplay among data density, noise, and the\ngeneralization ability in similarity learning. We consider Siamese Neural\nNetworks (SNNs), which are the basic form of contrastive learning, and explore\ntwo types of noise that can impact SNNs, Pair Label Noise (PLN) and Single\nLabel Noise (SLN). Our investigation reveals that SNNs exhibit double descent\nbehaviour regardless of the training setup and that it is further exacerbated\nby noise. We demonstrate that the density of data pairs is crucial for\ngeneralization. When SNNs are trained on sparse datasets with the same amount\nof PLN or SLN, they exhibit comparable generalization properties. However, when\nusing dense datasets, PLN cases generalize worse than SLN ones in the\noverparametrized region, leading to a phenomenon we call Density-Induced Break\nof Similarity (DIBS). In this regime, PLN similarity violation becomes\nmacroscopical, corrupting the dataset to the point where complete interpolation\ncannot be achieved, regardless of the number of model parameters. Our analysis\nalso delves into the correspondence between online optimization and offline\ngeneralization in similarity learning. The results show that this equivalence\nfails in the presence of label noise in all the scenarios considered.\n","authors":["Nayara Fonseca","Veronica Guidetti"],"pdf_url":"https://arxiv.org/pdf/2201.12803v3.pdf","comment":"v3: version accepted at ECAI 2023 + Supplementary Material"},{"id":"http://arxiv.org/abs/2307.12872v1","updated":"2023-07-24T15:10:22Z","published":"2023-07-24T15:10:22Z","title":"Data-free Black-box Attack based on Diffusion Model","summary":" Since the training data for the target model in a data-free black-box attack\nis not available, most recent schemes utilize GANs to generate data for\ntraining substitute model. However, these GANs-based schemes suffer from low\ntraining efficiency as the generator needs to be retrained for each target\nmodel during the substitute training process, as well as low generation\nquality. To overcome these limitations, we consider utilizing the diffusion\nmodel to generate data, and propose a data-free black-box attack scheme based\non diffusion model to improve the efficiency and accuracy of substitute\ntraining. Despite the data generated by the diffusion model exhibits high\nquality, it presents diverse domain distributions and contains many samples\nthat do not meet the discriminative criteria of the target model. To further\nfacilitate the diffusion model to generate data suitable for the target model,\nwe propose a Latent Code Augmentation (LCA) method to guide the diffusion model\nin generating data. With the guidance of LCA, the data generated by the\ndiffusion model not only meets the discriminative criteria of the target model\nbut also exhibits high diversity. By utilizing this data, it is possible to\ntrain substitute model that closely resemble the target model more efficiently.\nExtensive experiments demonstrate that our LCA achieves higher attack success\nrates and requires fewer query budgets compared to GANs-based schemes for\ndifferent target models.\n","authors":["Mingwen Shao","Lingzhuang Meng","Yuanjian Qiao","Lixu Zhang","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2307.12872v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12868v1","updated":"2023-07-24T15:06:42Z","published":"2023-07-24T15:06:42Z","title":"Understanding the Latent Space of Diffusion Models through the Lens of\n Riemannian Geometry","summary":" Despite the success of diffusion models (DMs), we still lack a thorough\nunderstanding of their latent space. To understand the latent space\n$\\mathbf{x}_t \\in \\mathcal{X}$, we analyze them from a geometrical perspective.\nSpecifically, we utilize the pullback metric to find the local latent basis in\n$\\mathcal{X}$ and their corresponding local tangent basis in $\\mathcal{H}$, the\nintermediate feature maps of DMs. The discovered latent basis enables\nunsupervised image editing capability through latent space traversal. We\ninvestigate the discovered structure from two perspectives. First, we examine\nhow geometric structure evolves over diffusion timesteps. Through analysis, we\nshow that 1) the model focuses on low-frequency components early in the\ngenerative process and attunes to high-frequency details later; 2) At early\ntimesteps, different samples share similar tangent spaces; and 3) The simpler\ndatasets that DMs trained on, the more consistent the tangent space for each\ntimestep. Second, we investigate how the geometric structure changes based on\ntext conditioning in Stable Diffusion. The results show that 1) similar prompts\nyield comparable tangent spaces; and 2) the model depends less on text\nconditions in later timesteps. To the best of our knowledge, this paper is the\nfirst to present image editing through $\\mathbf{x}$-space traversal and provide\nthorough analyses of the latent structure of DMs.\n","authors":["Yong-Hyun Park","Mingi Kwon","Jaewoong Choi","Junghyo Jo","Youngjung Uh"],"pdf_url":"https://arxiv.org/pdf/2307.12868v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09224v2","updated":"2023-07-24T15:05:55Z","published":"2023-06-15T16:03:01Z","title":"Encyclopedic VQA: Visual questions about detailed properties of\n fine-grained categories","summary":" We propose Encyclopedic-VQA, a large scale visual question answering (VQA)\ndataset featuring visual questions about detailed properties of fine-grained\ncategories and instances. It contains 221k unique question+answer pairs each\nmatched with (up to) 5 images, resulting in a total of 1M VQA samples.\nMoreover, our dataset comes with a controlled knowledge base derived from\nWikipedia, marking the evidence to support each answer. Empirically, we show\nthat our dataset poses a hard challenge for large vision+language models as\nthey perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA\n[37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we\nexperimentally show that progress on answering our encyclopedic questions can\nbe achieved by augmenting large models with a mechanism that retrieves relevant\ninformation from the knowledge base. An oracle experiment with perfect\nretrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and\nan automatic retrieval-augmented prototype yields 48.8%. We believe that our\ndataset enables future research on retrieval-augmented vision+language models.\nIt is available at\nhttps://github.com/google-research/google-research/tree/master/encyclopedic_vqa .\n","authors":["Thomas Mensink","Jasper Uijlings","Lluis Castrejon","Arushi Goel","Felipe Cadar","Howard Zhou","Fei Sha","André Araujo","Vittorio Ferrari"],"pdf_url":"https://arxiv.org/pdf/2306.09224v2.pdf","comment":"ICCV'23"},{"id":"http://arxiv.org/abs/2307.12858v1","updated":"2023-07-24T14:57:40Z","published":"2023-07-24T14:57:40Z","title":"Treatment Outcome Prediction for Intracerebral Hemorrhage via Generative\n Prognostic Model with Imaging and Tabular Data","summary":" Intracerebral hemorrhage (ICH) is the second most common and deadliest form\nof stroke. Despite medical advances, predicting treat ment outcomes for ICH\nremains a challenge. This paper proposes a novel prognostic model that utilizes\nboth imaging and tabular data to predict treatment outcome for ICH. Our model\nis trained on observational data collected from non-randomized controlled\ntrials, providing reliable predictions of treatment success. Specifically, we\npropose to employ a variational autoencoder model to generate a low-dimensional\nprognostic score, which can effectively address the selection bias resulting\nfrom the non-randomized controlled trials. Importantly, we develop a\nvariational distributions combination module that combines the information from\nimaging data, non-imaging clinical data, and treatment assignment to accurately\ngenerate the prognostic score. We conducted extensive experiments on a\nreal-world clinical dataset of intracerebral hemorrhage. Our proposed method\ndemonstrates a substantial improvement in treatment outcome prediction compared\nto existing state-of-the-art approaches. Code is available at\nhttps://github.com/med-air/TOP-GPM\n","authors":["Wenao Ma","Cheng Chen","Jill Abrigo","Calvin Hoi-Kwan Mak","Yuqi Gong","Nga Yan Chan","Chu Han","Zaiyi Liu","Qi Dou"],"pdf_url":"https://arxiv.org/pdf/2307.12858v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12854v1","updated":"2023-07-24T14:55:15Z","published":"2023-07-24T14:55:15Z","title":"Multiscale Video Pretraining for Long-Term Activity Forecasting","summary":" Long-term activity forecasting is an especially challenging research problem\nbecause it requires understanding the temporal relationships between observed\nactions, as well as the variability and complexity of human activities. Despite\nrelying on strong supervision via expensive human annotations, state-of-the-art\nforecasting approaches often generalize poorly to unseen data. To alleviate\nthis issue, we propose Multiscale Video Pretraining (MVP), a novel\nself-supervised pretraining approach that learns robust representations for\nforecasting by learning to predict contextualized representations of future\nvideo clips over multiple timescales. MVP is based on our observation that\nactions in videos have a multiscale nature, where atomic actions typically\noccur at a short timescale and more complex actions may span longer timescales.\nWe compare MVP to state-of-the-art self-supervised video learning approaches on\ndownstream long-term forecasting tasks including long-term action anticipation\nand video summary prediction. Our comprehensive experiments across the Ego4D\nand Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs\nstate-of-the-art methods by significant margins. Notably, MVP obtains a\nrelative performance gain of over 20% accuracy in video summary forecasting\nover existing methods.\n","authors":["Reuben Tan","Matthias De Lange","Michael Iuzzolino","Bryan A. Plummer","Kate Saenko","Karl Ridgeway","Lorenzo Torresani"],"pdf_url":"https://arxiv.org/pdf/2307.12854v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11630v3","updated":"2023-07-24T14:53:51Z","published":"2023-03-21T06:54:18Z","title":"BoxSnake: Polygonal Instance Segmentation with Box Supervision","summary":" Box-supervised instance segmentation has gained much attention as it requires\nonly simple box annotations instead of costly mask or polygon annotations.\nHowever, existing box-supervised instance segmentation models mainly focus on\nmask-based frameworks. We propose a new end-to-end training technique, termed\nBoxSnake, to achieve effective polygonal instance segmentation using only box\nannotations for the first time. Our method consists of two loss functions: (1)\na point-based unary loss that constrains the bounding box of predicted polygons\nto achieve coarse-grained segmentation; and (2) a distance-aware pairwise loss\nthat encourages the predicted polygons to fit the object boundaries. Compared\nwith the mask-based weakly-supervised methods, BoxSnake further reduces the\nperformance gap between the predicted segmentation and the bounding box, and\nshows significant superiority on the Cityscapes dataset. The code has been\navailable publicly.\n","authors":["Rui Yang","Lin Song","Yixiao Ge","Xiu Li"],"pdf_url":"https://arxiv.org/pdf/2303.11630v3.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12853v1","updated":"2023-07-24T14:53:23Z","published":"2023-07-24T14:53:23Z","title":"Spatiotemporal Modeling Encounters 3D Medical Image Analysis:\n Slice-Shift UNet with Multi-View Fusion","summary":" As a fundamental part of computational healthcare, Computer Tomography (CT)\nand Magnetic Resonance Imaging (MRI) provide volumetric data, making the\ndevelopment of algorithms for 3D image analysis a necessity. Despite being\ncomputationally cheap, 2D Convolutional Neural Networks can only extract\nspatial information. In contrast, 3D CNNs can extract three-dimensional\nfeatures, but they have higher computational costs and latency, which is a\nlimitation for clinical practice that requires fast and efficient models.\nInspired by the field of video action recognition we propose a new 2D-based\nmodel dubbed Slice SHift UNet (SSH-UNet) which encodes three-dimensional\nfeatures at 2D CNN's complexity. More precisely multi-view features are\ncollaboratively learned by performing 2D convolutions along the three\northogonal planes of a volume and imposing a weights-sharing mechanism. The\nthird dimension, which is neglected by the 2D convolution, is reincorporated by\nshifting a portion of the feature maps along the slices' axis. The\neffectiveness of our approach is validated in Multi-Modality Abdominal\nMulti-Organ Segmentation (AMOS) and Multi-Atlas Labeling Beyond the Cranial\nVault (BTCV) datasets, showing that SSH-UNet is more efficient while on par in\nperformance with state-of-the-art architectures.\n","authors":["C. I. Ugwu","S. Casarin","O. Lanz"],"pdf_url":"https://arxiv.org/pdf/2307.12853v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12845v1","updated":"2023-07-24T14:43:07Z","published":"2023-07-24T14:43:07Z","title":"Multi-View Vertebra Localization and Identification from CT Images","summary":" Accurately localizing and identifying vertebrae from CT images is crucial for\nvarious clinical applications. However, most existing efforts are performed on\n3D with cropping patch operation, suffering from the large computation costs\nand limited global information. In this paper, we propose a multi-view vertebra\nlocalization and identification from CT images, converting the 3D problem into\na 2D localization and identification task on different views. Without the\nlimitation of the 3D cropped patch, our method can learn the multi-view global\ninformation naturally. Moreover, to better capture the anatomical structure\ninformation from different view perspectives, a multi-view contrastive learning\nstrategy is developed to pre-train the backbone. Additionally, we further\npropose a Sequence Loss to maintain the sequential structure embedded along the\nvertebrae. Evaluation results demonstrate that, with only two 2D networks, our\nmethod can localize and identify vertebrae in CT images accurately, and\noutperforms the state-of-the-art methods consistently. Our code is available at\nhttps://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images.\n","authors":["Han Wu","Jiadong Zhang","Yu Fang","Zhentao Liu","Nizhuan Wang","Zhiming Cui","Dinggang Shen"],"pdf_url":"https://arxiv.org/pdf/2307.12845v1.pdf","comment":"MICCAI 2023"},{"id":"http://arxiv.org/abs/2306.15599v2","updated":"2023-07-24T14:41:40Z","published":"2023-06-27T16:37:37Z","title":"Coupling a Recurrent Neural Network to SPAD TCSPC Systems for Real-time\n Fluorescence Lifetime Imaging","summary":" Fluorescence lifetime imaging (FLI) has been receiving increased attention in\nrecent years as a powerful diagnostic technique in biological and medical\nresearch. However, existing FLI systems often suffer from a tradeoff between\nprocessing speed, accuracy, and robustness. In this paper, we propose a robust\napproach that enables fast FLI with no degradation of accuracy. The approach is\nbased on a SPAD TCSPC system coupled to a recurrent neural network (RNN) that\naccurately estimates the fluorescence lifetime directly from raw timestamps\nwithout building histograms, thereby drastically reducing transfer data volumes\nand hardware resource utilization, thus enabling FLI acquisition at video rate.\nWe train two variants of the RNN on a synthetic dataset and compare the results\nto those obtained using center-of-mass method (CMM) and least squares fitting\n(LS fitting). Results demonstrate that two RNN variants, gated recurrent unit\n(GRU) and long short-term memory (LSTM), are comparable to CMM and LS fitting\nin terms of accuracy, while outperforming them in background noise by a large\nmargin. To explore the ultimate limits of the approach, we derived the\nCramer-Rao lower bound of the measurement, showing that RNN yields lifetime\nestimations with near-optimal precision. Moreover, our FLI model, which is\npurely trained on synthetic datasets, works well with never-seen-before,\nreal-world data. To demonstrate real-time operation, we have built a FLI\nmicroscope based on Piccolo, a 32x32 SPAD sensor developed in our lab. Four\nquantized GRU cores, capable of processing up to 4 million photons per second,\nare deployed on a Xilinx Kintex-7 FPGA. Powered by the GRU, the FLI setup can\nretrieve real-time fluorescence lifetime images at up to 10 frames per second.\nThe proposed FLI system is promising and ideally suited for biomedical\napplications.\n","authors":["Yang Lin","Paul Mos","Andrei Ardelean","Claudio Bruschini","Edoardo Charbon"],"pdf_url":"https://arxiv.org/pdf/2306.15599v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09696v2","updated":"2023-07-24T14:36:24Z","published":"2023-07-19T00:41:39Z","title":"Towards Saner Deep Image Registration","summary":" With recent advances in computing hardware and surges of deep-learning\narchitectures, learning-based deep image registration methods have surpassed\ntheir traditional counterparts, in terms of metric performance and inference\ntime. However, these methods focus on improving performance measurements such\nas Dice, resulting in less attention given to model behaviors that are equally\ndesirable for registrations, especially for medical imaging. This paper\ninvestigates these behaviors for popular learning-based deep registrations\nunder a sanity-checking microscope. We find that most existing registrations\nsuffer from low inverse consistency and nondiscrimination of identical pairs\ndue to overly optimized image similarities. To rectify these behaviors, we\npropose a novel regularization-based sanity-enforcer method that imposes two\nsanity checks on the deep model to reduce its inverse consistency errors and\nincrease its discriminative power simultaneously. Moreover, we derive a set of\ntheoretical guarantees for our sanity-checked image registration method, with\nexperimental results supporting our theoretical findings and their\neffectiveness in increasing the sanity of models without sacrificing any\nperformance. Our code and models are available at\nhttps://github.com/tuffr5/Saner-deep-registration.\n","authors":["Bin Duan","Ming Zhong","Yan Yan"],"pdf_url":"https://arxiv.org/pdf/2307.09696v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12837v1","updated":"2023-07-24T14:35:46Z","published":"2023-07-24T14:35:46Z","title":"EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge: Mixed\n Sequences Prediction","summary":" This report presents the technical details of our approach for the\nEPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action\nRecognition. Our approach is based on the idea that the order in which actions\nare performed is similar between the source and target domains. Based on this,\nwe generate a modified sequence by randomly combining actions from the source\nand target domains. As only unlabelled target data are available under the UDA\nsetting, we use a standard pseudo-labeling strategy for extracting action\nlabels for the target. We then ask the network to predict the resulting action\nsequence. This allows to integrate information from both domains during\ntraining and to achieve better transfer results on target. Additionally, to\nbetter incorporate sequence information, we use a language model to filter\nunlikely sequences. Lastly, we employed a co-occurrence matrix to eliminate\nunseen combinations of verbs and nouns. Our submission, labeled as 'sshayan',\ncan be found on the leaderboard, where it currently holds the 2nd position for\n'verb' and the 4th position for both 'noun' and 'action'.\n","authors":["Amirshayan Nasirimajd","Simone Alberto Peirone","Chiara Plizzari","Barbara Caputo"],"pdf_url":"https://arxiv.org/pdf/2307.12837v1.pdf","comment":"2nd place in the 2023 EPIC-KITCHENS-100 Unsupervised Domain\n Adaptation Challenge for Action Recognition"},{"id":"http://arxiv.org/abs/2307.12822v1","updated":"2023-07-24T14:19:36Z","published":"2023-07-24T14:19:36Z","title":"Learning Provably Robust Estimators for Inverse Problems via Jittering","summary":" Deep neural networks provide excellent performance for inverse problems such\nas denoising. However, neural networks can be sensitive to adversarial or\nworst-case perturbations. This raises the question of whether such networks can\nbe trained efficiently to be worst-case robust. In this paper, we investigate\nwhether jittering, a simple regularization technique that adds isotropic\nGaussian noise during training, is effective for learning worst-case robust\nestimators for inverse problems. While well studied for prediction in\nclassification tasks, the effectiveness of jittering for inverse problems has\nnot been systematically investigated. In this paper, we present a novel\nanalytical characterization of the optimal $\\ell_2$-worst-case robust estimator\nfor linear denoising and show that jittering yields optimal robust denoisers.\nFurthermore, we examine jittering empirically via training deep neural networks\n(U-nets) for natural image denoising, deconvolution, and accelerated magnetic\nresonance imaging (MRI). The results show that jittering significantly enhances\nthe worst-case robustness, but can be suboptimal for inverse problems beyond\ndenoising. Moreover, our results imply that training on real data which often\ncontains slight noise is somewhat robustness enhancing.\n","authors":["Anselm Krainovic","Mahdi Soltanolkotabi","Reinhard Heckel"],"pdf_url":"https://arxiv.org/pdf/2307.12822v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12813v1","updated":"2023-07-24T14:06:54Z","published":"2023-07-24T14:06:54Z","title":"Exposing the Troublemakers in Described Object Detection","summary":" Detecting objects based on language descriptions is a popular task that\nincludes Open-Vocabulary object Detection (OVD) and Referring Expression\nComprehension (REC). In this paper, we advance them to a more practical setting\ncalled Described Object Detection (DOD) by expanding category names to flexible\nlanguage expressions for OVD and overcoming the limitation of REC to only\ngrounding the pre-existing object. We establish the research foundation for DOD\ntasks by constructing a Description Detection Dataset ($D^3$), featuring\nflexible language expressions and annotating all described objects without\nomission. By evaluating previous SOTA methods on $D^3$, we find some\ntroublemakers that fail current REC, OVD, and bi-functional methods. REC\nmethods struggle with confidence scores, rejecting negative instances, and\nmulti-target scenarios, while OVD methods face constraints with long and\ncomplex descriptions. Recent bi-functional methods also do not work well on DOD\ndue to their separated training procedures and inference strategies for REC and\nOVD tasks. Building upon the aforementioned findings, we propose a baseline\nthat largely improves REC methods by reconstructing the training data and\nintroducing a binary classification sub-task, outperforming existing methods.\nData and code is available at https://github.com/shikras/d-cube.\n","authors":["Chi Xie","Zhao Zhang","Yixuan Wu","Feng Zhu","Rui Zhao","Shuang Liang"],"pdf_url":"https://arxiv.org/pdf/2307.12813v1.pdf","comment":"Preprint. Under review"},{"id":"http://arxiv.org/abs/2307.02148v2","updated":"2023-07-24T13:59:50Z","published":"2023-07-05T09:44:02Z","title":"Compound Attention and Neighbor Matching Network for Multi-contrast MRI\n Super-resolution","summary":" Multi-contrast magnetic resonance imaging (MRI) reflects information about\nhuman tissue from different perspectives and has many clinical applications. By\nutilizing the complementary information among different modalities,\nmulti-contrast super-resolution (SR) of MRI can achieve better results than\nsingle-image super-resolution. However, existing methods of multi-contrast MRI\nSR have the following shortcomings that may limit their performance: First,\nexisting methods either simply concatenate the reference and degraded features\nor exploit global feature-matching between them, which are unsuitable for\nmulti-contrast MRI SR. Second, although many recent methods employ transformers\nto capture long-range dependencies in the spatial dimension, they neglect that\nself-attention in the channel dimension is also important for low-level vision\ntasks. To address these shortcomings, we proposed a novel network architecture\nwith compound-attention and neighbor matching (CANM-Net) for multi-contrast MRI\nSR: The compound self-attention mechanism effectively captures the dependencies\nin both spatial and channel dimension; the neighborhood-based feature-matching\nmodules are exploited to match degraded features and adjacent reference\nfeatures and then fuse them to obtain the high-quality images. We conduct\nexperiments of SR tasks on the IXI, fastMRI, and real-world scanning datasets.\nThe CANM-Net outperforms state-of-the-art approaches in both retrospective and\nprospective experiments. Moreover, the robustness study in our work shows that\nthe CANM-Net still achieves good performance when the reference and degraded\nimages are imperfectly registered, proving good potential in clinical\napplications.\n","authors":["Wenxuan Chen","Sirui Wu","Shuai Wang","Zhongsen Li","Jia Yang","Huifeng Yao","Xiaomeng Li","Xiaolei Song"],"pdf_url":"https://arxiv.org/pdf/2307.02148v2.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2211.16761v3","updated":"2023-07-24T13:53:26Z","published":"2022-11-30T05:59:23Z","title":"Improving Cross-Modal Retrieval with Set of Diverse Embeddings","summary":" Cross-modal retrieval across image and text modalities is a challenging task\ndue to its inherent ambiguity: An image often exhibits various situations, and\na caption can be coupled with diverse images. Set-based embedding has been\nstudied as a solution to this problem. It seeks to encode a sample into a set\nof different embedding vectors that capture different semantics of the sample.\nIn this paper, we present a novel set-based embedding method, which is distinct\nfrom previous work in two aspects. First, we present a new similarity function\ncalled smooth-Chamfer similarity, which is designed to alleviate the side\neffects of existing similarity functions for set-based embedding. Second, we\npropose a novel set prediction module to produce a set of embedding vectors\nthat effectively captures diverse semantics of input by the slot attention\nmechanism. Our method is evaluated on the COCO and Flickr30K datasets across\ndifferent visual backbones, where it outperforms existing methods including\nones that demand substantially larger computation at inference.\n","authors":["Dongwon Kim","Namyup Kim","Suha Kwak"],"pdf_url":"https://arxiv.org/pdf/2211.16761v3.pdf","comment":"Accepted to CVPR 2023 (Highlight)"},{"id":"http://arxiv.org/abs/2307.12790v1","updated":"2023-07-24T13:39:21Z","published":"2023-07-24T13:39:21Z","title":"Compact & Capable: Harnessing Graph Neural Networks and Edge Convolution\n for Medical Image Classification","summary":" Graph-based neural network models are gaining traction in the field of\nrepresentation learning due to their ability to uncover latent topological\nrelationships between entities that are otherwise challenging to identify.\nThese models have been employed across a diverse range of domains, encompassing\ndrug discovery, protein interactions, semantic segmentation, and fluid dynamics\nresearch. In this study, we investigate the potential of Graph Neural Networks\n(GNNs) for medical image classification. We introduce a novel model that\ncombines GNNs and edge convolution, leveraging the interconnectedness of RGB\nchannel feature values to strongly represent connections between crucial graph\nnodes. Our proposed model not only performs on par with state-of-the-art Deep\nNeural Networks (DNNs) but does so with 1000 times fewer parameters, resulting\nin reduced training time and data requirements. We compare our Graph\nConvolutional Neural Network (GCNN) to pre-trained DNNs for classifying\nMedMNIST dataset classes, revealing promising prospects for GNNs in medical\nimage analysis. Our results also encourage further exploration of advanced\ngraph-based models such as Graph Attention Networks (GAT) and Graph\nAuto-Encoders in the medical imaging domain. The proposed model yields more\nreliable, interpretable, and accurate outcomes for tasks like semantic\nsegmentation and image classification compared to simpler GCNNs\n","authors":["Aryan Singh","Pepijn Van de Ven","Ciarán Eising","Patrick Denny"],"pdf_url":"https://arxiv.org/pdf/2307.12790v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.13170v4","updated":"2023-07-24T13:35:28Z","published":"2022-04-27T20:04:24Z","title":"AdaBest: Minimizing Client Drift in Federated Learning via Adaptive Bias\n Estimation","summary":" In Federated Learning (FL), a number of clients or devices collaborate to\ntrain a model without sharing their data. Models are optimized locally at each\nclient and further communicated to a central hub for aggregation. While FL is\nan appealing decentralized training paradigm, heterogeneity among data from\ndifferent clients can cause the local optimization to drift away from the\nglobal objective. In order to estimate and therefore remove this drift,\nvariance reduction techniques have been incorporated into FL optimization\nrecently. However, these approaches inaccurately estimate the clients' drift\nand ultimately fail to remove it properly. In this work, we propose an adaptive\nalgorithm that accurately estimates drift across clients. In comparison to\nprevious works, our approach necessitates less storage and communication\nbandwidth, as well as lower compute costs. Additionally, our proposed\nmethodology induces stability by constraining the norm of estimates for client\ndrift, making it more practical for large scale FL. Experimental findings\ndemonstrate that the proposed algorithm converges significantly faster and\nachieves higher accuracy than the baselines across various FL benchmarks.\n","authors":["Farshid Varno","Marzie Saghayi","Laya Rafiee Sevyeri","Sharut Gupta","Stan Matwin","Mohammad Havaei"],"pdf_url":"https://arxiv.org/pdf/2204.13170v4.pdf","comment":"Published as a conference paper at ECCV 2022; Corrected some typos in\n the text and a baseline algorithm"},{"id":"http://arxiv.org/abs/2303.12540v2","updated":"2023-07-24T13:35:16Z","published":"2023-03-22T13:16:37Z","title":"Deployment of Image Analysis Algorithms under Prevalence Shifts","summary":" Domain gaps are among the most relevant roadblocks in the clinical\ntranslation of machine learning (ML)-based solutions for medical image\nanalysis. While current research focuses on new training paradigms and network\narchitectures, little attention is given to the specific effect of prevalence\nshifts on an algorithm deployed in practice. Such discrepancies between class\nfrequencies in the data used for a method's development/validation and that in\nits deployment environment(s) are of great importance, for example in the\ncontext of artificial intelligence (AI) democratization, as disease prevalences\nmay vary widely across time and location. Our contribution is twofold. First,\nwe empirically demonstrate the potentially severe consequences of missing\nprevalence handling by analyzing (i) the extent of miscalibration, (ii) the\ndeviation of the decision threshold from the optimum, and (iii) the ability of\nvalidation metrics to reflect neural network performance on the deployment\npopulation as a function of the discrepancy between development and deployment\nprevalence. Second, we propose a workflow for prevalence-aware image\nclassification that uses estimated deployment prevalences to adjust a trained\nclassifier to a new environment, without requiring additional annotated\ndeployment data. Comprehensive experiments based on a diverse set of 30 medical\nclassification tasks showcase the benefit of the proposed workflow in\ngenerating better classifier decisions and more reliable performance estimates\ncompared to current practice.\n","authors":["Patrick Godau","Piotr Kalinowski","Evangelia Christodoulou","Annika Reinke","Minu Tizabi","Luciana Ferrer","Paul Jäger","Lena Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2303.12540v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12775v1","updated":"2023-07-24T13:24:56Z","published":"2023-07-24T13:24:56Z","title":"Is attention all you need in medical image analysis? A review","summary":" Medical imaging is a key component in clinical diagnosis, treatment planning\nand clinical trial design, accounting for almost 90% of all healthcare data.\nCNNs achieved performance gains in medical image analysis (MIA) over the last\nyears. CNNs can efficiently model local pixel interactions and be trained on\nsmall-scale MI data. The main disadvantage of typical CNN models is that they\nignore global pixel relationships within images, which limits their\ngeneralisation ability to understand out-of-distribution data with different\n'global' information. The recent progress of Artificial Intelligence gave rise\nto Transformers, which can learn global relationships from data. However, full\nTransformer models need to be trained on large-scale data and involve\ntremendous computational complexity. Attention and Transformer compartments\n(Transf/Attention) which can well maintain properties for modelling global\nrelationships, have been proposed as lighter alternatives of full Transformers.\nRecently, there is an increasing trend to co-pollinate complementary\nlocal-global properties from CNN and Transf/Attention architectures, which led\nto a new era of hybrid models. The past years have witnessed substantial growth\nin hybrid CNN-Transf/Attention models across diverse MIA problems. In this\nsystematic review, we survey existing hybrid CNN-Transf/Attention models,\nreview and unravel key architectural designs, analyse breakthroughs, and\nevaluate current and future opportunities as well as challenges. We also\nintroduced a comprehensive analysis framework on generalisation opportunities\nof scientific and clinical impact, based on which new data-driven domain\ngeneralisation and adaptation methods can be stimulated.\n","authors":["Giorgos Papanastasiou","Nikolaos Dikaios","Jiahao Huang","Chengjia Wang","Guang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12774v1","updated":"2023-07-24T13:24:19Z","published":"2023-07-24T13:24:19Z","title":"Fast Full-frame Video Stabilization with Iterative Optimization","summary":" Video stabilization refers to the problem of transforming a shaky video into\na visually pleasing one. The question of how to strike a good trade-off between\nvisual quality and computational speed has remained one of the open challenges\nin video stabilization. Inspired by the analogy between wobbly frames and\njigsaw puzzles, we propose an iterative optimization-based learning approach\nusing synthetic datasets for video stabilization, which consists of two\ninteracting submodules: motion trajectory smoothing and full-frame outpainting.\nFirst, we develop a two-level (coarse-to-fine) stabilizing algorithm based on\nthe probabilistic flow field. The confidence map associated with the estimated\noptical flow is exploited to guide the search for shared regions through\nbackpropagation. Second, we take a divide-and-conquer approach and propose a\nnovel multiframe fusion strategy to render full-frame stabilized views. An\nimportant new insight brought about by our iterative optimization approach is\nthat the target video can be interpreted as the fixed point of nonlinear\nmapping for video stabilization. We formulate video stabilization as a problem\nof minimizing the amount of jerkiness in motion trajectories, which guarantees\nconvergence with the help of fixed-point theory. Extensive experimental results\nare reported to demonstrate the superiority of the proposed approach in terms\nof computational speed and visual quality. The code will be available on\nGitHub.\n","authors":["Weiyue Zhao","Xin Li","Zhan Peng","Xianrui Luo","Xinyi Ye","Hao Lu","Zhiguo Cao"],"pdf_url":"https://arxiv.org/pdf/2307.12774v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.12761v1","updated":"2023-07-24T13:05:36Z","published":"2023-07-24T13:05:36Z","title":"LiDAR Meta Depth Completion","summary":" Depth estimation is one of the essential tasks to be addressed when creating\nmobile autonomous systems. While monocular depth estimation methods have\nimproved in recent times, depth completion provides more accurate and reliable\ndepth maps by additionally using sparse depth information from other sensors\nsuch as LiDAR. However, current methods are specifically trained for a single\nLiDAR sensor. As the scanning pattern differs between sensors, every new sensor\nwould require re-training a specialized depth completion model, which is\ncomputationally inefficient and not flexible. Therefore, we propose to\ndynamically adapt the depth completion model to the used sensor type enabling\nLiDAR adaptive depth completion. Specifically, we propose a meta depth\ncompletion network that uses data patterns derived from the data to learn a\ntask network to alter weights of the main depth completion network to solve a\ngiven depth completion task effectively. The method demonstrates a strong\ncapability to work on multiple LiDAR scanning patterns and can also generalize\nto scanning patterns that are unseen during training. While using a single\nmodel, our method yields significantly better results than a non-adaptive\nbaseline trained on different LiDAR patterns. It outperforms LiDAR-specific\nexpert models for very sparse cases. These advantages allow flexible deployment\nof a single depth completion model on different sensors, which could also prove\nvaluable to process the input of nascent LiDAR technology with adaptive instead\nof fixed scanning patterns.\n","authors":["Wolfgang Boettcher","Lukas Hoyer","Ozan Unal","Dengxin Dai"],"pdf_url":"https://arxiv.org/pdf/2307.12761v1.pdf","comment":"Accepted at IROS 2023"},{"id":"http://arxiv.org/abs/2209.11531v2","updated":"2023-07-24T13:04:48Z","published":"2022-09-23T11:36:32Z","title":"Deep Learning-based Anonymization of Chest Radiographs: A\n Utility-preserving Measure for Patient Privacy","summary":" Robust and reliable anonymization of chest radiographs constitutes an\nessential step before publishing large datasets of such for research purposes.\nThe conventional anonymization process is carried out by obscuring personal\ninformation in the images with black boxes and removing or replacing\nmeta-information. However, such simple measures retain biometric information in\nthe chest radiographs, allowing patients to be re-identified by a linkage\nattack. Therefore, there is an urgent need to obfuscate the biometric\ninformation appearing in the images. We propose the first deep learning-based\napproach (PriCheXy-Net) to targetedly anonymize chest radiographs while\nmaintaining data utility for diagnostic and machine learning purposes. Our\nmodel architecture is a composition of three independent neural networks that,\nwhen collectively used, allow for learning a deformation field that is able to\nimpede patient re-identification. Quantitative results on the ChestX-ray14\ndataset show a reduction of patient re-identification from 81.8% to 57.7% (AUC)\nafter re-training with little impact on the abnormality classification\nperformance. This indicates the ability to preserve underlying abnormality\npatterns while increasing patient privacy. Lastly, we compare our proposed\nanonymization approach with two other obfuscation-based methods (Privacy-Net,\nDP-Pix) and demonstrate the superiority of our method towards resolving the\nprivacy-utility trade-off for chest radiographs.\n","authors":["Kai Packhäuser","Sebastian Gündel","Florian Thamm","Felix Denzinger","Andreas Maier"],"pdf_url":"https://arxiv.org/pdf/2209.11531v2.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.07620v2","updated":"2023-07-24T13:03:17Z","published":"2023-07-14T20:39:07Z","title":"Generalizable Embeddings with Cross-batch Metric Learning","summary":" Global average pooling (GAP) is a popular component in deep metric learning\n(DML) for aggregating features. Its effectiveness is often attributed to\ntreating each feature vector as a distinct semantic entity and GAP as a\ncombination of them. Albeit substantiated, such an explanation's algorithmic\nimplications to learn generalizable entities to represent unseen classes, a\ncrucial DML goal, remain unclear. To address this, we formulate GAP as a convex\ncombination of learnable prototypes. We then show that the prototype learning\ncan be expressed as a recursive process fitting a linear predictor to a batch\nof samples. Building on that perspective, we consider two batches of disjoint\nclasses at each iteration and regularize the learning by expressing the samples\nof a batch with the prototypes that are fitted to the other batch. We validate\nour approach on 4 popular DML benchmarks.\n","authors":["Yeti Z. Gurbuz","A. Aydin Alatan"],"pdf_url":"https://arxiv.org/pdf/2307.07620v2.pdf","comment":"\\c{opyright} 2023 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2307.12751v1","updated":"2023-07-24T12:42:45Z","published":"2023-07-24T12:42:45Z","title":"ICF-SRSR: Invertible scale-Conditional Function for Self-Supervised\n Real-world Single Image Super-Resolution","summary":" Single image super-resolution (SISR) is a challenging ill-posed problem that\naims to up-sample a given low-resolution (LR) image to a high-resolution (HR)\ncounterpart. Due to the difficulty in obtaining real LR-HR training pairs,\nrecent approaches are trained on simulated LR images degraded by simplified\ndown-sampling operators, e.g., bicubic. Such an approach can be problematic in\npractice because of the large gap between the synthesized and real-world LR\nimages. To alleviate the issue, we propose a novel Invertible scale-Conditional\nFunction (ICF), which can scale an input image and then restore the original\ninput with different scale conditions. By leveraging the proposed ICF, we\nconstruct a novel self-supervised SISR framework (ICF-SRSR) to handle the\nreal-world SR task without using any paired/unpaired training data.\nFurthermore, our ICF-SRSR can generate realistic and feasible LR-HR pairs,\nwhich can make existing supervised SISR networks more robust. Extensive\nexperiments demonstrate the effectiveness of the proposed method in handling\nSISR in a fully self-supervised manner. Our ICF-SRSR demonstrates superior\nperformance compared to the existing methods trained on synthetic paired images\nin real-world scenarios and exhibits comparable performance compared to\nstate-of-the-art supervised/unsupervised methods on public benchmark datasets.\n","authors":["Reyhaneh Neshatavar","Mohsen Yavartanoo","Sanghyun Son","Kyoung Mu Lee"],"pdf_url":"https://arxiv.org/pdf/2307.12751v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.09629v2","updated":"2023-07-24T12:33:09Z","published":"2023-02-19T17:15:56Z","title":"BiofilmScanner: A Computational Intelligence Approach to Obtain\n Bacterial Cell Morphological Attributes from Biofilm Image","summary":" Desulfovibrio alaskensis G20 (DA-G20) is utilized as a model for\nsulfate-reducing bacteria (SRB) that are associated with corrosion issues\ncaused by microorganisms. SRB-based biofilms are thought to be responsible for\nthe billion-dollar-per-year bio-corrosion of metal infrastructure.\nUnderstanding the extraction of the bacterial cells' shape and size properties\nin the SRB-biofilm at different growth stages will assist with the design of\nanti-corrosion techniques. However, numerous issues affect current approaches,\nincluding time-consuming geometric property extraction, low efficiency, and\nhigh error rates. This paper proposes BiofilScanner, a Yolact-based deep\nlearning method integrated with invariant moments to address these problems.\nOur approach efficiently detects and segments bacterial cells in an SRB image\nwhile simultaneously invariant moments measure the geometric characteristics of\nthe segmented cells with low errors. The numerical experiments of the proposed\nmethod demonstrate that the BiofilmScanner is 2.1x and 6.8x faster than our\nearlier Mask-RCNN and DLv3+ methods for detecting, segmenting, and measuring\nthe geometric properties of the cell. Furthermore, the BiofilmScanner achieved\nan F1-score of 85.28% while Mask-RCNN and DLv3+ obtained F1-scores of 77.67%\nand 75.18%, respectively.\n","authors":["Md Hafizur Rahman","Md Ali Azam","Md Abir Hossen","Shankarachary Ragi","Venkataramana Gadhamshetty"],"pdf_url":"https://arxiv.org/pdf/2302.09629v2.pdf","comment":"Submitted to Pattern Recognition"},{"id":"http://arxiv.org/abs/2307.12732v1","updated":"2023-07-24T12:24:07Z","published":"2023-07-24T12:24:07Z","title":"CLIP-KD: An Empirical Study of Distilling CLIP Models","summary":" CLIP has become a promising language-supervised visual pre-training framework\nand achieves excellent performance over a wide range of tasks. This paper aims\nto distill small CLIP models supervised by a large teacher CLIP model. We\npropose several distillation strategies, including relation, feature, gradient\nand contrastive paradigm, to examine the impact on CLIP distillation. We show\nthat the simplest feature mimicry with MSE loss performs best. Moreover,\ninteractive contrastive learning and relation-based distillation are also\ncritical in performance improvement. We apply the unified method to distill\nseveral student networks trained on 15 million (image, text) pairs.\nDistillation improves the student CLIP models consistently over zero-shot\nImageNet classification and cross-modal retrieval benchmarks. We hope our\nempirical study will become an important baseline for future CLIP distillation\nresearch. The code is available at \\url{https://github.com/winycg/CLIP-KD}.\n","authors":["Chuanguang Yang","Zhulin An","Libo Huang","Junyu Bi","Xinqiang Yu","Han Yang","Yongjun Xu"],"pdf_url":"https://arxiv.org/pdf/2307.12732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12730v1","updated":"2023-07-24T12:22:19Z","published":"2023-07-24T12:22:19Z","title":"COCO-O: A Benchmark for Object Detectors under Natural Distribution\n Shifts","summary":" Practical object detection application can lose its effectiveness on image\ninputs with natural distribution shifts. This problem leads the research\ncommunity to pay more attention on the robustness of detectors under\nOut-Of-Distribution (OOD) inputs. Existing works construct datasets to\nbenchmark the detector's OOD robustness for a specific application scenario,\ne.g., Autonomous Driving. However, these datasets lack universality and are\nhard to benchmark general detectors built on common tasks such as COCO. To give\na more comprehensive robustness assessment, we introduce\nCOCO-O(ut-of-distribution), a test dataset based on COCO with 6 types of\nnatural distribution shifts. COCO-O has a large distribution gap with training\ndata and results in a significant 55.7% relative performance drop on a Faster\nR-CNN detector. We leverage COCO-O to conduct experiments on more than 100\nmodern object detectors to investigate if their improvements are credible or\njust over-fitting to the COCO test set. Unfortunately, most classic detectors\nin early years do not exhibit strong OOD generalization. We further study the\nrobustness effect on recent breakthroughs of detector's architecture design,\naugmentation and pre-training techniques. Some empirical findings are revealed:\n1) Compared with detection head or neck, backbone is the most important part\nfor robustness; 2) An end-to-end detection transformer design brings no\nenhancement, and may even reduce robustness; 3) Large-scale foundation models\nhave made a great leap on robust object detection. We hope our COCO-O could\nprovide a rich testbed for robustness study of object detection. The dataset\nwill be available at\n\\url{https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o}.\n","authors":["Xiaofeng Mao","Yuefeng Chen","Yao Zhu","Da Chen","Hang Su","Rong Zhang","Hui Xue"],"pdf_url":"https://arxiv.org/pdf/2307.12730v1.pdf","comment":"To appear in ICCV2023,\n https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o"},{"id":"http://arxiv.org/abs/2307.12729v1","updated":"2023-07-24T12:21:33Z","published":"2023-07-24T12:21:33Z","title":"Persistent-Transient Duality: A Multi-mechanism Approach for Modeling\n Human-Object Interaction","summary":" Humans are highly adaptable, swiftly switching between different modes to\nprogressively handle different tasks, situations and contexts. In Human-object\ninteraction (HOI) activities, these modes can be attributed to two mechanisms:\n(1) the large-scale consistent plan for the whole activity and (2) the\nsmall-scale children interactive actions that start and end along the timeline.\nWhile neuroscience and cognitive science have confirmed this multi-mechanism\nnature of human behavior, machine modeling approaches for human motion are\ntrailing behind. While attempted to use gradually morphing structures (e.g.,\ngraph attention networks) to model the dynamic HOI patterns, they miss the\nexpeditious and discrete mode-switching nature of the human motion. To bridge\nthat gap, this work proposes to model two concurrent mechanisms that jointly\ncontrol human motion: the Persistent process that runs continually on the\nglobal scale, and the Transient sub-processes that operate intermittently on\nthe local context of the human while interacting with objects. These two\nmechanisms form an interactive Persistent-Transient Duality that\nsynergistically governs the activity sequences. We model this conceptual\nduality by a parent-child neural network of Persistent and Transient channels\nwith a dedicated neural module for dynamic mechanism switching. The framework\nis trialed on HOI motion forecasting. On two rich datasets and a wide variety\nof settings, the model consistently delivers superior performances, proving its\nsuitability for the challenge.\n","authors":["Hung Tran","Vuong Le","Svetha Venkatesh","Truyen Tran"],"pdf_url":"https://arxiv.org/pdf/2307.12729v1.pdf","comment":"Accepted at ICCV 2023"},{"id":"http://arxiv.org/abs/2303.12865v3","updated":"2023-07-24T12:08:50Z","published":"2023-03-22T18:59:48Z","title":"NeRF-GAN Distillation for Efficient 3D-Aware Generation with\n Convolutions","summary":" Pose-conditioned convolutional generative models struggle with high-quality\n3D-consistent image generation from single-view datasets, due to their lack of\nsufficient 3D priors. Recently, the integration of Neural Radiance Fields\n(NeRFs) and generative models, such as Generative Adversarial Networks (GANs),\nhas transformed 3D-aware generation from single-view images. NeRF-GANs exploit\nthe strong inductive bias of neural 3D representations and volumetric rendering\nat the cost of higher computational complexity. This study aims at revisiting\npose-conditioned 2D GANs for efficient 3D-aware generation at inference time by\ndistilling 3D knowledge from pretrained NeRF-GANs. We propose a simple and\neffective method, based on re-using the well-disentangled latent space of a\npre-trained NeRF-GAN in a pose-conditioned convolutional network to directly\ngenerate 3D-consistent images corresponding to the underlying 3D\nrepresentations. Experiments on several datasets demonstrate that the proposed\nmethod obtains results comparable with volumetric rendering in terms of quality\nand 3D consistency while benefiting from the computational advantage of\nconvolutional networks. The code will be available at:\nhttps://github.com/mshahbazi72/NeRF-GAN-Distillation\n","authors":["Mohamad Shahbazi","Evangelos Ntavelis","Alessio Tonioni","Edo Collins","Danda Pani Paudel","Martin Danelljan","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2303.12865v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12721v1","updated":"2023-07-24T12:03:50Z","published":"2023-07-24T12:03:50Z","title":"AMAE: Adaptation of Pre-Trained Masked Autoencoder for Dual-Distribution\n Anomaly Detection in Chest X-Rays","summary":" Unsupervised anomaly detection in medical images such as chest radiographs is\nstepping into the spotlight as it mitigates the scarcity of the labor-intensive\nand costly expert annotation of anomaly data. However, nearly all existing\nmethods are formulated as a one-class classification trained only on\nrepresentations from the normal class and discard a potentially significant\nportion of the unlabeled data. This paper focuses on a more practical setting,\ndual distribution anomaly detection for chest X-rays, using the entire training\ndata, including both normal and unlabeled images. Inspired by a modern\nself-supervised vision transformer model trained using partial image inputs to\nreconstruct missing image regions -- we propose AMAE, a two-stage algorithm for\nadaptation of the pre-trained masked autoencoder (MAE). Starting from MAE\ninitialization, AMAE first creates synthetic anomalies from only normal\ntraining images and trains a lightweight classifier on frozen transformer\nfeatures. Subsequently, we propose an adaptation strategy to leverage unlabeled\nimages containing anomalies. The adaptation scheme is accomplished by assigning\npseudo-labels to unlabeled images and using two separate MAE based modules to\nmodel the normative and anomalous distributions of pseudo-labeled images. The\neffectiveness of the proposed adaptation strategy is evaluated with different\nanomaly ratios in an unlabeled training set. AMAE leads to consistent\nperformance gains over competing self-supervised and dual distribution anomaly\ndetection methods, setting the new state-of-the-art on three public chest X-ray\nbenchmarks: RSNA, NIH-CXR, and VinDr-CXR.\n","authors":["Behzad Bozorgtabar","Dwarikanath Mahapatra","Jean-Philippe Thiran"],"pdf_url":"https://arxiv.org/pdf/2307.12721v1.pdf","comment":"To be presented at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.12718v1","updated":"2023-07-24T11:59:07Z","published":"2023-07-24T11:59:07Z","title":"CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle\n Components","summary":" Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly\neffective technique for representing 3D reconstructions of objects and scenes\nderived from sets of images. Despite their efficiency, NeRF models can pose\nchallenges in certain scenarios such as vehicle inspection, where the lack of\nsufficient data or the presence of challenging elements (e.g. reflections)\nstrongly impact the accuracy of the reconstruction. To this aim, we introduce\nCarPatch, a novel synthetic benchmark of vehicles. In addition to a set of\nimages annotated with their intrinsic and extrinsic camera parameters, the\ncorresponding depth maps and semantic segmentation masks have been generated\nfor each view. Global and part-based metrics have been defined and used to\nevaluate, compare, and better characterize some state-of-the-art techniques.\nThe dataset is publicly released at\nhttps://aimagelab.ing.unimore.it/go/carpatch and can be used as an evaluation\nguide and as a baseline for future work on this challenging topic.\n","authors":["Davide Di Nucci","Alessandro Simoni","Matteo Tomei","Luca Ciuffreda","Roberto Vezzani","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2307.12718v1.pdf","comment":"Accepted at ICIAP2023"},{"id":"http://arxiv.org/abs/2307.12717v1","updated":"2023-07-24T11:58:58Z","published":"2023-07-24T11:58:58Z","title":"Dense Transformer based Enhanced Coding Network for Unsupervised Metal\n Artifact Reduction","summary":" CT images corrupted by metal artifacts have serious negative effects on\nclinical diagnosis. Considering the difficulty of collecting paired data with\nground truth in clinical settings, unsupervised methods for metal artifact\nreduction are of high interest. However, it is difficult for previous\nunsupervised methods to retain structural information from CT images while\nhandling the non-local characteristics of metal artifacts. To address these\nchallenges, we proposed a novel Dense Transformer based Enhanced Coding Network\n(DTEC-Net) for unsupervised metal artifact reduction. Specifically, we\nintroduce a Hierarchical Disentangling Encoder, supported by the high-order\ndense process, and transformer to obtain densely encoded sequences with\nlong-range correspondence. Then, we present a second-order disentanglement\nmethod to improve the dense sequence's decoding process. Extensive experiments\nand model discussions illustrate DTEC-Net's effectiveness, which outperforms\nthe previous state-of-the-art methods on a benchmark dataset, and greatly\nreduces metal artifacts while restoring richer texture details.\n","authors":["Wangduo Xie","Matthew B. Blaschko"],"pdf_url":"https://arxiv.org/pdf/2307.12717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09340v3","updated":"2023-07-24T11:34:21Z","published":"2023-03-16T14:21:45Z","title":"Improving Automated Hemorrhage Detection in Sparse-view Computed\n Tomography via Deep Convolutional Neural Network based Artifact Reduction","summary":" Purpose: Sparse-view computed tomography (CT) is an effective way to reduce\ndose by lowering the total number of views acquired, albeit at the expense of\nimage quality, which, in turn, can impact the ability to detect diseases. We\nexplore deep learning-based artifact reduction in sparse-view cranial CT scans\nand its impact on automated hemorrhage detection. Methods: We trained a U-Net\nfor artefact reduction on simulated sparse-view cranial CT scans from 3000\npatients obtained from a public dataset and reconstructed with varying levels\nof sub-sampling. Additionally, we trained a convolutional neural network on\nfully sampled CT data from 17,545 patients for automated hemorrhage detection.\nWe evaluated the classification performance using the area under the receiver\noperator characteristic curves (AUC-ROCs) with corresponding 95% confidence\nintervals (CIs) and the DeLong test, along with confusion matrices. The\nperformance of the U-Net was compared to an analytical approach based on total\nvariation (TV). Results: The U-Net performed superior compared to unprocessed\nand TV-processed images with respect to image quality and automated hemorrhage\ndiagnosis. With U-Net post-processing, the number of views can be reduced from\n4096 (AUC-ROC: 0.974; 95% CI: 0.972-0.976) views to 512 views (0.973;\n0.971-0.975) with minimal decrease in hemorrhage detection (P<.001) and to 256\nviews (0.967; 0.964-0.969) with a slight performance decrease (P<.001).\nConclusion: The results suggest that U-Net based artifact reduction\nsubstantially enhances automated hemorrhage detection in sparse-view cranial\nCTs. Our findings highlight that appropriate post-processing is crucial for\noptimal image quality and diagnostic accuracy while minimizing radiation dose.\n","authors":["Johannes Thalhammer","Manuel Schultheiss","Tina Dorosti","Tobias Lasser","Franz Pfeiffer","Daniela Pfeiffer","Florian Schaff"],"pdf_url":"https://arxiv.org/pdf/2303.09340v3.pdf","comment":"11 pages, 6 figures, 1 table"},{"id":"http://arxiv.org/abs/2011.09094v3","updated":"2023-07-24T11:28:46Z","published":"2020-11-18T05:16:11Z","title":"UP-DETR: Unsupervised Pre-training for Object Detection with\n Transformers","summary":" DEtection TRansformer (DETR) for object detection reaches competitive\nperformance compared with Faster R-CNN via a transformer encoder-decoder\narchitecture. However, trained with scratch transformers, DETR needs\nlarge-scale training data and an extreme long training schedule even on COCO\ndataset. Inspired by the great success of pre-training transformers in natural\nlanguage processing, we propose a novel pretext task named random query patch\ndetection in Unsupervised Pre-training DETR (UP-DETR). Specifically, we\nrandomly crop patches from the given image and then feed them as queries to the\ndecoder. The model is pre-trained to detect these query patches from the input\nimage. During the pre-training, we address two critical issues: multi-task\nlearning and multi-query localization. (1) To trade off classification and\nlocalization preferences in the pretext task, we find that freezing the CNN\nbackbone is the prerequisite for the success of pre-training transformers. (2)\nTo perform multi-query localization, we develop UP-DETR with multi-query patch\ndetection with attention mask. Besides, UP-DETR also provides a unified\nperspective for fine-tuning object detection and one-shot detection tasks. In\nour experiments, UP-DETR significantly boosts the performance of DETR with\nfaster convergence and higher average precision on object detection, one-shot\ndetection and panoptic segmentation. Code and pre-training models:\nhttps://github.com/dddzg/up-detr.\n","authors":["Zhigang Dai","Bolun Cai","Yugeng Lin","Junying Chen"],"pdf_url":"https://arxiv.org/pdf/2011.09094v3.pdf","comment":"Accepted by TPAMI 2022 and CVPR 2021"},{"id":"http://arxiv.org/abs/2307.12698v1","updated":"2023-07-24T11:27:14Z","published":"2023-07-24T11:27:14Z","title":"MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised\n Learning of Motion and Content Features","summary":" Self-supervised learning of visual representations has been focusing on\nlearning content features, which do not capture object motion or location, and\nfocus on identifying and differentiating objects in images and videos. On the\nother hand, optical flow estimation is a task that does not involve\nunderstanding the content of the images on which it is estimated. We unify the\ntwo approaches and introduce MC-JEPA, a joint-embedding predictive architecture\nand self-supervised learning approach to jointly learn optical flow and content\nfeatures within a shared encoder, demonstrating that the two associated\nobjectives; the optical flow estimation objective and the self-supervised\nlearning objective; benefit from each other and thus learn content features\nthat incorporate motion information. The proposed approach achieves performance\non-par with existing unsupervised optical flow benchmarks, as well as with\ncommon self-supervised learning approaches on downstream tasks such as semantic\nsegmentation of images and videos.\n","authors":["Adrien Bardes","Jean Ponce","Yann LeCun"],"pdf_url":"https://arxiv.org/pdf/2307.12698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.10763v3","updated":"2023-07-24T11:15:47Z","published":"2023-02-12T12:19:57Z","title":"Contrastive Learning and the Emergence of Attributes Associations","summary":" In response to an object presentation, supervised learning schemes generally\nrespond with a parsimonious label. Upon a similar presentation we humans\nrespond again with a label, but are flooded, in addition, by a myriad of\nassociations. A significant portion of these consist of the presented object\nattributes. Contrastive learning is a semi-supervised learning scheme based on\nthe application of identity preserving transformations on the object input\nrepresentations. It is conjectured in this work that these same applied\ntransformations preserve, in addition to the identity of the presented object,\nalso the identity of its semantically meaningful attributes. The corollary of\nthis is that the output representations of such a contrastive learning scheme\ncontain valuable information not only for the classification of the presented\nobject, but also for the presence or absence decision of any attribute of\ninterest. Simulation results which demonstrate this idea and the feasibility of\nthis conjecture are presented.\n","authors":["Daniel N. Nissani"],"pdf_url":"https://arxiv.org/pdf/2302.10763v3.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2304.02941v2","updated":"2023-07-24T10:57:15Z","published":"2023-04-06T08:56:18Z","title":"Dr. KID: Direct Remeshing and K-set Isometric Decomposition for Scalable\n Physicalization of Organic Shapes","summary":" Dr. KID is an algorithm that uses isometric decomposition for the\nphysicalization of potato-shaped organic models in a puzzle fashion. The\nalgorithm begins with creating a simple, regular triangular surface mesh of\norganic shapes, followed by iterative k-means clustering and remeshing. For\nclustering, we need similarity between triangles (segments) which is defined as\na distance function. The distance function maps each triangle's shape to a\nsingle point in the virtual 3D space. Thus, the distance between the triangles\nindicates their degree of dissimilarity. K-means clustering uses this distance\nand sorts of segments into k classes. After this, remeshing is applied to\nminimize the distance between triangles within the same cluster by making their\nshapes identical. Clustering and remeshing are repeated until the distance\nbetween triangles in the same cluster reaches an acceptable threshold. We adopt\na curvature-aware strategy to determine the surface thickness and finalize\npuzzle pieces for 3D printing. Identical hinges and holes are created for\nassembling the puzzle components. For smoother outcomes, we use triangle\nsubdivision along with curvature-aware clustering, generating curved triangular\npatches for 3D printing. Our algorithm was evaluated using various models, and\nthe 3D-printed results were analyzed. Findings indicate that our algorithm\nperforms reliably on target organic shapes with minimal loss of input geometry.\n","authors":["Dawar Khan","Ciril Bohak","Ivan Viola"],"pdf_url":"https://arxiv.org/pdf/2304.02941v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12676v1","updated":"2023-07-24T10:30:54Z","published":"2023-07-24T10:30:54Z","title":"Damage Vision Mining Opportunity for Imbalanced Anomaly Detection","summary":" In past decade, previous balanced datasets have been used to advance\nalgorithms for classification, object detection, semantic segmentation, and\nanomaly detection in industrial applications. Specifically, for condition-based\nmaintenance, automating visual inspection is crucial to ensure high quality.\nDeterioration prognostic attempts to optimize the fine decision process for\npredictive maintenance and proactive repair. In civil infrastructure and living\nenvironment, damage data mining cannot avoid the imbalanced data issue because\nof rare unseen events and high quality status by improved operations. For\nvisual inspection, deteriorated class acquired from the surface of concrete and\nsteel components are occasionally imbalanced. From numerous related surveys, we\nsummarize that imbalanced data problems can be categorized into four types; 1)\nmissing range of target and label valuables, 2) majority-minority class\nimbalance, 3) foreground-background of spatial imbalance, 4) long-tailed class\nof pixel-wise imbalance. Since 2015, there has been many imbalanced studies\nusing deep learning approaches that includes regression, image classification,\nobject detection, semantic segmentation. However, anomaly detection for\nimbalanced data is not yet well known. In the study, we highlight one-class\nanomaly detection application whether anomalous class or not, and demonstrate\nclear examples on imbalanced vision datasets: wooden, concrete deterioration,\nand disaster damage. We provide key results on damage vision mining advantage,\nhypothesizing that the more effective range of positive ratio, the higher\naccuracy gain of anomaly detection application. Finally, the applicability of\nthe damage learning methods, limitations, and future works are mentioned.\n","authors":["Takato Yasuno"],"pdf_url":"https://arxiv.org/pdf/2307.12676v1.pdf","comment":"12 pages, 14 figures, 8 tables"},{"id":"http://arxiv.org/abs/2307.12674v1","updated":"2023-07-24T10:24:13Z","published":"2023-07-24T10:24:13Z","title":"Industrial Segment Anything -- a Case Study in Aircraft Manufacturing,\n Intralogistics, Maintenance, Repair, and Overhaul","summary":" Deploying deep learning-based applications in specialized domains like the\naircraft production industry typically suffers from the training data\navailability problem. Only a few datasets represent non-everyday objects,\nsituations, and tasks. Recent advantages in research around Vision Foundation\nModels (VFM) opened a new area of tasks and models with high generalization\ncapabilities in non-semantic and semantic predictions. As recently demonstrated\nby the Segment Anything Project, exploiting VFM's zero-shot capabilities is a\npromising direction in tackling the boundaries spanned by data, context, and\nsensor variety. Although, investigating its application within specific domains\nis subject to ongoing research. This paper contributes here by surveying\napplications of the SAM in aircraft production-specific use cases. We include\nmanufacturing, intralogistics, as well as maintenance, repair, and overhaul\nprocesses, also representing a variety of other neighboring industrial domains.\nBesides presenting the various use cases, we further discuss the injection of\ndomain knowledge.\n","authors":["Keno Moenck","Arne Wendt","Philipp Prünte","Julian Koch","Arne Sahrhage","Johann Gierecker","Ole Schmedemann","Falko Kähler","Dirk Holst","Martin Gomse","Thorsten Schüppstuhl","Daniel Schoepflin"],"pdf_url":"https://arxiv.org/pdf/2307.12674v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12672v1","updated":"2023-07-24T10:20:14Z","published":"2023-07-24T10:20:14Z","title":"Global k-Space Interpolation for Dynamic MRI Reconstruction using Masked\n Image Modeling","summary":" In dynamic Magnetic Resonance Imaging (MRI), k-space is typically\nundersampled due to limited scan time, resulting in aliasing artifacts in the\nimage domain. Hence, dynamic MR reconstruction requires not only modeling\nspatial frequency components in the x and y directions of k-space but also\nconsidering temporal redundancy. Most previous works rely on image-domain\nregularizers (priors) to conduct MR reconstruction. In contrast, we focus on\ninterpolating the undersampled k-space before obtaining images with Fourier\ntransform. In this work, we connect masked image modeling with k-space\ninterpolation and propose a novel Transformer-based k-space Global\nInterpolation Network, termed k-GIN. Our k-GIN learns global dependencies among\nlow- and high-frequency components of 2D+t k-space and uses it to interpolate\nunsampled data. Further, we propose a novel k-space Iterative Refinement Module\n(k-IRM) to enhance the high-frequency components learning. We evaluate our\napproach on 92 in-house 2D+t cardiac MR subjects and compare it to MR\nreconstruction methods with image-domain regularizers. Experiments show that\nour proposed k-space interpolation method quantitatively and qualitatively\noutperforms baseline methods. Importantly, the proposed approach achieves\nsubstantially higher robustness and generalizability in cases of\nhighly-undersampled MR data.\n","authors":["Jiazhen Pan","Suprosanna Shit","Özgün Turgut","Wenqi Huang","Hongwei Bran Li","Nil Stolt-Ansó","Thomas Küstner","Kerstin Hammernik","Daniel Rueckert"],"pdf_url":"https://arxiv.org/pdf/2307.12672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07250v2","updated":"2023-07-24T10:10:25Z","published":"2023-04-14T16:58:23Z","title":"Fusing Structure from Motion and Simulation-Augmented Pose Regression\n from Optical Flow for Challenging Indoor Environments","summary":" The localization of objects is a crucial task in various applications such as\nrobotics, virtual and augmented reality, and the transportation of goods in\nwarehouses. Recent advances in deep learning have enabled the localization\nusing monocular visual cameras. While structure from motion (SfM) predicts the\nabsolute pose from a point cloud, absolute pose regression (APR) methods learn\na semantic understanding of the environment through neural networks. However,\nboth fields face challenges caused by the environment such as motion blur,\nlighting changes, repetitive patterns, and feature-less structures. This study\naims to address these challenges by incorporating additional information and\nregularizing the absolute pose using relative pose regression (RPR) methods.\nRPR methods suffer under different challenges, i.e., motion blur. The optical\nflow between consecutive images is computed using the Lucas-Kanade algorithm,\nand the relative pose is predicted using an auxiliary small recurrent\nconvolutional network. The fusion of absolute and relative poses is a complex\ntask due to the mismatch between the global and local coordinate systems.\nState-of-the-art methods fusing absolute and relative poses use pose graph\noptimization (PGO) to regularize the absolute pose predictions using relative\nposes. In this work, we propose recurrent fusion networks to optimally align\nabsolute and relative pose predictions to improve the absolute pose prediction.\nWe evaluate eight different recurrent units and construct a simulation\nenvironment to pre-train the APR and RPR networks for better generalized\ntraining. Additionally, we record a large database of different scenarios in a\nchallenging large-scale indoor environment that mimics a warehouse with\ntransportation robots. We conduct hyperparameter searches and experiments to\nshow the effectiveness of our recurrent fusion method compared to PGO.\n","authors":["Felix Ott","Lucas Heublein","David Rügamer","Bernd Bischl","Christopher Mutschler"],"pdf_url":"https://arxiv.org/pdf/2304.07250v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12656v1","updated":"2023-07-24T09:54:49Z","published":"2023-07-24T09:54:49Z","title":"A Theoretically Guaranteed Quaternion Weighted Schatten p-norm\n Minimization Method for Color Image Restoration","summary":" Inspired by the fact that the matrix formulated by nonlocal similar patches\nin a natural image is of low rank, the rank approximation issue have been\nextensively investigated over the past decades, among which weighted nuclear\nnorm minimization (WNNM) and weighted Schatten $p$-norm minimization (WSNM) are\ntwo prevailing methods have shown great superiority in various image\nrestoration (IR) problems. Due to the physical characteristic of color images,\ncolor image restoration (CIR) is often a much more difficult task than its\ngrayscale image counterpart. However, when applied to CIR, the traditional\nWNNM/WSNM method only processes three color channels individually and fails to\nconsider their cross-channel correlations. Very recently, a quaternion-based\nWNNM approach (QWNNM) has been developed to mitigate this issue, which is\ncapable of representing the color image as a whole in the quaternion domain and\npreserving the inherent correlation among the three color channels. Despite its\nempirical success, unfortunately, the convergence behavior of QWNNM has not\nbeen strictly studied yet. In this paper, on the one side, we extend the WSNM\ninto quaternion domain and correspondingly propose a novel quaternion-based\nWSNM model (QWSNM) for tackling the CIR problems. Extensive experiments on two\nrepresentative CIR tasks, including color image denoising and deblurring,\ndemonstrate that the proposed QWSNM method performs favorably against many\nstate-of-the-art alternatives, in both quantitative and qualitative\nevaluations. On the other side, more importantly, we preliminarily provide a\ntheoretical convergence analysis, that is, by modifying the quaternion\nalternating direction method of multipliers (QADMM) through a simple\ncontinuation strategy, we theoretically prove that both the solution sequences\ngenerated by the QWNNM and QWSNM have fixed-point convergence guarantees.\n","authors":["Qing-Hua Zhang","Liang-Tian He","Yi-Lun Wang","Liang-Jian Deng","Jun Liu"],"pdf_url":"https://arxiv.org/pdf/2307.12656v1.pdf","comment":"46 pages, 10 figures; references added"},{"id":"http://arxiv.org/abs/2302.01162v5","updated":"2023-07-24T09:41:07Z","published":"2023-02-02T15:37:46Z","title":"Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model using\n Pixel-aligned Reconstruction Priors","summary":" Fast generation of high-quality 3D digital humans is important to a vast\nnumber of applications ranging from entertainment to professional concerns.\nRecent advances in differentiable rendering have enabled the training of 3D\ngenerative models without requiring 3D ground truths. However, the quality of\nthe generated 3D humans still has much room to improve in terms of both\nfidelity and diversity. In this paper, we present Get3DHuman, a novel 3D human\nframework that can significantly boost the realism and diversity of the\ngenerated outcomes by only using a limited budget of 3D ground-truth data. Our\nkey observation is that the 3D generator can profit from human-related priors\nlearned through 2D human generators and 3D reconstructors. Specifically, we\nbridge the latent space of Get3DHuman with that of StyleGAN-Human via a\nspecially-designed prior network, where the input latent code is mapped to the\nshape and texture feature volumes spanned by the pixel-aligned 3D\nreconstructor. The outcomes of the prior network are then leveraged as the\nsupervisory signals for the main generator network. To ensure effective\ntraining, we further propose three tailored losses applied to the generated\nfeature volumes and the intermediate feature maps. Extensive experiments\ndemonstrate that Get3DHuman greatly outperforms the other state-of-the-art\napproaches and can support a wide range of applications including shape\ninterpolation, shape re-texturing, and single-view reconstruction through\nlatent inversion.\n","authors":["Zhangyang Xiong","Di Kang","Derong Jin","Weikai Chen","Linchao Bao","Shuguang Cui","Xiaoguang Han"],"pdf_url":"https://arxiv.org/pdf/2302.01162v5.pdf","comment":"ICCV 2023, project page:\n https://x-zhangyang.github.io/2023_Get3DHuman/"},{"id":"http://arxiv.org/abs/2307.12644v1","updated":"2023-07-24T09:35:47Z","published":"2023-07-24T09:35:47Z","title":"Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation\n of rPPG","summary":" Remote Photoplethysmography (rPPG) is a technology that utilizes the light\nabsorption properties of hemoglobin, captured via camera, to analyze and\nmeasure blood volume pulse (BVP). By analyzing the measured BVP, various\nphysiological signals such as heart rate, stress levels, and blood pressure can\nbe derived, enabling applications such as the early prediction of\ncardiovascular diseases. rPPG is a rapidly evolving field as it allows the\nmeasurement of vital signals using camera-equipped devices without the need for\nadditional devices such as blood pressure monitors or pulse oximeters, and\nwithout the assistance of medical experts. Despite extensive efforts and\nadvances in this field, serious challenges remain, including issues related to\nskin color, camera characteristics, ambient lighting, and other sources of\nnoise, which degrade performance accuracy. We argue that fair and evaluable\nbenchmarking is urgently required to overcome these challenges and make any\nmeaningful progress from both academic and commercial perspectives. In most\nexisting work, models are trained, tested, and validated only on limited\ndatasets. Worse still, some studies lack available code or reproducibility,\nmaking it difficult to fairly evaluate and compare performance. Therefore, the\npurpose of this study is to provide a benchmarking framework to evaluate\nvarious rPPG techniques across a wide range of datasets for fair evaluation and\ncomparison, including both conventional non-deep neural network (non-DNN) and\ndeep neural network (DNN) methods. GitHub URL:\nhttps://github.com/remotebiosensing/rppg.\n","authors":["Dae Yeol Kim","Eunsu Goh","KwangKee Lee","JongEui Chae","JongHyeon Mun","Junyeong Na","Chae-bong Sohn","Do-Yup Kim"],"pdf_url":"https://arxiv.org/pdf/2307.12644v1.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2304.03981v2","updated":"2023-07-24T09:24:04Z","published":"2023-04-08T10:47:41Z","title":"Uncertainty-inspired Open Set Learning for Retinal Anomaly\n Identification","summary":" Failure to recognize samples from the classes unseen during training is a\nmajor limitation of artificial intelligence in the real-world implementation\nfor recognition and classification of retinal anomalies. We established an\nuncertainty-inspired open-set (UIOS) model, which was trained with fundus\nimages of 9 retinal conditions. Besides assessing the probability of each\ncategory, UIOS also calculated an uncertainty score to express its confidence.\nOur UIOS model with thresholding strategy achieved an F1 score of 99.55%,\n97.01% and 91.91% for the internal testing set, external target categories\n(TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1\nscore of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS\ncorrectly predicted high uncertainty scores, which would prompt the need for a\nmanual check in the datasets of non-target categories retinal diseases,\nlow-quality fundus images, and non-fundus images. UIOS provides a robust method\nfor real-world screening of retinal anomalies.\n","authors":["Meng Wang","Tian Lin","Lianyu Wang","Aidi Lin","Ke Zou","Xinxing Xu","Yi Zhou","Yuanyuan Peng","Qingquan Meng","Yiming Qian","Guoyao Deng","Zhiqun Wu","Junhong Chen","Jianhong Lin","Mingzhi Zhang","Weifang Zhu","Changqing Zhang","Daoqiang Zhang","Rick Siow Mong Goh","Yong Liu","Chi Pui Pang","Xinjian Chen","Haoyu Chen","Huazhu Fu"],"pdf_url":"https://arxiv.org/pdf/2304.03981v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12637v1","updated":"2023-07-24T09:22:09Z","published":"2023-07-24T09:22:09Z","title":"PG-RCNN: Semantic Surface Point Generation for 3D Object Detection","summary":" One of the main challenges in LiDAR-based 3D object detection is that the\nsensors often fail to capture the complete spatial information about the\nobjects due to long distance and occlusion. Two-stage detectors with point\ncloud completion approaches tackle this problem by adding more points to the\nregions of interest (RoIs) with a pre-trained network. However, these methods\ngenerate dense point clouds of objects for all region proposals, assuming that\nobjects always exist in the RoIs. This leads to the indiscriminate point\ngeneration for incorrect proposals as well. Motivated by this, we propose Point\nGeneration R-CNN (PG-RCNN), a novel end-to-end detector that generates semantic\nsurface points of foreground objects for accurate detection. Our method uses a\njointly trained RoI point generation module to process the contextual\ninformation of RoIs and estimate the complete shape and displacement of\nforeground objects. For every generated point, PG-RCNN assigns a semantic\nfeature that indicates the estimated foreground probability. Extensive\nexperiments show that the point clouds generated by our method provide\ngeometrically and semantically rich information for refining false positive and\nmisaligned proposals. PG-RCNN achieves competitive performance on the KITTI\nbenchmark, with significantly fewer parameters than state-of-the-art models.\nThe code is available at https://github.com/quotation2520/PG-RCNN.\n","authors":["Inyong Koo","Inyoung Lee","Se-Ho Kim","Hee-Seon Kim","Woo-jin Jeon","Changick Kim"],"pdf_url":"https://arxiv.org/pdf/2307.12637v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.11643v2","updated":"2023-07-24T09:18:52Z","published":"2023-07-21T15:22:32Z","title":"Morphological Image Analysis and Feature Extraction for Reasoning with\n AI-based Defect Detection and Classification Models","summary":" As the use of artificial intelligent (AI) models becomes more prevalent in\nindustries such as engineering and manufacturing, it is essential that these\nmodels provide transparent reasoning behind their predictions. This paper\nproposes the AI-Reasoner, which extracts the morphological characteristics of\ndefects (DefChars) from images and utilises decision trees to reason with the\nDefChar values. Thereafter, the AI-Reasoner exports visualisations (i.e.\ncharts) and textual explanations to provide insights into outputs made by\nmasked-based defect detection and classification models. It also provides\neffective mitigation strategies to enhance data pre-processing and overall\nmodel performance. The AI-Reasoner was tested on explaining the outputs of an\nIE Mask R-CNN model using a set of 366 images containing defects. The results\ndemonstrated its effectiveness in explaining the IE Mask R-CNN model's\npredictions. Overall, the proposed AI-Reasoner provides a solution for\nimproving the performance of AI models in industrial applications that require\ndefect analysis.\n","authors":["Jiajun Zhang","Georgina Cosma","Sarah Bugby","Axel Finke","Jason Watkins"],"pdf_url":"https://arxiv.org/pdf/2307.11643v2.pdf","comment":"8 pages, 3 figures, 5 tables; submitted to 2023 IEEE symposium series\n on computational intelligence (SSCI)"},{"id":"http://arxiv.org/abs/2307.12634v1","updated":"2023-07-24T09:16:05Z","published":"2023-07-24T09:16:05Z","title":"Automatic lobe segmentation using attentive cross entropy and end-to-end\n fissure generation","summary":" The automatic lung lobe segmentation algorithm is of great significance for\nthe diagnosis and treatment of lung diseases, however, which has great\nchallenges due to the incompleteness of pulmonary fissures in lung CT images\nand the large variability of pathological features. Therefore, we propose a new\nautomatic lung lobe segmentation framework, in which we urge the model to pay\nattention to the area around the pulmonary fissure during the training process,\nwhich is realized by a task-specific loss function. In addition, we introduce\nan end-to-end pulmonary fissure generation method in the auxiliary pulmonary\nfissure segmentation task, without any additional network branch. Finally, we\npropose a registration-based loss function to alleviate the convergence\ndifficulty of the Dice loss supervised pulmonary fissure segmentation task. We\nachieve 97.83% and 94.75% dice scores on our private dataset STLB and public\nLUNA16 dataset respectively.\n","authors":["Qi Su","Na Wang","Jiawen Xie","Yinan Chen","Xiaofan Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12634v1.pdf","comment":"5 pages, 3 figures, published to 'IEEE International Symposium on\n Biomedical Imaging (ISBI) 2023'"},{"id":"http://arxiv.org/abs/2307.12630v1","updated":"2023-07-24T09:08:30Z","published":"2023-07-24T09:08:30Z","title":"Semi-Supervised Medical Image Segmentation with Co-Distribution\n Alignment","summary":" Medical image segmentation has made significant progress when a large amount\nof labeled data are available. However, annotating medical image segmentation\ndatasets is expensive due to the requirement of professional skills.\nAdditionally, classes are often unevenly distributed in medical images, which\nseverely affects the classification performance on minority classes. To address\nthese problems, this paper proposes Co-Distribution Alignment (Co-DA) for\nsemi-supervised medical image segmentation. Specifically, Co-DA aligns marginal\npredictions on unlabeled data to marginal predictions on labeled data in a\nclass-wise manner with two differently initialized models before using the\npseudo-labels generated by one model to supervise the other. Besides, we design\nan over-expectation cross-entropy loss for filtering the unlabeled pixels to\nreduce noise in their pseudo-labels. Quantitative and qualitative experiments\non three public datasets demonstrate that the proposed approach outperforms\nexisting state-of-the-art semi-supervised medical image segmentation methods on\nboth the 2D CaDIS dataset and the 3D LGE-MRI and ACDC datasets, achieving an\nmIoU of 0.8515 with only 24% labeled data on CaDIS, and a Dice score of 0.8824\nand 0.8773 with only 20% data on LGE-MRI and ACDC, respectively.\n","authors":["Tao Wang","Zhongzheng Huang","Jiawei Wu","Yuanzheng Cai","Zuoyong Li"],"pdf_url":"https://arxiv.org/pdf/2307.12630v1.pdf","comment":"Paper appears in Bioengineering 2023, 10(7), 869"},{"id":"http://arxiv.org/abs/2307.12622v1","updated":"2023-07-24T08:51:49Z","published":"2023-07-24T08:51:49Z","title":"Phase Match for Out-of-Distribution Generalization","summary":" The Fourier transform, serving as an explicit decomposition method for visual\nsignals, has been employed to explain the out-of-distribution generalization\nbehaviors of Convolutional Neural Networks (CNNs). Previous research and\nempirical studies have indicated that the amplitude spectrum plays a decisive\nrole in CNN recognition, but it is susceptible to disturbance caused by\ndistribution shifts. On the other hand, the phase spectrum preserves\nhighly-structured spatial information, which is crucial for visual\nrepresentation learning. In this paper, we aim to clarify the relationships\nbetween Domain Generalization (DG) and the frequency components by introducing\na Fourier-based structural causal model. Specifically, we interpret the phase\nspectrum as semi-causal factors and the amplitude spectrum as non-causal\nfactors. Building upon these observations, we propose Phase Match (PhaMa) to\naddress DG problems. Our method introduces perturbations on the amplitude\nspectrum and establishes spatial relationships to match the phase components.\nThrough experiments on multiple benchmarks, we demonstrate that our proposed\nmethod achieves state-of-the-art performance in domain generalization and\nout-of-distribution robustness tasks.\n","authors":["Chengming Hu","Rui Wang","Hao Chen","Zhouwang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12622v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12619v1","updated":"2023-07-24T08:49:20Z","published":"2023-07-24T08:49:20Z","title":"Sparse annotation strategies for segmentation of short axis cardiac MRI","summary":" Short axis cardiac MRI segmentation is a well-researched topic, with\nexcellent results achieved by state-of-the-art models in a supervised setting.\nHowever, annotating MRI volumes is time-consuming and expensive. Many different\napproaches (e.g. transfer learning, data augmentation, few-shot learning, etc.)\nhave emerged in an effort to use fewer annotated data and still achieve similar\nperformance as a fully supervised model. Nevertheless, to the best of our\nknowledge, none of these works focus on which slices of MRI volumes are most\nimportant to annotate for yielding the best segmentation results. In this\npaper, we investigate the effects of training with sparse volumes, i.e.\nreducing the number of cases annotated, and sparse annotations, i.e. reducing\nthe number of slices annotated per case. We evaluate the segmentation\nperformance using the state-of-the-art nnU-Net model on two public datasets to\nidentify which slices are the most important to annotate. We have shown that\ntraining on a significantly reduced dataset (48 annotated volumes) can give a\nDice score greater than 0.85 and results comparable to using the full dataset\n(160 and 240 volumes for each dataset respectively). In general, training on\nmore slice annotations provides more valuable information compared to training\non more volumes. Further, annotating slices from the middle of volumes yields\nthe most beneficial results in terms of segmentation performance, and the\napical region the worst. When evaluating the trade-off between annotating\nvolumes against slices, annotating as many slices as possible instead of\nannotating more volumes is a better strategy.\n","authors":["Josh Stein","Maxime Di Folco","Julia Schnabel"],"pdf_url":"https://arxiv.org/pdf/2307.12619v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12618v1","updated":"2023-07-24T08:47:45Z","published":"2023-07-24T08:47:45Z","title":"Attribute Regularized Soft Introspective VAE: Towards Cardiac Attribute\n Regularization Through MRI Domains","summary":" Deep generative models have emerged as influential instruments for data\ngeneration and manipulation. Enhancing the controllability of these models by\nselectively modifying data attributes has been a recent focus. Variational\nAutoencoders (VAEs) have shown promise in capturing hidden attributes but often\nproduce blurry reconstructions. Controlling these attributes through different\nimaging domains is difficult in medical imaging. Recently, Soft Introspective\nVAE leverage the benefits of both VAEs and Generative Adversarial Networks\n(GANs), which have demonstrated impressive image synthesis capabilities, by\nincorporating an adversarial loss into VAE training. In this work, we propose\nthe Attributed Soft Introspective VAE (Attri-SIVAE) by incorporating an\nattribute regularized loss, into the Soft-Intro VAE framework. We evaluate\nexperimentally the proposed method on cardiac MRI data from different domains,\nsuch as various scanner vendors and acquisition centers. The proposed method\nachieves similar performance in terms of reconstruction and regularization\ncompared to the state-of-the-art Attributed regularized VAE but additionally\nalso succeeds in keeping the same regularization level when tested on a\ndifferent dataset, unlike the compared method.\n","authors":["Maxime Di Folco","Cosmin Bercea","Julia A. Schnabel"],"pdf_url":"https://arxiv.org/pdf/2307.12618v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12616v1","updated":"2023-07-24T08:44:25Z","published":"2023-07-24T08:44:25Z","title":"CTVIS: Consistent Training for Online Video Instance Segmentation","summary":" The discrimination of instance embeddings plays a vital role in associating\ninstances across time for online video instance segmentation (VIS). Instance\nembedding learning is directly supervised by the contrastive loss computed upon\nthe contrastive items (CIs), which are sets of anchor/positive/negative\nembeddings. Recent online VIS methods leverage CIs sourced from one reference\nframe only, which we argue is insufficient for learning highly discriminative\nembeddings. Intuitively, a possible strategy to enhance CIs is replicating the\ninference phase during training. To this end, we propose a simple yet effective\ntraining strategy, called Consistent Training for Online VIS (CTVIS), which\ndevotes to aligning the training and inference pipelines in terms of building\nCIs. Specifically, CTVIS constructs CIs by referring inference the\nmomentum-averaged embedding and the memory bank storage mechanisms, and adding\nnoise to the relevant embeddings. Such an extension allows a reliable\ncomparison between embeddings of current instances and the stable\nrepresentations of historical instances, thereby conferring an advantage in\nmodeling VIS challenges such as occlusion, re-identification, and deformation.\nEmpirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three\nVIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS\n(35.5% AP). Furthermore, we find that pseudo-videos transformed from images can\ntrain robust models surpassing fully-supervised ones.\n","authors":["Kaining Ying","Qing Zhong","Weian Mao","Zhenhua Wang","Hao Chen","Lin Yuanbo Wu","Yifan Liu","Chengxiang Fan","Yunzhi Zhuge","Chunhua Shen"],"pdf_url":"https://arxiv.org/pdf/2307.12616v1.pdf","comment":"Accepted by ICCV 2023. The code is available at\n https://github.com/KainingYing/CTVIS"},{"id":"http://arxiv.org/abs/2307.12612v1","updated":"2023-07-24T08:39:11Z","published":"2023-07-24T08:39:11Z","title":"Less is More: Focus Attention for Efficient DETR","summary":" DETR-like models have significantly boosted the performance of detectors and\neven outperformed classical convolutional models. However, all tokens are\ntreated equally without discrimination brings a redundant computational burden\nin the traditional encoder structure. The recent sparsification strategies\nexploit a subset of informative tokens to reduce attention complexity\nmaintaining performance through the sparse encoder. But these methods tend to\nrely on unreliable model statistics. Moreover, simply reducing the token\npopulation hinders the detection performance to a large extent, limiting the\napplication of these sparse models. We propose Focus-DETR, which focuses\nattention on more informative tokens for a better trade-off between computation\nefficiency and model accuracy. Specifically, we reconstruct the encoder with\ndual attention, which includes a token scoring mechanism that considers both\nlocalization and category semantic information of the objects from multi-scale\nfeature maps. We efficiently abandon the background queries and enhance the\nsemantic interaction of the fine-grained object queries based on the scores.\nCompared with the state-of-the-art sparse DETR-like detectors under the same\nsetting, our Focus-DETR gets comparable complexity while achieving 50.4AP\n(+2.2) on COCO. The code is available at\nhttps://github.com/huawei-noah/noah-research/tree/master/Focus-DETR and\nhttps://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR.\n","authors":["Dehua Zheng","Wenhui Dong","Hailin Hu","Xinghao Chen","Yunhe Wang"],"pdf_url":"https://arxiv.org/pdf/2307.12612v1.pdf","comment":"8 pages, 6 figures, accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2307.12607v1","updated":"2023-07-24T08:32:27Z","published":"2023-07-24T08:32:27Z","title":"ExWarp: Extrapolation and Warping-based Temporal Supersampling for\n High-frequency Displays","summary":" High-frequency displays are gaining immense popularity because of their\nincreasing use in video games and virtual reality applications. However, the\nissue is that the underlying GPUs cannot continuously generate frames at this\nhigh rate -- this results in a less smooth and responsive experience.\nFurthermore, if the frame rate is not synchronized with the refresh rate, the\nuser may experience screen tearing and stuttering. Previous works propose\nincreasing the frame rate to provide a smooth experience on modern displays by\npredicting new frames based on past or future frames. Interpolation and\nextrapolation are two widely used algorithms that predict new frames.\nInterpolation requires waiting for the future frame to make a prediction, which\nadds additional latency. On the other hand, extrapolation provides a better\nquality of experience because it relies solely on past frames -- it does not\nincur any additional latency. The simplest method to extrapolate a frame is to\nwarp the previous frame using motion vectors; however, the warped frame may\ncontain improperly rendered visual artifacts due to dynamic objects -- this\nmakes it very challenging to design such a scheme. Past work has used DNNs to\nget good accuracy, however, these approaches are slow. This paper proposes\nExwarp -- an approach based on reinforcement learning (RL) to intelligently\nchoose between the slower DNN-based extrapolation and faster warping-based\nmethods to increase the frame rate by 4x with an almost negligible reduction in\nthe perceived image quality.\n","authors":["Akanksha Dixit","Yashashwee Chakrabarty","Smruti R. Sarangi"],"pdf_url":"https://arxiv.org/pdf/2307.12607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07515v2","updated":"2023-07-24T08:10:52Z","published":"2023-04-15T09:39:52Z","title":"S3M: Scalable Statistical Shape Modeling through Unsupervised\n Correspondences","summary":" Statistical shape models (SSMs) are an established way to represent the\nanatomy of a population with various clinically relevant applications. However,\nthey typically require domain expertise, and labor-intensive landmark\nannotations to construct. We address these shortcomings by proposing an\nunsupervised method that leverages deep geometric features and functional\ncorrespondences to simultaneously learn local and global shape structures\nacross population anatomies. Our pipeline significantly improves unsupervised\ncorrespondence estimation for SSMs compared to baseline methods, even on highly\nirregular surface topologies. We demonstrate this for two different anatomical\nstructures: the thyroid and a multi-chamber heart dataset. Furthermore, our\nmethod is robust enough to learn from noisy neural network predictions,\npotentially enabling scaling SSMs to larger patient populations without manual\nsegmentation annotation.\n","authors":["Lennart Bastian","Alexander Baumann","Emily Hoppe","Vincent Bürgin","Ha Young Kim","Mahdi Saleh","Benjamin Busam","Nassir Navab"],"pdf_url":"https://arxiv.org/pdf/2304.07515v2.pdf","comment":"Accepted at MICCAI 2023. 13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.12591v1","updated":"2023-07-24T08:06:46Z","published":"2023-07-24T08:06:46Z","title":"SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image\n Segmentation","summary":" Recent advancements in large-scale Vision Transformers have made significant\nstrides in improving pre-trained models for medical image segmentation.\nHowever, these methods face a notable challenge in acquiring a substantial\namount of pre-training data, particularly within the medical field. To address\nthis limitation, we present Masked Multi-view with Swin Transformers (SwinMM),\na novel multi-view pipeline for enabling accurate and data-efficient\nself-supervised medical image analysis. Our strategy harnesses the potential of\nmulti-view information by incorporating two principal components. In the\npre-training phase, we deploy a masked multi-view encoder devised to\nconcurrently train masked multi-view observations through a range of diverse\nproxy tasks. These tasks span image reconstruction, rotation, contrastive\nlearning, and a novel task that employs a mutual learning paradigm. This new\ntask capitalizes on the consistency between predictions from various\nperspectives, enabling the extraction of hidden multi-view information from 3D\nmedical data. In the fine-tuning stage, a cross-view decoder is developed to\naggregate the multi-view information through a cross-attention block. Compared\nwith the previous state-of-the-art self-supervised learning method Swin UNETR,\nSwinMM demonstrates a notable advantage on several medical image segmentation\ntasks. It allows for a smooth integration of multi-view information,\nsignificantly boosting both the accuracy and data-efficiency of the model. Code\nand models are available at https://github.com/UCSC-VLAA/SwinMM/.\n","authors":["Yiqing Wang","Zihan Li","Jieru Mei","Zihao Wei","Li Liu","Chen Wang","Shengtian Sang","Alan Yuille","Cihang Xie","Yuyin Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.12591v1.pdf","comment":"MICCAI 2023; project page: https://github.com/UCSC-VLAA/SwinMM/"},{"id":"http://arxiv.org/abs/2307.12580v1","updated":"2023-07-24T07:51:40Z","published":"2023-07-24T07:51:40Z","title":"SL: Stable Learning in Source-Free Domain Adaption for Medical Image\n Segmentation","summary":" Deep learning techniques for medical image analysis usually suffer from the\ndomain shift between source and target data. Most existing works focus on\nunsupervised domain adaptation (UDA). However, in practical applications,\nprivacy issues are much more severe. For example, the data of different\nhospitals have domain shifts due to equipment problems, and data of the two\ndomains cannot be available simultaneously because of privacy. In this\nchallenge defined as Source-Free UDA, the previous UDA medical methods are\nlimited. Although a variety of medical source-free unsupervised domain adaption\n(MSFUDA) methods have been proposed, we found they fall into an over-fitting\ndilemma called \"longer training, worse performance.\" Therefore, we propose the\nStable Learning (SL) strategy to address the dilemma. SL is a scalable method\nand can be integrated with other research, which consists of Weight\nConsolidation and Entropy Increase. First, we apply Weight Consolidation to\nretain domain-invariant knowledge and then we design Entropy Increase to avoid\nover-learning. Comparative experiments prove the effectiveness of SL. We also\nhave done extensive ablation experiments. Besides, We will release codes\nincluding a variety of MSFUDA methods.\n","authors":["Yixin Chen","Yan Wang"],"pdf_url":"https://arxiv.org/pdf/2307.12580v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12577v1","updated":"2023-07-24T07:49:01Z","published":"2023-07-24T07:49:01Z","title":"PRIOR: Prototype Representation Joint Learning from Medical Images and\n Reports","summary":" Contrastive learning based vision-language joint pre-training has emerged as\na successful representation learning strategy. In this paper, we present a\nprototype representation learning framework incorporating both global and local\nalignment between medical images and reports. In contrast to standard global\nmulti-modality alignment methods, we employ a local alignment module for\nfine-grained representation. Furthermore, a cross-modality conditional\nreconstruction module is designed to interchange information across modalities\nin the training phase by reconstructing masked images and reports. For\nreconstructing long reports, a sentence-wise prototype memory bank is\nconstructed, enabling the network to focus on low-level localized visual and\nhigh-level clinical linguistic features. Additionally, a non-auto-regressive\ngeneration paradigm is proposed for reconstructing non-sequential reports.\nExperimental results on five downstream tasks, including supervised\nclassification, zero-shot classification, image-to-text retrieval, semantic\nsegmentation, and object detection, show the proposed method outperforms other\nstate-of-the-art methods across multiple datasets and under different dataset\nsize settings. The code is available at https://github.com/QtacierP/PRIOR.\n","authors":["Pujin Cheng","Li Lin","Junyan Lyu","Yijin Huang","Wenhan Luo","Xiaoying Tang"],"pdf_url":"https://arxiv.org/pdf/2307.12577v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12574v1","updated":"2023-07-24T07:46:06Z","published":"2023-07-24T07:46:06Z","title":"A Good Student is Cooperative and Reliable: CNN-Transformer\n Collaborative Learning for Semantic Segmentation","summary":" In this paper, we strive to answer the question \"how to collaboratively learn\nconvolutional neural network (CNN)-based and vision transformer (ViT)-based\nmodels by selecting and exchanging the reliable knowledge between them for\nsemantic segmentation?\" Accordingly, we propose an online knowledge\ndistillation (KD) framework that can simultaneously learn compact yet effective\nCNN-based and ViT-based models with two key technical breakthroughs to take\nfull advantage of CNNs and ViT while compensating their limitations. Firstly,\nwe propose heterogeneous feature distillation (HFD) to improve students'\nconsistency in low-layer feature space by mimicking heterogeneous features\nbetween CNNs and ViT. Secondly, to facilitate the two students to learn\nreliable knowledge from each other, we propose bidirectional selective\ndistillation (BSD) that can dynamically transfer selective knowledge. This is\nachieved by 1) region-wise BSD determining the directions of knowledge\ntransferred between the corresponding regions in the feature space and 2)\npixel-wise BSD discerning which of the prediction knowledge to be transferred\nin the logit space. Extensive experiments on three benchmark datasets\ndemonstrate that our proposed framework outperforms the state-of-the-art online\ndistillation methods by a large margin, and shows its efficacy in learning\ncollaboratively between ViT-based and CNN-based models.\n","authors":["Jinjing Zhu","Yunhao Luo","Xu Zheng","Hao Wang","Lin Wang"],"pdf_url":"https://arxiv.org/pdf/2307.12574v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2210.10495v3","updated":"2023-07-24T07:43:31Z","published":"2022-10-19T12:04:47Z","title":"ADPS: Asymmetric Distillation Post-Segmentation for Image Anomaly\n Detection","summary":" Knowledge Distillation-based Anomaly Detection (KDAD) methods rely on the\nteacher-student paradigm to detect and segment anomalous regions by contrasting\nthe unique features extracted by both networks. However, existing KDAD methods\nsuffer from two main limitations: 1) the student network can effortlessly\nreplicate the teacher network's representations, and 2) the features of the\nteacher network serve solely as a ``reference standard\" and are not fully\nleveraged. Toward this end, we depart from the established paradigm and instead\npropose an innovative approach called Asymmetric Distillation Post-Segmentation\n(ADPS). Our ADPS employs an asymmetric distillation paradigm that takes\ndistinct forms of the same image as the input of the teacher-student networks,\ndriving the student network to learn discriminating representations for\nanomalous regions.\n Meanwhile, a customized Weight Mask Block (WMB) is proposed to generate a\ncoarse anomaly localization mask that transfers the distilled knowledge\nacquired from the asymmetric paradigm to the teacher network. Equipped with\nWMB, the proposed Post-Segmentation Module (PSM) is able to effectively detect\nand segment abnormal regions with fine structures and clear boundaries.\nExperimental results demonstrate that the proposed ADPS outperforms the\nstate-of-the-art methods in detecting and segmenting anomalies. Surprisingly,\nADPS significantly improves Average Precision (AP) metric by 9% and 20% on the\nMVTec AD and KolektorSDD2 datasets, respectively.\n","authors":["Peng Xing","Hao Tang","Jinhui Tang","Zechao Li"],"pdf_url":"https://arxiv.org/pdf/2210.10495v3.pdf","comment":"11pages,9 figures"},{"id":"http://arxiv.org/abs/2307.12571v1","updated":"2023-07-24T07:39:22Z","published":"2023-07-24T07:39:22Z","title":"MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary","summary":" Document dewarping from a distorted camera-captured image is of great value\nfor OCR and document understanding. The document boundary plays an important\nrole which is more evident than the inner region in document dewarping. Current\nlearning-based methods mainly focus on complete boundary cases, leading to poor\ndocument correction performance of documents with incomplete boundaries. In\ncontrast to these methods, this paper proposes MataDoc, the first method\nfocusing on arbitrary boundary document dewarping with margin and text aware\nregularizations. Specifically, we design the margin regularization by\nexplicitly considering background consistency to enhance boundary perception.\nMoreover, we introduce word position consistency to keep text lines straight in\nrectified document images. To produce a comprehensive evaluation of MataDoc, we\npropose a novel benchmark ArbDoc, mainly consisting of document images with\narbitrary boundaries in four typical scenarios. Extensive experiments confirm\nthe superiority of MataDoc with consideration for the incomplete boundary on\nArbDoc and also demonstrate the effectiveness of the proposed method on\nDocUNet, DIR300, and WarpDoc datasets.\n","authors":["Beiya Dai","Xing li","Qunyi Xie","Yulin Li","Xiameng Qin","Chengquan Zhang","Kun Yao","Junyu Han"],"pdf_url":"https://arxiv.org/pdf/2307.12571v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2307.12560v1","updated":"2023-07-24T07:03:22Z","published":"2023-07-24T07:03:22Z","title":"Interpolating between Images with Diffusion Models","summary":" One little-explored frontier of image generation and editing is the task of\ninterpolating between two input images, a feature missing from all currently\ndeployed image generation pipelines. We argue that such a feature can expand\nthe creative applications of such models, and propose a method for zero-shot\ninterpolation using latent diffusion models. We apply interpolation in the\nlatent space at a sequence of decreasing noise levels, then perform denoising\nconditioned on interpolated text embeddings derived from textual inversion and\n(optionally) subject poses. For greater consistency, or to specify additional\ncriteria, we can generate several candidates and use CLIP to select the highest\nquality image. We obtain convincing interpolations across diverse subject\nposes, image styles, and image content, and show that standard quantitative\nmetrics such as FID are insufficient to measure the quality of an\ninterpolation. Code and data are available at\nhttps://clintonjwang.github.io/interpolation.\n","authors":["Clinton J. Wang","Polina Golland"],"pdf_url":"https://arxiv.org/pdf/2307.12560v1.pdf","comment":"Presented at ICML 2023 Workshop on Challenges of Deploying Generative\n AI"},{"id":"http://arxiv.org/abs/2203.01923v4","updated":"2023-07-24T06:59:56Z","published":"2022-03-03T18:56:08Z","title":"Recovering 3D Human Mesh from Monocular Images: A Survey","summary":" Estimating human pose and shape from monocular images is a long-standing\nproblem in computer vision. Since the release of statistical body models, 3D\nhuman mesh recovery has been drawing broader attention. With the same goal of\nobtaining well-aligned and physically plausible mesh results, two paradigms\nhave been developed to overcome challenges in the 2D-to-3D lifting process: i)\nan optimization-based paradigm, where different data terms and regularization\nterms are exploited as optimization objectives; and ii) a regression-based\nparadigm, where deep learning techniques are embraced to solve the problem in\nan end-to-end fashion. Meanwhile, continuous efforts are devoted to improving\nthe quality of 3D mesh labels for a wide range of datasets. Though remarkable\nprogress has been achieved in the past decade, the task is still challenging\ndue to flexible body motions, diverse appearances, complex environments, and\ninsufficient in-the-wild annotations. To the best of our knowledge, this is the\nfirst survey that focuses on the task of monocular 3D human mesh recovery. We\nstart with the introduction of body models and then elaborate recovery\nframeworks and training objectives by providing in-depth analyses of their\nstrengths and weaknesses. We also summarize datasets, evaluation metrics, and\nbenchmark results. Open issues and future directions are discussed in the end,\nhoping to motivate researchers and facilitate their research in this area. A\nregularly updated project page can be found at\nhttps://github.com/tinatiansjz/hmr-survey.\n","authors":["Yating Tian","Hongwen Zhang","Yebin Liu","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2203.01923v4.pdf","comment":"Accepted to IEEE TPAMI, Survey on monocular 3D human mesh recovery,\n Project page: https://github.com/tinatiansjz/hmr-survey"},{"id":"http://arxiv.org/abs/2307.12558v1","updated":"2023-07-24T06:51:07Z","published":"2023-07-24T06:51:07Z","title":"Revisiting Event-based Video Frame Interpolation","summary":" Dynamic vision sensors or event cameras provide rich complementary\ninformation for video frame interpolation. Existing state-of-the-art methods\nfollow the paradigm of combining both synthesis-based and warping networks.\nHowever, few of those methods fully respect the intrinsic characteristics of\nevents streams. Given that event cameras only encode intensity changes and\npolarity rather than color intensities, estimating optical flow from events is\narguably more difficult than from RGB information. We therefore propose to\nincorporate RGB information in an event-guided optical flow refinement\nstrategy. Moreover, in light of the quasi-continuous nature of the time signals\nprovided by event cameras, we propose a divide-and-conquer strategy in which\nevent-based intermediate frame synthesis happens incrementally in multiple\nsimplified stages rather than in a single, long stage. Extensive experiments on\nboth synthetic and real-world datasets show that these modifications lead to\nmore reliable and realistic intermediate frame results than previous video\nframe interpolation methods. Our findings underline that a careful\nconsideration of event characteristics such as high temporal density and\nelevated noise benefits interpolation accuracy.\n","authors":["Jiaben Chen","Yichen Zhu","Dongze Lian","Jiaqi Yang","Yifu Wang","Renrui Zhang","Xinhang Liu","Shenhan Qian","Laurent Kneip","Shenghua Gao"],"pdf_url":"https://arxiv.org/pdf/2307.12558v1.pdf","comment":"Accepted by IROS2023 Project Site:\n https://jiabenchen.github.io/revisit_event"},{"id":"http://arxiv.org/abs/2307.12548v1","updated":"2023-07-24T06:33:52Z","published":"2023-07-24T06:33:52Z","title":"MFMAN-YOLO: A Method for Detecting Pole-like Obstacles in Complex\n Environment","summary":" In real-world traffic, there are various uncertainties and complexities in\nroad and weather conditions. To solve the problem that the feature information\nof pole-like obstacles in complex environments is easily lost, resulting in low\ndetection accuracy and low real-time performance, a multi-scale hybrid\nattention mechanism detection algorithm is proposed in this paper. First, the\noptimal transport function Monge-Kantorovich (MK) is incorporated not only to\nsolve the problem of overlapping multiple prediction frames with optimal\nmatching but also the MK function can be regularized to prevent model\nover-fitting; then, the features at different scales are up-sampled separately\naccording to the optimized efficient multi-scale feature pyramid. Finally, the\nextraction of multi-scale feature space channel information is enhanced in\ncomplex environments based on the hybrid attention mechanism, which suppresses\nthe irrelevant complex environment background information and focuses the\nfeature information of pole-like obstacles. Meanwhile, this paper conducts real\nroad test experiments in a variety of complex environments. The experimental\nresults show that the detection precision, recall, and average precision of the\nmethod are 94.7%, 93.1%, and 97.4%, respectively, and the detection frame rate\nis 400 f/s. This research method can detect pole-like obstacles in a complex\nroad environment in real time and accurately, which further promotes innovation\nand progress in the field of automatic driving.\n","authors":["Lei Cai","Hao Wang","Congling Zhou","Yongqiang Wang","Boyu Liu"],"pdf_url":"https://arxiv.org/pdf/2307.12548v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2301.01482v5","updated":"2023-07-24T06:31:58Z","published":"2023-01-04T08:22:34Z","title":"Underwater Object Tracker: UOSTrack for Marine Organism Grasping of\n Underwater Vehicles","summary":" A visual single-object tracker is an indispensable component of underwater\nvehicles (UVs) in marine organism grasping tasks. Its accuracy and stability\nare imperative to guide the UVs to perform grasping behavior. Although\nsingle-object trackers show competitive performance in the challenge of\nunderwater image degradation, there are still issues with sample imbalance and\nexclusion of similar objects that need to be addressed for application in\nmarine organism grasping. This paper proposes Underwater OSTrack (UOSTrack),\nwhich consists of underwater image and open-air sequence hybrid training\n(UOHT), and motion-based post-processing (MBPP). The UOHT training paradigm is\ndesigned to train the sample-imbalanced underwater tracker so that the tracker\nis exposed to a great number of underwater domain training samples and learns\nthe feature expressions. The MBPP paradigm is proposed to exclude similar\nobjects. It uses the estimation box predicted with a Kalman filter and the\ncandidate boxes in the response map to relocate the lost tracked object in the\ncandidate area. UOSTrack achieves an average performance improvement of 4.41%\nand 7.98% maximum compared to state-of-the-art methods on various benchmarks,\nrespectively. Field experiments have verified the accuracy and stability of our\nproposed UOSTrack for UVs in marine organism grasping tasks. More details can\nbe found at https://github.com/LiYunfengLYF/UOSTrack.\n","authors":["Yunfeng Li","Bo Wang","Ye Li","Zhuoyan Liu","Wei Huo","Yueming Li","Jian Cao"],"pdf_url":"https://arxiv.org/pdf/2301.01482v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12545v1","updated":"2023-07-24T06:22:37Z","published":"2023-07-24T06:22:37Z","title":"Towards Video Anomaly Retrieval from Video Anomaly Detection: New\n Benchmarks and Model","summary":" Video anomaly detection (VAD) has been paid increasing attention due to its\npotential applications, its current dominant tasks focus on online detecting\nanomalies% at the frame level, which can be roughly interpreted as the binary\nor multiple event classification. However, such a setup that builds\nrelationships between complicated anomalous events and single labels, e.g.,\n``vandalism'', is superficial, since single labels are deficient to\ncharacterize anomalous events. In reality, users tend to search a specific\nvideo rather than a series of approximate videos. Therefore, retrieving\nanomalous events using detailed descriptions is practical and positive but few\nresearches focus on this. In this context, we propose a novel task called Video\nAnomaly Retrieval (VAR), which aims to pragmatically retrieve relevant\nanomalous videos by cross-modalities, e.g., language descriptions and\nsynchronous audios. Unlike the current video retrieval where videos are assumed\nto be temporally well-trimmed with short duration, VAR is devised to retrieve\nlong untrimmed videos which may be partially relevant to the given query. To\nachieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and\nXDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we\ndesign a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we\npropose an anomaly-led sampling to focus on key segments in long untrimmed\nvideos. Then, we introduce an efficient pretext task to enhance semantic\nassociations between video-text fine-grained representations. Besides, we\nleverage two complementary alignments to further match cross-modal contents.\nExperimental results on two benchmarks reveal the challenges of VAR task and\nalso demonstrate the advantages of our tailored method.\n","authors":["Peng Wu","Jing Liu","Xiangteng He","Yuxin Peng","Peng Wang","Yanning Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12545v1.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2307.12542v1","updated":"2023-07-24T06:12:37Z","published":"2023-07-24T06:12:37Z","title":"Client-Level Differential Privacy via Adaptive Intermediary in Federated\n Medical Imaging","summary":" Despite recent progress in enhancing the privacy of federated learning (FL)\nvia differential privacy (DP), the trade-off of DP between privacy protection\nand performance is still underexplored for real-world medical scenario. In this\npaper, we propose to optimize the trade-off under the context of client-level\nDP, which focuses on privacy during communications. However, FL for medical\nimaging involves typically much fewer participants (hospitals) than other\ndomains (e.g., mobile devices), thus ensuring clients be differentially private\nis much more challenging. To tackle this problem, we propose an adaptive\nintermediary strategy to improve performance without harming privacy.\nSpecifically, we theoretically find splitting clients into sub-clients, which\nserve as intermediaries between hospitals and the server, can mitigate the\nnoises introduced by DP without harming privacy. Our proposed approach is\nempirically evaluated on both classification and segmentation tasks using two\npublic datasets, and its effectiveness is demonstrated with significant\nperformance improvements and comprehensive analytical studies. Code is\navailable at: https://github.com/med-air/Client-DP-FL.\n","authors":["Meirui Jiang","Yuan Zhong","Anjie Le","Xiaoxiao Li","Qi Dou"],"pdf_url":"https://arxiv.org/pdf/2307.12542v1.pdf","comment":"Accepted by 26th International Conference on Medical Image Computing\n and Computer Assisted Intervention (MICCAI'23)"},{"id":"http://arxiv.org/abs/2303.05021v3","updated":"2023-07-24T06:06:27Z","published":"2023-03-09T03:48:24Z","title":"DiffusionDepth: Diffusion Denoising Approach for Monocular Depth\n Estimation","summary":" Monocular depth estimation is a challenging task that predicts the pixel-wise\ndepth from a single 2D image. Current methods typically model this problem as a\nregression or classification task. We propose DiffusionDepth, a new approach\nthat reformulates monocular depth estimation as a denoising diffusion process.\nIt learns an iterative denoising process to `denoise' random depth distribution\ninto a depth map with the guidance of monocular visual conditions. The process\nis performed in the latent space encoded by a dedicated depth encoder and\ndecoder. Instead of diffusing ground truth (GT) depth, the model learns to\nreverse the process of diffusing the refined depth of itself into random depth\ndistribution. This self-diffusion formulation overcomes the difficulty of\napplying generative models to sparse GT depth scenarios. The proposed approach\nbenefits this task by refining depth estimation step by step, which is superior\nfor generating accurate and highly detailed depth maps. Experimental results on\nKITTI and NYU-Depth-V2 datasets suggest that a simple yet efficient diffusion\napproach could reach state-of-the-art performance in both indoor and outdoor\nscenarios with acceptable inference time.\n","authors":["Yiqun Duan","Xianda Guo","Zheng Zhu"],"pdf_url":"https://arxiv.org/pdf/2303.05021v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12540v1","updated":"2023-07-24T06:04:12Z","published":"2023-07-24T06:04:12Z","title":"SelFormaly: Towards Task-Agnostic Unified Anomaly Detection","summary":" The core idea of visual anomaly detection is to learn the normality from\nnormal images, but previous works have been developed specifically for certain\ntasks, leading to fragmentation among various tasks: defect detection, semantic\nanomaly detection, multi-class anomaly detection, and anomaly clustering. This\none-task-one-model approach is resource-intensive and incurs high maintenance\ncosts as the number of tasks increases. This paper presents SelFormaly, a\nuniversal and powerful anomaly detection framework. We emphasize the necessity\nof our off-the-shelf approach by pointing out a suboptimal issue with\nfluctuating performance in previous online encoder-based methods. In addition,\nwe question the effectiveness of using ConvNets as previously employed in the\nliterature and confirm that self-supervised ViTs are suitable for unified\nanomaly detection. We introduce back-patch masking and discover the new role of\ntop k-ratio feature matching to achieve unified and powerful anomaly detection.\nBack-patch masking eliminates irrelevant regions that possibly hinder\ntarget-centric detection with representations of the scene layout. The top\nk-ratio feature matching unifies various anomaly levels and tasks. Finally,\nSelFormaly achieves state-of-the-art results across various datasets for all\nthe aforementioned tasks.\n","authors":["Yujin Lee","Harin Lim","Hyunsoo Yoon"],"pdf_url":"https://arxiv.org/pdf/2307.12540v1.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.12534v1","updated":"2023-07-24T05:43:34Z","published":"2023-07-24T05:43:34Z","title":"Towards Generalizable Deepfake Detection by Primary Region\n Regularization","summary":" The existing deepfake detection methods have reached a bottleneck in\ngeneralizing to unseen forgeries and manipulation approaches. Based on the\nobservation that the deepfake detectors exhibit a preference for overfitting\nthe specific primary regions in input, this paper enhances the generalization\ncapability from a novel regularization perspective. This can be simply achieved\nby augmenting the images through primary region removal, thereby preventing the\ndetector from over-relying on data bias. Our method consists of two stages,\nnamely the static localization for primary region maps, as well as the dynamic\nexploitation of primary region masks. The proposed method can be seamlessly\nintegrated into different backbones without affecting their inference\nefficiency. We conduct extensive experiments over three widely used deepfake\ndatasets - DFDC, DF-1.0, and Celeb-DF with five backbones. Our method\ndemonstrates an average performance improvement of 6% across different\nbackbones and performs competitively with several state-of-the-art baselines.\n","authors":["Harry Cheng","Yangyang Guo","Tianyi Wang","Liqiang Nie","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2307.12534v1.pdf","comment":"12 pages. Code and Dataset: https://github.com/xaCheng1996/PRLE"},{"id":"http://arxiv.org/abs/2307.12532v1","updated":"2023-07-24T05:36:19Z","published":"2023-07-24T05:36:19Z","title":"On the Connection between Pre-training Data Diversity and Fine-tuning\n Robustness","summary":" Pre-training has been widely adopted in deep learning to improve model\nperformance, especially when the training data for a target task is limited. In\nour work, we seek to understand the implications of this training strategy on\nthe generalization properties of downstream models. More specifically, we ask\nthe following question: how do properties of the pre-training distribution\naffect the robustness of a fine-tuned model? The properties we explore include\nthe label space, label semantics, image diversity, data domains, and data\nquantity of the pre-training distribution. We find that the primary factor\ninfluencing downstream effective robustness (Taori et al., 2020) is data\nquantity, while other factors have limited significance. For example, reducing\nthe number of ImageNet pre-training classes by 4x while increasing the number\nof images per class by 4x (that is, keeping total data quantity fixed) does not\nimpact the robustness of fine-tuned models. We demonstrate our findings on\npre-training distributions drawn from various natural and synthetic data\nsources, primarily using the iWildCam-WILDS distribution shift as a test for\ndownstream robustness.\n","authors":["Vivek Ramanujan","Thao Nguyen","Sewoong Oh","Ludwig Schmidt","Ali Farhadi"],"pdf_url":"https://arxiv.org/pdf/2307.12532v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.18246v3","updated":"2023-07-24T05:35:30Z","published":"2023-03-31T17:59:09Z","title":"3D Human Pose Estimation via Intuitive Physics","summary":" Estimating 3D humans from images often produces implausible bodies that lean,\nfloat, or penetrate the floor. Such methods ignore the fact that bodies are\ntypically supported by the scene. A physics engine can be used to enforce\nphysical plausibility, but these are not differentiable, rely on unrealistic\nproxy bodies, and are difficult to integrate into existing optimization and\nlearning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms\nthat can be inferred from a 3D SMPL body interacting with the scene. Inspired\nby biomechanics, we infer the pressure heatmap on the body, the Center of\nPressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With\nthese, we develop IPMAN, to estimate a 3D body from a color image in a \"stable\"\nconfiguration by encouraging plausible floor contact and overlapping CoP and\nCoM. Our IP terms are intuitive, easy to implement, fast to compute,\ndifferentiable, and can be integrated into existing optimization and regression\nmethods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with\nsynchronized multi-view images, ground-truth 3D bodies with complex poses,\nbody-floor contact, CoM and pressure. IPMAN produces more plausible results\nthan the state of the art, improving accuracy for static poses, while not\nhurting dynamic ones. Code and data are available for research at\nhttps://ipman.is.tue.mpg.de.\n","authors":["Shashank Tripathi","Lea Müller","Chun-Hao P. Huang","Omid Taheri","Michael J. Black","Dimitrios Tzionas"],"pdf_url":"https://arxiv.org/pdf/2303.18246v3.pdf","comment":"Accepted in CVPR'23. Project page: https://ipman.is.tue.mpg.de"},{"id":"http://arxiv.org/abs/2307.12526v1","updated":"2023-07-24T04:56:23Z","published":"2023-07-24T04:56:23Z","title":"Rethinking Medical Report Generation: Disease Revealing Enhancement with\n Knowledge Graph","summary":" Knowledge Graph (KG) plays a crucial role in Medical Report Generation (MRG)\nbecause it reveals the relations among diseases and thus can be utilized to\nguide the generation process. However, constructing a comprehensive KG is\nlabor-intensive and its applications on the MRG process are under-explored. In\nthis study, we establish a complete KG on chest X-ray imaging that includes 137\ntypes of diseases and abnormalities. Based on this KG, we find that the current\nMRG data sets exhibit a long-tailed problem in disease distribution. To\nmitigate this problem, we introduce a novel augmentation strategy that enhances\nthe representation of disease types in the tail-end of the distribution. We\nfurther design a two-stage MRG approach, where a classifier is first trained to\ndetect whether the input images exhibit any abnormalities. The classified\nimages are then independently fed into two transformer-based generators,\nnamely, ``disease-specific generator\" and ``disease-free generator\" to generate\nthe corresponding reports. To enhance the clinical evaluation of whether the\ngenerated reports correctly describe the diseases appearing in the input image,\nwe propose diverse sensitivity (DS), a new metric that checks whether generated\ndiseases match ground truth and measures the diversity of all generated\ndiseases. Results show that the proposed two-stage generation framework and\naugmentation strategies improve DS by a considerable margin, indicating a\nnotable reduction in the long-tailed problem associated with under-represented\ndiseases.\n","authors":["Yixin Wang","Zihao Lin","Haoyu Dong"],"pdf_url":"https://arxiv.org/pdf/2307.12526v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12517v1","updated":"2023-07-24T04:21:51Z","published":"2023-07-24T04:21:51Z","title":"Entropy Transformer Networks: A Learning Approach via Tangent Bundle\n Data Manifold","summary":" This paper focuses on an accurate and fast interpolation approach for image\ntransformation employed in the design of CNN architectures. Standard Spatial\nTransformer Networks (STNs) use bilinear or linear interpolation as their\ninterpolation, with unrealistic assumptions about the underlying data\ndistributions, which leads to poor performance under scale variations.\nMoreover, STNs do not preserve the norm of gradients in propagation due to\ntheir dependency on sparse neighboring pixels. To address this problem, a novel\nEntropy STN (ESTN) is proposed that interpolates on the data manifold\ndistributions. In particular, random samples are generated for each pixel in\nassociation with the tangent space of the data manifold and construct a linear\napproximation of their intensity values with an entropy regularizer to compute\nthe transformer parameters. A simple yet effective technique is also proposed\nto normalize the non-zero values of the convolution operation, to fine-tune the\nlayers for gradients' norm-regularization during training. Experiments on\nchallenging benchmarks show that the proposed ESTN can improve predictive\naccuracy over a range of computer vision tasks, including image reconstruction,\nand classification, while reducing the computational cost.\n","authors":["Pourya Shamsolmoali","Masoumeh Zareapoor"],"pdf_url":"https://arxiv.org/pdf/2307.12517v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.12539v2","updated":"2023-07-24T04:20:37Z","published":"2023-04-25T03:12:54Z","title":"Text-guided Eyeglasses Manipulation with Spatial Constraints","summary":" Virtual try-on of eyeglasses involves placing eyeglasses of different shapes\nand styles onto a face image without physically trying them on. While existing\nmethods have shown impressive results, the variety of eyeglasses styles is\nlimited and the interactions are not always intuitive or efficient. To address\nthese limitations, we propose a Text-guided Eyeglasses Manipulation method that\nallows for control of the eyeglasses shape and style based on a binary mask and\ntext, respectively. Specifically, we introduce a mask encoder to extract mask\nconditions and a modulation module that enables simultaneous injection of text\nand mask conditions. This design allows for fine-grained control of the\neyeglasses' appearance based on both textual descriptions and spatial\nconstraints. Our approach includes a disentangled mapper and a decoupling\nstrategy that preserves irrelevant areas, resulting in better local editing. We\nemploy a two-stage training scheme to handle the different convergence speeds\nof the various modality conditions, successfully controlling both the shape and\nstyle of eyeglasses. Extensive comparison experiments and ablation analyses\ndemonstrate the effectiveness of our approach in achieving diverse eyeglasses\nstyles while preserving irrelevant areas.\n","authors":["Jiacheng Wang","Ping Liu","Jingen Liu","Wei Xu"],"pdf_url":"https://arxiv.org/pdf/2304.12539v2.pdf","comment":"Revised version: add some experiments"},{"id":"http://arxiv.org/abs/2307.11466v2","updated":"2023-07-24T03:35:03Z","published":"2023-07-21T10:02:02Z","title":"MatSpectNet: Material Segmentation Network with Domain-Aware and\n Physically-Constrained Hyperspectral Reconstruction","summary":" Achieving accurate material segmentation for 3-channel RGB images is\nchallenging due to the considerable variation in a material's appearance.\nHyperspectral images, which are sets of spectral measurements sampled at\nmultiple wavelengths, theoretically offer distinct information for material\nidentification, as variations in intensity of electromagnetic radiation\nreflected by a surface depend on the material composition of a scene. However,\nexisting hyperspectral datasets are impoverished regarding the number of images\nand material categories for the dense material segmentation task, and\ncollecting and annotating hyperspectral images with a spectral camera is\nprohibitively expensive. To address this, we propose a new model, the\nMatSpectNet to segment materials with recovered hyperspectral images from RGB\nimages. The network leverages the principles of colour perception in modern\ncameras to constrain the reconstructed hyperspectral images and employs the\ndomain adaptation method to generalise the hyperspectral reconstruction\ncapability from a spectral recovery dataset to material segmentation datasets.\nThe reconstructed hyperspectral images are further filtered using learned\nresponse curves and enhanced with human perception. The performance of\nMatSpectNet is evaluated on the LMD dataset as well as the OpenSurfaces\ndataset. Our experiments demonstrate that MatSpectNet attains a 1.60% increase\nin average pixel accuracy and a 3.42% improvement in mean class accuracy\ncompared with the most recent publication. The project code is attached to the\nsupplementary material and will be published on GitHub.\n","authors":["Yuwen Heng","Yihong Wu","Jiawen Chen","Srinandan Dasmahapatra","Hansung Kim"],"pdf_url":"https://arxiv.org/pdf/2307.11466v2.pdf","comment":"7 pages main paper"},{"id":"http://arxiv.org/abs/2304.03483v2","updated":"2023-07-24T03:28:34Z","published":"2023-04-07T05:29:59Z","title":"RED-PSM: Regularization by Denoising of Partially Separable Models for\n Dynamic Imaging","summary":" Dynamic imaging addresses the recovery of a time-varying 2D or 3D object at\neach time instant using its undersampled measurements. In particular, in the\ncase of dynamic tomography, only a single projection at a single view angle may\nbe available at a time, making the problem severely ill-posed. In this work, we\npropose an approach, RED-PSM, which combines for the first time two powerful\ntechniques to address this challenging imaging problem. The first, are\npartially separable models, which have been used to efficiently introduce a\nlow-rank prior for the spatio-temporal object. The second is the recent\nRegularization by Denoising (RED), which provides a flexible framework to\nexploit the impressive performance of state-of-the-art image denoising\nalgorithms, for various inverse problems. We propose a partially separable\nobjective with RED and a computationally efficient and scalable optimization\nscheme with variable splitting and ADMM. Theoretical analysis proves the\nconvergence of our objective to a value corresponding to a stationary point\nsatisfying the first-order optimality conditions. Convergence is accelerated by\na particular projection-domain-based initialization. We demonstrate the\nperformance and computational improvements of our proposed RED-PSM with a\nlearned image denoiser by comparing it to a recent deep-prior-based method\nknown as TD-DIP. Although the main focus is on dynamic tomography, we also show\nthe performance advantages of RED-PSM in a cardiac dynamic MRI setting.\n","authors":["Berk Iskender","Marc L. Klasky","Yoram Bresler"],"pdf_url":"https://arxiv.org/pdf/2304.03483v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12502v1","updated":"2023-07-24T03:27:41Z","published":"2023-07-24T03:27:41Z","title":"Cross Contrastive Feature Perturbation for Domain Generalization","summary":" Domain generalization (DG) aims to learn a robust model from source domains\nthat generalize well on unseen target domains. Recent studies focus on\ngenerating novel domain samples or features to diversify distributions\ncomplementary to source domains. Yet, these approaches can hardly deal with the\nrestriction that the samples synthesized from various domains can cause\nsemantic distortion. In this paper, we propose an online one-stage Cross\nContrasting Feature Perturbation (CCFP) framework to simulate domain shift by\ngenerating perturbed features in the latent space while regularizing the model\nprediction against domain shift. Different from the previous fixed synthesizing\nstrategy, we design modules with learnable feature perturbations and semantic\nconsistency constraints. In contrast to prior work, our method does not use any\ngenerative-based models or domain labels. We conduct extensive experiments on a\nstandard DomainBed benchmark with a strict evaluation protocol for a fair\ncomparison. Comprehensive experiments show that our method outperforms the\nprevious state-of-the-art, and quantitative analyses illustrate that our\napproach can alleviate the domain shift problem in out-of-distribution (OOD)\nscenarios.\n","authors":["Chenming Li","Daoan Zhang","Wenjian Huang","Jianguo Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12502v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.09186v4","updated":"2023-07-24T03:20:19Z","published":"2022-04-20T02:14:20Z","title":"Reconstruction-Aware Prior Distillation for Semi-supervised Point Cloud\n Completion","summary":" Real-world sensors often produce incomplete, irregular, and noisy point\nclouds, making point cloud completion increasingly important. However, most\nexisting completion methods rely on large paired datasets for training, which\nis labor-intensive. This paper proposes RaPD, a novel semi-supervised point\ncloud completion method that reduces the need for paired datasets. RaPD\nutilizes a two-stage training scheme, where a deep semantic prior is learned in\nstage 1 from unpaired complete and incomplete point clouds, and a\nsemi-supervised prior distillation process is introduced in stage 2 to train a\ncompletion network using only a small number of paired samples. Additionally, a\nself-supervised completion module is introduced to improve performance using\nunpaired incomplete point clouds. Experiments on multiple datasets show that\nRaPD outperforms previous methods in both homologous and heterologous\nscenarios.\n","authors":["Zhaoxin Fan","Yulin He","Zhicheng Wang","Kejian Wu","Hongyan Liu","Jun He"],"pdf_url":"https://arxiv.org/pdf/2204.09186v4.pdf","comment":"Accepted to IJCAI 2023"},{"id":"http://arxiv.org/abs/2307.12499v1","updated":"2023-07-24T03:10:02Z","published":"2023-07-24T03:10:02Z","title":"AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion\n Models","summary":" Unrestricted adversarial attacks present a serious threat to deep learning\nmodels and adversarial defense techniques. They pose severe security problems\nfor deep learning applications because they can effectively bypass defense\nmechanisms. However, previous attack methods often utilize Generative\nAdversarial Networks (GANs), which are not theoretically provable and thus\ngenerate unrealistic examples by incorporating adversarial objectives,\nespecially for large-scale datasets like ImageNet. In this paper, we propose a\nnew method, called AdvDiff, to generate unrestricted adversarial examples with\ndiffusion models. We design two novel adversarial guidance techniques to\nconduct adversarial sampling in the reverse generation process of diffusion\nmodels. These two techniques are effective and stable to generate high-quality,\nrealistic adversarial examples by integrating gradients of the target\nclassifier interpretably. Experimental results on MNIST and ImageNet datasets\ndemonstrate that AdvDiff is effective to generate unrestricted adversarial\nexamples, which outperforms GAN-based methods in terms of attack performance\nand generation quality.\n","authors":["Xuelong Dai","Kaisheng Liang","Bin Xiao"],"pdf_url":"https://arxiv.org/pdf/2307.12499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.09417v2","updated":"2023-07-24T03:06:15Z","published":"2022-08-19T16:04:29Z","title":"Target-oriented Sentiment Classification with Sequential Cross-modal\n Semantic Graph","summary":" Multi-modal aspect-based sentiment classification (MABSC) is task of\nclassifying the sentiment of a target entity mentioned in a sentence and an\nimage. However, previous methods failed to account for the fine-grained\nsemantic association between the image and the text, which resulted in limited\nidentification of fine-grained image aspects and opinions. To address these\nlimitations, in this paper we propose a new approach called SeqCSG, which\nenhances the encoder-decoder sentiment classification framework using\nsequential cross-modal semantic graphs. SeqCSG utilizes image captions and\nscene graphs to extract both global and local fine-grained image information\nand considers them as elements of the cross-modal semantic graph along with\ntokens from tweets. The sequential cross-modal semantic graph is represented as\na sequence with a multi-modal adjacency matrix indicating relationships between\nelements. Experimental results show that the approach outperforms existing\nmethods and achieves state-of-the-art performance on two standard datasets.\nFurther analysis has demonstrated that the model can implicitly learn the\ncorrelation between fine-grained information of the image and the text with the\ngiven target. Our code is available at https://github.com/zjukg/SeqCSG.\n","authors":["Yufeng Huang","Zhuo Chen","Jiaoyan Chen","Jeff Z. Pan","Zhen Yao","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2208.09417v2.pdf","comment":"ICANN 2023, https://github.com/zjukg/SeqCSG"},{"id":"http://arxiv.org/abs/2307.11411v2","updated":"2023-07-24T02:57:01Z","published":"2023-07-21T08:10:26Z","title":"Deep Directly-Trained Spiking Neural Networks for Object Detection","summary":" Spiking neural networks (SNNs) are brain-inspired energy-efficient models\nthat encode information in spatiotemporal dynamics. Recently, deep SNNs trained\ndirectly have shown great success in achieving high performance on\nclassification tasks with very few time steps. However, how to design a\ndirectly-trained SNN for the regression task of object detection still remains\na challenging problem. To address this problem, we propose EMS-YOLO, a novel\ndirectly-trained SNN framework for object detection, which is the first trial\nto train a deep SNN with surrogate gradients for object detection rather than\nANN-SNN conversion strategies. Specifically, we design a full-spike residual\nblock, EMS-ResNet, which can effectively extend the depth of the\ndirectly-trained SNN with low power consumption. Furthermore, we theoretically\nanalyze and prove the EMS-ResNet could avoid gradient vanishing or exploding.\nThe results demonstrate that our approach outperforms the state-of-the-art\nANN-SNN conversion methods (at least 500 time steps) in extremely fewer time\nsteps (only 4 time steps). It is shown that our model could achieve comparable\nperformance to the ANN with the same architecture while consuming 5.83 times\nless energy on the frame-based COCO Dataset and the event-based Gen1 Dataset.\n","authors":["Qiaoyi Su","Yuhong Chou","Yifan Hu","Jianing Li","Shijie Mei","Ziyang Zhang","Guoqi Li"],"pdf_url":"https://arxiv.org/pdf/2307.11411v2.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.12493v1","updated":"2023-07-24T02:50:44Z","published":"2023-07-24T02:50:44Z","title":"TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition","summary":" Text-driven diffusion models have exhibited impressive generative\ncapabilities, enabling various image editing tasks. In this paper, we propose\nTF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the\npower of text-driven diffusion models for cross-domain image-guided\ncomposition. This task aims to seamlessly integrate user-provided objects into\na specific visual context. Current diffusion-based methods often involve costly\ninstance-based optimization or finetuning of pretrained models on customized\ndatasets, which can potentially undermine their rich prior. In contrast,\nTF-ICON can leverage off-the-shelf diffusion models to perform cross-domain\nimage-guided composition without requiring additional training, finetuning, or\noptimization. Moreover, we introduce the exceptional prompt, which contains no\ninformation, to facilitate text-driven diffusion models in accurately inverting\nreal images into latent representations, forming the basis for compositing. Our\nexperiments show that equipping Stable Diffusion with the exceptional prompt\noutperforms state-of-the-art inversion methods on various datasets (CelebA-HQ,\nCOCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile\nvisual domains. Code is available at https://github.com/Shilin-LU/TF-ICON\n","authors":["Shilin Lu","Yanzhu Liu","Adams Wai-Kin Kong"],"pdf_url":"https://arxiv.org/pdf/2307.12493v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.00932v2","updated":"2023-07-24T01:57:52Z","published":"2023-07-03T11:13:28Z","title":"A large calcium-imaging dataset reveals a systematic V4 organization for\n natural scenes","summary":" The visual system evolved to process natural scenes, yet most of our\nunderstanding of the topology and function of visual cortex derives from\nstudies using artificial stimuli. To gain deeper insights into visual\nprocessing of natural scenes, we utilized widefield calcium-imaging of primate\nV4 in response to many natural images, generating a large dataset of\ncolumnar-scale responses. We used this dataset to build a digital twin of V4\nvia deep learning, generating a detailed topographical map of natural image\npreferences at each cortical position. The map revealed clustered functional\ndomains for specific classes of natural image features. These ranged from\nsurface-related attributes like color and texture to shape-related features\nsuch as edges, curvature, and facial features. We validated the model-predicted\ndomains with additional widefield calcium-imaging and single-cell resolution\ntwo-photon imaging. Our study illuminates the detailed topological organization\nand neural codes in V4 that represent natural scenes.\n","authors":["Tianye Wang","Haoxuan Yao","Tai Sing Lee","Jiayi Hong","Yang Li","Hongfei Jiang","Ian Max Andolina","Shiming Tang"],"pdf_url":"https://arxiv.org/pdf/2307.00932v2.pdf","comment":"39 pages, 14 figures"},{"id":"http://arxiv.org/abs/2305.01788v3","updated":"2023-07-24T00:54:51Z","published":"2023-05-02T21:33:10Z","title":"Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation\n Incorporating Gloss Information","summary":" Visual Word Sense Disambiguation (VWSD) is a task to find the image that most\naccurately depicts the correct sense of the target word for the given context.\nPreviously, image-text matching models often suffered from recognizing\npolysemous words. This paper introduces an unsupervised VWSD approach that uses\ngloss information of an external lexical knowledge-base, especially the sense\ndefinitions. Specifically, we suggest employing Bayesian inference to\nincorporate the sense definitions when sense information of the answer is not\nprovided. In addition, to ameliorate the out-of-dictionary (OOD) issue, we\npropose a context-aware definition generation with GPT-3. Experimental results\nshow that the VWSD performance significantly increased with our Bayesian\ninference-based approach. In addition, our context-aware definition generation\nachieved prominent performance improvement in OOD examples exhibiting better\nperformance than the existing definition generation method.\n","authors":["Sunjae Kwon","Rishabh Garodia","Minhwa Lee","Zhichao Yang","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2305.01788v3.pdf","comment":"ACL 2023, https://aclanthology.org/2023.acl-long.88"},{"id":"http://arxiv.org/abs/2307.12463v1","updated":"2023-07-24T00:53:46Z","published":"2023-07-24T00:53:46Z","title":"Rethinking Data Distillation: Do Not Overlook Calibration","summary":" Neural networks trained on distilled data often produce over-confident output\nand require correction by calibration methods. Existing calibration methods\nsuch as temperature scaling and mixup work well for networks trained on\noriginal large-scale data. However, we find that these methods fail to\ncalibrate networks trained on data distilled from large source datasets. In\nthis paper, we show that distilled data lead to networks that are not\ncalibratable due to (i) a more concentrated distribution of the maximum logits\nand (ii) the loss of information that is semantically meaningful but unrelated\nto classification tasks. To address this problem, we propose Masked Temperature\nScaling (MTS) and Masked Distillation Training (MDT) which mitigate the\nlimitations of distilled data and achieve better calibration results while\nmaintaining the efficiency of dataset distillation.\n","authors":["Dongyao Zhu","Bowen Lei","Jie Zhang","Yanbo Fang","Ruqi Zhang","Yiqun Xie","Dongkuan Xu"],"pdf_url":"https://arxiv.org/pdf/2307.12463v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2304.07916v2","updated":"2023-07-24T00:29:45Z","published":"2023-04-16T23:37:24Z","title":"GaitRef: Gait Recognition with Refined Sequential Skeletons","summary":" Identifying humans with their walking sequences, known as gait recognition,\nis a useful biometric understanding task as it can be observed from a long\ndistance and does not require cooperation from the subject. Two common\nmodalities used for representing the walking sequence of a person are\nsilhouettes and joint skeletons. Silhouette sequences, which record the\nboundary of the walking person in each frame, may suffer from the variant\nappearances from carried-on objects and clothes of the person. Framewise joint\ndetections are noisy and introduce some jitters that are not consistent with\nsequential detections. In this paper, we combine the silhouettes and skeletons\nand refine the framewise joint predictions for gait recognition. With temporal\ninformation from the silhouette sequences. We show that the refined skeletons\ncan improve gait recognition performance without extra annotations. We compare\nour methods on four public datasets, CASIA-B, OUMVLP, Gait3D and GREW, and show\nstate-of-the-art performance.\n","authors":["Haidong Zhu","Wanrong Zheng","Zhaoheng Zheng","Ram Nevatia"],"pdf_url":"https://arxiv.org/pdf/2304.07916v2.pdf","comment":"IJCB 2023. Code is available at\n https://github.com/haidongz-usc/GaitRef"},{"id":"http://arxiv.org/abs/2307.12459v1","updated":"2023-07-24T00:03:09Z","published":"2023-07-24T00:03:09Z","title":"Robust face anti-spoofing framework with Convolutional Vision\n Transformer","summary":" Owing to the advances in image processing technology and large-scale\ndatasets, companies have implemented facial authentication processes, thereby\nstimulating increased focus on face anti-spoofing (FAS) against realistic\npresentation attacks. Recently, various attempts have been made to improve face\nrecognition performance using both global and local learning on face images;\nhowever, to the best of our knowledge, this is the first study to investigate\nwhether the robustness of FAS against domain shifts is improved by considering\nglobal information and local cues in face images captured using self-attention\nand convolutional layers. This study proposes a convolutional vision\ntransformer-based framework that achieves robust performance for various unseen\ndomain data. Our model resulted in 7.3%$p$ and 12.9%$p$ increases in FAS\nperformance compared to models using only a convolutional neural network or\nvision transformer, respectively. It also shows the highest average rank in\nsub-protocols of cross-dataset setting over the other nine benchmark models for\ndomain generalization.\n","authors":["Yunseung Lee","Youngjun Kwak","Jinho Shin"],"pdf_url":"https://arxiv.org/pdf/2307.12459v1.pdf","comment":"ICIP 2023"},{"id":"http://arxiv.org/abs/2301.06363v2","updated":"2023-07-24T23:39:15Z","published":"2023-01-16T11:17:32Z","title":"A$^2$-UAV: Application-Aware Content and Network Optimization of\n Edge-Assisted UAV Systems","summary":" To perform advanced surveillance, Unmanned Aerial Vehicles (UAVs) require the\nexecution of edge-assisted computer vision (CV) tasks. In multi-hop UAV\nnetworks, the successful transmission of these tasks to the edge is severely\nchallenged due to severe bandwidth constraints. For this reason, we propose a\nnovel A$^2$-UAV framework to optimize the number of correctly executed tasks at\nthe edge. In stark contrast with existing art, we take an application-aware\napproach and formulate a novel pplication-Aware Task Planning Problem\n(A$^2$-TPP) that takes into account (i) the relationship between deep neural\nnetwork (DNN) accuracy and image compression for the classes of interest based\non the available dataset, (ii) the target positions, (iii) the current\nenergy/position of the UAVs to optimize routing, data pre-processing and target\nassignment for each UAV. We demonstrate A$^2$-TPP is NP-Hard and propose a\npolynomial-time algorithm to solve it efficiently. We extensively evaluate\nA$^2$-UAV through real-world experiments with a testbed composed by four DJI\nMavic Air 2 UAVs. We consider state-of-the-art image classification tasks with\nfour different DNN models (i.e., DenseNet, ResNet152, ResNet50 and\nMobileNet-V2) and object detection tasks using YoloV4 trained on the ImageNet\ndataset. Results show that A$^2$-UAV attains on average around 38% more\naccomplished tasks than the state-of-the-art, with 400% more accomplished tasks\nwhen the number of targets increases significantly. To allow full\nreproducibility, we pledge to share datasets and code with the research\ncommunity.\n","authors":["Andrea Coletta","Flavio Giorgi","Gaia Maselli","Matteo Prata","Domenicomichele Silvestri","Jonathan Ashdown","Francesco Restuccia"],"pdf_url":"https://arxiv.org/pdf/2301.06363v2.pdf","comment":"Accepted to INFOCOM 2023"},{"id":"http://arxiv.org/abs/2307.13136v1","updated":"2023-07-24T21:29:48Z","published":"2023-07-24T21:29:48Z","title":"Does Progress On Object Recognition Benchmarks Improve Real-World\n Generalization?","summary":" For more than a decade, researchers have measured progress in object\nrecognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C,\nand -R. Recent advances in foundation models, trained on orders of magnitude\nmore data, have begun to saturate these standard benchmarks, but remain brittle\nin practice. This suggests standard benchmarks, which tend to focus on\npredefined or synthetic changes, may not be sufficient for measuring real world\ngeneralization. Consequently, we propose studying generalization across\ngeography as a more realistic measure of progress using two datasets of objects\nfrom households across the globe. We conduct an extensive empirical evaluation\nof progress across nearly 100 vision models up to most recent foundation\nmodels. We first identify a progress gap between standard benchmarks and\nreal-world, geographical shifts: progress on ImageNet results in up to 2.5x\nmore progress on standard generalization benchmarks than real-world\ndistribution shifts. Second, we study model generalization across geographies\nby measuring the disparities in performance across regions, a more fine-grained\nmeasure of real world generalization. We observe all models have large\ngeographic disparities, even foundation CLIP models, with differences of 7-20%\nin accuracy between regions. Counter to modern intuition, we discover progress\non standard benchmarks fails to improve geographic disparities and often\nexacerbates them: geographic disparities between the least performant models\nand today's best models have more than tripled. Our results suggest scaling\nalone is insufficient for consistent robustness to real-world distribution\nshifts. Finally, we highlight in early experiments how simple last layer\nretraining on more representative, curated data can complement scaling as a\npromising direction of future work, reducing geographic disparity on both\nbenchmarks by over two-thirds.\n","authors":["Megan Richards","Polina Kirichenko","Diane Bouchacourt","Mark Ibrahim"],"pdf_url":"https://arxiv.org/pdf/2307.13136v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13133v1","updated":"2023-07-24T21:22:58Z","published":"2023-07-24T21:22:58Z","title":"simPLE: a visuotactile method learned in simulation to precisely pick,\n localize, regrasp, and place objects","summary":" Existing robotic systems have a clear tension between generality and\nprecision. Deployed solutions for robotic manipulation tend to fall into the\nparadigm of one robot solving a single task, lacking precise generalization,\ni.e., the ability to solve many tasks without compromising on precision. This\npaper explores solutions for precise and general pick-and-place. In precise\npick-and-place, i.e. kitting, the robot transforms an unstructured arrangement\nof objects into an organized arrangement, which can facilitate further\nmanipulation. We propose simPLE (simulation to Pick Localize and PLacE) as a\nsolution to precise pick-and-place. simPLE learns to pick, regrasp and place\nobjects precisely, given only the object CAD model and no prior experience. We\ndevelop three main components: task-aware grasping, visuotactile perception,\nand regrasp planning. Task-aware grasping computes affordances of grasps that\nare stable, observable, and favorable to placing. The visuotactile perception\nmodel relies on matching real observations against a set of simulated ones\nthrough supervised learning. Finally, we compute the desired robot motion by\nsolving a shortest path problem on a graph of hand-to-hand regrasps. On a\ndual-arm robot equipped with visuotactile sensing, we demonstrate\npick-and-place of 15 diverse objects with simPLE. The objects span a wide range\nof shapes and simPLE achieves successful placements into structured\narrangements with 1mm clearance over 90% of the time for 6 objects, and over\n80% of the time for 11 objects. Videos are available at\nhttp://mcube.mit.edu/research/simPLE.html .\n","authors":["Maria Bauza","Antonia Bronars","Yifan Hou","Ian Taylor","Nikhil Chavan-Dafle","Alberto Rodriguez"],"pdf_url":"https://arxiv.org/pdf/2307.13133v1.pdf","comment":"33 pages, 6 figures, 2 tables, submitted to Science Robotics"},{"id":"http://arxiv.org/abs/2205.04691v3","updated":"2023-07-24T20:56:50Z","published":"2022-05-10T06:24:09Z","title":"An Asynchronous Event-Based Algorithm for Periodic Signals","summary":" Let $0\\leq\\tau_{1}\\leq\\tau_{2}\\leq\\cdots\\leq\\tau_{m}\\leq1$, originated from a\nuniform distribution. Let also $\\epsilon,\\delta\\in\\mathbb{R}$, and\n$d\\in\\mathbb{N}$. What is the probability of having more than $d$ adjacent\n$\\tau_{i}$-s pairs that the distance between them is $\\delta$, up to an error\n$\\epsilon$ ? In this paper we are going to show how this untreated theoretical\nprobabilistic problem arises naturally from the motivation of analyzing a\nsimple asynchronous algorithm for detection of signals with a known frequency,\nusing the novel technology of an event camera.\n","authors":["David El-Chai Ben-Ezra","Ron Arad","Ayelet Padowicz","Israel Tugendhaft"],"pdf_url":"https://arxiv.org/pdf/2205.04691v3.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2307.13125v1","updated":"2023-07-24T20:53:59Z","published":"2023-07-24T20:53:59Z","title":"Deep Learning Approaches for Data Augmentation in Medical Imaging: A\n Review","summary":" Deep learning has become a popular tool for medical image analysis, but the\nlimited availability of training data remains a major challenge, particularly\nin the medical field where data acquisition can be costly and subject to\nprivacy regulations. Data augmentation techniques offer a solution by\nartificially increasing the number of training samples, but these techniques\noften produce limited and unconvincing results. To address this issue, a\ngrowing number of studies have proposed the use of deep generative models to\ngenerate more realistic and diverse data that conform to the true distribution\nof the data. In this review, we focus on three types of deep generative models\nfor medical image augmentation: variational autoencoders, generative\nadversarial networks, and diffusion models. We provide an overview of the\ncurrent state of the art in each of these models and discuss their potential\nfor use in different downstream tasks in medical imaging, including\nclassification, segmentation, and cross-modal translation. We also evaluate the\nstrengths and limitations of each model and suggest directions for future\nresearch in this field. Our goal is to provide a comprehensive review about the\nuse of deep generative models for medical image augmentation and to highlight\nthe potential of these models for improving the performance of deep learning\nalgorithms in medical image analysis.\n","authors":["Aghiles Kebaili","Jérôme Lapuyade-Lahorgue","Su Ruan"],"pdf_url":"https://arxiv.org/pdf/2307.13125v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13110v1","updated":"2023-07-24T19:59:15Z","published":"2023-07-24T19:59:15Z","title":"Automatic Infant Respiration Estimation from Video: A Deep Flow-based\n Algorithm and a Novel Public Benchmark","summary":" Respiration is a critical vital sign for infants, and continuous respiratory\nmonitoring is particularly important for newborns. However, neonates are\nsensitive and contact-based sensors present challenges in comfort, hygiene, and\nskin health, especially for preterm babies. As a step toward fully automatic,\ncontinuous, and contactless respiratory monitoring, we develop a deep-learning\nmethod for estimating respiratory rate and waveform from plain video footage in\nnatural settings. Our automated infant respiration flow-based network\n(AIRFlowNet) combines video-extracted optical flow input and spatiotemporal\nconvolutional processing tuned to the infant domain. We support our model with\nthe first public annotated infant respiration dataset with 125 videos\n(AIR-125), drawn from eight infant subjects, set varied pose, lighting, and\ncamera conditions. We include manual respiration annotations and optimize\nAIRFlowNet training on them using a novel spectral bandpass loss function. When\ntrained and tested on the AIR-125 infant data, our method significantly\noutperforms other state-of-the-art methods in respiratory rate estimation,\nachieving a mean absolute error of $\\sim$2.9 breaths per minute, compared to\n$\\sim$4.7--6.2 for other public models designed for adult subjects and more\nuniform environments.\n","authors":["Sai Kumar Reddy Manne","Shaotong Zhu","Sarah Ostadabbas","Michael Wan"],"pdf_url":"https://arxiv.org/pdf/2307.13110v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.05799v2","updated":"2023-07-24T19:13:20Z","published":"2023-07-11T20:46:19Z","title":"3D Medical Image Segmentation based on multi-scale MPU-Net","summary":" The high cure rate of cancer is inextricably linked to physicians' accuracy\nin diagnosis and treatment, therefore a model that can accomplish\nhigh-precision tumor segmentation has become a necessity in many applications\nof the medical industry. It can effectively lower the rate of misdiagnosis\nwhile considerably lessening the burden on clinicians. However, fully automated\ntarget organ segmentation is problematic due to the irregular stereo structure\nof 3D volume organs. As a basic model for this class of real applications,\nU-Net excels. It can learn certain global and local features, but still lacks\nthe capacity to grasp spatial long-range relationships and contextual\ninformation at multiple scales. This paper proposes a tumor segmentation model\nMPU-Net for patient volume CT images, which is inspired by Transformer with a\nglobal attention mechanism. By combining image serialization with the Position\nAttention Module, the model attempts to comprehend deeper contextual\ndependencies and accomplish precise positioning. Each layer of the decoder is\nalso equipped with a multi-scale module and a cross-attention mechanism. The\ncapability of feature extraction and integration at different levels has been\nenhanced, and the hybrid loss function developed in this study can better\nexploit high-resolution characteristic information. Moreover, the suggested\narchitecture is tested and evaluated on the Liver Tumor Segmentation Challenge\n2017 (LiTS 2017) dataset. Compared with the benchmark model U-Net, MPU-Net\nshows excellent segmentation results. The dice, accuracy, precision,\nspecificity, IOU, and MCC metrics for the best model segmentation results are\n92.17%, 99.08%, 91.91%, 99.52%, 85.91%, and 91.74%, respectively. Outstanding\nindicators in various aspects illustrate the exceptional performance of this\nframework in automatic medical image segmentation.\n","authors":["Zeqiu. Yu","Shuo. Han","Ziheng. Song"],"pdf_url":"https://arxiv.org/pdf/2307.05799v2.pdf","comment":"37 pages"},{"id":"http://arxiv.org/abs/2307.13078v1","updated":"2023-07-24T18:59:46Z","published":"2023-07-24T18:59:46Z","title":"Adaptive Certified Training: Towards Better Accuracy-Robustness\n Tradeoffs","summary":" As deep learning models continue to advance and are increasingly utilized in\nreal-world systems, the issue of robustness remains a major challenge. Existing\ncertified training methods produce models that achieve high provable robustness\nguarantees at certain perturbation levels. However, the main problem of such\nmodels is a dramatically low standard accuracy, i.e. accuracy on clean\nunperturbed data, that makes them impractical. In this work, we consider a more\nrealistic perspective of maximizing the robustness of a model at certain levels\nof (high) standard accuracy. To this end, we propose a novel certified training\nmethod based on a key insight that training with adaptive certified radii helps\nto improve both the accuracy and robustness of the model, advancing\nstate-of-the-art accuracy-robustness tradeoffs. We demonstrate the\neffectiveness of the proposed method on MNIST, CIFAR-10, and TinyImageNet\ndatasets. Particularly, on CIFAR-10 and TinyImageNet, our method yields models\nwith up to two times higher robustness, measured as an average certified radius\nof a test set, at the same levels of standard accuracy compared to baseline\napproaches.\n","authors":["Zhakshylyk Nurlanov","Frank R. Schmidt","Florian Bernard"],"pdf_url":"https://arxiv.org/pdf/2307.13078v1.pdf","comment":"Presented at ICML 2023 workshop \"New Frontiers in Adversarial Machine\n Learning\""},{"id":"http://arxiv.org/abs/2307.09588v2","updated":"2023-07-24T18:52:54Z","published":"2023-07-18T19:51:28Z","title":"Automating Wood Species Detection and Classification in Microscopic\n Images of Fibrous Materials with Deep Learning","summary":" We have developed a methodology for the systematic generation of a large\nimage dataset of macerated wood references, which we used to generate image\ndata for nine hardwood genera. This is the basis for a substantial approach to\nautomate, for the first time, the identification of hardwood species in\nmicroscopic images of fibrous materials by deep learning. Our methodology\nincludes a flexible pipeline for easy annotation of vessel elements. We compare\nthe performance of different neural network architectures and hyperparameters.\nOur proposed method performs similarly well to human experts. In the future,\nthis will improve controls on global wood fiber product flows to protect\nforests.\n","authors":["Lars Nieradzik","Jördis Sieburg-Rockel","Stephanie Helmling","Janis Keuper","Thomas Weibel","Andrea Olbrich","Henrike Stephani"],"pdf_url":"https://arxiv.org/pdf/2307.09588v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13069v1","updated":"2023-07-24T18:50:49Z","published":"2023-07-24T18:50:49Z","title":"General-Purpose Multi-Modal OOD Detection Framework","summary":" Out-of-distribution (OOD) detection identifies test samples that differ from\nthe training data, which is critical to ensuring the safety and reliability of\nmachine learning (ML) systems. While a plethora of methods have been developed\nto detect uni-modal OOD samples, only a few have focused on multi-modal OOD\ndetection. Current contrastive learning-based methods primarily study\nmulti-modal OOD detection in a scenario where both a given image and its\ncorresponding textual description come from a new domain. However, real-world\ndeployments of ML systems may face more anomaly scenarios caused by multiple\nfactors like sensor faults, bad weather, and environmental changes. Hence, the\ngoal of this work is to simultaneously detect from multiple different OOD\nscenarios in a fine-grained manner. To reach this goal, we propose a\ngeneral-purpose weakly-supervised OOD detection framework, called WOOD, that\ncombines a binary classifier and a contrastive learning component to reap the\nbenefits of both. In order to better distinguish the latent representations of\nin-distribution (ID) and OOD samples, we adopt the Hinge loss to constrain\ntheir similarity. Furthermore, we develop a new scoring metric to integrate the\nprediction results from both the binary classifier and contrastive learning for\nidentifying OOD samples. We evaluate the proposed WOOD model on multiple\nreal-world datasets, and the experimental results demonstrate that the WOOD\nmodel outperforms the state-of-the-art methods for multi-modal OOD detection.\nImportantly, our approach is able to achieve high accuracy in OOD detection in\nthree different OOD scenarios simultaneously. The source code will be made\npublicly available upon publication.\n","authors":["Viet Duong","Qiong Wu","Zhengyi Zhou","Eric Zavesky","Jiahe Chen","Xiangzhou Liu","Wen-Ling Hsu","Huajie Shao"],"pdf_url":"https://arxiv.org/pdf/2307.13069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13060v1","updated":"2023-07-24T18:19:39Z","published":"2023-07-24T18:19:39Z","title":"On the characteristics of natural hydraulic dampers: An image-based\n approach to study the fluid flow behaviour inside the human meniscal tissue","summary":" The meniscal tissue is a layered material with varying properties influenced\nby collagen content and arrangement. Understanding the relationship between\nstructure and properties is crucial for disease management, treatment\ndevelopment, and biomaterial design. The internal layer of the meniscus is\nsofter and more deformable than the outer layers, thanks to interconnected\ncollagen channels that guide fluid flow. To investigate these relationships, we\npropose a novel approach that combines Computational Fluid Dynamics (CFD) with\nImage Analysis (CFD-IA). We analyze fluid flow in the internal architecture of\nthe human meniscus across a range of inlet velocities (0.1mm/s to 1.6m/s) using\nhigh-resolution 3D micro-computed tomography scans. Statistical correlations\nare observed between architectural parameters (tortuosity, connectivity,\nporosity, pore size) and fluid flow parameters (Re number distribution,\npermeability). Some channels exhibit Re values of 1400 at an inlet velocity of\n1.6m/s, and a transition from Darcy's regime to a non-Darcian regime occurs\naround an inlet velocity of 0.02m/s. Location-dependent permeability ranges\nfrom 20-32 Darcy. Regression modelling reveals a strong correlation between\nfluid velocity and tortuosity at high inlet velocities, as well as with channel\ndiameter at low inlet velocities. At higher inlet velocities, flow paths\ndeviate more from the preferential direction, resulting in a decrease in the\nconcentration parameter by an average of 0.4. This research provides valuable\ninsights into the fluid flow behaviour within the meniscus and its structural\ninfluences.\n","authors":["J. Waghorne","F. P. Bonomo","A. Rabbani","D. Bell","O. Barrera"],"pdf_url":"https://arxiv.org/pdf/2307.13060v1.pdf","comment":"20 Pages, 5 Figures"},{"id":"http://arxiv.org/abs/2307.02625v2","updated":"2023-07-24T18:16:38Z","published":"2023-07-05T19:56:50Z","title":"Retinex-based Image Denoising / Contrast Enhancement using Gradient\n Graph Laplacian Regularizer","summary":" Images captured in poorly lit conditions are often corrupted by acquisition\nnoise. Leveraging recent advances in graph-based regularization, we propose a\nfast Retinex-based restoration scheme that denoises and contrast-enhances an\nimage. Specifically, by Retinex theory we first assume that each image pixel is\na multiplication of its reflectance and illumination components. We next assume\nthat the reflectance and illumination components are piecewise constant (PWC)\nand continuous piecewise planar (PWP) signals, which can be recovered via graph\nLaplacian regularizer (GLR) and gradient graph Laplacian regularizer (GGLR)\nrespectively. We formulate quadratic objectives regularized by GLR and GGLR,\nwhich are minimized alternately until convergence by solving linear systems --\nwith improved condition numbers via proposed preconditioners -- via conjugate\ngradient (CG) efficiently. Experimental results show that our algorithm\nachieves competitive visual image quality while reducing computation complexity\nnoticeably.\n","authors":["Yeganeh Gharedaghi","Gene Cheung","Xianming Liu"],"pdf_url":"https://arxiv.org/pdf/2307.02625v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13011v1","updated":"2023-07-24T13:47:30Z","published":"2023-07-24T13:47:30Z","title":"Maximal Independent Sets for Pooling in Graph Neural Networks","summary":" Convolutional Neural Networks (CNNs) have enabled major advances in image\nclassification through convolution and pooling. In particular, image pooling\ntransforms a connected discrete lattice into a reduced lattice with the same\nconnectivity and allows reduction functions to consider all pixels in an image.\nHowever, there is no pooling that satisfies these properties for graphs. In\nfact, traditional graph pooling methods suffer from at least one of the\nfollowing drawbacks: Graph disconnection or overconnection, low decimation\nratio, and deletion of large parts of graphs. In this paper, we present three\npooling methods based on the notion of maximal independent sets that avoid\nthese pitfalls. Our experimental results confirm the relevance of maximal\nindependent set constraints for graph pooling.\n","authors":["Stevan Stanovic","Benoit Gaüzère","Luc Brun"],"pdf_url":"https://arxiv.org/pdf/2307.13011v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2307.09683v2","updated":"2023-07-24T15:41:03Z","published":"2023-07-18T23:35:53Z","title":"PubMed and Beyond: Recent Advances and Best Practices in Biomedical\n Literature Search","summary":" Biomedical research yields a wealth of information, much of which is only\naccessible through the literature. Consequently, literature search is an\nessential tool for building on prior knowledge in clinical and biomedical\nresearch. Although recent improvements in artificial intelligence have expanded\nfunctionality beyond keyword-based search, these advances may be unfamiliar to\nclinicians and researchers. In response, we present a survey of literature\nsearch tools tailored to both general and specific information needs in\nbiomedicine, with the objective of helping readers efficiently fulfill their\ninformation needs. We first examine the widely used PubMed search engine,\ndiscussing recent improvements and continued challenges. We then describe\nliterature search tools catering to five specific information needs: 1.\nIdentifying high-quality clinical research for evidence-based medicine. 2.\nRetrieving gene-related information for precision medicine and genomics. 3.\nSearching by meaning, including natural language questions. 4. Locating related\narticles with literature recommendation. 5. Mining literature to discover\nassociations between concepts such as diseases and genetic variants.\nAdditionally, we cover practical considerations and best practices for choosing\nand using these tools. Finally, we provide a perspective on the future of\nliterature search engines, considering recent breakthroughs in large language\nmodels such as ChatGPT. In summary, our survey provides a comprehensive view of\nbiomedical literature search functionalities with 36 publicly available tools.\n","authors":["Qiao Jin","Robert Leaman","Zhiyong Lu"],"pdf_url":"https://arxiv.org/pdf/2307.09683v2.pdf","comment":"27 pages, 6 figures, 36 tools"},{"id":"http://arxiv.org/abs/2307.12810v1","updated":"2023-07-24T14:00:07Z","published":"2023-07-24T14:00:07Z","title":"HeteFedRec: Federated Recommender Systems with Model Heterogeneity","summary":" Owing to the nature of privacy protection, federated recommender systems\n(FedRecs) have garnered increasing interest in the realm of on-device\nrecommender systems. However, most existing FedRecs only allow participating\nclients to collaboratively train a recommendation model of the same public\nparameter size. Training a model of the same size for all clients can lead to\nsuboptimal performance since clients possess varying resources. For example,\nclients with limited training data may prefer to train a smaller recommendation\nmodel to avoid excessive data consumption, while clients with sufficient data\nwould benefit from a larger model to achieve higher recommendation accuracy. To\naddress the above challenge, this paper introduces HeteFedRec, a novel FedRec\nframework that enables the assignment of personalized model sizes to\nparticipants. In HeteFedRec, we present a heterogeneous recommendation model\naggregation strategy, including a unified dual-task learning mechanism and a\ndimensional decorrelation regularization, to allow knowledge aggregation among\nrecommender models of different sizes. Additionally, a relation-based ensemble\nknowledge distillation method is proposed to effectively distil knowledge from\nheterogeneous item embeddings. Extensive experiments conducted on three\nreal-world recommendation datasets demonstrate the effectiveness and efficiency\nof HeteFedRec in training federated recommender systems under heterogeneous\nsettings.\n","authors":["Wei Yuan","Liang Qu","Lizhen Cui","Yongxin Tong","Xiaofang Zhou","Hongzhi Yin"],"pdf_url":"https://arxiv.org/pdf/2307.12810v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12798v1","updated":"2023-07-24T13:51:19Z","published":"2023-07-24T13:51:19Z","title":"RRAML: Reinforced Retrieval Augmented Machine Learning","summary":" The emergence of large language models (LLMs) has revolutionized machine\nlearning and related fields, showcasing remarkable abilities in comprehending,\ngenerating, and manipulating human language. However, their conventional usage\nthrough API-based text prompt submissions imposes certain limitations in terms\nof context constraints and external source availability. To address these\nchallenges, we propose a novel framework called Reinforced Retrieval Augmented\nMachine Learning (RRAML). RRAML integrates the reasoning capabilities of LLMs\nwith supporting information retrieved by a purpose-built retriever from a vast\nuser-provided database. By leveraging recent advancements in reinforcement\nlearning, our method effectively addresses several critical challenges.\nFirstly, it circumvents the need for accessing LLM gradients. Secondly, our\nmethod alleviates the burden of retraining LLMs for specific tasks, as it is\noften impractical or impossible due to restricted access to the model and the\ncomputational intensity involved. Additionally we seamlessly link the\nretriever's task with the reasoner, mitigating hallucinations and reducing\nirrelevant, and potentially damaging retrieved documents. We believe that the\nresearch agenda outlined in this paper has the potential to profoundly impact\nthe field of AI, democratizing access to and utilization of LLMs for a wide\nrange of entities.\n","authors":["Andrea Bacciu","Florin Cocunasu","Federico Siciliano","Fabrizio Silvestri","Nicola Tonellotto","Giovanni Trappolini"],"pdf_url":"https://arxiv.org/pdf/2307.12798v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12756v1","updated":"2023-07-24T12:58:47Z","published":"2023-07-24T12:58:47Z","title":"Unbiased Delayed Feedback Label Correction for Conversion Rate\n Prediction","summary":" Conversion rate prediction is critical to many online applications such as\ndigital display advertising. To capture dynamic data distribution, industrial\nsystems often require retraining models on recent data daily or weekly.\nHowever, the delay of conversion behavior usually leads to incorrect labeling,\nwhich is called delayed feedback problem. Existing work may fail to introduce\nthe correct information about false negative samples due to data sparsity and\ndynamic data distribution. To directly introduce the correct feedback label\ninformation, we propose an Unbiased delayed feedback Label Correction framework\n(ULC), which uses an auxiliary model to correct labels for observed negative\nfeedback samples. Firstly, we theoretically prove that the label-corrected loss\nis an unbiased estimate of the oracle loss using true labels. Then, as there\nare no ready training data for label correction, counterfactual labeling is\nused to construct artificial training data. Furthermore, since counterfactual\nlabeling utilizes only partial training data, we design an embedding-based\nalternative training method to enhance performance. Comparative experiments on\nboth public and private datasets and detailed analyses show that our proposed\napproach effectively alleviates the delayed feedback problem and consistently\noutperforms the previous state-of-the-art methods.\n","authors":["Yifan Wang","Peijie Sun","Min Zhang","Qinglin Jia","Jingjie Li","Shaoping Ma"],"pdf_url":"https://arxiv.org/pdf/2307.12756v1.pdf","comment":"accepted by KDD 2023"},{"id":"http://arxiv.org/abs/2307.12576v1","updated":"2023-07-24T07:47:21Z","published":"2023-07-24T07:47:21Z","title":"Self-refining of Pseudo Labels for Music Source Separation with Noisy\n Labeled Data","summary":" Music source separation (MSS) faces challenges due to the limited\navailability of correctly-labeled individual instrument tracks. With the push\nto acquire larger datasets to improve MSS performance, the inevitability of\nencountering mislabeled individual instrument tracks becomes a significant\nchallenge to address. This paper introduces an automated technique for refining\nthe labels in a partially mislabeled dataset. Our proposed self-refining\ntechnique, employed with a noisy-labeled dataset, results in only a 1% accuracy\ndegradation in multi-label instrument recognition compared to a classifier\ntrained on a clean-labeled dataset. The study demonstrates the importance of\nrefining noisy-labeled data in MSS model training and shows that utilizing the\nrefined dataset leads to comparable results derived from a clean-labeled\ndataset. Notably, upon only access to a noisy dataset, MSS models trained on a\nself-refined dataset even outperform those trained on a dataset refined with a\nclassifier trained on clean labels.\n","authors":["Junghyun Koo","Yunkee Chae","Chang-Bin Jeon","Kyogu Lee"],"pdf_url":"https://arxiv.org/pdf/2307.12576v1.pdf","comment":"24th International Society for Music Information Retrieval Conference\n (ISMIR 2023)"},{"id":"http://arxiv.org/abs/2307.10617v3","updated":"2023-07-24T07:03:01Z","published":"2023-07-20T06:35:43Z","title":"Unmasking Falsehoods in Reviews: An Exploration of NLP Techniques","summary":" In the contemporary digital landscape, online reviews have become an\nindispensable tool for promoting products and services across various\nbusinesses. Marketers, advertisers, and online businesses have found incentives\nto create deceptive positive reviews for their products and negative reviews\nfor their competitors' offerings. As a result, the writing of deceptive reviews\nhas become an unavoidable practice for businesses seeking to promote themselves\nor undermine their rivals. Detecting such deceptive reviews has become an\nintense and ongoing area of research. This research paper proposes a machine\nlearning model to identify deceptive reviews, with a particular focus on\nrestaurants. This study delves into the performance of numerous experiments\nconducted on a dataset of restaurant reviews known as the Deceptive Opinion\nSpam Corpus. To accomplish this, an n-gram model and max features are developed\nto effectively identify deceptive content, particularly focusing on fake\nreviews. A benchmark study is undertaken to explore the performance of two\ndifferent feature extraction techniques, which are then coupled with five\ndistinct machine learning classification algorithms. The experimental results\nreveal that the passive aggressive classifier stands out among the various\nalgorithms, showcasing the highest accuracy not only in text classification but\nalso in identifying fake reviews. Moreover, the research delves into data\naugmentation and implements various deep learning techniques to further enhance\nthe process of detecting deceptive reviews. The findings shed light on the\nefficacy of the proposed machine learning approach and offer valuable insights\ninto dealing with deceptive reviews in the realm of online businesses.\n","authors":["Anusuya Baby Hari Krishnan"],"pdf_url":"https://arxiv.org/pdf/2307.10617v3.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2307.12518v1","updated":"2023-07-24T04:23:08Z","published":"2023-07-24T04:23:08Z","title":"FaFCNN: A General Disease Classification Framework Based on Feature\n Fusion Neural Networks","summary":" There are two fundamental problems in applying deep learning/machine learning\nmethods to disease classification tasks, one is the insufficient number and\npoor quality of training samples; another one is how to effectively fuse\nmultiple source features and thus train robust classification models. To\naddress these problems, inspired by the process of human learning knowledge, we\npropose the Feature-aware Fusion Correlation Neural Network (FaFCNN), which\nintroduces a feature-aware interaction module and a feature alignment module\nbased on domain adversarial learning. This is a general framework for disease\nclassification, and FaFCNN improves the way existing methods obtain sample\ncorrelation features. The experimental results show that training using\naugmented features obtained by pre-training gradient boosting decision tree\nyields more performance gains than random-forest based methods. On the\nlow-quality dataset with a large amount of missing data in our setup, FaFCNN\nobtains a consistently optimal performance compared to competitive baselines.\nIn addition, extensive experiments demonstrate the robustness of the proposed\nmethod and the effectiveness of each component of the model\\footnote{Accepted\nin IEEE SMC2023}.\n","authors":["Menglin Kong","Shaojie Zhao","Juan Cheng","Xingquan Li","Ri Su","Muzhou Hou","Cong Cao"],"pdf_url":"https://arxiv.org/pdf/2307.12518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13165v1","updated":"2023-07-24T23:26:46Z","published":"2023-07-24T23:26:46Z","title":"Investigating the Robustness of Sequential Recommender Systems Against\n Training Data Perturbations: an Empirical Study","summary":" Sequential Recommender Systems (SRSs) have been widely used to model user\nbehavior over time, but their robustness in the face of perturbations to\ntraining data is a critical issue. In this paper, we conduct an empirical study\nto investigate the effects of removing items at different positions within a\ntemporally ordered sequence. We evaluate two different SRS models on multiple\ndatasets, measuring their performance using Normalized Discounted Cumulative\nGain (NDCG) and Rank Sensitivity List metrics. Our results demonstrate that\nremoving items at the end of the sequence significantly impacts performance,\nwith NDCG decreasing up to 60\\%, while removing items from the beginning or\nmiddle has no significant effect. These findings highlight the importance of\nconsidering the position of the perturbed items in the training data and shall\ninform the design of more robust SRSs.\n","authors":["Filippo Betello","Federico Siciliano","Pushkar Mishra","Fabrizio Silvestri"],"pdf_url":"https://arxiv.org/pdf/2307.13165v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.15498v2","updated":"2023-07-24T20:08:20Z","published":"2021-06-29T15:25:33Z","title":"Classification of Consumer Belief Statements From Social Media","summary":" Social media offer plenty of information to perform market research in order\nto meet the requirements of customers. One way how this research is conducted\nis that a domain expert gathers and categorizes user-generated content into a\ncomplex and fine-grained class structure. In many of such cases, little data\nmeets complex annotations. It is not yet fully understood how this can be\nleveraged successfully for classification. We examine the classification\naccuracy of expert labels when used with a) many fine-grained classes and b)\nfew abstract classes. For scenario b) we compare abstract class labels given by\nthe domain expert as baseline and by automatic hierarchical clustering. We\ncompare this to another baseline where the entire class structure is given by a\ncompletely unsupervised clustering approach. By doing so, this work can serve\nas an example of how complex expert annotations are potentially beneficial and\ncan be utilized in the most optimal way for opinion mining in highly specific\ndomains. By exploring across a range of techniques and experiments, we find\nthat automated class abstraction approaches in particular the unsupervised\napproach performs remarkably well against domain expert baseline on text\nclassification tasks. This has the potential to inspire opinion mining\napplications in order to support market researchers in practice and to inspire\nfine-grained automated content analysis on a large scale.\n","authors":["Gerhard Johann Hagerer","Wenbin Le","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2106.15498v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2111.02259v3","updated":"2023-07-24T20:03:14Z","published":"2021-11-03T14:49:50Z","title":"A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion\n Mining","summary":" User-generated content from social media is produced in many languages,\nmaking it technically challenging to compare the discussed themes from one\ndomain across different cultures and regions. It is relevant for domains in a\nglobalized world, such as market research, where people from two nations and\nmarkets might have different requirements for a product. We propose a simple,\nmodern, and effective method for building a single topic model with sentiment\nanalysis capable of covering multiple languages simultanteously, based on a\npre-trained state-of-the-art deep neural network for natural language\nunderstanding. To demonstrate its feasibility, we apply the model to newspaper\narticles and user comments of a specific domain, i.e., organic food products\nand related consumption behavior. The themes match across languages.\nAdditionally, we obtain an high proportion of stable and domain-relevant\ntopics, a meaningful relation between topics and their respective textual\ncontents, and an interpretable representation for social media documents.\nMarketing can potentially benefit from our method, since it provides an\neasy-to-use means of addressing specific customer interests from different\nmarket regions around the globe. For reproducibility, we provide the code,\ndata, and results of our study.\n","authors":["Gerhard Johann Hagerer","Wing Sheung Leung","Qiaoxi Liu","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2111.02259v3.pdf","comment":"10 pages, 2 tables, 5 figures, full paper, peer-reviewed, published\n at KDIR/IC3k 2021 conference"},{"id":"http://arxiv.org/abs/2304.04759v2","updated":"2023-07-24T18:10:09Z","published":"2023-04-07T23:10:39Z","title":"Similarity search in the blink of an eye with compressed indices","summary":" Nowadays, data is represented by vectors. Retrieving those vectors, among\nmillions and billions, that are similar to a given query is a ubiquitous\nproblem, known as similarity search, of relevance for a wide range of\napplications. Graph-based indices are currently the best performing techniques\nfor billion-scale similarity search. However, their random-access memory\npattern presents challenges to realize their full potential. In this work, we\npresent new techniques and systems for creating faster and smaller graph-based\nindices. To this end, we introduce a novel vector compression method,\nLocally-adaptive Vector Quantization (LVQ), that uses per-vector scaling and\nscalar quantization to improve search performance with fast similarity\ncomputations and a reduced effective bandwidth, while decreasing memory\nfootprint and barely impacting accuracy. LVQ, when combined with a new\nhigh-performance computing system for graph-based similarity search,\nestablishes the new state of the art in terms of performance and memory\nfootprint. For billions of vectors, LVQ outcompetes the second-best\nalternatives: (1) in the low-memory regime, by up to 20.7x in throughput with\nup to a 3x memory footprint reduction, and (2) in the high-throughput regime by\n5.8x with 1.4x less memory.\n","authors":["Cecilia Aguerrebere","Ishwar Bhati","Mark Hildebrand","Mariano Tepper","Ted Willke"],"pdf_url":"https://arxiv.org/pdf/2304.04759v2.pdf","comment":"VLDB 2023"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2307.12983v1","updated":"2023-07-24T17:59:37Z","published":"2023-07-24T17:59:37Z","title":"Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under\n Massively Parallel Simulation","summary":" Reinforcement learning is time-consuming for complex tasks due to the need\nfor large amounts of training data. Recent advances in GPU-based simulation,\nsuch as Isaac Gym, have sped up data collection thousands of times on a\ncommodity GPU. Most prior works used on-policy methods like PPO due to their\nsimplicity and ease of scaling. Off-policy methods are more data efficient but\nchallenging to scale, resulting in a longer wall-clock training time. This\npaper presents a Parallel $Q$-Learning (PQL) scheme that outperforms PPO in\nwall-clock time while maintaining superior sample efficiency of off-policy\nlearning. PQL achieves this by parallelizing data collection, policy learning,\nand value learning. Different from prior works on distributed off-policy\nlearning, such as Apex, our scheme is designed specifically for massively\nparallel GPU-based simulation and optimized to work on a single workstation. In\nexperiments, we demonstrate that $Q$-learning can be scaled to \\textit{tens of\nthousands of parallel environments} and investigate important factors affecting\nlearning speed. The code is available at https://github.com/Improbable-AI/pql.\n","authors":["Zechu Li","Tao Chen","Zhang-Wei Hong","Anurag Ajay","Pulkit Agrawal"],"pdf_url":"https://arxiv.org/pdf/2307.12983v1.pdf","comment":"Accepted by ICML 2023"},{"id":"http://arxiv.org/abs/2307.12981v1","updated":"2023-07-24T17:59:02Z","published":"2023-07-24T17:59:02Z","title":"3D-LLM: Injecting the 3D World into Large Language Models","summary":" Large language models (LLMs) and Vision-Language Models (VLMs) have been\nproven to excel at multiple tasks, such as commonsense reasoning. Powerful as\nthese models can be, they are not grounded in the 3D physical world, which\ninvolves richer concepts such as spatial relationships, affordances, physics,\nlayout, and so on. In this work, we propose to inject the 3D world into large\nlanguage models and introduce a whole new family of 3D-LLMs. Specifically,\n3D-LLMs can take 3D point clouds and their features as input and perform a\ndiverse set of 3D-related tasks, including captioning, dense captioning, 3D\nquestion answering, task decomposition, 3D grounding, 3D-assisted dialog,\nnavigation, and so on. Using three types of prompting mechanisms that we\ndesign, we are able to collect over 300k 3D-language data covering these tasks.\nTo efficiently train 3D-LLMs, we first utilize a 3D feature extractor that\nobtains 3D features from rendered multi- view images. Then, we use 2D VLMs as\nour backbones to train our 3D-LLMs. By introducing a 3D localization mechanism,\n3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show\nthat our model outperforms state-of-the-art baselines by a large margin (e.g.,\nthe BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore,\nexperiments on our held-in datasets for 3D captioning, task composition, and\n3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative\nexamples also show that our model could perform more tasks beyond the scope of\nexisting LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.\n","authors":["Yining Hong","Haoyu Zhen","Peihao Chen","Shuhong Zheng","Yilun Du","Zhenfang Chen","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2307.12981v1.pdf","comment":"Project Page: : https://vis-www.cs.umass.edu/3dllm/"},{"id":"http://arxiv.org/abs/2303.06147v2","updated":"2023-07-24T17:58:45Z","published":"2023-03-10T18:59:57Z","title":"Exphormer: Sparse Transformers for Graphs","summary":" Graph transformers have emerged as a promising architecture for a variety of\ngraph learning and representation tasks. Despite their successes, though, it\nremains challenging to scale graph transformers to large graphs while\nmaintaining accuracy competitive with message-passing networks. In this paper,\nwe introduce Exphormer, a framework for building powerful and scalable graph\ntransformers. Exphormer consists of a sparse attention mechanism based on two\nmechanisms: virtual global nodes and expander graphs, whose mathematical\ncharacteristics, such as spectral expansion, pseduorandomness, and sparsity,\nyield graph transformers with complexity only linear in the size of the graph,\nwhile allowing us to prove desirable theoretical properties of the resulting\ntransformer models. We show that incorporating Exphormer into the\nrecently-proposed GraphGPS framework produces models with competitive empirical\nresults on a wide variety of graph datasets, including state-of-the-art results\non three datasets. We also show that Exphormer can scale to datasets on larger\ngraphs than shown in previous graph transformer architectures. Code can be\nfound at \\url{https://github.com/hamed1375/Exphormer}.\n","authors":["Hamed Shirzad","Ameya Velingker","Balaji Venkatachalam","Danica J. Sutherland","Ali Kemal Sinop"],"pdf_url":"https://arxiv.org/pdf/2303.06147v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.05407v3","updated":"2023-07-24T17:58:31Z","published":"2022-09-12T16:59:36Z","title":"Segmenting Known Objects and Unseen Unknowns without Prior Knowledge","summary":" Panoptic segmentation methods assign a known class to each pixel given in\ninput. Even for state-of-the-art approaches, this inevitably enforces decisions\nthat systematically lead to wrong predictions for objects outside the training\ncategories. However, robustness against out-of-distribution samples and corner\ncases is crucial in safety-critical settings to avoid dangerous consequences.\nSince real-world datasets cannot contain enough data points to adequately\nsample the long tail of the underlying distribution, models must be able to\ndeal with unseen and unknown scenarios as well. Previous methods targeted this\nby re-identifying already-seen unlabeled objects. In this work, we propose the\nnecessary step to extend segmentation with a new setting which we term holistic\nsegmentation. Holistic segmentation aims to identify and separate objects of\nunseen unknown categories into instances, without any prior knowledge about\nthem, while performing panoptic segmentation of known classes. We tackle this\nnew problem with U3HS, which finds unknowns as highly uncertain regions and\nclusters their corresponding instance-aware embeddings into individual objects.\nBy doing so, for the first time in panoptic segmentation with unknown objects,\nour U3HS is trained without unknown categories, reducing assumptions and\nleaving the settings as unconstrained as in real-life scenarios. Extensive\nexperiments on public data from MS COCO, Cityscapes, and Lost&Found demonstrate\nthe effectiveness of U3HS for this new, challenging, and assumptions-free\nsetting called holistic segmentation.\n","authors":["Stefano Gasperini","Alvaro Marcos-Ramiro","Michael Schmidt","Nassir Navab","Benjamin Busam","Federico Tombari"],"pdf_url":"https://arxiv.org/pdf/2209.05407v3.pdf","comment":"Accepted at ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12979v1","updated":"2023-07-24T17:56:58Z","published":"2023-07-24T17:56:58Z","title":"An Isometric Stochastic Optimizer","summary":" The Adam optimizer is the standard choice in deep learning applications. I\npropose a simple explanation of Adam's success: it makes each parameter's step\nsize independent of the norms of the other parameters. Based on this principle\nI derive Iso, a new optimizer which makes the norm of a parameter's update\ninvariant to the application of any linear transformation to its inputs and\noutputs. I develop a variant of Iso called IsoAdam that allows optimal\nhyperparameters to be transferred from Adam, and demonstrate that IsoAdam\nobtains a speedup over Adam when training a small Transformer.\n","authors":["Jacob Jackson"],"pdf_url":"https://arxiv.org/pdf/2307.12979v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12975v1","updated":"2023-07-24T17:50:24Z","published":"2023-07-24T17:50:24Z","title":"Provable Benefits of Policy Learning from Human Preferences in\n Contextual Bandit Problems","summary":" A crucial task in decision-making problems is reward engineering. It is\ncommon in practice that no obvious choice of reward function exists. Thus, a\npopular approach is to introduce human feedback during training and leverage\nsuch feedback to learn a reward function. Among all policy learning methods\nthat use human feedback, preference-based methods have demonstrated substantial\nsuccess in recent empirical applications such as InstructGPT. In this work, we\ndevelop a theory that provably shows the benefits of preference-based methods\nin offline contextual bandits. In particular, we improve the modeling and\nsuboptimality analysis for running policy learning methods on human-scored\nsamples directly. Then, we compare it with the suboptimality guarantees of\npreference-based methods and show that preference-based methods enjoy lower\nsuboptimality.\n","authors":["Xiang Ji","Huazheng Wang","Minshuo Chen","Tuo Zhao","Mengdi Wang"],"pdf_url":"https://arxiv.org/pdf/2307.12975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12971v1","updated":"2023-07-24T17:49:05Z","published":"2023-07-24T17:49:05Z","title":"Big Data - Supply Chain Management Framework for Forecasting: Data\n Preprocessing and Machine Learning Techniques","summary":" This article intends to systematically identify and comparatively analyze\nstate-of-the-art supply chain (SC) forecasting strategies and technologies. A\nnovel framework has been proposed incorporating Big Data Analytics in SC\nManagement (problem identification, data sources, exploratory data analysis,\nmachine-learning model training, hyperparameter tuning, performance evaluation,\nand optimization), forecasting effects on human-workforce, inventory, and\noverall SC. Initially, the need to collect data according to SC strategy and\nhow to collect them has been discussed. The article discusses the need for\ndifferent types of forecasting according to the period or SC objective. The SC\nKPIs and the error-measurement systems have been recommended to optimize the\ntop-performing model. The adverse effects of phantom inventory on forecasting\nand the dependence of managerial decisions on the SC KPIs for determining model\nperformance parameters and improving operations management, transparency, and\nplanning efficiency have been illustrated. The cyclic connection within the\nframework introduces preprocessing optimization based on the post-process KPIs,\noptimizing the overall control process (inventory management, workforce\ndetermination, cost, production and capacity planning). The contribution of\nthis research lies in the standard SC process framework proposal, recommended\nforecasting data analysis, forecasting effects on SC performance, machine\nlearning algorithms optimization followed, and in shedding light on future\nresearch.\n","authors":["Md Abrar Jahin","Md Sakib Hossain Shovon","Jungpil Shin","Istiyaque Ahmed Ridoy","Yoichi Tomioka","M. F. Mridha"],"pdf_url":"https://arxiv.org/pdf/2307.12971v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12968v1","updated":"2023-07-24T17:46:32Z","published":"2023-07-24T17:46:32Z","title":"A Connection between One-Step Regularization and Critic Regularization\n in Reinforcement Learning","summary":" As with any machine learning problem with limited data, effective offline RL\nalgorithms require careful regularization to avoid overfitting. One-step\nmethods perform regularization by doing just a single step of policy\nimprovement, while critic regularization methods do many steps of policy\nimprovement with a regularized objective. These methods appear distinct.\nOne-step methods, such as advantage-weighted regression and conditional\nbehavioral cloning, truncate policy iteration after just one step. This ``early\nstopping'' makes one-step RL simple and stable, but can limit its asymptotic\nperformance. Critic regularization typically requires more compute but has\nappealing lower-bound guarantees. In this paper, we draw a close connection\nbetween these methods: applying a multi-step critic regularization method with\na regularization coefficient of 1 yields the same policy as one-step RL. While\npractical implementations violate our assumptions and critic regularization is\ntypically applied with smaller regularization coefficients, our experiments\nnevertheless show that our analysis makes accurate, testable predictions about\npractical offline RL methods (CQL and one-step RL) with commonly-used\nhyperparameters. Our results that every problem can be solved with a single\nstep of policy improvement, but rather that one-step RL might be competitive\nwith critic regularization on RL problems that demand strong regularization.\n","authors":["Benjamin Eysenbach","Matthieu Geist","Sergey Levine","Ruslan Salakhutdinov"],"pdf_url":"https://arxiv.org/pdf/2307.12968v1.pdf","comment":"Accepted to ICML 2023. Video\n (https://www.youtube.com/watch?v=1xlixIHZ0R4) and code\n (https://github.com/ben-eysenbach/ac-connection)"},{"id":"http://arxiv.org/abs/2307.12967v1","updated":"2023-07-24T17:45:40Z","published":"2023-07-24T17:45:40Z","title":"Learning Dense Correspondences between Photos and Sketches","summary":" Humans effortlessly grasp the connection between sketches and real-world\nobjects, even when these sketches are far from realistic. Moreover, human\nsketch understanding goes beyond categorization -- critically, it also entails\nunderstanding how individual elements within a sketch correspond to parts of\nthe physical world it represents. What are the computational ingredients needed\nto support this ability? Towards answering this question, we make two\ncontributions: first, we introduce a new sketch-photo correspondence benchmark,\n$\\textit{PSC6k}$, containing 150K annotations of 6250 sketch-photo pairs across\n125 object categories, augmenting the existing Sketchy dataset with\nfine-grained correspondence metadata. Second, we propose a self-supervised\nmethod for learning dense correspondences between sketch-photo pairs, building\nupon recent advances in correspondence learning for pairs of photos. Our model\nuses a spatial transformer network to estimate the warp flow between latent\nrepresentations of a sketch and photo extracted by a contrastive learning-based\nConvNet backbone. We found that this approach outperformed several strong\nbaselines and produced predictions that were quantitatively consistent with\nother warp-based methods. However, our benchmark also revealed systematic\ndifferences between predictions of the suite of models we tested and those of\nhumans. Taken together, our work suggests a promising path towards developing\nartificial systems that achieve more human-like understanding of visual images\nat different levels of abstraction. Project page:\nhttps://photo-sketch-correspondence.github.io\n","authors":["Xuanchen Lu","Xiaolong Wang","Judith E Fan"],"pdf_url":"https://arxiv.org/pdf/2307.12967v1.pdf","comment":"Accepted to ICML 2023. Project page:\n https://photo-sketch-correspondence.github.io"},{"id":"http://arxiv.org/abs/2303.04245v2","updated":"2023-07-24T17:29:04Z","published":"2023-03-07T21:42:17Z","title":"How Do Transformers Learn Topic Structure: Towards a Mechanistic\n Understanding","summary":" While the successes of transformers across many domains are indisputable,\naccurate understanding of the learning mechanics is still largely lacking.\nTheir capabilities have been probed on benchmarks which include a variety of\nstructured and reasoning tasks -- but mathematical understanding is lagging\nsubstantially behind. Recent lines of work have begun studying representational\naspects of this question: that is, the size/depth/complexity of attention-based\nnetworks to perform certain tasks. However, there is no guarantee the learning\ndynamics will converge to the constructions proposed. In our paper, we provide\nfine-grained mechanistic understanding of how transformers learn \"semantic\nstructure\", understood as capturing co-occurrence structure of words.\nPrecisely, we show, through a combination of mathematical analysis and\nexperiments on Wikipedia data and synthetic data modeled by Latent Dirichlet\nAllocation (LDA), that the embedding layer and the self-attention layer encode\nthe topical structure. In the former case, this manifests as higher average\ninner product of embeddings between same-topic words. In the latter, it\nmanifests as higher average pairwise attention between same-topic words. The\nmathematical results involve several assumptions to make the analysis\ntractable, which we verify on data, and might be of independent interest as\nwell.\n","authors":["Yuchen Li","Yuanzhi Li","Andrej Risteski"],"pdf_url":"https://arxiv.org/pdf/2303.04245v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12943v1","updated":"2023-07-24T17:15:38Z","published":"2023-07-24T17:15:38Z","title":"Efficiently Sampling the PSD Cone with the Metric Dikin Walk","summary":" Semi-definite programs represent a frontier of efficient computation. While\nthere has been much progress on semi-definite optimization, with moderate-sized\ninstances currently solvable in practice by the interior-point method, the\nbasic problem of sampling semi-definite solutions remains a formidable\nchallenge. The direct application of known polynomial-time algorithms for\nsampling general convex bodies to semi-definite sampling leads to a\nprohibitively high running time. In addition, known general methods require an\nexpensive rounding phase as pre-processing. Here we analyze the Dikin walk, by\nfirst adapting it to general metrics, then devising suitable metrics for the\nPSD cone with affine constraints. The resulting mixing time and per-step\ncomplexity are considerably smaller, and by an appropriate choice of the\nmetric, the dependence on the number of constraints can be made\npolylogarithmic. We introduce a refined notion of self-concordant matrix\nfunctions and give rules for combining different metrics. Along the way, we\nfurther develop the theory of interior-point methods for sampling.\n","authors":["Yunbum Kook","Santosh S. Vempala"],"pdf_url":"https://arxiv.org/pdf/2307.12943v1.pdf","comment":"54 pages"},{"id":"http://arxiv.org/abs/2307.12941v1","updated":"2023-07-24T17:11:39Z","published":"2023-07-24T17:11:39Z","title":"On Privileged and Convergent Bases in Neural Network Representations","summary":" In this study, we investigate whether the representations learned by neural\nnetworks possess a privileged and convergent basis. Specifically, we examine\nthe significance of feature directions represented by individual neurons.\nFirst, we establish that arbitrary rotations of neural representations cannot\nbe inverted (unlike linear networks), indicating that they do not exhibit\ncomplete rotational invariance. Subsequently, we explore the possibility of\nmultiple bases achieving identical performance. To do this, we compare the\nbases of networks trained with the same parameters but with varying random\ninitializations. Our study reveals two findings: (1) Even in wide networks such\nas WideResNets, neural networks do not converge to a unique basis; (2) Basis\ncorrelation increases significantly when a few early layers of the network are\nfrozen identically.\n Furthermore, we analyze Linear Mode Connectivity, which has been studied as a\nmeasure of basis correlation. Our findings give evidence that while Linear Mode\nConnectivity improves with increased network width, this improvement is not due\nto an increase in basis correlation.\n","authors":["Davis Brown","Nikhil Vyas","Yamini Bansal"],"pdf_url":"https://arxiv.org/pdf/2307.12941v1.pdf","comment":"In the Workshop on High-dimensional Learning Dynamics at ICML 2023"},{"id":"http://arxiv.org/abs/2307.08572v3","updated":"2023-07-24T17:01:50Z","published":"2023-07-17T15:38:11Z","title":"Revisiting the Robustness of the Minimum Error Entropy Criterion: A\n Transfer Learning Case Study","summary":" Coping with distributional shifts is an important part of transfer learning\nmethods in order to perform well in real-life tasks. However, most of the\nexisting approaches in this area either focus on an ideal scenario in which the\ndata does not contain noises or employ a complicated training paradigm or model\ndesign to deal with distributional shifts. In this paper, we revisit the\nrobustness of the minimum error entropy (MEE) criterion, a widely used\nobjective in statistical signal processing to deal with non-Gaussian noises,\nand investigate its feasibility and usefulness in real-life transfer learning\nregression tasks, where distributional shifts are common. Specifically, we put\nforward a new theoretical result showing the robustness of MEE against\ncovariate shift. We also show that by simply replacing the mean squared error\n(MSE) loss with the MEE on basic transfer learning algorithms such as\nfine-tuning and linear probing, we can achieve competitive performance with\nrespect to state-of-the-art transfer learning algorithms. We justify our\narguments on both synthetic data and 5 real-world time-series data.\n","authors":["Luis Pedro Silvestrin","Shujian Yu","Mark Hoogendoorn"],"pdf_url":"https://arxiv.org/pdf/2307.08572v3.pdf","comment":"Manuscript accepted at ECAI-23. Code available at\n https://github.com/lpsilvestrin/mee-finetune"},{"id":"http://arxiv.org/abs/2307.12926v1","updated":"2023-07-24T16:36:04Z","published":"2023-07-24T16:36:04Z","title":"Contextual Bandits and Imitation Learning via Preference-Based Active\n Queries","summary":" We consider the problem of contextual bandits and imitation learning, where\nthe learner lacks direct knowledge of the executed action's reward. Instead,\nthe learner can actively query an expert at each round to compare two actions\nand receive noisy preference feedback. The learner's objective is two-fold: to\nminimize the regret associated with the executed actions, while simultaneously,\nminimizing the number of comparison queries made to the expert. In this paper,\nwe assume that the learner has access to a function class that can represent\nthe expert's preference model under appropriate link functions, and provide an\nalgorithm that leverages an online regression oracle with respect to this\nfunction class for choosing its actions and deciding when to query. For the\ncontextual bandit setting, our algorithm achieves a regret bound that combines\nthe best of both worlds, scaling as $O(\\min\\{\\sqrt{T}, d/\\Delta\\})$, where $T$\nrepresents the number of interactions, $d$ represents the eluder dimension of\nthe function class, and $\\Delta$ represents the minimum preference of the\noptimal action over any suboptimal action under all contexts. Our algorithm\ndoes not require the knowledge of $\\Delta$, and the obtained regret bound is\ncomparable to what can be achieved in the standard contextual bandits setting\nwhere the learner observes reward signals at each round. Additionally, our\nalgorithm makes only $O(\\min\\{T, d^2/\\Delta^2\\})$ queries to the expert. We\nthen extend our algorithm to the imitation learning setting, where the learning\nagent engages with an unknown environment in episodes of length $H$ each, and\nprovide similar guarantees for regret and query complexity. Interestingly, our\nalgorithm for imitation learning can even learn to outperform the underlying\nexpert, when it is suboptimal, highlighting a practical benefit of\npreference-based feedback in imitation learning.\n","authors":["Ayush Sekhari","Karthik Sridharan","Wen Sun","Runzhe Wu"],"pdf_url":"https://arxiv.org/pdf/2307.12926v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.12231v2","updated":"2023-07-24T16:00:37Z","published":"2023-04-24T16:18:22Z","title":"An Approximation Theory for Metric Space-Valued Functions With A View\n Towards Deep Learning","summary":" Motivated by the developing mathematics of deep learning, we build universal\nfunctions approximators of continuous maps between arbitrary Polish metric\nspaces $\\mathcal{X}$ and $\\mathcal{Y}$ using elementary functions between\nEuclidean spaces as building blocks. Earlier results assume that the target\nspace $\\mathcal{Y}$ is a topological vector space. We overcome this limitation\nby ``randomization'': our approximators output discrete probability measures\nover $\\mathcal{Y}$. When $\\mathcal{X}$ and $\\mathcal{Y}$ are Polish without\nadditional structure, we prove very general qualitative guarantees; when they\nhave suitable combinatorial structure, we prove quantitative guarantees for\nH\\\"{o}lder-like maps, including maps between finite graphs, solution operators\nto rough differential equations between certain Carnot groups, and continuous\nnon-linear operators between Banach spaces arising in inverse problems. In\nparticular, we show that the required number of Dirac measures is determined by\nthe combinatorial structure of $\\mathcal{X}$ and $\\mathcal{Y}$. For barycentric\n$\\mathcal{Y}$, including Banach spaces, $\\mathbb{R}$-trees, Hadamard manifolds,\nor Wasserstein spaces on Polish metric spaces, our approximators reduce to\n$\\mathcal{Y}$-valued functions. When the Euclidean approximators are neural\nnetworks, our constructions generalize transformer networks, providing a new\nprobabilistic viewpoint of geometric deep learning.\n","authors":["Anastasis Kratsios","Chong Liu","Matti Lassas","Maarten V. de Hoop","Ivan Dokmanić"],"pdf_url":"https://arxiv.org/pdf/2304.12231v2.pdf","comment":"14 Figures, 3 Tables, 78 Pages (Main 40, Proofs 26, Acknowledgments\n and References 12)"},{"id":"http://arxiv.org/abs/2307.12906v1","updated":"2023-07-24T15:59:36Z","published":"2023-07-24T15:59:36Z","title":"QAmplifyNet: Pushing the Boundaries of Supply Chain Backorder Prediction\n Using Interpretable Hybrid Quantum - Classical Neural Network","summary":" Supply chain management relies on accurate backorder prediction for\noptimizing inventory control, reducing costs, and enhancing customer\nsatisfaction. However, traditional machine-learning models struggle with\nlarge-scale datasets and complex relationships, hindering real-world data\ncollection. This research introduces a novel methodological framework for\nsupply chain backorder prediction, addressing the challenge of handling large\ndatasets. Our proposed model, QAmplifyNet, employs quantum-inspired techniques\nwithin a quantum-classical neural network to predict backorders effectively on\nshort and imbalanced datasets. Experimental evaluations on a benchmark dataset\ndemonstrate QAmplifyNet's superiority over classical models, quantum ensembles,\nquantum neural networks, and deep reinforcement learning. Its proficiency in\nhandling short, imbalanced datasets makes it an ideal solution for supply chain\nmanagement. To enhance model interpretability, we use Explainable Artificial\nIntelligence techniques. Practical implications include improved inventory\ncontrol, reduced backorders, and enhanced operational efficiency. QAmplifyNet\nseamlessly integrates into real-world supply chain management systems, enabling\nproactive decision-making and efficient resource allocation. Future work\ninvolves exploring additional quantum-inspired techniques, expanding the\ndataset, and investigating other supply chain applications. This research\nunlocks the potential of quantum computing in supply chain optimization and\npaves the way for further exploration of quantum-inspired machine learning\nmodels in supply chain management. Our framework and QAmplifyNet model offer a\nbreakthrough approach to supply chain backorder prediction, providing superior\nperformance and opening new avenues for leveraging quantum-inspired techniques\nin supply chain management.\n","authors":["Md Abrar Jahin","Md Sakib Hossain Shovon","Md. Saiful Islam","Jungpil Shin","M. F. Mridha","Yuichi Okuyama"],"pdf_url":"https://arxiv.org/pdf/2307.12906v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12904v1","updated":"2023-07-24T15:52:33Z","published":"2023-07-24T15:52:33Z","title":"Universal Approximation Theorem and error bounds for quantum neural\n networks and quantum reservoirs","summary":" Universal approximation theorems are the foundations of classical neural\nnetworks, providing theoretical guarantees that the latter are able to\napproximate maps of interest. Recent results have shown that this can also be\nachieved in a quantum setting, whereby classical functions can be approximated\nby parameterised quantum circuits. We provide here precise error bounds for\nspecific classes of functions and extend these results to the interesting new\nsetup of randomised quantum circuits, mimicking classical reservoir neural\nnetworks. Our results show in particular that a quantum neural network with\n$\\mathcal{O}(\\varepsilon^{-2})$ weights and $\\mathcal{O} (\\lceil\n\\log_2(\\varepsilon^{-1}) \\rceil)$ qubits suffices to achieve accuracy\n$\\varepsilon>0$ when approximating functions with integrable Fourier transform.\n","authors":["Lukas Gonon","Antoine Jacquier"],"pdf_url":"https://arxiv.org/pdf/2307.12904v1.pdf","comment":"20 pages, 0 figure"},{"id":"http://arxiv.org/abs/2206.02909v2","updated":"2023-07-24T15:47:59Z","published":"2022-06-06T21:14:01Z","title":"Self-supervised Learning for Human Activity Recognition Using 700,000\n Person-days of Wearable Data","summary":" Advances in deep learning for human activity recognition have been relatively\nlimited due to the lack of large labelled datasets. In this study, we leverage\nself-supervised learning techniques on the UK-Biobank activity tracker\ndataset--the largest of its kind to date--containing more than 700,000\nperson-days of unlabelled wearable sensor data. Our resulting activity\nrecognition model consistently outperformed strong baselines across seven\nbenchmark datasets, with an F1 relative improvement of 2.5%-100% (median\n18.4%), the largest improvements occurring in the smaller datasets. In contrast\nto previous studies, our results generalise across external datasets, devices,\nand environments. Our open-source model will help researchers and developers to\nbuild customisable and generalisable activity classifiers with high\nperformance.\n","authors":["Hang Yuan","Shing Chan","Andrew P. Creagh","Catherine Tong","David A. Clifton","Aiden Doherty"],"pdf_url":"https://arxiv.org/pdf/2206.02909v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12897v1","updated":"2023-07-24T15:44:30Z","published":"2023-07-24T15:44:30Z","title":"Anytime Model Selection in Linear Bandits","summary":" Model selection in the context of bandit optimization is a challenging\nproblem, as it requires balancing exploration and exploitation not only for\naction selection, but also for model selection. One natural approach is to rely\non online learning algorithms that treat different models as experts. Existing\nmethods, however, scale poorly ($\\text{poly}M$) with the number of models $M$\nin terms of their regret. Our key insight is that, for model selection in\nlinear bandits, we can emulate full-information feedback to the online learner\nwith a favorable bias-variance trade-off. This allows us to develop ALEXP,\nwhich has an exponentially improved ($\\log M$) dependence on $M$ for its\nregret. ALEXP has anytime guarantees on its regret, and neither requires\nknowledge of the horizon $n$, nor relies on an initial purely exploratory\nstage. Our approach utilizes a novel time-uniform analysis of the Lasso,\nestablishing a new connection between online learning and high-dimensional\nstatistics.\n","authors":["Parnian Kassraie","Aldo Pacchiano","Nicolas Emmenegger","Andreas Krause"],"pdf_url":"https://arxiv.org/pdf/2307.12897v1.pdf","comment":"37 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.12892v1","updated":"2023-07-24T15:42:33Z","published":"2023-07-24T15:42:33Z","title":"A Statistical View of Column Subset Selection","summary":" We consider the problem of selecting a small subset of representative\nvariables from a large dataset. In the computer science literature, this\ndimensionality reduction problem is typically formalized as Column Subset\nSelection (CSS). Meanwhile, the typical statistical formalization is to find an\ninformation-maximizing set of Principal Variables. This paper shows that these\ntwo approaches are equivalent, and moreover, both can be viewed as maximum\nlikelihood estimation within a certain semi-parametric model. Using these\nconnections, we show how to efficiently (1) perform CSS using only summary\nstatistics from the original dataset; (2) perform CSS in the presence of\nmissing and/or censored data; and (3) select the subset size for CSS in a\nhypothesis testing framework.\n","authors":["Anav Sood","Trevor Hastie"],"pdf_url":"https://arxiv.org/pdf/2307.12892v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.08649v3","updated":"2023-07-24T15:33:25Z","published":"2023-04-17T22:53:54Z","title":"Classification of US Supreme Court Cases using BERT-Based Techniques","summary":" Models based on bidirectional encoder representations from transformers\n(BERT) produce state of the art (SOTA) results on many natural language\nprocessing (NLP) tasks such as named entity recognition (NER), part-of-speech\n(POS) tagging etc. An interesting phenomenon occurs when classifying long\ndocuments such as those from the US supreme court where BERT-based models can\nbe considered difficult to use on a first-pass or out-of-the-box basis. In this\npaper, we experiment with several BERT-based classification techniques for US\nsupreme court decisions or supreme court database (SCDB) and compare them with\nthe previous SOTA results. We then compare our results specifically with SOTA\nmodels for long documents. We compare our results for two classification tasks:\n(1) a broad classification task with 15 categories and (2) a fine-grained\nclassification task with 279 categories. Our best result produces an accuracy\nof 80\\% on the 15 broad categories and 60\\% on the fine-grained 279 categories\nwhich marks an improvement of 8\\% and 28\\% respectively from previously\nreported SOTA results.\n","authors":["Shubham Vatsal","Adam Meyers","John E. Ortega"],"pdf_url":"https://arxiv.org/pdf/2304.08649v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2108.13628v2","updated":"2023-07-24T15:31:05Z","published":"2021-08-31T05:38:36Z","title":"Learning Optimal Prescriptive Trees from Observational Data","summary":" We consider the problem of learning an optimal prescriptive tree (i.e., an\ninterpretable treatment assignment policy in the form of a binary tree) of\nmoderate depth, from observational data. This problem arises in numerous\nsocially important domains such as public health and personalized medicine,\nwhere interpretable and data-driven interventions are sought based on data\ngathered in deployment -- through passive collection of data -- rather than\nfrom randomized trials. We propose a method for learning optimal prescriptive\ntrees using mixed-integer optimization (MIO) technology. We show that under\nmild conditions our method is asymptotically exact in the sense that it\nconverges to an optimal out-of-sample treatment assignment policy as the number\nof historical data samples tends to infinity. Contrary to existing literature,\nour approach: 1) does not require data to be randomized, 2) does not impose\nstringent assumptions on the learned trees, and 3) has the ability to model\ndomain specific constraints. Through extensive computational experiments, we\ndemonstrate that our asymptotic guarantees translate to significant performance\nimprovements in finite samples, as well as showcase our uniquely flexible\nmodeling power by incorporating budget and fairness constraints.\n","authors":["Nathanael Jo","Sina Aghaei","Andrés Gómez","Phebe Vayanos"],"pdf_url":"https://arxiv.org/pdf/2108.13628v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.11389v3","updated":"2023-07-24T15:28:34Z","published":"2022-08-24T09:26:12Z","title":"Approximate blocked Gibbs sampling for Bayesian neural networks","summary":" In this work, minibatch MCMC sampling for feedforward neural networks is made\nmore feasible. To this end, it is proposed to sample subgroups of parameters\nvia a blocked Gibbs sampling scheme. By partitioning the parameter space,\nsampling is possible irrespective of layer width. It is also possible to\nalleviate vanishing acceptance rates for increasing depth by reducing the\nproposal variance in deeper layers. Increasing the length of a non-convergent\nchain increases the predictive accuracy in classification tasks, so avoiding\nvanishing acceptance rates and consequently enabling longer chain runs have\npractical benefits. Moreover, non-convergent chain realizations aid in the\nquantification of predictive uncertainty. An open problem is how to perform\nminibatch MCMC sampling for feedforward neural networks in the presence of\naugmented data.\n","authors":["Theodore Papamarkou"],"pdf_url":"https://arxiv.org/pdf/2208.11389v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.12803v3","updated":"2023-07-24T15:27:16Z","published":"2022-01-30T12:53:51Z","title":"Generalizing similarity in noisy setups: the DIBS phenomenon","summary":" This work uncovers an interplay among data density, noise, and the\ngeneralization ability in similarity learning. We consider Siamese Neural\nNetworks (SNNs), which are the basic form of contrastive learning, and explore\ntwo types of noise that can impact SNNs, Pair Label Noise (PLN) and Single\nLabel Noise (SLN). Our investigation reveals that SNNs exhibit double descent\nbehaviour regardless of the training setup and that it is further exacerbated\nby noise. We demonstrate that the density of data pairs is crucial for\ngeneralization. When SNNs are trained on sparse datasets with the same amount\nof PLN or SLN, they exhibit comparable generalization properties. However, when\nusing dense datasets, PLN cases generalize worse than SLN ones in the\noverparametrized region, leading to a phenomenon we call Density-Induced Break\nof Similarity (DIBS). In this regime, PLN similarity violation becomes\nmacroscopical, corrupting the dataset to the point where complete interpolation\ncannot be achieved, regardless of the number of model parameters. Our analysis\nalso delves into the correspondence between online optimization and offline\ngeneralization in similarity learning. The results show that this equivalence\nfails in the presence of label noise in all the scenarios considered.\n","authors":["Nayara Fonseca","Veronica Guidetti"],"pdf_url":"https://arxiv.org/pdf/2201.12803v3.pdf","comment":"v3: version accepted at ECAI 2023 + Supplementary Material"},{"id":"http://arxiv.org/abs/2307.10490v3","updated":"2023-07-24T15:24:17Z","published":"2023-07-19T23:03:20Z","title":"(Ab)using Images and Sounds for Indirect Instruction Injection in\n Multi-Modal LLMs","summary":" We demonstrate how images and sounds can be used for indirect prompt and\ninstruction injection in multi-modal LLMs. An attacker generates an adversarial\nperturbation corresponding to the prompt and blends it into an image or audio\nrecording. When the user asks the (unmodified, benign) model about the\nperturbed image or audio, the perturbation steers the model to output the\nattacker-chosen text and/or make the subsequent dialog follow the attacker's\ninstruction. We illustrate this attack with several proof-of-concept examples\ntargeting LLaVa and PandaGPT.\n","authors":["Eugene Bagdasaryan","Tsung-Yin Hsieh","Ben Nassi","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2307.10490v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10891v2","updated":"2023-07-24T15:16:46Z","published":"2023-06-19T12:36:54Z","title":"Transformer Training Strategies for Forecasting Multiple Load Time\n Series","summary":" In the smart grid of the future, accurate load forecasts on the level of\nindividual clients can help to balance supply and demand locally and to prevent\ngrid outages. While the number of monitored clients will increase with the\nongoing smart meter rollout, the amount of data per client will always be\nlimited. We evaluate whether a Transformer load forecasting model benefits from\na transfer learning strategy, where a global univariate model is trained on the\nload time series from multiple clients. In experiments with two datasets\ncontaining load time series from several hundred clients, we find that the\nglobal training strategy is superior to the multivariate and local training\nstrategies used in related work. On average, the global training strategy\nresults in 21.8% and 12.8% lower forecasting errors than the two other\nstrategies, measured across forecasting horizons from one day to one month into\nthe future. A comparison to linear models, multi-layer perceptrons and LSTMs\nshows that Transformers are effective for load forecasting when they are\ntrained with the global training strategy.\n","authors":["Matthias Hertel","Maximilian Beichter","Benedikt Heidrich","Oliver Neumann","Benjamin Schäfer","Ralf Mikut","Veit Hagenmeyer"],"pdf_url":"https://arxiv.org/pdf/2306.10891v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12872v1","updated":"2023-07-24T15:10:22Z","published":"2023-07-24T15:10:22Z","title":"Data-free Black-box Attack based on Diffusion Model","summary":" Since the training data for the target model in a data-free black-box attack\nis not available, most recent schemes utilize GANs to generate data for\ntraining substitute model. However, these GANs-based schemes suffer from low\ntraining efficiency as the generator needs to be retrained for each target\nmodel during the substitute training process, as well as low generation\nquality. To overcome these limitations, we consider utilizing the diffusion\nmodel to generate data, and propose a data-free black-box attack scheme based\non diffusion model to improve the efficiency and accuracy of substitute\ntraining. Despite the data generated by the diffusion model exhibits high\nquality, it presents diverse domain distributions and contains many samples\nthat do not meet the discriminative criteria of the target model. To further\nfacilitate the diffusion model to generate data suitable for the target model,\nwe propose a Latent Code Augmentation (LCA) method to guide the diffusion model\nin generating data. With the guidance of LCA, the data generated by the\ndiffusion model not only meets the discriminative criteria of the target model\nbut also exhibits high diversity. By utilizing this data, it is possible to\ntrain substitute model that closely resemble the target model more efficiently.\nExtensive experiments demonstrate that our LCA achieves higher attack success\nrates and requires fewer query budgets compared to GANs-based schemes for\ndifferent target models.\n","authors":["Mingwen Shao","Lingzhuang Meng","Yuanjian Qiao","Lixu Zhang","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2307.12872v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12862v1","updated":"2023-07-24T15:02:03Z","published":"2023-07-24T15:02:03Z","title":"Stochastic Step-wise Feature Selection for Exponential Random Graph\n Models (ERGMs)","summary":" Statistical analysis of social networks provides valuable insights into\ncomplex network interactions across various scientific disciplines. However,\naccurate modeling of networks remains challenging due to the heavy\ncomputational burden and the need to account for observed network dependencies.\nExponential Random Graph Models (ERGMs) have emerged as a promising technique\nused in social network modeling to capture network dependencies by\nincorporating endogenous variables. Nevertheless, using ERGMs poses multiple\nchallenges, including the occurrence of ERGM degeneracy, which generates\nunrealistic and meaningless network structures. To address these challenges and\nenhance the modeling of collaboration networks, we propose and test a novel\napproach that focuses on endogenous variable selection within ERGMs. Our method\naims to overcome the computational burden and improve the accommodation of\nobserved network dependencies, thereby facilitating more accurate and\nmeaningful interpretations of network phenomena in various scientific fields.\nWe conduct empirical testing and rigorous analysis to contribute to the\nadvancement of statistical techniques and offer practical insights for network\nanalysis.\n","authors":["Helal El-Zaatari","Fei Yu","Michael R Kosorok"],"pdf_url":"https://arxiv.org/pdf/2307.12862v1.pdf","comment":"23 pages, 6 tables and 18 figures"},{"id":"http://arxiv.org/abs/2307.12856v1","updated":"2023-07-24T14:56:30Z","published":"2023-07-24T14:56:30Z","title":"A Real-World WebAgent with Planning, Long Context Understanding, and\n Program Synthesis","summary":" Pre-trained large language models (LLMs) have recently achieved better\ngeneralization and sample efficiency in autonomous web navigation. However, the\nperformance on real-world websites has still suffered from (1) open domainness,\n(2) limited context length, and (3) lack of inductive bias on HTML. We\nintroduce WebAgent, an LLM-driven agent that can complete the tasks on real\nwebsites following natural language instructions. WebAgent plans ahead by\ndecomposing instructions into canonical sub-instructions, summarizes long HTML\ndocuments into task-relevant snippets, and acts on websites via generated\nPython programs from those. We design WebAgent with Flan-U-PaLM, for grounded\ncode generation, and HTML-T5, new pre-trained LLMs for long HTML documents\nusing local and global attention mechanisms and a mixture of long-span\ndenoising objectives, for planning and summarization. We empirically\ndemonstrate that our recipe improves the success on a real website by over 50%,\nand that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9%\nhigher success rate than prior SoTA on the MiniWoB web navigation benchmark and\nbetter accuracy on offline task planning evaluation.\n","authors":["Izzeddin Gur","Hiroki Furuta","Austin Huang","Mustafa Safdari","Yutaka Matsuo","Douglas Eck","Aleksandra Faust"],"pdf_url":"https://arxiv.org/pdf/2307.12856v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12851v1","updated":"2023-07-24T14:51:54Z","published":"2023-07-24T14:51:54Z","title":"Early Neuron Alignment in Two-layer ReLU Networks with Small\n Initialization","summary":" This paper studies the problem of training a two-layer ReLU network for\nbinary classification using gradient flow with small initialization. We\nconsider a training dataset with well-separated input vectors: Any pair of\ninput data with the same label are positively correlated, and any pair with\ndifferent labels are negatively correlated. Our analysis shows that, during the\nearly phase of training, neurons in the first layer try to align with either\nthe positive data or the negative data, depending on its corresponding weight\non the second layer. A careful analysis of the neurons' directional dynamics\nallows us to provide an $\\mathcal{O}(\\frac{\\log n}{\\sqrt{\\mu}})$ upper bound on\nthe time it takes for all neurons to achieve good alignment with the input\ndata, where $n$ is the number of data points and $\\mu$ measures how well the\ndata are separated. After the early alignment phase, the loss converges to zero\nat a $\\mathcal{O}(\\frac{1}{t})$ rate, and the weight matrix on the first layer\nis approximately low-rank. Numerical experiments on the MNIST dataset\nillustrate our theoretical findings.\n","authors":["Hancheng Min","René Vidal","Enrique Mallada"],"pdf_url":"https://arxiv.org/pdf/2307.12851v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12840v1","updated":"2023-07-24T14:37:22Z","published":"2023-07-24T14:37:22Z","title":"Efficiently Learning One-Hidden-Layer ReLU Networks via Schur\n Polynomials","summary":" We study the problem of PAC learning a linear combination of $k$ ReLU\nactivations under the standard Gaussian distribution on $\\mathbb{R}^d$ with\nrespect to the square loss. Our main result is an efficient algorithm for this\nlearning task with sample and computational complexity $(dk/\\epsilon)^{O(k)}$,\nwhere $\\epsilon>0$ is the target accuracy. Prior work had given an algorithm\nfor this problem with complexity $(dk/\\epsilon)^{h(k)}$, where the function\n$h(k)$ scales super-polynomially in $k$. Interestingly, the complexity of our\nalgorithm is near-optimal within the class of Correlational Statistical Query\nalgorithms. At a high-level, our algorithm uses tensor decomposition to\nidentify a subspace such that all the $O(k)$-order moments are small in the\northogonal directions. Its analysis makes essential use of the theory of Schur\npolynomials to show that the higher-moment error tensors are small given that\nthe lower-order ones are.\n","authors":["Ilias Diakonikolas","Daniel M. Kane"],"pdf_url":"https://arxiv.org/pdf/2307.12840v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.08272v3","updated":"2023-07-24T14:28:11Z","published":"2023-03-14T23:26:55Z","title":"Automated patent extraction powers generative modeling in focused\n chemical spaces","summary":" Deep generative models have emerged as an exciting avenue for inverse\nmolecular design, with progress coming from the interplay between training\nalgorithms and molecular representations. One of the key challenges in their\napplicability to materials science and chemistry has been the lack of access to\nsizeable training datasets with property labels. Published patents contain the\nfirst disclosure of new materials prior to their publication in journals, and\nare a vast source of scientific knowledge that has remained relatively untapped\nin the field of data-driven molecular design. Because patents are filed seeking\nto protect specific uses, molecules in patents can be considered to be weakly\nlabeled into application classes. Furthermore, patents published by the US\nPatent and Trademark Office (USPTO) are downloadable and have machine-readable\ntext and molecular structures. In this work, we train domain-specific\ngenerative models using patent data sources by developing an automated pipeline\nto go from USPTO patent digital files to the generation of novel candidates\nwith minimal human intervention. We test the approach on two in-class extracted\ndatasets, one in organic electronics and another in tyrosine kinase inhibitors.\nWe then evaluate the ability of generative models trained on these in-class\ndatasets on two categories of tasks (distribution learning and property\noptimization), identify strengths and limitations, and suggest possible\nexplanations and remedies that could be used to overcome these in practice.\n","authors":["Akshay Subramanian","Kevin P. Greenman","Alexis Gervaix","Tzuhsiung Yang","Rafael Gómez-Bombarelli"],"pdf_url":"https://arxiv.org/pdf/2303.08272v3.pdf","comment":"Digital Discovery (2023)"},{"id":"http://arxiv.org/abs/2307.02620v2","updated":"2023-07-24T14:21:09Z","published":"2023-07-05T19:48:03Z","title":"Learning when to observe: A frugal reinforcement learning framework for\n a high-cost world","summary":" Reinforcement learning (RL) has been shown to learn sophisticated control\npolicies for complex tasks including games, robotics, heating and cooling\nsystems and text generation. The action-perception cycle in RL, however,\ngenerally assumes that a measurement of the state of the environment is\navailable at each time step without a cost. In applications such as materials\ndesign, deep-sea and planetary robot exploration and medicine, however, there\ncan be a high cost associated with measuring, or even approximating, the state\nof the environment. In this paper, we survey the recently growing literature\nthat adopts the perspective that an RL agent might not need, or even want, a\ncostly measurement at each time step. Within this context, we propose the Deep\nDynamic Multi-Step Observationless Agent (DMSOA), contrast it with the\nliterature and empirically evaluate it on OpenAI gym and Atari Pong\nenvironments. Our results, show that DMSOA learns a better policy with fewer\ndecision steps and measurements than the considered alternative from the\nliterature. The corresponding code is available at:\n\\url{https://github.com/cbellinger27/Learning-when-to-observe-in-RL\n","authors":["Colin Bellinger","Mark Crowley","Isaac Tamblyn"],"pdf_url":"https://arxiv.org/pdf/2307.02620v2.pdf","comment":"Accepted for presentation at ECML-PKDD 2023 workshop track:\n Simplification, Compression, Efficiency and Frugality for Artificial\n Intelligence (SCEFA)"},{"id":"http://arxiv.org/abs/2307.12822v1","updated":"2023-07-24T14:19:36Z","published":"2023-07-24T14:19:36Z","title":"Learning Provably Robust Estimators for Inverse Problems via Jittering","summary":" Deep neural networks provide excellent performance for inverse problems such\nas denoising. However, neural networks can be sensitive to adversarial or\nworst-case perturbations. This raises the question of whether such networks can\nbe trained efficiently to be worst-case robust. In this paper, we investigate\nwhether jittering, a simple regularization technique that adds isotropic\nGaussian noise during training, is effective for learning worst-case robust\nestimators for inverse problems. While well studied for prediction in\nclassification tasks, the effectiveness of jittering for inverse problems has\nnot been systematically investigated. In this paper, we present a novel\nanalytical characterization of the optimal $\\ell_2$-worst-case robust estimator\nfor linear denoising and show that jittering yields optimal robust denoisers.\nFurthermore, we examine jittering empirically via training deep neural networks\n(U-nets) for natural image denoising, deconvolution, and accelerated magnetic\nresonance imaging (MRI). The results show that jittering significantly enhances\nthe worst-case robustness, but can be suboptimal for inverse problems beyond\ndenoising. Moreover, our results imply that training on real data which often\ncontains slight noise is somewhat robustness enhancing.\n","authors":["Anselm Krainovic","Mahdi Soltanolkotabi","Reinhard Heckel"],"pdf_url":"https://arxiv.org/pdf/2307.12822v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02813v2","updated":"2023-07-24T14:17:24Z","published":"2023-07-06T07:18:22Z","title":"CPDG: A Contrastive Pre-Training Method for Dynamic Graph Neural\n Networks","summary":" Dynamic graph data mining has gained popularity in recent years due to the\nrich information contained in dynamic graphs and their widespread use in the\nreal world. Despite the advances in dynamic graph neural networks (DGNNs), the\nrich information and diverse downstream tasks have posed significant\ndifficulties for the practical application of DGNNs in industrial scenarios. To\nthis end, in this paper, we propose to address them by pre-training and present\nthe Contrastive Pre-Training Method for Dynamic Graph Neural Networks (CPDG).\nCPDG tackles the challenges of pre-training for DGNNs, including generalization\ncapability and long-short term modeling capability, through a flexible\nstructural-temporal subgraph sampler along with structural-temporal contrastive\npre-training schemes. Extensive experiments conducted on both large-scale\nresearch and industrial dynamic graph datasets show that CPDG outperforms\nexisting methods in dynamic graph pre-training for various downstream tasks\nunder three transfer settings.\n","authors":["Yuanchen Bei","Hao Xu","Sheng Zhou","Huixuan Chi","Haishuai Wang","Mengdi Zhang","Zhao Li","Jiajun Bu"],"pdf_url":"https://arxiv.org/pdf/2307.02813v2.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.12797v1","updated":"2023-07-24T13:46:50Z","published":"2023-07-24T13:46:50Z","title":"Causal Fair Machine Learning via Rank-Preserving Interventional\n Distributions","summary":" A decision can be defined as fair if equal individuals are treated equally\nand unequals unequally. Adopting this definition, the task of designing machine\nlearning models that mitigate unfairness in automated decision-making systems\nmust include causal thinking when introducing protected attributes. Following a\nrecent proposal, we define individuals as being normatively equal if they are\nequal in a fictitious, normatively desired (FiND) world, where the protected\nattribute has no (direct or indirect) causal effect on the target. We propose\nrank-preserving interventional distributions to define an estimand of this FiND\nworld and a warping method for estimation. Evaluation criteria for both the\nmethod and resulting model are presented and validated through simulations and\nempirical data. With this, we show that our warping approach effectively\nidentifies the most discriminated individuals and mitigates unfairness.\n","authors":["Ludwig Bothmann","Susanne Dandl","Michael Schomaker"],"pdf_url":"https://arxiv.org/pdf/2307.12797v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.05018v3","updated":"2023-07-24T13:46:46Z","published":"2022-07-11T17:13:10Z","title":"Learning Temporally Extended Skills in Continuous Domains as Symbolic\n Actions for Planning","summary":" Problems which require both long-horizon planning and continuous control\ncapabilities pose significant challenges to existing reinforcement learning\nagents. In this paper we introduce a novel hierarchical reinforcement learning\nagent which links temporally extended skills for continuous control with a\nforward model in a symbolic discrete abstraction of the environment's state for\nplanning. We term our agent SEADS for Symbolic Effect-Aware Diverse Skills. We\nformulate an objective and corresponding algorithm which leads to unsupervised\nlearning of a diverse set of skills through intrinsic motivation given a known\nstate abstraction. The skills are jointly learned with the symbolic forward\nmodel which captures the effect of skill execution in the state abstraction.\nAfter training, we can leverage the skills as symbolic actions using the\nforward model for long-horizon planning and subsequently execute the plan using\nthe learned continuous-action control skills. The proposed algorithm learns\nskills and forward models that can be used to solve complex tasks which require\nboth continuous control and long-horizon planning capabilities with high\nsuccess rate. It compares favorably with other flat and hierarchical\nreinforcement learning baseline agents and is successfully demonstrated with a\nreal robot.\n","authors":["Jan Achterhold","Markus Krimmel","Joerg Stueckler"],"pdf_url":"https://arxiv.org/pdf/2207.05018v3.pdf","comment":"Project website (including video) is available at\n https://seads.is.tue.mpg.de/. (v2) Accepted for publication at the 6th\n Conference on Robot Learning (CoRL) 2022, Auckland, New Zealand. (v3) Added\n details on checkpointing (S.8.1), with references on p.7, p.8, p.21 to\n clarify number of env. steps of reported results"},{"id":"http://arxiv.org/abs/2307.12790v1","updated":"2023-07-24T13:39:21Z","published":"2023-07-24T13:39:21Z","title":"Compact & Capable: Harnessing Graph Neural Networks and Edge Convolution\n for Medical Image Classification","summary":" Graph-based neural network models are gaining traction in the field of\nrepresentation learning due to their ability to uncover latent topological\nrelationships between entities that are otherwise challenging to identify.\nThese models have been employed across a diverse range of domains, encompassing\ndrug discovery, protein interactions, semantic segmentation, and fluid dynamics\nresearch. In this study, we investigate the potential of Graph Neural Networks\n(GNNs) for medical image classification. We introduce a novel model that\ncombines GNNs and edge convolution, leveraging the interconnectedness of RGB\nchannel feature values to strongly represent connections between crucial graph\nnodes. Our proposed model not only performs on par with state-of-the-art Deep\nNeural Networks (DNNs) but does so with 1000 times fewer parameters, resulting\nin reduced training time and data requirements. We compare our Graph\nConvolutional Neural Network (GCNN) to pre-trained DNNs for classifying\nMedMNIST dataset classes, revealing promising prospects for GNNs in medical\nimage analysis. Our results also encourage further exploration of advanced\ngraph-based models such as Graph Attention Networks (GAT) and Graph\nAuto-Encoders in the medical imaging domain. The proposed model yields more\nreliable, interpretable, and accurate outcomes for tasks like semantic\nsegmentation and image classification compared to simpler GCNNs\n","authors":["Aryan Singh","Pepijn Van de Ven","Ciarán Eising","Patrick Denny"],"pdf_url":"https://arxiv.org/pdf/2307.12790v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.13170v4","updated":"2023-07-24T13:35:28Z","published":"2022-04-27T20:04:24Z","title":"AdaBest: Minimizing Client Drift in Federated Learning via Adaptive Bias\n Estimation","summary":" In Federated Learning (FL), a number of clients or devices collaborate to\ntrain a model without sharing their data. Models are optimized locally at each\nclient and further communicated to a central hub for aggregation. While FL is\nan appealing decentralized training paradigm, heterogeneity among data from\ndifferent clients can cause the local optimization to drift away from the\nglobal objective. In order to estimate and therefore remove this drift,\nvariance reduction techniques have been incorporated into FL optimization\nrecently. However, these approaches inaccurately estimate the clients' drift\nand ultimately fail to remove it properly. In this work, we propose an adaptive\nalgorithm that accurately estimates drift across clients. In comparison to\nprevious works, our approach necessitates less storage and communication\nbandwidth, as well as lower compute costs. Additionally, our proposed\nmethodology induces stability by constraining the norm of estimates for client\ndrift, making it more practical for large scale FL. Experimental findings\ndemonstrate that the proposed algorithm converges significantly faster and\nachieves higher accuracy than the baselines across various FL benchmarks.\n","authors":["Farshid Varno","Marzie Saghayi","Laya Rafiee Sevyeri","Sharut Gupta","Stan Matwin","Mohammad Havaei"],"pdf_url":"https://arxiv.org/pdf/2204.13170v4.pdf","comment":"Published as a conference paper at ECCV 2022; Corrected some typos in\n the text and a baseline algorithm"},{"id":"http://arxiv.org/abs/2307.12788v1","updated":"2023-07-24T13:35:18Z","published":"2023-07-24T13:35:18Z","title":"Analyzing the Strategy of Propaganda using Inverse Reinforcement\n Learning: Evidence from the 2022 Russian Invasion of Ukraine","summary":" The 2022 Russian invasion of Ukraine was accompanied by a large-scale,\npro-Russian propaganda campaign on social media. However, the strategy behind\nthe dissemination of propaganda has remained unclear, particularly how the\nonline discourse was strategically shaped by the propagandists' community.\nHere, we analyze the strategy of the Twitter community using an inverse\nreinforcement learning (IRL) approach. Specifically, IRL allows us to model\nonline behavior as a Markov decision process, where the goal is to infer the\nunderlying reward structure that guides propagandists when interacting with\nusers with a supporting or opposing stance toward the invasion. Thereby, we aim\nto understand empirically whether and how between-user interactions are\nstrategically used to promote the proliferation of Russian propaganda. For\nthis, we leverage a large-scale dataset with 349,455 posts with pro-Russian\npropaganda from 132,131 users. We show that bots and humans follow a different\nstrategy: bots respond predominantly to pro-invasion messages, suggesting that\nthey seek to drive virality; while messages indicating opposition primarily\nelicit responses from humans, suggesting that they tend to engage in critical\ndiscussions. To the best of our knowledge, this is the first study analyzing\nthe strategy behind propaganda from the 2022 Russian invasion of Ukraine\nthrough the lens of IRL.\n","authors":["Dominique Geissler","Stefan Feuerriegel"],"pdf_url":"https://arxiv.org/pdf/2307.12788v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12540v2","updated":"2023-07-24T13:35:16Z","published":"2023-03-22T13:16:37Z","title":"Deployment of Image Analysis Algorithms under Prevalence Shifts","summary":" Domain gaps are among the most relevant roadblocks in the clinical\ntranslation of machine learning (ML)-based solutions for medical image\nanalysis. While current research focuses on new training paradigms and network\narchitectures, little attention is given to the specific effect of prevalence\nshifts on an algorithm deployed in practice. Such discrepancies between class\nfrequencies in the data used for a method's development/validation and that in\nits deployment environment(s) are of great importance, for example in the\ncontext of artificial intelligence (AI) democratization, as disease prevalences\nmay vary widely across time and location. Our contribution is twofold. First,\nwe empirically demonstrate the potentially severe consequences of missing\nprevalence handling by analyzing (i) the extent of miscalibration, (ii) the\ndeviation of the decision threshold from the optimum, and (iii) the ability of\nvalidation metrics to reflect neural network performance on the deployment\npopulation as a function of the discrepancy between development and deployment\nprevalence. Second, we propose a workflow for prevalence-aware image\nclassification that uses estimated deployment prevalences to adjust a trained\nclassifier to a new environment, without requiring additional annotated\ndeployment data. Comprehensive experiments based on a diverse set of 30 medical\nclassification tasks showcase the benefit of the proposed workflow in\ngenerating better classifier decisions and more reliable performance estimates\ncompared to current practice.\n","authors":["Patrick Godau","Piotr Kalinowski","Evangelia Christodoulou","Annika Reinke","Minu Tizabi","Luciana Ferrer","Paul Jäger","Lena Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2303.12540v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12775v1","updated":"2023-07-24T13:24:56Z","published":"2023-07-24T13:24:56Z","title":"Is attention all you need in medical image analysis? A review","summary":" Medical imaging is a key component in clinical diagnosis, treatment planning\nand clinical trial design, accounting for almost 90% of all healthcare data.\nCNNs achieved performance gains in medical image analysis (MIA) over the last\nyears. CNNs can efficiently model local pixel interactions and be trained on\nsmall-scale MI data. The main disadvantage of typical CNN models is that they\nignore global pixel relationships within images, which limits their\ngeneralisation ability to understand out-of-distribution data with different\n'global' information. The recent progress of Artificial Intelligence gave rise\nto Transformers, which can learn global relationships from data. However, full\nTransformer models need to be trained on large-scale data and involve\ntremendous computational complexity. Attention and Transformer compartments\n(Transf/Attention) which can well maintain properties for modelling global\nrelationships, have been proposed as lighter alternatives of full Transformers.\nRecently, there is an increasing trend to co-pollinate complementary\nlocal-global properties from CNN and Transf/Attention architectures, which led\nto a new era of hybrid models. The past years have witnessed substantial growth\nin hybrid CNN-Transf/Attention models across diverse MIA problems. In this\nsystematic review, we survey existing hybrid CNN-Transf/Attention models,\nreview and unravel key architectural designs, analyse breakthroughs, and\nevaluate current and future opportunities as well as challenges. We also\nintroduced a comprehensive analysis framework on generalisation opportunities\nof scientific and clinical impact, based on which new data-driven domain\ngeneralisation and adaptation methods can be stimulated.\n","authors":["Giorgos Papanastasiou","Nikolaos Dikaios","Jiahao Huang","Chengjia Wang","Guang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12771v1","updated":"2023-07-24T13:19:15Z","published":"2023-07-24T13:19:15Z","title":"Detecting disturbances in network-coupled dynamical systems with machine\n learning","summary":" Identifying disturbances in network-coupled dynamical systems without\nknowledge of the disturbances or underlying dynamics is a problem with a wide\nrange of applications. For example, one might want to know which nodes in the\nnetwork are being disturbed and identify the type of disturbance. Here we\npresent a model-free method based on machine learning to identify such unknown\ndisturbances based only on prior observations of the system when forced by a\nknown training function. We find that this method is able to identify the\nlocations and properties of many different types of unknown disturbances using\na variety of known forcing functions. We illustrate our results both with\nlinear and nonlinear disturbances using food web and neuronal activity models.\nFinally, we discuss how to scale our method to large networks.\n","authors":["Per Sebastian Skardal","Juan G. Restrepo"],"pdf_url":"https://arxiv.org/pdf/2307.12771v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.05732v6","updated":"2023-07-24T13:15:14Z","published":"2022-09-13T04:58:35Z","title":"Rényi Divergence Deep Mutual Learning","summary":" This paper revisits Deep Mutual Learning (DML), a simple yet effective\ncomputing paradigm. We propose using R\\'{e}nyi divergence instead of the KL\ndivergence, which is more flexible and tunable, to improve vanilla DML. This\nmodification is able to consistently improve performance over vanilla DML with\nlimited additional complexity. The convergence properties of the proposed\nparadigm are analyzed theoretically, and Stochastic Gradient Descent with a\nconstant learning rate is shown to converge with $\\mathcal{O}(1)$-bias in the\nworst case scenario for nonconvex optimization tasks. That is, learning will\nreach nearby local optima but continue searching within a bounded scope, which\nmay help mitigate overfitting. Finally, our extensive empirical results\ndemonstrate the advantage of combining DML and R\\'{e}nyi divergence, leading to\nfurther improvement in model generalization.\n","authors":["Weipeng Huang","Junjie Tao","Changbo Deng","Ming Fan","Wenqiang Wan","Qi Xiong","Guangyuan Piao"],"pdf_url":"https://arxiv.org/pdf/2209.05732v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.11531v2","updated":"2023-07-24T13:04:48Z","published":"2022-09-23T11:36:32Z","title":"Deep Learning-based Anonymization of Chest Radiographs: A\n Utility-preserving Measure for Patient Privacy","summary":" Robust and reliable anonymization of chest radiographs constitutes an\nessential step before publishing large datasets of such for research purposes.\nThe conventional anonymization process is carried out by obscuring personal\ninformation in the images with black boxes and removing or replacing\nmeta-information. However, such simple measures retain biometric information in\nthe chest radiographs, allowing patients to be re-identified by a linkage\nattack. Therefore, there is an urgent need to obfuscate the biometric\ninformation appearing in the images. We propose the first deep learning-based\napproach (PriCheXy-Net) to targetedly anonymize chest radiographs while\nmaintaining data utility for diagnostic and machine learning purposes. Our\nmodel architecture is a composition of three independent neural networks that,\nwhen collectively used, allow for learning a deformation field that is able to\nimpede patient re-identification. Quantitative results on the ChestX-ray14\ndataset show a reduction of patient re-identification from 81.8% to 57.7% (AUC)\nafter re-training with little impact on the abnormality classification\nperformance. This indicates the ability to preserve underlying abnormality\npatterns while increasing patient privacy. Lastly, we compare our proposed\nanonymization approach with two other obfuscation-based methods (Privacy-Net,\nDP-Pix) and demonstrate the superiority of our method towards resolving the\nprivacy-utility trade-off for chest radiographs.\n","authors":["Kai Packhäuser","Sebastian Gündel","Florian Thamm","Felix Denzinger","Andreas Maier"],"pdf_url":"https://arxiv.org/pdf/2209.11531v2.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.07620v2","updated":"2023-07-24T13:03:17Z","published":"2023-07-14T20:39:07Z","title":"Generalizable Embeddings with Cross-batch Metric Learning","summary":" Global average pooling (GAP) is a popular component in deep metric learning\n(DML) for aggregating features. Its effectiveness is often attributed to\ntreating each feature vector as a distinct semantic entity and GAP as a\ncombination of them. Albeit substantiated, such an explanation's algorithmic\nimplications to learn generalizable entities to represent unseen classes, a\ncrucial DML goal, remain unclear. To address this, we formulate GAP as a convex\ncombination of learnable prototypes. We then show that the prototype learning\ncan be expressed as a recursive process fitting a linear predictor to a batch\nof samples. Building on that perspective, we consider two batches of disjoint\nclasses at each iteration and regularize the learning by expressing the samples\nof a batch with the prototypes that are fitted to the other batch. We validate\nour approach on 4 popular DML benchmarks.\n","authors":["Yeti Z. Gurbuz","A. Aydin Alatan"],"pdf_url":"https://arxiv.org/pdf/2307.07620v2.pdf","comment":"\\c{opyright} 2023 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2212.07368v3","updated":"2023-07-24T12:53:23Z","published":"2022-12-14T17:46:17Z","title":"Shuffled Multi-Channel Sparse Signal Recovery","summary":" Mismatches between samples and their respective channel or target commonly\narise in several real-world applications. For instance, whole-brain calcium\nimaging of freely moving organisms, multiple-target tracking or multi-person\ncontactless vital sign monitoring may be severely affected by mismatched\nsample-channel assignments. To systematically address this fundamental problem,\nwe pose it as a signal reconstruction problem where we have lost\ncorrespondences between the samples and their respective channels. Assuming\nthat we have a sensing matrix for the underlying signals, we show that the\nproblem is equivalent to a structured unlabeled sensing problem, and establish\nsufficient conditions for unique recovery. To the best of our knowledge, a\nsampling result for the reconstruction of shuffled multi-channel signals has\nnot been considered in the literature and existing methods for unlabeled\nsensing cannot be directly applied. We extend our results to the case where the\nsignals admit a sparse representation in an overcomplete dictionary (i.e., the\nsensing matrix is not precisely known), and derive sufficient conditions for\nthe reconstruction of shuffled sparse signals. We propose a robust\nreconstruction method that combines sparse signal recovery with robust linear\nregression for the two-channel case. The performance and robustness of the\nproposed approach is illustrated in an application related to whole-brain\ncalcium imaging. The proposed methodology can be generalized to sparse signal\nrepresentations other than the ones considered in this work to be applied in a\nvariety of real-world problems with imprecise measurement or channel\nassignment.\n","authors":["Taulant Koka","Manolis C. Tsakiris","Michael Muma","Benjamín Béjar Haro"],"pdf_url":"https://arxiv.org/pdf/2212.07368v3.pdf","comment":"Submitted to TSP"},{"id":"http://arxiv.org/abs/2307.12754v1","updated":"2023-07-24T12:52:55Z","published":"2023-07-24T12:52:55Z","title":"Nonparametric Linear Feature Learning in Regression Through\n Regularisation","summary":" Representation learning plays a crucial role in automated feature selection,\nparticularly in the context of high-dimensional data, where non-parametric\nmethods often struggle. In this study, we focus on supervised learning\nscenarios where the pertinent information resides within a lower-dimensional\nlinear subspace of the data, namely the multi-index model. If this subspace\nwere known, it would greatly enhance prediction, computation, and\ninterpretation. To address this challenge, we propose a novel method for linear\nfeature learning with non-parametric prediction, which simultaneously estimates\nthe prediction function and the linear subspace. Our approach employs empirical\nrisk minimisation, augmented with a penalty on function derivatives, ensuring\nversatility. Leveraging the orthogonality and rotation invariance properties of\nHermite polynomials, we introduce our estimator, named RegFeaL. By utilising\nalternative minimisation, we iteratively rotate the data to improve alignment\nwith leading directions and accurately estimate the relevant dimension in\npractical settings. We establish that our method yields a consistent estimator\nof the prediction function with explicit rates. Additionally, we provide\nempirical results demonstrating the performance of RegFeaL in various\nexperiments.\n","authors":["Bertille Follain","Umut Simsekli","Francis Bach"],"pdf_url":"https://arxiv.org/pdf/2307.12754v1.pdf","comment":"43 pages, 16 figures"},{"id":"http://arxiv.org/abs/2307.12745v1","updated":"2023-07-24T12:36:05Z","published":"2023-07-24T12:36:05Z","title":"Concept-based explainability for an EEG transformer model","summary":" Deep learning models are complex due to their size, structure, and inherent\nrandomness in training procedures. Additional complexity arises from the\nselection of datasets and inductive biases. Addressing these challenges for\nexplainability, Kim et al. (2018) introduced Concept Activation Vectors (CAVs),\nwhich aim to understand deep models' internal states in terms of human-aligned\nconcepts. These concepts correspond to directions in latent space, identified\nusing linear discriminants. Although this method was first applied to image\nclassification, it was later adapted to other domains, including natural\nlanguage processing. In this work, we attempt to apply the method to\nelectroencephalogram (EEG) data for explainability in Kostas et al.'s BENDR\n(2021), a large-scale transformer model. A crucial part of this endeavor\ninvolves defining the explanatory concepts and selecting relevant datasets to\nground concepts in the latent space. Our focus is on two mechanisms for EEG\nconcept formation: the use of externally labeled EEG datasets, and the\napplication of anatomically defined concepts. The former approach is a\nstraightforward generalization of methods used in image classification, while\nthe latter is novel and specific to EEG. We present evidence that both\napproaches to concept formation yield valuable insights into the\nrepresentations learned by deep EEG models.\n","authors":["Anders Gjølbye Madsen","William Theodor Lehn-Schiøler","Áshildur Jónsdóttir","Bergdís Arnardóttir","Lars Kai Hansen"],"pdf_url":"https://arxiv.org/pdf/2307.12745v1.pdf","comment":"To appear in proceedings of 2023 IEEE International workshop on\n Machine Learning for Signal Processing"},{"id":"http://arxiv.org/abs/2207.09657v3","updated":"2023-07-24T12:35:18Z","published":"2022-07-20T05:22:26Z","title":"Reducing Training Time in Cross-Silo Federated Learning using Multigraph\n Topology","summary":" Federated learning is an active research topic since it enables several\nparticipants to jointly train a model without sharing local data. Currently,\ncross-silo federated learning is a popular training setting that utilizes a few\nhundred reliable data silos with high-speed access links to training a model.\nWhile this approach has been widely applied in real-world scenarios, designing\na robust topology to reduce the training time remains an open problem. In this\npaper, we present a new multigraph topology for cross-silo federated learning.\nWe first construct the multigraph using the overlay graph. We then parse this\nmultigraph into different simple graphs with isolated nodes. The existence of\nisolated nodes allows us to perform model aggregation without waiting for other\nnodes, hence effectively reducing the training time. Intensive experiments on\nthree public datasets show that our proposed method significantly reduces the\ntraining time compared with recent state-of-the-art topologies while\nmaintaining the accuracy of the learned model. Our code can be found at\nhttps://github.com/aioz-ai/MultigraphFL\n","authors":["Tuong Do","Binh X. Nguyen","Vuong Pham","Toan Tran","Erman Tjiputra","Quang Tran","Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2207.09657v3.pdf","comment":"accepted in ICCV 2023"},{"id":"http://arxiv.org/abs/2302.09629v2","updated":"2023-07-24T12:33:09Z","published":"2023-02-19T17:15:56Z","title":"BiofilmScanner: A Computational Intelligence Approach to Obtain\n Bacterial Cell Morphological Attributes from Biofilm Image","summary":" Desulfovibrio alaskensis G20 (DA-G20) is utilized as a model for\nsulfate-reducing bacteria (SRB) that are associated with corrosion issues\ncaused by microorganisms. SRB-based biofilms are thought to be responsible for\nthe billion-dollar-per-year bio-corrosion of metal infrastructure.\nUnderstanding the extraction of the bacterial cells' shape and size properties\nin the SRB-biofilm at different growth stages will assist with the design of\nanti-corrosion techniques. However, numerous issues affect current approaches,\nincluding time-consuming geometric property extraction, low efficiency, and\nhigh error rates. This paper proposes BiofilScanner, a Yolact-based deep\nlearning method integrated with invariant moments to address these problems.\nOur approach efficiently detects and segments bacterial cells in an SRB image\nwhile simultaneously invariant moments measure the geometric characteristics of\nthe segmented cells with low errors. The numerical experiments of the proposed\nmethod demonstrate that the BiofilmScanner is 2.1x and 6.8x faster than our\nearlier Mask-RCNN and DLv3+ methods for detecting, segmenting, and measuring\nthe geometric properties of the cell. Furthermore, the BiofilmScanner achieved\nan F1-score of 85.28% while Mask-RCNN and DLv3+ obtained F1-scores of 77.67%\nand 75.18%, respectively.\n","authors":["Md Hafizur Rahman","Md Ali Azam","Md Abir Hossen","Shankarachary Ragi","Venkataramana Gadhamshetty"],"pdf_url":"https://arxiv.org/pdf/2302.09629v2.pdf","comment":"Submitted to Pattern Recognition"},{"id":"http://arxiv.org/abs/2306.16177v3","updated":"2023-07-24T12:32:58Z","published":"2023-06-28T12:58:42Z","title":"Defining data science: a new field of inquiry","summary":" Data science is not a science. It is a research paradigm. Its power, scope,\nand scale will surpass science, our most powerful research paradigm, to enable\nknowledge discovery and change our world. We have yet to understand and define\nit, vital to realizing its potential and managing its risks. Modern data\nscience is in its infancy. Emerging slowly since 1962 and rapidly since 2000,\nit is a fundamentally new field of inquiry, one of the most active, powerful,\nand rapidly evolving 21st century innovations. Due to its value, power, and\napplicability, it is emerging in over 40 disciplines, hundreds of research\nareas, and thousands of applications. Millions of data science publications\ncontain myriad definitions of data science and data science problem solving.\nDue to its infancy, many definitions are independent, application specific,\nmutually incomplete, redundant, or inconsistent, hence so is data science. This\nresearch addresses this data science multiple definitions challenge by\nproposing the development of coherent, unified definition based on a data\nscience reference framework using a data science journal for the data science\ncommunity to achieve such a definition. This paper provides candidate\ndefinitions for essential data science artifacts that are required to discuss\nsuch a definition. They are based on the classical research paradigm concept\nconsisting of a philosophy of data science, the data science problem solving\nparadigm, and the six component data science reference framework (axiology,\nontology, epistemology, methodology, methods, technology) that is a frequently\ncalled for unifying framework with which to define, unify, and evolve data\nscience. It presents challenges for defining data science, solution approaches,\ni.e., means for defining data science, and their requirements and benefits as\nthe basis of a comprehensive solution.\n","authors":["Michael L Brodie"],"pdf_url":"https://arxiv.org/pdf/2306.16177v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12865v3","updated":"2023-07-24T12:08:50Z","published":"2023-03-22T18:59:48Z","title":"NeRF-GAN Distillation for Efficient 3D-Aware Generation with\n Convolutions","summary":" Pose-conditioned convolutional generative models struggle with high-quality\n3D-consistent image generation from single-view datasets, due to their lack of\nsufficient 3D priors. Recently, the integration of Neural Radiance Fields\n(NeRFs) and generative models, such as Generative Adversarial Networks (GANs),\nhas transformed 3D-aware generation from single-view images. NeRF-GANs exploit\nthe strong inductive bias of neural 3D representations and volumetric rendering\nat the cost of higher computational complexity. This study aims at revisiting\npose-conditioned 2D GANs for efficient 3D-aware generation at inference time by\ndistilling 3D knowledge from pretrained NeRF-GANs. We propose a simple and\neffective method, based on re-using the well-disentangled latent space of a\npre-trained NeRF-GAN in a pose-conditioned convolutional network to directly\ngenerate 3D-consistent images corresponding to the underlying 3D\nrepresentations. Experiments on several datasets demonstrate that the proposed\nmethod obtains results comparable with volumetric rendering in terms of quality\nand 3D consistency while benefiting from the computational advantage of\nconvolutional networks. The code will be available at:\nhttps://github.com/mshahbazi72/NeRF-GAN-Distillation\n","authors":["Mohamad Shahbazi","Evangelos Ntavelis","Alessio Tonioni","Edo Collins","Danda Pani Paudel","Martin Danelljan","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2303.12865v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12716v1","updated":"2023-07-24T11:55:32Z","published":"2023-07-24T11:55:32Z","title":"Safety Performance of Neural Networks in the Presence of Covariate Shift","summary":" Covariate shift may impact the operational safety performance of neural\nnetworks. A re-evaluation of the safety performance, however, requires\ncollecting new operational data and creating corresponding ground truth labels,\nwhich often is not possible during operation. We are therefore proposing to\nreshape the initial test set, as used for the safety performance evaluation\nprior to deployment, based on an approximation of the operational data. This\napproximation is obtained by observing and learning the distribution of\nactivation patterns of neurons in the network during operation. The reshaped\ntest set reflects the distribution of neuron activation values as observed\nduring operation, and may therefore be used for re-evaluating safety\nperformance in the presence of covariate shift. First, we derive conservative\nbounds on the values of neurons by applying finite binning and static dataflow\nanalysis. Second, we formulate a mixed integer linear programming (MILP)\nconstraint for constructing the minimum set of data points to be removed in the\ntest set, such that the difference between the discretized test and operational\ndistributions is bounded. We discuss potential benefits and limitations of this\nconstraint-based approach based on our initial experience with an implemented\nresearch prototype.\n","authors":["Chih-Hong Cheng","Harald Ruess","Konstantinos Theodorou"],"pdf_url":"https://arxiv.org/pdf/2307.12716v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13871v2","updated":"2023-07-24T11:44:01Z","published":"2023-04-26T23:34:40Z","title":"Typical and atypical solutions in non-convex neural networks with\n discrete and continuous weights","summary":" We study the binary and continuous negative-margin perceptrons as simple\nnon-convex neural network models learning random rules and associations. We\nanalyze the geometry of the landscape of solutions in both models and find\nimportant similarities and differences. Both models exhibit subdominant\nminimizers which are extremely flat and wide. These minimizers coexist with a\nbackground of dominant solutions which are composed by an exponential number of\nalgorithmically inaccessible small clusters for the binary case (the frozen\n1-RSB phase) or a hierarchical structure of clusters of different sizes for the\nspherical case (the full RSB phase). In both cases, when a certain threshold in\nconstraint density is crossed, the local entropy of the wide flat minima\nbecomes non-monotonic, indicating a break-up of the space of robust solutions\ninto disconnected components. This has a strong impact on the behavior of\nalgorithms in binary models, which cannot access the remaining isolated\nclusters. For the spherical case the behaviour is different, since even beyond\nthe disappearance of the wide flat minima the remaining solutions are shown to\nalways be surrounded by a large number of other solutions at any distance, up\nto capacity. Indeed, we exhibit numerical evidence that algorithms seem to find\nsolutions up to the SAT/UNSAT transition, that we compute here using an 1RSB\napproximation. For both models, the generalization performance as a learning\ndevice is shown to be greatly improved by the existence of wide flat minimizers\neven when trained in the highly underconstrained regime of very negative\nmargins.\n","authors":["Carlo Baldassi","Enrico M. Malatesta","Gabriele Perugini","Riccardo Zecchina"],"pdf_url":"https://arxiv.org/pdf/2304.13871v2.pdf","comment":"34 pages, 13 figures"},{"id":"http://arxiv.org/abs/2210.17230v3","updated":"2023-07-24T11:43:26Z","published":"2022-10-31T11:15:48Z","title":"Lipschitz-regularized gradient flows and generative particle algorithms\n for high-dimensional scarce data","summary":" We build a new class of generative algorithms capable of efficiently learning\nan arbitrary target distribution from possibly scarce, high-dimensional data\nand subsequently generate new samples. These generative algorithms are\nparticle-based and are constructed as gradient flows of Lipschitz-regularized\nKullback-Leibler or other $f$-divergences, where data from a source\ndistribution can be stably transported as particles, towards the vicinity of\nthe target distribution. As a highlighted result in data integration, we\ndemonstrate that the proposed algorithms correctly transport gene expression\ndata points with dimension exceeding 54K, while the sample size is typically\nonly in the hundreds.\n","authors":["Hyemin Gu","Panagiota Birmpa","Yannis Pantazis","Luc Rey-Bellet","Markos A. Katsoulakis"],"pdf_url":"https://arxiv.org/pdf/2210.17230v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12703v1","updated":"2023-07-24T11:37:02Z","published":"2023-07-24T11:37:02Z","title":"Policy Gradient Optimal Correlation Search for Variance Reduction in\n Monte Carlo simulation and Maximum Optimal Transport","summary":" We propose a new algorithm for variance reduction when estimating $f(X_T)$\nwhere $X$ is the solution to some stochastic differential equation and $f$ is a\ntest function. The new estimator is $(f(X^1_T) + f(X^2_T))/2$, where $X^1$ and\n$X^2$ have same marginal law as $X$ but are pathwise correlated so that to\nreduce the variance. The optimal correlation function $\\rho$ is approximated by\na deep neural network and is calibrated along the trajectories of $(X^1, X^2)$\nby policy gradient and reinforcement learning techniques. Finding an optimal\ncoupling given marginal laws has links with maximum optimal transport.\n","authors":["Pierre Bras","Gilles Pagès"],"pdf_url":"https://arxiv.org/pdf/2307.12703v1.pdf","comment":"7 pages"},{"id":"http://arxiv.org/abs/2303.09340v3","updated":"2023-07-24T11:34:21Z","published":"2023-03-16T14:21:45Z","title":"Improving Automated Hemorrhage Detection in Sparse-view Computed\n Tomography via Deep Convolutional Neural Network based Artifact Reduction","summary":" Purpose: Sparse-view computed tomography (CT) is an effective way to reduce\ndose by lowering the total number of views acquired, albeit at the expense of\nimage quality, which, in turn, can impact the ability to detect diseases. We\nexplore deep learning-based artifact reduction in sparse-view cranial CT scans\nand its impact on automated hemorrhage detection. Methods: We trained a U-Net\nfor artefact reduction on simulated sparse-view cranial CT scans from 3000\npatients obtained from a public dataset and reconstructed with varying levels\nof sub-sampling. Additionally, we trained a convolutional neural network on\nfully sampled CT data from 17,545 patients for automated hemorrhage detection.\nWe evaluated the classification performance using the area under the receiver\noperator characteristic curves (AUC-ROCs) with corresponding 95% confidence\nintervals (CIs) and the DeLong test, along with confusion matrices. The\nperformance of the U-Net was compared to an analytical approach based on total\nvariation (TV). Results: The U-Net performed superior compared to unprocessed\nand TV-processed images with respect to image quality and automated hemorrhage\ndiagnosis. With U-Net post-processing, the number of views can be reduced from\n4096 (AUC-ROC: 0.974; 95% CI: 0.972-0.976) views to 512 views (0.973;\n0.971-0.975) with minimal decrease in hemorrhage detection (P<.001) and to 256\nviews (0.967; 0.964-0.969) with a slight performance decrease (P<.001).\nConclusion: The results suggest that U-Net based artifact reduction\nsubstantially enhances automated hemorrhage detection in sparse-view cranial\nCTs. Our findings highlight that appropriate post-processing is crucial for\noptimal image quality and diagnostic accuracy while minimizing radiation dose.\n","authors":["Johannes Thalhammer","Manuel Schultheiss","Tina Dorosti","Tobias Lasser","Franz Pfeiffer","Daniela Pfeiffer","Florian Schaff"],"pdf_url":"https://arxiv.org/pdf/2303.09340v3.pdf","comment":"11 pages, 6 figures, 1 table"},{"id":"http://arxiv.org/abs/2307.12698v1","updated":"2023-07-24T11:27:14Z","published":"2023-07-24T11:27:14Z","title":"MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised\n Learning of Motion and Content Features","summary":" Self-supervised learning of visual representations has been focusing on\nlearning content features, which do not capture object motion or location, and\nfocus on identifying and differentiating objects in images and videos. On the\nother hand, optical flow estimation is a task that does not involve\nunderstanding the content of the images on which it is estimated. We unify the\ntwo approaches and introduce MC-JEPA, a joint-embedding predictive architecture\nand self-supervised learning approach to jointly learn optical flow and content\nfeatures within a shared encoder, demonstrating that the two associated\nobjectives; the optical flow estimation objective and the self-supervised\nlearning objective; benefit from each other and thus learn content features\nthat incorporate motion information. The proposed approach achieves performance\non-par with existing unsupervised optical flow benchmarks, as well as with\ncommon self-supervised learning approaches on downstream tasks such as semantic\nsegmentation of images and videos.\n","authors":["Adrien Bardes","Jean Ponce","Yann LeCun"],"pdf_url":"https://arxiv.org/pdf/2307.12698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.10763v3","updated":"2023-07-24T11:15:47Z","published":"2023-02-12T12:19:57Z","title":"Contrastive Learning and the Emergence of Attributes Associations","summary":" In response to an object presentation, supervised learning schemes generally\nrespond with a parsimonious label. Upon a similar presentation we humans\nrespond again with a label, but are flooded, in addition, by a myriad of\nassociations. A significant portion of these consist of the presented object\nattributes. Contrastive learning is a semi-supervised learning scheme based on\nthe application of identity preserving transformations on the object input\nrepresentations. It is conjectured in this work that these same applied\ntransformations preserve, in addition to the identity of the presented object,\nalso the identity of its semantically meaningful attributes. The corollary of\nthis is that the output representations of such a contrastive learning scheme\ncontain valuable information not only for the classification of the presented\nobject, but also for the presence or absence decision of any attribute of\ninterest. Simulation results which demonstrate this idea and the feasibility of\nthis conjecture are presented.\n","authors":["Daniel N. Nissani"],"pdf_url":"https://arxiv.org/pdf/2302.10763v3.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2210.12583v2","updated":"2023-07-24T11:13:21Z","published":"2022-10-23T00:45:05Z","title":"Active Learning of Discrete-Time Dynamics for Uncertainty-Aware Model\n Predictive Control","summary":" Model-based control requires an accurate model of the system dynamics for\nprecisely and safely controlling the robot in complex and dynamic environments.\nMoreover, in the presence of variations in the operating conditions, the model\nshould be continuously refined to compensate for dynamics changes. In this\npaper, we present a self-supervised learning approach that actively models the\ndynamics of nonlinear robotic systems. We combine offline learning from past\nexperience and online learning from current robot interaction with the unknown\nenvironment. These two ingredients enable a highly sample-efficient and\nadaptive learning process, capable of accurately inferring model dynamics in\nreal-time even in operating regimes that greatly differ from the training\ndistribution. Moreover, we design an uncertainty-aware model predictive\ncontroller that is heuristically conditioned to the aleatoric (data)\nuncertainty of the learned dynamics. This controller actively chooses the\noptimal control actions that (i) optimize the control performance and (ii)\nimprove the efficiency of online learning sample collection. We demonstrate the\neffectiveness of our method through a series of challenging real-world\nexperiments using a quadrotor system. Our approach showcases high resilience\nand generalization capabilities by consistently adapting to unseen flight\nconditions, while it significantly outperforms classical and adaptive control\nbaselines.\n","authors":["Alessandro Saviolo","Jonathan Frey","Abhishek Rathod","Moritz Diehl","Giuseppe Loianno"],"pdf_url":"https://arxiv.org/pdf/2210.12583v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12689v1","updated":"2023-07-24T11:04:22Z","published":"2023-07-24T11:04:22Z","title":"Addressing the Impact of Localized Training Data in Graph Neural\n Networks","summary":" Graph Neural Networks (GNNs) have achieved notable success in learning from\ngraph-structured data, owing to their ability to capture intricate dependencies\nand relationships between nodes. They excel in various applications, including\nsemi-supervised node classification, link prediction, and graph generation.\nHowever, it is important to acknowledge that the majority of state-of-the-art\nGNN models are built upon the assumption of an in-distribution setting, which\nhinders their performance on real-world graphs with dynamic structures. In this\narticle, we aim to assess the impact of training GNNs on localized subsets of\nthe graph. Such restricted training data may lead to a model that performs well\nin the specific region it was trained on but fails to generalize and make\naccurate predictions for the entire graph. In the context of graph-based\nsemi-supervised learning (SSL), resource constraints often lead to scenarios\nwhere the dataset is large, but only a portion of it can be labeled, affecting\nthe model's performance. This limitation affects tasks like anomaly detection\nor spam detection when labeling processes are biased or influenced by human\nsubjectivity. To tackle the challenges posed by localized training data, we\napproach the problem as an out-of-distribution (OOD) data issue by by aligning\nthe distributions between the training data, which represents a small portion\nof labeled data, and the graph inference process that involves making\npredictions for the entire graph. We propose a regularization method to\nminimize distributional discrepancies between localized training data and graph\ninference, improving model performance on OOD data. Extensive tests on popular\nGNN models show significant performance improvement on three citation GNN\nbenchmark datasets. The regularization approach effectively enhances model\nadaptation and generalization, overcoming challenges posed by OOD data.\n","authors":["Singh Akansha"],"pdf_url":"https://arxiv.org/pdf/2307.12689v1.pdf","comment":"6 pages, 4 figures"},{"id":"http://arxiv.org/abs/2307.12679v1","updated":"2023-07-24T10:33:32Z","published":"2023-07-24T10:33:32Z","title":"An Estimator for the Sensitivity to Perturbations of Deep Neural\n Networks","summary":" For Deep Neural Networks (DNNs) to become useful in safety-critical\napplications, such as self-driving cars and disease diagnosis, they must be\nstable to perturbations in input and model parameters. Characterizing the\nsensitivity of a DNN to perturbations is necessary to determine minimal\nbit-width precision that may be used to safely represent the network. However,\nno general result exists that is capable of predicting the sensitivity of a\ngiven DNN to round-off error, noise, or other perturbations in input. This\npaper derives an estimator that can predict such quantities. The estimator is\nderived via inequalities and matrix norms, and the resulting quantity is\nroughly analogous to a condition number for the entire neural network. An\napproximation of the estimator is tested on two Convolutional Neural Networks,\nAlexNet and VGG-19, using the ImageNet dataset. For each of these networks, the\ntightness of the estimator is explored via random perturbations and adversarial\nattacks.\n","authors":["Naman Maheshwari","Nicholas Malaya","Scott Moe","Jaydeep P. Kulkarni","Sudhanva Gurumurthi"],"pdf_url":"https://arxiv.org/pdf/2307.12679v1.pdf","comment":"Actual work and paper concluded in January 2019"},{"id":"http://arxiv.org/abs/2307.12672v1","updated":"2023-07-24T10:20:14Z","published":"2023-07-24T10:20:14Z","title":"Global k-Space Interpolation for Dynamic MRI Reconstruction using Masked\n Image Modeling","summary":" In dynamic Magnetic Resonance Imaging (MRI), k-space is typically\nundersampled due to limited scan time, resulting in aliasing artifacts in the\nimage domain. Hence, dynamic MR reconstruction requires not only modeling\nspatial frequency components in the x and y directions of k-space but also\nconsidering temporal redundancy. Most previous works rely on image-domain\nregularizers (priors) to conduct MR reconstruction. In contrast, we focus on\ninterpolating the undersampled k-space before obtaining images with Fourier\ntransform. In this work, we connect masked image modeling with k-space\ninterpolation and propose a novel Transformer-based k-space Global\nInterpolation Network, termed k-GIN. Our k-GIN learns global dependencies among\nlow- and high-frequency components of 2D+t k-space and uses it to interpolate\nunsampled data. Further, we propose a novel k-space Iterative Refinement Module\n(k-IRM) to enhance the high-frequency components learning. We evaluate our\napproach on 92 in-house 2D+t cardiac MR subjects and compare it to MR\nreconstruction methods with image-domain regularizers. Experiments show that\nour proposed k-space interpolation method quantitatively and qualitatively\noutperforms baseline methods. Importantly, the proposed approach achieves\nsubstantially higher robustness and generalizability in cases of\nhighly-undersampled MR data.\n","authors":["Jiazhen Pan","Suprosanna Shit","Özgün Turgut","Wenqi Huang","Hongwei Bran Li","Nil Stolt-Ansó","Thomas Küstner","Kerstin Hammernik","Daniel Rueckert"],"pdf_url":"https://arxiv.org/pdf/2307.12672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12667v1","updated":"2023-07-24T10:14:51Z","published":"2023-07-24T10:14:51Z","title":"TransFusion: Generating Long, High Fidelity Time Series using Diffusion\n Models with Transformers","summary":" The generation of high-quality, long-sequenced time-series data is essential\ndue to its wide range of applications. In the past, standalone Recurrent and\nConvolutional Neural Network-based Generative Adversarial Networks (GAN) were\nused to synthesize time-series data. However, they are inadequate for\ngenerating long sequences of time-series data due to limitations in the\narchitecture. Furthermore, GANs are well known for their training instability\nand mode collapse problem. To address this, we propose TransFusion, a\ndiffusion, and transformers-based generative model to generate high-quality\nlong-sequence time-series data. We have stretched the sequence length to 384,\nand generated high-quality synthetic data. To the best of our knowledge, this\nis the first study that has been done with this long-sequence length. Also, we\nintroduce two evaluation metrics to evaluate the quality of the synthetic data\nas well as its predictive characteristics. We evaluate TransFusion with a wide\nvariety of visual and empirical metrics, and TransFusion outperforms the\nprevious state-of-the-art by a significant margin.\n","authors":["Md Fahim Sikder","Resmi Ramachandranpillai","Fredrik Heintz"],"pdf_url":"https://arxiv.org/pdf/2307.12667v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12660v1","updated":"2023-07-24T10:04:27Z","published":"2023-07-24T10:04:27Z","title":"Online Continual Learning in Keyword Spotting for Low-Resource Devices\n via Pooling High-Order Temporal Statistics","summary":" Keyword Spotting (KWS) models on embedded devices should adapt fast to new\nuser-defined words without forgetting previous ones. Embedded devices have\nlimited storage and computational resources, thus, they cannot save samples or\nupdate large models. We consider the setup of embedded online continual\nlearning (EOCL), where KWS models with frozen backbone are trained to\nincrementally recognize new words from a non-repeated stream of samples, seen\none at a time. To this end, we propose Temporal Aware Pooling (TAP) which\nconstructs an enriched feature space computing high-order moments of speech\nfeatures extracted by a pre-trained backbone. Our method, TAP-SLDA, updates a\nGaussian model for each class on the enriched feature space to effectively use\naudio representations. In experimental analyses, TAP-SLDA outperforms\ncompetitors on several setups, backbones, and baselines, bringing a relative\naverage gain of 11.3% on the GSC dataset.\n","authors":["Umberto Michieli","Pablo Peso Parada","Mete Ozay"],"pdf_url":"https://arxiv.org/pdf/2307.12660v1.pdf","comment":"INTERSPEECH 2023"},{"id":"http://arxiv.org/abs/2306.12231v2","updated":"2023-07-24T09:36:05Z","published":"2023-06-21T12:44:52Z","title":"Predicting protein variants with equivariant graph neural networks","summary":" Pre-trained models have been successful in many protein engineering tasks.\nMost notably, sequence-based models have achieved state-of-the-art performance\non protein fitness prediction while structure-based models have been used\nexperimentally to develop proteins with enhanced functions. However, there is a\nresearch gap in comparing structure- and sequence-based methods for predicting\nprotein variants that are better than the wildtype protein. This paper aims to\naddress this gap by conducting a comparative study between the abilities of\nequivariant graph neural networks (EGNNs) and sequence-based approaches to\nidentify promising amino-acid mutations. The results show that our proposed\nstructural approach achieves a competitive performance to sequence-based\nmethods while being trained on significantly fewer molecules. Additionally, we\nfind that combining assay labelled data with structure pre-trained models\nyields similar trends as with sequence pre-trained models.\n Our code and trained models can be found at:\nhttps://github.com/semiluna/partIII-amino-acid-prediction.\n","authors":["Antonia Boca","Simon Mathis"],"pdf_url":"https://arxiv.org/pdf/2306.12231v2.pdf","comment":"4 pages, 2 figures, accepted to the 2023 ICML Workshop on\n Computational Biology"},{"id":"http://arxiv.org/abs/2307.12644v1","updated":"2023-07-24T09:35:47Z","published":"2023-07-24T09:35:47Z","title":"Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation\n of rPPG","summary":" Remote Photoplethysmography (rPPG) is a technology that utilizes the light\nabsorption properties of hemoglobin, captured via camera, to analyze and\nmeasure blood volume pulse (BVP). By analyzing the measured BVP, various\nphysiological signals such as heart rate, stress levels, and blood pressure can\nbe derived, enabling applications such as the early prediction of\ncardiovascular diseases. rPPG is a rapidly evolving field as it allows the\nmeasurement of vital signals using camera-equipped devices without the need for\nadditional devices such as blood pressure monitors or pulse oximeters, and\nwithout the assistance of medical experts. Despite extensive efforts and\nadvances in this field, serious challenges remain, including issues related to\nskin color, camera characteristics, ambient lighting, and other sources of\nnoise, which degrade performance accuracy. We argue that fair and evaluable\nbenchmarking is urgently required to overcome these challenges and make any\nmeaningful progress from both academic and commercial perspectives. In most\nexisting work, models are trained, tested, and validated only on limited\ndatasets. Worse still, some studies lack available code or reproducibility,\nmaking it difficult to fairly evaluate and compare performance. Therefore, the\npurpose of this study is to provide a benchmarking framework to evaluate\nvarious rPPG techniques across a wide range of datasets for fair evaluation and\ncomparison, including both conventional non-deep neural network (non-DNN) and\ndeep neural network (DNN) methods. GitHub URL:\nhttps://github.com/remotebiosensing/rppg.\n","authors":["Dae Yeol Kim","Eunsu Goh","KwangKee Lee","JongEui Chae","JongHyeon Mun","Junyeong Na","Chae-bong Sohn","Do-Yup Kim"],"pdf_url":"https://arxiv.org/pdf/2307.12644v1.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2307.12639v1","updated":"2023-07-24T09:30:30Z","published":"2023-07-24T09:30:30Z","title":"Fake News Detection Through Graph-based Neural Networks: A Survey","summary":" The popularity of online social networks has enabled rapid dissemination of\ninformation. People now can share and consume information much more rapidly\nthan ever before. However, low-quality and/or accidentally/deliberately fake\ninformation can also spread rapidly. This can lead to considerable and negative\nimpacts on society. Identifying, labelling and debunking online misinformation\nas early as possible has become an increasingly urgent problem. Many methods\nhave been proposed to detect fake news including many deep learning and\ngraph-based approaches. In recent years, graph-based methods have yielded\nstrong results, as they can closely model the social context and propagation\nprocess of online news. In this paper, we present a systematic review of fake\nnews detection studies based on graph-based and deep learning-based techniques.\nWe classify existing graph-based methods into knowledge-driven methods,\npropagation-based methods, and heterogeneous social context-based methods,\ndepending on how a graph structure is constructed to model news related\ninformation flows. We further discuss the challenges and open problems in\ngraph-based fake news detection and identify future research directions.\n","authors":["Shuzhi Gong","Richard O. Sinnott","Jianzhong Qi","Cecile Paris"],"pdf_url":"https://arxiv.org/pdf/2307.12639v1.pdf","comment":"18 pages, 3 tables, 7 figures"},{"id":"http://arxiv.org/abs/2304.03981v2","updated":"2023-07-24T09:24:04Z","published":"2023-04-08T10:47:41Z","title":"Uncertainty-inspired Open Set Learning for Retinal Anomaly\n Identification","summary":" Failure to recognize samples from the classes unseen during training is a\nmajor limitation of artificial intelligence in the real-world implementation\nfor recognition and classification of retinal anomalies. We established an\nuncertainty-inspired open-set (UIOS) model, which was trained with fundus\nimages of 9 retinal conditions. Besides assessing the probability of each\ncategory, UIOS also calculated an uncertainty score to express its confidence.\nOur UIOS model with thresholding strategy achieved an F1 score of 99.55%,\n97.01% and 91.91% for the internal testing set, external target categories\n(TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1\nscore of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS\ncorrectly predicted high uncertainty scores, which would prompt the need for a\nmanual check in the datasets of non-target categories retinal diseases,\nlow-quality fundus images, and non-fundus images. UIOS provides a robust method\nfor real-world screening of retinal anomalies.\n","authors":["Meng Wang","Tian Lin","Lianyu Wang","Aidi Lin","Ke Zou","Xinxing Xu","Yi Zhou","Yuanyuan Peng","Qingquan Meng","Yiming Qian","Guoyao Deng","Zhiqun Wu","Junhong Chen","Jianhong Lin","Mingzhi Zhang","Weifang Zhu","Changqing Zhang","Daoqiang Zhang","Rick Siow Mong Goh","Yong Liu","Chi Pui Pang","Xinjian Chen","Haoyu Chen","Huazhu Fu"],"pdf_url":"https://arxiv.org/pdf/2304.03981v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12636v1","updated":"2023-07-24T09:19:38Z","published":"2023-07-24T09:19:38Z","title":"Identifying drivers and mitigators for congestion and redispatch in the\n German electric power system with explainable AI","summary":" The transition to a sustainable energy supply challenges the operation of\nelectric power systems in manifold ways. Transmission grid loads increase as\nwind and solar power are often installed far away from the consumers. In\nextreme cases, system operators must intervene via countertrading or redispatch\nto ensure grid stability. In this article, we provide a data-driven analysis of\ncongestion in the German transmission grid. We develop an explainable machine\nlearning model to predict the volume of redispatch and countertrade on an\nhourly basis. The model reveals factors that drive or mitigate grid congestion\nand quantifies their impact. We show that, as expected, wind power generation\nis the main driver, but hydropower and cross-border electricity trading also\nplay an essential role. Solar power, on the other hand, has no mitigating\neffect. Our results suggest that a change to the market design would alleviate\ncongestion.\n","authors":["Maurizio Titz","Sebastian Pütz","Dirk Witthaut"],"pdf_url":"https://arxiv.org/pdf/2307.12636v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.14430v3","updated":"2023-07-24T09:15:02Z","published":"2022-09-28T21:31:43Z","title":"Minimax Optimal Kernel Operator Learning via Multilevel Training","summary":" Learning mappings between infinite-dimensional function spaces has achieved\nempirical success in many disciplines of machine learning, including generative\nmodeling, functional data analysis, causal inference, and multi-agent\nreinforcement learning. In this paper, we study the statistical limit of\nlearning a Hilbert-Schmidt operator between two infinite-dimensional Sobolev\nreproducing kernel Hilbert spaces. We establish the information-theoretic lower\nbound in terms of the Sobolev Hilbert-Schmidt norm and show that a\nregularization that learns the spectral components below the bias contour and\nignores the ones that are above the variance contour can achieve the optimal\nlearning rate. At the same time, the spectral components between the bias and\nvariance contours give us flexibility in designing computationally feasible\nmachine learning algorithms. Based on this observation, we develop a multilevel\nkernel operator learning algorithm that is optimal when learning linear\noperators between infinite-dimensional function spaces.\n","authors":["Jikai Jin","Yiping Lu","Jose Blanchet","Lexing Ying"],"pdf_url":"https://arxiv.org/pdf/2209.14430v3.pdf","comment":"ICLR 2023 spotlight"},{"id":"http://arxiv.org/abs/2307.12625v1","updated":"2023-07-24T08:56:25Z","published":"2023-07-24T08:56:25Z","title":"De-confounding Representation Learning for Counterfactual Inference on\n Continuous Treatment via Generative Adversarial Network","summary":" Counterfactual inference for continuous rather than binary treatment\nvariables is more common in real-world causal inference tasks. While there are\nalready some sample reweighting methods based on Marginal Structural Model for\neliminating the confounding bias, they generally focus on removing the\ntreatment's linear dependence on confounders and rely on the accuracy of the\nassumed parametric models, which are usually unverifiable. In this paper, we\npropose a de-confounding representation learning (DRL) framework for\ncounterfactual outcome estimation of continuous treatment by generating the\nrepresentations of covariates disentangled with the treatment variables. The\nDRL is a non-parametric model that eliminates both linear and nonlinear\ndependence between treatment and covariates. Specifically, we train the\ncorrelations between the de-confounded representations and the treatment\nvariables against the correlations between the covariate representations and\nthe treatment variables to eliminate confounding bias. Further, a\ncounterfactual inference network is embedded into the framework to make the\nlearned representations serve both de-confounding and trusted inference.\nExtensive experiments on synthetic datasets show that the DRL model performs\nsuperiorly in learning de-confounding representations and outperforms\nstate-of-the-art counterfactual inference models for continuous treatment\nvariables. In addition, we apply the DRL model to a real-world medical dataset\nMIMIC and demonstrate a detailed causal relationship between red cell width\ndistribution and mortality.\n","authors":["Yonghe Zhao","Qiang Huang","Haolong Zeng","Yun Pen","Huiyan Sun"],"pdf_url":"https://arxiv.org/pdf/2307.12625v1.pdf","comment":"15 pages,4 figures"},{"id":"http://arxiv.org/abs/2307.12617v1","updated":"2023-07-24T08:46:12Z","published":"2023-07-24T08:46:12Z","title":"Predicting Ordinary Differential Equations with Transformers","summary":" We develop a transformer-based sequence-to-sequence model that recovers\nscalar ordinary differential equations (ODEs) in symbolic form from irregularly\nsampled and noisy observations of a single solution trajectory. We demonstrate\nin extensive empirical evaluations that our model performs better or on par\nwith existing methods in terms of accurate recovery across various settings.\nMoreover, our method is efficiently scalable: after one-time pretraining on a\nlarge set of ODEs, we can infer the governing law of a new observed solution in\na few forward passes of the model.\n","authors":["Sören Becker","Michal Klein","Alexander Neitz","Giambattista Parascandolo","Niki Kilbertus"],"pdf_url":"https://arxiv.org/pdf/2307.12617v1.pdf","comment":"Published at ICML 2023"},{"id":"http://arxiv.org/abs/2307.09458v3","updated":"2023-07-24T08:32:40Z","published":"2023-07-18T17:39:04Z","title":"Does Circuit Analysis Interpretability Scale? Evidence from Multiple\n Choice Capabilities in Chinchilla","summary":" \\emph{Circuit analysis} is a promising technique for understanding the\ninternal mechanisms of language models. However, existing analyses are done in\nsmall models far from the state of the art. To address this, we present a case\nstudy of circuit analysis in the 70B Chinchilla model, aiming to test the\nscalability of circuit analysis. In particular, we study multiple-choice\nquestion answering, and investigate Chinchilla's capability to identify the\ncorrect answer \\emph{label} given knowledge of the correct answer \\emph{text}.\nWe find that the existing techniques of logit attribution, attention pattern\nvisualization, and activation patching naturally scale to Chinchilla, allowing\nus to identify and categorize a small set of `output nodes' (attention heads\nand MLPs).\n We further study the `correct letter' category of attention heads aiming to\nunderstand the semantics of their features, with mixed results. For normal\nmultiple-choice question answers, we significantly compress the query, key and\nvalue subspaces of the head without loss of performance when operating on the\nanswer labels for multiple-choice questions, and we show that the query and key\nsubspaces represent an `Nth item in an enumeration' feature to at least some\nextent. However, when we attempt to use this explanation to understand the\nheads' behaviour on a more general distribution including randomized answer\nlabels, we find that it is only a partial explanation, suggesting there is more\nto learn about the operation of `correct letter' heads on multiple choice\nquestion answering.\n","authors":["Tom Lieberum","Matthew Rahtz","János Kramár","Neel Nanda","Geoffrey Irving","Rohin Shah","Vladimir Mikulik"],"pdf_url":"https://arxiv.org/pdf/2307.09458v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12607v1","updated":"2023-07-24T08:32:27Z","published":"2023-07-24T08:32:27Z","title":"ExWarp: Extrapolation and Warping-based Temporal Supersampling for\n High-frequency Displays","summary":" High-frequency displays are gaining immense popularity because of their\nincreasing use in video games and virtual reality applications. However, the\nissue is that the underlying GPUs cannot continuously generate frames at this\nhigh rate -- this results in a less smooth and responsive experience.\nFurthermore, if the frame rate is not synchronized with the refresh rate, the\nuser may experience screen tearing and stuttering. Previous works propose\nincreasing the frame rate to provide a smooth experience on modern displays by\npredicting new frames based on past or future frames. Interpolation and\nextrapolation are two widely used algorithms that predict new frames.\nInterpolation requires waiting for the future frame to make a prediction, which\nadds additional latency. On the other hand, extrapolation provides a better\nquality of experience because it relies solely on past frames -- it does not\nincur any additional latency. The simplest method to extrapolate a frame is to\nwarp the previous frame using motion vectors; however, the warped frame may\ncontain improperly rendered visual artifacts due to dynamic objects -- this\nmakes it very challenging to design such a scheme. Past work has used DNNs to\nget good accuracy, however, these approaches are slow. This paper proposes\nExwarp -- an approach based on reinforcement learning (RL) to intelligently\nchoose between the slower DNN-based extrapolation and faster warping-based\nmethods to increase the frame rate by 4x with an almost negligible reduction in\nthe perceived image quality.\n","authors":["Akanksha Dixit","Yashashwee Chakrabarty","Smruti R. Sarangi"],"pdf_url":"https://arxiv.org/pdf/2307.12607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12601v1","updated":"2023-07-24T08:21:13Z","published":"2023-07-24T08:21:13Z","title":"Concept backpropagation: An Explainable AI approach for visualising\n learned concepts in neural network models","summary":" Neural network models are widely used in a variety of domains, often as\nblack-box solutions, since they are not directly interpretable for humans. The\nfield of explainable artificial intelligence aims at developing explanation\nmethods to address this challenge, and several approaches have been developed\nover the recent years, including methods for investigating what type of\nknowledge these models internalise during the training process. Among these,\nthe method of concept detection, investigates which \\emph{concepts} neural\nnetwork models learn to represent in order to complete their tasks. In this\nwork, we present an extension to the method of concept detection, named\n\\emph{concept backpropagation}, which provides a way of analysing how the\ninformation representing a given concept is internalised in a given neural\nnetwork model. In this approach, the model input is perturbed in a manner\nguided by a trained concept probe for the described model, such that the\nconcept of interest is maximised. This allows for the visualisation of the\ndetected concept directly in the input space of the model, which in turn makes\nit possible to see what information the model depends on for representing the\ndescribed concept. We present results for this method applied to a various set\nof input modalities, and discuss how our proposed method can be used to\nvisualise what information trained concept probes use, and the degree as to\nwhich the representation of the probed concept is entangled within the neural\nnetwork model itself.\n","authors":["Patrik Hammersborg","Inga Strümke"],"pdf_url":"https://arxiv.org/pdf/2307.12601v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12594v1","updated":"2023-07-24T08:11:59Z","published":"2023-07-24T08:11:59Z","title":"Optimized data collection and analysis process for studying\n solar-thermal desalination by machine learning","summary":" An effective interdisciplinary study between machine learning and\nsolar-thermal desalination requires a sufficiently large and well-analyzed\nexperimental datasets. This study develops a modified dataset collection and\nanalysis process for studying solar-thermal desalination by machine learning.\nBased on the optimized water condensation and collection process, the proposed\nexperimental method collects over one thousand datasets, which is ten times\nmore than the average number of datasets in previous works, by accelerating\ndata collection and reducing the time by 83.3%. On the other hand, the effects\nof dataset features are investigated by using three different algorithms,\nincluding artificial neural networks, multiple linear regressions, and random\nforests. The investigation focuses on the effects of dataset size and range on\nprediction accuracy, factor importance ranking, and the model's generalization\nability. The results demonstrate that a larger dataset can significantly\nimprove prediction accuracy when using artificial neural networks and random\nforests. Additionally, the study highlights the significant impact of dataset\nsize and range on ranking the importance of influence factors. Furthermore, the\nstudy reveals that the extrapolation data range significantly affects the\nextrapolation accuracy of artificial neural networks. Based on the results,\nmassive dataset collection and analysis of dataset feature effects are\nimportant steps in an effective and consistent machine learning process flow\nfor solar-thermal desalination, which can promote machine learning as a more\ngeneral tool in the field of solar-thermal desalination.\n","authors":["Guilong Peng","Senshan Sun","Yangjun Qin","Zhenwei Xu","Juxin Du","Swellam W. sharshir","A. W. Kandel","A. E. Kabeel","Nuo Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12594v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07515v2","updated":"2023-07-24T08:10:52Z","published":"2023-04-15T09:39:52Z","title":"S3M: Scalable Statistical Shape Modeling through Unsupervised\n Correspondences","summary":" Statistical shape models (SSMs) are an established way to represent the\nanatomy of a population with various clinically relevant applications. However,\nthey typically require domain expertise, and labor-intensive landmark\nannotations to construct. We address these shortcomings by proposing an\nunsupervised method that leverages deep geometric features and functional\ncorrespondences to simultaneously learn local and global shape structures\nacross population anatomies. Our pipeline significantly improves unsupervised\ncorrespondence estimation for SSMs compared to baseline methods, even on highly\nirregular surface topologies. We demonstrate this for two different anatomical\nstructures: the thyroid and a multi-chamber heart dataset. Furthermore, our\nmethod is robust enough to learn from noisy neural network predictions,\npotentially enabling scaling SSMs to larger patient populations without manual\nsegmentation annotation.\n","authors":["Lennart Bastian","Alexander Baumann","Emily Hoppe","Vincent Bürgin","Ha Young Kim","Mahdi Saleh","Benjamin Busam","Nassir Navab"],"pdf_url":"https://arxiv.org/pdf/2304.07515v2.pdf","comment":"Accepted at MICCAI 2023. 13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.12586v1","updated":"2023-07-24T07:58:18Z","published":"2023-07-24T07:58:18Z","title":"InVAErt networks: a data-driven framework for emulation, inference and\n identifiability analysis","summary":" Use of generative models and deep learning for physics-based systems is\ncurrently dominated by the task of emulation. However, the remarkable\nflexibility offered by data-driven architectures would suggest to extend this\nrepresentation to other aspects of system synthesis including model inversion\nand identifiability. We introduce inVAErt (pronounced \\emph{invert}) networks,\na comprehensive framework for data-driven analysis and synthesis of parametric\nphysical systems which uses a deterministic encoder and decoder to represent\nthe forward and inverse solution maps, normalizing flow to capture the\nprobabilistic distribution of system outputs, and a variational encoder\ndesigned to learn a compact latent representation for the lack of bijectivity\nbetween inputs and outputs. We formally investigate the selection of penalty\ncoefficients in the loss function and strategies for latent space sampling,\nsince we find that these significantly affect both training and testing\nperformance. We validate our framework through extensive numerical examples,\nincluding simple linear, nonlinear, and periodic maps, dynamical systems, and\nspatio-temporal PDEs.\n","authors":["Guoxiang Grayson Tong","Carlos A. Sing Long","Daniele E. Schiavazzi"],"pdf_url":"https://arxiv.org/pdf/2307.12586v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09087v3","updated":"2023-07-24T07:55:19Z","published":"2023-06-15T12:33:39Z","title":"Deep learning based Meta-modeling for Multi-objective Technology\n Optimization of Electrical Machines","summary":" Optimization of rotating electrical machines is both time- and\ncomputationally expensive. Because of the different parametrization, design\noptimization is commonly executed separately for each machine technology. In\nthis paper, we present the application of a variational auto-encoder (VAE) to\noptimize two different machine technologies simultaneously, namely an\nasynchronous machine and a permanent magnet synchronous machine. After\ntraining, we employ a deep neural network and a decoder as meta-models to\npredict global key performance indicators (KPIs) and generate associated new\ndesigns, respectively, through unified latent space in the optimization loop.\nNumerical results demonstrate concurrent parametric multi-objective technology\noptimization in the high-dimensional design space. The VAE-based approach is\nquantitatively compared to a classical deep learning-based direct approach for\nKPIs prediction.\n","authors":["Vivek Parekh","Dominik Flore","Sebastian Schöps"],"pdf_url":"https://arxiv.org/pdf/2306.09087v3.pdf","comment":"12 pages, 15 figures"},{"id":"http://arxiv.org/abs/2307.12576v1","updated":"2023-07-24T07:47:21Z","published":"2023-07-24T07:47:21Z","title":"Self-refining of Pseudo Labels for Music Source Separation with Noisy\n Labeled Data","summary":" Music source separation (MSS) faces challenges due to the limited\navailability of correctly-labeled individual instrument tracks. With the push\nto acquire larger datasets to improve MSS performance, the inevitability of\nencountering mislabeled individual instrument tracks becomes a significant\nchallenge to address. This paper introduces an automated technique for refining\nthe labels in a partially mislabeled dataset. Our proposed self-refining\ntechnique, employed with a noisy-labeled dataset, results in only a 1% accuracy\ndegradation in multi-label instrument recognition compared to a classifier\ntrained on a clean-labeled dataset. The study demonstrates the importance of\nrefining noisy-labeled data in MSS model training and shows that utilizing the\nrefined dataset leads to comparable results derived from a clean-labeled\ndataset. Notably, upon only access to a noisy dataset, MSS models trained on a\nself-refined dataset even outperform those trained on a dataset refined with a\nclassifier trained on clean labels.\n","authors":["Junghyun Koo","Yunkee Chae","Chang-Bin Jeon","Kyogu Lee"],"pdf_url":"https://arxiv.org/pdf/2307.12576v1.pdf","comment":"24th International Society for Music Information Retrieval Conference\n (ISMIR 2023)"},{"id":"http://arxiv.org/abs/2306.16264v2","updated":"2023-07-24T07:30:53Z","published":"2023-06-28T14:46:55Z","title":"Deep Unfolded Simulated Bifurcation for Massive MIMO Signal Detection","summary":" Multiple-input multiple-output (MIMO) is a key ingredient of next-generation\nwireless communications. Recently, various MIMO signal detectors based on deep\nlearning techniques and quantum(-inspired) algorithms have been proposed to\nimprove the detection performance compared with conventional detectors. This\npaper focuses on the simulated bifurcation (SB) algorithm, a quantum-inspired\nalgorithm. This paper proposes two techniques to improve its detection\nperformance. The first is modifying the algorithm inspired by the\nLevenberg-Marquardt algorithm to eliminate local minima of maximum likelihood\ndetection. The second is the use of deep unfolding, a deep learning technique\nto train the internal parameters of an iterative algorithm. We propose a\ndeep-unfolded SB by making the update rule of SB differentiable. The numerical\nresults show that these proposed detectors significantly improve the signal\ndetection performance in massive MIMO systems.\n","authors":["Satoshi Takabe"],"pdf_url":"https://arxiv.org/pdf/2306.16264v2.pdf","comment":"5pages, 4 figures; codes are available at\n https://github.com/s-takabe/unfolded_simbif"},{"id":"http://arxiv.org/abs/2307.12564v1","updated":"2023-07-24T07:17:33Z","published":"2023-07-24T07:17:33Z","title":"Towards Generalising Neural Topical Representations","summary":" Topic models have evolved from conventional Bayesian probabilistic models to\nNeural Topic Models (NTMs) over the last two decays. Although NTMs have\nachieved promising performance when trained and tested on a specific corpus,\ntheir generalisation ability across corpora is rarely studied. In practice, we\noften expect that an NTM trained on a source corpus can still produce quality\ntopical representation for documents in a different target corpus without\nretraining. In this work, we aim to improve NTMs further so that their benefits\ngeneralise reliably across corpora and tasks. To do so, we propose to model\nsimilar documents by minimising their semantical distance when training NTMs.\nSpecifically, similar documents are created by data augmentation during\ntraining; The semantical distance between documents is measured by the\nHierarchical Topic Transport Distance (HOTT), which computes the Optimal\nTransport (OT) distance between the topical representations. Our framework can\nbe readily applied to most NTMs as a plug-and-play module. Extensive\nexperiments show that our framework significantly improves the generalisation\nability regarding neural topical representation across corpora.\n","authors":["Xiaohao Yang","He Zhao","Dinh Phung","Lan Du"],"pdf_url":"https://arxiv.org/pdf/2307.12564v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.09251v2","updated":"2023-07-24T07:08:59Z","published":"2022-11-16T22:50:40Z","title":"Learning-Augmented B-Trees","summary":" We study learning-augmented binary search trees (BSTs) and B-Trees via Treaps\nwith composite priorities. The result is a simple search tree where the depth\nof each item is determined by its predicted weight $w_x$. To achieve the\nresult, each item $x$ has its composite priority\n$-\\lfloor\\log\\log(1/w_x)\\rfloor + U(0, 1)$ where $U(0, 1)$ is the uniform\nrandom variable. This generalizes the recent learning-augmented BSTs\n[Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, to\narbitrary inputs and predictions. It also gives the first B-Tree data structure\nthat can provably take advantage of localities in the access sequence via\nonline self-reorganization. The data structure is robust to prediction errors\nand handles insertions, deletions, as well as prediction updates.\n","authors":["Xinyuan Cao","Jingbang Chen","Li Chen","Chris Lambert","Richard Peng","Daniel Sleator"],"pdf_url":"https://arxiv.org/pdf/2211.09251v2.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2307.10617v3","updated":"2023-07-24T07:03:01Z","published":"2023-07-20T06:35:43Z","title":"Unmasking Falsehoods in Reviews: An Exploration of NLP Techniques","summary":" In the contemporary digital landscape, online reviews have become an\nindispensable tool for promoting products and services across various\nbusinesses. Marketers, advertisers, and online businesses have found incentives\nto create deceptive positive reviews for their products and negative reviews\nfor their competitors' offerings. As a result, the writing of deceptive reviews\nhas become an unavoidable practice for businesses seeking to promote themselves\nor undermine their rivals. Detecting such deceptive reviews has become an\nintense and ongoing area of research. This research paper proposes a machine\nlearning model to identify deceptive reviews, with a particular focus on\nrestaurants. This study delves into the performance of numerous experiments\nconducted on a dataset of restaurant reviews known as the Deceptive Opinion\nSpam Corpus. To accomplish this, an n-gram model and max features are developed\nto effectively identify deceptive content, particularly focusing on fake\nreviews. A benchmark study is undertaken to explore the performance of two\ndifferent feature extraction techniques, which are then coupled with five\ndistinct machine learning classification algorithms. The experimental results\nreveal that the passive aggressive classifier stands out among the various\nalgorithms, showcasing the highest accuracy not only in text classification but\nalso in identifying fake reviews. Moreover, the research delves into data\naugmentation and implements various deep learning techniques to further enhance\nthe process of detecting deceptive reviews. The findings shed light on the\nefficacy of the proposed machine learning approach and offer valuable insights\ninto dealing with deceptive reviews in the realm of online businesses.\n","authors":["Anusuya Baby Hari Krishnan"],"pdf_url":"https://arxiv.org/pdf/2307.10617v3.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2307.12555v1","updated":"2023-07-24T06:41:59Z","published":"2023-07-24T06:41:59Z","title":"Homophily-Driven Sanitation View for Robust Graph Contrastive Learning","summary":" We investigate adversarial robustness of unsupervised Graph Contrastive\nLearning (GCL) against structural attacks. First, we provide a comprehensive\nempirical and theoretical analysis of existing attacks, revealing how and why\nthey downgrade the performance of GCL. Inspired by our analytic results, we\npresent a robust GCL framework that integrates a homophily-driven sanitation\nview, which can be learned jointly with contrastive learning. A key challenge\nthis poses, however, is the non-differentiable nature of the sanitation\nobjective. To address this challenge, we propose a series of techniques to\nenable gradient-based end-to-end robust GCL. Moreover, we develop a fully\nunsupervised hyperparameter tuning method which, unlike prior approaches, does\nnot require knowledge of node labels. We conduct extensive experiments to\nevaluate the performance of our proposed model, GCHS (Graph Contrastive\nLearning with Homophily-driven Sanitation View), against two state of the art\nstructural attacks on GCL. Our results demonstrate that GCHS consistently\noutperforms all state of the art baselines in terms of the quality of generated\nnode embeddings as well as performance on two important downstream tasks.\n","authors":["Yulin Zhu","Xing Ai","Yevgeniy Vorobeychik","Kai Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.12555v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12551v1","updated":"2023-07-24T06:38:10Z","published":"2023-07-24T06:38:10Z","title":"Continuation Path Learning for Homotopy Optimization","summary":" Homotopy optimization is a traditional method to deal with a complicated\noptimization problem by solving a sequence of easy-to-hard surrogate\nsubproblems. However, this method can be very sensitive to the continuation\nschedule design and might lead to a suboptimal solution to the original\nproblem. In addition, the intermediate solutions, often ignored by classic\nhomotopy optimization, could be useful for many real-world applications. In\nthis work, we propose a novel model-based approach to learn the whole\ncontinuation path for homotopy optimization, which contains infinite\nintermediate solutions for any surrogate subproblems. Rather than the classic\nunidirectional easy-to-hard optimization, our method can simultaneously\noptimize the original problem and all surrogate subproblems in a collaborative\nmanner. The proposed model also supports real-time generation of any\nintermediate solution, which could be desirable for many applications.\nExperimental studies on different problems show that our proposed method can\nsignificantly improve the performance of homotopy optimization and provide\nextra helpful information to support better decision-making.\n","authors":["Xi Lin","Zhiyuan Yang","Xiaoyuan Zhang","Qingfu Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12551v1.pdf","comment":"Accepted by the 40th International Conference on Machine Learning\n (ICML 2023)"},{"id":"http://arxiv.org/abs/2304.12438v2","updated":"2023-07-24T06:19:17Z","published":"2023-04-24T20:24:07Z","title":"Stochastic MPC for energy hubs using data driven demand forecasting","summary":" Energy hubs convert and distribute energy resources by combining different\nenergy inputs through multiple conversion and storage components. The optimal\noperation of the energy hub exploits its flexibility to increase the energy\nefficiency and reduce the operational costs. However, uncertainties in the\ndemand present challenges to energy hub optimization. In this paper, we propose\na stochastic MPC controller to minimize energy costs using chance constraints\nfor the uncertain electricity and thermal demands. Historical data is used to\nbuild a demand prediction model based on Gaussian processes to generate a\nforecast of the future electricity and heat demands. The stochastic\noptimization problem is solved via the Scenario Approach by sampling multi-step\ndemand trajectories from the derived prediction model. The performance of the\nproposed predictor and of the stochastic controller is verified on a simulated\nenergy hub model and demand data from a real building.\n","authors":["Varsha Behrunani","Francesco Micheli","Jonas Mehr","Philipp Heer","John Lygeros"],"pdf_url":"https://arxiv.org/pdf/2304.12438v2.pdf","comment":"6 pages, 5 figures. Submitted to IFAC World Congress 2023"},{"id":"http://arxiv.org/abs/2211.09710v3","updated":"2023-07-24T05:39:27Z","published":"2022-11-17T17:45:59Z","title":"Style Classification of Rabbinic Literature for Detection of Lost\n Midrash Tanhuma Material","summary":" Midrash collections are complex rabbinic works that consist of text in\nmultiple languages, which evolved through long processes of unstable oral and\nwritten transmission. Determining the origin of a given passage in such a\ncompilation is not always straightforward and is often a matter of dispute\namong scholars, yet it is essential for scholars' understanding of the passage\nand its relationship to other texts in the rabbinic corpus. To help solve this\nproblem, we propose a system for classification of rabbinic literature based on\nits style, leveraging recent advances in natural language processing for Hebrew\ntexts. Additionally, we demonstrate how this method can be applied to uncover\nlost material from a specific midrash genre, Tan\\d{h}uma-Yelammedenu, that has\nbeen preserved in later anthologies.\n","authors":["Shlomo Tannor","Nachum Dershowitz","Moshe Lavee"],"pdf_url":"https://arxiv.org/pdf/2211.09710v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12532v1","updated":"2023-07-24T05:36:19Z","published":"2023-07-24T05:36:19Z","title":"On the Connection between Pre-training Data Diversity and Fine-tuning\n Robustness","summary":" Pre-training has been widely adopted in deep learning to improve model\nperformance, especially when the training data for a target task is limited. In\nour work, we seek to understand the implications of this training strategy on\nthe generalization properties of downstream models. More specifically, we ask\nthe following question: how do properties of the pre-training distribution\naffect the robustness of a fine-tuned model? The properties we explore include\nthe label space, label semantics, image diversity, data domains, and data\nquantity of the pre-training distribution. We find that the primary factor\ninfluencing downstream effective robustness (Taori et al., 2020) is data\nquantity, while other factors have limited significance. For example, reducing\nthe number of ImageNet pre-training classes by 4x while increasing the number\nof images per class by 4x (that is, keeping total data quantity fixed) does not\nimpact the robustness of fine-tuned models. We demonstrate our findings on\npre-training distributions drawn from various natural and synthetic data\nsources, primarily using the iWildCam-WILDS distribution shift as a test for\ndownstream robustness.\n","authors":["Vivek Ramanujan","Thao Nguyen","Sewoong Oh","Ludwig Schmidt","Ali Farhadi"],"pdf_url":"https://arxiv.org/pdf/2307.12532v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12526v1","updated":"2023-07-24T04:56:23Z","published":"2023-07-24T04:56:23Z","title":"Rethinking Medical Report Generation: Disease Revealing Enhancement with\n Knowledge Graph","summary":" Knowledge Graph (KG) plays a crucial role in Medical Report Generation (MRG)\nbecause it reveals the relations among diseases and thus can be utilized to\nguide the generation process. However, constructing a comprehensive KG is\nlabor-intensive and its applications on the MRG process are under-explored. In\nthis study, we establish a complete KG on chest X-ray imaging that includes 137\ntypes of diseases and abnormalities. Based on this KG, we find that the current\nMRG data sets exhibit a long-tailed problem in disease distribution. To\nmitigate this problem, we introduce a novel augmentation strategy that enhances\nthe representation of disease types in the tail-end of the distribution. We\nfurther design a two-stage MRG approach, where a classifier is first trained to\ndetect whether the input images exhibit any abnormalities. The classified\nimages are then independently fed into two transformer-based generators,\nnamely, ``disease-specific generator\" and ``disease-free generator\" to generate\nthe corresponding reports. To enhance the clinical evaluation of whether the\ngenerated reports correctly describe the diseases appearing in the input image,\nwe propose diverse sensitivity (DS), a new metric that checks whether generated\ndiseases match ground truth and measures the diversity of all generated\ndiseases. Results show that the proposed two-stage generation framework and\naugmentation strategies improve DS by a considerable margin, indicating a\nnotable reduction in the long-tailed problem associated with under-represented\ndiseases.\n","authors":["Yixin Wang","Zihao Lin","Haoyu Dong"],"pdf_url":"https://arxiv.org/pdf/2307.12526v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12524v1","updated":"2023-07-24T04:46:22Z","published":"2023-07-24T04:46:22Z","title":"Landslide Surface Displacement Prediction Based on VSXC-LSTM Algorithm","summary":" Landslide is a natural disaster that can easily threaten local ecology,\npeople's lives and property. In this paper, we conduct modelling research on\nreal unidirectional surface displacement data of recent landslides in the\nresearch area and propose a time series prediction framework named\nVMD-SegSigmoid-XGBoost-ClusterLSTM (VSXC-LSTM) based on variational mode\ndecomposition, which can predict the landslide surface displacement more\naccurately. The model performs well on the test set. Except for the random item\nsubsequence that is hard to fit, the root mean square error (RMSE) and the mean\nabsolute percentage error (MAPE) of the trend item subsequence and the periodic\nitem subsequence are both less than 0.1, and the RMSE is as low as 0.006 for\nthe periodic item prediction module based on XGBoost\\footnote{Accepted in\nICANN2023}.\n","authors":["Menglin Kong","Ruichen Li","Fan Liu","Xingquan Li","Juan Cheng","Muzhou Hou","Cong Cao"],"pdf_url":"https://arxiv.org/pdf/2307.12524v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12520v1","updated":"2023-07-24T04:29:43Z","published":"2023-07-24T04:29:43Z","title":"Lost In Translation: Generating Adversarial Examples Robust to\n Round-Trip Translation","summary":" Language Models today provide a high accuracy across a large number of\ndownstream tasks. However, they remain susceptible to adversarial attacks,\nparticularly against those where the adversarial examples maintain considerable\nsimilarity to the original text. Given the multilingual nature of text, the\neffectiveness of adversarial examples across translations and how machine\ntranslations can improve the robustness of adversarial examples remain largely\nunexplored. In this paper, we present a comprehensive study on the robustness\nof current text adversarial attacks to round-trip translation. We demonstrate\nthat 6 state-of-the-art text-based adversarial attacks do not maintain their\nefficacy after round-trip translation. Furthermore, we introduce an\nintervention-based solution to this problem, by integrating Machine Translation\ninto the process of adversarial example generation and demonstrating increased\nrobustness to round-trip translation. Our results indicate that finding\nadversarial examples robust to translation can help identify the insufficiency\nof language models that is common across languages, and motivate further\nresearch into multilingual adversarial attacks.\n","authors":["Neel Bhandari","Pin-Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2307.12520v1.pdf","comment":"Published at International Conference on Acoustics, Speech, and\n Signal Processing (ICASSP) 2023"},{"id":"http://arxiv.org/abs/2307.12519v1","updated":"2023-07-24T04:29:00Z","published":"2023-07-24T04:29:00Z","title":"DEPHN: Different Expression Parallel Heterogeneous Network using virtual\n gradient optimization for Multi-task Learning","summary":" Recommendation system algorithm based on multi-task learning (MTL) is the\nmajor method for Internet operators to understand users and predict their\nbehaviors in the multi-behavior scenario of platform. Task correlation is an\nimportant consideration of MTL goals, traditional models use shared-bottom\nmodels and gating experts to realize shared representation learning and\ninformation differentiation. However, The relationship between real-world tasks\nis often more complex than existing methods do not handle properly sharing\ninformation. In this paper, we propose an Different Expression Parallel\nHeterogeneous Network (DEPHN) to model multiple tasks simultaneously. DEPHN\nconstructs the experts at the bottom of the model by using different feature\ninteraction methods to improve the generalization ability of the shared\ninformation flow. In view of the model's differentiating ability for different\ntask information flows, DEPHN uses feature explicit mapping and virtual\ngradient coefficient for expert gating during the training process, and\nadaptively adjusts the learning intensity of the gated unit by considering the\ndifference of gating values and task correlation. Extensive experiments on\nartificial and real-world datasets demonstrate that our proposed method can\ncapture task correlation in complex situations and achieve better performance\nthan baseline models\\footnote{Accepted in IJCNN2023}.\n","authors":["Menglin Kong","Ri Su","Shaojie Zhao","Muzhou Hou"],"pdf_url":"https://arxiv.org/pdf/2307.12519v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12518v1","updated":"2023-07-24T04:23:08Z","published":"2023-07-24T04:23:08Z","title":"FaFCNN: A General Disease Classification Framework Based on Feature\n Fusion Neural Networks","summary":" There are two fundamental problems in applying deep learning/machine learning\nmethods to disease classification tasks, one is the insufficient number and\npoor quality of training samples; another one is how to effectively fuse\nmultiple source features and thus train robust classification models. To\naddress these problems, inspired by the process of human learning knowledge, we\npropose the Feature-aware Fusion Correlation Neural Network (FaFCNN), which\nintroduces a feature-aware interaction module and a feature alignment module\nbased on domain adversarial learning. This is a general framework for disease\nclassification, and FaFCNN improves the way existing methods obtain sample\ncorrelation features. The experimental results show that training using\naugmented features obtained by pre-training gradient boosting decision tree\nyields more performance gains than random-forest based methods. On the\nlow-quality dataset with a large amount of missing data in our setup, FaFCNN\nobtains a consistently optimal performance compared to competitive baselines.\nIn addition, extensive experiments demonstrate the robustness of the proposed\nmethod and the effectiveness of each component of the model\\footnote{Accepted\nin IEEE SMC2023}.\n","authors":["Menglin Kong","Shaojie Zhao","Juan Cheng","Xingquan Li","Ri Su","Muzhou Hou","Cong Cao"],"pdf_url":"https://arxiv.org/pdf/2307.12518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12510v1","updated":"2023-07-24T03:52:11Z","published":"2023-07-24T03:52:11Z","title":"An Empirical Evaluation of Temporal Graph Benchmark","summary":" In this paper, we conduct an empirical evaluation of Temporal Graph Benchmark\n(TGB) by extending our Dynamic Graph Library (DyGLib) to TGB. Compared with\nTGB, we include eleven popular dynamic graph learning methods for more\nexhaustive comparisons. Through the experiments, we find that (1) some issues\nneed to be addressed in the current version of TGB, including mismatched data\nstatistics, inaccurate evaluation metric computation, and so on; (2) different\nmodels depict varying performance across various datasets, which is in line\nwith previous observations; (3) the performance of some baselines can be\nsignificantly improved over the reported results in TGB when using DyGLib. This\nwork aims to ease the researchers' efforts in evaluating various dynamic graph\nlearning methods on TGB and attempts to offer results that can be directly\nreferenced in the follow-up research. All the used resources in this project\nare publicly available at https://github.com/yule-BUAA/DyGLib_TGB. This work is\nin progress, and feedback from the community is welcomed for improvements.\n","authors":["Le Yu"],"pdf_url":"https://arxiv.org/pdf/2307.12510v1.pdf","comment":"preprint, in progress"},{"id":"http://arxiv.org/abs/2304.03483v2","updated":"2023-07-24T03:28:34Z","published":"2023-04-07T05:29:59Z","title":"RED-PSM: Regularization by Denoising of Partially Separable Models for\n Dynamic Imaging","summary":" Dynamic imaging addresses the recovery of a time-varying 2D or 3D object at\neach time instant using its undersampled measurements. In particular, in the\ncase of dynamic tomography, only a single projection at a single view angle may\nbe available at a time, making the problem severely ill-posed. In this work, we\npropose an approach, RED-PSM, which combines for the first time two powerful\ntechniques to address this challenging imaging problem. The first, are\npartially separable models, which have been used to efficiently introduce a\nlow-rank prior for the spatio-temporal object. The second is the recent\nRegularization by Denoising (RED), which provides a flexible framework to\nexploit the impressive performance of state-of-the-art image denoising\nalgorithms, for various inverse problems. We propose a partially separable\nobjective with RED and a computationally efficient and scalable optimization\nscheme with variable splitting and ADMM. Theoretical analysis proves the\nconvergence of our objective to a value corresponding to a stationary point\nsatisfying the first-order optimality conditions. Convergence is accelerated by\na particular projection-domain-based initialization. We demonstrate the\nperformance and computational improvements of our proposed RED-PSM with a\nlearned image denoiser by comparing it to a recent deep-prior-based method\nknown as TD-DIP. Although the main focus is on dynamic tomography, we also show\nthe performance advantages of RED-PSM in a cardiac dynamic MRI setting.\n","authors":["Berk Iskender","Marc L. Klasky","Yoram Bresler"],"pdf_url":"https://arxiv.org/pdf/2304.03483v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12499v1","updated":"2023-07-24T03:10:02Z","published":"2023-07-24T03:10:02Z","title":"AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion\n Models","summary":" Unrestricted adversarial attacks present a serious threat to deep learning\nmodels and adversarial defense techniques. They pose severe security problems\nfor deep learning applications because they can effectively bypass defense\nmechanisms. However, previous attack methods often utilize Generative\nAdversarial Networks (GANs), which are not theoretically provable and thus\ngenerate unrealistic examples by incorporating adversarial objectives,\nespecially for large-scale datasets like ImageNet. In this paper, we propose a\nnew method, called AdvDiff, to generate unrestricted adversarial examples with\ndiffusion models. We design two novel adversarial guidance techniques to\nconduct adversarial sampling in the reverse generation process of diffusion\nmodels. These two techniques are effective and stable to generate high-quality,\nrealistic adversarial examples by integrating gradients of the target\nclassifier interpretably. Experimental results on MNIST and ImageNet datasets\ndemonstrate that AdvDiff is effective to generate unrestricted adversarial\nexamples, which outperforms GAN-based methods in terms of attack performance\nand generation quality.\n","authors":["Xuelong Dai","Kaisheng Liang","Bin Xiao"],"pdf_url":"https://arxiv.org/pdf/2307.12499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12496v1","updated":"2023-07-24T03:04:10Z","published":"2023-07-24T03:04:10Z","title":"A faster and simpler algorithm for learning shallow networks","summary":" We revisit the well-studied problem of learning a linear combination of $k$\nReLU activations given labeled examples drawn from the standard $d$-dimensional\nGaussian measure. Chen et al. [CDG+23] recently gave the first algorithm for\nthis problem to run in $\\text{poly}(d,1/\\varepsilon)$ time when $k = O(1)$,\nwhere $\\varepsilon$ is the target error. More precisely, their algorithm runs\nin time $(d/\\varepsilon)^{\\mathrm{quasipoly}(k)}$ and learns over multiple\nstages. Here we show that a much simpler one-stage version of their algorithm\nsuffices, and moreover its runtime is only $(d/\\varepsilon)^{O(k^2)}$.\n","authors":["Sitan Chen","Shyam Narayanan"],"pdf_url":"https://arxiv.org/pdf/2307.12496v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2307.12491v1","updated":"2023-07-24T02:50:19Z","published":"2023-07-24T02:50:19Z","title":"Learning Universal and Robust 3D Molecular Representations with Graph\n Convolutional Networks","summary":" To learn accurate representations of molecules, it is essential to consider\nboth chemical and geometric features. To encode geometric information, many\ndescriptors have been proposed in constrained circumstances for specific types\nof molecules and do not have the properties to be ``robust\": 1. Invariant to\nrotations and translations; 2. Injective when embedding molecular structures.\nIn this work, we propose a universal and robust Directional Node Pair (DNP)\ndescriptor based on the graph representations of 3D molecules. Our DNP\ndescriptor is robust compared to previous ones and can be applied to multiple\nmolecular types. To combine the DNP descriptor and chemical features in\nmolecules, we construct the Robust Molecular Graph Convolutional Network\n(RoM-GCN) which is capable to take both node and edge features into\nconsideration when generating molecule representations. We evaluate our model\non protein and small molecule datasets. Our results validate the superiority of\nthe DNP descriptor in incorporating 3D geometric information of molecules.\nRoM-GCN outperforms all compared baselines.\n","authors":["Shuo Zhang","Yang Liu","Li Xie","Lei Xie"],"pdf_url":"https://arxiv.org/pdf/2307.12491v1.pdf","comment":"Preprint. Work in progress"},{"id":"http://arxiv.org/abs/2307.01482v2","updated":"2023-07-24T02:40:29Z","published":"2023-07-04T05:19:19Z","title":"Nexus sine qua non: Essentially Connected Networks for Traffic\n Forecasting","summary":" Spatial-temporal graph neural networks (STGNNs) have become the de facto\nmodels for learning spatiotemporal representations of traffic flow. However,\nmodern STGNNs often contain superfluous or obscure components, along with\ncomplex techniques, posing significant challenges in terms of complexity and\nscalability. Such concerns prompt us to rethink the design of neural\narchitectures and to identify the key challenges in traffic forecasting as\nspatial-temporal contextualization. Here, we present an essentially connected\nmodel based on an efficient message-passing backbone, powered by learnable node\nembedding, without any complex sequential techniques such as TCNs, RNNs, and\nTransformers. Intriguingly, empirical results demonstrate how a simple and\nelegant model with contextualization capability compares favorably w.r.t. the\nstate-of-the-art with elaborate structures, while being much more interpretable\nand computationally efficient for traffic forecasting. We anticipate that our\nfindings will open new horizons for further research to explore the possibility\nof creating simple but effective neural forecasting architectures.\n","authors":["Tong Nie","Guoyang Qin","Yunpeng Wang","Jian Sun"],"pdf_url":"https://arxiv.org/pdf/2307.01482v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04893v2","updated":"2023-07-24T02:38:09Z","published":"2023-07-10T20:31:23Z","title":"Choosing Well Your Opponents: How to Guide the Synthesis of Programmatic\n Strategies","summary":" This paper introduces Local Learner (2L), an algorithm for providing a set of\nreference strategies to guide the search for programmatic strategies in\ntwo-player zero-sum games. Previous learning algorithms, such as Iterated Best\nResponse (IBR), Fictitious Play (FP), and Double-Oracle (DO), can be\ncomputationally expensive or miss important information for guiding search\nalgorithms. 2L actively selects a set of reference strategies to improve the\nsearch signal. We empirically demonstrate the advantages of our approach while\nguiding a local search algorithm for synthesizing strategies in three games,\nincluding MicroRTS, a challenging real-time strategy game. Results show that 2L\nlearns reference strategies that provide a stronger search signal than IBR, FP,\nand DO. We also simulate a tournament of MicroRTS, where a synthesizer using 2L\noutperformed the winners of the two latest MicroRTS competitions, which were\nprogrammatic strategies written by human programmers.\n","authors":["Rubens O. Moraes","David S. Aleixo","Lucas N. Ferreira","Levi H. S. Lelis"],"pdf_url":"https://arxiv.org/pdf/2307.04893v2.pdf","comment":"International Joint Conference on Artificial Intelligence (IJCAI)\n 2023"},{"id":"http://arxiv.org/abs/2307.12480v1","updated":"2023-07-24T02:28:50Z","published":"2023-07-24T02:28:50Z","title":"Learning Resource Allocation Policy: Vertex-GNN or Edge-GNN?","summary":" Graph neural networks (GNNs) update the hidden representations of vertices\n(called Vertex-GNNs) or hidden representations of edges (called Edge-GNNs) by\nprocessing and pooling the information of neighboring vertices and edges and\ncombining to incorporate graph topology. When learning resource allocation\npolicies, GNNs cannot perform well if their expressive power are weak, i.e., if\nthey cannot differentiate all input features such as channel matrices. In this\npaper, we analyze the expressive power of the Vertex-GNNs and Edge-GNNs for\nlearning three representative wireless policies: link scheduling, power\ncontrol, and precoding policies. We find that the expressive power of the GNNs\ndepend on the linearity and output dimensions of the processing and combination\nfunctions. When linear processors are used, the Vertex-GNNs cannot\ndifferentiate all channel matrices due to the loss of channel information,\nwhile the Edge-GNNs can. When learning the precoding policy, even the\nVertex-GNNs with non-linear processors may not be with strong expressive\nability due to the dimension compression. We proceed to provide necessary\nconditions for the GNNs to well learn the precoding policy. Simulation results\nvalidate the analyses and show that the Edge-GNNs can achieve the same\nperformance as the Vertex-GNNs with much lower training and inference time.\n","authors":["Yao Peng","Jia Guo","Chenyang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12480v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.16392v2","updated":"2023-07-24T02:05:50Z","published":"2022-10-28T20:13:00Z","title":"Physics-aware Graph Neural Network for Accurate RNA 3D Structure\n Prediction","summary":" Biological functions of RNAs are determined by their three-dimensional (3D)\nstructures. Thus, given the limited number of experimentally determined RNA\nstructures, the prediction of RNA structures will facilitate elucidating RNA\nfunctions and RNA-targeted drug discovery, but remains a challenging task. In\nthis work, we propose a Graph Neural Network (GNN)-based scoring function\ntrained only with the atomic types and coordinates on limited solved RNA 3D\nstructures for distinguishing accurate structural models. The proposed\nPhysics-aware Multiplex Graph Neural Network (PaxNet) separately models the\nlocal and non-local interactions inspired by molecular mechanics. Furthermore,\nPaxNet contains an attention-based fusion module that learns the individual\ncontribution of each interaction type for the final prediction. We rigorously\nevaluate the performance of PaxNet on two benchmarks and compare it with\nseveral state-of-the-art baselines. The results show that PaxNet significantly\noutperforms all the baselines overall, and demonstrate the potential of PaxNet\nfor improving the 3D structure modeling of RNA and other macromolecules. Our\ncode is available at https://github.com/zetayue/Physics-aware-Multiplex-GNN.\n","authors":["Shuo Zhang","Yang Liu","Lei Xie"],"pdf_url":"https://arxiv.org/pdf/2210.16392v2.pdf","comment":"Accepted by the Machine Learning for Structural Biology Workshop\n (MLSB) at the 36th Conference on Neural Information Processing Systems\n (NeurIPS 2022)"},{"id":"http://arxiv.org/abs/2307.12472v1","updated":"2023-07-24T01:58:48Z","published":"2023-07-24T01:58:48Z","title":"Model-free generalized fiducial inference","summary":" Motivated by the need for the development of safe and reliable methods for\nuncertainty quantification in machine learning, I propose and develop ideas for\na model-free statistical framework for imprecise probabilistic prediction\ninference. This framework facilitates uncertainty quantification in the form of\nprediction sets that offer finite sample control of type 1 errors, a property\nshared with conformal prediction sets, but this new approach also offers more\nversatile tools for imprecise probabilistic reasoning. Furthermore, I propose\nand consider the theoretical and empirical properties of a precise\nprobabilistic approximation to the model-free imprecise framework.\nApproximating a belief/plausibility measure pair by an [optimal in some sense]\nprobability measure in the credal set is a critical resolution needed for the\nbroader adoption of imprecise probabilistic approaches to inference in\nstatistical and machine learning communities. It is largely undetermined in the\nstatistical and machine learning literatures, more generally, how to properly\nquantify uncertainty in that there is no generally accepted standard of\naccountability of stated uncertainties. The research I present in this\nmanuscript is aimed at motivating a framework for statistical inference with\nreliability and accountability as the guiding principles.\n","authors":["Jonathan P Williams"],"pdf_url":"https://arxiv.org/pdf/2307.12472v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12463v1","updated":"2023-07-24T00:53:46Z","published":"2023-07-24T00:53:46Z","title":"Rethinking Data Distillation: Do Not Overlook Calibration","summary":" Neural networks trained on distilled data often produce over-confident output\nand require correction by calibration methods. Existing calibration methods\nsuch as temperature scaling and mixup work well for networks trained on\noriginal large-scale data. However, we find that these methods fail to\ncalibrate networks trained on data distilled from large source datasets. In\nthis paper, we show that distilled data lead to networks that are not\ncalibratable due to (i) a more concentrated distribution of the maximum logits\nand (ii) the loss of information that is semantically meaningful but unrelated\nto classification tasks. To address this problem, we propose Masked Temperature\nScaling (MTS) and Masked Distillation Training (MDT) which mitigate the\nlimitations of distilled data and achieve better calibration results while\nmaintaining the efficiency of dataset distillation.\n","authors":["Dongyao Zhu","Bowen Lei","Jie Zhang","Yanbo Fang","Ruqi Zhang","Yiqun Xie","Dongkuan Xu"],"pdf_url":"https://arxiv.org/pdf/2307.12463v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12461v1","updated":"2023-07-24T00:16:50Z","published":"2023-07-24T00:16:50Z","title":"Rates of Approximation by ReLU Shallow Neural Networks","summary":" Neural networks activated by the rectified linear unit (ReLU) play a central\nrole in the recent development of deep learning. The topic of approximating\nfunctions from H\\\"older spaces by these networks is crucial for understanding\nthe efficiency of the induced learning algorithms. Although the topic has been\nwell investigated in the setting of deep neural networks with many layers of\nhidden neurons, it is still open for shallow networks having only one hidden\nlayer. In this paper, we provide rates of uniform approximation by these\nnetworks. We show that ReLU shallow neural networks with $m$ hidden neurons can\nuniformly approximate functions from the H\\\"older space $W_\\infty^r([-1, 1]^d)$\nwith rates $O((\\log m)^{\\frac{1}{2} +d}m^{-\\frac{r}{d}\\frac{d+2}{d+4}})$ when\n$r