diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..e2f3c437 --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2023-07-24T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2307.12981v1","updated":"2023-07-24T17:59:02Z","published":"2023-07-24T17:59:02Z","title":"3D-LLM: Injecting the 3D World into Large Language Models","summary":" Large language models (LLMs) and Vision-Language Models (VLMs) have been\nproven to excel at multiple tasks, such as commonsense reasoning. Powerful as\nthese models can be, they are not grounded in the 3D physical world, which\ninvolves richer concepts such as spatial relationships, affordances, physics,\nlayout, and so on. In this work, we propose to inject the 3D world into large\nlanguage models and introduce a whole new family of 3D-LLMs. Specifically,\n3D-LLMs can take 3D point clouds and their features as input and perform a\ndiverse set of 3D-related tasks, including captioning, dense captioning, 3D\nquestion answering, task decomposition, 3D grounding, 3D-assisted dialog,\nnavigation, and so on. Using three types of prompting mechanisms that we\ndesign, we are able to collect over 300k 3D-language data covering these tasks.\nTo efficiently train 3D-LLMs, we first utilize a 3D feature extractor that\nobtains 3D features from rendered multi- view images. Then, we use 2D VLMs as\nour backbones to train our 3D-LLMs. By introducing a 3D localization mechanism,\n3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show\nthat our model outperforms state-of-the-art baselines by a large margin (e.g.,\nthe BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore,\nexperiments on our held-in datasets for 3D captioning, task composition, and\n3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative\nexamples also show that our model could perform more tasks beyond the scope of\nexisting LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.\n","authors":["Yining Hong","Haoyu Zhen","Peihao Chen","Shuhong Zheng","Yilun Du","Zhenfang Chen","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2307.12981v1.pdf","comment":"Project Page: : https://vis-www.cs.umass.edu/3dllm/"},{"id":"http://arxiv.org/abs/2307.12976v1","updated":"2023-07-24T17:52:46Z","published":"2023-07-24T17:52:46Z","title":"Evaluating the Ripple Effects of Knowledge Editing in Language Models","summary":" Modern language models capture a large body of factual knowledge. However,\nsome facts can be incorrectly induced or become obsolete over time, resulting\nin factually incorrect generations. This has led to the development of various\nediting methods that allow updating facts encoded by the model. Evaluation of\nthese methods has primarily focused on testing whether an individual fact has\nbeen successfully injected, and if similar predictions for other subjects have\nnot changed. Here we argue that such evaluation is limited, since injecting one\nfact (e.g. ``Jack Depp is the son of Johnny Depp'') introduces a ``ripple\neffect'' in the form of additional facts that the model needs to update\n(e.g.``Jack Depp is the sibling of Lily-Rose Depp''). To address this issue, we\npropose a novel set of evaluation criteria that consider the implications of an\nedit on related facts. Using these criteria, we then construct \\ripple{}, a\ndiagnostic benchmark of 5K factual edits, capturing a variety of types of\nripple effects. We evaluate prominent editing methods on \\ripple{}, showing\nthat current methods fail to introduce consistent changes in the model's\nknowledge. In addition, we find that a simple in-context editing baseline\nobtains the best scores on our benchmark, suggesting a promising research\ndirection for model editing.\n","authors":["Roi Cohen","Eden Biran","Ori Yoran","Amir Globerson","Mor Geva"],"pdf_url":"https://arxiv.org/pdf/2307.12976v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12973v1","updated":"2023-07-24T17:49:31Z","published":"2023-07-24T17:49:31Z","title":"Leveraging Label Variation in Large Language Models for Zero-Shot Text\n Classification","summary":" The zero-shot learning capabilities of large language models (LLMs) make them\nideal for text classification without annotation or supervised training. Many\nstudies have shown impressive results across multiple tasks. While tasks, data,\nand results differ widely, their similarities to human annotation can aid us in\ntackling new tasks with minimal expenses. We evaluate using 5 state-of-the-art\nLLMs as \"annotators\" on 5 different tasks (age, gender, topic, sentiment\nprediction, and hate speech detection), across 4 languages: English, French,\nGerman, and Spanish. No single model excels at all tasks, across languages, or\nacross all labels within a task. However, aggregation techniques designed for\nhuman annotators perform substantially better than any one individual model.\nOverall, though, LLMs do not rival even simple supervised models, so they do\nnot (yet) replace the need for human annotation. We also discuss the tradeoffs\nbetween speed, accuracy, cost, and bias when it comes to aggregated model\nlabeling versus human annotation.\n","authors":["Flor Miriam Plaza-del-Arco","Debora Nozza","Dirk Hovy"],"pdf_url":"https://arxiv.org/pdf/2307.12973v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12966v1","updated":"2023-07-24T17:44:58Z","published":"2023-07-24T17:44:58Z","title":"Aligning Large Language Models with Human: A Survey","summary":" Large Language Models (LLMs) trained on extensive textual corpora have\nemerged as leading solutions for a broad array of Natural Language Processing\n(NLP) tasks. Despite their notable performance, these models are prone to\ncertain limitations such as misunderstanding human instructions, generating\npotentially biased content, or factually incorrect (hallucinated) information.\nHence, aligning LLMs with human expectations has become an active area of\ninterest within the research community. This survey presents a comprehensive\noverview of these alignment technologies, including the following aspects. (1)\nData collection: the methods for effectively collecting high-quality\ninstructions for LLM alignment, including the use of NLP benchmarks, human\nannotations, and leveraging strong LLMs. (2) Training methodologies: a detailed\nreview of the prevailing training methods employed for LLM alignment. Our\nexploration encompasses Supervised Fine-tuning, both Online and Offline human\npreference training, along with parameter-efficient training mechanisms. (3)\nModel Evaluation: the methods for evaluating the effectiveness of these\nhuman-aligned LLMs, presenting a multifaceted approach towards their\nassessment. In conclusion, we collate and distill our findings, shedding light\non several promising future research avenues in the field. This survey,\ntherefore, serves as a valuable resource for anyone invested in understanding\nand advancing the alignment of LLMs to better suit human-oriented tasks and\nexpectations. An associated GitHub link collecting the latest papers is\navailable at https://github.com/GaryYufei/AlignLLMHumanSurvey.\n","authors":["Yufei Wang","Wanjun Zhong","Liangyou Li","Fei Mi","Xingshan Zeng","Wenyong Huang","Lifeng Shang","Xin Jiang","Qun Liu"],"pdf_url":"https://arxiv.org/pdf/2307.12966v1.pdf","comment":"work in progress"},{"id":"http://arxiv.org/abs/2303.04245v2","updated":"2023-07-24T17:29:04Z","published":"2023-03-07T21:42:17Z","title":"How Do Transformers Learn Topic Structure: Towards a Mechanistic\n Understanding","summary":" While the successes of transformers across many domains are indisputable,\naccurate understanding of the learning mechanics is still largely lacking.\nTheir capabilities have been probed on benchmarks which include a variety of\nstructured and reasoning tasks -- but mathematical understanding is lagging\nsubstantially behind. Recent lines of work have begun studying representational\naspects of this question: that is, the size/depth/complexity of attention-based\nnetworks to perform certain tasks. However, there is no guarantee the learning\ndynamics will converge to the constructions proposed. In our paper, we provide\nfine-grained mechanistic understanding of how transformers learn \"semantic\nstructure\", understood as capturing co-occurrence structure of words.\nPrecisely, we show, through a combination of mathematical analysis and\nexperiments on Wikipedia data and synthetic data modeled by Latent Dirichlet\nAllocation (LDA), that the embedding layer and the self-attention layer encode\nthe topical structure. In the former case, this manifests as higher average\ninner product of embeddings between same-topic words. In the latter, it\nmanifests as higher average pairwise attention between same-topic words. The\nmathematical results involve several assumptions to make the analysis\ntractable, which we verify on data, and might be of independent interest as\nwell.\n","authors":["Yuchen Li","Yuanzhi Li","Andrej Risteski"],"pdf_url":"https://arxiv.org/pdf/2303.04245v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12950v1","updated":"2023-07-24T17:23:22Z","published":"2023-07-24T17:23:22Z","title":"RLCD: Reinforcement Learning from Contrast Distillation for Language\n Model Alignment","summary":" We propose Reinforcement Learning from Contrast Distillation (RLCD), a method\nfor aligning language models to follow natural language principles without\nusing human feedback. RLCD trains a preference model using simulated preference\npairs that contain both a high-quality and low-quality example, generated using\ncontrasting positive and negative prompts. The preference model is then used to\nimprove a base unaligned language model via reinforcement learning.\nEmpirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context\ndistillation (Huang et al., 2022) baselines across three diverse alignment\ntasks--harmlessness, helpfulness, and story outline generation--and on both 7B\nand 30B model scales for preference data simulation.\n","authors":["Kevin Yang","Dan Klein","Asli Celikyilmaz","Nanyun Peng","Yuandong Tian"],"pdf_url":"https://arxiv.org/pdf/2307.12950v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12949v1","updated":"2023-07-24T17:22:04Z","published":"2023-07-24T17:22:04Z","title":"Boosting Punctuation Restoration with Data Generation and Reinforcement\n Learning","summary":" Punctuation restoration is an important task in automatic speech recognition\n(ASR) which aim to restore the syntactic structure of generated ASR texts to\nimprove readability. While punctuated texts are abundant from written\ndocuments, the discrepancy between written punctuated texts and ASR texts\nlimits the usability of written texts in training punctuation restoration\nsystems for ASR texts. This paper proposes a reinforcement learning method to\nexploit in-topic written texts and recent advances in large pre-trained\ngenerative language models to bridge this gap. The experiments show that our\nmethod achieves state-of-the-art performance on the ASR test set on two\nbenchmark datasets for punctuation restoration.\n","authors":["Viet Dac Lai","Abel Salinas","Hao Tan","Trung Bui","Quan Tran","Seunghyun Yoon","Hanieh Deilamsalehy","Franck Dernoncourt","Thien Huu Nguyen"],"pdf_url":"https://arxiv.org/pdf/2307.12949v1.pdf","comment":"Accepted at INTERSPEECH 2023, 6 pages"},{"id":"http://arxiv.org/abs/2307.12935v1","updated":"2023-07-24T16:55:37Z","published":"2023-07-24T16:55:37Z","title":"Rule By Example: Harnessing Logical Rules for Explainable Hate Speech\n Detection","summary":" Classic approaches to content moderation typically apply a rule-based\nheuristic approach to flag content. While rules are easily customizable and\nintuitive for humans to interpret, they are inherently fragile and lack the\nflexibility or robustness needed to moderate the vast amount of undesirable\ncontent found online today. Recent advances in deep learning have demonstrated\nthe promise of using highly effective deep neural models to overcome these\nchallenges. However, despite the improved performance, these data-driven models\nlack transparency and explainability, often leading to mistrust from everyday\nusers and a lack of adoption by many platforms. In this paper, we present Rule\nBy Example (RBE): a novel exemplar-based contrastive learning approach for\nlearning from logical rules for the task of textual content moderation. RBE is\ncapable of providing rule-grounded predictions, allowing for more explainable\nand customizable predictions compared to typical deep learning-based\napproaches. We demonstrate that our approach is capable of learning rich rule\nembedding representations using only a few data examples. Experimental results\non 3 popular hate speech classification datasets show that RBE is able to\noutperform state-of-the-art deep learning classifiers as well as the use of\nrules in both supervised and unsupervised settings while providing explainable\nmodel predictions via rule-grounding.\n","authors":["Christopher Clarke","Matthew Hall","Gaurav Mittal","Ye Yu","Sandra Sajeev","Jason Mars","Mei Chen"],"pdf_url":"https://arxiv.org/pdf/2307.12935v1.pdf","comment":"ACL 2023 Main Conference"},{"id":"http://arxiv.org/abs/2307.12896v1","updated":"2023-07-24T15:44:23Z","published":"2023-07-24T15:44:23Z","title":"Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models","summary":" The article introduces corrections to Zipf's and Heaps' laws based on\nsystematic models of the hapax rate. The derivation rests on two assumptions:\nThe first one is the standard urn model which predicts that marginal frequency\ndistributions for shorter texts look as if word tokens were sampled blindly\nfrom a given longer text. The second assumption posits that the rate of hapaxes\nis a simple function of the text size. Four such functions are discussed: the\nconstant model, the Davis model, the linear model, and the logistic model. It\nis shown that the logistic model yields the best fit.\n","authors":["Łukasz Dębowski"],"pdf_url":"https://arxiv.org/pdf/2307.12896v1.pdf","comment":"41 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2304.08649v3","updated":"2023-07-24T15:33:25Z","published":"2023-04-17T22:53:54Z","title":"Classification of US Supreme Court Cases using BERT-Based Techniques","summary":" Models based on bidirectional encoder representations from transformers\n(BERT) produce state of the art (SOTA) results on many natural language\nprocessing (NLP) tasks such as named entity recognition (NER), part-of-speech\n(POS) tagging etc. An interesting phenomenon occurs when classifying long\ndocuments such as those from the US supreme court where BERT-based models can\nbe considered difficult to use on a first-pass or out-of-the-box basis. In this\npaper, we experiment with several BERT-based classification techniques for US\nsupreme court decisions or supreme court database (SCDB) and compare them with\nthe previous SOTA results. We then compare our results specifically with SOTA\nmodels for long documents. We compare our results for two classification tasks:\n(1) a broad classification task with 15 categories and (2) a fine-grained\nclassification task with 279 categories. Our best result produces an accuracy\nof 80\\% on the 15 broad categories and 60\\% on the fine-grained 279 categories\nwhich marks an improvement of 8\\% and 28\\% respectively from previously\nreported SOTA results.\n","authors":["Shubham Vatsal","Adam Meyers","John E. Ortega"],"pdf_url":"https://arxiv.org/pdf/2304.08649v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10490v3","updated":"2023-07-24T15:24:17Z","published":"2023-07-19T23:03:20Z","title":"(Ab)using Images and Sounds for Indirect Instruction Injection in\n Multi-Modal LLMs","summary":" We demonstrate how images and sounds can be used for indirect prompt and\ninstruction injection in multi-modal LLMs. An attacker generates an adversarial\nperturbation corresponding to the prompt and blends it into an image or audio\nrecording. When the user asks the (unmodified, benign) model about the\nperturbed image or audio, the perturbation steers the model to output the\nattacker-chosen text and/or make the subsequent dialog follow the attacker's\ninstruction. We illustrate this attack with several proof-of-concept examples\ntargeting LLaVa and PandaGPT.\n","authors":["Eugene Bagdasaryan","Tsung-Yin Hsieh","Ben Nassi","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2307.10490v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12856v1","updated":"2023-07-24T14:56:30Z","published":"2023-07-24T14:56:30Z","title":"A Real-World WebAgent with Planning, Long Context Understanding, and\n Program Synthesis","summary":" Pre-trained large language models (LLMs) have recently achieved better\ngeneralization and sample efficiency in autonomous web navigation. However, the\nperformance on real-world websites has still suffered from (1) open domainness,\n(2) limited context length, and (3) lack of inductive bias on HTML. We\nintroduce WebAgent, an LLM-driven agent that can complete the tasks on real\nwebsites following natural language instructions. WebAgent plans ahead by\ndecomposing instructions into canonical sub-instructions, summarizes long HTML\ndocuments into task-relevant snippets, and acts on websites via generated\nPython programs from those. We design WebAgent with Flan-U-PaLM, for grounded\ncode generation, and HTML-T5, new pre-trained LLMs for long HTML documents\nusing local and global attention mechanisms and a mixture of long-span\ndenoising objectives, for planning and summarization. We empirically\ndemonstrate that our recipe improves the success on a real website by over 50%,\nand that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9%\nhigher success rate than prior SoTA on the MiniWoB web navigation benchmark and\nbetter accuracy on offline task planning evaluation.\n","authors":["Izzeddin Gur","Hiroki Furuta","Austin Huang","Mustafa Safdari","Yutaka Matsuo","Douglas Eck","Aleksandra Faust"],"pdf_url":"https://arxiv.org/pdf/2307.12856v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12835v1","updated":"2023-07-24T14:33:49Z","published":"2023-07-24T14:33:49Z","title":"Joint Dropout: Improving Generalizability in Low-Resource Neural Machine\n Translation through Phrase Pair Variables","summary":" Despite the tremendous success of Neural Machine Translation (NMT), its\nperformance on low-resource language pairs still remains subpar, partly due to\nthe limited ability to handle previously unseen inputs, i.e., generalization.\nIn this paper, we propose a method called Joint Dropout, that addresses the\nchallenge of low-resource neural machine translation by substituting phrases\nwith variables, resulting in significant enhancement of compositionality, which\nis a key aspect of generalization. We observe a substantial improvement in\ntranslation quality for language pairs with minimal resources, as seen in BLEU\nand Direct Assessment scores. Furthermore, we conduct an error analysis, and\nfind Joint Dropout to also enhance generalizability of low-resource NMT in\nterms of robustness and adaptability across different domains\n","authors":["Ali Araabi","Vlad Niculae","Christof Monz"],"pdf_url":"https://arxiv.org/pdf/2307.12835v1.pdf","comment":"Accepted at MT Summit 2023"},{"id":"http://arxiv.org/abs/2307.12803v1","updated":"2023-07-24T13:54:37Z","published":"2023-07-24T13:54:37Z","title":"Guidance in Radiology Report Summarization: An Empirical Evaluation and\n Error Analysis","summary":" Automatically summarizing radiology reports into a concise impression can\nreduce the manual burden of clinicians and improve the consistency of\nreporting. Previous work aimed to enhance content selection and factuality\nthrough guided abstractive summarization. However, two key issues persist.\nFirst, current methods heavily rely on domain-specific resources to extract the\nguidance signal, limiting their transferability to domains and languages where\nthose resources are unavailable. Second, while automatic metrics like ROUGE\nshow progress, we lack a good understanding of the errors and failure modes in\nthis task. To bridge these gaps, we first propose a domain-agnostic guidance\nsignal in form of variable-length extractive summaries. Our empirical results\non two English benchmarks demonstrate that this guidance signal improves upon\nunguided summarization while being competitive with domain-specific methods.\nAdditionally, we run an expert evaluation of four systems according to a\ntaxonomy of 11 fine-grained errors. We find that the most pressing differences\nbetween automatic summaries and those of radiologists relate to content\nselection including omissions (up to 52%) and additions (up to 57%). We\nhypothesize that latent reporting factors and corpus-level inconsistencies may\nlimit models to reliably learn content selection from the available data,\npresenting promising directions for future work.\n","authors":["Jan Trienes","Paul Youssef","Jörg Schlötterer","Christin Seifert"],"pdf_url":"https://arxiv.org/pdf/2307.12803v1.pdf","comment":"Accepted at INLG2023"},{"id":"http://arxiv.org/abs/2307.12798v1","updated":"2023-07-24T13:51:19Z","published":"2023-07-24T13:51:19Z","title":"RRAML: Reinforced Retrieval Augmented Machine Learning","summary":" The emergence of large language models (LLMs) has revolutionized machine\nlearning and related fields, showcasing remarkable abilities in comprehending,\ngenerating, and manipulating human language. However, their conventional usage\nthrough API-based text prompt submissions imposes certain limitations in terms\nof context constraints and external source availability. To address these\nchallenges, we propose a novel framework called Reinforced Retrieval Augmented\nMachine Learning (RRAML). RRAML integrates the reasoning capabilities of LLMs\nwith supporting information retrieved by a purpose-built retriever from a vast\nuser-provided database. By leveraging recent advancements in reinforcement\nlearning, our method effectively addresses several critical challenges.\nFirstly, it circumvents the need for accessing LLM gradients. Secondly, our\nmethod alleviates the burden of retraining LLMs for specific tasks, as it is\noften impractical or impossible due to restricted access to the model and the\ncomputational intensity involved. Additionally we seamlessly link the\nretriever's task with the reasoner, mitigating hallucinations and reducing\nirrelevant, and potentially damaging retrieved documents. We believe that the\nresearch agenda outlined in this paper has the potential to profoundly impact\nthe field of AI, democratizing access to and utilization of LLMs for a wide\nrange of entities.\n","authors":["Andrea Bacciu","Florin Cocunasu","Federico Siciliano","Fabrizio Silvestri","Nicola Tonellotto","Giovanni Trappolini"],"pdf_url":"https://arxiv.org/pdf/2307.12798v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2011.12662v4","updated":"2023-07-24T13:22:58Z","published":"2020-11-25T11:44:12Z","title":"XTQA: Span-Level Explanations of the Textbook Question Answering","summary":" Textbook Question Answering (TQA) is a task that one should answer a\ndiagram/non-diagram question given a large multi-modal context consisting of\nabundant essays and diagrams. We argue that the explainability of this task\nshould place students as a key aspect to be considered. To address this issue,\nwe devise a novel architecture towards span-level eXplanations of the TQA\n(XTQA) based on our proposed coarse-to-fine grained algorithm, which can\nprovide not only the answers but also the span-level evidences to choose them\nfor students. This algorithm first coarsely chooses top $M$ paragraphs relevant\nto questions using the TF-IDF method, and then chooses top $K$ evidence spans\nfinely from all candidate spans within these paragraphs by computing the\ninformation gain of each span to questions. Experimental results shows that\nXTQA significantly improves the state-of-the-art performance compared with\nbaselines. The source code is available at\nhttps://github.com/keep-smile-001/opentqa\n","authors":["Jie Ma","Qi Chai","Jun Liu","Qingyu Yin","Pinghui Wang","Qinghua Zheng"],"pdf_url":"https://arxiv.org/pdf/2011.12662v4.pdf","comment":"Accepted by IEEE TNNLS"},{"id":"http://arxiv.org/abs/2307.12759v1","updated":"2023-07-24T13:04:21Z","published":"2023-07-24T13:04:21Z","title":"Code-Switched Urdu ASR for Noisy Telephonic Environment using Data\n Centric Approach with Hybrid HMM and CNN-TDNN","summary":" Call Centers have huge amount of audio data which can be used for achieving\nvaluable business insights and transcription of phone calls is manually tedious\ntask. An effective Automated Speech Recognition system can accurately\ntranscribe these calls for easy search through call history for specific\ncontext and content allowing automatic call monitoring, improving QoS through\nkeyword search and sentiment analysis. ASR for Call Center requires more\nrobustness as telephonic environment are generally noisy. Moreover, there are\nmany low-resourced languages that are on verge of extinction which can be\npreserved with help of Automatic Speech Recognition Technology. Urdu is the\n$10^{th}$ most widely spoken language in the world, with 231,295,440 worldwide\nstill remains a resource constrained language in ASR. Regional call-center\nconversations operate in local language, with a mix of English numbers and\ntechnical terms generally causing a \"code-switching\" problem. Hence, this paper\ndescribes an implementation framework of a resource efficient Automatic Speech\nRecognition/ Speech to Text System in a noisy call-center environment using\nChain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid\nHMM-DNN approach allowed us to utilize the advantages of Neural Network with\nless labelled data. Adding CNN with TDNN has shown to work better in noisy\nenvironment due to CNN's additional frequency dimension which captures extra\ninformation from noisy speech, thus improving accuracy. We collected data from\nvarious open sources and labelled some of the unlabelled data after analysing\nits general context and content from Urdu language as well as from commonly\nused words from other languages, primarily English and were able to achieve WER\nof 5.2% with noisy as well as clean environment in isolated words or numbers as\nwell as in continuous spontaneous speech.\n","authors":["Muhammad Danyal Khan","Raheem Ali","Arshad Aziz"],"pdf_url":"https://arxiv.org/pdf/2307.12759v1.pdf","comment":"32 pages, 19 figures, 2 tables, preprint"},{"id":"http://arxiv.org/abs/2305.16731v3","updated":"2023-07-24T11:20:10Z","published":"2023-05-26T08:33:28Z","title":"Automatic Emotion Experiencer Recognition","summary":" The most prominent subtask in emotion analysis is emotion classification; to\nassign a category to a textual unit, for instance a social media post. Many\nresearch questions from the social sciences do, however, not only require the\ndetection of the emotion of an author of a post but to understand who is\nascribed an emotion in text. This task is tackled by emotion role labeling\nwhich aims at extracting who is described in text to experience an emotion,\nwhy, and towards whom. This could, however, be considered overly sophisticated\nif the main question to answer is who feels which emotion. A targeted approach\nfor such setup is to classify emotion experiencer mentions (aka \"emoters\")\nregarding the emotion they presumably perceive. This task is similar to named\nentity recognition of person names with the difference that not every mentioned\nentity name is an emoter. While, very recently, data with emoter annotations\nhas been made available, no experiments have yet been performed to detect such\nmentions. With this paper, we provide baseline experiments to understand how\nchallenging the task is. We further evaluate the impact on experiencer-specific\nemotion categorization and appraisal detection in a pipeline, when gold\nmentions are not available. We show that experiencer detection in text is a\nchallenging task, with a precision of .82 and a recall of .56 (F1 =.66). These\nresults motivate future work of jointly modeling emoter spans and\nemotion/appraisal predictions.\n","authors":["Maximilian Wegge","Roman Klinger"],"pdf_url":"https://arxiv.org/pdf/2305.16731v3.pdf","comment":"accepted to the CPSS workshop at KONVENS"},{"id":"http://arxiv.org/abs/2307.12659v1","updated":"2023-07-24T10:03:28Z","published":"2023-07-24T10:03:28Z","title":"A Model for Every User and Budget: Label-Free and Personalized\n Mixed-Precision Quantization","summary":" Recent advancement in Automatic Speech Recognition (ASR) has produced large\nAI models, which become impractical for deployment in mobile devices. Model\nquantization is effective to produce compressed general-purpose models, however\nsuch models may only be deployed to a restricted sub-domain of interest. We\nshow that ASR models can be personalized during quantization while relying on\njust a small set of unlabelled samples from the target domain. To this end, we\npropose myQASR, a mixed-precision quantization method that generates tailored\nquantization schemes for diverse users under any memory requirement with no\nfine-tuning. myQASR automatically evaluates the quantization sensitivity of\nnetwork layers by analysing the full-precision activation values. We are then\nable to generate a personalised mixed-precision quantization scheme for any\npre-determined memory budget. Results for large-scale ASR models show how\nmyQASR improves performance for specific genders, languages, and speakers.\n","authors":["Edward Fish","Umberto Michieli","Mete Ozay"],"pdf_url":"https://arxiv.org/pdf/2307.12659v1.pdf","comment":"INTERSPEECH 2023"},{"id":"http://arxiv.org/abs/2301.09790v3","updated":"2023-07-24T10:03:01Z","published":"2023-01-24T02:44:02Z","title":"The Next Chapter: A Study of Large Language Models in Storytelling","summary":" To enhance the quality of generated stories, recent story generation models\nhave been investigating the utilization of higher-level attributes like plots\nor commonsense knowledge. The application of prompt-based learning with large\nlanguage models (LLMs), exemplified by GPT-3, has exhibited remarkable\nperformance in diverse natural language processing (NLP) tasks. This paper\nconducts a comprehensive investigation, utilizing both automatic and human\nevaluation, to compare the story generation capacity of LLMs with recent models\nacross three datasets with variations in style, register, and length of\nstories. The results demonstrate that LLMs generate stories of significantly\nhigher quality compared to other story generation models. Moreover, they\nexhibit a level of performance that competes with human authors, albeit with\nthe preliminary observation that they tend to replicate real stories in\nsituations involving world knowledge, resembling a form of plagiarism.\n","authors":["Zhuohan Xie","Trevor Cohn","Jey Han Lau"],"pdf_url":"https://arxiv.org/pdf/2301.09790v3.pdf","comment":"Accepted to INLG2023"},{"id":"http://arxiv.org/abs/2304.14721v4","updated":"2023-07-24T09:49:55Z","published":"2023-04-28T09:42:18Z","title":"Towards autonomous system: flexible modular production system enhanced\n with large language model agents","summary":" In this paper, we present a novel framework that combines large language\nmodels (LLMs), digital twins and industrial automation system to enable\nintelligent planning and control of production processes. We retrofit the\nautomation system for a modular production facility and create executable\ncontrol interfaces of fine-granular functionalities and coarse-granular skills.\nLow-level functionalities are executed by automation components, and high-level\nskills are performed by automation modules. Subsequently, a digital twin system\nis developed, registering these interfaces and containing additional\ndescriptive information about the production system. Based on the retrofitted\nautomation system and the created digital twins, LLM-agents are designed to\ninterpret descriptive information in the digital twins and control the physical\nsystem through service interfaces. These LLM-agents serve as intelligent agents\non different levels within an automation system, enabling autonomous planning\nand control of flexible production. Given a task instruction as input, the\nLLM-agents orchestrate a sequence of atomic functionalities and skills to\naccomplish the task. We demonstrate how our implemented prototype can handle\nun-predefined tasks, plan a production process, and execute the operations.\nThis research highlights the potential of integrating LLMs into industrial\nautomation systems in the context of smart factory for more agile, flexible,\nand adaptive production processes, while it also underscores the critical\ninsights and limitations for future work. Demos at:\nhttps://github.com/YuchenXia/GPT4IndustrialAutomation\n","authors":["Yuchen Xia","Manthan Shenoy","Nasser Jazdi","Michael Weyrich"],"pdf_url":"https://arxiv.org/pdf/2304.14721v4.pdf","comment":"This is the pre-print draft manuscript. The peer-reviewed version\n will be published exclusively by IEEE after the conference, which is set to\n take place from September 12th to 15th, 2023. We've made several improvements\n to the final version of the paper based on valuable feedback and suggestions\n from other researchers"},{"id":"http://arxiv.org/abs/2307.12639v1","updated":"2023-07-24T09:30:30Z","published":"2023-07-24T09:30:30Z","title":"Fake News Detection Through Graph-based Neural Networks: A Survey","summary":" The popularity of online social networks has enabled rapid dissemination of\ninformation. People now can share and consume information much more rapidly\nthan ever before. However, low-quality and/or accidentally/deliberately fake\ninformation can also spread rapidly. This can lead to considerable and negative\nimpacts on society. Identifying, labelling and debunking online misinformation\nas early as possible has become an increasingly urgent problem. Many methods\nhave been proposed to detect fake news including many deep learning and\ngraph-based approaches. In recent years, graph-based methods have yielded\nstrong results, as they can closely model the social context and propagation\nprocess of online news. In this paper, we present a systematic review of fake\nnews detection studies based on graph-based and deep learning-based techniques.\nWe classify existing graph-based methods into knowledge-driven methods,\npropagation-based methods, and heterogeneous social context-based methods,\ndepending on how a graph structure is constructed to model news related\ninformation flows. We further discuss the challenges and open problems in\ngraph-based fake news detection and identify future research directions.\n","authors":["Shuzhi Gong","Richard O. Sinnott","Jianzhong Qi","Cecile Paris"],"pdf_url":"https://arxiv.org/pdf/2307.12639v1.pdf","comment":"18 pages, 3 tables, 7 figures"},{"id":"http://arxiv.org/abs/2210.04676v2","updated":"2023-07-24T09:00:03Z","published":"2022-10-10T13:26:45Z","title":"Learning \"O\" Helps for Learning More: Handling the Concealed Entity\n Problem for Class-incremental NER","summary":" As the categories of named entities rapidly increase, the deployed NER models\nare required to keep updating toward recognizing more entity types, creating a\ndemand for class-incremental learning for NER. Considering the privacy concerns\nand storage constraints, the standard paradigm for class-incremental NER\nupdates the models with training data only annotated with the new classes, yet\nthe entities from other entity classes are unlabeled, regarded as \"Non-entity\"\n(or \"O\"). In this work, we conduct an empirical study on the \"Unlabeled Entity\nProblem\" and find that it leads to severe confusion between \"O\" and entities,\ndecreasing class discrimination of old classes and declining the model's\nability to learn new classes. To solve the Unlabeled Entity Problem, we propose\na novel representation learning method to learn discriminative representations\nfor the entity classes and \"O\". Specifically, we propose an entity-aware\ncontrastive learning method that adaptively detects entity clusters in \"O\".\nFurthermore, we propose two effective distance-based relabeling strategies for\nbetter learning the old classes. We introduce a more realistic and challenging\nbenchmark for class-incremental NER, and the proposed method achieves up to\n10.62\\% improvement over the baseline methods.\n","authors":["Ruotian Ma","Xuanting Chen","Lin Zhang","Xin Zhou","Junzhe Wang","Tao Gui","Qi Zhang","Xiang Gao","Yunwen Chen"],"pdf_url":"https://arxiv.org/pdf/2210.04676v2.pdf","comment":"Accepted by ACL 2023"},{"id":"http://arxiv.org/abs/2306.16108v2","updated":"2023-07-24T08:14:44Z","published":"2023-06-28T11:24:48Z","title":"Is ChatGPT a Biomedical Expert? -- Exploring the Zero-Shot Performance\n of Current GPT Models in Biomedical Tasks","summary":" We assessed the performance of commercial Large Language Models (LLMs)\nGPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b\nPhase B, which is focused on answer generation, both models demonstrated\ncompetitive abilities with leading systems. Remarkably, they achieved this with\nsimple zero-shot learning, grounded with relevant snippets. Even without\nrelevant snippets, their performance was decent, though not on par with the\nbest systems. Interestingly, the older and cheaper GPT-3.5-Turbo system was\nable to compete with GPT-4 in the grounded Q&A setting on factoid and list\nanswers. In Task 11b Phase A, focusing on retrieval, query expansion through\nzero-shot learning improved performance, but the models fell short compared to\nother systems. The code needed to rerun these experiments is available through\nGitHub.\n","authors":["Samy Ateia","Udo Kruschwitz"],"pdf_url":"https://arxiv.org/pdf/2306.16108v2.pdf","comment":"Preprint accepted at the 11th BioASQ Workshop at the 14th Conference\n and Labs of the Evaluation Forum (CLEF) 2023; Changes: 1. Added related work\n and experimental setup sections. 2. Reworked discussion and future work\n section. 3. Fixed multiple typos and improved style. Changed license"},{"id":"http://arxiv.org/abs/2307.12573v1","updated":"2023-07-24T07:40:59Z","published":"2023-07-24T07:40:59Z","title":"Tachikuma: Understading Complex Interactions with Multi-Character and\n Novel Objects by Large Language Models","summary":" Recent advancements in natural language and Large Language Models (LLMs) have\nenabled AI agents to simulate human-like interactions within virtual worlds.\nHowever, these interactions still face limitations in complexity and\nflexibility, particularly in scenarios involving multiple characters and novel\nobjects. Pre-defining all interactable objects in the agent's world model\npresents challenges, and conveying implicit intentions to multiple characters\nthrough complex interactions remains difficult. To address these issues, we\npropose integrating virtual Game Masters (GMs) into the agent's world model,\ndrawing inspiration from Tabletop Role-Playing Games (TRPGs). GMs play a\ncrucial role in overseeing information, estimating players' intentions,\nproviding environment descriptions, and offering feedback, compensating for\ncurrent world model deficiencies. To facilitate future explorations for complex\ninteractions, we introduce a benchmark named Tachikuma, comprising a Multiple\ncharacter and novel Object based interaction Estimation (MOE) task and a\nsupporting dataset. MOE challenges models to understand characters' intentions\nand accurately determine their actions within intricate contexts involving\nmulti-character and novel object interactions. Besides, the dataset captures\nlog data from real-time communications during gameplay, providing diverse,\ngrounded, and complex interactions for further explorations. Finally, we\npresent a simple prompting baseline and evaluate its performance, demonstrating\nits effectiveness in enhancing interaction understanding. We hope that our\ndataset and task will inspire further research in complex interactions with\nnatural language, fostering the development of more advanced AI agents.\n","authors":["Yuanzhi Liang","Linchao Zhu","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12573v1.pdf","comment":"Preliminary version of an ongoing work"},{"id":"http://arxiv.org/abs/2307.12564v1","updated":"2023-07-24T07:17:33Z","published":"2023-07-24T07:17:33Z","title":"Towards Generalising Neural Topical Representations","summary":" Topic models have evolved from conventional Bayesian probabilistic models to\nNeural Topic Models (NTMs) over the last two decays. Although NTMs have\nachieved promising performance when trained and tested on a specific corpus,\ntheir generalisation ability across corpora is rarely studied. In practice, we\noften expect that an NTM trained on a source corpus can still produce quality\ntopical representation for documents in a different target corpus without\nretraining. In this work, we aim to improve NTMs further so that their benefits\ngeneralise reliably across corpora and tasks. To do so, we propose to model\nsimilar documents by minimising their semantical distance when training NTMs.\nSpecifically, similar documents are created by data augmentation during\ntraining; The semantical distance between documents is measured by the\nHierarchical Topic Transport Distance (HOTT), which computes the Optimal\nTransport (OT) distance between the topical representations. Our framework can\nbe readily applied to most NTMs as a plug-and-play module. Extensive\nexperiments show that our framework significantly improves the generalisation\nability regarding neural topical representation across corpora.\n","authors":["Xiaohao Yang","He Zhao","Dinh Phung","Lan Du"],"pdf_url":"https://arxiv.org/pdf/2307.12564v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2103.11578v2","updated":"2023-07-24T06:53:10Z","published":"2021-03-22T04:44:43Z","title":"SparseGAN: Sparse Generative Adversarial Network for Text Generation","summary":" It is still a challenging task to learn a neural text generation model under\nthe framework of generative adversarial networks (GANs) since the entire\ntraining process is not differentiable. The existing training strategies either\nsuffer from unreliable gradient estimations or imprecise sentence\nrepresentations. Inspired by the principle of sparse coding, we propose a\nSparseGAN that generates semantic-interpretable, but sparse sentence\nrepresentations as inputs to the discriminator. The key idea is that we treat\nan embedding matrix as an over-complete dictionary, and use a linear\ncombination of very few selected word embeddings to approximate the output\nfeature representation of the generator at each time step. With such\nsemantic-rich representations, we not only reduce unnecessary noises for\nefficient adversarial training, but also make the entire training process fully\ndifferentiable. Experiments on multiple text generation datasets yield\nperformance improvements, especially in sequence-level metrics, such as BLEU.\n","authors":["Liping Yuan","Jiehang Zeng","Xiaoqing Zheng"],"pdf_url":"https://arxiv.org/pdf/2103.11578v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.09710v3","updated":"2023-07-24T05:39:27Z","published":"2022-11-17T17:45:59Z","title":"Style Classification of Rabbinic Literature for Detection of Lost\n Midrash Tanhuma Material","summary":" Midrash collections are complex rabbinic works that consist of text in\nmultiple languages, which evolved through long processes of unstable oral and\nwritten transmission. Determining the origin of a given passage in such a\ncompilation is not always straightforward and is often a matter of dispute\namong scholars, yet it is essential for scholars' understanding of the passage\nand its relationship to other texts in the rabbinic corpus. To help solve this\nproblem, we propose a system for classification of rabbinic literature based on\nits style, leveraging recent advances in natural language processing for Hebrew\ntexts. Additionally, we demonstrate how this method can be applied to uncover\nlost material from a specific midrash genre, Tan\\d{h}uma-Yelammedenu, that has\nbeen preserved in later anthologies.\n","authors":["Shlomo Tannor","Nachum Dershowitz","Moshe Lavee"],"pdf_url":"https://arxiv.org/pdf/2211.09710v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12520v1","updated":"2023-07-24T04:29:43Z","published":"2023-07-24T04:29:43Z","title":"Lost In Translation: Generating Adversarial Examples Robust to\n Round-Trip Translation","summary":" Language Models today provide a high accuracy across a large number of\ndownstream tasks. However, they remain susceptible to adversarial attacks,\nparticularly against those where the adversarial examples maintain considerable\nsimilarity to the original text. Given the multilingual nature of text, the\neffectiveness of adversarial examples across translations and how machine\ntranslations can improve the robustness of adversarial examples remain largely\nunexplored. In this paper, we present a comprehensive study on the robustness\nof current text adversarial attacks to round-trip translation. We demonstrate\nthat 6 state-of-the-art text-based adversarial attacks do not maintain their\nefficacy after round-trip translation. Furthermore, we introduce an\nintervention-based solution to this problem, by integrating Machine Translation\ninto the process of adversarial example generation and demonstrating increased\nrobustness to round-trip translation. Our results indicate that finding\nadversarial examples robust to translation can help identify the insufficiency\nof language models that is common across languages, and motivate further\nresearch into multilingual adversarial attacks.\n","authors":["Neel Bhandari","Pin-Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2307.12520v1.pdf","comment":"Published at International Conference on Acoustics, Speech, and\n Signal Processing (ICASSP) 2023"},{"id":"http://arxiv.org/abs/2009.04639v2","updated":"2023-07-24T03:56:31Z","published":"2020-09-10T02:22:21Z","title":"Improving Coreference Resolution by Leveraging Entity-Centric Features\n with Graph Neural Networks and Second-order Inference","summary":" One of the major challenges in coreference resolution is how to make use of\nentity-level features defined over clusters of mentions rather than mention\npairs. However, coreferent mentions usually spread far apart in an entire text,\nwhich makes it extremely difficult to incorporate entity-level features. We\npropose a graph neural network-based coreference resolution method that can\ncapture the entity-centric information by encouraging the sharing of features\nacross all mentions that probably refer to the same real-world entity. Mentions\nare linked to each other via the edges modeling how likely two linked mentions\npoint to the same entity. Modeling by such graphs, the features between\nmentions can be shared by message passing operations in an entity-centric\nmanner. A global inference algorithm up to second-order features is also\npresented to optimally cluster mentions into consistent groups. Experimental\nresults show our graph neural network-based method combing with the\nsecond-order decoding algorithm (named GNNCR) achieved close to\nstate-of-the-art performance on the English CoNLL-2012 Shared Task dataset.\n","authors":["Lu Liu","Zhenqiao Song","Xiaoqing Zheng","Jun He"],"pdf_url":"https://arxiv.org/pdf/2009.04639v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12507v1","updated":"2023-07-24T03:44:17Z","published":"2023-07-24T03:44:17Z","title":"Investigating the Existence of \"Secret Language'' in Language Models","summary":" In this paper, we study the problem of secret language in NLP, where current\nlanguage models (LMs) seem to have a hidden vocabulary that allows them to\ninterpret absurd inputs as meaningful concepts. We investigate two research\nquestions: ``Does the secret language phenomenon exist in different language\nmodels?'' and ``Does secret language depend on specific context?'' To answer\nthese questions, we introduce a novel method named \\textit{SecretFinding}, a\ngradient-based approach that can automatically discover secret languages in\nLMs. We conduct experiments on five representative models (Electra, ALBERT,\nRoberta, DistillBERT, and CLIP) finetuned on four NLP benchmarks (SST-2, MRPC,\nSNLI, and SQuAD) and a language-grounding benchmark (MSCOCO). Our experimental\nresults show that even when we replace the most important words with others\nthat are semantically dissimilar to the original words in a sentence, LMs do\nnot consider the new sentence semantically dissimilar to the original, as the\noutput does not change with a high probability. This phenomenon holds true\nacross the five models and five tasks and gives a positive answer to the first\nresearch question. As for the second research question, we find that the secret\nlanguage discovered by \\textit{SecretFinding} is quite general and could even\nbe transferred to other models in the black-box settings, such as GPT-3 and\nChatGPT. Finally, we discuss the causes of secret language, how to eliminate\nit, the potential connection to memorization, and ethical implications.\nExamples of secret language found by SecretFinding are available on\nhttps://huggingface.co/spaces/anonymousauthors/ACL23_SecretLanguage.\n","authors":["Yimu Wang","Peng Shi","Hongyang Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13040v3","updated":"2023-07-24T03:31:42Z","published":"2023-05-22T13:47:51Z","title":"SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented\n Dialogue Agents","summary":" Task-oriented dialogue (TOD) models have made significant progress in recent\nyears. However, previous studies primarily focus on datasets written by\nannotators, which has resulted in a gap between academic research and\nreal-world spoken conversation scenarios. While several small-scale spoken TOD\ndatasets are proposed to address robustness issues such as ASR errors, they\nignore the unique challenges in spoken conversation. To tackle the limitations,\nwe introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD,\ncontaining 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from\nhuman-to-human spoken conversations. SpokenWOZ further incorporates common\nspoken characteristics such as word-by-word processing and reasoning in spoken\nlanguage. Based on these characteristics, we present cross-turn slot and\nreasoning slot detection as new challenges. We conduct experiments on various\nbaselines, including text-modal models, newly proposed dual-modal models, and\nLLMs, e.g., ChatGPT. The results show that the current models still have\nsubstantial room for improvement in spoken conversation, where the most\nadvanced dialogue state tracker only achieves 25.65% in joint goal accuracy and\nthe SOTA end-to-end model only correctly completes the user request in 52.1% of\ndialogues. The dataset, code, and leaderboard are available:\nhttps://spokenwoz.github.io/SpokenWOZ-github.io/.\n","authors":["Shuzheng Si","Wentao Ma","Haoyu Gao","Yuchuan Wu","Ting-En Lin","Yinpei Dai","Hangyu Li","Rui Yan","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2305.13040v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2009.07481v2","updated":"2023-07-24T03:26:17Z","published":"2020-09-16T05:58:00Z","title":"Unsupervised Summarization by Jointly Extracting Sentences and Keywords","summary":" We present RepRank, an unsupervised graph-based ranking model for extractive\nmulti-document summarization in which the similarity between words, sentences,\nand word-to-sentence can be estimated by the distances between their vector\nrepresentations in a unified vector space. In order to obtain desirable\nrepresentations, we propose a self-attention based learning method that\nrepresent a sentence by the weighted sum of its word embeddings, and the\nweights are concentrated to those words hopefully better reflecting the content\nof a document. We show that salient sentences and keywords can be extracted in\na joint and mutual reinforcement process using our learned representations, and\nprove that this process always converges to a unique solution leading to\nimprovement in performance. A variant of absorbing random walk and the\ncorresponding sampling-based algorithm are also described to avoid redundancy\nand increase diversity in the summaries. Experiment results with multiple\nbenchmark datasets show that RepRank achieved the best or comparable\nperformance in ROUGE.\n","authors":["Zongyi Li","Xiaoqing Zheng","Jun He"],"pdf_url":"https://arxiv.org/pdf/2009.07481v2.pdf","comment":"10 pages(includes 2 pages references), 1 figure"},{"id":"http://arxiv.org/abs/2307.12498v1","updated":"2023-07-24T03:07:40Z","published":"2023-07-24T03:07:40Z","title":"Robust Automatic Speech Recognition via WavAugment Guided Phoneme\n Adversarial Training","summary":" Developing a practically-robust automatic speech recognition (ASR) is\nchallenging since the model should not only maintain the original performance\non clean samples, but also achieve consistent efficacy under small volume\nperturbations and large domain shifts. To address this problem, we propose a\nnovel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use\nadversarial examples in phoneme space as augmentation to make the model\ninvariant to minor fluctuations in phoneme representation and preserve the\nperformance on clean samples. In addition, wapat utilizes the phoneme\nrepresentation of augmented samples to guide the generation of adversaries,\nwhich helps to find more stable and diverse gradient-directions, resulting in\nimproved generalization. Extensive experiments demonstrate the effectiveness of\nwapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat\noutperforms the original model by 6.28% WER reduction on ESB, achieving the new\nstate-of-the-art.\n","authors":["Gege Qi","Yuefeng Chen","Xiaofeng Mao","Xiaojun Jia","Ranjie Duan","Rong Zhang","Hui Xue"],"pdf_url":"https://arxiv.org/pdf/2307.12498v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11610v2","updated":"2023-07-24T01:35:47Z","published":"2023-07-21T14:25:39Z","title":"CausE: Towards Causal Knowledge Graph Embedding","summary":" Knowledge graph embedding (KGE) focuses on representing the entities and\nrelations of a knowledge graph (KG) into the continuous vector spaces, which\ncan be employed to predict the missing triples to achieve knowledge graph\ncompletion (KGC). However, KGE models often only briefly learn structural\ncorrelations of triple data and embeddings would be misled by the trivial\npatterns and noisy links in real-world KGs. To address this issue, we build the\nnew paradigm of KGE in the context of causality and embedding disentanglement.\nWe further propose a Causality-enhanced knowledge graph Embedding (CausE)\nframework. CausE employs causal intervention to estimate the causal effect of\nthe confounder embeddings and design new training objectives to make stable\npredictions. Experimental results demonstrate that CausE could outperform the\nbaseline models and achieve state-of-the-art KGC performance. We release our\ncode in https://github.com/zjukg/CausE.\n","authors":["Yichi Zhang","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.11610v2.pdf","comment":"Accepted by CCKS 2023 as a research paper"},{"id":"http://arxiv.org/abs/2306.14096v4","updated":"2023-07-24T00:58:11Z","published":"2023-06-25T02:24:30Z","title":"Chinese Fine-Grained Financial Sentiment Analysis with Large Language\n Models","summary":" Entity-level fine-grained sentiment analysis in the financial domain is a\ncrucial subtask of sentiment analysis and currently faces numerous challenges.\nThe primary challenge stems from the lack of high-quality and large-scale\nannotated corpora specifically designed for financial text sentiment analysis,\nwhich in turn limits the availability of data necessary for developing\neffective text processing techniques. Recent advancements in large language\nmodels (LLMs) have yielded remarkable performance in natural language\nprocessing tasks, primarily centered around language pattern matching. In this\npaper, we propose a novel and extensive Chinese fine-grained financial\nsentiment analysis dataset, FinChina SA, for enterprise early warning. We\nthoroughly evaluate and experiment with well-known existing open-source LLMs\nusing our dataset. We firmly believe that our dataset will serve as a valuable\nresource to advance the exploration of real-world financial sentiment analysis\ntasks, which should be the focus of future research. The FinChina SA dataset is\npublicly available at https://github.com/YerayL/FinChina-SA\n","authors":["Yinyu Lan","Yanru Wu","Wang Xu","Weiqiang Feng","Youhao Zhang"],"pdf_url":"https://arxiv.org/pdf/2306.14096v4.pdf","comment":"FinLLM Symposium at IJCAI 2023"},{"id":"http://arxiv.org/abs/2305.01788v3","updated":"2023-07-24T00:54:51Z","published":"2023-05-02T21:33:10Z","title":"Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation\n Incorporating Gloss Information","summary":" Visual Word Sense Disambiguation (VWSD) is a task to find the image that most\naccurately depicts the correct sense of the target word for the given context.\nPreviously, image-text matching models often suffered from recognizing\npolysemous words. This paper introduces an unsupervised VWSD approach that uses\ngloss information of an external lexical knowledge-base, especially the sense\ndefinitions. Specifically, we suggest employing Bayesian inference to\nincorporate the sense definitions when sense information of the answer is not\nprovided. In addition, to ameliorate the out-of-dictionary (OOD) issue, we\npropose a context-aware definition generation with GPT-3. Experimental results\nshow that the VWSD performance significantly increased with our Bayesian\ninference-based approach. In addition, our context-aware definition generation\nachieved prominent performance improvement in OOD examples exhibiting better\nperformance than the existing definition generation method.\n","authors":["Sunjae Kwon","Rishabh Garodia","Minhwa Lee","Zhichao Yang","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2305.01788v3.pdf","comment":"ACL 2023, https://aclanthology.org/2023.acl-long.88"},{"id":"http://arxiv.org/abs/2307.02591v2","updated":"2023-07-24T00:47:23Z","published":"2023-07-05T18:41:29Z","title":"ODD: A Benchmark Dataset for the NLP-based Opioid Related Aberrant\n Behavior Detection","summary":" Opioid related aberrant behaviors (ORAB) present novel risk factors for\nopioid overdose. Previously, ORAB have been mainly assessed by survey results\nand by monitoring drug administrations. Such methods however, cannot scale up\nand do not cover the entire spectrum of aberrant behaviors. On the other hand,\nORAB are widely documented in electronic health record notes. This paper\nintroduces a novel biomedical natural language processing benchmark dataset\nnamed ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset\ncomprising of more than 750 publicly available EHR notes. ODD has been designed\nto identify ORAB from patients' EHR notes and classify them into nine\ncategories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3)\nOpioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiapines, 7)\nMedication Changes, 8) Central Nervous System-related, and 9) Social\nDeterminants of Health. We explored two state-of-the-art natural language\nprocessing (NLP) models (finetuning pretrained language models and\nprompt-tuning approaches) to identify ORAB. Experimental results show that the\nprompt-tuning models outperformed the finetuning models in most cateogories and\nthe gains were especially higher among uncommon categories (Suggested aberrant\nbehavior, Diagnosed opioid dependency and Medication change). Although the best\nmodel achieved the highest 83.92% on area under precision recall curve,\nuncommon classes (Suggested Aberrant Behavior, Diagnosed Opioid Dependence, and\nMedication Change) still have a large room for performance improvement.\n","authors":["Sunjae Kwon","Xun Wang","Weisong Liu","Emily Druhl","Minhee L. Sung","Joel I. Reisman","Wenjun Li","Robert D. Kerns","William Becker","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2307.02591v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2307.13176v1","updated":"2023-07-24T23:53:13Z","published":"2023-07-24T23:53:13Z","title":"Schema-Driven Actionable Insight Generation and Smart Recommendation","summary":" In natural language generation (NLG), insight mining is seen as a\ndata-to-text task, where data is mined for interesting patterns and verbalised\ninto 'insight' statements. An 'over-generate and rank' paradigm is intuitively\nused to generate such insights. The multidimensionality and subjectivity of\nthis process make it challenging. This paper introduces a schema-driven method\nto generate actionable insights from data to drive growth and change. It also\nintroduces a technique to rank the insights to align with user interests based\non their feedback. We show preliminary qualitative results of the insights\ngenerated using our technique and demonstrate its ability to adapt to feedback.\n","authors":["Allmin Susaiyah","Aki Härmä","Milan Petković"],"pdf_url":"https://arxiv.org/pdf/2307.13176v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13173v1","updated":"2023-07-24T23:42:32Z","published":"2023-07-24T23:42:32Z","title":"Opinion Mining Using Population-tuned Generative Language Models","summary":" We present a novel method for mining opinions from text collections using\ngenerative language models trained on data collected from different\npopulations. We describe the basic definitions, methodology and a generic\nalgorithm for opinion insight mining. We demonstrate the performance of our\nmethod in an experiment where a pre-trained generative model is fine-tuned\nusing specifically tailored content with unnatural and fully annotated\nopinions. We show that our approach can learn and transfer the opinions to the\nsemantic classes while maintaining the proportion of polarisation. Finally, we\ndemonstrate the usage of an insight mining system to scale up the discovery of\nopinion insights from a real text corpus.\n","authors":["Allmin Susaiyah","Abhinay Pandya","Aki Härmä"],"pdf_url":"https://arxiv.org/pdf/2307.13173v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13128v1","updated":"2023-07-24T21:05:47Z","published":"2023-07-24T21:05:47Z","title":"Explaining Math Word Problem Solvers","summary":" Automated math word problem solvers based on neural networks have\nsuccessfully managed to obtain 70-80\\% accuracy in solving arithmetic word\nproblems. However, it has been shown that these solvers may rely on superficial\npatterns to obtain their equations. In order to determine what information math\nword problem solvers use to generate solutions, we remove parts of the input\nand measure the model's performance on the perturbed dataset. Our results show\nthat the model is not sensitive to the removal of many words from the input and\ncan still manage to find a correct answer when given a nonsense question. This\nindicates that automatic solvers do not follow the semantic logic of math word\nproblems, and may be overfitting to the presence of specific words.\n","authors":["Abby Newcomb","Jugal Kalita"],"pdf_url":"https://arxiv.org/pdf/2307.13128v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.15498v2","updated":"2023-07-24T20:08:20Z","published":"2021-06-29T15:25:33Z","title":"Classification of Consumer Belief Statements From Social Media","summary":" Social media offer plenty of information to perform market research in order\nto meet the requirements of customers. One way how this research is conducted\nis that a domain expert gathers and categorizes user-generated content into a\ncomplex and fine-grained class structure. In many of such cases, little data\nmeets complex annotations. It is not yet fully understood how this can be\nleveraged successfully for classification. We examine the classification\naccuracy of expert labels when used with a) many fine-grained classes and b)\nfew abstract classes. For scenario b) we compare abstract class labels given by\nthe domain expert as baseline and by automatic hierarchical clustering. We\ncompare this to another baseline where the entire class structure is given by a\ncompletely unsupervised clustering approach. By doing so, this work can serve\nas an example of how complex expert annotations are potentially beneficial and\ncan be utilized in the most optimal way for opinion mining in highly specific\ndomains. By exploring across a range of techniques and experiments, we find\nthat automated class abstraction approaches in particular the unsupervised\napproach performs remarkably well against domain expert baseline on text\nclassification tasks. This has the potential to inspire opinion mining\napplications in order to support market researchers in practice and to inspire\nfine-grained automated content analysis on a large scale.\n","authors":["Gerhard Johann Hagerer","Wenbin Le","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2106.15498v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2110.10575v2","updated":"2023-07-24T20:07:07Z","published":"2021-10-20T14:04:13Z","title":"SocialVisTUM: An Interactive Visualization Toolkit for Correlated Neural\n Topic Models on Social Media Opinion Mining","summary":" Recent research in opinion mining proposed word embedding-based topic\nmodeling methods that provide superior coherence compared to traditional topic\nmodeling. In this paper, we demonstrate how these methods can be used to\ndisplay correlated topic models on social media texts using SocialVisTUM, our\nproposed interactive visualization toolkit. It displays a graph with topics as\nnodes and their correlations as edges. Further details are displayed\ninteractively to support the exploration of large text collections, e.g.,\nrepresentative words and sentences of topics, topic and sentiment\ndistributions, hierarchical topic clustering, and customizable, predefined\ntopic labels. The toolkit optimizes automatically on custom data for optimal\ncoherence. We show a working instance of the toolkit on data crawled from\nEnglish social media discussions about organic food consumption. The\nvisualization confirms findings of a qualitative consumer research study.\nSocialVisTUM and its training procedures are accessible online.\n","authors":["Gerhard Johann Hagerer","Martin Kirchhoff","Hannah Danner","Robert Pesch","Mainak Ghosh","Archishman Roy","Jiaxi Zhao","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2110.10575v2.pdf","comment":"Demo paper accepted for publication on RANLP 2021; 8 pages, 5\n figures, 1 table"},{"id":"http://arxiv.org/abs/2110.15134v2","updated":"2023-07-24T20:05:38Z","published":"2021-10-28T14:09:44Z","title":"An Analysis of Programming Course Evaluations Before and After the\n Introduction of an Autograder","summary":" Commonly, introductory programming courses in higher education institutions\nhave hundreds of participating students eager to learn to program. The manual\neffort for reviewing the submitted source code and for providing feedback can\nno longer be managed. Manually reviewing the submitted homework can be\nsubjective and unfair, particularly if many tutors are responsible for grading.\nDifferent autograders can help in this situation; however, there is a lack of\nknowledge about how autograders can impact students' overall perception of\nprogramming classes and teaching. This is relevant for course organizers and\ninstitutions to keep their programming courses attractive while coping with\nincreasing students.\n This paper studies the answers to the standardized university evaluation\nquestionnaires of multiple large-scale foundational computer science courses\nwhich recently introduced autograding. The differences before and after this\nintervention are analyzed. By incorporating additional observations, we\nhypothesize how the autograder might have contributed to the significant\nchanges in the data, such as, improved interactions between tutors and\nstudents, improved overall course quality, improved learning success, increased\ntime spent, and reduced difficulty. This qualitative study aims to provide\nhypotheses for future research to define and conduct quantitative surveys and\ndata analysis. The autograder technology can be validated as a teaching method\nto improve student satisfaction with programming courses.\n","authors":["Gerhard Johann Hagerer","Laura Lahesoo","Miriam Anschütz","Stephan Krusche","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2110.15134v2.pdf","comment":"Accepted full paper article on IEEE ITHET 2021"},{"id":"http://arxiv.org/abs/2111.02259v3","updated":"2023-07-24T20:03:14Z","published":"2021-11-03T14:49:50Z","title":"A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion\n Mining","summary":" User-generated content from social media is produced in many languages,\nmaking it technically challenging to compare the discussed themes from one\ndomain across different cultures and regions. It is relevant for domains in a\nglobalized world, such as market research, where people from two nations and\nmarkets might have different requirements for a product. We propose a simple,\nmodern, and effective method for building a single topic model with sentiment\nanalysis capable of covering multiple languages simultanteously, based on a\npre-trained state-of-the-art deep neural network for natural language\nunderstanding. To demonstrate its feasibility, we apply the model to newspaper\narticles and user comments of a specific domain, i.e., organic food products\nand related consumption behavior. The themes match across languages.\nAdditionally, we obtain an high proportion of stable and domain-relevant\ntopics, a meaningful relation between topics and their respective textual\ncontents, and an interpretable representation for social media documents.\nMarketing can potentially benefit from our method, since it provides an\neasy-to-use means of addressing specific customer interests from different\nmarket regions around the globe. For reproducibility, we provide the code,\ndata, and results of our study.\n","authors":["Gerhard Johann Hagerer","Wing Sheung Leung","Qiaoxi Liu","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2111.02259v3.pdf","comment":"10 pages, 2 tables, 5 figures, full paper, peer-reviewed, published\n at KDIR/IC3k 2021 conference"},{"id":"http://arxiv.org/abs/2307.13106v1","updated":"2023-07-24T19:54:15Z","published":"2023-07-24T19:54:15Z","title":"How to use LLMs for Text Analysis","summary":" This guide introduces Large Language Models (LLM) as a highly versatile text\nanalysis method within the social sciences. As LLMs are easy-to-use, cheap,\nfast, and applicable on a broad range of text analysis tasks, ranging from text\nannotation and classification to sentiment analysis and critical discourse\nanalysis, many scholars believe that LLMs will transform how we do text\nanalysis. This how-to guide is aimed at students and researchers with limited\nprogramming experience, and offers a simple introduction to how LLMs can be\nused for text analysis in your own research project, as well as advice on best\npractices. We will go through each of the steps of analyzing textual data with\nLLMs using Python: installing the software, setting up the API, loading the\ndata, developing an analysis prompt, analyzing the text, and validating the\nresults. As an illustrative example, we will use the challenging task of\nidentifying populism in political texts, and show how LLMs move beyond the\nexisting state-of-the-art.\n","authors":["Petter Törnberg"],"pdf_url":"https://arxiv.org/pdf/2307.13106v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2111.02326v2","updated":"2023-07-24T19:44:53Z","published":"2021-11-03T16:20:16Z","title":"End-to-End Annotator Bias Approximation on Crowdsourced Single-Label\n Sentiment Analysis","summary":" Sentiment analysis is often a crowdsourcing task prone to subjective labels\ngiven by many annotators. It is not yet fully understood how the annotation\nbias of each annotator can be modeled correctly with state-of-the-art methods.\nHowever, resolving annotator bias precisely and reliably is the key to\nunderstand annotators' labeling behavior and to successfully resolve\ncorresponding individual misconceptions and wrongdoings regarding the\nannotation task. Our contribution is an explanation and improvement for precise\nneural end-to-end bias modeling and ground truth estimation, which reduces an\nundesired mismatch in that regard of the existing state-of-the-art.\nClassification experiments show that it has potential to improve accuracy in\ncases where each sample is annotated only by one single annotator. We provide\nthe whole source code publicly and release an own domain-specific sentiment\ndataset containing 10,000 sentences discussing organic food products. These are\ncrawled from social media and are singly labeled by 10 non-expert annotators.\n","authors":["Gerhard Johann Hagerer","David Szabo","Andreas Koch","Maria Luisa Ripoll Dominguez","Christian Widmer","Maximilian Wich","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2111.02326v2.pdf","comment":"10 pages, 2 figures, 2 tables, full conference paper, peer-reviewed"},{"id":"http://arxiv.org/abs/2305.17008v2","updated":"2023-07-24T19:18:25Z","published":"2023-05-26T15:09:11Z","title":"NormBank: A Knowledge Bank of Situational Social Norms","summary":" We present NormBank, a knowledge bank of 155k situational norms. This\nresource is designed to ground flexible normative reasoning for interactive,\nassistive, and collaborative AI systems. Unlike prior commonsense resources,\nNormBank grounds each inference within a multivalent sociocultural frame, which\nincludes the setting (e.g., restaurant), the agents' contingent roles (waiter,\ncustomer), their attributes (age, gender), and other physical, social, and\ncultural constraints (e.g., the temperature or the country of operation). In\ntotal, NormBank contains 63k unique constraints from a taxonomy that we\nintroduce and iteratively refine here. Constraints then apply in different\ncombinations to frame social norms. Under these manipulations, norms are\nnon-monotonic - one can cancel an inference by updating its frame even\nslightly. Still, we find evidence that neural models can help reliably extend\nthe scope and coverage of NormBank. We further demonstrate the utility of this\nresource with a series of transfer experiments.\n","authors":["Caleb Ziems","Jane Dwivedi-Yu","Yi-Chia Wang","Alon Halevy","Diyi Yang"],"pdf_url":"https://arxiv.org/pdf/2305.17008v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13085v1","updated":"2023-07-24T19:14:38Z","published":"2023-07-24T19:14:38Z","title":"Making Metadata More FAIR Using Large Language Models","summary":" With the global increase in experimental data artifacts, harnessing them in a\nunified fashion leads to a major stumbling block - bad metadata. To bridge this\ngap, this work presents a Natural Language Processing (NLP) informed\napplication, called FAIRMetaText, that compares metadata. Specifically,\nFAIRMetaText analyzes the natural language descriptions of metadata and\nprovides a mathematical similarity measure between two terms. This measure can\nthen be utilized for analyzing varied metadata, by suggesting terms for\ncompliance or grouping similar terms for identification of replaceable terms.\nThe efficacy of the algorithm is presented qualitatively and quantitatively on\npublicly available research artifacts and demonstrates large gains across\nmetadata related tasks through an in-depth study of a wide variety of Large\nLanguage Models (LLMs). This software can drastically reduce the human effort\nin sifting through various natural language metadata while employing several\nexperimental datasets on the same topic.\n","authors":["Sowmya S. Sundaram","Mark A. Musen"],"pdf_url":"https://arxiv.org/pdf/2307.13085v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.00017v2","updated":"2023-07-24T18:46:22Z","published":"2023-05-30T15:15:40Z","title":"Towards Explainable and Language-Agnostic LLMs: Symbolic Reverse\n Engineering of Language at Scale","summary":" Large language models (LLMs) have achieved a milestone that undenia-bly\nchanged many held beliefs in artificial intelligence (AI). However, there\nremains many limitations of these LLMs when it comes to true language\nunderstanding, limitations that are a byproduct of the under-lying architecture\nof deep neural networks. Moreover, and due to their subsymbolic nature,\nwhatever knowledge these models acquire about how language works will always be\nburied in billions of microfeatures (weights), none of which is meaningful on\nits own, making such models hopelessly unexplainable. To address these\nlimitations, we suggest com-bining the strength of symbolic representations\nwith what we believe to be the key to the success of LLMs, namely a successful\nbottom-up re-verse engineering of language at scale. As such we argue for a\nbottom-up reverse engineering of language in a symbolic setting. Hints on what\nthis project amounts to have been suggested by several authors, and we discuss\nin some detail here how this project could be accomplished.\n","authors":["Walid S. Saba"],"pdf_url":"https://arxiv.org/pdf/2306.00017v2.pdf","comment":"Draft, preprint"},{"id":"http://arxiv.org/abs/2307.13018v1","updated":"2023-07-24T17:17:13Z","published":"2023-07-24T17:17:13Z","title":"The potential of LLMs for coding with low-resource and domain-specific\n programming languages","summary":" This paper presents a study on the feasibility of using large language models\n(LLM) for coding with low-resource and domain-specific programming languages\nthat typically lack the amount of data required for effective LLM processing\ntechniques. This study focuses on the econometric scripting language named\nhansl of the open-source software gretl and employs a proprietary LLM based on\nGPT-3.5. Our findings suggest that LLMs can be a useful tool for writing,\nunderstanding, improving, and documenting gretl code, which includes generating\ndescriptive docstrings for functions and providing precise explanations for\nabstract and poorly documented econometric code. While the LLM showcased\npromoting docstring-to-code translation capability, we also identify some\nlimitations, such as its inability to improve certain sections of code and to\nwrite accurate unit tests. This study is a step towards leveraging the power of\nLLMs to facilitate software development in low-resource programming languages\nand ultimately to lower barriers to entry for their adoption.\n","authors":["Artur Tarassow"],"pdf_url":"https://arxiv.org/pdf/2307.13018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.14361v1","updated":"2023-07-24T21:01:46Z","published":"2023-07-24T21:01:46Z","title":"A Hybrid Machine Learning Model for Classifying Gene Mutations in Cancer\n using LSTM, BiLSTM, CNN, GRU, and GloVe","summary":" This study presents an ensemble model combining LSTM, BiLSTM, CNN, GRU, and\nGloVe to classify gene mutations using Kaggle's Personalized Medicine:\nRedefining Cancer Treatment dataset. The results were compared against\nwell-known transformers like as BERT, Electra, Roberta, XLNet, Distilbert, and\ntheir LSTM ensembles. Our model outperformed all other models in terms of\naccuracy, precision, recall, F1 score, and Mean Squared Error. Surprisingly, it\nalso needed less training time, resulting in a perfect combination of\nperformance and efficiency. This study demonstrates the utility of ensemble\nmodels for difficult tasks such as gene mutation classification.\n","authors":["Sanad Aburass","Osama Dorgham","Jamil Al Shaqsi"],"pdf_url":"https://arxiv.org/pdf/2307.14361v1.pdf","comment":"6 pages, 7 figures and 2 tables"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2307.12981v1","updated":"2023-07-24T17:59:02Z","published":"2023-07-24T17:59:02Z","title":"3D-LLM: Injecting the 3D World into Large Language Models","summary":" Large language models (LLMs) and Vision-Language Models (VLMs) have been\nproven to excel at multiple tasks, such as commonsense reasoning. Powerful as\nthese models can be, they are not grounded in the 3D physical world, which\ninvolves richer concepts such as spatial relationships, affordances, physics,\nlayout, and so on. In this work, we propose to inject the 3D world into large\nlanguage models and introduce a whole new family of 3D-LLMs. Specifically,\n3D-LLMs can take 3D point clouds and their features as input and perform a\ndiverse set of 3D-related tasks, including captioning, dense captioning, 3D\nquestion answering, task decomposition, 3D grounding, 3D-assisted dialog,\nnavigation, and so on. Using three types of prompting mechanisms that we\ndesign, we are able to collect over 300k 3D-language data covering these tasks.\nTo efficiently train 3D-LLMs, we first utilize a 3D feature extractor that\nobtains 3D features from rendered multi- view images. Then, we use 2D VLMs as\nour backbones to train our 3D-LLMs. By introducing a 3D localization mechanism,\n3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show\nthat our model outperforms state-of-the-art baselines by a large margin (e.g.,\nthe BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore,\nexperiments on our held-in datasets for 3D captioning, task composition, and\n3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative\nexamples also show that our model could perform more tasks beyond the scope of\nexisting LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.\n","authors":["Yining Hong","Haoyu Zhen","Peihao Chen","Shuhong Zheng","Yilun Du","Zhenfang Chen","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2307.12981v1.pdf","comment":"Project Page: : https://vis-www.cs.umass.edu/3dllm/"},{"id":"http://arxiv.org/abs/2209.05407v3","updated":"2023-07-24T17:58:31Z","published":"2022-09-12T16:59:36Z","title":"Segmenting Known Objects and Unseen Unknowns without Prior Knowledge","summary":" Panoptic segmentation methods assign a known class to each pixel given in\ninput. Even for state-of-the-art approaches, this inevitably enforces decisions\nthat systematically lead to wrong predictions for objects outside the training\ncategories. However, robustness against out-of-distribution samples and corner\ncases is crucial in safety-critical settings to avoid dangerous consequences.\nSince real-world datasets cannot contain enough data points to adequately\nsample the long tail of the underlying distribution, models must be able to\ndeal with unseen and unknown scenarios as well. Previous methods targeted this\nby re-identifying already-seen unlabeled objects. In this work, we propose the\nnecessary step to extend segmentation with a new setting which we term holistic\nsegmentation. Holistic segmentation aims to identify and separate objects of\nunseen unknown categories into instances, without any prior knowledge about\nthem, while performing panoptic segmentation of known classes. We tackle this\nnew problem with U3HS, which finds unknowns as highly uncertain regions and\nclusters their corresponding instance-aware embeddings into individual objects.\nBy doing so, for the first time in panoptic segmentation with unknown objects,\nour U3HS is trained without unknown categories, reducing assumptions and\nleaving the settings as unconstrained as in real-life scenarios. Extensive\nexperiments on public data from MS COCO, Cityscapes, and Lost&Found demonstrate\nthe effectiveness of U3HS for this new, challenging, and assumptions-free\nsetting called holistic segmentation.\n","authors":["Stefano Gasperini","Alvaro Marcos-Ramiro","Michael Schmidt","Nassir Navab","Benjamin Busam","Federico Tombari"],"pdf_url":"https://arxiv.org/pdf/2209.05407v3.pdf","comment":"Accepted at ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12980v1","updated":"2023-07-24T17:58:06Z","published":"2023-07-24T17:58:06Z","title":"A Systematic Survey of Prompt Engineering on Vision-Language Foundation\n Models","summary":" Prompt engineering is a technique that involves augmenting a large\npre-trained model with task-specific hints, known as prompts, to adapt the\nmodel to new tasks. Prompts can be created manually as natural language\ninstructions or generated automatically as either natural language instructions\nor vector representations. Prompt engineering enables the ability to perform\npredictions based solely on prompts without updating model parameters, and the\neasier application of large pre-trained models in real-world tasks. In past\nyears, Prompt engineering has been well-studied in natural language processing.\nRecently, it has also been intensively studied in vision-language modeling.\nHowever, there is currently a lack of a systematic overview of prompt\nengineering on pre-trained vision-language models. This paper aims to provide a\ncomprehensive survey of cutting-edge research in prompt engineering on three\ntypes of vision-language models: multimodal-to-text generation models (e.g.\nFlamingo), image-text matching models (e.g. CLIP), and text-to-image generation\nmodels (e.g. Stable Diffusion). For each type of model, a brief model summary,\nprompting methods, prompting-based applications, and the corresponding\nresponsibility and integrity issues are summarized and discussed. Furthermore,\nthe commonalities and differences between prompting on vision-language models,\nlanguage models, and vision models are also discussed. The challenges, future\ndirections, and research opportunities are summarized to foster future research\non this topic.\n","authors":["Jindong Gu","Zhen Han","Shuo Chen","Ahmad Beirami","Bailan He","Gengyuan Zhang","Ruotong Liao","Yao Qin","Volker Tresp","Philip Torr"],"pdf_url":"https://arxiv.org/pdf/2307.12980v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12972v1","updated":"2023-07-24T17:49:11Z","published":"2023-07-24T17:49:11Z","title":"DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting","summary":" In this paper, we propose a new operator, called 3D DeFormable Attention\n(DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image\nfeatures into a unified 3D space for 3D object detection. Existing feature\nlifting approaches, such as Lift-Splat-based and 2D attention-based, either use\nestimated depth to get pseudo LiDAR features and then splat them to a 3D space,\nwhich is a one-pass operation without feature refinement, or ignore depth and\nlift features by 2D attention mechanisms, which achieve finer semantics while\nsuffering from a depth ambiguity problem. In contrast, our DFA3D-based method\nfirst leverages the estimated depth to expand each view's 2D feature map to 3D\nand then utilizes DFA3D to aggregate features from the expanded 3D feature\nmaps. With the help of DFA3D, the depth ambiguity problem can be effectively\nalleviated from the root, and the lifted features can be progressively refined\nlayer by layer, thanks to the Transformer-like architecture. In addition, we\npropose a mathematically equivalent implementation of DFA3D which can\nsignificantly improve its memory efficiency and computational speed. We\nintegrate DFA3D into several methods that use 2D attention-based feature\nlifting with only a few modifications in code and evaluate on the nuScenes\ndataset. The experiment results show a consistent improvement of +1.41\\% mAP on\naverage, and up to +15.1\\% mAP improvement when high-quality depth information\nis available, demonstrating the superiority, applicability, and huge potential\nof DFA3D. The code is available at\nhttps://github.com/IDEA-Research/3D-deformable-attention.git.\n","authors":["Hongyang Li","Hao Zhang","Zhaoyang Zeng","Shilong Liu","Feng Li","Tianhe Ren","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12972v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12970v1","updated":"2023-07-24T17:49:04Z","published":"2023-07-24T17:49:04Z","title":"Volcanic ash delimitation using Artificial Intelligence based on Pix2Pix","summary":" Volcanic eruptions emit ash that can be harmful to human health and cause\ndamage to infrastructure, economic activities and the environment. The\ndelimitation of ash clouds allows to know their behavior and dispersion, which\nhelps in the prevention and mitigation of this phenomenon. Traditional methods\ntake advantage of specialized software programs to process the bands or\nchannels that compose the satellite images. However, their use is limited to\nexperts and demands a lot of time and significant computational resources. In\nrecent years, Artificial Intelligence has been a milestone in the computational\ntreatment of complex problems in different areas. In particular, Deep Learning\ntechniques allow automatic, fast and accurate processing of digital images. The\npresent work proposes the use of the Pix2Pix model, a type of generative\nadversarial network that, once trained, learns the mapping of input images to\noutput images. The architecture of such a network consisting of a generator and\na discriminator provides the versatility needed to produce black and white ash\ncloud images from multispectral satellite images. The evaluation of the model,\nbased on loss and accuracy plots, a confusion matrix, and visual inspection,\nindicates a satisfactory solution for accurate ash cloud delineation,\napplicable in any area of the world and becomes a useful tool in risk\nmanagement.\n","authors":["Christian Carrillo","Gissela Torres","Christian Mejia-Escobar"],"pdf_url":"https://arxiv.org/pdf/2307.12970v1.pdf","comment":"18 pages, in Spanish language, 15 figures"},{"id":"http://arxiv.org/abs/2307.12967v1","updated":"2023-07-24T17:45:40Z","published":"2023-07-24T17:45:40Z","title":"Learning Dense Correspondences between Photos and Sketches","summary":" Humans effortlessly grasp the connection between sketches and real-world\nobjects, even when these sketches are far from realistic. Moreover, human\nsketch understanding goes beyond categorization -- critically, it also entails\nunderstanding how individual elements within a sketch correspond to parts of\nthe physical world it represents. What are the computational ingredients needed\nto support this ability? Towards answering this question, we make two\ncontributions: first, we introduce a new sketch-photo correspondence benchmark,\n$\\textit{PSC6k}$, containing 150K annotations of 6250 sketch-photo pairs across\n125 object categories, augmenting the existing Sketchy dataset with\nfine-grained correspondence metadata. Second, we propose a self-supervised\nmethod for learning dense correspondences between sketch-photo pairs, building\nupon recent advances in correspondence learning for pairs of photos. Our model\nuses a spatial transformer network to estimate the warp flow between latent\nrepresentations of a sketch and photo extracted by a contrastive learning-based\nConvNet backbone. We found that this approach outperformed several strong\nbaselines and produced predictions that were quantitatively consistent with\nother warp-based methods. However, our benchmark also revealed systematic\ndifferences between predictions of the suite of models we tested and those of\nhumans. Taken together, our work suggests a promising path towards developing\nartificial systems that achieve more human-like understanding of visual images\nat different levels of abstraction. Project page:\nhttps://photo-sketch-correspondence.github.io\n","authors":["Xuanchen Lu","Xiaolong Wang","Judith E Fan"],"pdf_url":"https://arxiv.org/pdf/2307.12967v1.pdf","comment":"Accepted to ICML 2023. Project page:\n https://photo-sketch-correspondence.github.io"},{"id":"http://arxiv.org/abs/2307.12964v1","updated":"2023-07-24T17:43:13Z","published":"2023-07-24T17:43:13Z","title":"Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature\n Alignment","summary":" Text-to-video retrieval systems have recently made significant progress by\nutilizing pre-trained models trained on large-scale image-text pairs. However,\nmost of the latest methods primarily focus on the video modality while\ndisregarding the audio signal for this task. Nevertheless, a recent advancement\nby ECLIPSE has improved long-range text-to-video retrieval by developing an\naudiovisual video representation. Nonetheless, the objective of the\ntext-to-video retrieval task is to capture the complementary audio and video\ninformation that is pertinent to the text query rather than simply achieving\nbetter audio and video alignment. To address this issue, we introduce TEFAL, a\nTExt-conditioned Feature ALignment method that produces both audio and video\nrepresentations conditioned on the text query. Instead of using only an\naudiovisual attention block, which could suppress the audio information\nrelevant to the text query, our approach employs two independent cross-modal\nattention blocks that enable the text to attend to the audio and video\nrepresentations separately. Our proposed method's efficacy is demonstrated on\nfour benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and\nCharades, and achieves better than state-of-the-art performance consistently\nacross the four datasets. This is attributed to the additional\ntext-query-conditioned audio representation and the complementary information\nit adds to the text-query-conditioned video representation.\n","authors":["Sarah Ibrahimi","Xiaohang Sun","Pichao Wang","Amanmeet Garg","Ashutosh Sanan","Mohamed Omar"],"pdf_url":"https://arxiv.org/pdf/2307.12964v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12941v1","updated":"2023-07-24T17:11:39Z","published":"2023-07-24T17:11:39Z","title":"On Privileged and Convergent Bases in Neural Network Representations","summary":" In this study, we investigate whether the representations learned by neural\nnetworks possess a privileged and convergent basis. Specifically, we examine\nthe significance of feature directions represented by individual neurons.\nFirst, we establish that arbitrary rotations of neural representations cannot\nbe inverted (unlike linear networks), indicating that they do not exhibit\ncomplete rotational invariance. Subsequently, we explore the possibility of\nmultiple bases achieving identical performance. To do this, we compare the\nbases of networks trained with the same parameters but with varying random\ninitializations. Our study reveals two findings: (1) Even in wide networks such\nas WideResNets, neural networks do not converge to a unique basis; (2) Basis\ncorrelation increases significantly when a few early layers of the network are\nfrozen identically.\n Furthermore, we analyze Linear Mode Connectivity, which has been studied as a\nmeasure of basis correlation. Our findings give evidence that while Linear Mode\nConnectivity improves with increased network width, this improvement is not due\nto an increase in basis correlation.\n","authors":["Davis Brown","Nikhil Vyas","Yamini Bansal"],"pdf_url":"https://arxiv.org/pdf/2307.12941v1.pdf","comment":"In the Workshop on High-dimensional Learning Dynamics at ICML 2023"},{"id":"http://arxiv.org/abs/2307.12917v1","updated":"2023-07-24T16:18:22Z","published":"2023-07-24T16:18:22Z","title":"Hierarchical Skeleton Meta-Prototype Contrastive Learning with Hard\n Skeleton Mining for Unsupervised Person Re-Identification","summary":" With rapid advancements in depth sensors and deep learning, skeleton-based\nperson re-identification (re-ID) models have recently achieved remarkable\nprogress with many advantages. Most existing solutions learn single-level\nskeleton features from body joints with the assumption of equal skeleton\nimportance, while they typically lack the ability to exploit more informative\nskeleton features from various levels such as limb level with more global body\npatterns. The label dependency of these methods also limits their flexibility\nin learning more general skeleton representations. This paper proposes a\ngeneric unsupervised Hierarchical skeleton Meta-Prototype Contrastive learning\n(Hi-MPC) approach with Hard Skeleton Mining (HSM) for person re-ID with\nunlabeled 3D skeletons. Firstly, we construct hierarchical representations of\nskeletons to model coarse-to-fine body and motion features from the levels of\nbody joints, components, and limbs. Then a hierarchical meta-prototype\ncontrastive learning model is proposed to cluster and contrast the most typical\nskeleton features (\"prototypes\") from different-level skeletons. By converting\noriginal prototypes into meta-prototypes with multiple homogeneous\ntransformations, we induce the model to learn the inherent consistency of\nprototypes to capture more effective skeleton features for person re-ID.\nFurthermore, we devise a hard skeleton mining mechanism to adaptively infer the\ninformative importance of each skeleton, so as to focus on harder skeletons to\nlearn more discriminative skeleton representations. Extensive evaluations on\nfive datasets demonstrate that our approach outperforms a wide variety of\nstate-of-the-art skeleton-based methods. We further show the general\napplicability of our method to cross-view person re-ID and RGB-based scenarios\nwith estimated skeletons.\n","authors":["Haocong Rao","Cyril Leung","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2307.12917v1.pdf","comment":"Accepted by International Journal of Computer Vision (IJCV). Codes\n are available at https://github.com/Kali-Hac/Hi-MPC. Supplemental materials\n will be included in the published version"},{"id":"http://arxiv.org/abs/2307.12914v1","updated":"2023-07-24T16:13:43Z","published":"2023-07-24T16:13:43Z","title":"Towards a Visual-Language Foundation Model for Computational Pathology","summary":" The accelerated adoption of digital pathology and advances in deep learning\nhave enabled the development of powerful models for various pathology tasks\nacross a diverse array of diseases and patient cohorts. However, model training\nis often difficult due to label scarcity in the medical domain and the model's\nusage is limited by the specific task and disease for which it is trained.\nAdditionally, most models in histopathology leverage only image data, a stark\ncontrast to how humans teach each other and reason about histopathologic\nentities. We introduce CONtrastive learning from Captions for Histopathology\n(CONCH), a visual-language foundation model developed using diverse sources of\nhistopathology images, biomedical text, and notably over 1.17 million\nimage-caption pairs via task-agnostic pretraining. Evaluated on a suite of 13\ndiverse benchmarks, CONCH can be transferred to a wide range of downstream\ntasks involving either or both histopathology images and text, achieving\nstate-of-the-art performance on histology image classification, segmentation,\ncaptioning, text-to-image and image-to-text retrieval. CONCH represents a\nsubstantial leap over concurrent visual-language pretrained systems for\nhistopathology, with the potential to directly facilitate a wide array of\nmachine learning-based workflows requiring minimal or no further supervised\nfine-tuning.\n","authors":["Ming Y. Lu","Bowen Chen","Drew F. K. Williamson","Richard J. Chen","Ivy Liang","Tong Ding","Guillaume Jaume","Igor Odintsov","Andrew Zhang","Long Phi Le","Georg Gerber","Anil V Parwani","Faisal Mahmood"],"pdf_url":"https://arxiv.org/pdf/2307.12914v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12909v1","updated":"2023-07-24T16:08:32Z","published":"2023-07-24T16:08:32Z","title":"Dyn-E: Local Appearance Editing of Dynamic Neural Radiance Fields","summary":" Recently, the editing of neural radiance fields (NeRFs) has gained\nconsiderable attention, but most prior works focus on static scenes while\nresearch on the appearance editing of dynamic scenes is relatively lacking. In\nthis paper, we propose a novel framework to edit the local appearance of\ndynamic NeRFs by manipulating pixels in a single frame of training video.\nSpecifically, to locally edit the appearance of dynamic NeRFs while preserving\nunedited regions, we introduce a local surface representation of the edited\nregion, which can be inserted into and rendered along with the original NeRF\nand warped to arbitrary other frames through a learned invertible motion\nrepresentation network. By employing our method, users without professional\nexpertise can easily add desired content to the appearance of a dynamic scene.\nWe extensively evaluate our approach on various scenes and show that our\napproach achieves spatially and temporally consistent editing results. Notably,\nour approach is versatile and applicable to different variants of dynamic NeRF\nrepresentations.\n","authors":["Shangzhan Zhang","Sida Peng","Yinji ShenTu","Qing Shuai","Tianrun Chen","Kaicheng Yu","Hujun Bao","Xiaowei Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.12909v1.pdf","comment":"project page: https://dyn-e.github.io/"},{"id":"http://arxiv.org/abs/2307.12907v1","updated":"2023-07-24T16:02:42Z","published":"2023-07-24T16:02:42Z","title":"GridMM: Grid Memory Map for Vision-and-Language Navigation","summary":" Vision-and-language navigation (VLN) enables the agent to navigate to a\nremote location following the natural language instruction in 3D environments.\nTo represent the previously visited environment, most approaches for VLN\nimplement memory using recurrent states, topological maps, or top-down semantic\nmaps. In contrast to these approaches, we build the top-down egocentric and\ndynamically growing Grid Memory Map (i.e., GridMM) to structure the visited\nenvironment. From a global perspective, historical observations are projected\ninto a unified grid map in a top-down view, which can better represent the\nspatial relations of the environment. From a local perspective, we further\npropose an instruction relevance aggregation method to capture fine-grained\nvisual clues in each grid region. Extensive experiments are conducted on both\nthe REVERIE, R2R, SOON datasets in the discrete environments, and the R2R-CE\ndataset in the continuous environments, showing the superiority of our proposed\nmethod.\n","authors":["Zihan Wang","Xiangyang Li","Jiahao Yang","Yeqi Liu","Shuqiang Jiang"],"pdf_url":"https://arxiv.org/pdf/2307.12907v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12900v1","updated":"2023-07-24T15:47:21Z","published":"2023-07-24T15:47:21Z","title":"Automotive Object Detection via Learning Sparse Events by Temporal\n Dynamics of Spiking Neurons","summary":" Event-based sensors, with their high temporal resolution (1us) and dynamical\nrange (120dB), have the potential to be deployed in high-speed platforms such\nas vehicles and drones. However, the highly sparse and fluctuating nature of\nevents poses challenges for conventional object detection techniques based on\nArtificial Neural Networks (ANNs). In contrast, Spiking Neural Networks (SNNs)\nare well-suited for representing event-based data due to their inherent\ntemporal dynamics. In particular, we demonstrate that the membrane potential\ndynamics can modulate network activity upon fluctuating events and strengthen\nfeatures of sparse input. In addition, the spike-triggered adaptive threshold\ncan stabilize training which further improves network performance. Based on\nthis, we develop an efficient spiking feature pyramid network for event-based\nobject detection. Our proposed SNN outperforms previous SNNs and sophisticated\nANNs with attention mechanisms, achieving a mean average precision (map50) of\n47.7% on the Gen1 benchmark dataset. This result significantly surpasses the\nprevious best SNN by 9.7% and demonstrates the potential of SNNs for\nevent-based vision. Our model has a concise architecture while maintaining high\naccuracy and much lower computation cost as a result of sparse computation. Our\ncode will be publicly available.\n","authors":["Hu Zhang","Luziwei Leng","Kaiwei Che","Qian Liu","Jie Cheng","Qinghai Guo","Jiangxing Liao","Ran Cheng"],"pdf_url":"https://arxiv.org/pdf/2307.12900v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.12803v3","updated":"2023-07-24T15:27:16Z","published":"2022-01-30T12:53:51Z","title":"Generalizing similarity in noisy setups: the DIBS phenomenon","summary":" This work uncovers an interplay among data density, noise, and the\ngeneralization ability in similarity learning. We consider Siamese Neural\nNetworks (SNNs), which are the basic form of contrastive learning, and explore\ntwo types of noise that can impact SNNs, Pair Label Noise (PLN) and Single\nLabel Noise (SLN). Our investigation reveals that SNNs exhibit double descent\nbehaviour regardless of the training setup and that it is further exacerbated\nby noise. We demonstrate that the density of data pairs is crucial for\ngeneralization. When SNNs are trained on sparse datasets with the same amount\nof PLN or SLN, they exhibit comparable generalization properties. However, when\nusing dense datasets, PLN cases generalize worse than SLN ones in the\noverparametrized region, leading to a phenomenon we call Density-Induced Break\nof Similarity (DIBS). In this regime, PLN similarity violation becomes\nmacroscopical, corrupting the dataset to the point where complete interpolation\ncannot be achieved, regardless of the number of model parameters. Our analysis\nalso delves into the correspondence between online optimization and offline\ngeneralization in similarity learning. The results show that this equivalence\nfails in the presence of label noise in all the scenarios considered.\n","authors":["Nayara Fonseca","Veronica Guidetti"],"pdf_url":"https://arxiv.org/pdf/2201.12803v3.pdf","comment":"v3: version accepted at ECAI 2023 + Supplementary Material"},{"id":"http://arxiv.org/abs/2307.12872v1","updated":"2023-07-24T15:10:22Z","published":"2023-07-24T15:10:22Z","title":"Data-free Black-box Attack based on Diffusion Model","summary":" Since the training data for the target model in a data-free black-box attack\nis not available, most recent schemes utilize GANs to generate data for\ntraining substitute model. However, these GANs-based schemes suffer from low\ntraining efficiency as the generator needs to be retrained for each target\nmodel during the substitute training process, as well as low generation\nquality. To overcome these limitations, we consider utilizing the diffusion\nmodel to generate data, and propose a data-free black-box attack scheme based\non diffusion model to improve the efficiency and accuracy of substitute\ntraining. Despite the data generated by the diffusion model exhibits high\nquality, it presents diverse domain distributions and contains many samples\nthat do not meet the discriminative criteria of the target model. To further\nfacilitate the diffusion model to generate data suitable for the target model,\nwe propose a Latent Code Augmentation (LCA) method to guide the diffusion model\nin generating data. With the guidance of LCA, the data generated by the\ndiffusion model not only meets the discriminative criteria of the target model\nbut also exhibits high diversity. By utilizing this data, it is possible to\ntrain substitute model that closely resemble the target model more efficiently.\nExtensive experiments demonstrate that our LCA achieves higher attack success\nrates and requires fewer query budgets compared to GANs-based schemes for\ndifferent target models.\n","authors":["Mingwen Shao","Lingzhuang Meng","Yuanjian Qiao","Lixu Zhang","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2307.12872v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12868v1","updated":"2023-07-24T15:06:42Z","published":"2023-07-24T15:06:42Z","title":"Understanding the Latent Space of Diffusion Models through the Lens of\n Riemannian Geometry","summary":" Despite the success of diffusion models (DMs), we still lack a thorough\nunderstanding of their latent space. To understand the latent space\n$\\mathbf{x}_t \\in \\mathcal{X}$, we analyze them from a geometrical perspective.\nSpecifically, we utilize the pullback metric to find the local latent basis in\n$\\mathcal{X}$ and their corresponding local tangent basis in $\\mathcal{H}$, the\nintermediate feature maps of DMs. The discovered latent basis enables\nunsupervised image editing capability through latent space traversal. We\ninvestigate the discovered structure from two perspectives. First, we examine\nhow geometric structure evolves over diffusion timesteps. Through analysis, we\nshow that 1) the model focuses on low-frequency components early in the\ngenerative process and attunes to high-frequency details later; 2) At early\ntimesteps, different samples share similar tangent spaces; and 3) The simpler\ndatasets that DMs trained on, the more consistent the tangent space for each\ntimestep. Second, we investigate how the geometric structure changes based on\ntext conditioning in Stable Diffusion. The results show that 1) similar prompts\nyield comparable tangent spaces; and 2) the model depends less on text\nconditions in later timesteps. To the best of our knowledge, this paper is the\nfirst to present image editing through $\\mathbf{x}$-space traversal and provide\nthorough analyses of the latent structure of DMs.\n","authors":["Yong-Hyun Park","Mingi Kwon","Jaewoong Choi","Junghyo Jo","Youngjung Uh"],"pdf_url":"https://arxiv.org/pdf/2307.12868v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09224v2","updated":"2023-07-24T15:05:55Z","published":"2023-06-15T16:03:01Z","title":"Encyclopedic VQA: Visual questions about detailed properties of\n fine-grained categories","summary":" We propose Encyclopedic-VQA, a large scale visual question answering (VQA)\ndataset featuring visual questions about detailed properties of fine-grained\ncategories and instances. It contains 221k unique question+answer pairs each\nmatched with (up to) 5 images, resulting in a total of 1M VQA samples.\nMoreover, our dataset comes with a controlled knowledge base derived from\nWikipedia, marking the evidence to support each answer. Empirically, we show\nthat our dataset poses a hard challenge for large vision+language models as\nthey perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA\n[37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we\nexperimentally show that progress on answering our encyclopedic questions can\nbe achieved by augmenting large models with a mechanism that retrieves relevant\ninformation from the knowledge base. An oracle experiment with perfect\nretrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and\nan automatic retrieval-augmented prototype yields 48.8%. We believe that our\ndataset enables future research on retrieval-augmented vision+language models.\nIt is available at\nhttps://github.com/google-research/google-research/tree/master/encyclopedic_vqa .\n","authors":["Thomas Mensink","Jasper Uijlings","Lluis Castrejon","Arushi Goel","Felipe Cadar","Howard Zhou","Fei Sha","André Araujo","Vittorio Ferrari"],"pdf_url":"https://arxiv.org/pdf/2306.09224v2.pdf","comment":"ICCV'23"},{"id":"http://arxiv.org/abs/2307.12858v1","updated":"2023-07-24T14:57:40Z","published":"2023-07-24T14:57:40Z","title":"Treatment Outcome Prediction for Intracerebral Hemorrhage via Generative\n Prognostic Model with Imaging and Tabular Data","summary":" Intracerebral hemorrhage (ICH) is the second most common and deadliest form\nof stroke. Despite medical advances, predicting treat ment outcomes for ICH\nremains a challenge. This paper proposes a novel prognostic model that utilizes\nboth imaging and tabular data to predict treatment outcome for ICH. Our model\nis trained on observational data collected from non-randomized controlled\ntrials, providing reliable predictions of treatment success. Specifically, we\npropose to employ a variational autoencoder model to generate a low-dimensional\nprognostic score, which can effectively address the selection bias resulting\nfrom the non-randomized controlled trials. Importantly, we develop a\nvariational distributions combination module that combines the information from\nimaging data, non-imaging clinical data, and treatment assignment to accurately\ngenerate the prognostic score. We conducted extensive experiments on a\nreal-world clinical dataset of intracerebral hemorrhage. Our proposed method\ndemonstrates a substantial improvement in treatment outcome prediction compared\nto existing state-of-the-art approaches. Code is available at\nhttps://github.com/med-air/TOP-GPM\n","authors":["Wenao Ma","Cheng Chen","Jill Abrigo","Calvin Hoi-Kwan Mak","Yuqi Gong","Nga Yan Chan","Chu Han","Zaiyi Liu","Qi Dou"],"pdf_url":"https://arxiv.org/pdf/2307.12858v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12854v1","updated":"2023-07-24T14:55:15Z","published":"2023-07-24T14:55:15Z","title":"Multiscale Video Pretraining for Long-Term Activity Forecasting","summary":" Long-term activity forecasting is an especially challenging research problem\nbecause it requires understanding the temporal relationships between observed\nactions, as well as the variability and complexity of human activities. Despite\nrelying on strong supervision via expensive human annotations, state-of-the-art\nforecasting approaches often generalize poorly to unseen data. To alleviate\nthis issue, we propose Multiscale Video Pretraining (MVP), a novel\nself-supervised pretraining approach that learns robust representations for\nforecasting by learning to predict contextualized representations of future\nvideo clips over multiple timescales. MVP is based on our observation that\nactions in videos have a multiscale nature, where atomic actions typically\noccur at a short timescale and more complex actions may span longer timescales.\nWe compare MVP to state-of-the-art self-supervised video learning approaches on\ndownstream long-term forecasting tasks including long-term action anticipation\nand video summary prediction. Our comprehensive experiments across the Ego4D\nand Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs\nstate-of-the-art methods by significant margins. Notably, MVP obtains a\nrelative performance gain of over 20% accuracy in video summary forecasting\nover existing methods.\n","authors":["Reuben Tan","Matthias De Lange","Michael Iuzzolino","Bryan A. Plummer","Kate Saenko","Karl Ridgeway","Lorenzo Torresani"],"pdf_url":"https://arxiv.org/pdf/2307.12854v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11630v3","updated":"2023-07-24T14:53:51Z","published":"2023-03-21T06:54:18Z","title":"BoxSnake: Polygonal Instance Segmentation with Box Supervision","summary":" Box-supervised instance segmentation has gained much attention as it requires\nonly simple box annotations instead of costly mask or polygon annotations.\nHowever, existing box-supervised instance segmentation models mainly focus on\nmask-based frameworks. We propose a new end-to-end training technique, termed\nBoxSnake, to achieve effective polygonal instance segmentation using only box\nannotations for the first time. Our method consists of two loss functions: (1)\na point-based unary loss that constrains the bounding box of predicted polygons\nto achieve coarse-grained segmentation; and (2) a distance-aware pairwise loss\nthat encourages the predicted polygons to fit the object boundaries. Compared\nwith the mask-based weakly-supervised methods, BoxSnake further reduces the\nperformance gap between the predicted segmentation and the bounding box, and\nshows significant superiority on the Cityscapes dataset. The code has been\navailable publicly.\n","authors":["Rui Yang","Lin Song","Yixiao Ge","Xiu Li"],"pdf_url":"https://arxiv.org/pdf/2303.11630v3.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12853v1","updated":"2023-07-24T14:53:23Z","published":"2023-07-24T14:53:23Z","title":"Spatiotemporal Modeling Encounters 3D Medical Image Analysis:\n Slice-Shift UNet with Multi-View Fusion","summary":" As a fundamental part of computational healthcare, Computer Tomography (CT)\nand Magnetic Resonance Imaging (MRI) provide volumetric data, making the\ndevelopment of algorithms for 3D image analysis a necessity. Despite being\ncomputationally cheap, 2D Convolutional Neural Networks can only extract\nspatial information. In contrast, 3D CNNs can extract three-dimensional\nfeatures, but they have higher computational costs and latency, which is a\nlimitation for clinical practice that requires fast and efficient models.\nInspired by the field of video action recognition we propose a new 2D-based\nmodel dubbed Slice SHift UNet (SSH-UNet) which encodes three-dimensional\nfeatures at 2D CNN's complexity. More precisely multi-view features are\ncollaboratively learned by performing 2D convolutions along the three\northogonal planes of a volume and imposing a weights-sharing mechanism. The\nthird dimension, which is neglected by the 2D convolution, is reincorporated by\nshifting a portion of the feature maps along the slices' axis. The\neffectiveness of our approach is validated in Multi-Modality Abdominal\nMulti-Organ Segmentation (AMOS) and Multi-Atlas Labeling Beyond the Cranial\nVault (BTCV) datasets, showing that SSH-UNet is more efficient while on par in\nperformance with state-of-the-art architectures.\n","authors":["C. I. Ugwu","S. Casarin","O. Lanz"],"pdf_url":"https://arxiv.org/pdf/2307.12853v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12845v1","updated":"2023-07-24T14:43:07Z","published":"2023-07-24T14:43:07Z","title":"Multi-View Vertebra Localization and Identification from CT Images","summary":" Accurately localizing and identifying vertebrae from CT images is crucial for\nvarious clinical applications. However, most existing efforts are performed on\n3D with cropping patch operation, suffering from the large computation costs\nand limited global information. In this paper, we propose a multi-view vertebra\nlocalization and identification from CT images, converting the 3D problem into\na 2D localization and identification task on different views. Without the\nlimitation of the 3D cropped patch, our method can learn the multi-view global\ninformation naturally. Moreover, to better capture the anatomical structure\ninformation from different view perspectives, a multi-view contrastive learning\nstrategy is developed to pre-train the backbone. Additionally, we further\npropose a Sequence Loss to maintain the sequential structure embedded along the\nvertebrae. Evaluation results demonstrate that, with only two 2D networks, our\nmethod can localize and identify vertebrae in CT images accurately, and\noutperforms the state-of-the-art methods consistently. Our code is available at\nhttps://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images.\n","authors":["Han Wu","Jiadong Zhang","Yu Fang","Zhentao Liu","Nizhuan Wang","Zhiming Cui","Dinggang Shen"],"pdf_url":"https://arxiv.org/pdf/2307.12845v1.pdf","comment":"MICCAI 2023"},{"id":"http://arxiv.org/abs/2306.15599v2","updated":"2023-07-24T14:41:40Z","published":"2023-06-27T16:37:37Z","title":"Coupling a Recurrent Neural Network to SPAD TCSPC Systems for Real-time\n Fluorescence Lifetime Imaging","summary":" Fluorescence lifetime imaging (FLI) has been receiving increased attention in\nrecent years as a powerful diagnostic technique in biological and medical\nresearch. However, existing FLI systems often suffer from a tradeoff between\nprocessing speed, accuracy, and robustness. In this paper, we propose a robust\napproach that enables fast FLI with no degradation of accuracy. The approach is\nbased on a SPAD TCSPC system coupled to a recurrent neural network (RNN) that\naccurately estimates the fluorescence lifetime directly from raw timestamps\nwithout building histograms, thereby drastically reducing transfer data volumes\nand hardware resource utilization, thus enabling FLI acquisition at video rate.\nWe train two variants of the RNN on a synthetic dataset and compare the results\nto those obtained using center-of-mass method (CMM) and least squares fitting\n(LS fitting). Results demonstrate that two RNN variants, gated recurrent unit\n(GRU) and long short-term memory (LSTM), are comparable to CMM and LS fitting\nin terms of accuracy, while outperforming them in background noise by a large\nmargin. To explore the ultimate limits of the approach, we derived the\nCramer-Rao lower bound of the measurement, showing that RNN yields lifetime\nestimations with near-optimal precision. Moreover, our FLI model, which is\npurely trained on synthetic datasets, works well with never-seen-before,\nreal-world data. To demonstrate real-time operation, we have built a FLI\nmicroscope based on Piccolo, a 32x32 SPAD sensor developed in our lab. Four\nquantized GRU cores, capable of processing up to 4 million photons per second,\nare deployed on a Xilinx Kintex-7 FPGA. Powered by the GRU, the FLI setup can\nretrieve real-time fluorescence lifetime images at up to 10 frames per second.\nThe proposed FLI system is promising and ideally suited for biomedical\napplications.\n","authors":["Yang Lin","Paul Mos","Andrei Ardelean","Claudio Bruschini","Edoardo Charbon"],"pdf_url":"https://arxiv.org/pdf/2306.15599v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09696v2","updated":"2023-07-24T14:36:24Z","published":"2023-07-19T00:41:39Z","title":"Towards Saner Deep Image Registration","summary":" With recent advances in computing hardware and surges of deep-learning\narchitectures, learning-based deep image registration methods have surpassed\ntheir traditional counterparts, in terms of metric performance and inference\ntime. However, these methods focus on improving performance measurements such\nas Dice, resulting in less attention given to model behaviors that are equally\ndesirable for registrations, especially for medical imaging. This paper\ninvestigates these behaviors for popular learning-based deep registrations\nunder a sanity-checking microscope. We find that most existing registrations\nsuffer from low inverse consistency and nondiscrimination of identical pairs\ndue to overly optimized image similarities. To rectify these behaviors, we\npropose a novel regularization-based sanity-enforcer method that imposes two\nsanity checks on the deep model to reduce its inverse consistency errors and\nincrease its discriminative power simultaneously. Moreover, we derive a set of\ntheoretical guarantees for our sanity-checked image registration method, with\nexperimental results supporting our theoretical findings and their\neffectiveness in increasing the sanity of models without sacrificing any\nperformance. Our code and models are available at\nhttps://github.com/tuffr5/Saner-deep-registration.\n","authors":["Bin Duan","Ming Zhong","Yan Yan"],"pdf_url":"https://arxiv.org/pdf/2307.09696v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12837v1","updated":"2023-07-24T14:35:46Z","published":"2023-07-24T14:35:46Z","title":"EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge: Mixed\n Sequences Prediction","summary":" This report presents the technical details of our approach for the\nEPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action\nRecognition. Our approach is based on the idea that the order in which actions\nare performed is similar between the source and target domains. Based on this,\nwe generate a modified sequence by randomly combining actions from the source\nand target domains. As only unlabelled target data are available under the UDA\nsetting, we use a standard pseudo-labeling strategy for extracting action\nlabels for the target. We then ask the network to predict the resulting action\nsequence. This allows to integrate information from both domains during\ntraining and to achieve better transfer results on target. Additionally, to\nbetter incorporate sequence information, we use a language model to filter\nunlikely sequences. Lastly, we employed a co-occurrence matrix to eliminate\nunseen combinations of verbs and nouns. Our submission, labeled as 'sshayan',\ncan be found on the leaderboard, where it currently holds the 2nd position for\n'verb' and the 4th position for both 'noun' and 'action'.\n","authors":["Amirshayan Nasirimajd","Simone Alberto Peirone","Chiara Plizzari","Barbara Caputo"],"pdf_url":"https://arxiv.org/pdf/2307.12837v1.pdf","comment":"2nd place in the 2023 EPIC-KITCHENS-100 Unsupervised Domain\n Adaptation Challenge for Action Recognition"},{"id":"http://arxiv.org/abs/2307.12822v1","updated":"2023-07-24T14:19:36Z","published":"2023-07-24T14:19:36Z","title":"Learning Provably Robust Estimators for Inverse Problems via Jittering","summary":" Deep neural networks provide excellent performance for inverse problems such\nas denoising. However, neural networks can be sensitive to adversarial or\nworst-case perturbations. This raises the question of whether such networks can\nbe trained efficiently to be worst-case robust. In this paper, we investigate\nwhether jittering, a simple regularization technique that adds isotropic\nGaussian noise during training, is effective for learning worst-case robust\nestimators for inverse problems. While well studied for prediction in\nclassification tasks, the effectiveness of jittering for inverse problems has\nnot been systematically investigated. In this paper, we present a novel\nanalytical characterization of the optimal $\\ell_2$-worst-case robust estimator\nfor linear denoising and show that jittering yields optimal robust denoisers.\nFurthermore, we examine jittering empirically via training deep neural networks\n(U-nets) for natural image denoising, deconvolution, and accelerated magnetic\nresonance imaging (MRI). The results show that jittering significantly enhances\nthe worst-case robustness, but can be suboptimal for inverse problems beyond\ndenoising. Moreover, our results imply that training on real data which often\ncontains slight noise is somewhat robustness enhancing.\n","authors":["Anselm Krainovic","Mahdi Soltanolkotabi","Reinhard Heckel"],"pdf_url":"https://arxiv.org/pdf/2307.12822v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12813v1","updated":"2023-07-24T14:06:54Z","published":"2023-07-24T14:06:54Z","title":"Exposing the Troublemakers in Described Object Detection","summary":" Detecting objects based on language descriptions is a popular task that\nincludes Open-Vocabulary object Detection (OVD) and Referring Expression\nComprehension (REC). In this paper, we advance them to a more practical setting\ncalled Described Object Detection (DOD) by expanding category names to flexible\nlanguage expressions for OVD and overcoming the limitation of REC to only\ngrounding the pre-existing object. We establish the research foundation for DOD\ntasks by constructing a Description Detection Dataset ($D^3$), featuring\nflexible language expressions and annotating all described objects without\nomission. By evaluating previous SOTA methods on $D^3$, we find some\ntroublemakers that fail current REC, OVD, and bi-functional methods. REC\nmethods struggle with confidence scores, rejecting negative instances, and\nmulti-target scenarios, while OVD methods face constraints with long and\ncomplex descriptions. Recent bi-functional methods also do not work well on DOD\ndue to their separated training procedures and inference strategies for REC and\nOVD tasks. Building upon the aforementioned findings, we propose a baseline\nthat largely improves REC methods by reconstructing the training data and\nintroducing a binary classification sub-task, outperforming existing methods.\nData and code is available at https://github.com/shikras/d-cube.\n","authors":["Chi Xie","Zhao Zhang","Yixuan Wu","Feng Zhu","Rui Zhao","Shuang Liang"],"pdf_url":"https://arxiv.org/pdf/2307.12813v1.pdf","comment":"Preprint. Under review"},{"id":"http://arxiv.org/abs/2307.02148v2","updated":"2023-07-24T13:59:50Z","published":"2023-07-05T09:44:02Z","title":"Compound Attention and Neighbor Matching Network for Multi-contrast MRI\n Super-resolution","summary":" Multi-contrast magnetic resonance imaging (MRI) reflects information about\nhuman tissue from different perspectives and has many clinical applications. By\nutilizing the complementary information among different modalities,\nmulti-contrast super-resolution (SR) of MRI can achieve better results than\nsingle-image super-resolution. However, existing methods of multi-contrast MRI\nSR have the following shortcomings that may limit their performance: First,\nexisting methods either simply concatenate the reference and degraded features\nor exploit global feature-matching between them, which are unsuitable for\nmulti-contrast MRI SR. Second, although many recent methods employ transformers\nto capture long-range dependencies in the spatial dimension, they neglect that\nself-attention in the channel dimension is also important for low-level vision\ntasks. To address these shortcomings, we proposed a novel network architecture\nwith compound-attention and neighbor matching (CANM-Net) for multi-contrast MRI\nSR: The compound self-attention mechanism effectively captures the dependencies\nin both spatial and channel dimension; the neighborhood-based feature-matching\nmodules are exploited to match degraded features and adjacent reference\nfeatures and then fuse them to obtain the high-quality images. We conduct\nexperiments of SR tasks on the IXI, fastMRI, and real-world scanning datasets.\nThe CANM-Net outperforms state-of-the-art approaches in both retrospective and\nprospective experiments. Moreover, the robustness study in our work shows that\nthe CANM-Net still achieves good performance when the reference and degraded\nimages are imperfectly registered, proving good potential in clinical\napplications.\n","authors":["Wenxuan Chen","Sirui Wu","Shuai Wang","Zhongsen Li","Jia Yang","Huifeng Yao","Xiaomeng Li","Xiaolei Song"],"pdf_url":"https://arxiv.org/pdf/2307.02148v2.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2211.16761v3","updated":"2023-07-24T13:53:26Z","published":"2022-11-30T05:59:23Z","title":"Improving Cross-Modal Retrieval with Set of Diverse Embeddings","summary":" Cross-modal retrieval across image and text modalities is a challenging task\ndue to its inherent ambiguity: An image often exhibits various situations, and\na caption can be coupled with diverse images. Set-based embedding has been\nstudied as a solution to this problem. It seeks to encode a sample into a set\nof different embedding vectors that capture different semantics of the sample.\nIn this paper, we present a novel set-based embedding method, which is distinct\nfrom previous work in two aspects. First, we present a new similarity function\ncalled smooth-Chamfer similarity, which is designed to alleviate the side\neffects of existing similarity functions for set-based embedding. Second, we\npropose a novel set prediction module to produce a set of embedding vectors\nthat effectively captures diverse semantics of input by the slot attention\nmechanism. Our method is evaluated on the COCO and Flickr30K datasets across\ndifferent visual backbones, where it outperforms existing methods including\nones that demand substantially larger computation at inference.\n","authors":["Dongwon Kim","Namyup Kim","Suha Kwak"],"pdf_url":"https://arxiv.org/pdf/2211.16761v3.pdf","comment":"Accepted to CVPR 2023 (Highlight)"},{"id":"http://arxiv.org/abs/2307.12790v1","updated":"2023-07-24T13:39:21Z","published":"2023-07-24T13:39:21Z","title":"Compact & Capable: Harnessing Graph Neural Networks and Edge Convolution\n for Medical Image Classification","summary":" Graph-based neural network models are gaining traction in the field of\nrepresentation learning due to their ability to uncover latent topological\nrelationships between entities that are otherwise challenging to identify.\nThese models have been employed across a diverse range of domains, encompassing\ndrug discovery, protein interactions, semantic segmentation, and fluid dynamics\nresearch. In this study, we investigate the potential of Graph Neural Networks\n(GNNs) for medical image classification. We introduce a novel model that\ncombines GNNs and edge convolution, leveraging the interconnectedness of RGB\nchannel feature values to strongly represent connections between crucial graph\nnodes. Our proposed model not only performs on par with state-of-the-art Deep\nNeural Networks (DNNs) but does so with 1000 times fewer parameters, resulting\nin reduced training time and data requirements. We compare our Graph\nConvolutional Neural Network (GCNN) to pre-trained DNNs for classifying\nMedMNIST dataset classes, revealing promising prospects for GNNs in medical\nimage analysis. Our results also encourage further exploration of advanced\ngraph-based models such as Graph Attention Networks (GAT) and Graph\nAuto-Encoders in the medical imaging domain. The proposed model yields more\nreliable, interpretable, and accurate outcomes for tasks like semantic\nsegmentation and image classification compared to simpler GCNNs\n","authors":["Aryan Singh","Pepijn Van de Ven","Ciarán Eising","Patrick Denny"],"pdf_url":"https://arxiv.org/pdf/2307.12790v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.13170v4","updated":"2023-07-24T13:35:28Z","published":"2022-04-27T20:04:24Z","title":"AdaBest: Minimizing Client Drift in Federated Learning via Adaptive Bias\n Estimation","summary":" In Federated Learning (FL), a number of clients or devices collaborate to\ntrain a model without sharing their data. Models are optimized locally at each\nclient and further communicated to a central hub for aggregation. While FL is\nan appealing decentralized training paradigm, heterogeneity among data from\ndifferent clients can cause the local optimization to drift away from the\nglobal objective. In order to estimate and therefore remove this drift,\nvariance reduction techniques have been incorporated into FL optimization\nrecently. However, these approaches inaccurately estimate the clients' drift\nand ultimately fail to remove it properly. In this work, we propose an adaptive\nalgorithm that accurately estimates drift across clients. In comparison to\nprevious works, our approach necessitates less storage and communication\nbandwidth, as well as lower compute costs. Additionally, our proposed\nmethodology induces stability by constraining the norm of estimates for client\ndrift, making it more practical for large scale FL. Experimental findings\ndemonstrate that the proposed algorithm converges significantly faster and\nachieves higher accuracy than the baselines across various FL benchmarks.\n","authors":["Farshid Varno","Marzie Saghayi","Laya Rafiee Sevyeri","Sharut Gupta","Stan Matwin","Mohammad Havaei"],"pdf_url":"https://arxiv.org/pdf/2204.13170v4.pdf","comment":"Published as a conference paper at ECCV 2022; Corrected some typos in\n the text and a baseline algorithm"},{"id":"http://arxiv.org/abs/2303.12540v2","updated":"2023-07-24T13:35:16Z","published":"2023-03-22T13:16:37Z","title":"Deployment of Image Analysis Algorithms under Prevalence Shifts","summary":" Domain gaps are among the most relevant roadblocks in the clinical\ntranslation of machine learning (ML)-based solutions for medical image\nanalysis. While current research focuses on new training paradigms and network\narchitectures, little attention is given to the specific effect of prevalence\nshifts on an algorithm deployed in practice. Such discrepancies between class\nfrequencies in the data used for a method's development/validation and that in\nits deployment environment(s) are of great importance, for example in the\ncontext of artificial intelligence (AI) democratization, as disease prevalences\nmay vary widely across time and location. Our contribution is twofold. First,\nwe empirically demonstrate the potentially severe consequences of missing\nprevalence handling by analyzing (i) the extent of miscalibration, (ii) the\ndeviation of the decision threshold from the optimum, and (iii) the ability of\nvalidation metrics to reflect neural network performance on the deployment\npopulation as a function of the discrepancy between development and deployment\nprevalence. Second, we propose a workflow for prevalence-aware image\nclassification that uses estimated deployment prevalences to adjust a trained\nclassifier to a new environment, without requiring additional annotated\ndeployment data. Comprehensive experiments based on a diverse set of 30 medical\nclassification tasks showcase the benefit of the proposed workflow in\ngenerating better classifier decisions and more reliable performance estimates\ncompared to current practice.\n","authors":["Patrick Godau","Piotr Kalinowski","Evangelia Christodoulou","Annika Reinke","Minu Tizabi","Luciana Ferrer","Paul Jäger","Lena Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2303.12540v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12775v1","updated":"2023-07-24T13:24:56Z","published":"2023-07-24T13:24:56Z","title":"Is attention all you need in medical image analysis? A review","summary":" Medical imaging is a key component in clinical diagnosis, treatment planning\nand clinical trial design, accounting for almost 90% of all healthcare data.\nCNNs achieved performance gains in medical image analysis (MIA) over the last\nyears. CNNs can efficiently model local pixel interactions and be trained on\nsmall-scale MI data. The main disadvantage of typical CNN models is that they\nignore global pixel relationships within images, which limits their\ngeneralisation ability to understand out-of-distribution data with different\n'global' information. The recent progress of Artificial Intelligence gave rise\nto Transformers, which can learn global relationships from data. However, full\nTransformer models need to be trained on large-scale data and involve\ntremendous computational complexity. Attention and Transformer compartments\n(Transf/Attention) which can well maintain properties for modelling global\nrelationships, have been proposed as lighter alternatives of full Transformers.\nRecently, there is an increasing trend to co-pollinate complementary\nlocal-global properties from CNN and Transf/Attention architectures, which led\nto a new era of hybrid models. The past years have witnessed substantial growth\nin hybrid CNN-Transf/Attention models across diverse MIA problems. In this\nsystematic review, we survey existing hybrid CNN-Transf/Attention models,\nreview and unravel key architectural designs, analyse breakthroughs, and\nevaluate current and future opportunities as well as challenges. We also\nintroduced a comprehensive analysis framework on generalisation opportunities\nof scientific and clinical impact, based on which new data-driven domain\ngeneralisation and adaptation methods can be stimulated.\n","authors":["Giorgos Papanastasiou","Nikolaos Dikaios","Jiahao Huang","Chengjia Wang","Guang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12774v1","updated":"2023-07-24T13:24:19Z","published":"2023-07-24T13:24:19Z","title":"Fast Full-frame Video Stabilization with Iterative Optimization","summary":" Video stabilization refers to the problem of transforming a shaky video into\na visually pleasing one. The question of how to strike a good trade-off between\nvisual quality and computational speed has remained one of the open challenges\nin video stabilization. Inspired by the analogy between wobbly frames and\njigsaw puzzles, we propose an iterative optimization-based learning approach\nusing synthetic datasets for video stabilization, which consists of two\ninteracting submodules: motion trajectory smoothing and full-frame outpainting.\nFirst, we develop a two-level (coarse-to-fine) stabilizing algorithm based on\nthe probabilistic flow field. The confidence map associated with the estimated\noptical flow is exploited to guide the search for shared regions through\nbackpropagation. Second, we take a divide-and-conquer approach and propose a\nnovel multiframe fusion strategy to render full-frame stabilized views. An\nimportant new insight brought about by our iterative optimization approach is\nthat the target video can be interpreted as the fixed point of nonlinear\nmapping for video stabilization. We formulate video stabilization as a problem\nof minimizing the amount of jerkiness in motion trajectories, which guarantees\nconvergence with the help of fixed-point theory. Extensive experimental results\nare reported to demonstrate the superiority of the proposed approach in terms\nof computational speed and visual quality. The code will be available on\nGitHub.\n","authors":["Weiyue Zhao","Xin Li","Zhan Peng","Xianrui Luo","Xinyi Ye","Hao Lu","Zhiguo Cao"],"pdf_url":"https://arxiv.org/pdf/2307.12774v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.12761v1","updated":"2023-07-24T13:05:36Z","published":"2023-07-24T13:05:36Z","title":"LiDAR Meta Depth Completion","summary":" Depth estimation is one of the essential tasks to be addressed when creating\nmobile autonomous systems. While monocular depth estimation methods have\nimproved in recent times, depth completion provides more accurate and reliable\ndepth maps by additionally using sparse depth information from other sensors\nsuch as LiDAR. However, current methods are specifically trained for a single\nLiDAR sensor. As the scanning pattern differs between sensors, every new sensor\nwould require re-training a specialized depth completion model, which is\ncomputationally inefficient and not flexible. Therefore, we propose to\ndynamically adapt the depth completion model to the used sensor type enabling\nLiDAR adaptive depth completion. Specifically, we propose a meta depth\ncompletion network that uses data patterns derived from the data to learn a\ntask network to alter weights of the main depth completion network to solve a\ngiven depth completion task effectively. The method demonstrates a strong\ncapability to work on multiple LiDAR scanning patterns and can also generalize\nto scanning patterns that are unseen during training. While using a single\nmodel, our method yields significantly better results than a non-adaptive\nbaseline trained on different LiDAR patterns. It outperforms LiDAR-specific\nexpert models for very sparse cases. These advantages allow flexible deployment\nof a single depth completion model on different sensors, which could also prove\nvaluable to process the input of nascent LiDAR technology with adaptive instead\nof fixed scanning patterns.\n","authors":["Wolfgang Boettcher","Lukas Hoyer","Ozan Unal","Dengxin Dai"],"pdf_url":"https://arxiv.org/pdf/2307.12761v1.pdf","comment":"Accepted at IROS 2023"},{"id":"http://arxiv.org/abs/2209.11531v2","updated":"2023-07-24T13:04:48Z","published":"2022-09-23T11:36:32Z","title":"Deep Learning-based Anonymization of Chest Radiographs: A\n Utility-preserving Measure for Patient Privacy","summary":" Robust and reliable anonymization of chest radiographs constitutes an\nessential step before publishing large datasets of such for research purposes.\nThe conventional anonymization process is carried out by obscuring personal\ninformation in the images with black boxes and removing or replacing\nmeta-information. However, such simple measures retain biometric information in\nthe chest radiographs, allowing patients to be re-identified by a linkage\nattack. Therefore, there is an urgent need to obfuscate the biometric\ninformation appearing in the images. We propose the first deep learning-based\napproach (PriCheXy-Net) to targetedly anonymize chest radiographs while\nmaintaining data utility for diagnostic and machine learning purposes. Our\nmodel architecture is a composition of three independent neural networks that,\nwhen collectively used, allow for learning a deformation field that is able to\nimpede patient re-identification. Quantitative results on the ChestX-ray14\ndataset show a reduction of patient re-identification from 81.8% to 57.7% (AUC)\nafter re-training with little impact on the abnormality classification\nperformance. This indicates the ability to preserve underlying abnormality\npatterns while increasing patient privacy. Lastly, we compare our proposed\nanonymization approach with two other obfuscation-based methods (Privacy-Net,\nDP-Pix) and demonstrate the superiority of our method towards resolving the\nprivacy-utility trade-off for chest radiographs.\n","authors":["Kai Packhäuser","Sebastian Gündel","Florian Thamm","Felix Denzinger","Andreas Maier"],"pdf_url":"https://arxiv.org/pdf/2209.11531v2.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.07620v2","updated":"2023-07-24T13:03:17Z","published":"2023-07-14T20:39:07Z","title":"Generalizable Embeddings with Cross-batch Metric Learning","summary":" Global average pooling (GAP) is a popular component in deep metric learning\n(DML) for aggregating features. Its effectiveness is often attributed to\ntreating each feature vector as a distinct semantic entity and GAP as a\ncombination of them. Albeit substantiated, such an explanation's algorithmic\nimplications to learn generalizable entities to represent unseen classes, a\ncrucial DML goal, remain unclear. To address this, we formulate GAP as a convex\ncombination of learnable prototypes. We then show that the prototype learning\ncan be expressed as a recursive process fitting a linear predictor to a batch\nof samples. Building on that perspective, we consider two batches of disjoint\nclasses at each iteration and regularize the learning by expressing the samples\nof a batch with the prototypes that are fitted to the other batch. We validate\nour approach on 4 popular DML benchmarks.\n","authors":["Yeti Z. Gurbuz","A. Aydin Alatan"],"pdf_url":"https://arxiv.org/pdf/2307.07620v2.pdf","comment":"\\c{opyright} 2023 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2307.12751v1","updated":"2023-07-24T12:42:45Z","published":"2023-07-24T12:42:45Z","title":"ICF-SRSR: Invertible scale-Conditional Function for Self-Supervised\n Real-world Single Image Super-Resolution","summary":" Single image super-resolution (SISR) is a challenging ill-posed problem that\naims to up-sample a given low-resolution (LR) image to a high-resolution (HR)\ncounterpart. Due to the difficulty in obtaining real LR-HR training pairs,\nrecent approaches are trained on simulated LR images degraded by simplified\ndown-sampling operators, e.g., bicubic. Such an approach can be problematic in\npractice because of the large gap between the synthesized and real-world LR\nimages. To alleviate the issue, we propose a novel Invertible scale-Conditional\nFunction (ICF), which can scale an input image and then restore the original\ninput with different scale conditions. By leveraging the proposed ICF, we\nconstruct a novel self-supervised SISR framework (ICF-SRSR) to handle the\nreal-world SR task without using any paired/unpaired training data.\nFurthermore, our ICF-SRSR can generate realistic and feasible LR-HR pairs,\nwhich can make existing supervised SISR networks more robust. Extensive\nexperiments demonstrate the effectiveness of the proposed method in handling\nSISR in a fully self-supervised manner. Our ICF-SRSR demonstrates superior\nperformance compared to the existing methods trained on synthetic paired images\nin real-world scenarios and exhibits comparable performance compared to\nstate-of-the-art supervised/unsupervised methods on public benchmark datasets.\n","authors":["Reyhaneh Neshatavar","Mohsen Yavartanoo","Sanghyun Son","Kyoung Mu Lee"],"pdf_url":"https://arxiv.org/pdf/2307.12751v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.09629v2","updated":"2023-07-24T12:33:09Z","published":"2023-02-19T17:15:56Z","title":"BiofilmScanner: A Computational Intelligence Approach to Obtain\n Bacterial Cell Morphological Attributes from Biofilm Image","summary":" Desulfovibrio alaskensis G20 (DA-G20) is utilized as a model for\nsulfate-reducing bacteria (SRB) that are associated with corrosion issues\ncaused by microorganisms. SRB-based biofilms are thought to be responsible for\nthe billion-dollar-per-year bio-corrosion of metal infrastructure.\nUnderstanding the extraction of the bacterial cells' shape and size properties\nin the SRB-biofilm at different growth stages will assist with the design of\nanti-corrosion techniques. However, numerous issues affect current approaches,\nincluding time-consuming geometric property extraction, low efficiency, and\nhigh error rates. This paper proposes BiofilScanner, a Yolact-based deep\nlearning method integrated with invariant moments to address these problems.\nOur approach efficiently detects and segments bacterial cells in an SRB image\nwhile simultaneously invariant moments measure the geometric characteristics of\nthe segmented cells with low errors. The numerical experiments of the proposed\nmethod demonstrate that the BiofilmScanner is 2.1x and 6.8x faster than our\nearlier Mask-RCNN and DLv3+ methods for detecting, segmenting, and measuring\nthe geometric properties of the cell. Furthermore, the BiofilmScanner achieved\nan F1-score of 85.28% while Mask-RCNN and DLv3+ obtained F1-scores of 77.67%\nand 75.18%, respectively.\n","authors":["Md Hafizur Rahman","Md Ali Azam","Md Abir Hossen","Shankarachary Ragi","Venkataramana Gadhamshetty"],"pdf_url":"https://arxiv.org/pdf/2302.09629v2.pdf","comment":"Submitted to Pattern Recognition"},{"id":"http://arxiv.org/abs/2307.12732v1","updated":"2023-07-24T12:24:07Z","published":"2023-07-24T12:24:07Z","title":"CLIP-KD: An Empirical Study of Distilling CLIP Models","summary":" CLIP has become a promising language-supervised visual pre-training framework\nand achieves excellent performance over a wide range of tasks. This paper aims\nto distill small CLIP models supervised by a large teacher CLIP model. We\npropose several distillation strategies, including relation, feature, gradient\nand contrastive paradigm, to examine the impact on CLIP distillation. We show\nthat the simplest feature mimicry with MSE loss performs best. Moreover,\ninteractive contrastive learning and relation-based distillation are also\ncritical in performance improvement. We apply the unified method to distill\nseveral student networks trained on 15 million (image, text) pairs.\nDistillation improves the student CLIP models consistently over zero-shot\nImageNet classification and cross-modal retrieval benchmarks. We hope our\nempirical study will become an important baseline for future CLIP distillation\nresearch. The code is available at \\url{https://github.com/winycg/CLIP-KD}.\n","authors":["Chuanguang Yang","Zhulin An","Libo Huang","Junyu Bi","Xinqiang Yu","Han Yang","Yongjun Xu"],"pdf_url":"https://arxiv.org/pdf/2307.12732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12730v1","updated":"2023-07-24T12:22:19Z","published":"2023-07-24T12:22:19Z","title":"COCO-O: A Benchmark for Object Detectors under Natural Distribution\n Shifts","summary":" Practical object detection application can lose its effectiveness on image\ninputs with natural distribution shifts. This problem leads the research\ncommunity to pay more attention on the robustness of detectors under\nOut-Of-Distribution (OOD) inputs. Existing works construct datasets to\nbenchmark the detector's OOD robustness for a specific application scenario,\ne.g., Autonomous Driving. However, these datasets lack universality and are\nhard to benchmark general detectors built on common tasks such as COCO. To give\na more comprehensive robustness assessment, we introduce\nCOCO-O(ut-of-distribution), a test dataset based on COCO with 6 types of\nnatural distribution shifts. COCO-O has a large distribution gap with training\ndata and results in a significant 55.7% relative performance drop on a Faster\nR-CNN detector. We leverage COCO-O to conduct experiments on more than 100\nmodern object detectors to investigate if their improvements are credible or\njust over-fitting to the COCO test set. Unfortunately, most classic detectors\nin early years do not exhibit strong OOD generalization. We further study the\nrobustness effect on recent breakthroughs of detector's architecture design,\naugmentation and pre-training techniques. Some empirical findings are revealed:\n1) Compared with detection head or neck, backbone is the most important part\nfor robustness; 2) An end-to-end detection transformer design brings no\nenhancement, and may even reduce robustness; 3) Large-scale foundation models\nhave made a great leap on robust object detection. We hope our COCO-O could\nprovide a rich testbed for robustness study of object detection. The dataset\nwill be available at\n\\url{https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o}.\n","authors":["Xiaofeng Mao","Yuefeng Chen","Yao Zhu","Da Chen","Hang Su","Rong Zhang","Hui Xue"],"pdf_url":"https://arxiv.org/pdf/2307.12730v1.pdf","comment":"To appear in ICCV2023,\n https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o"},{"id":"http://arxiv.org/abs/2307.12729v1","updated":"2023-07-24T12:21:33Z","published":"2023-07-24T12:21:33Z","title":"Persistent-Transient Duality: A Multi-mechanism Approach for Modeling\n Human-Object Interaction","summary":" Humans are highly adaptable, swiftly switching between different modes to\nprogressively handle different tasks, situations and contexts. In Human-object\ninteraction (HOI) activities, these modes can be attributed to two mechanisms:\n(1) the large-scale consistent plan for the whole activity and (2) the\nsmall-scale children interactive actions that start and end along the timeline.\nWhile neuroscience and cognitive science have confirmed this multi-mechanism\nnature of human behavior, machine modeling approaches for human motion are\ntrailing behind. While attempted to use gradually morphing structures (e.g.,\ngraph attention networks) to model the dynamic HOI patterns, they miss the\nexpeditious and discrete mode-switching nature of the human motion. To bridge\nthat gap, this work proposes to model two concurrent mechanisms that jointly\ncontrol human motion: the Persistent process that runs continually on the\nglobal scale, and the Transient sub-processes that operate intermittently on\nthe local context of the human while interacting with objects. These two\nmechanisms form an interactive Persistent-Transient Duality that\nsynergistically governs the activity sequences. We model this conceptual\nduality by a parent-child neural network of Persistent and Transient channels\nwith a dedicated neural module for dynamic mechanism switching. The framework\nis trialed on HOI motion forecasting. On two rich datasets and a wide variety\nof settings, the model consistently delivers superior performances, proving its\nsuitability for the challenge.\n","authors":["Hung Tran","Vuong Le","Svetha Venkatesh","Truyen Tran"],"pdf_url":"https://arxiv.org/pdf/2307.12729v1.pdf","comment":"Accepted at ICCV 2023"},{"id":"http://arxiv.org/abs/2303.12865v3","updated":"2023-07-24T12:08:50Z","published":"2023-03-22T18:59:48Z","title":"NeRF-GAN Distillation for Efficient 3D-Aware Generation with\n Convolutions","summary":" Pose-conditioned convolutional generative models struggle with high-quality\n3D-consistent image generation from single-view datasets, due to their lack of\nsufficient 3D priors. Recently, the integration of Neural Radiance Fields\n(NeRFs) and generative models, such as Generative Adversarial Networks (GANs),\nhas transformed 3D-aware generation from single-view images. NeRF-GANs exploit\nthe strong inductive bias of neural 3D representations and volumetric rendering\nat the cost of higher computational complexity. This study aims at revisiting\npose-conditioned 2D GANs for efficient 3D-aware generation at inference time by\ndistilling 3D knowledge from pretrained NeRF-GANs. We propose a simple and\neffective method, based on re-using the well-disentangled latent space of a\npre-trained NeRF-GAN in a pose-conditioned convolutional network to directly\ngenerate 3D-consistent images corresponding to the underlying 3D\nrepresentations. Experiments on several datasets demonstrate that the proposed\nmethod obtains results comparable with volumetric rendering in terms of quality\nand 3D consistency while benefiting from the computational advantage of\nconvolutional networks. The code will be available at:\nhttps://github.com/mshahbazi72/NeRF-GAN-Distillation\n","authors":["Mohamad Shahbazi","Evangelos Ntavelis","Alessio Tonioni","Edo Collins","Danda Pani Paudel","Martin Danelljan","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2303.12865v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12721v1","updated":"2023-07-24T12:03:50Z","published":"2023-07-24T12:03:50Z","title":"AMAE: Adaptation of Pre-Trained Masked Autoencoder for Dual-Distribution\n Anomaly Detection in Chest X-Rays","summary":" Unsupervised anomaly detection in medical images such as chest radiographs is\nstepping into the spotlight as it mitigates the scarcity of the labor-intensive\nand costly expert annotation of anomaly data. However, nearly all existing\nmethods are formulated as a one-class classification trained only on\nrepresentations from the normal class and discard a potentially significant\nportion of the unlabeled data. This paper focuses on a more practical setting,\ndual distribution anomaly detection for chest X-rays, using the entire training\ndata, including both normal and unlabeled images. Inspired by a modern\nself-supervised vision transformer model trained using partial image inputs to\nreconstruct missing image regions -- we propose AMAE, a two-stage algorithm for\nadaptation of the pre-trained masked autoencoder (MAE). Starting from MAE\ninitialization, AMAE first creates synthetic anomalies from only normal\ntraining images and trains a lightweight classifier on frozen transformer\nfeatures. Subsequently, we propose an adaptation strategy to leverage unlabeled\nimages containing anomalies. The adaptation scheme is accomplished by assigning\npseudo-labels to unlabeled images and using two separate MAE based modules to\nmodel the normative and anomalous distributions of pseudo-labeled images. The\neffectiveness of the proposed adaptation strategy is evaluated with different\nanomaly ratios in an unlabeled training set. AMAE leads to consistent\nperformance gains over competing self-supervised and dual distribution anomaly\ndetection methods, setting the new state-of-the-art on three public chest X-ray\nbenchmarks: RSNA, NIH-CXR, and VinDr-CXR.\n","authors":["Behzad Bozorgtabar","Dwarikanath Mahapatra","Jean-Philippe Thiran"],"pdf_url":"https://arxiv.org/pdf/2307.12721v1.pdf","comment":"To be presented at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.12718v1","updated":"2023-07-24T11:59:07Z","published":"2023-07-24T11:59:07Z","title":"CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle\n Components","summary":" Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly\neffective technique for representing 3D reconstructions of objects and scenes\nderived from sets of images. Despite their efficiency, NeRF models can pose\nchallenges in certain scenarios such as vehicle inspection, where the lack of\nsufficient data or the presence of challenging elements (e.g. reflections)\nstrongly impact the accuracy of the reconstruction. To this aim, we introduce\nCarPatch, a novel synthetic benchmark of vehicles. In addition to a set of\nimages annotated with their intrinsic and extrinsic camera parameters, the\ncorresponding depth maps and semantic segmentation masks have been generated\nfor each view. Global and part-based metrics have been defined and used to\nevaluate, compare, and better characterize some state-of-the-art techniques.\nThe dataset is publicly released at\nhttps://aimagelab.ing.unimore.it/go/carpatch and can be used as an evaluation\nguide and as a baseline for future work on this challenging topic.\n","authors":["Davide Di Nucci","Alessandro Simoni","Matteo Tomei","Luca Ciuffreda","Roberto Vezzani","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2307.12718v1.pdf","comment":"Accepted at ICIAP2023"},{"id":"http://arxiv.org/abs/2307.12717v1","updated":"2023-07-24T11:58:58Z","published":"2023-07-24T11:58:58Z","title":"Dense Transformer based Enhanced Coding Network for Unsupervised Metal\n Artifact Reduction","summary":" CT images corrupted by metal artifacts have serious negative effects on\nclinical diagnosis. Considering the difficulty of collecting paired data with\nground truth in clinical settings, unsupervised methods for metal artifact\nreduction are of high interest. However, it is difficult for previous\nunsupervised methods to retain structural information from CT images while\nhandling the non-local characteristics of metal artifacts. To address these\nchallenges, we proposed a novel Dense Transformer based Enhanced Coding Network\n(DTEC-Net) for unsupervised metal artifact reduction. Specifically, we\nintroduce a Hierarchical Disentangling Encoder, supported by the high-order\ndense process, and transformer to obtain densely encoded sequences with\nlong-range correspondence. Then, we present a second-order disentanglement\nmethod to improve the dense sequence's decoding process. Extensive experiments\nand model discussions illustrate DTEC-Net's effectiveness, which outperforms\nthe previous state-of-the-art methods on a benchmark dataset, and greatly\nreduces metal artifacts while restoring richer texture details.\n","authors":["Wangduo Xie","Matthew B. Blaschko"],"pdf_url":"https://arxiv.org/pdf/2307.12717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09340v3","updated":"2023-07-24T11:34:21Z","published":"2023-03-16T14:21:45Z","title":"Improving Automated Hemorrhage Detection in Sparse-view Computed\n Tomography via Deep Convolutional Neural Network based Artifact Reduction","summary":" Purpose: Sparse-view computed tomography (CT) is an effective way to reduce\ndose by lowering the total number of views acquired, albeit at the expense of\nimage quality, which, in turn, can impact the ability to detect diseases. We\nexplore deep learning-based artifact reduction in sparse-view cranial CT scans\nand its impact on automated hemorrhage detection. Methods: We trained a U-Net\nfor artefact reduction on simulated sparse-view cranial CT scans from 3000\npatients obtained from a public dataset and reconstructed with varying levels\nof sub-sampling. Additionally, we trained a convolutional neural network on\nfully sampled CT data from 17,545 patients for automated hemorrhage detection.\nWe evaluated the classification performance using the area under the receiver\noperator characteristic curves (AUC-ROCs) with corresponding 95% confidence\nintervals (CIs) and the DeLong test, along with confusion matrices. The\nperformance of the U-Net was compared to an analytical approach based on total\nvariation (TV). Results: The U-Net performed superior compared to unprocessed\nand TV-processed images with respect to image quality and automated hemorrhage\ndiagnosis. With U-Net post-processing, the number of views can be reduced from\n4096 (AUC-ROC: 0.974; 95% CI: 0.972-0.976) views to 512 views (0.973;\n0.971-0.975) with minimal decrease in hemorrhage detection (P<.001) and to 256\nviews (0.967; 0.964-0.969) with a slight performance decrease (P<.001).\nConclusion: The results suggest that U-Net based artifact reduction\nsubstantially enhances automated hemorrhage detection in sparse-view cranial\nCTs. Our findings highlight that appropriate post-processing is crucial for\noptimal image quality and diagnostic accuracy while minimizing radiation dose.\n","authors":["Johannes Thalhammer","Manuel Schultheiss","Tina Dorosti","Tobias Lasser","Franz Pfeiffer","Daniela Pfeiffer","Florian Schaff"],"pdf_url":"https://arxiv.org/pdf/2303.09340v3.pdf","comment":"11 pages, 6 figures, 1 table"},{"id":"http://arxiv.org/abs/2011.09094v3","updated":"2023-07-24T11:28:46Z","published":"2020-11-18T05:16:11Z","title":"UP-DETR: Unsupervised Pre-training for Object Detection with\n Transformers","summary":" DEtection TRansformer (DETR) for object detection reaches competitive\nperformance compared with Faster R-CNN via a transformer encoder-decoder\narchitecture. However, trained with scratch transformers, DETR needs\nlarge-scale training data and an extreme long training schedule even on COCO\ndataset. Inspired by the great success of pre-training transformers in natural\nlanguage processing, we propose a novel pretext task named random query patch\ndetection in Unsupervised Pre-training DETR (UP-DETR). Specifically, we\nrandomly crop patches from the given image and then feed them as queries to the\ndecoder. The model is pre-trained to detect these query patches from the input\nimage. During the pre-training, we address two critical issues: multi-task\nlearning and multi-query localization. (1) To trade off classification and\nlocalization preferences in the pretext task, we find that freezing the CNN\nbackbone is the prerequisite for the success of pre-training transformers. (2)\nTo perform multi-query localization, we develop UP-DETR with multi-query patch\ndetection with attention mask. Besides, UP-DETR also provides a unified\nperspective for fine-tuning object detection and one-shot detection tasks. In\nour experiments, UP-DETR significantly boosts the performance of DETR with\nfaster convergence and higher average precision on object detection, one-shot\ndetection and panoptic segmentation. Code and pre-training models:\nhttps://github.com/dddzg/up-detr.\n","authors":["Zhigang Dai","Bolun Cai","Yugeng Lin","Junying Chen"],"pdf_url":"https://arxiv.org/pdf/2011.09094v3.pdf","comment":"Accepted by TPAMI 2022 and CVPR 2021"},{"id":"http://arxiv.org/abs/2307.12698v1","updated":"2023-07-24T11:27:14Z","published":"2023-07-24T11:27:14Z","title":"MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised\n Learning of Motion and Content Features","summary":" Self-supervised learning of visual representations has been focusing on\nlearning content features, which do not capture object motion or location, and\nfocus on identifying and differentiating objects in images and videos. On the\nother hand, optical flow estimation is a task that does not involve\nunderstanding the content of the images on which it is estimated. We unify the\ntwo approaches and introduce MC-JEPA, a joint-embedding predictive architecture\nand self-supervised learning approach to jointly learn optical flow and content\nfeatures within a shared encoder, demonstrating that the two associated\nobjectives; the optical flow estimation objective and the self-supervised\nlearning objective; benefit from each other and thus learn content features\nthat incorporate motion information. The proposed approach achieves performance\non-par with existing unsupervised optical flow benchmarks, as well as with\ncommon self-supervised learning approaches on downstream tasks such as semantic\nsegmentation of images and videos.\n","authors":["Adrien Bardes","Jean Ponce","Yann LeCun"],"pdf_url":"https://arxiv.org/pdf/2307.12698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.10763v3","updated":"2023-07-24T11:15:47Z","published":"2023-02-12T12:19:57Z","title":"Contrastive Learning and the Emergence of Attributes Associations","summary":" In response to an object presentation, supervised learning schemes generally\nrespond with a parsimonious label. Upon a similar presentation we humans\nrespond again with a label, but are flooded, in addition, by a myriad of\nassociations. A significant portion of these consist of the presented object\nattributes. Contrastive learning is a semi-supervised learning scheme based on\nthe application of identity preserving transformations on the object input\nrepresentations. It is conjectured in this work that these same applied\ntransformations preserve, in addition to the identity of the presented object,\nalso the identity of its semantically meaningful attributes. The corollary of\nthis is that the output representations of such a contrastive learning scheme\ncontain valuable information not only for the classification of the presented\nobject, but also for the presence or absence decision of any attribute of\ninterest. Simulation results which demonstrate this idea and the feasibility of\nthis conjecture are presented.\n","authors":["Daniel N. Nissani"],"pdf_url":"https://arxiv.org/pdf/2302.10763v3.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2304.02941v2","updated":"2023-07-24T10:57:15Z","published":"2023-04-06T08:56:18Z","title":"Dr. KID: Direct Remeshing and K-set Isometric Decomposition for Scalable\n Physicalization of Organic Shapes","summary":" Dr. KID is an algorithm that uses isometric decomposition for the\nphysicalization of potato-shaped organic models in a puzzle fashion. The\nalgorithm begins with creating a simple, regular triangular surface mesh of\norganic shapes, followed by iterative k-means clustering and remeshing. For\nclustering, we need similarity between triangles (segments) which is defined as\na distance function. The distance function maps each triangle's shape to a\nsingle point in the virtual 3D space. Thus, the distance between the triangles\nindicates their degree of dissimilarity. K-means clustering uses this distance\nand sorts of segments into k classes. After this, remeshing is applied to\nminimize the distance between triangles within the same cluster by making their\nshapes identical. Clustering and remeshing are repeated until the distance\nbetween triangles in the same cluster reaches an acceptable threshold. We adopt\na curvature-aware strategy to determine the surface thickness and finalize\npuzzle pieces for 3D printing. Identical hinges and holes are created for\nassembling the puzzle components. For smoother outcomes, we use triangle\nsubdivision along with curvature-aware clustering, generating curved triangular\npatches for 3D printing. Our algorithm was evaluated using various models, and\nthe 3D-printed results were analyzed. Findings indicate that our algorithm\nperforms reliably on target organic shapes with minimal loss of input geometry.\n","authors":["Dawar Khan","Ciril Bohak","Ivan Viola"],"pdf_url":"https://arxiv.org/pdf/2304.02941v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12676v1","updated":"2023-07-24T10:30:54Z","published":"2023-07-24T10:30:54Z","title":"Damage Vision Mining Opportunity for Imbalanced Anomaly Detection","summary":" In past decade, previous balanced datasets have been used to advance\nalgorithms for classification, object detection, semantic segmentation, and\nanomaly detection in industrial applications. Specifically, for condition-based\nmaintenance, automating visual inspection is crucial to ensure high quality.\nDeterioration prognostic attempts to optimize the fine decision process for\npredictive maintenance and proactive repair. In civil infrastructure and living\nenvironment, damage data mining cannot avoid the imbalanced data issue because\nof rare unseen events and high quality status by improved operations. For\nvisual inspection, deteriorated class acquired from the surface of concrete and\nsteel components are occasionally imbalanced. From numerous related surveys, we\nsummarize that imbalanced data problems can be categorized into four types; 1)\nmissing range of target and label valuables, 2) majority-minority class\nimbalance, 3) foreground-background of spatial imbalance, 4) long-tailed class\nof pixel-wise imbalance. Since 2015, there has been many imbalanced studies\nusing deep learning approaches that includes regression, image classification,\nobject detection, semantic segmentation. However, anomaly detection for\nimbalanced data is not yet well known. In the study, we highlight one-class\nanomaly detection application whether anomalous class or not, and demonstrate\nclear examples on imbalanced vision datasets: wooden, concrete deterioration,\nand disaster damage. We provide key results on damage vision mining advantage,\nhypothesizing that the more effective range of positive ratio, the higher\naccuracy gain of anomaly detection application. Finally, the applicability of\nthe damage learning methods, limitations, and future works are mentioned.\n","authors":["Takato Yasuno"],"pdf_url":"https://arxiv.org/pdf/2307.12676v1.pdf","comment":"12 pages, 14 figures, 8 tables"},{"id":"http://arxiv.org/abs/2307.12674v1","updated":"2023-07-24T10:24:13Z","published":"2023-07-24T10:24:13Z","title":"Industrial Segment Anything -- a Case Study in Aircraft Manufacturing,\n Intralogistics, Maintenance, Repair, and Overhaul","summary":" Deploying deep learning-based applications in specialized domains like the\naircraft production industry typically suffers from the training data\navailability problem. Only a few datasets represent non-everyday objects,\nsituations, and tasks. Recent advantages in research around Vision Foundation\nModels (VFM) opened a new area of tasks and models with high generalization\ncapabilities in non-semantic and semantic predictions. As recently demonstrated\nby the Segment Anything Project, exploiting VFM's zero-shot capabilities is a\npromising direction in tackling the boundaries spanned by data, context, and\nsensor variety. Although, investigating its application within specific domains\nis subject to ongoing research. This paper contributes here by surveying\napplications of the SAM in aircraft production-specific use cases. We include\nmanufacturing, intralogistics, as well as maintenance, repair, and overhaul\nprocesses, also representing a variety of other neighboring industrial domains.\nBesides presenting the various use cases, we further discuss the injection of\ndomain knowledge.\n","authors":["Keno Moenck","Arne Wendt","Philipp Prünte","Julian Koch","Arne Sahrhage","Johann Gierecker","Ole Schmedemann","Falko Kähler","Dirk Holst","Martin Gomse","Thorsten Schüppstuhl","Daniel Schoepflin"],"pdf_url":"https://arxiv.org/pdf/2307.12674v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12672v1","updated":"2023-07-24T10:20:14Z","published":"2023-07-24T10:20:14Z","title":"Global k-Space Interpolation for Dynamic MRI Reconstruction using Masked\n Image Modeling","summary":" In dynamic Magnetic Resonance Imaging (MRI), k-space is typically\nundersampled due to limited scan time, resulting in aliasing artifacts in the\nimage domain. Hence, dynamic MR reconstruction requires not only modeling\nspatial frequency components in the x and y directions of k-space but also\nconsidering temporal redundancy. Most previous works rely on image-domain\nregularizers (priors) to conduct MR reconstruction. In contrast, we focus on\ninterpolating the undersampled k-space before obtaining images with Fourier\ntransform. In this work, we connect masked image modeling with k-space\ninterpolation and propose a novel Transformer-based k-space Global\nInterpolation Network, termed k-GIN. Our k-GIN learns global dependencies among\nlow- and high-frequency components of 2D+t k-space and uses it to interpolate\nunsampled data. Further, we propose a novel k-space Iterative Refinement Module\n(k-IRM) to enhance the high-frequency components learning. We evaluate our\napproach on 92 in-house 2D+t cardiac MR subjects and compare it to MR\nreconstruction methods with image-domain regularizers. Experiments show that\nour proposed k-space interpolation method quantitatively and qualitatively\noutperforms baseline methods. Importantly, the proposed approach achieves\nsubstantially higher robustness and generalizability in cases of\nhighly-undersampled MR data.\n","authors":["Jiazhen Pan","Suprosanna Shit","Özgün Turgut","Wenqi Huang","Hongwei Bran Li","Nil Stolt-Ansó","Thomas Küstner","Kerstin Hammernik","Daniel Rueckert"],"pdf_url":"https://arxiv.org/pdf/2307.12672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07250v2","updated":"2023-07-24T10:10:25Z","published":"2023-04-14T16:58:23Z","title":"Fusing Structure from Motion and Simulation-Augmented Pose Regression\n from Optical Flow for Challenging Indoor Environments","summary":" The localization of objects is a crucial task in various applications such as\nrobotics, virtual and augmented reality, and the transportation of goods in\nwarehouses. Recent advances in deep learning have enabled the localization\nusing monocular visual cameras. While structure from motion (SfM) predicts the\nabsolute pose from a point cloud, absolute pose regression (APR) methods learn\na semantic understanding of the environment through neural networks. However,\nboth fields face challenges caused by the environment such as motion blur,\nlighting changes, repetitive patterns, and feature-less structures. This study\naims to address these challenges by incorporating additional information and\nregularizing the absolute pose using relative pose regression (RPR) methods.\nRPR methods suffer under different challenges, i.e., motion blur. The optical\nflow between consecutive images is computed using the Lucas-Kanade algorithm,\nand the relative pose is predicted using an auxiliary small recurrent\nconvolutional network. The fusion of absolute and relative poses is a complex\ntask due to the mismatch between the global and local coordinate systems.\nState-of-the-art methods fusing absolute and relative poses use pose graph\noptimization (PGO) to regularize the absolute pose predictions using relative\nposes. In this work, we propose recurrent fusion networks to optimally align\nabsolute and relative pose predictions to improve the absolute pose prediction.\nWe evaluate eight different recurrent units and construct a simulation\nenvironment to pre-train the APR and RPR networks for better generalized\ntraining. Additionally, we record a large database of different scenarios in a\nchallenging large-scale indoor environment that mimics a warehouse with\ntransportation robots. We conduct hyperparameter searches and experiments to\nshow the effectiveness of our recurrent fusion method compared to PGO.\n","authors":["Felix Ott","Lucas Heublein","David Rügamer","Bernd Bischl","Christopher Mutschler"],"pdf_url":"https://arxiv.org/pdf/2304.07250v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12656v1","updated":"2023-07-24T09:54:49Z","published":"2023-07-24T09:54:49Z","title":"A Theoretically Guaranteed Quaternion Weighted Schatten p-norm\n Minimization Method for Color Image Restoration","summary":" Inspired by the fact that the matrix formulated by nonlocal similar patches\nin a natural image is of low rank, the rank approximation issue have been\nextensively investigated over the past decades, among which weighted nuclear\nnorm minimization (WNNM) and weighted Schatten $p$-norm minimization (WSNM) are\ntwo prevailing methods have shown great superiority in various image\nrestoration (IR) problems. Due to the physical characteristic of color images,\ncolor image restoration (CIR) is often a much more difficult task than its\ngrayscale image counterpart. However, when applied to CIR, the traditional\nWNNM/WSNM method only processes three color channels individually and fails to\nconsider their cross-channel correlations. Very recently, a quaternion-based\nWNNM approach (QWNNM) has been developed to mitigate this issue, which is\ncapable of representing the color image as a whole in the quaternion domain and\npreserving the inherent correlation among the three color channels. Despite its\nempirical success, unfortunately, the convergence behavior of QWNNM has not\nbeen strictly studied yet. In this paper, on the one side, we extend the WSNM\ninto quaternion domain and correspondingly propose a novel quaternion-based\nWSNM model (QWSNM) for tackling the CIR problems. Extensive experiments on two\nrepresentative CIR tasks, including color image denoising and deblurring,\ndemonstrate that the proposed QWSNM method performs favorably against many\nstate-of-the-art alternatives, in both quantitative and qualitative\nevaluations. On the other side, more importantly, we preliminarily provide a\ntheoretical convergence analysis, that is, by modifying the quaternion\nalternating direction method of multipliers (QADMM) through a simple\ncontinuation strategy, we theoretically prove that both the solution sequences\ngenerated by the QWNNM and QWSNM have fixed-point convergence guarantees.\n","authors":["Qing-Hua Zhang","Liang-Tian He","Yi-Lun Wang","Liang-Jian Deng","Jun Liu"],"pdf_url":"https://arxiv.org/pdf/2307.12656v1.pdf","comment":"46 pages, 10 figures; references added"},{"id":"http://arxiv.org/abs/2302.01162v5","updated":"2023-07-24T09:41:07Z","published":"2023-02-02T15:37:46Z","title":"Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model using\n Pixel-aligned Reconstruction Priors","summary":" Fast generation of high-quality 3D digital humans is important to a vast\nnumber of applications ranging from entertainment to professional concerns.\nRecent advances in differentiable rendering have enabled the training of 3D\ngenerative models without requiring 3D ground truths. However, the quality of\nthe generated 3D humans still has much room to improve in terms of both\nfidelity and diversity. In this paper, we present Get3DHuman, a novel 3D human\nframework that can significantly boost the realism and diversity of the\ngenerated outcomes by only using a limited budget of 3D ground-truth data. Our\nkey observation is that the 3D generator can profit from human-related priors\nlearned through 2D human generators and 3D reconstructors. Specifically, we\nbridge the latent space of Get3DHuman with that of StyleGAN-Human via a\nspecially-designed prior network, where the input latent code is mapped to the\nshape and texture feature volumes spanned by the pixel-aligned 3D\nreconstructor. The outcomes of the prior network are then leveraged as the\nsupervisory signals for the main generator network. To ensure effective\ntraining, we further propose three tailored losses applied to the generated\nfeature volumes and the intermediate feature maps. Extensive experiments\ndemonstrate that Get3DHuman greatly outperforms the other state-of-the-art\napproaches and can support a wide range of applications including shape\ninterpolation, shape re-texturing, and single-view reconstruction through\nlatent inversion.\n","authors":["Zhangyang Xiong","Di Kang","Derong Jin","Weikai Chen","Linchao Bao","Shuguang Cui","Xiaoguang Han"],"pdf_url":"https://arxiv.org/pdf/2302.01162v5.pdf","comment":"ICCV 2023, project page:\n https://x-zhangyang.github.io/2023_Get3DHuman/"},{"id":"http://arxiv.org/abs/2307.12644v1","updated":"2023-07-24T09:35:47Z","published":"2023-07-24T09:35:47Z","title":"Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation\n of rPPG","summary":" Remote Photoplethysmography (rPPG) is a technology that utilizes the light\nabsorption properties of hemoglobin, captured via camera, to analyze and\nmeasure blood volume pulse (BVP). By analyzing the measured BVP, various\nphysiological signals such as heart rate, stress levels, and blood pressure can\nbe derived, enabling applications such as the early prediction of\ncardiovascular diseases. rPPG is a rapidly evolving field as it allows the\nmeasurement of vital signals using camera-equipped devices without the need for\nadditional devices such as blood pressure monitors or pulse oximeters, and\nwithout the assistance of medical experts. Despite extensive efforts and\nadvances in this field, serious challenges remain, including issues related to\nskin color, camera characteristics, ambient lighting, and other sources of\nnoise, which degrade performance accuracy. We argue that fair and evaluable\nbenchmarking is urgently required to overcome these challenges and make any\nmeaningful progress from both academic and commercial perspectives. In most\nexisting work, models are trained, tested, and validated only on limited\ndatasets. Worse still, some studies lack available code or reproducibility,\nmaking it difficult to fairly evaluate and compare performance. Therefore, the\npurpose of this study is to provide a benchmarking framework to evaluate\nvarious rPPG techniques across a wide range of datasets for fair evaluation and\ncomparison, including both conventional non-deep neural network (non-DNN) and\ndeep neural network (DNN) methods. GitHub URL:\nhttps://github.com/remotebiosensing/rppg.\n","authors":["Dae Yeol Kim","Eunsu Goh","KwangKee Lee","JongEui Chae","JongHyeon Mun","Junyeong Na","Chae-bong Sohn","Do-Yup Kim"],"pdf_url":"https://arxiv.org/pdf/2307.12644v1.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2304.03981v2","updated":"2023-07-24T09:24:04Z","published":"2023-04-08T10:47:41Z","title":"Uncertainty-inspired Open Set Learning for Retinal Anomaly\n Identification","summary":" Failure to recognize samples from the classes unseen during training is a\nmajor limitation of artificial intelligence in the real-world implementation\nfor recognition and classification of retinal anomalies. We established an\nuncertainty-inspired open-set (UIOS) model, which was trained with fundus\nimages of 9 retinal conditions. Besides assessing the probability of each\ncategory, UIOS also calculated an uncertainty score to express its confidence.\nOur UIOS model with thresholding strategy achieved an F1 score of 99.55%,\n97.01% and 91.91% for the internal testing set, external target categories\n(TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1\nscore of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS\ncorrectly predicted high uncertainty scores, which would prompt the need for a\nmanual check in the datasets of non-target categories retinal diseases,\nlow-quality fundus images, and non-fundus images. UIOS provides a robust method\nfor real-world screening of retinal anomalies.\n","authors":["Meng Wang","Tian Lin","Lianyu Wang","Aidi Lin","Ke Zou","Xinxing Xu","Yi Zhou","Yuanyuan Peng","Qingquan Meng","Yiming Qian","Guoyao Deng","Zhiqun Wu","Junhong Chen","Jianhong Lin","Mingzhi Zhang","Weifang Zhu","Changqing Zhang","Daoqiang Zhang","Rick Siow Mong Goh","Yong Liu","Chi Pui Pang","Xinjian Chen","Haoyu Chen","Huazhu Fu"],"pdf_url":"https://arxiv.org/pdf/2304.03981v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12637v1","updated":"2023-07-24T09:22:09Z","published":"2023-07-24T09:22:09Z","title":"PG-RCNN: Semantic Surface Point Generation for 3D Object Detection","summary":" One of the main challenges in LiDAR-based 3D object detection is that the\nsensors often fail to capture the complete spatial information about the\nobjects due to long distance and occlusion. Two-stage detectors with point\ncloud completion approaches tackle this problem by adding more points to the\nregions of interest (RoIs) with a pre-trained network. However, these methods\ngenerate dense point clouds of objects for all region proposals, assuming that\nobjects always exist in the RoIs. This leads to the indiscriminate point\ngeneration for incorrect proposals as well. Motivated by this, we propose Point\nGeneration R-CNN (PG-RCNN), a novel end-to-end detector that generates semantic\nsurface points of foreground objects for accurate detection. Our method uses a\njointly trained RoI point generation module to process the contextual\ninformation of RoIs and estimate the complete shape and displacement of\nforeground objects. For every generated point, PG-RCNN assigns a semantic\nfeature that indicates the estimated foreground probability. Extensive\nexperiments show that the point clouds generated by our method provide\ngeometrically and semantically rich information for refining false positive and\nmisaligned proposals. PG-RCNN achieves competitive performance on the KITTI\nbenchmark, with significantly fewer parameters than state-of-the-art models.\nThe code is available at https://github.com/quotation2520/PG-RCNN.\n","authors":["Inyong Koo","Inyoung Lee","Se-Ho Kim","Hee-Seon Kim","Woo-jin Jeon","Changick Kim"],"pdf_url":"https://arxiv.org/pdf/2307.12637v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.11643v2","updated":"2023-07-24T09:18:52Z","published":"2023-07-21T15:22:32Z","title":"Morphological Image Analysis and Feature Extraction for Reasoning with\n AI-based Defect Detection and Classification Models","summary":" As the use of artificial intelligent (AI) models becomes more prevalent in\nindustries such as engineering and manufacturing, it is essential that these\nmodels provide transparent reasoning behind their predictions. This paper\nproposes the AI-Reasoner, which extracts the morphological characteristics of\ndefects (DefChars) from images and utilises decision trees to reason with the\nDefChar values. Thereafter, the AI-Reasoner exports visualisations (i.e.\ncharts) and textual explanations to provide insights into outputs made by\nmasked-based defect detection and classification models. It also provides\neffective mitigation strategies to enhance data pre-processing and overall\nmodel performance. The AI-Reasoner was tested on explaining the outputs of an\nIE Mask R-CNN model using a set of 366 images containing defects. The results\ndemonstrated its effectiveness in explaining the IE Mask R-CNN model's\npredictions. Overall, the proposed AI-Reasoner provides a solution for\nimproving the performance of AI models in industrial applications that require\ndefect analysis.\n","authors":["Jiajun Zhang","Georgina Cosma","Sarah Bugby","Axel Finke","Jason Watkins"],"pdf_url":"https://arxiv.org/pdf/2307.11643v2.pdf","comment":"8 pages, 3 figures, 5 tables; submitted to 2023 IEEE symposium series\n on computational intelligence (SSCI)"},{"id":"http://arxiv.org/abs/2307.12634v1","updated":"2023-07-24T09:16:05Z","published":"2023-07-24T09:16:05Z","title":"Automatic lobe segmentation using attentive cross entropy and end-to-end\n fissure generation","summary":" The automatic lung lobe segmentation algorithm is of great significance for\nthe diagnosis and treatment of lung diseases, however, which has great\nchallenges due to the incompleteness of pulmonary fissures in lung CT images\nand the large variability of pathological features. Therefore, we propose a new\nautomatic lung lobe segmentation framework, in which we urge the model to pay\nattention to the area around the pulmonary fissure during the training process,\nwhich is realized by a task-specific loss function. In addition, we introduce\nan end-to-end pulmonary fissure generation method in the auxiliary pulmonary\nfissure segmentation task, without any additional network branch. Finally, we\npropose a registration-based loss function to alleviate the convergence\ndifficulty of the Dice loss supervised pulmonary fissure segmentation task. We\nachieve 97.83% and 94.75% dice scores on our private dataset STLB and public\nLUNA16 dataset respectively.\n","authors":["Qi Su","Na Wang","Jiawen Xie","Yinan Chen","Xiaofan Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12634v1.pdf","comment":"5 pages, 3 figures, published to 'IEEE International Symposium on\n Biomedical Imaging (ISBI) 2023'"},{"id":"http://arxiv.org/abs/2307.12630v1","updated":"2023-07-24T09:08:30Z","published":"2023-07-24T09:08:30Z","title":"Semi-Supervised Medical Image Segmentation with Co-Distribution\n Alignment","summary":" Medical image segmentation has made significant progress when a large amount\nof labeled data are available. However, annotating medical image segmentation\ndatasets is expensive due to the requirement of professional skills.\nAdditionally, classes are often unevenly distributed in medical images, which\nseverely affects the classification performance on minority classes. To address\nthese problems, this paper proposes Co-Distribution Alignment (Co-DA) for\nsemi-supervised medical image segmentation. Specifically, Co-DA aligns marginal\npredictions on unlabeled data to marginal predictions on labeled data in a\nclass-wise manner with two differently initialized models before using the\npseudo-labels generated by one model to supervise the other. Besides, we design\nan over-expectation cross-entropy loss for filtering the unlabeled pixels to\nreduce noise in their pseudo-labels. Quantitative and qualitative experiments\non three public datasets demonstrate that the proposed approach outperforms\nexisting state-of-the-art semi-supervised medical image segmentation methods on\nboth the 2D CaDIS dataset and the 3D LGE-MRI and ACDC datasets, achieving an\nmIoU of 0.8515 with only 24% labeled data on CaDIS, and a Dice score of 0.8824\nand 0.8773 with only 20% data on LGE-MRI and ACDC, respectively.\n","authors":["Tao Wang","Zhongzheng Huang","Jiawei Wu","Yuanzheng Cai","Zuoyong Li"],"pdf_url":"https://arxiv.org/pdf/2307.12630v1.pdf","comment":"Paper appears in Bioengineering 2023, 10(7), 869"},{"id":"http://arxiv.org/abs/2307.12622v1","updated":"2023-07-24T08:51:49Z","published":"2023-07-24T08:51:49Z","title":"Phase Match for Out-of-Distribution Generalization","summary":" The Fourier transform, serving as an explicit decomposition method for visual\nsignals, has been employed to explain the out-of-distribution generalization\nbehaviors of Convolutional Neural Networks (CNNs). Previous research and\nempirical studies have indicated that the amplitude spectrum plays a decisive\nrole in CNN recognition, but it is susceptible to disturbance caused by\ndistribution shifts. On the other hand, the phase spectrum preserves\nhighly-structured spatial information, which is crucial for visual\nrepresentation learning. In this paper, we aim to clarify the relationships\nbetween Domain Generalization (DG) and the frequency components by introducing\na Fourier-based structural causal model. Specifically, we interpret the phase\nspectrum as semi-causal factors and the amplitude spectrum as non-causal\nfactors. Building upon these observations, we propose Phase Match (PhaMa) to\naddress DG problems. Our method introduces perturbations on the amplitude\nspectrum and establishes spatial relationships to match the phase components.\nThrough experiments on multiple benchmarks, we demonstrate that our proposed\nmethod achieves state-of-the-art performance in domain generalization and\nout-of-distribution robustness tasks.\n","authors":["Chengming Hu","Rui Wang","Hao Chen","Zhouwang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12622v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12619v1","updated":"2023-07-24T08:49:20Z","published":"2023-07-24T08:49:20Z","title":"Sparse annotation strategies for segmentation of short axis cardiac MRI","summary":" Short axis cardiac MRI segmentation is a well-researched topic, with\nexcellent results achieved by state-of-the-art models in a supervised setting.\nHowever, annotating MRI volumes is time-consuming and expensive. Many different\napproaches (e.g. transfer learning, data augmentation, few-shot learning, etc.)\nhave emerged in an effort to use fewer annotated data and still achieve similar\nperformance as a fully supervised model. Nevertheless, to the best of our\nknowledge, none of these works focus on which slices of MRI volumes are most\nimportant to annotate for yielding the best segmentation results. In this\npaper, we investigate the effects of training with sparse volumes, i.e.\nreducing the number of cases annotated, and sparse annotations, i.e. reducing\nthe number of slices annotated per case. We evaluate the segmentation\nperformance using the state-of-the-art nnU-Net model on two public datasets to\nidentify which slices are the most important to annotate. We have shown that\ntraining on a significantly reduced dataset (48 annotated volumes) can give a\nDice score greater than 0.85 and results comparable to using the full dataset\n(160 and 240 volumes for each dataset respectively). In general, training on\nmore slice annotations provides more valuable information compared to training\non more volumes. Further, annotating slices from the middle of volumes yields\nthe most beneficial results in terms of segmentation performance, and the\napical region the worst. When evaluating the trade-off between annotating\nvolumes against slices, annotating as many slices as possible instead of\nannotating more volumes is a better strategy.\n","authors":["Josh Stein","Maxime Di Folco","Julia Schnabel"],"pdf_url":"https://arxiv.org/pdf/2307.12619v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12618v1","updated":"2023-07-24T08:47:45Z","published":"2023-07-24T08:47:45Z","title":"Attribute Regularized Soft Introspective VAE: Towards Cardiac Attribute\n Regularization Through MRI Domains","summary":" Deep generative models have emerged as influential instruments for data\ngeneration and manipulation. Enhancing the controllability of these models by\nselectively modifying data attributes has been a recent focus. Variational\nAutoencoders (VAEs) have shown promise in capturing hidden attributes but often\nproduce blurry reconstructions. Controlling these attributes through different\nimaging domains is difficult in medical imaging. Recently, Soft Introspective\nVAE leverage the benefits of both VAEs and Generative Adversarial Networks\n(GANs), which have demonstrated impressive image synthesis capabilities, by\nincorporating an adversarial loss into VAE training. In this work, we propose\nthe Attributed Soft Introspective VAE (Attri-SIVAE) by incorporating an\nattribute regularized loss, into the Soft-Intro VAE framework. We evaluate\nexperimentally the proposed method on cardiac MRI data from different domains,\nsuch as various scanner vendors and acquisition centers. The proposed method\nachieves similar performance in terms of reconstruction and regularization\ncompared to the state-of-the-art Attributed regularized VAE but additionally\nalso succeeds in keeping the same regularization level when tested on a\ndifferent dataset, unlike the compared method.\n","authors":["Maxime Di Folco","Cosmin Bercea","Julia A. Schnabel"],"pdf_url":"https://arxiv.org/pdf/2307.12618v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12616v1","updated":"2023-07-24T08:44:25Z","published":"2023-07-24T08:44:25Z","title":"CTVIS: Consistent Training for Online Video Instance Segmentation","summary":" The discrimination of instance embeddings plays a vital role in associating\ninstances across time for online video instance segmentation (VIS). Instance\nembedding learning is directly supervised by the contrastive loss computed upon\nthe contrastive items (CIs), which are sets of anchor/positive/negative\nembeddings. Recent online VIS methods leverage CIs sourced from one reference\nframe only, which we argue is insufficient for learning highly discriminative\nembeddings. Intuitively, a possible strategy to enhance CIs is replicating the\ninference phase during training. To this end, we propose a simple yet effective\ntraining strategy, called Consistent Training for Online VIS (CTVIS), which\ndevotes to aligning the training and inference pipelines in terms of building\nCIs. Specifically, CTVIS constructs CIs by referring inference the\nmomentum-averaged embedding and the memory bank storage mechanisms, and adding\nnoise to the relevant embeddings. Such an extension allows a reliable\ncomparison between embeddings of current instances and the stable\nrepresentations of historical instances, thereby conferring an advantage in\nmodeling VIS challenges such as occlusion, re-identification, and deformation.\nEmpirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three\nVIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS\n(35.5% AP). Furthermore, we find that pseudo-videos transformed from images can\ntrain robust models surpassing fully-supervised ones.\n","authors":["Kaining Ying","Qing Zhong","Weian Mao","Zhenhua Wang","Hao Chen","Lin Yuanbo Wu","Yifan Liu","Chengxiang Fan","Yunzhi Zhuge","Chunhua Shen"],"pdf_url":"https://arxiv.org/pdf/2307.12616v1.pdf","comment":"Accepted by ICCV 2023. The code is available at\n https://github.com/KainingYing/CTVIS"},{"id":"http://arxiv.org/abs/2307.12612v1","updated":"2023-07-24T08:39:11Z","published":"2023-07-24T08:39:11Z","title":"Less is More: Focus Attention for Efficient DETR","summary":" DETR-like models have significantly boosted the performance of detectors and\neven outperformed classical convolutional models. However, all tokens are\ntreated equally without discrimination brings a redundant computational burden\nin the traditional encoder structure. The recent sparsification strategies\nexploit a subset of informative tokens to reduce attention complexity\nmaintaining performance through the sparse encoder. But these methods tend to\nrely on unreliable model statistics. Moreover, simply reducing the token\npopulation hinders the detection performance to a large extent, limiting the\napplication of these sparse models. We propose Focus-DETR, which focuses\nattention on more informative tokens for a better trade-off between computation\nefficiency and model accuracy. Specifically, we reconstruct the encoder with\ndual attention, which includes a token scoring mechanism that considers both\nlocalization and category semantic information of the objects from multi-scale\nfeature maps. We efficiently abandon the background queries and enhance the\nsemantic interaction of the fine-grained object queries based on the scores.\nCompared with the state-of-the-art sparse DETR-like detectors under the same\nsetting, our Focus-DETR gets comparable complexity while achieving 50.4AP\n(+2.2) on COCO. The code is available at\nhttps://github.com/huawei-noah/noah-research/tree/master/Focus-DETR and\nhttps://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR.\n","authors":["Dehua Zheng","Wenhui Dong","Hailin Hu","Xinghao Chen","Yunhe Wang"],"pdf_url":"https://arxiv.org/pdf/2307.12612v1.pdf","comment":"8 pages, 6 figures, accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2307.12607v1","updated":"2023-07-24T08:32:27Z","published":"2023-07-24T08:32:27Z","title":"ExWarp: Extrapolation and Warping-based Temporal Supersampling for\n High-frequency Displays","summary":" High-frequency displays are gaining immense popularity because of their\nincreasing use in video games and virtual reality applications. However, the\nissue is that the underlying GPUs cannot continuously generate frames at this\nhigh rate -- this results in a less smooth and responsive experience.\nFurthermore, if the frame rate is not synchronized with the refresh rate, the\nuser may experience screen tearing and stuttering. Previous works propose\nincreasing the frame rate to provide a smooth experience on modern displays by\npredicting new frames based on past or future frames. Interpolation and\nextrapolation are two widely used algorithms that predict new frames.\nInterpolation requires waiting for the future frame to make a prediction, which\nadds additional latency. On the other hand, extrapolation provides a better\nquality of experience because it relies solely on past frames -- it does not\nincur any additional latency. The simplest method to extrapolate a frame is to\nwarp the previous frame using motion vectors; however, the warped frame may\ncontain improperly rendered visual artifacts due to dynamic objects -- this\nmakes it very challenging to design such a scheme. Past work has used DNNs to\nget good accuracy, however, these approaches are slow. This paper proposes\nExwarp -- an approach based on reinforcement learning (RL) to intelligently\nchoose between the slower DNN-based extrapolation and faster warping-based\nmethods to increase the frame rate by 4x with an almost negligible reduction in\nthe perceived image quality.\n","authors":["Akanksha Dixit","Yashashwee Chakrabarty","Smruti R. Sarangi"],"pdf_url":"https://arxiv.org/pdf/2307.12607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07515v2","updated":"2023-07-24T08:10:52Z","published":"2023-04-15T09:39:52Z","title":"S3M: Scalable Statistical Shape Modeling through Unsupervised\n Correspondences","summary":" Statistical shape models (SSMs) are an established way to represent the\nanatomy of a population with various clinically relevant applications. However,\nthey typically require domain expertise, and labor-intensive landmark\nannotations to construct. We address these shortcomings by proposing an\nunsupervised method that leverages deep geometric features and functional\ncorrespondences to simultaneously learn local and global shape structures\nacross population anatomies. Our pipeline significantly improves unsupervised\ncorrespondence estimation for SSMs compared to baseline methods, even on highly\nirregular surface topologies. We demonstrate this for two different anatomical\nstructures: the thyroid and a multi-chamber heart dataset. Furthermore, our\nmethod is robust enough to learn from noisy neural network predictions,\npotentially enabling scaling SSMs to larger patient populations without manual\nsegmentation annotation.\n","authors":["Lennart Bastian","Alexander Baumann","Emily Hoppe","Vincent Bürgin","Ha Young Kim","Mahdi Saleh","Benjamin Busam","Nassir Navab"],"pdf_url":"https://arxiv.org/pdf/2304.07515v2.pdf","comment":"Accepted at MICCAI 2023. 13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.12591v1","updated":"2023-07-24T08:06:46Z","published":"2023-07-24T08:06:46Z","title":"SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image\n Segmentation","summary":" Recent advancements in large-scale Vision Transformers have made significant\nstrides in improving pre-trained models for medical image segmentation.\nHowever, these methods face a notable challenge in acquiring a substantial\namount of pre-training data, particularly within the medical field. To address\nthis limitation, we present Masked Multi-view with Swin Transformers (SwinMM),\na novel multi-view pipeline for enabling accurate and data-efficient\nself-supervised medical image analysis. Our strategy harnesses the potential of\nmulti-view information by incorporating two principal components. In the\npre-training phase, we deploy a masked multi-view encoder devised to\nconcurrently train masked multi-view observations through a range of diverse\nproxy tasks. These tasks span image reconstruction, rotation, contrastive\nlearning, and a novel task that employs a mutual learning paradigm. This new\ntask capitalizes on the consistency between predictions from various\nperspectives, enabling the extraction of hidden multi-view information from 3D\nmedical data. In the fine-tuning stage, a cross-view decoder is developed to\naggregate the multi-view information through a cross-attention block. Compared\nwith the previous state-of-the-art self-supervised learning method Swin UNETR,\nSwinMM demonstrates a notable advantage on several medical image segmentation\ntasks. It allows for a smooth integration of multi-view information,\nsignificantly boosting both the accuracy and data-efficiency of the model. Code\nand models are available at https://github.com/UCSC-VLAA/SwinMM/.\n","authors":["Yiqing Wang","Zihan Li","Jieru Mei","Zihao Wei","Li Liu","Chen Wang","Shengtian Sang","Alan Yuille","Cihang Xie","Yuyin Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.12591v1.pdf","comment":"MICCAI 2023; project page: https://github.com/UCSC-VLAA/SwinMM/"},{"id":"http://arxiv.org/abs/2307.12580v1","updated":"2023-07-24T07:51:40Z","published":"2023-07-24T07:51:40Z","title":"SL: Stable Learning in Source-Free Domain Adaption for Medical Image\n Segmentation","summary":" Deep learning techniques for medical image analysis usually suffer from the\ndomain shift between source and target data. Most existing works focus on\nunsupervised domain adaptation (UDA). However, in practical applications,\nprivacy issues are much more severe. For example, the data of different\nhospitals have domain shifts due to equipment problems, and data of the two\ndomains cannot be available simultaneously because of privacy. In this\nchallenge defined as Source-Free UDA, the previous UDA medical methods are\nlimited. Although a variety of medical source-free unsupervised domain adaption\n(MSFUDA) methods have been proposed, we found they fall into an over-fitting\ndilemma called \"longer training, worse performance.\" Therefore, we propose the\nStable Learning (SL) strategy to address the dilemma. SL is a scalable method\nand can be integrated with other research, which consists of Weight\nConsolidation and Entropy Increase. First, we apply Weight Consolidation to\nretain domain-invariant knowledge and then we design Entropy Increase to avoid\nover-learning. Comparative experiments prove the effectiveness of SL. We also\nhave done extensive ablation experiments. Besides, We will release codes\nincluding a variety of MSFUDA methods.\n","authors":["Yixin Chen","Yan Wang"],"pdf_url":"https://arxiv.org/pdf/2307.12580v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12577v1","updated":"2023-07-24T07:49:01Z","published":"2023-07-24T07:49:01Z","title":"PRIOR: Prototype Representation Joint Learning from Medical Images and\n Reports","summary":" Contrastive learning based vision-language joint pre-training has emerged as\na successful representation learning strategy. In this paper, we present a\nprototype representation learning framework incorporating both global and local\nalignment between medical images and reports. In contrast to standard global\nmulti-modality alignment methods, we employ a local alignment module for\nfine-grained representation. Furthermore, a cross-modality conditional\nreconstruction module is designed to interchange information across modalities\nin the training phase by reconstructing masked images and reports. For\nreconstructing long reports, a sentence-wise prototype memory bank is\nconstructed, enabling the network to focus on low-level localized visual and\nhigh-level clinical linguistic features. Additionally, a non-auto-regressive\ngeneration paradigm is proposed for reconstructing non-sequential reports.\nExperimental results on five downstream tasks, including supervised\nclassification, zero-shot classification, image-to-text retrieval, semantic\nsegmentation, and object detection, show the proposed method outperforms other\nstate-of-the-art methods across multiple datasets and under different dataset\nsize settings. The code is available at https://github.com/QtacierP/PRIOR.\n","authors":["Pujin Cheng","Li Lin","Junyan Lyu","Yijin Huang","Wenhan Luo","Xiaoying Tang"],"pdf_url":"https://arxiv.org/pdf/2307.12577v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12574v1","updated":"2023-07-24T07:46:06Z","published":"2023-07-24T07:46:06Z","title":"A Good Student is Cooperative and Reliable: CNN-Transformer\n Collaborative Learning for Semantic Segmentation","summary":" In this paper, we strive to answer the question \"how to collaboratively learn\nconvolutional neural network (CNN)-based and vision transformer (ViT)-based\nmodels by selecting and exchanging the reliable knowledge between them for\nsemantic segmentation?\" Accordingly, we propose an online knowledge\ndistillation (KD) framework that can simultaneously learn compact yet effective\nCNN-based and ViT-based models with two key technical breakthroughs to take\nfull advantage of CNNs and ViT while compensating their limitations. Firstly,\nwe propose heterogeneous feature distillation (HFD) to improve students'\nconsistency in low-layer feature space by mimicking heterogeneous features\nbetween CNNs and ViT. Secondly, to facilitate the two students to learn\nreliable knowledge from each other, we propose bidirectional selective\ndistillation (BSD) that can dynamically transfer selective knowledge. This is\nachieved by 1) region-wise BSD determining the directions of knowledge\ntransferred between the corresponding regions in the feature space and 2)\npixel-wise BSD discerning which of the prediction knowledge to be transferred\nin the logit space. Extensive experiments on three benchmark datasets\ndemonstrate that our proposed framework outperforms the state-of-the-art online\ndistillation methods by a large margin, and shows its efficacy in learning\ncollaboratively between ViT-based and CNN-based models.\n","authors":["Jinjing Zhu","Yunhao Luo","Xu Zheng","Hao Wang","Lin Wang"],"pdf_url":"https://arxiv.org/pdf/2307.12574v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2210.10495v3","updated":"2023-07-24T07:43:31Z","published":"2022-10-19T12:04:47Z","title":"ADPS: Asymmetric Distillation Post-Segmentation for Image Anomaly\n Detection","summary":" Knowledge Distillation-based Anomaly Detection (KDAD) methods rely on the\nteacher-student paradigm to detect and segment anomalous regions by contrasting\nthe unique features extracted by both networks. However, existing KDAD methods\nsuffer from two main limitations: 1) the student network can effortlessly\nreplicate the teacher network's representations, and 2) the features of the\nteacher network serve solely as a ``reference standard\" and are not fully\nleveraged. Toward this end, we depart from the established paradigm and instead\npropose an innovative approach called Asymmetric Distillation Post-Segmentation\n(ADPS). Our ADPS employs an asymmetric distillation paradigm that takes\ndistinct forms of the same image as the input of the teacher-student networks,\ndriving the student network to learn discriminating representations for\nanomalous regions.\n Meanwhile, a customized Weight Mask Block (WMB) is proposed to generate a\ncoarse anomaly localization mask that transfers the distilled knowledge\nacquired from the asymmetric paradigm to the teacher network. Equipped with\nWMB, the proposed Post-Segmentation Module (PSM) is able to effectively detect\nand segment abnormal regions with fine structures and clear boundaries.\nExperimental results demonstrate that the proposed ADPS outperforms the\nstate-of-the-art methods in detecting and segmenting anomalies. Surprisingly,\nADPS significantly improves Average Precision (AP) metric by 9% and 20% on the\nMVTec AD and KolektorSDD2 datasets, respectively.\n","authors":["Peng Xing","Hao Tang","Jinhui Tang","Zechao Li"],"pdf_url":"https://arxiv.org/pdf/2210.10495v3.pdf","comment":"11pages,9 figures"},{"id":"http://arxiv.org/abs/2307.12571v1","updated":"2023-07-24T07:39:22Z","published":"2023-07-24T07:39:22Z","title":"MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary","summary":" Document dewarping from a distorted camera-captured image is of great value\nfor OCR and document understanding. The document boundary plays an important\nrole which is more evident than the inner region in document dewarping. Current\nlearning-based methods mainly focus on complete boundary cases, leading to poor\ndocument correction performance of documents with incomplete boundaries. In\ncontrast to these methods, this paper proposes MataDoc, the first method\nfocusing on arbitrary boundary document dewarping with margin and text aware\nregularizations. Specifically, we design the margin regularization by\nexplicitly considering background consistency to enhance boundary perception.\nMoreover, we introduce word position consistency to keep text lines straight in\nrectified document images. To produce a comprehensive evaluation of MataDoc, we\npropose a novel benchmark ArbDoc, mainly consisting of document images with\narbitrary boundaries in four typical scenarios. Extensive experiments confirm\nthe superiority of MataDoc with consideration for the incomplete boundary on\nArbDoc and also demonstrate the effectiveness of the proposed method on\nDocUNet, DIR300, and WarpDoc datasets.\n","authors":["Beiya Dai","Xing li","Qunyi Xie","Yulin Li","Xiameng Qin","Chengquan Zhang","Kun Yao","Junyu Han"],"pdf_url":"https://arxiv.org/pdf/2307.12571v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2307.12560v1","updated":"2023-07-24T07:03:22Z","published":"2023-07-24T07:03:22Z","title":"Interpolating between Images with Diffusion Models","summary":" One little-explored frontier of image generation and editing is the task of\ninterpolating between two input images, a feature missing from all currently\ndeployed image generation pipelines. We argue that such a feature can expand\nthe creative applications of such models, and propose a method for zero-shot\ninterpolation using latent diffusion models. We apply interpolation in the\nlatent space at a sequence of decreasing noise levels, then perform denoising\nconditioned on interpolated text embeddings derived from textual inversion and\n(optionally) subject poses. For greater consistency, or to specify additional\ncriteria, we can generate several candidates and use CLIP to select the highest\nquality image. We obtain convincing interpolations across diverse subject\nposes, image styles, and image content, and show that standard quantitative\nmetrics such as FID are insufficient to measure the quality of an\ninterpolation. Code and data are available at\nhttps://clintonjwang.github.io/interpolation.\n","authors":["Clinton J. Wang","Polina Golland"],"pdf_url":"https://arxiv.org/pdf/2307.12560v1.pdf","comment":"Presented at ICML 2023 Workshop on Challenges of Deploying Generative\n AI"},{"id":"http://arxiv.org/abs/2203.01923v4","updated":"2023-07-24T06:59:56Z","published":"2022-03-03T18:56:08Z","title":"Recovering 3D Human Mesh from Monocular Images: A Survey","summary":" Estimating human pose and shape from monocular images is a long-standing\nproblem in computer vision. Since the release of statistical body models, 3D\nhuman mesh recovery has been drawing broader attention. With the same goal of\nobtaining well-aligned and physically plausible mesh results, two paradigms\nhave been developed to overcome challenges in the 2D-to-3D lifting process: i)\nan optimization-based paradigm, where different data terms and regularization\nterms are exploited as optimization objectives; and ii) a regression-based\nparadigm, where deep learning techniques are embraced to solve the problem in\nan end-to-end fashion. Meanwhile, continuous efforts are devoted to improving\nthe quality of 3D mesh labels for a wide range of datasets. Though remarkable\nprogress has been achieved in the past decade, the task is still challenging\ndue to flexible body motions, diverse appearances, complex environments, and\ninsufficient in-the-wild annotations. To the best of our knowledge, this is the\nfirst survey that focuses on the task of monocular 3D human mesh recovery. We\nstart with the introduction of body models and then elaborate recovery\nframeworks and training objectives by providing in-depth analyses of their\nstrengths and weaknesses. We also summarize datasets, evaluation metrics, and\nbenchmark results. Open issues and future directions are discussed in the end,\nhoping to motivate researchers and facilitate their research in this area. A\nregularly updated project page can be found at\nhttps://github.com/tinatiansjz/hmr-survey.\n","authors":["Yating Tian","Hongwen Zhang","Yebin Liu","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2203.01923v4.pdf","comment":"Accepted to IEEE TPAMI, Survey on monocular 3D human mesh recovery,\n Project page: https://github.com/tinatiansjz/hmr-survey"},{"id":"http://arxiv.org/abs/2307.12558v1","updated":"2023-07-24T06:51:07Z","published":"2023-07-24T06:51:07Z","title":"Revisiting Event-based Video Frame Interpolation","summary":" Dynamic vision sensors or event cameras provide rich complementary\ninformation for video frame interpolation. Existing state-of-the-art methods\nfollow the paradigm of combining both synthesis-based and warping networks.\nHowever, few of those methods fully respect the intrinsic characteristics of\nevents streams. Given that event cameras only encode intensity changes and\npolarity rather than color intensities, estimating optical flow from events is\narguably more difficult than from RGB information. We therefore propose to\nincorporate RGB information in an event-guided optical flow refinement\nstrategy. Moreover, in light of the quasi-continuous nature of the time signals\nprovided by event cameras, we propose a divide-and-conquer strategy in which\nevent-based intermediate frame synthesis happens incrementally in multiple\nsimplified stages rather than in a single, long stage. Extensive experiments on\nboth synthetic and real-world datasets show that these modifications lead to\nmore reliable and realistic intermediate frame results than previous video\nframe interpolation methods. Our findings underline that a careful\nconsideration of event characteristics such as high temporal density and\nelevated noise benefits interpolation accuracy.\n","authors":["Jiaben Chen","Yichen Zhu","Dongze Lian","Jiaqi Yang","Yifu Wang","Renrui Zhang","Xinhang Liu","Shenhan Qian","Laurent Kneip","Shenghua Gao"],"pdf_url":"https://arxiv.org/pdf/2307.12558v1.pdf","comment":"Accepted by IROS2023 Project Site:\n https://jiabenchen.github.io/revisit_event"},{"id":"http://arxiv.org/abs/2307.12548v1","updated":"2023-07-24T06:33:52Z","published":"2023-07-24T06:33:52Z","title":"MFMAN-YOLO: A Method for Detecting Pole-like Obstacles in Complex\n Environment","summary":" In real-world traffic, there are various uncertainties and complexities in\nroad and weather conditions. To solve the problem that the feature information\nof pole-like obstacles in complex environments is easily lost, resulting in low\ndetection accuracy and low real-time performance, a multi-scale hybrid\nattention mechanism detection algorithm is proposed in this paper. First, the\noptimal transport function Monge-Kantorovich (MK) is incorporated not only to\nsolve the problem of overlapping multiple prediction frames with optimal\nmatching but also the MK function can be regularized to prevent model\nover-fitting; then, the features at different scales are up-sampled separately\naccording to the optimized efficient multi-scale feature pyramid. Finally, the\nextraction of multi-scale feature space channel information is enhanced in\ncomplex environments based on the hybrid attention mechanism, which suppresses\nthe irrelevant complex environment background information and focuses the\nfeature information of pole-like obstacles. Meanwhile, this paper conducts real\nroad test experiments in a variety of complex environments. The experimental\nresults show that the detection precision, recall, and average precision of the\nmethod are 94.7%, 93.1%, and 97.4%, respectively, and the detection frame rate\nis 400 f/s. This research method can detect pole-like obstacles in a complex\nroad environment in real time and accurately, which further promotes innovation\nand progress in the field of automatic driving.\n","authors":["Lei Cai","Hao Wang","Congling Zhou","Yongqiang Wang","Boyu Liu"],"pdf_url":"https://arxiv.org/pdf/2307.12548v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2301.01482v5","updated":"2023-07-24T06:31:58Z","published":"2023-01-04T08:22:34Z","title":"Underwater Object Tracker: UOSTrack for Marine Organism Grasping of\n Underwater Vehicles","summary":" A visual single-object tracker is an indispensable component of underwater\nvehicles (UVs) in marine organism grasping tasks. Its accuracy and stability\nare imperative to guide the UVs to perform grasping behavior. Although\nsingle-object trackers show competitive performance in the challenge of\nunderwater image degradation, there are still issues with sample imbalance and\nexclusion of similar objects that need to be addressed for application in\nmarine organism grasping. This paper proposes Underwater OSTrack (UOSTrack),\nwhich consists of underwater image and open-air sequence hybrid training\n(UOHT), and motion-based post-processing (MBPP). The UOHT training paradigm is\ndesigned to train the sample-imbalanced underwater tracker so that the tracker\nis exposed to a great number of underwater domain training samples and learns\nthe feature expressions. The MBPP paradigm is proposed to exclude similar\nobjects. It uses the estimation box predicted with a Kalman filter and the\ncandidate boxes in the response map to relocate the lost tracked object in the\ncandidate area. UOSTrack achieves an average performance improvement of 4.41%\nand 7.98% maximum compared to state-of-the-art methods on various benchmarks,\nrespectively. Field experiments have verified the accuracy and stability of our\nproposed UOSTrack for UVs in marine organism grasping tasks. More details can\nbe found at https://github.com/LiYunfengLYF/UOSTrack.\n","authors":["Yunfeng Li","Bo Wang","Ye Li","Zhuoyan Liu","Wei Huo","Yueming Li","Jian Cao"],"pdf_url":"https://arxiv.org/pdf/2301.01482v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12545v1","updated":"2023-07-24T06:22:37Z","published":"2023-07-24T06:22:37Z","title":"Towards Video Anomaly Retrieval from Video Anomaly Detection: New\n Benchmarks and Model","summary":" Video anomaly detection (VAD) has been paid increasing attention due to its\npotential applications, its current dominant tasks focus on online detecting\nanomalies% at the frame level, which can be roughly interpreted as the binary\nor multiple event classification. However, such a setup that builds\nrelationships between complicated anomalous events and single labels, e.g.,\n``vandalism'', is superficial, since single labels are deficient to\ncharacterize anomalous events. In reality, users tend to search a specific\nvideo rather than a series of approximate videos. Therefore, retrieving\nanomalous events using detailed descriptions is practical and positive but few\nresearches focus on this. In this context, we propose a novel task called Video\nAnomaly Retrieval (VAR), which aims to pragmatically retrieve relevant\nanomalous videos by cross-modalities, e.g., language descriptions and\nsynchronous audios. Unlike the current video retrieval where videos are assumed\nto be temporally well-trimmed with short duration, VAR is devised to retrieve\nlong untrimmed videos which may be partially relevant to the given query. To\nachieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and\nXDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we\ndesign a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we\npropose an anomaly-led sampling to focus on key segments in long untrimmed\nvideos. Then, we introduce an efficient pretext task to enhance semantic\nassociations between video-text fine-grained representations. Besides, we\nleverage two complementary alignments to further match cross-modal contents.\nExperimental results on two benchmarks reveal the challenges of VAR task and\nalso demonstrate the advantages of our tailored method.\n","authors":["Peng Wu","Jing Liu","Xiangteng He","Yuxin Peng","Peng Wang","Yanning Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12545v1.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2307.12542v1","updated":"2023-07-24T06:12:37Z","published":"2023-07-24T06:12:37Z","title":"Client-Level Differential Privacy via Adaptive Intermediary in Federated\n Medical Imaging","summary":" Despite recent progress in enhancing the privacy of federated learning (FL)\nvia differential privacy (DP), the trade-off of DP between privacy protection\nand performance is still underexplored for real-world medical scenario. In this\npaper, we propose to optimize the trade-off under the context of client-level\nDP, which focuses on privacy during communications. However, FL for medical\nimaging involves typically much fewer participants (hospitals) than other\ndomains (e.g., mobile devices), thus ensuring clients be differentially private\nis much more challenging. To tackle this problem, we propose an adaptive\nintermediary strategy to improve performance without harming privacy.\nSpecifically, we theoretically find splitting clients into sub-clients, which\nserve as intermediaries between hospitals and the server, can mitigate the\nnoises introduced by DP without harming privacy. Our proposed approach is\nempirically evaluated on both classification and segmentation tasks using two\npublic datasets, and its effectiveness is demonstrated with significant\nperformance improvements and comprehensive analytical studies. Code is\navailable at: https://github.com/med-air/Client-DP-FL.\n","authors":["Meirui Jiang","Yuan Zhong","Anjie Le","Xiaoxiao Li","Qi Dou"],"pdf_url":"https://arxiv.org/pdf/2307.12542v1.pdf","comment":"Accepted by 26th International Conference on Medical Image Computing\n and Computer Assisted Intervention (MICCAI'23)"},{"id":"http://arxiv.org/abs/2303.05021v3","updated":"2023-07-24T06:06:27Z","published":"2023-03-09T03:48:24Z","title":"DiffusionDepth: Diffusion Denoising Approach for Monocular Depth\n Estimation","summary":" Monocular depth estimation is a challenging task that predicts the pixel-wise\ndepth from a single 2D image. Current methods typically model this problem as a\nregression or classification task. We propose DiffusionDepth, a new approach\nthat reformulates monocular depth estimation as a denoising diffusion process.\nIt learns an iterative denoising process to `denoise' random depth distribution\ninto a depth map with the guidance of monocular visual conditions. The process\nis performed in the latent space encoded by a dedicated depth encoder and\ndecoder. Instead of diffusing ground truth (GT) depth, the model learns to\nreverse the process of diffusing the refined depth of itself into random depth\ndistribution. This self-diffusion formulation overcomes the difficulty of\napplying generative models to sparse GT depth scenarios. The proposed approach\nbenefits this task by refining depth estimation step by step, which is superior\nfor generating accurate and highly detailed depth maps. Experimental results on\nKITTI and NYU-Depth-V2 datasets suggest that a simple yet efficient diffusion\napproach could reach state-of-the-art performance in both indoor and outdoor\nscenarios with acceptable inference time.\n","authors":["Yiqun Duan","Xianda Guo","Zheng Zhu"],"pdf_url":"https://arxiv.org/pdf/2303.05021v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12540v1","updated":"2023-07-24T06:04:12Z","published":"2023-07-24T06:04:12Z","title":"SelFormaly: Towards Task-Agnostic Unified Anomaly Detection","summary":" The core idea of visual anomaly detection is to learn the normality from\nnormal images, but previous works have been developed specifically for certain\ntasks, leading to fragmentation among various tasks: defect detection, semantic\nanomaly detection, multi-class anomaly detection, and anomaly clustering. This\none-task-one-model approach is resource-intensive and incurs high maintenance\ncosts as the number of tasks increases. This paper presents SelFormaly, a\nuniversal and powerful anomaly detection framework. We emphasize the necessity\nof our off-the-shelf approach by pointing out a suboptimal issue with\nfluctuating performance in previous online encoder-based methods. In addition,\nwe question the effectiveness of using ConvNets as previously employed in the\nliterature and confirm that self-supervised ViTs are suitable for unified\nanomaly detection. We introduce back-patch masking and discover the new role of\ntop k-ratio feature matching to achieve unified and powerful anomaly detection.\nBack-patch masking eliminates irrelevant regions that possibly hinder\ntarget-centric detection with representations of the scene layout. The top\nk-ratio feature matching unifies various anomaly levels and tasks. Finally,\nSelFormaly achieves state-of-the-art results across various datasets for all\nthe aforementioned tasks.\n","authors":["Yujin Lee","Harin Lim","Hyunsoo Yoon"],"pdf_url":"https://arxiv.org/pdf/2307.12540v1.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.12534v1","updated":"2023-07-24T05:43:34Z","published":"2023-07-24T05:43:34Z","title":"Towards Generalizable Deepfake Detection by Primary Region\n Regularization","summary":" The existing deepfake detection methods have reached a bottleneck in\ngeneralizing to unseen forgeries and manipulation approaches. Based on the\nobservation that the deepfake detectors exhibit a preference for overfitting\nthe specific primary regions in input, this paper enhances the generalization\ncapability from a novel regularization perspective. This can be simply achieved\nby augmenting the images through primary region removal, thereby preventing the\ndetector from over-relying on data bias. Our method consists of two stages,\nnamely the static localization for primary region maps, as well as the dynamic\nexploitation of primary region masks. The proposed method can be seamlessly\nintegrated into different backbones without affecting their inference\nefficiency. We conduct extensive experiments over three widely used deepfake\ndatasets - DFDC, DF-1.0, and Celeb-DF with five backbones. Our method\ndemonstrates an average performance improvement of 6% across different\nbackbones and performs competitively with several state-of-the-art baselines.\n","authors":["Harry Cheng","Yangyang Guo","Tianyi Wang","Liqiang Nie","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2307.12534v1.pdf","comment":"12 pages. Code and Dataset: https://github.com/xaCheng1996/PRLE"},{"id":"http://arxiv.org/abs/2307.12532v1","updated":"2023-07-24T05:36:19Z","published":"2023-07-24T05:36:19Z","title":"On the Connection between Pre-training Data Diversity and Fine-tuning\n Robustness","summary":" Pre-training has been widely adopted in deep learning to improve model\nperformance, especially when the training data for a target task is limited. In\nour work, we seek to understand the implications of this training strategy on\nthe generalization properties of downstream models. More specifically, we ask\nthe following question: how do properties of the pre-training distribution\naffect the robustness of a fine-tuned model? The properties we explore include\nthe label space, label semantics, image diversity, data domains, and data\nquantity of the pre-training distribution. We find that the primary factor\ninfluencing downstream effective robustness (Taori et al., 2020) is data\nquantity, while other factors have limited significance. For example, reducing\nthe number of ImageNet pre-training classes by 4x while increasing the number\nof images per class by 4x (that is, keeping total data quantity fixed) does not\nimpact the robustness of fine-tuned models. We demonstrate our findings on\npre-training distributions drawn from various natural and synthetic data\nsources, primarily using the iWildCam-WILDS distribution shift as a test for\ndownstream robustness.\n","authors":["Vivek Ramanujan","Thao Nguyen","Sewoong Oh","Ludwig Schmidt","Ali Farhadi"],"pdf_url":"https://arxiv.org/pdf/2307.12532v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.18246v3","updated":"2023-07-24T05:35:30Z","published":"2023-03-31T17:59:09Z","title":"3D Human Pose Estimation via Intuitive Physics","summary":" Estimating 3D humans from images often produces implausible bodies that lean,\nfloat, or penetrate the floor. Such methods ignore the fact that bodies are\ntypically supported by the scene. A physics engine can be used to enforce\nphysical plausibility, but these are not differentiable, rely on unrealistic\nproxy bodies, and are difficult to integrate into existing optimization and\nlearning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms\nthat can be inferred from a 3D SMPL body interacting with the scene. Inspired\nby biomechanics, we infer the pressure heatmap on the body, the Center of\nPressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With\nthese, we develop IPMAN, to estimate a 3D body from a color image in a \"stable\"\nconfiguration by encouraging plausible floor contact and overlapping CoP and\nCoM. Our IP terms are intuitive, easy to implement, fast to compute,\ndifferentiable, and can be integrated into existing optimization and regression\nmethods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with\nsynchronized multi-view images, ground-truth 3D bodies with complex poses,\nbody-floor contact, CoM and pressure. IPMAN produces more plausible results\nthan the state of the art, improving accuracy for static poses, while not\nhurting dynamic ones. Code and data are available for research at\nhttps://ipman.is.tue.mpg.de.\n","authors":["Shashank Tripathi","Lea Müller","Chun-Hao P. Huang","Omid Taheri","Michael J. Black","Dimitrios Tzionas"],"pdf_url":"https://arxiv.org/pdf/2303.18246v3.pdf","comment":"Accepted in CVPR'23. Project page: https://ipman.is.tue.mpg.de"},{"id":"http://arxiv.org/abs/2307.12526v1","updated":"2023-07-24T04:56:23Z","published":"2023-07-24T04:56:23Z","title":"Rethinking Medical Report Generation: Disease Revealing Enhancement with\n Knowledge Graph","summary":" Knowledge Graph (KG) plays a crucial role in Medical Report Generation (MRG)\nbecause it reveals the relations among diseases and thus can be utilized to\nguide the generation process. However, constructing a comprehensive KG is\nlabor-intensive and its applications on the MRG process are under-explored. In\nthis study, we establish a complete KG on chest X-ray imaging that includes 137\ntypes of diseases and abnormalities. Based on this KG, we find that the current\nMRG data sets exhibit a long-tailed problem in disease distribution. To\nmitigate this problem, we introduce a novel augmentation strategy that enhances\nthe representation of disease types in the tail-end of the distribution. We\nfurther design a two-stage MRG approach, where a classifier is first trained to\ndetect whether the input images exhibit any abnormalities. The classified\nimages are then independently fed into two transformer-based generators,\nnamely, ``disease-specific generator\" and ``disease-free generator\" to generate\nthe corresponding reports. To enhance the clinical evaluation of whether the\ngenerated reports correctly describe the diseases appearing in the input image,\nwe propose diverse sensitivity (DS), a new metric that checks whether generated\ndiseases match ground truth and measures the diversity of all generated\ndiseases. Results show that the proposed two-stage generation framework and\naugmentation strategies improve DS by a considerable margin, indicating a\nnotable reduction in the long-tailed problem associated with under-represented\ndiseases.\n","authors":["Yixin Wang","Zihao Lin","Haoyu Dong"],"pdf_url":"https://arxiv.org/pdf/2307.12526v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12517v1","updated":"2023-07-24T04:21:51Z","published":"2023-07-24T04:21:51Z","title":"Entropy Transformer Networks: A Learning Approach via Tangent Bundle\n Data Manifold","summary":" This paper focuses on an accurate and fast interpolation approach for image\ntransformation employed in the design of CNN architectures. Standard Spatial\nTransformer Networks (STNs) use bilinear or linear interpolation as their\ninterpolation, with unrealistic assumptions about the underlying data\ndistributions, which leads to poor performance under scale variations.\nMoreover, STNs do not preserve the norm of gradients in propagation due to\ntheir dependency on sparse neighboring pixels. To address this problem, a novel\nEntropy STN (ESTN) is proposed that interpolates on the data manifold\ndistributions. In particular, random samples are generated for each pixel in\nassociation with the tangent space of the data manifold and construct a linear\napproximation of their intensity values with an entropy regularizer to compute\nthe transformer parameters. A simple yet effective technique is also proposed\nto normalize the non-zero values of the convolution operation, to fine-tune the\nlayers for gradients' norm-regularization during training. Experiments on\nchallenging benchmarks show that the proposed ESTN can improve predictive\naccuracy over a range of computer vision tasks, including image reconstruction,\nand classification, while reducing the computational cost.\n","authors":["Pourya Shamsolmoali","Masoumeh Zareapoor"],"pdf_url":"https://arxiv.org/pdf/2307.12517v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.12539v2","updated":"2023-07-24T04:20:37Z","published":"2023-04-25T03:12:54Z","title":"Text-guided Eyeglasses Manipulation with Spatial Constraints","summary":" Virtual try-on of eyeglasses involves placing eyeglasses of different shapes\nand styles onto a face image without physically trying them on. While existing\nmethods have shown impressive results, the variety of eyeglasses styles is\nlimited and the interactions are not always intuitive or efficient. To address\nthese limitations, we propose a Text-guided Eyeglasses Manipulation method that\nallows for control of the eyeglasses shape and style based on a binary mask and\ntext, respectively. Specifically, we introduce a mask encoder to extract mask\nconditions and a modulation module that enables simultaneous injection of text\nand mask conditions. This design allows for fine-grained control of the\neyeglasses' appearance based on both textual descriptions and spatial\nconstraints. Our approach includes a disentangled mapper and a decoupling\nstrategy that preserves irrelevant areas, resulting in better local editing. We\nemploy a two-stage training scheme to handle the different convergence speeds\nof the various modality conditions, successfully controlling both the shape and\nstyle of eyeglasses. Extensive comparison experiments and ablation analyses\ndemonstrate the effectiveness of our approach in achieving diverse eyeglasses\nstyles while preserving irrelevant areas.\n","authors":["Jiacheng Wang","Ping Liu","Jingen Liu","Wei Xu"],"pdf_url":"https://arxiv.org/pdf/2304.12539v2.pdf","comment":"Revised version: add some experiments"},{"id":"http://arxiv.org/abs/2307.11466v2","updated":"2023-07-24T03:35:03Z","published":"2023-07-21T10:02:02Z","title":"MatSpectNet: Material Segmentation Network with Domain-Aware and\n Physically-Constrained Hyperspectral Reconstruction","summary":" Achieving accurate material segmentation for 3-channel RGB images is\nchallenging due to the considerable variation in a material's appearance.\nHyperspectral images, which are sets of spectral measurements sampled at\nmultiple wavelengths, theoretically offer distinct information for material\nidentification, as variations in intensity of electromagnetic radiation\nreflected by a surface depend on the material composition of a scene. However,\nexisting hyperspectral datasets are impoverished regarding the number of images\nand material categories for the dense material segmentation task, and\ncollecting and annotating hyperspectral images with a spectral camera is\nprohibitively expensive. To address this, we propose a new model, the\nMatSpectNet to segment materials with recovered hyperspectral images from RGB\nimages. The network leverages the principles of colour perception in modern\ncameras to constrain the reconstructed hyperspectral images and employs the\ndomain adaptation method to generalise the hyperspectral reconstruction\ncapability from a spectral recovery dataset to material segmentation datasets.\nThe reconstructed hyperspectral images are further filtered using learned\nresponse curves and enhanced with human perception. The performance of\nMatSpectNet is evaluated on the LMD dataset as well as the OpenSurfaces\ndataset. Our experiments demonstrate that MatSpectNet attains a 1.60% increase\nin average pixel accuracy and a 3.42% improvement in mean class accuracy\ncompared with the most recent publication. The project code is attached to the\nsupplementary material and will be published on GitHub.\n","authors":["Yuwen Heng","Yihong Wu","Jiawen Chen","Srinandan Dasmahapatra","Hansung Kim"],"pdf_url":"https://arxiv.org/pdf/2307.11466v2.pdf","comment":"7 pages main paper"},{"id":"http://arxiv.org/abs/2304.03483v2","updated":"2023-07-24T03:28:34Z","published":"2023-04-07T05:29:59Z","title":"RED-PSM: Regularization by Denoising of Partially Separable Models for\n Dynamic Imaging","summary":" Dynamic imaging addresses the recovery of a time-varying 2D or 3D object at\neach time instant using its undersampled measurements. In particular, in the\ncase of dynamic tomography, only a single projection at a single view angle may\nbe available at a time, making the problem severely ill-posed. In this work, we\npropose an approach, RED-PSM, which combines for the first time two powerful\ntechniques to address this challenging imaging problem. The first, are\npartially separable models, which have been used to efficiently introduce a\nlow-rank prior for the spatio-temporal object. The second is the recent\nRegularization by Denoising (RED), which provides a flexible framework to\nexploit the impressive performance of state-of-the-art image denoising\nalgorithms, for various inverse problems. We propose a partially separable\nobjective with RED and a computationally efficient and scalable optimization\nscheme with variable splitting and ADMM. Theoretical analysis proves the\nconvergence of our objective to a value corresponding to a stationary point\nsatisfying the first-order optimality conditions. Convergence is accelerated by\na particular projection-domain-based initialization. We demonstrate the\nperformance and computational improvements of our proposed RED-PSM with a\nlearned image denoiser by comparing it to a recent deep-prior-based method\nknown as TD-DIP. Although the main focus is on dynamic tomography, we also show\nthe performance advantages of RED-PSM in a cardiac dynamic MRI setting.\n","authors":["Berk Iskender","Marc L. Klasky","Yoram Bresler"],"pdf_url":"https://arxiv.org/pdf/2304.03483v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12502v1","updated":"2023-07-24T03:27:41Z","published":"2023-07-24T03:27:41Z","title":"Cross Contrastive Feature Perturbation for Domain Generalization","summary":" Domain generalization (DG) aims to learn a robust model from source domains\nthat generalize well on unseen target domains. Recent studies focus on\ngenerating novel domain samples or features to diversify distributions\ncomplementary to source domains. Yet, these approaches can hardly deal with the\nrestriction that the samples synthesized from various domains can cause\nsemantic distortion. In this paper, we propose an online one-stage Cross\nContrasting Feature Perturbation (CCFP) framework to simulate domain shift by\ngenerating perturbed features in the latent space while regularizing the model\nprediction against domain shift. Different from the previous fixed synthesizing\nstrategy, we design modules with learnable feature perturbations and semantic\nconsistency constraints. In contrast to prior work, our method does not use any\ngenerative-based models or domain labels. We conduct extensive experiments on a\nstandard DomainBed benchmark with a strict evaluation protocol for a fair\ncomparison. Comprehensive experiments show that our method outperforms the\nprevious state-of-the-art, and quantitative analyses illustrate that our\napproach can alleviate the domain shift problem in out-of-distribution (OOD)\nscenarios.\n","authors":["Chenming Li","Daoan Zhang","Wenjian Huang","Jianguo Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12502v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.09186v4","updated":"2023-07-24T03:20:19Z","published":"2022-04-20T02:14:20Z","title":"Reconstruction-Aware Prior Distillation for Semi-supervised Point Cloud\n Completion","summary":" Real-world sensors often produce incomplete, irregular, and noisy point\nclouds, making point cloud completion increasingly important. However, most\nexisting completion methods rely on large paired datasets for training, which\nis labor-intensive. This paper proposes RaPD, a novel semi-supervised point\ncloud completion method that reduces the need for paired datasets. RaPD\nutilizes a two-stage training scheme, where a deep semantic prior is learned in\nstage 1 from unpaired complete and incomplete point clouds, and a\nsemi-supervised prior distillation process is introduced in stage 2 to train a\ncompletion network using only a small number of paired samples. Additionally, a\nself-supervised completion module is introduced to improve performance using\nunpaired incomplete point clouds. Experiments on multiple datasets show that\nRaPD outperforms previous methods in both homologous and heterologous\nscenarios.\n","authors":["Zhaoxin Fan","Yulin He","Zhicheng Wang","Kejian Wu","Hongyan Liu","Jun He"],"pdf_url":"https://arxiv.org/pdf/2204.09186v4.pdf","comment":"Accepted to IJCAI 2023"},{"id":"http://arxiv.org/abs/2307.12499v1","updated":"2023-07-24T03:10:02Z","published":"2023-07-24T03:10:02Z","title":"AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion\n Models","summary":" Unrestricted adversarial attacks present a serious threat to deep learning\nmodels and adversarial defense techniques. They pose severe security problems\nfor deep learning applications because they can effectively bypass defense\nmechanisms. However, previous attack methods often utilize Generative\nAdversarial Networks (GANs), which are not theoretically provable and thus\ngenerate unrealistic examples by incorporating adversarial objectives,\nespecially for large-scale datasets like ImageNet. In this paper, we propose a\nnew method, called AdvDiff, to generate unrestricted adversarial examples with\ndiffusion models. We design two novel adversarial guidance techniques to\nconduct adversarial sampling in the reverse generation process of diffusion\nmodels. These two techniques are effective and stable to generate high-quality,\nrealistic adversarial examples by integrating gradients of the target\nclassifier interpretably. Experimental results on MNIST and ImageNet datasets\ndemonstrate that AdvDiff is effective to generate unrestricted adversarial\nexamples, which outperforms GAN-based methods in terms of attack performance\nand generation quality.\n","authors":["Xuelong Dai","Kaisheng Liang","Bin Xiao"],"pdf_url":"https://arxiv.org/pdf/2307.12499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.09417v2","updated":"2023-07-24T03:06:15Z","published":"2022-08-19T16:04:29Z","title":"Target-oriented Sentiment Classification with Sequential Cross-modal\n Semantic Graph","summary":" Multi-modal aspect-based sentiment classification (MABSC) is task of\nclassifying the sentiment of a target entity mentioned in a sentence and an\nimage. However, previous methods failed to account for the fine-grained\nsemantic association between the image and the text, which resulted in limited\nidentification of fine-grained image aspects and opinions. To address these\nlimitations, in this paper we propose a new approach called SeqCSG, which\nenhances the encoder-decoder sentiment classification framework using\nsequential cross-modal semantic graphs. SeqCSG utilizes image captions and\nscene graphs to extract both global and local fine-grained image information\nand considers them as elements of the cross-modal semantic graph along with\ntokens from tweets. The sequential cross-modal semantic graph is represented as\na sequence with a multi-modal adjacency matrix indicating relationships between\nelements. Experimental results show that the approach outperforms existing\nmethods and achieves state-of-the-art performance on two standard datasets.\nFurther analysis has demonstrated that the model can implicitly learn the\ncorrelation between fine-grained information of the image and the text with the\ngiven target. Our code is available at https://github.com/zjukg/SeqCSG.\n","authors":["Yufeng Huang","Zhuo Chen","Jiaoyan Chen","Jeff Z. Pan","Zhen Yao","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2208.09417v2.pdf","comment":"ICANN 2023, https://github.com/zjukg/SeqCSG"},{"id":"http://arxiv.org/abs/2307.11411v2","updated":"2023-07-24T02:57:01Z","published":"2023-07-21T08:10:26Z","title":"Deep Directly-Trained Spiking Neural Networks for Object Detection","summary":" Spiking neural networks (SNNs) are brain-inspired energy-efficient models\nthat encode information in spatiotemporal dynamics. Recently, deep SNNs trained\ndirectly have shown great success in achieving high performance on\nclassification tasks with very few time steps. However, how to design a\ndirectly-trained SNN for the regression task of object detection still remains\na challenging problem. To address this problem, we propose EMS-YOLO, a novel\ndirectly-trained SNN framework for object detection, which is the first trial\nto train a deep SNN with surrogate gradients for object detection rather than\nANN-SNN conversion strategies. Specifically, we design a full-spike residual\nblock, EMS-ResNet, which can effectively extend the depth of the\ndirectly-trained SNN with low power consumption. Furthermore, we theoretically\nanalyze and prove the EMS-ResNet could avoid gradient vanishing or exploding.\nThe results demonstrate that our approach outperforms the state-of-the-art\nANN-SNN conversion methods (at least 500 time steps) in extremely fewer time\nsteps (only 4 time steps). It is shown that our model could achieve comparable\nperformance to the ANN with the same architecture while consuming 5.83 times\nless energy on the frame-based COCO Dataset and the event-based Gen1 Dataset.\n","authors":["Qiaoyi Su","Yuhong Chou","Yifan Hu","Jianing Li","Shijie Mei","Ziyang Zhang","Guoqi Li"],"pdf_url":"https://arxiv.org/pdf/2307.11411v2.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.12493v1","updated":"2023-07-24T02:50:44Z","published":"2023-07-24T02:50:44Z","title":"TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition","summary":" Text-driven diffusion models have exhibited impressive generative\ncapabilities, enabling various image editing tasks. In this paper, we propose\nTF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the\npower of text-driven diffusion models for cross-domain image-guided\ncomposition. This task aims to seamlessly integrate user-provided objects into\na specific visual context. Current diffusion-based methods often involve costly\ninstance-based optimization or finetuning of pretrained models on customized\ndatasets, which can potentially undermine their rich prior. In contrast,\nTF-ICON can leverage off-the-shelf diffusion models to perform cross-domain\nimage-guided composition without requiring additional training, finetuning, or\noptimization. Moreover, we introduce the exceptional prompt, which contains no\ninformation, to facilitate text-driven diffusion models in accurately inverting\nreal images into latent representations, forming the basis for compositing. Our\nexperiments show that equipping Stable Diffusion with the exceptional prompt\noutperforms state-of-the-art inversion methods on various datasets (CelebA-HQ,\nCOCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile\nvisual domains. Code is available at https://github.com/Shilin-LU/TF-ICON\n","authors":["Shilin Lu","Yanzhu Liu","Adams Wai-Kin Kong"],"pdf_url":"https://arxiv.org/pdf/2307.12493v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2307.00932v2","updated":"2023-07-24T01:57:52Z","published":"2023-07-03T11:13:28Z","title":"A large calcium-imaging dataset reveals a systematic V4 organization for\n natural scenes","summary":" The visual system evolved to process natural scenes, yet most of our\nunderstanding of the topology and function of visual cortex derives from\nstudies using artificial stimuli. To gain deeper insights into visual\nprocessing of natural scenes, we utilized widefield calcium-imaging of primate\nV4 in response to many natural images, generating a large dataset of\ncolumnar-scale responses. We used this dataset to build a digital twin of V4\nvia deep learning, generating a detailed topographical map of natural image\npreferences at each cortical position. The map revealed clustered functional\ndomains for specific classes of natural image features. These ranged from\nsurface-related attributes like color and texture to shape-related features\nsuch as edges, curvature, and facial features. We validated the model-predicted\ndomains with additional widefield calcium-imaging and single-cell resolution\ntwo-photon imaging. Our study illuminates the detailed topological organization\nand neural codes in V4 that represent natural scenes.\n","authors":["Tianye Wang","Haoxuan Yao","Tai Sing Lee","Jiayi Hong","Yang Li","Hongfei Jiang","Ian Max Andolina","Shiming Tang"],"pdf_url":"https://arxiv.org/pdf/2307.00932v2.pdf","comment":"39 pages, 14 figures"},{"id":"http://arxiv.org/abs/2305.01788v3","updated":"2023-07-24T00:54:51Z","published":"2023-05-02T21:33:10Z","title":"Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation\n Incorporating Gloss Information","summary":" Visual Word Sense Disambiguation (VWSD) is a task to find the image that most\naccurately depicts the correct sense of the target word for the given context.\nPreviously, image-text matching models often suffered from recognizing\npolysemous words. This paper introduces an unsupervised VWSD approach that uses\ngloss information of an external lexical knowledge-base, especially the sense\ndefinitions. Specifically, we suggest employing Bayesian inference to\nincorporate the sense definitions when sense information of the answer is not\nprovided. In addition, to ameliorate the out-of-dictionary (OOD) issue, we\npropose a context-aware definition generation with GPT-3. Experimental results\nshow that the VWSD performance significantly increased with our Bayesian\ninference-based approach. In addition, our context-aware definition generation\nachieved prominent performance improvement in OOD examples exhibiting better\nperformance than the existing definition generation method.\n","authors":["Sunjae Kwon","Rishabh Garodia","Minhwa Lee","Zhichao Yang","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2305.01788v3.pdf","comment":"ACL 2023, https://aclanthology.org/2023.acl-long.88"},{"id":"http://arxiv.org/abs/2307.12463v1","updated":"2023-07-24T00:53:46Z","published":"2023-07-24T00:53:46Z","title":"Rethinking Data Distillation: Do Not Overlook Calibration","summary":" Neural networks trained on distilled data often produce over-confident output\nand require correction by calibration methods. Existing calibration methods\nsuch as temperature scaling and mixup work well for networks trained on\noriginal large-scale data. However, we find that these methods fail to\ncalibrate networks trained on data distilled from large source datasets. In\nthis paper, we show that distilled data lead to networks that are not\ncalibratable due to (i) a more concentrated distribution of the maximum logits\nand (ii) the loss of information that is semantically meaningful but unrelated\nto classification tasks. To address this problem, we propose Masked Temperature\nScaling (MTS) and Masked Distillation Training (MDT) which mitigate the\nlimitations of distilled data and achieve better calibration results while\nmaintaining the efficiency of dataset distillation.\n","authors":["Dongyao Zhu","Bowen Lei","Jie Zhang","Yanbo Fang","Ruqi Zhang","Yiqun Xie","Dongkuan Xu"],"pdf_url":"https://arxiv.org/pdf/2307.12463v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2304.07916v2","updated":"2023-07-24T00:29:45Z","published":"2023-04-16T23:37:24Z","title":"GaitRef: Gait Recognition with Refined Sequential Skeletons","summary":" Identifying humans with their walking sequences, known as gait recognition,\nis a useful biometric understanding task as it can be observed from a long\ndistance and does not require cooperation from the subject. Two common\nmodalities used for representing the walking sequence of a person are\nsilhouettes and joint skeletons. Silhouette sequences, which record the\nboundary of the walking person in each frame, may suffer from the variant\nappearances from carried-on objects and clothes of the person. Framewise joint\ndetections are noisy and introduce some jitters that are not consistent with\nsequential detections. In this paper, we combine the silhouettes and skeletons\nand refine the framewise joint predictions for gait recognition. With temporal\ninformation from the silhouette sequences. We show that the refined skeletons\ncan improve gait recognition performance without extra annotations. We compare\nour methods on four public datasets, CASIA-B, OUMVLP, Gait3D and GREW, and show\nstate-of-the-art performance.\n","authors":["Haidong Zhu","Wanrong Zheng","Zhaoheng Zheng","Ram Nevatia"],"pdf_url":"https://arxiv.org/pdf/2304.07916v2.pdf","comment":"IJCB 2023. Code is available at\n https://github.com/haidongz-usc/GaitRef"},{"id":"http://arxiv.org/abs/2307.12459v1","updated":"2023-07-24T00:03:09Z","published":"2023-07-24T00:03:09Z","title":"Robust face anti-spoofing framework with Convolutional Vision\n Transformer","summary":" Owing to the advances in image processing technology and large-scale\ndatasets, companies have implemented facial authentication processes, thereby\nstimulating increased focus on face anti-spoofing (FAS) against realistic\npresentation attacks. Recently, various attempts have been made to improve face\nrecognition performance using both global and local learning on face images;\nhowever, to the best of our knowledge, this is the first study to investigate\nwhether the robustness of FAS against domain shifts is improved by considering\nglobal information and local cues in face images captured using self-attention\nand convolutional layers. This study proposes a convolutional vision\ntransformer-based framework that achieves robust performance for various unseen\ndomain data. Our model resulted in 7.3%$p$ and 12.9%$p$ increases in FAS\nperformance compared to models using only a convolutional neural network or\nvision transformer, respectively. It also shows the highest average rank in\nsub-protocols of cross-dataset setting over the other nine benchmark models for\ndomain generalization.\n","authors":["Yunseung Lee","Youngjun Kwak","Jinho Shin"],"pdf_url":"https://arxiv.org/pdf/2307.12459v1.pdf","comment":"ICIP 2023"},{"id":"http://arxiv.org/abs/2301.06363v2","updated":"2023-07-24T23:39:15Z","published":"2023-01-16T11:17:32Z","title":"A$^2$-UAV: Application-Aware Content and Network Optimization of\n Edge-Assisted UAV Systems","summary":" To perform advanced surveillance, Unmanned Aerial Vehicles (UAVs) require the\nexecution of edge-assisted computer vision (CV) tasks. In multi-hop UAV\nnetworks, the successful transmission of these tasks to the edge is severely\nchallenged due to severe bandwidth constraints. For this reason, we propose a\nnovel A$^2$-UAV framework to optimize the number of correctly executed tasks at\nthe edge. In stark contrast with existing art, we take an application-aware\napproach and formulate a novel pplication-Aware Task Planning Problem\n(A$^2$-TPP) that takes into account (i) the relationship between deep neural\nnetwork (DNN) accuracy and image compression for the classes of interest based\non the available dataset, (ii) the target positions, (iii) the current\nenergy/position of the UAVs to optimize routing, data pre-processing and target\nassignment for each UAV. We demonstrate A$^2$-TPP is NP-Hard and propose a\npolynomial-time algorithm to solve it efficiently. We extensively evaluate\nA$^2$-UAV through real-world experiments with a testbed composed by four DJI\nMavic Air 2 UAVs. We consider state-of-the-art image classification tasks with\nfour different DNN models (i.e., DenseNet, ResNet152, ResNet50 and\nMobileNet-V2) and object detection tasks using YoloV4 trained on the ImageNet\ndataset. Results show that A$^2$-UAV attains on average around 38% more\naccomplished tasks than the state-of-the-art, with 400% more accomplished tasks\nwhen the number of targets increases significantly. To allow full\nreproducibility, we pledge to share datasets and code with the research\ncommunity.\n","authors":["Andrea Coletta","Flavio Giorgi","Gaia Maselli","Matteo Prata","Domenicomichele Silvestri","Jonathan Ashdown","Francesco Restuccia"],"pdf_url":"https://arxiv.org/pdf/2301.06363v2.pdf","comment":"Accepted to INFOCOM 2023"},{"id":"http://arxiv.org/abs/2307.13136v1","updated":"2023-07-24T21:29:48Z","published":"2023-07-24T21:29:48Z","title":"Does Progress On Object Recognition Benchmarks Improve Real-World\n Generalization?","summary":" For more than a decade, researchers have measured progress in object\nrecognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C,\nand -R. Recent advances in foundation models, trained on orders of magnitude\nmore data, have begun to saturate these standard benchmarks, but remain brittle\nin practice. This suggests standard benchmarks, which tend to focus on\npredefined or synthetic changes, may not be sufficient for measuring real world\ngeneralization. Consequently, we propose studying generalization across\ngeography as a more realistic measure of progress using two datasets of objects\nfrom households across the globe. We conduct an extensive empirical evaluation\nof progress across nearly 100 vision models up to most recent foundation\nmodels. We first identify a progress gap between standard benchmarks and\nreal-world, geographical shifts: progress on ImageNet results in up to 2.5x\nmore progress on standard generalization benchmarks than real-world\ndistribution shifts. Second, we study model generalization across geographies\nby measuring the disparities in performance across regions, a more fine-grained\nmeasure of real world generalization. We observe all models have large\ngeographic disparities, even foundation CLIP models, with differences of 7-20%\nin accuracy between regions. Counter to modern intuition, we discover progress\non standard benchmarks fails to improve geographic disparities and often\nexacerbates them: geographic disparities between the least performant models\nand today's best models have more than tripled. Our results suggest scaling\nalone is insufficient for consistent robustness to real-world distribution\nshifts. Finally, we highlight in early experiments how simple last layer\nretraining on more representative, curated data can complement scaling as a\npromising direction of future work, reducing geographic disparity on both\nbenchmarks by over two-thirds.\n","authors":["Megan Richards","Polina Kirichenko","Diane Bouchacourt","Mark Ibrahim"],"pdf_url":"https://arxiv.org/pdf/2307.13136v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13133v1","updated":"2023-07-24T21:22:58Z","published":"2023-07-24T21:22:58Z","title":"simPLE: a visuotactile method learned in simulation to precisely pick,\n localize, regrasp, and place objects","summary":" Existing robotic systems have a clear tension between generality and\nprecision. Deployed solutions for robotic manipulation tend to fall into the\nparadigm of one robot solving a single task, lacking precise generalization,\ni.e., the ability to solve many tasks without compromising on precision. This\npaper explores solutions for precise and general pick-and-place. In precise\npick-and-place, i.e. kitting, the robot transforms an unstructured arrangement\nof objects into an organized arrangement, which can facilitate further\nmanipulation. We propose simPLE (simulation to Pick Localize and PLacE) as a\nsolution to precise pick-and-place. simPLE learns to pick, regrasp and place\nobjects precisely, given only the object CAD model and no prior experience. We\ndevelop three main components: task-aware grasping, visuotactile perception,\nand regrasp planning. Task-aware grasping computes affordances of grasps that\nare stable, observable, and favorable to placing. The visuotactile perception\nmodel relies on matching real observations against a set of simulated ones\nthrough supervised learning. Finally, we compute the desired robot motion by\nsolving a shortest path problem on a graph of hand-to-hand regrasps. On a\ndual-arm robot equipped with visuotactile sensing, we demonstrate\npick-and-place of 15 diverse objects with simPLE. The objects span a wide range\nof shapes and simPLE achieves successful placements into structured\narrangements with 1mm clearance over 90% of the time for 6 objects, and over\n80% of the time for 11 objects. Videos are available at\nhttp://mcube.mit.edu/research/simPLE.html .\n","authors":["Maria Bauza","Antonia Bronars","Yifan Hou","Ian Taylor","Nikhil Chavan-Dafle","Alberto Rodriguez"],"pdf_url":"https://arxiv.org/pdf/2307.13133v1.pdf","comment":"33 pages, 6 figures, 2 tables, submitted to Science Robotics"},{"id":"http://arxiv.org/abs/2205.04691v3","updated":"2023-07-24T20:56:50Z","published":"2022-05-10T06:24:09Z","title":"An Asynchronous Event-Based Algorithm for Periodic Signals","summary":" Let $0\\leq\\tau_{1}\\leq\\tau_{2}\\leq\\cdots\\leq\\tau_{m}\\leq1$, originated from a\nuniform distribution. Let also $\\epsilon,\\delta\\in\\mathbb{R}$, and\n$d\\in\\mathbb{N}$. What is the probability of having more than $d$ adjacent\n$\\tau_{i}$-s pairs that the distance between them is $\\delta$, up to an error\n$\\epsilon$ ? In this paper we are going to show how this untreated theoretical\nprobabilistic problem arises naturally from the motivation of analyzing a\nsimple asynchronous algorithm for detection of signals with a known frequency,\nusing the novel technology of an event camera.\n","authors":["David El-Chai Ben-Ezra","Ron Arad","Ayelet Padowicz","Israel Tugendhaft"],"pdf_url":"https://arxiv.org/pdf/2205.04691v3.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2307.13125v1","updated":"2023-07-24T20:53:59Z","published":"2023-07-24T20:53:59Z","title":"Deep Learning Approaches for Data Augmentation in Medical Imaging: A\n Review","summary":" Deep learning has become a popular tool for medical image analysis, but the\nlimited availability of training data remains a major challenge, particularly\nin the medical field where data acquisition can be costly and subject to\nprivacy regulations. Data augmentation techniques offer a solution by\nartificially increasing the number of training samples, but these techniques\noften produce limited and unconvincing results. To address this issue, a\ngrowing number of studies have proposed the use of deep generative models to\ngenerate more realistic and diverse data that conform to the true distribution\nof the data. In this review, we focus on three types of deep generative models\nfor medical image augmentation: variational autoencoders, generative\nadversarial networks, and diffusion models. We provide an overview of the\ncurrent state of the art in each of these models and discuss their potential\nfor use in different downstream tasks in medical imaging, including\nclassification, segmentation, and cross-modal translation. We also evaluate the\nstrengths and limitations of each model and suggest directions for future\nresearch in this field. Our goal is to provide a comprehensive review about the\nuse of deep generative models for medical image augmentation and to highlight\nthe potential of these models for improving the performance of deep learning\nalgorithms in medical image analysis.\n","authors":["Aghiles Kebaili","Jérôme Lapuyade-Lahorgue","Su Ruan"],"pdf_url":"https://arxiv.org/pdf/2307.13125v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13110v1","updated":"2023-07-24T19:59:15Z","published":"2023-07-24T19:59:15Z","title":"Automatic Infant Respiration Estimation from Video: A Deep Flow-based\n Algorithm and a Novel Public Benchmark","summary":" Respiration is a critical vital sign for infants, and continuous respiratory\nmonitoring is particularly important for newborns. However, neonates are\nsensitive and contact-based sensors present challenges in comfort, hygiene, and\nskin health, especially for preterm babies. As a step toward fully automatic,\ncontinuous, and contactless respiratory monitoring, we develop a deep-learning\nmethod for estimating respiratory rate and waveform from plain video footage in\nnatural settings. Our automated infant respiration flow-based network\n(AIRFlowNet) combines video-extracted optical flow input and spatiotemporal\nconvolutional processing tuned to the infant domain. We support our model with\nthe first public annotated infant respiration dataset with 125 videos\n(AIR-125), drawn from eight infant subjects, set varied pose, lighting, and\ncamera conditions. We include manual respiration annotations and optimize\nAIRFlowNet training on them using a novel spectral bandpass loss function. When\ntrained and tested on the AIR-125 infant data, our method significantly\noutperforms other state-of-the-art methods in respiratory rate estimation,\nachieving a mean absolute error of $\\sim$2.9 breaths per minute, compared to\n$\\sim$4.7--6.2 for other public models designed for adult subjects and more\nuniform environments.\n","authors":["Sai Kumar Reddy Manne","Shaotong Zhu","Sarah Ostadabbas","Michael Wan"],"pdf_url":"https://arxiv.org/pdf/2307.13110v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.05799v2","updated":"2023-07-24T19:13:20Z","published":"2023-07-11T20:46:19Z","title":"3D Medical Image Segmentation based on multi-scale MPU-Net","summary":" The high cure rate of cancer is inextricably linked to physicians' accuracy\nin diagnosis and treatment, therefore a model that can accomplish\nhigh-precision tumor segmentation has become a necessity in many applications\nof the medical industry. It can effectively lower the rate of misdiagnosis\nwhile considerably lessening the burden on clinicians. However, fully automated\ntarget organ segmentation is problematic due to the irregular stereo structure\nof 3D volume organs. As a basic model for this class of real applications,\nU-Net excels. It can learn certain global and local features, but still lacks\nthe capacity to grasp spatial long-range relationships and contextual\ninformation at multiple scales. This paper proposes a tumor segmentation model\nMPU-Net for patient volume CT images, which is inspired by Transformer with a\nglobal attention mechanism. By combining image serialization with the Position\nAttention Module, the model attempts to comprehend deeper contextual\ndependencies and accomplish precise positioning. Each layer of the decoder is\nalso equipped with a multi-scale module and a cross-attention mechanism. The\ncapability of feature extraction and integration at different levels has been\nenhanced, and the hybrid loss function developed in this study can better\nexploit high-resolution characteristic information. Moreover, the suggested\narchitecture is tested and evaluated on the Liver Tumor Segmentation Challenge\n2017 (LiTS 2017) dataset. Compared with the benchmark model U-Net, MPU-Net\nshows excellent segmentation results. The dice, accuracy, precision,\nspecificity, IOU, and MCC metrics for the best model segmentation results are\n92.17%, 99.08%, 91.91%, 99.52%, 85.91%, and 91.74%, respectively. Outstanding\nindicators in various aspects illustrate the exceptional performance of this\nframework in automatic medical image segmentation.\n","authors":["Zeqiu. Yu","Shuo. Han","Ziheng. Song"],"pdf_url":"https://arxiv.org/pdf/2307.05799v2.pdf","comment":"37 pages"},{"id":"http://arxiv.org/abs/2307.13078v1","updated":"2023-07-24T18:59:46Z","published":"2023-07-24T18:59:46Z","title":"Adaptive Certified Training: Towards Better Accuracy-Robustness\n Tradeoffs","summary":" As deep learning models continue to advance and are increasingly utilized in\nreal-world systems, the issue of robustness remains a major challenge. Existing\ncertified training methods produce models that achieve high provable robustness\nguarantees at certain perturbation levels. However, the main problem of such\nmodels is a dramatically low standard accuracy, i.e. accuracy on clean\nunperturbed data, that makes them impractical. In this work, we consider a more\nrealistic perspective of maximizing the robustness of a model at certain levels\nof (high) standard accuracy. To this end, we propose a novel certified training\nmethod based on a key insight that training with adaptive certified radii helps\nto improve both the accuracy and robustness of the model, advancing\nstate-of-the-art accuracy-robustness tradeoffs. We demonstrate the\neffectiveness of the proposed method on MNIST, CIFAR-10, and TinyImageNet\ndatasets. Particularly, on CIFAR-10 and TinyImageNet, our method yields models\nwith up to two times higher robustness, measured as an average certified radius\nof a test set, at the same levels of standard accuracy compared to baseline\napproaches.\n","authors":["Zhakshylyk Nurlanov","Frank R. Schmidt","Florian Bernard"],"pdf_url":"https://arxiv.org/pdf/2307.13078v1.pdf","comment":"Presented at ICML 2023 workshop \"New Frontiers in Adversarial Machine\n Learning\""},{"id":"http://arxiv.org/abs/2307.09588v2","updated":"2023-07-24T18:52:54Z","published":"2023-07-18T19:51:28Z","title":"Automating Wood Species Detection and Classification in Microscopic\n Images of Fibrous Materials with Deep Learning","summary":" We have developed a methodology for the systematic generation of a large\nimage dataset of macerated wood references, which we used to generate image\ndata for nine hardwood genera. This is the basis for a substantial approach to\nautomate, for the first time, the identification of hardwood species in\nmicroscopic images of fibrous materials by deep learning. Our methodology\nincludes a flexible pipeline for easy annotation of vessel elements. We compare\nthe performance of different neural network architectures and hyperparameters.\nOur proposed method performs similarly well to human experts. In the future,\nthis will improve controls on global wood fiber product flows to protect\nforests.\n","authors":["Lars Nieradzik","Jördis Sieburg-Rockel","Stephanie Helmling","Janis Keuper","Thomas Weibel","Andrea Olbrich","Henrike Stephani"],"pdf_url":"https://arxiv.org/pdf/2307.09588v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13069v1","updated":"2023-07-24T18:50:49Z","published":"2023-07-24T18:50:49Z","title":"General-Purpose Multi-Modal OOD Detection Framework","summary":" Out-of-distribution (OOD) detection identifies test samples that differ from\nthe training data, which is critical to ensuring the safety and reliability of\nmachine learning (ML) systems. While a plethora of methods have been developed\nto detect uni-modal OOD samples, only a few have focused on multi-modal OOD\ndetection. Current contrastive learning-based methods primarily study\nmulti-modal OOD detection in a scenario where both a given image and its\ncorresponding textual description come from a new domain. However, real-world\ndeployments of ML systems may face more anomaly scenarios caused by multiple\nfactors like sensor faults, bad weather, and environmental changes. Hence, the\ngoal of this work is to simultaneously detect from multiple different OOD\nscenarios in a fine-grained manner. To reach this goal, we propose a\ngeneral-purpose weakly-supervised OOD detection framework, called WOOD, that\ncombines a binary classifier and a contrastive learning component to reap the\nbenefits of both. In order to better distinguish the latent representations of\nin-distribution (ID) and OOD samples, we adopt the Hinge loss to constrain\ntheir similarity. Furthermore, we develop a new scoring metric to integrate the\nprediction results from both the binary classifier and contrastive learning for\nidentifying OOD samples. We evaluate the proposed WOOD model on multiple\nreal-world datasets, and the experimental results demonstrate that the WOOD\nmodel outperforms the state-of-the-art methods for multi-modal OOD detection.\nImportantly, our approach is able to achieve high accuracy in OOD detection in\nthree different OOD scenarios simultaneously. The source code will be made\npublicly available upon publication.\n","authors":["Viet Duong","Qiong Wu","Zhengyi Zhou","Eric Zavesky","Jiahe Chen","Xiangzhou Liu","Wen-Ling Hsu","Huajie Shao"],"pdf_url":"https://arxiv.org/pdf/2307.13069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13060v1","updated":"2023-07-24T18:19:39Z","published":"2023-07-24T18:19:39Z","title":"On the characteristics of natural hydraulic dampers: An image-based\n approach to study the fluid flow behaviour inside the human meniscal tissue","summary":" The meniscal tissue is a layered material with varying properties influenced\nby collagen content and arrangement. Understanding the relationship between\nstructure and properties is crucial for disease management, treatment\ndevelopment, and biomaterial design. The internal layer of the meniscus is\nsofter and more deformable than the outer layers, thanks to interconnected\ncollagen channels that guide fluid flow. To investigate these relationships, we\npropose a novel approach that combines Computational Fluid Dynamics (CFD) with\nImage Analysis (CFD-IA). We analyze fluid flow in the internal architecture of\nthe human meniscus across a range of inlet velocities (0.1mm/s to 1.6m/s) using\nhigh-resolution 3D micro-computed tomography scans. Statistical correlations\nare observed between architectural parameters (tortuosity, connectivity,\nporosity, pore size) and fluid flow parameters (Re number distribution,\npermeability). Some channels exhibit Re values of 1400 at an inlet velocity of\n1.6m/s, and a transition from Darcy's regime to a non-Darcian regime occurs\naround an inlet velocity of 0.02m/s. Location-dependent permeability ranges\nfrom 20-32 Darcy. Regression modelling reveals a strong correlation between\nfluid velocity and tortuosity at high inlet velocities, as well as with channel\ndiameter at low inlet velocities. At higher inlet velocities, flow paths\ndeviate more from the preferential direction, resulting in a decrease in the\nconcentration parameter by an average of 0.4. This research provides valuable\ninsights into the fluid flow behaviour within the meniscus and its structural\ninfluences.\n","authors":["J. Waghorne","F. P. Bonomo","A. Rabbani","D. Bell","O. Barrera"],"pdf_url":"https://arxiv.org/pdf/2307.13060v1.pdf","comment":"20 Pages, 5 Figures"},{"id":"http://arxiv.org/abs/2307.02625v2","updated":"2023-07-24T18:16:38Z","published":"2023-07-05T19:56:50Z","title":"Retinex-based Image Denoising / Contrast Enhancement using Gradient\n Graph Laplacian Regularizer","summary":" Images captured in poorly lit conditions are often corrupted by acquisition\nnoise. Leveraging recent advances in graph-based regularization, we propose a\nfast Retinex-based restoration scheme that denoises and contrast-enhances an\nimage. Specifically, by Retinex theory we first assume that each image pixel is\na multiplication of its reflectance and illumination components. We next assume\nthat the reflectance and illumination components are piecewise constant (PWC)\nand continuous piecewise planar (PWP) signals, which can be recovered via graph\nLaplacian regularizer (GLR) and gradient graph Laplacian regularizer (GGLR)\nrespectively. We formulate quadratic objectives regularized by GLR and GGLR,\nwhich are minimized alternately until convergence by solving linear systems --\nwith improved condition numbers via proposed preconditioners -- via conjugate\ngradient (CG) efficiently. Experimental results show that our algorithm\nachieves competitive visual image quality while reducing computation complexity\nnoticeably.\n","authors":["Yeganeh Gharedaghi","Gene Cheung","Xianming Liu"],"pdf_url":"https://arxiv.org/pdf/2307.02625v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13011v1","updated":"2023-07-24T13:47:30Z","published":"2023-07-24T13:47:30Z","title":"Maximal Independent Sets for Pooling in Graph Neural Networks","summary":" Convolutional Neural Networks (CNNs) have enabled major advances in image\nclassification through convolution and pooling. In particular, image pooling\ntransforms a connected discrete lattice into a reduced lattice with the same\nconnectivity and allows reduction functions to consider all pixels in an image.\nHowever, there is no pooling that satisfies these properties for graphs. In\nfact, traditional graph pooling methods suffer from at least one of the\nfollowing drawbacks: Graph disconnection or overconnection, low decimation\nratio, and deletion of large parts of graphs. In this paper, we present three\npooling methods based on the notion of maximal independent sets that avoid\nthese pitfalls. Our experimental results confirm the relevance of maximal\nindependent set constraints for graph pooling.\n","authors":["Stevan Stanovic","Benoit Gaüzère","Luc Brun"],"pdf_url":"https://arxiv.org/pdf/2307.13011v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2307.09683v2","updated":"2023-07-24T15:41:03Z","published":"2023-07-18T23:35:53Z","title":"PubMed and Beyond: Recent Advances and Best Practices in Biomedical\n Literature Search","summary":" Biomedical research yields a wealth of information, much of which is only\naccessible through the literature. Consequently, literature search is an\nessential tool for building on prior knowledge in clinical and biomedical\nresearch. Although recent improvements in artificial intelligence have expanded\nfunctionality beyond keyword-based search, these advances may be unfamiliar to\nclinicians and researchers. In response, we present a survey of literature\nsearch tools tailored to both general and specific information needs in\nbiomedicine, with the objective of helping readers efficiently fulfill their\ninformation needs. We first examine the widely used PubMed search engine,\ndiscussing recent improvements and continued challenges. We then describe\nliterature search tools catering to five specific information needs: 1.\nIdentifying high-quality clinical research for evidence-based medicine. 2.\nRetrieving gene-related information for precision medicine and genomics. 3.\nSearching by meaning, including natural language questions. 4. Locating related\narticles with literature recommendation. 5. Mining literature to discover\nassociations between concepts such as diseases and genetic variants.\nAdditionally, we cover practical considerations and best practices for choosing\nand using these tools. Finally, we provide a perspective on the future of\nliterature search engines, considering recent breakthroughs in large language\nmodels such as ChatGPT. In summary, our survey provides a comprehensive view of\nbiomedical literature search functionalities with 36 publicly available tools.\n","authors":["Qiao Jin","Robert Leaman","Zhiyong Lu"],"pdf_url":"https://arxiv.org/pdf/2307.09683v2.pdf","comment":"27 pages, 6 figures, 36 tools"},{"id":"http://arxiv.org/abs/2307.12810v1","updated":"2023-07-24T14:00:07Z","published":"2023-07-24T14:00:07Z","title":"HeteFedRec: Federated Recommender Systems with Model Heterogeneity","summary":" Owing to the nature of privacy protection, federated recommender systems\n(FedRecs) have garnered increasing interest in the realm of on-device\nrecommender systems. However, most existing FedRecs only allow participating\nclients to collaboratively train a recommendation model of the same public\nparameter size. Training a model of the same size for all clients can lead to\nsuboptimal performance since clients possess varying resources. For example,\nclients with limited training data may prefer to train a smaller recommendation\nmodel to avoid excessive data consumption, while clients with sufficient data\nwould benefit from a larger model to achieve higher recommendation accuracy. To\naddress the above challenge, this paper introduces HeteFedRec, a novel FedRec\nframework that enables the assignment of personalized model sizes to\nparticipants. In HeteFedRec, we present a heterogeneous recommendation model\naggregation strategy, including a unified dual-task learning mechanism and a\ndimensional decorrelation regularization, to allow knowledge aggregation among\nrecommender models of different sizes. Additionally, a relation-based ensemble\nknowledge distillation method is proposed to effectively distil knowledge from\nheterogeneous item embeddings. Extensive experiments conducted on three\nreal-world recommendation datasets demonstrate the effectiveness and efficiency\nof HeteFedRec in training federated recommender systems under heterogeneous\nsettings.\n","authors":["Wei Yuan","Liang Qu","Lizhen Cui","Yongxin Tong","Xiaofang Zhou","Hongzhi Yin"],"pdf_url":"https://arxiv.org/pdf/2307.12810v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12798v1","updated":"2023-07-24T13:51:19Z","published":"2023-07-24T13:51:19Z","title":"RRAML: Reinforced Retrieval Augmented Machine Learning","summary":" The emergence of large language models (LLMs) has revolutionized machine\nlearning and related fields, showcasing remarkable abilities in comprehending,\ngenerating, and manipulating human language. However, their conventional usage\nthrough API-based text prompt submissions imposes certain limitations in terms\nof context constraints and external source availability. To address these\nchallenges, we propose a novel framework called Reinforced Retrieval Augmented\nMachine Learning (RRAML). RRAML integrates the reasoning capabilities of LLMs\nwith supporting information retrieved by a purpose-built retriever from a vast\nuser-provided database. By leveraging recent advancements in reinforcement\nlearning, our method effectively addresses several critical challenges.\nFirstly, it circumvents the need for accessing LLM gradients. Secondly, our\nmethod alleviates the burden of retraining LLMs for specific tasks, as it is\noften impractical or impossible due to restricted access to the model and the\ncomputational intensity involved. Additionally we seamlessly link the\nretriever's task with the reasoner, mitigating hallucinations and reducing\nirrelevant, and potentially damaging retrieved documents. We believe that the\nresearch agenda outlined in this paper has the potential to profoundly impact\nthe field of AI, democratizing access to and utilization of LLMs for a wide\nrange of entities.\n","authors":["Andrea Bacciu","Florin Cocunasu","Federico Siciliano","Fabrizio Silvestri","Nicola Tonellotto","Giovanni Trappolini"],"pdf_url":"https://arxiv.org/pdf/2307.12798v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12756v1","updated":"2023-07-24T12:58:47Z","published":"2023-07-24T12:58:47Z","title":"Unbiased Delayed Feedback Label Correction for Conversion Rate\n Prediction","summary":" Conversion rate prediction is critical to many online applications such as\ndigital display advertising. To capture dynamic data distribution, industrial\nsystems often require retraining models on recent data daily or weekly.\nHowever, the delay of conversion behavior usually leads to incorrect labeling,\nwhich is called delayed feedback problem. Existing work may fail to introduce\nthe correct information about false negative samples due to data sparsity and\ndynamic data distribution. To directly introduce the correct feedback label\ninformation, we propose an Unbiased delayed feedback Label Correction framework\n(ULC), which uses an auxiliary model to correct labels for observed negative\nfeedback samples. Firstly, we theoretically prove that the label-corrected loss\nis an unbiased estimate of the oracle loss using true labels. Then, as there\nare no ready training data for label correction, counterfactual labeling is\nused to construct artificial training data. Furthermore, since counterfactual\nlabeling utilizes only partial training data, we design an embedding-based\nalternative training method to enhance performance. Comparative experiments on\nboth public and private datasets and detailed analyses show that our proposed\napproach effectively alleviates the delayed feedback problem and consistently\noutperforms the previous state-of-the-art methods.\n","authors":["Yifan Wang","Peijie Sun","Min Zhang","Qinglin Jia","Jingjie Li","Shaoping Ma"],"pdf_url":"https://arxiv.org/pdf/2307.12756v1.pdf","comment":"accepted by KDD 2023"},{"id":"http://arxiv.org/abs/2307.12576v1","updated":"2023-07-24T07:47:21Z","published":"2023-07-24T07:47:21Z","title":"Self-refining of Pseudo Labels for Music Source Separation with Noisy\n Labeled Data","summary":" Music source separation (MSS) faces challenges due to the limited\navailability of correctly-labeled individual instrument tracks. With the push\nto acquire larger datasets to improve MSS performance, the inevitability of\nencountering mislabeled individual instrument tracks becomes a significant\nchallenge to address. This paper introduces an automated technique for refining\nthe labels in a partially mislabeled dataset. Our proposed self-refining\ntechnique, employed with a noisy-labeled dataset, results in only a 1% accuracy\ndegradation in multi-label instrument recognition compared to a classifier\ntrained on a clean-labeled dataset. The study demonstrates the importance of\nrefining noisy-labeled data in MSS model training and shows that utilizing the\nrefined dataset leads to comparable results derived from a clean-labeled\ndataset. Notably, upon only access to a noisy dataset, MSS models trained on a\nself-refined dataset even outperform those trained on a dataset refined with a\nclassifier trained on clean labels.\n","authors":["Junghyun Koo","Yunkee Chae","Chang-Bin Jeon","Kyogu Lee"],"pdf_url":"https://arxiv.org/pdf/2307.12576v1.pdf","comment":"24th International Society for Music Information Retrieval Conference\n (ISMIR 2023)"},{"id":"http://arxiv.org/abs/2307.10617v3","updated":"2023-07-24T07:03:01Z","published":"2023-07-20T06:35:43Z","title":"Unmasking Falsehoods in Reviews: An Exploration of NLP Techniques","summary":" In the contemporary digital landscape, online reviews have become an\nindispensable tool for promoting products and services across various\nbusinesses. Marketers, advertisers, and online businesses have found incentives\nto create deceptive positive reviews for their products and negative reviews\nfor their competitors' offerings. As a result, the writing of deceptive reviews\nhas become an unavoidable practice for businesses seeking to promote themselves\nor undermine their rivals. Detecting such deceptive reviews has become an\nintense and ongoing area of research. This research paper proposes a machine\nlearning model to identify deceptive reviews, with a particular focus on\nrestaurants. This study delves into the performance of numerous experiments\nconducted on a dataset of restaurant reviews known as the Deceptive Opinion\nSpam Corpus. To accomplish this, an n-gram model and max features are developed\nto effectively identify deceptive content, particularly focusing on fake\nreviews. A benchmark study is undertaken to explore the performance of two\ndifferent feature extraction techniques, which are then coupled with five\ndistinct machine learning classification algorithms. The experimental results\nreveal that the passive aggressive classifier stands out among the various\nalgorithms, showcasing the highest accuracy not only in text classification but\nalso in identifying fake reviews. Moreover, the research delves into data\naugmentation and implements various deep learning techniques to further enhance\nthe process of detecting deceptive reviews. The findings shed light on the\nefficacy of the proposed machine learning approach and offer valuable insights\ninto dealing with deceptive reviews in the realm of online businesses.\n","authors":["Anusuya Baby Hari Krishnan"],"pdf_url":"https://arxiv.org/pdf/2307.10617v3.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2307.12518v1","updated":"2023-07-24T04:23:08Z","published":"2023-07-24T04:23:08Z","title":"FaFCNN: A General Disease Classification Framework Based on Feature\n Fusion Neural Networks","summary":" There are two fundamental problems in applying deep learning/machine learning\nmethods to disease classification tasks, one is the insufficient number and\npoor quality of training samples; another one is how to effectively fuse\nmultiple source features and thus train robust classification models. To\naddress these problems, inspired by the process of human learning knowledge, we\npropose the Feature-aware Fusion Correlation Neural Network (FaFCNN), which\nintroduces a feature-aware interaction module and a feature alignment module\nbased on domain adversarial learning. This is a general framework for disease\nclassification, and FaFCNN improves the way existing methods obtain sample\ncorrelation features. The experimental results show that training using\naugmented features obtained by pre-training gradient boosting decision tree\nyields more performance gains than random-forest based methods. On the\nlow-quality dataset with a large amount of missing data in our setup, FaFCNN\nobtains a consistently optimal performance compared to competitive baselines.\nIn addition, extensive experiments demonstrate the robustness of the proposed\nmethod and the effectiveness of each component of the model\\footnote{Accepted\nin IEEE SMC2023}.\n","authors":["Menglin Kong","Shaojie Zhao","Juan Cheng","Xingquan Li","Ri Su","Muzhou Hou","Cong Cao"],"pdf_url":"https://arxiv.org/pdf/2307.12518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13165v1","updated":"2023-07-24T23:26:46Z","published":"2023-07-24T23:26:46Z","title":"Investigating the Robustness of Sequential Recommender Systems Against\n Training Data Perturbations: an Empirical Study","summary":" Sequential Recommender Systems (SRSs) have been widely used to model user\nbehavior over time, but their robustness in the face of perturbations to\ntraining data is a critical issue. In this paper, we conduct an empirical study\nto investigate the effects of removing items at different positions within a\ntemporally ordered sequence. We evaluate two different SRS models on multiple\ndatasets, measuring their performance using Normalized Discounted Cumulative\nGain (NDCG) and Rank Sensitivity List metrics. Our results demonstrate that\nremoving items at the end of the sequence significantly impacts performance,\nwith NDCG decreasing up to 60\\%, while removing items from the beginning or\nmiddle has no significant effect. These findings highlight the importance of\nconsidering the position of the perturbed items in the training data and shall\ninform the design of more robust SRSs.\n","authors":["Filippo Betello","Federico Siciliano","Pushkar Mishra","Fabrizio Silvestri"],"pdf_url":"https://arxiv.org/pdf/2307.13165v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.15498v2","updated":"2023-07-24T20:08:20Z","published":"2021-06-29T15:25:33Z","title":"Classification of Consumer Belief Statements From Social Media","summary":" Social media offer plenty of information to perform market research in order\nto meet the requirements of customers. One way how this research is conducted\nis that a domain expert gathers and categorizes user-generated content into a\ncomplex and fine-grained class structure. In many of such cases, little data\nmeets complex annotations. It is not yet fully understood how this can be\nleveraged successfully for classification. We examine the classification\naccuracy of expert labels when used with a) many fine-grained classes and b)\nfew abstract classes. For scenario b) we compare abstract class labels given by\nthe domain expert as baseline and by automatic hierarchical clustering. We\ncompare this to another baseline where the entire class structure is given by a\ncompletely unsupervised clustering approach. By doing so, this work can serve\nas an example of how complex expert annotations are potentially beneficial and\ncan be utilized in the most optimal way for opinion mining in highly specific\ndomains. By exploring across a range of techniques and experiments, we find\nthat automated class abstraction approaches in particular the unsupervised\napproach performs remarkably well against domain expert baseline on text\nclassification tasks. This has the potential to inspire opinion mining\napplications in order to support market researchers in practice and to inspire\nfine-grained automated content analysis on a large scale.\n","authors":["Gerhard Johann Hagerer","Wenbin Le","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2106.15498v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2111.02259v3","updated":"2023-07-24T20:03:14Z","published":"2021-11-03T14:49:50Z","title":"A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion\n Mining","summary":" User-generated content from social media is produced in many languages,\nmaking it technically challenging to compare the discussed themes from one\ndomain across different cultures and regions. It is relevant for domains in a\nglobalized world, such as market research, where people from two nations and\nmarkets might have different requirements for a product. We propose a simple,\nmodern, and effective method for building a single topic model with sentiment\nanalysis capable of covering multiple languages simultanteously, based on a\npre-trained state-of-the-art deep neural network for natural language\nunderstanding. To demonstrate its feasibility, we apply the model to newspaper\narticles and user comments of a specific domain, i.e., organic food products\nand related consumption behavior. The themes match across languages.\nAdditionally, we obtain an high proportion of stable and domain-relevant\ntopics, a meaningful relation between topics and their respective textual\ncontents, and an interpretable representation for social media documents.\nMarketing can potentially benefit from our method, since it provides an\neasy-to-use means of addressing specific customer interests from different\nmarket regions around the globe. For reproducibility, we provide the code,\ndata, and results of our study.\n","authors":["Gerhard Johann Hagerer","Wing Sheung Leung","Qiaoxi Liu","Hannah Danner","Georg Groh"],"pdf_url":"https://arxiv.org/pdf/2111.02259v3.pdf","comment":"10 pages, 2 tables, 5 figures, full paper, peer-reviewed, published\n at KDIR/IC3k 2021 conference"},{"id":"http://arxiv.org/abs/2304.04759v2","updated":"2023-07-24T18:10:09Z","published":"2023-04-07T23:10:39Z","title":"Similarity search in the blink of an eye with compressed indices","summary":" Nowadays, data is represented by vectors. Retrieving those vectors, among\nmillions and billions, that are similar to a given query is a ubiquitous\nproblem, known as similarity search, of relevance for a wide range of\napplications. Graph-based indices are currently the best performing techniques\nfor billion-scale similarity search. However, their random-access memory\npattern presents challenges to realize their full potential. In this work, we\npresent new techniques and systems for creating faster and smaller graph-based\nindices. To this end, we introduce a novel vector compression method,\nLocally-adaptive Vector Quantization (LVQ), that uses per-vector scaling and\nscalar quantization to improve search performance with fast similarity\ncomputations and a reduced effective bandwidth, while decreasing memory\nfootprint and barely impacting accuracy. LVQ, when combined with a new\nhigh-performance computing system for graph-based similarity search,\nestablishes the new state of the art in terms of performance and memory\nfootprint. For billions of vectors, LVQ outcompetes the second-best\nalternatives: (1) in the low-memory regime, by up to 20.7x in throughput with\nup to a 3x memory footprint reduction, and (2) in the high-throughput regime by\n5.8x with 1.4x less memory.\n","authors":["Cecilia Aguerrebere","Ishwar Bhati","Mark Hildebrand","Mariano Tepper","Ted Willke"],"pdf_url":"https://arxiv.org/pdf/2304.04759v2.pdf","comment":"VLDB 2023"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2307.12983v1","updated":"2023-07-24T17:59:37Z","published":"2023-07-24T17:59:37Z","title":"Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under\n Massively Parallel Simulation","summary":" Reinforcement learning is time-consuming for complex tasks due to the need\nfor large amounts of training data. Recent advances in GPU-based simulation,\nsuch as Isaac Gym, have sped up data collection thousands of times on a\ncommodity GPU. Most prior works used on-policy methods like PPO due to their\nsimplicity and ease of scaling. Off-policy methods are more data efficient but\nchallenging to scale, resulting in a longer wall-clock training time. This\npaper presents a Parallel $Q$-Learning (PQL) scheme that outperforms PPO in\nwall-clock time while maintaining superior sample efficiency of off-policy\nlearning. PQL achieves this by parallelizing data collection, policy learning,\nand value learning. Different from prior works on distributed off-policy\nlearning, such as Apex, our scheme is designed specifically for massively\nparallel GPU-based simulation and optimized to work on a single workstation. In\nexperiments, we demonstrate that $Q$-learning can be scaled to \\textit{tens of\nthousands of parallel environments} and investigate important factors affecting\nlearning speed. The code is available at https://github.com/Improbable-AI/pql.\n","authors":["Zechu Li","Tao Chen","Zhang-Wei Hong","Anurag Ajay","Pulkit Agrawal"],"pdf_url":"https://arxiv.org/pdf/2307.12983v1.pdf","comment":"Accepted by ICML 2023"},{"id":"http://arxiv.org/abs/2307.12981v1","updated":"2023-07-24T17:59:02Z","published":"2023-07-24T17:59:02Z","title":"3D-LLM: Injecting the 3D World into Large Language Models","summary":" Large language models (LLMs) and Vision-Language Models (VLMs) have been\nproven to excel at multiple tasks, such as commonsense reasoning. Powerful as\nthese models can be, they are not grounded in the 3D physical world, which\ninvolves richer concepts such as spatial relationships, affordances, physics,\nlayout, and so on. In this work, we propose to inject the 3D world into large\nlanguage models and introduce a whole new family of 3D-LLMs. Specifically,\n3D-LLMs can take 3D point clouds and their features as input and perform a\ndiverse set of 3D-related tasks, including captioning, dense captioning, 3D\nquestion answering, task decomposition, 3D grounding, 3D-assisted dialog,\nnavigation, and so on. Using three types of prompting mechanisms that we\ndesign, we are able to collect over 300k 3D-language data covering these tasks.\nTo efficiently train 3D-LLMs, we first utilize a 3D feature extractor that\nobtains 3D features from rendered multi- view images. Then, we use 2D VLMs as\nour backbones to train our 3D-LLMs. By introducing a 3D localization mechanism,\n3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show\nthat our model outperforms state-of-the-art baselines by a large margin (e.g.,\nthe BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore,\nexperiments on our held-in datasets for 3D captioning, task composition, and\n3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative\nexamples also show that our model could perform more tasks beyond the scope of\nexisting LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.\n","authors":["Yining Hong","Haoyu Zhen","Peihao Chen","Shuhong Zheng","Yilun Du","Zhenfang Chen","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2307.12981v1.pdf","comment":"Project Page: : https://vis-www.cs.umass.edu/3dllm/"},{"id":"http://arxiv.org/abs/2303.06147v2","updated":"2023-07-24T17:58:45Z","published":"2023-03-10T18:59:57Z","title":"Exphormer: Sparse Transformers for Graphs","summary":" Graph transformers have emerged as a promising architecture for a variety of\ngraph learning and representation tasks. Despite their successes, though, it\nremains challenging to scale graph transformers to large graphs while\nmaintaining accuracy competitive with message-passing networks. In this paper,\nwe introduce Exphormer, a framework for building powerful and scalable graph\ntransformers. Exphormer consists of a sparse attention mechanism based on two\nmechanisms: virtual global nodes and expander graphs, whose mathematical\ncharacteristics, such as spectral expansion, pseduorandomness, and sparsity,\nyield graph transformers with complexity only linear in the size of the graph,\nwhile allowing us to prove desirable theoretical properties of the resulting\ntransformer models. We show that incorporating Exphormer into the\nrecently-proposed GraphGPS framework produces models with competitive empirical\nresults on a wide variety of graph datasets, including state-of-the-art results\non three datasets. We also show that Exphormer can scale to datasets on larger\ngraphs than shown in previous graph transformer architectures. Code can be\nfound at \\url{https://github.com/hamed1375/Exphormer}.\n","authors":["Hamed Shirzad","Ameya Velingker","Balaji Venkatachalam","Danica J. Sutherland","Ali Kemal Sinop"],"pdf_url":"https://arxiv.org/pdf/2303.06147v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.05407v3","updated":"2023-07-24T17:58:31Z","published":"2022-09-12T16:59:36Z","title":"Segmenting Known Objects and Unseen Unknowns without Prior Knowledge","summary":" Panoptic segmentation methods assign a known class to each pixel given in\ninput. Even for state-of-the-art approaches, this inevitably enforces decisions\nthat systematically lead to wrong predictions for objects outside the training\ncategories. However, robustness against out-of-distribution samples and corner\ncases is crucial in safety-critical settings to avoid dangerous consequences.\nSince real-world datasets cannot contain enough data points to adequately\nsample the long tail of the underlying distribution, models must be able to\ndeal with unseen and unknown scenarios as well. Previous methods targeted this\nby re-identifying already-seen unlabeled objects. In this work, we propose the\nnecessary step to extend segmentation with a new setting which we term holistic\nsegmentation. Holistic segmentation aims to identify and separate objects of\nunseen unknown categories into instances, without any prior knowledge about\nthem, while performing panoptic segmentation of known classes. We tackle this\nnew problem with U3HS, which finds unknowns as highly uncertain regions and\nclusters their corresponding instance-aware embeddings into individual objects.\nBy doing so, for the first time in panoptic segmentation with unknown objects,\nour U3HS is trained without unknown categories, reducing assumptions and\nleaving the settings as unconstrained as in real-life scenarios. Extensive\nexperiments on public data from MS COCO, Cityscapes, and Lost&Found demonstrate\nthe effectiveness of U3HS for this new, challenging, and assumptions-free\nsetting called holistic segmentation.\n","authors":["Stefano Gasperini","Alvaro Marcos-Ramiro","Michael Schmidt","Nassir Navab","Benjamin Busam","Federico Tombari"],"pdf_url":"https://arxiv.org/pdf/2209.05407v3.pdf","comment":"Accepted at ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12979v1","updated":"2023-07-24T17:56:58Z","published":"2023-07-24T17:56:58Z","title":"An Isometric Stochastic Optimizer","summary":" The Adam optimizer is the standard choice in deep learning applications. I\npropose a simple explanation of Adam's success: it makes each parameter's step\nsize independent of the norms of the other parameters. Based on this principle\nI derive Iso, a new optimizer which makes the norm of a parameter's update\ninvariant to the application of any linear transformation to its inputs and\noutputs. I develop a variant of Iso called IsoAdam that allows optimal\nhyperparameters to be transferred from Adam, and demonstrate that IsoAdam\nobtains a speedup over Adam when training a small Transformer.\n","authors":["Jacob Jackson"],"pdf_url":"https://arxiv.org/pdf/2307.12979v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12975v1","updated":"2023-07-24T17:50:24Z","published":"2023-07-24T17:50:24Z","title":"Provable Benefits of Policy Learning from Human Preferences in\n Contextual Bandit Problems","summary":" A crucial task in decision-making problems is reward engineering. It is\ncommon in practice that no obvious choice of reward function exists. Thus, a\npopular approach is to introduce human feedback during training and leverage\nsuch feedback to learn a reward function. Among all policy learning methods\nthat use human feedback, preference-based methods have demonstrated substantial\nsuccess in recent empirical applications such as InstructGPT. In this work, we\ndevelop a theory that provably shows the benefits of preference-based methods\nin offline contextual bandits. In particular, we improve the modeling and\nsuboptimality analysis for running policy learning methods on human-scored\nsamples directly. Then, we compare it with the suboptimality guarantees of\npreference-based methods and show that preference-based methods enjoy lower\nsuboptimality.\n","authors":["Xiang Ji","Huazheng Wang","Minshuo Chen","Tuo Zhao","Mengdi Wang"],"pdf_url":"https://arxiv.org/pdf/2307.12975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12971v1","updated":"2023-07-24T17:49:05Z","published":"2023-07-24T17:49:05Z","title":"Big Data - Supply Chain Management Framework for Forecasting: Data\n Preprocessing and Machine Learning Techniques","summary":" This article intends to systematically identify and comparatively analyze\nstate-of-the-art supply chain (SC) forecasting strategies and technologies. A\nnovel framework has been proposed incorporating Big Data Analytics in SC\nManagement (problem identification, data sources, exploratory data analysis,\nmachine-learning model training, hyperparameter tuning, performance evaluation,\nand optimization), forecasting effects on human-workforce, inventory, and\noverall SC. Initially, the need to collect data according to SC strategy and\nhow to collect them has been discussed. The article discusses the need for\ndifferent types of forecasting according to the period or SC objective. The SC\nKPIs and the error-measurement systems have been recommended to optimize the\ntop-performing model. The adverse effects of phantom inventory on forecasting\nand the dependence of managerial decisions on the SC KPIs for determining model\nperformance parameters and improving operations management, transparency, and\nplanning efficiency have been illustrated. The cyclic connection within the\nframework introduces preprocessing optimization based on the post-process KPIs,\noptimizing the overall control process (inventory management, workforce\ndetermination, cost, production and capacity planning). The contribution of\nthis research lies in the standard SC process framework proposal, recommended\nforecasting data analysis, forecasting effects on SC performance, machine\nlearning algorithms optimization followed, and in shedding light on future\nresearch.\n","authors":["Md Abrar Jahin","Md Sakib Hossain Shovon","Jungpil Shin","Istiyaque Ahmed Ridoy","Yoichi Tomioka","M. F. Mridha"],"pdf_url":"https://arxiv.org/pdf/2307.12971v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12968v1","updated":"2023-07-24T17:46:32Z","published":"2023-07-24T17:46:32Z","title":"A Connection between One-Step Regularization and Critic Regularization\n in Reinforcement Learning","summary":" As with any machine learning problem with limited data, effective offline RL\nalgorithms require careful regularization to avoid overfitting. One-step\nmethods perform regularization by doing just a single step of policy\nimprovement, while critic regularization methods do many steps of policy\nimprovement with a regularized objective. These methods appear distinct.\nOne-step methods, such as advantage-weighted regression and conditional\nbehavioral cloning, truncate policy iteration after just one step. This ``early\nstopping'' makes one-step RL simple and stable, but can limit its asymptotic\nperformance. Critic regularization typically requires more compute but has\nappealing lower-bound guarantees. In this paper, we draw a close connection\nbetween these methods: applying a multi-step critic regularization method with\na regularization coefficient of 1 yields the same policy as one-step RL. While\npractical implementations violate our assumptions and critic regularization is\ntypically applied with smaller regularization coefficients, our experiments\nnevertheless show that our analysis makes accurate, testable predictions about\npractical offline RL methods (CQL and one-step RL) with commonly-used\nhyperparameters. Our results that every problem can be solved with a single\nstep of policy improvement, but rather that one-step RL might be competitive\nwith critic regularization on RL problems that demand strong regularization.\n","authors":["Benjamin Eysenbach","Matthieu Geist","Sergey Levine","Ruslan Salakhutdinov"],"pdf_url":"https://arxiv.org/pdf/2307.12968v1.pdf","comment":"Accepted to ICML 2023. Video\n (https://www.youtube.com/watch?v=1xlixIHZ0R4) and code\n (https://github.com/ben-eysenbach/ac-connection)"},{"id":"http://arxiv.org/abs/2307.12967v1","updated":"2023-07-24T17:45:40Z","published":"2023-07-24T17:45:40Z","title":"Learning Dense Correspondences between Photos and Sketches","summary":" Humans effortlessly grasp the connection between sketches and real-world\nobjects, even when these sketches are far from realistic. Moreover, human\nsketch understanding goes beyond categorization -- critically, it also entails\nunderstanding how individual elements within a sketch correspond to parts of\nthe physical world it represents. What are the computational ingredients needed\nto support this ability? Towards answering this question, we make two\ncontributions: first, we introduce a new sketch-photo correspondence benchmark,\n$\\textit{PSC6k}$, containing 150K annotations of 6250 sketch-photo pairs across\n125 object categories, augmenting the existing Sketchy dataset with\nfine-grained correspondence metadata. Second, we propose a self-supervised\nmethod for learning dense correspondences between sketch-photo pairs, building\nupon recent advances in correspondence learning for pairs of photos. Our model\nuses a spatial transformer network to estimate the warp flow between latent\nrepresentations of a sketch and photo extracted by a contrastive learning-based\nConvNet backbone. We found that this approach outperformed several strong\nbaselines and produced predictions that were quantitatively consistent with\nother warp-based methods. However, our benchmark also revealed systematic\ndifferences between predictions of the suite of models we tested and those of\nhumans. Taken together, our work suggests a promising path towards developing\nartificial systems that achieve more human-like understanding of visual images\nat different levels of abstraction. Project page:\nhttps://photo-sketch-correspondence.github.io\n","authors":["Xuanchen Lu","Xiaolong Wang","Judith E Fan"],"pdf_url":"https://arxiv.org/pdf/2307.12967v1.pdf","comment":"Accepted to ICML 2023. Project page:\n https://photo-sketch-correspondence.github.io"},{"id":"http://arxiv.org/abs/2303.04245v2","updated":"2023-07-24T17:29:04Z","published":"2023-03-07T21:42:17Z","title":"How Do Transformers Learn Topic Structure: Towards a Mechanistic\n Understanding","summary":" While the successes of transformers across many domains are indisputable,\naccurate understanding of the learning mechanics is still largely lacking.\nTheir capabilities have been probed on benchmarks which include a variety of\nstructured and reasoning tasks -- but mathematical understanding is lagging\nsubstantially behind. Recent lines of work have begun studying representational\naspects of this question: that is, the size/depth/complexity of attention-based\nnetworks to perform certain tasks. However, there is no guarantee the learning\ndynamics will converge to the constructions proposed. In our paper, we provide\nfine-grained mechanistic understanding of how transformers learn \"semantic\nstructure\", understood as capturing co-occurrence structure of words.\nPrecisely, we show, through a combination of mathematical analysis and\nexperiments on Wikipedia data and synthetic data modeled by Latent Dirichlet\nAllocation (LDA), that the embedding layer and the self-attention layer encode\nthe topical structure. In the former case, this manifests as higher average\ninner product of embeddings between same-topic words. In the latter, it\nmanifests as higher average pairwise attention between same-topic words. The\nmathematical results involve several assumptions to make the analysis\ntractable, which we verify on data, and might be of independent interest as\nwell.\n","authors":["Yuchen Li","Yuanzhi Li","Andrej Risteski"],"pdf_url":"https://arxiv.org/pdf/2303.04245v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12943v1","updated":"2023-07-24T17:15:38Z","published":"2023-07-24T17:15:38Z","title":"Efficiently Sampling the PSD Cone with the Metric Dikin Walk","summary":" Semi-definite programs represent a frontier of efficient computation. While\nthere has been much progress on semi-definite optimization, with moderate-sized\ninstances currently solvable in practice by the interior-point method, the\nbasic problem of sampling semi-definite solutions remains a formidable\nchallenge. The direct application of known polynomial-time algorithms for\nsampling general convex bodies to semi-definite sampling leads to a\nprohibitively high running time. In addition, known general methods require an\nexpensive rounding phase as pre-processing. Here we analyze the Dikin walk, by\nfirst adapting it to general metrics, then devising suitable metrics for the\nPSD cone with affine constraints. The resulting mixing time and per-step\ncomplexity are considerably smaller, and by an appropriate choice of the\nmetric, the dependence on the number of constraints can be made\npolylogarithmic. We introduce a refined notion of self-concordant matrix\nfunctions and give rules for combining different metrics. Along the way, we\nfurther develop the theory of interior-point methods for sampling.\n","authors":["Yunbum Kook","Santosh S. Vempala"],"pdf_url":"https://arxiv.org/pdf/2307.12943v1.pdf","comment":"54 pages"},{"id":"http://arxiv.org/abs/2307.12941v1","updated":"2023-07-24T17:11:39Z","published":"2023-07-24T17:11:39Z","title":"On Privileged and Convergent Bases in Neural Network Representations","summary":" In this study, we investigate whether the representations learned by neural\nnetworks possess a privileged and convergent basis. Specifically, we examine\nthe significance of feature directions represented by individual neurons.\nFirst, we establish that arbitrary rotations of neural representations cannot\nbe inverted (unlike linear networks), indicating that they do not exhibit\ncomplete rotational invariance. Subsequently, we explore the possibility of\nmultiple bases achieving identical performance. To do this, we compare the\nbases of networks trained with the same parameters but with varying random\ninitializations. Our study reveals two findings: (1) Even in wide networks such\nas WideResNets, neural networks do not converge to a unique basis; (2) Basis\ncorrelation increases significantly when a few early layers of the network are\nfrozen identically.\n Furthermore, we analyze Linear Mode Connectivity, which has been studied as a\nmeasure of basis correlation. Our findings give evidence that while Linear Mode\nConnectivity improves with increased network width, this improvement is not due\nto an increase in basis correlation.\n","authors":["Davis Brown","Nikhil Vyas","Yamini Bansal"],"pdf_url":"https://arxiv.org/pdf/2307.12941v1.pdf","comment":"In the Workshop on High-dimensional Learning Dynamics at ICML 2023"},{"id":"http://arxiv.org/abs/2307.08572v3","updated":"2023-07-24T17:01:50Z","published":"2023-07-17T15:38:11Z","title":"Revisiting the Robustness of the Minimum Error Entropy Criterion: A\n Transfer Learning Case Study","summary":" Coping with distributional shifts is an important part of transfer learning\nmethods in order to perform well in real-life tasks. However, most of the\nexisting approaches in this area either focus on an ideal scenario in which the\ndata does not contain noises or employ a complicated training paradigm or model\ndesign to deal with distributional shifts. In this paper, we revisit the\nrobustness of the minimum error entropy (MEE) criterion, a widely used\nobjective in statistical signal processing to deal with non-Gaussian noises,\nand investigate its feasibility and usefulness in real-life transfer learning\nregression tasks, where distributional shifts are common. Specifically, we put\nforward a new theoretical result showing the robustness of MEE against\ncovariate shift. We also show that by simply replacing the mean squared error\n(MSE) loss with the MEE on basic transfer learning algorithms such as\nfine-tuning and linear probing, we can achieve competitive performance with\nrespect to state-of-the-art transfer learning algorithms. We justify our\narguments on both synthetic data and 5 real-world time-series data.\n","authors":["Luis Pedro Silvestrin","Shujian Yu","Mark Hoogendoorn"],"pdf_url":"https://arxiv.org/pdf/2307.08572v3.pdf","comment":"Manuscript accepted at ECAI-23. Code available at\n https://github.com/lpsilvestrin/mee-finetune"},{"id":"http://arxiv.org/abs/2307.12926v1","updated":"2023-07-24T16:36:04Z","published":"2023-07-24T16:36:04Z","title":"Contextual Bandits and Imitation Learning via Preference-Based Active\n Queries","summary":" We consider the problem of contextual bandits and imitation learning, where\nthe learner lacks direct knowledge of the executed action's reward. Instead,\nthe learner can actively query an expert at each round to compare two actions\nand receive noisy preference feedback. The learner's objective is two-fold: to\nminimize the regret associated with the executed actions, while simultaneously,\nminimizing the number of comparison queries made to the expert. In this paper,\nwe assume that the learner has access to a function class that can represent\nthe expert's preference model under appropriate link functions, and provide an\nalgorithm that leverages an online regression oracle with respect to this\nfunction class for choosing its actions and deciding when to query. For the\ncontextual bandit setting, our algorithm achieves a regret bound that combines\nthe best of both worlds, scaling as $O(\\min\\{\\sqrt{T}, d/\\Delta\\})$, where $T$\nrepresents the number of interactions, $d$ represents the eluder dimension of\nthe function class, and $\\Delta$ represents the minimum preference of the\noptimal action over any suboptimal action under all contexts. Our algorithm\ndoes not require the knowledge of $\\Delta$, and the obtained regret bound is\ncomparable to what can be achieved in the standard contextual bandits setting\nwhere the learner observes reward signals at each round. Additionally, our\nalgorithm makes only $O(\\min\\{T, d^2/\\Delta^2\\})$ queries to the expert. We\nthen extend our algorithm to the imitation learning setting, where the learning\nagent engages with an unknown environment in episodes of length $H$ each, and\nprovide similar guarantees for regret and query complexity. Interestingly, our\nalgorithm for imitation learning can even learn to outperform the underlying\nexpert, when it is suboptimal, highlighting a practical benefit of\npreference-based feedback in imitation learning.\n","authors":["Ayush Sekhari","Karthik Sridharan","Wen Sun","Runzhe Wu"],"pdf_url":"https://arxiv.org/pdf/2307.12926v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.12231v2","updated":"2023-07-24T16:00:37Z","published":"2023-04-24T16:18:22Z","title":"An Approximation Theory for Metric Space-Valued Functions With A View\n Towards Deep Learning","summary":" Motivated by the developing mathematics of deep learning, we build universal\nfunctions approximators of continuous maps between arbitrary Polish metric\nspaces $\\mathcal{X}$ and $\\mathcal{Y}$ using elementary functions between\nEuclidean spaces as building blocks. Earlier results assume that the target\nspace $\\mathcal{Y}$ is a topological vector space. We overcome this limitation\nby ``randomization'': our approximators output discrete probability measures\nover $\\mathcal{Y}$. When $\\mathcal{X}$ and $\\mathcal{Y}$ are Polish without\nadditional structure, we prove very general qualitative guarantees; when they\nhave suitable combinatorial structure, we prove quantitative guarantees for\nH\\\"{o}lder-like maps, including maps between finite graphs, solution operators\nto rough differential equations between certain Carnot groups, and continuous\nnon-linear operators between Banach spaces arising in inverse problems. In\nparticular, we show that the required number of Dirac measures is determined by\nthe combinatorial structure of $\\mathcal{X}$ and $\\mathcal{Y}$. For barycentric\n$\\mathcal{Y}$, including Banach spaces, $\\mathbb{R}$-trees, Hadamard manifolds,\nor Wasserstein spaces on Polish metric spaces, our approximators reduce to\n$\\mathcal{Y}$-valued functions. When the Euclidean approximators are neural\nnetworks, our constructions generalize transformer networks, providing a new\nprobabilistic viewpoint of geometric deep learning.\n","authors":["Anastasis Kratsios","Chong Liu","Matti Lassas","Maarten V. de Hoop","Ivan Dokmanić"],"pdf_url":"https://arxiv.org/pdf/2304.12231v2.pdf","comment":"14 Figures, 3 Tables, 78 Pages (Main 40, Proofs 26, Acknowledgments\n and References 12)"},{"id":"http://arxiv.org/abs/2307.12906v1","updated":"2023-07-24T15:59:36Z","published":"2023-07-24T15:59:36Z","title":"QAmplifyNet: Pushing the Boundaries of Supply Chain Backorder Prediction\n Using Interpretable Hybrid Quantum - Classical Neural Network","summary":" Supply chain management relies on accurate backorder prediction for\noptimizing inventory control, reducing costs, and enhancing customer\nsatisfaction. However, traditional machine-learning models struggle with\nlarge-scale datasets and complex relationships, hindering real-world data\ncollection. This research introduces a novel methodological framework for\nsupply chain backorder prediction, addressing the challenge of handling large\ndatasets. Our proposed model, QAmplifyNet, employs quantum-inspired techniques\nwithin a quantum-classical neural network to predict backorders effectively on\nshort and imbalanced datasets. Experimental evaluations on a benchmark dataset\ndemonstrate QAmplifyNet's superiority over classical models, quantum ensembles,\nquantum neural networks, and deep reinforcement learning. Its proficiency in\nhandling short, imbalanced datasets makes it an ideal solution for supply chain\nmanagement. To enhance model interpretability, we use Explainable Artificial\nIntelligence techniques. Practical implications include improved inventory\ncontrol, reduced backorders, and enhanced operational efficiency. QAmplifyNet\nseamlessly integrates into real-world supply chain management systems, enabling\nproactive decision-making and efficient resource allocation. Future work\ninvolves exploring additional quantum-inspired techniques, expanding the\ndataset, and investigating other supply chain applications. This research\nunlocks the potential of quantum computing in supply chain optimization and\npaves the way for further exploration of quantum-inspired machine learning\nmodels in supply chain management. Our framework and QAmplifyNet model offer a\nbreakthrough approach to supply chain backorder prediction, providing superior\nperformance and opening new avenues for leveraging quantum-inspired techniques\nin supply chain management.\n","authors":["Md Abrar Jahin","Md Sakib Hossain Shovon","Md. Saiful Islam","Jungpil Shin","M. F. Mridha","Yuichi Okuyama"],"pdf_url":"https://arxiv.org/pdf/2307.12906v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12904v1","updated":"2023-07-24T15:52:33Z","published":"2023-07-24T15:52:33Z","title":"Universal Approximation Theorem and error bounds for quantum neural\n networks and quantum reservoirs","summary":" Universal approximation theorems are the foundations of classical neural\nnetworks, providing theoretical guarantees that the latter are able to\napproximate maps of interest. Recent results have shown that this can also be\nachieved in a quantum setting, whereby classical functions can be approximated\nby parameterised quantum circuits. We provide here precise error bounds for\nspecific classes of functions and extend these results to the interesting new\nsetup of randomised quantum circuits, mimicking classical reservoir neural\nnetworks. Our results show in particular that a quantum neural network with\n$\\mathcal{O}(\\varepsilon^{-2})$ weights and $\\mathcal{O} (\\lceil\n\\log_2(\\varepsilon^{-1}) \\rceil)$ qubits suffices to achieve accuracy\n$\\varepsilon>0$ when approximating functions with integrable Fourier transform.\n","authors":["Lukas Gonon","Antoine Jacquier"],"pdf_url":"https://arxiv.org/pdf/2307.12904v1.pdf","comment":"20 pages, 0 figure"},{"id":"http://arxiv.org/abs/2206.02909v2","updated":"2023-07-24T15:47:59Z","published":"2022-06-06T21:14:01Z","title":"Self-supervised Learning for Human Activity Recognition Using 700,000\n Person-days of Wearable Data","summary":" Advances in deep learning for human activity recognition have been relatively\nlimited due to the lack of large labelled datasets. In this study, we leverage\nself-supervised learning techniques on the UK-Biobank activity tracker\ndataset--the largest of its kind to date--containing more than 700,000\nperson-days of unlabelled wearable sensor data. Our resulting activity\nrecognition model consistently outperformed strong baselines across seven\nbenchmark datasets, with an F1 relative improvement of 2.5%-100% (median\n18.4%), the largest improvements occurring in the smaller datasets. In contrast\nto previous studies, our results generalise across external datasets, devices,\nand environments. Our open-source model will help researchers and developers to\nbuild customisable and generalisable activity classifiers with high\nperformance.\n","authors":["Hang Yuan","Shing Chan","Andrew P. Creagh","Catherine Tong","David A. Clifton","Aiden Doherty"],"pdf_url":"https://arxiv.org/pdf/2206.02909v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12897v1","updated":"2023-07-24T15:44:30Z","published":"2023-07-24T15:44:30Z","title":"Anytime Model Selection in Linear Bandits","summary":" Model selection in the context of bandit optimization is a challenging\nproblem, as it requires balancing exploration and exploitation not only for\naction selection, but also for model selection. One natural approach is to rely\non online learning algorithms that treat different models as experts. Existing\nmethods, however, scale poorly ($\\text{poly}M$) with the number of models $M$\nin terms of their regret. Our key insight is that, for model selection in\nlinear bandits, we can emulate full-information feedback to the online learner\nwith a favorable bias-variance trade-off. This allows us to develop ALEXP,\nwhich has an exponentially improved ($\\log M$) dependence on $M$ for its\nregret. ALEXP has anytime guarantees on its regret, and neither requires\nknowledge of the horizon $n$, nor relies on an initial purely exploratory\nstage. Our approach utilizes a novel time-uniform analysis of the Lasso,\nestablishing a new connection between online learning and high-dimensional\nstatistics.\n","authors":["Parnian Kassraie","Aldo Pacchiano","Nicolas Emmenegger","Andreas Krause"],"pdf_url":"https://arxiv.org/pdf/2307.12897v1.pdf","comment":"37 pages, 7 figures"},{"id":"http://arxiv.org/abs/2307.12892v1","updated":"2023-07-24T15:42:33Z","published":"2023-07-24T15:42:33Z","title":"A Statistical View of Column Subset Selection","summary":" We consider the problem of selecting a small subset of representative\nvariables from a large dataset. In the computer science literature, this\ndimensionality reduction problem is typically formalized as Column Subset\nSelection (CSS). Meanwhile, the typical statistical formalization is to find an\ninformation-maximizing set of Principal Variables. This paper shows that these\ntwo approaches are equivalent, and moreover, both can be viewed as maximum\nlikelihood estimation within a certain semi-parametric model. Using these\nconnections, we show how to efficiently (1) perform CSS using only summary\nstatistics from the original dataset; (2) perform CSS in the presence of\nmissing and/or censored data; and (3) select the subset size for CSS in a\nhypothesis testing framework.\n","authors":["Anav Sood","Trevor Hastie"],"pdf_url":"https://arxiv.org/pdf/2307.12892v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.08649v3","updated":"2023-07-24T15:33:25Z","published":"2023-04-17T22:53:54Z","title":"Classification of US Supreme Court Cases using BERT-Based Techniques","summary":" Models based on bidirectional encoder representations from transformers\n(BERT) produce state of the art (SOTA) results on many natural language\nprocessing (NLP) tasks such as named entity recognition (NER), part-of-speech\n(POS) tagging etc. An interesting phenomenon occurs when classifying long\ndocuments such as those from the US supreme court where BERT-based models can\nbe considered difficult to use on a first-pass or out-of-the-box basis. In this\npaper, we experiment with several BERT-based classification techniques for US\nsupreme court decisions or supreme court database (SCDB) and compare them with\nthe previous SOTA results. We then compare our results specifically with SOTA\nmodels for long documents. We compare our results for two classification tasks:\n(1) a broad classification task with 15 categories and (2) a fine-grained\nclassification task with 279 categories. Our best result produces an accuracy\nof 80\\% on the 15 broad categories and 60\\% on the fine-grained 279 categories\nwhich marks an improvement of 8\\% and 28\\% respectively from previously\nreported SOTA results.\n","authors":["Shubham Vatsal","Adam Meyers","John E. Ortega"],"pdf_url":"https://arxiv.org/pdf/2304.08649v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2108.13628v2","updated":"2023-07-24T15:31:05Z","published":"2021-08-31T05:38:36Z","title":"Learning Optimal Prescriptive Trees from Observational Data","summary":" We consider the problem of learning an optimal prescriptive tree (i.e., an\ninterpretable treatment assignment policy in the form of a binary tree) of\nmoderate depth, from observational data. This problem arises in numerous\nsocially important domains such as public health and personalized medicine,\nwhere interpretable and data-driven interventions are sought based on data\ngathered in deployment -- through passive collection of data -- rather than\nfrom randomized trials. We propose a method for learning optimal prescriptive\ntrees using mixed-integer optimization (MIO) technology. We show that under\nmild conditions our method is asymptotically exact in the sense that it\nconverges to an optimal out-of-sample treatment assignment policy as the number\nof historical data samples tends to infinity. Contrary to existing literature,\nour approach: 1) does not require data to be randomized, 2) does not impose\nstringent assumptions on the learned trees, and 3) has the ability to model\ndomain specific constraints. Through extensive computational experiments, we\ndemonstrate that our asymptotic guarantees translate to significant performance\nimprovements in finite samples, as well as showcase our uniquely flexible\nmodeling power by incorporating budget and fairness constraints.\n","authors":["Nathanael Jo","Sina Aghaei","Andrés Gómez","Phebe Vayanos"],"pdf_url":"https://arxiv.org/pdf/2108.13628v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.11389v3","updated":"2023-07-24T15:28:34Z","published":"2022-08-24T09:26:12Z","title":"Approximate blocked Gibbs sampling for Bayesian neural networks","summary":" In this work, minibatch MCMC sampling for feedforward neural networks is made\nmore feasible. To this end, it is proposed to sample subgroups of parameters\nvia a blocked Gibbs sampling scheme. By partitioning the parameter space,\nsampling is possible irrespective of layer width. It is also possible to\nalleviate vanishing acceptance rates for increasing depth by reducing the\nproposal variance in deeper layers. Increasing the length of a non-convergent\nchain increases the predictive accuracy in classification tasks, so avoiding\nvanishing acceptance rates and consequently enabling longer chain runs have\npractical benefits. Moreover, non-convergent chain realizations aid in the\nquantification of predictive uncertainty. An open problem is how to perform\nminibatch MCMC sampling for feedforward neural networks in the presence of\naugmented data.\n","authors":["Theodore Papamarkou"],"pdf_url":"https://arxiv.org/pdf/2208.11389v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.12803v3","updated":"2023-07-24T15:27:16Z","published":"2022-01-30T12:53:51Z","title":"Generalizing similarity in noisy setups: the DIBS phenomenon","summary":" This work uncovers an interplay among data density, noise, and the\ngeneralization ability in similarity learning. We consider Siamese Neural\nNetworks (SNNs), which are the basic form of contrastive learning, and explore\ntwo types of noise that can impact SNNs, Pair Label Noise (PLN) and Single\nLabel Noise (SLN). Our investigation reveals that SNNs exhibit double descent\nbehaviour regardless of the training setup and that it is further exacerbated\nby noise. We demonstrate that the density of data pairs is crucial for\ngeneralization. When SNNs are trained on sparse datasets with the same amount\nof PLN or SLN, they exhibit comparable generalization properties. However, when\nusing dense datasets, PLN cases generalize worse than SLN ones in the\noverparametrized region, leading to a phenomenon we call Density-Induced Break\nof Similarity (DIBS). In this regime, PLN similarity violation becomes\nmacroscopical, corrupting the dataset to the point where complete interpolation\ncannot be achieved, regardless of the number of model parameters. Our analysis\nalso delves into the correspondence between online optimization and offline\ngeneralization in similarity learning. The results show that this equivalence\nfails in the presence of label noise in all the scenarios considered.\n","authors":["Nayara Fonseca","Veronica Guidetti"],"pdf_url":"https://arxiv.org/pdf/2201.12803v3.pdf","comment":"v3: version accepted at ECAI 2023 + Supplementary Material"},{"id":"http://arxiv.org/abs/2307.10490v3","updated":"2023-07-24T15:24:17Z","published":"2023-07-19T23:03:20Z","title":"(Ab)using Images and Sounds for Indirect Instruction Injection in\n Multi-Modal LLMs","summary":" We demonstrate how images and sounds can be used for indirect prompt and\ninstruction injection in multi-modal LLMs. An attacker generates an adversarial\nperturbation corresponding to the prompt and blends it into an image or audio\nrecording. When the user asks the (unmodified, benign) model about the\nperturbed image or audio, the perturbation steers the model to output the\nattacker-chosen text and/or make the subsequent dialog follow the attacker's\ninstruction. We illustrate this attack with several proof-of-concept examples\ntargeting LLaVa and PandaGPT.\n","authors":["Eugene Bagdasaryan","Tsung-Yin Hsieh","Ben Nassi","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2307.10490v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10891v2","updated":"2023-07-24T15:16:46Z","published":"2023-06-19T12:36:54Z","title":"Transformer Training Strategies for Forecasting Multiple Load Time\n Series","summary":" In the smart grid of the future, accurate load forecasts on the level of\nindividual clients can help to balance supply and demand locally and to prevent\ngrid outages. While the number of monitored clients will increase with the\nongoing smart meter rollout, the amount of data per client will always be\nlimited. We evaluate whether a Transformer load forecasting model benefits from\na transfer learning strategy, where a global univariate model is trained on the\nload time series from multiple clients. In experiments with two datasets\ncontaining load time series from several hundred clients, we find that the\nglobal training strategy is superior to the multivariate and local training\nstrategies used in related work. On average, the global training strategy\nresults in 21.8% and 12.8% lower forecasting errors than the two other\nstrategies, measured across forecasting horizons from one day to one month into\nthe future. A comparison to linear models, multi-layer perceptrons and LSTMs\nshows that Transformers are effective for load forecasting when they are\ntrained with the global training strategy.\n","authors":["Matthias Hertel","Maximilian Beichter","Benedikt Heidrich","Oliver Neumann","Benjamin Schäfer","Ralf Mikut","Veit Hagenmeyer"],"pdf_url":"https://arxiv.org/pdf/2306.10891v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12872v1","updated":"2023-07-24T15:10:22Z","published":"2023-07-24T15:10:22Z","title":"Data-free Black-box Attack based on Diffusion Model","summary":" Since the training data for the target model in a data-free black-box attack\nis not available, most recent schemes utilize GANs to generate data for\ntraining substitute model. However, these GANs-based schemes suffer from low\ntraining efficiency as the generator needs to be retrained for each target\nmodel during the substitute training process, as well as low generation\nquality. To overcome these limitations, we consider utilizing the diffusion\nmodel to generate data, and propose a data-free black-box attack scheme based\non diffusion model to improve the efficiency and accuracy of substitute\ntraining. Despite the data generated by the diffusion model exhibits high\nquality, it presents diverse domain distributions and contains many samples\nthat do not meet the discriminative criteria of the target model. To further\nfacilitate the diffusion model to generate data suitable for the target model,\nwe propose a Latent Code Augmentation (LCA) method to guide the diffusion model\nin generating data. With the guidance of LCA, the data generated by the\ndiffusion model not only meets the discriminative criteria of the target model\nbut also exhibits high diversity. By utilizing this data, it is possible to\ntrain substitute model that closely resemble the target model more efficiently.\nExtensive experiments demonstrate that our LCA achieves higher attack success\nrates and requires fewer query budgets compared to GANs-based schemes for\ndifferent target models.\n","authors":["Mingwen Shao","Lingzhuang Meng","Yuanjian Qiao","Lixu Zhang","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2307.12872v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12862v1","updated":"2023-07-24T15:02:03Z","published":"2023-07-24T15:02:03Z","title":"Stochastic Step-wise Feature Selection for Exponential Random Graph\n Models (ERGMs)","summary":" Statistical analysis of social networks provides valuable insights into\ncomplex network interactions across various scientific disciplines. However,\naccurate modeling of networks remains challenging due to the heavy\ncomputational burden and the need to account for observed network dependencies.\nExponential Random Graph Models (ERGMs) have emerged as a promising technique\nused in social network modeling to capture network dependencies by\nincorporating endogenous variables. Nevertheless, using ERGMs poses multiple\nchallenges, including the occurrence of ERGM degeneracy, which generates\nunrealistic and meaningless network structures. To address these challenges and\nenhance the modeling of collaboration networks, we propose and test a novel\napproach that focuses on endogenous variable selection within ERGMs. Our method\naims to overcome the computational burden and improve the accommodation of\nobserved network dependencies, thereby facilitating more accurate and\nmeaningful interpretations of network phenomena in various scientific fields.\nWe conduct empirical testing and rigorous analysis to contribute to the\nadvancement of statistical techniques and offer practical insights for network\nanalysis.\n","authors":["Helal El-Zaatari","Fei Yu","Michael R Kosorok"],"pdf_url":"https://arxiv.org/pdf/2307.12862v1.pdf","comment":"23 pages, 6 tables and 18 figures"},{"id":"http://arxiv.org/abs/2307.12856v1","updated":"2023-07-24T14:56:30Z","published":"2023-07-24T14:56:30Z","title":"A Real-World WebAgent with Planning, Long Context Understanding, and\n Program Synthesis","summary":" Pre-trained large language models (LLMs) have recently achieved better\ngeneralization and sample efficiency in autonomous web navigation. However, the\nperformance on real-world websites has still suffered from (1) open domainness,\n(2) limited context length, and (3) lack of inductive bias on HTML. We\nintroduce WebAgent, an LLM-driven agent that can complete the tasks on real\nwebsites following natural language instructions. WebAgent plans ahead by\ndecomposing instructions into canonical sub-instructions, summarizes long HTML\ndocuments into task-relevant snippets, and acts on websites via generated\nPython programs from those. We design WebAgent with Flan-U-PaLM, for grounded\ncode generation, and HTML-T5, new pre-trained LLMs for long HTML documents\nusing local and global attention mechanisms and a mixture of long-span\ndenoising objectives, for planning and summarization. We empirically\ndemonstrate that our recipe improves the success on a real website by over 50%,\nand that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9%\nhigher success rate than prior SoTA on the MiniWoB web navigation benchmark and\nbetter accuracy on offline task planning evaluation.\n","authors":["Izzeddin Gur","Hiroki Furuta","Austin Huang","Mustafa Safdari","Yutaka Matsuo","Douglas Eck","Aleksandra Faust"],"pdf_url":"https://arxiv.org/pdf/2307.12856v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12851v1","updated":"2023-07-24T14:51:54Z","published":"2023-07-24T14:51:54Z","title":"Early Neuron Alignment in Two-layer ReLU Networks with Small\n Initialization","summary":" This paper studies the problem of training a two-layer ReLU network for\nbinary classification using gradient flow with small initialization. We\nconsider a training dataset with well-separated input vectors: Any pair of\ninput data with the same label are positively correlated, and any pair with\ndifferent labels are negatively correlated. Our analysis shows that, during the\nearly phase of training, neurons in the first layer try to align with either\nthe positive data or the negative data, depending on its corresponding weight\non the second layer. A careful analysis of the neurons' directional dynamics\nallows us to provide an $\\mathcal{O}(\\frac{\\log n}{\\sqrt{\\mu}})$ upper bound on\nthe time it takes for all neurons to achieve good alignment with the input\ndata, where $n$ is the number of data points and $\\mu$ measures how well the\ndata are separated. After the early alignment phase, the loss converges to zero\nat a $\\mathcal{O}(\\frac{1}{t})$ rate, and the weight matrix on the first layer\nis approximately low-rank. Numerical experiments on the MNIST dataset\nillustrate our theoretical findings.\n","authors":["Hancheng Min","René Vidal","Enrique Mallada"],"pdf_url":"https://arxiv.org/pdf/2307.12851v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12840v1","updated":"2023-07-24T14:37:22Z","published":"2023-07-24T14:37:22Z","title":"Efficiently Learning One-Hidden-Layer ReLU Networks via Schur\n Polynomials","summary":" We study the problem of PAC learning a linear combination of $k$ ReLU\nactivations under the standard Gaussian distribution on $\\mathbb{R}^d$ with\nrespect to the square loss. Our main result is an efficient algorithm for this\nlearning task with sample and computational complexity $(dk/\\epsilon)^{O(k)}$,\nwhere $\\epsilon>0$ is the target accuracy. Prior work had given an algorithm\nfor this problem with complexity $(dk/\\epsilon)^{h(k)}$, where the function\n$h(k)$ scales super-polynomially in $k$. Interestingly, the complexity of our\nalgorithm is near-optimal within the class of Correlational Statistical Query\nalgorithms. At a high-level, our algorithm uses tensor decomposition to\nidentify a subspace such that all the $O(k)$-order moments are small in the\northogonal directions. Its analysis makes essential use of the theory of Schur\npolynomials to show that the higher-moment error tensors are small given that\nthe lower-order ones are.\n","authors":["Ilias Diakonikolas","Daniel M. Kane"],"pdf_url":"https://arxiv.org/pdf/2307.12840v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.08272v3","updated":"2023-07-24T14:28:11Z","published":"2023-03-14T23:26:55Z","title":"Automated patent extraction powers generative modeling in focused\n chemical spaces","summary":" Deep generative models have emerged as an exciting avenue for inverse\nmolecular design, with progress coming from the interplay between training\nalgorithms and molecular representations. One of the key challenges in their\napplicability to materials science and chemistry has been the lack of access to\nsizeable training datasets with property labels. Published patents contain the\nfirst disclosure of new materials prior to their publication in journals, and\nare a vast source of scientific knowledge that has remained relatively untapped\nin the field of data-driven molecular design. Because patents are filed seeking\nto protect specific uses, molecules in patents can be considered to be weakly\nlabeled into application classes. Furthermore, patents published by the US\nPatent and Trademark Office (USPTO) are downloadable and have machine-readable\ntext and molecular structures. In this work, we train domain-specific\ngenerative models using patent data sources by developing an automated pipeline\nto go from USPTO patent digital files to the generation of novel candidates\nwith minimal human intervention. We test the approach on two in-class extracted\ndatasets, one in organic electronics and another in tyrosine kinase inhibitors.\nWe then evaluate the ability of generative models trained on these in-class\ndatasets on two categories of tasks (distribution learning and property\noptimization), identify strengths and limitations, and suggest possible\nexplanations and remedies that could be used to overcome these in practice.\n","authors":["Akshay Subramanian","Kevin P. Greenman","Alexis Gervaix","Tzuhsiung Yang","Rafael Gómez-Bombarelli"],"pdf_url":"https://arxiv.org/pdf/2303.08272v3.pdf","comment":"Digital Discovery (2023)"},{"id":"http://arxiv.org/abs/2307.02620v2","updated":"2023-07-24T14:21:09Z","published":"2023-07-05T19:48:03Z","title":"Learning when to observe: A frugal reinforcement learning framework for\n a high-cost world","summary":" Reinforcement learning (RL) has been shown to learn sophisticated control\npolicies for complex tasks including games, robotics, heating and cooling\nsystems and text generation. The action-perception cycle in RL, however,\ngenerally assumes that a measurement of the state of the environment is\navailable at each time step without a cost. In applications such as materials\ndesign, deep-sea and planetary robot exploration and medicine, however, there\ncan be a high cost associated with measuring, or even approximating, the state\nof the environment. In this paper, we survey the recently growing literature\nthat adopts the perspective that an RL agent might not need, or even want, a\ncostly measurement at each time step. Within this context, we propose the Deep\nDynamic Multi-Step Observationless Agent (DMSOA), contrast it with the\nliterature and empirically evaluate it on OpenAI gym and Atari Pong\nenvironments. Our results, show that DMSOA learns a better policy with fewer\ndecision steps and measurements than the considered alternative from the\nliterature. The corresponding code is available at:\n\\url{https://github.com/cbellinger27/Learning-when-to-observe-in-RL\n","authors":["Colin Bellinger","Mark Crowley","Isaac Tamblyn"],"pdf_url":"https://arxiv.org/pdf/2307.02620v2.pdf","comment":"Accepted for presentation at ECML-PKDD 2023 workshop track:\n Simplification, Compression, Efficiency and Frugality for Artificial\n Intelligence (SCEFA)"},{"id":"http://arxiv.org/abs/2307.12822v1","updated":"2023-07-24T14:19:36Z","published":"2023-07-24T14:19:36Z","title":"Learning Provably Robust Estimators for Inverse Problems via Jittering","summary":" Deep neural networks provide excellent performance for inverse problems such\nas denoising. However, neural networks can be sensitive to adversarial or\nworst-case perturbations. This raises the question of whether such networks can\nbe trained efficiently to be worst-case robust. In this paper, we investigate\nwhether jittering, a simple regularization technique that adds isotropic\nGaussian noise during training, is effective for learning worst-case robust\nestimators for inverse problems. While well studied for prediction in\nclassification tasks, the effectiveness of jittering for inverse problems has\nnot been systematically investigated. In this paper, we present a novel\nanalytical characterization of the optimal $\\ell_2$-worst-case robust estimator\nfor linear denoising and show that jittering yields optimal robust denoisers.\nFurthermore, we examine jittering empirically via training deep neural networks\n(U-nets) for natural image denoising, deconvolution, and accelerated magnetic\nresonance imaging (MRI). The results show that jittering significantly enhances\nthe worst-case robustness, but can be suboptimal for inverse problems beyond\ndenoising. Moreover, our results imply that training on real data which often\ncontains slight noise is somewhat robustness enhancing.\n","authors":["Anselm Krainovic","Mahdi Soltanolkotabi","Reinhard Heckel"],"pdf_url":"https://arxiv.org/pdf/2307.12822v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02813v2","updated":"2023-07-24T14:17:24Z","published":"2023-07-06T07:18:22Z","title":"CPDG: A Contrastive Pre-Training Method for Dynamic Graph Neural\n Networks","summary":" Dynamic graph data mining has gained popularity in recent years due to the\nrich information contained in dynamic graphs and their widespread use in the\nreal world. Despite the advances in dynamic graph neural networks (DGNNs), the\nrich information and diverse downstream tasks have posed significant\ndifficulties for the practical application of DGNNs in industrial scenarios. To\nthis end, in this paper, we propose to address them by pre-training and present\nthe Contrastive Pre-Training Method for Dynamic Graph Neural Networks (CPDG).\nCPDG tackles the challenges of pre-training for DGNNs, including generalization\ncapability and long-short term modeling capability, through a flexible\nstructural-temporal subgraph sampler along with structural-temporal contrastive\npre-training schemes. Extensive experiments conducted on both large-scale\nresearch and industrial dynamic graph datasets show that CPDG outperforms\nexisting methods in dynamic graph pre-training for various downstream tasks\nunder three transfer settings.\n","authors":["Yuanchen Bei","Hao Xu","Sheng Zhou","Huixuan Chi","Haishuai Wang","Mengdi Zhang","Zhao Li","Jiajun Bu"],"pdf_url":"https://arxiv.org/pdf/2307.02813v2.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.12797v1","updated":"2023-07-24T13:46:50Z","published":"2023-07-24T13:46:50Z","title":"Causal Fair Machine Learning via Rank-Preserving Interventional\n Distributions","summary":" A decision can be defined as fair if equal individuals are treated equally\nand unequals unequally. Adopting this definition, the task of designing machine\nlearning models that mitigate unfairness in automated decision-making systems\nmust include causal thinking when introducing protected attributes. Following a\nrecent proposal, we define individuals as being normatively equal if they are\nequal in a fictitious, normatively desired (FiND) world, where the protected\nattribute has no (direct or indirect) causal effect on the target. We propose\nrank-preserving interventional distributions to define an estimand of this FiND\nworld and a warping method for estimation. Evaluation criteria for both the\nmethod and resulting model are presented and validated through simulations and\nempirical data. With this, we show that our warping approach effectively\nidentifies the most discriminated individuals and mitigates unfairness.\n","authors":["Ludwig Bothmann","Susanne Dandl","Michael Schomaker"],"pdf_url":"https://arxiv.org/pdf/2307.12797v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.05018v3","updated":"2023-07-24T13:46:46Z","published":"2022-07-11T17:13:10Z","title":"Learning Temporally Extended Skills in Continuous Domains as Symbolic\n Actions for Planning","summary":" Problems which require both long-horizon planning and continuous control\ncapabilities pose significant challenges to existing reinforcement learning\nagents. In this paper we introduce a novel hierarchical reinforcement learning\nagent which links temporally extended skills for continuous control with a\nforward model in a symbolic discrete abstraction of the environment's state for\nplanning. We term our agent SEADS for Symbolic Effect-Aware Diverse Skills. We\nformulate an objective and corresponding algorithm which leads to unsupervised\nlearning of a diverse set of skills through intrinsic motivation given a known\nstate abstraction. The skills are jointly learned with the symbolic forward\nmodel which captures the effect of skill execution in the state abstraction.\nAfter training, we can leverage the skills as symbolic actions using the\nforward model for long-horizon planning and subsequently execute the plan using\nthe learned continuous-action control skills. The proposed algorithm learns\nskills and forward models that can be used to solve complex tasks which require\nboth continuous control and long-horizon planning capabilities with high\nsuccess rate. It compares favorably with other flat and hierarchical\nreinforcement learning baseline agents and is successfully demonstrated with a\nreal robot.\n","authors":["Jan Achterhold","Markus Krimmel","Joerg Stueckler"],"pdf_url":"https://arxiv.org/pdf/2207.05018v3.pdf","comment":"Project website (including video) is available at\n https://seads.is.tue.mpg.de/. (v2) Accepted for publication at the 6th\n Conference on Robot Learning (CoRL) 2022, Auckland, New Zealand. (v3) Added\n details on checkpointing (S.8.1), with references on p.7, p.8, p.21 to\n clarify number of env. steps of reported results"},{"id":"http://arxiv.org/abs/2307.12790v1","updated":"2023-07-24T13:39:21Z","published":"2023-07-24T13:39:21Z","title":"Compact & Capable: Harnessing Graph Neural Networks and Edge Convolution\n for Medical Image Classification","summary":" Graph-based neural network models are gaining traction in the field of\nrepresentation learning due to their ability to uncover latent topological\nrelationships between entities that are otherwise challenging to identify.\nThese models have been employed across a diverse range of domains, encompassing\ndrug discovery, protein interactions, semantic segmentation, and fluid dynamics\nresearch. In this study, we investigate the potential of Graph Neural Networks\n(GNNs) for medical image classification. We introduce a novel model that\ncombines GNNs and edge convolution, leveraging the interconnectedness of RGB\nchannel feature values to strongly represent connections between crucial graph\nnodes. Our proposed model not only performs on par with state-of-the-art Deep\nNeural Networks (DNNs) but does so with 1000 times fewer parameters, resulting\nin reduced training time and data requirements. We compare our Graph\nConvolutional Neural Network (GCNN) to pre-trained DNNs for classifying\nMedMNIST dataset classes, revealing promising prospects for GNNs in medical\nimage analysis. Our results also encourage further exploration of advanced\ngraph-based models such as Graph Attention Networks (GAT) and Graph\nAuto-Encoders in the medical imaging domain. The proposed model yields more\nreliable, interpretable, and accurate outcomes for tasks like semantic\nsegmentation and image classification compared to simpler GCNNs\n","authors":["Aryan Singh","Pepijn Van de Ven","Ciarán Eising","Patrick Denny"],"pdf_url":"https://arxiv.org/pdf/2307.12790v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.13170v4","updated":"2023-07-24T13:35:28Z","published":"2022-04-27T20:04:24Z","title":"AdaBest: Minimizing Client Drift in Federated Learning via Adaptive Bias\n Estimation","summary":" In Federated Learning (FL), a number of clients or devices collaborate to\ntrain a model without sharing their data. Models are optimized locally at each\nclient and further communicated to a central hub for aggregation. While FL is\nan appealing decentralized training paradigm, heterogeneity among data from\ndifferent clients can cause the local optimization to drift away from the\nglobal objective. In order to estimate and therefore remove this drift,\nvariance reduction techniques have been incorporated into FL optimization\nrecently. However, these approaches inaccurately estimate the clients' drift\nand ultimately fail to remove it properly. In this work, we propose an adaptive\nalgorithm that accurately estimates drift across clients. In comparison to\nprevious works, our approach necessitates less storage and communication\nbandwidth, as well as lower compute costs. Additionally, our proposed\nmethodology induces stability by constraining the norm of estimates for client\ndrift, making it more practical for large scale FL. Experimental findings\ndemonstrate that the proposed algorithm converges significantly faster and\nachieves higher accuracy than the baselines across various FL benchmarks.\n","authors":["Farshid Varno","Marzie Saghayi","Laya Rafiee Sevyeri","Sharut Gupta","Stan Matwin","Mohammad Havaei"],"pdf_url":"https://arxiv.org/pdf/2204.13170v4.pdf","comment":"Published as a conference paper at ECCV 2022; Corrected some typos in\n the text and a baseline algorithm"},{"id":"http://arxiv.org/abs/2307.12788v1","updated":"2023-07-24T13:35:18Z","published":"2023-07-24T13:35:18Z","title":"Analyzing the Strategy of Propaganda using Inverse Reinforcement\n Learning: Evidence from the 2022 Russian Invasion of Ukraine","summary":" The 2022 Russian invasion of Ukraine was accompanied by a large-scale,\npro-Russian propaganda campaign on social media. However, the strategy behind\nthe dissemination of propaganda has remained unclear, particularly how the\nonline discourse was strategically shaped by the propagandists' community.\nHere, we analyze the strategy of the Twitter community using an inverse\nreinforcement learning (IRL) approach. Specifically, IRL allows us to model\nonline behavior as a Markov decision process, where the goal is to infer the\nunderlying reward structure that guides propagandists when interacting with\nusers with a supporting or opposing stance toward the invasion. Thereby, we aim\nto understand empirically whether and how between-user interactions are\nstrategically used to promote the proliferation of Russian propaganda. For\nthis, we leverage a large-scale dataset with 349,455 posts with pro-Russian\npropaganda from 132,131 users. We show that bots and humans follow a different\nstrategy: bots respond predominantly to pro-invasion messages, suggesting that\nthey seek to drive virality; while messages indicating opposition primarily\nelicit responses from humans, suggesting that they tend to engage in critical\ndiscussions. To the best of our knowledge, this is the first study analyzing\nthe strategy behind propaganda from the 2022 Russian invasion of Ukraine\nthrough the lens of IRL.\n","authors":["Dominique Geissler","Stefan Feuerriegel"],"pdf_url":"https://arxiv.org/pdf/2307.12788v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12540v2","updated":"2023-07-24T13:35:16Z","published":"2023-03-22T13:16:37Z","title":"Deployment of Image Analysis Algorithms under Prevalence Shifts","summary":" Domain gaps are among the most relevant roadblocks in the clinical\ntranslation of machine learning (ML)-based solutions for medical image\nanalysis. While current research focuses on new training paradigms and network\narchitectures, little attention is given to the specific effect of prevalence\nshifts on an algorithm deployed in practice. Such discrepancies between class\nfrequencies in the data used for a method's development/validation and that in\nits deployment environment(s) are of great importance, for example in the\ncontext of artificial intelligence (AI) democratization, as disease prevalences\nmay vary widely across time and location. Our contribution is twofold. First,\nwe empirically demonstrate the potentially severe consequences of missing\nprevalence handling by analyzing (i) the extent of miscalibration, (ii) the\ndeviation of the decision threshold from the optimum, and (iii) the ability of\nvalidation metrics to reflect neural network performance on the deployment\npopulation as a function of the discrepancy between development and deployment\nprevalence. Second, we propose a workflow for prevalence-aware image\nclassification that uses estimated deployment prevalences to adjust a trained\nclassifier to a new environment, without requiring additional annotated\ndeployment data. Comprehensive experiments based on a diverse set of 30 medical\nclassification tasks showcase the benefit of the proposed workflow in\ngenerating better classifier decisions and more reliable performance estimates\ncompared to current practice.\n","authors":["Patrick Godau","Piotr Kalinowski","Evangelia Christodoulou","Annika Reinke","Minu Tizabi","Luciana Ferrer","Paul Jäger","Lena Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2303.12540v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12775v1","updated":"2023-07-24T13:24:56Z","published":"2023-07-24T13:24:56Z","title":"Is attention all you need in medical image analysis? A review","summary":" Medical imaging is a key component in clinical diagnosis, treatment planning\nand clinical trial design, accounting for almost 90% of all healthcare data.\nCNNs achieved performance gains in medical image analysis (MIA) over the last\nyears. CNNs can efficiently model local pixel interactions and be trained on\nsmall-scale MI data. The main disadvantage of typical CNN models is that they\nignore global pixel relationships within images, which limits their\ngeneralisation ability to understand out-of-distribution data with different\n'global' information. The recent progress of Artificial Intelligence gave rise\nto Transformers, which can learn global relationships from data. However, full\nTransformer models need to be trained on large-scale data and involve\ntremendous computational complexity. Attention and Transformer compartments\n(Transf/Attention) which can well maintain properties for modelling global\nrelationships, have been proposed as lighter alternatives of full Transformers.\nRecently, there is an increasing trend to co-pollinate complementary\nlocal-global properties from CNN and Transf/Attention architectures, which led\nto a new era of hybrid models. The past years have witnessed substantial growth\nin hybrid CNN-Transf/Attention models across diverse MIA problems. In this\nsystematic review, we survey existing hybrid CNN-Transf/Attention models,\nreview and unravel key architectural designs, analyse breakthroughs, and\nevaluate current and future opportunities as well as challenges. We also\nintroduced a comprehensive analysis framework on generalisation opportunities\nof scientific and clinical impact, based on which new data-driven domain\ngeneralisation and adaptation methods can be stimulated.\n","authors":["Giorgos Papanastasiou","Nikolaos Dikaios","Jiahao Huang","Chengjia Wang","Guang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12771v1","updated":"2023-07-24T13:19:15Z","published":"2023-07-24T13:19:15Z","title":"Detecting disturbances in network-coupled dynamical systems with machine\n learning","summary":" Identifying disturbances in network-coupled dynamical systems without\nknowledge of the disturbances or underlying dynamics is a problem with a wide\nrange of applications. For example, one might want to know which nodes in the\nnetwork are being disturbed and identify the type of disturbance. Here we\npresent a model-free method based on machine learning to identify such unknown\ndisturbances based only on prior observations of the system when forced by a\nknown training function. We find that this method is able to identify the\nlocations and properties of many different types of unknown disturbances using\na variety of known forcing functions. We illustrate our results both with\nlinear and nonlinear disturbances using food web and neuronal activity models.\nFinally, we discuss how to scale our method to large networks.\n","authors":["Per Sebastian Skardal","Juan G. Restrepo"],"pdf_url":"https://arxiv.org/pdf/2307.12771v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.05732v6","updated":"2023-07-24T13:15:14Z","published":"2022-09-13T04:58:35Z","title":"Rényi Divergence Deep Mutual Learning","summary":" This paper revisits Deep Mutual Learning (DML), a simple yet effective\ncomputing paradigm. We propose using R\\'{e}nyi divergence instead of the KL\ndivergence, which is more flexible and tunable, to improve vanilla DML. This\nmodification is able to consistently improve performance over vanilla DML with\nlimited additional complexity. The convergence properties of the proposed\nparadigm are analyzed theoretically, and Stochastic Gradient Descent with a\nconstant learning rate is shown to converge with $\\mathcal{O}(1)$-bias in the\nworst case scenario for nonconvex optimization tasks. That is, learning will\nreach nearby local optima but continue searching within a bounded scope, which\nmay help mitigate overfitting. Finally, our extensive empirical results\ndemonstrate the advantage of combining DML and R\\'{e}nyi divergence, leading to\nfurther improvement in model generalization.\n","authors":["Weipeng Huang","Junjie Tao","Changbo Deng","Ming Fan","Wenqiang Wan","Qi Xiong","Guangyuan Piao"],"pdf_url":"https://arxiv.org/pdf/2209.05732v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.11531v2","updated":"2023-07-24T13:04:48Z","published":"2022-09-23T11:36:32Z","title":"Deep Learning-based Anonymization of Chest Radiographs: A\n Utility-preserving Measure for Patient Privacy","summary":" Robust and reliable anonymization of chest radiographs constitutes an\nessential step before publishing large datasets of such for research purposes.\nThe conventional anonymization process is carried out by obscuring personal\ninformation in the images with black boxes and removing or replacing\nmeta-information. However, such simple measures retain biometric information in\nthe chest radiographs, allowing patients to be re-identified by a linkage\nattack. Therefore, there is an urgent need to obfuscate the biometric\ninformation appearing in the images. We propose the first deep learning-based\napproach (PriCheXy-Net) to targetedly anonymize chest radiographs while\nmaintaining data utility for diagnostic and machine learning purposes. Our\nmodel architecture is a composition of three independent neural networks that,\nwhen collectively used, allow for learning a deformation field that is able to\nimpede patient re-identification. Quantitative results on the ChestX-ray14\ndataset show a reduction of patient re-identification from 81.8% to 57.7% (AUC)\nafter re-training with little impact on the abnormality classification\nperformance. This indicates the ability to preserve underlying abnormality\npatterns while increasing patient privacy. Lastly, we compare our proposed\nanonymization approach with two other obfuscation-based methods (Privacy-Net,\nDP-Pix) and demonstrate the superiority of our method towards resolving the\nprivacy-utility trade-off for chest radiographs.\n","authors":["Kai Packhäuser","Sebastian Gündel","Florian Thamm","Felix Denzinger","Andreas Maier"],"pdf_url":"https://arxiv.org/pdf/2209.11531v2.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2307.07620v2","updated":"2023-07-24T13:03:17Z","published":"2023-07-14T20:39:07Z","title":"Generalizable Embeddings with Cross-batch Metric Learning","summary":" Global average pooling (GAP) is a popular component in deep metric learning\n(DML) for aggregating features. Its effectiveness is often attributed to\ntreating each feature vector as a distinct semantic entity and GAP as a\ncombination of them. Albeit substantiated, such an explanation's algorithmic\nimplications to learn generalizable entities to represent unseen classes, a\ncrucial DML goal, remain unclear. To address this, we formulate GAP as a convex\ncombination of learnable prototypes. We then show that the prototype learning\ncan be expressed as a recursive process fitting a linear predictor to a batch\nof samples. Building on that perspective, we consider two batches of disjoint\nclasses at each iteration and regularize the learning by expressing the samples\nof a batch with the prototypes that are fitted to the other batch. We validate\nour approach on 4 popular DML benchmarks.\n","authors":["Yeti Z. Gurbuz","A. Aydin Alatan"],"pdf_url":"https://arxiv.org/pdf/2307.07620v2.pdf","comment":"\\c{opyright} 2023 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2212.07368v3","updated":"2023-07-24T12:53:23Z","published":"2022-12-14T17:46:17Z","title":"Shuffled Multi-Channel Sparse Signal Recovery","summary":" Mismatches between samples and their respective channel or target commonly\narise in several real-world applications. For instance, whole-brain calcium\nimaging of freely moving organisms, multiple-target tracking or multi-person\ncontactless vital sign monitoring may be severely affected by mismatched\nsample-channel assignments. To systematically address this fundamental problem,\nwe pose it as a signal reconstruction problem where we have lost\ncorrespondences between the samples and their respective channels. Assuming\nthat we have a sensing matrix for the underlying signals, we show that the\nproblem is equivalent to a structured unlabeled sensing problem, and establish\nsufficient conditions for unique recovery. To the best of our knowledge, a\nsampling result for the reconstruction of shuffled multi-channel signals has\nnot been considered in the literature and existing methods for unlabeled\nsensing cannot be directly applied. We extend our results to the case where the\nsignals admit a sparse representation in an overcomplete dictionary (i.e., the\nsensing matrix is not precisely known), and derive sufficient conditions for\nthe reconstruction of shuffled sparse signals. We propose a robust\nreconstruction method that combines sparse signal recovery with robust linear\nregression for the two-channel case. The performance and robustness of the\nproposed approach is illustrated in an application related to whole-brain\ncalcium imaging. The proposed methodology can be generalized to sparse signal\nrepresentations other than the ones considered in this work to be applied in a\nvariety of real-world problems with imprecise measurement or channel\nassignment.\n","authors":["Taulant Koka","Manolis C. Tsakiris","Michael Muma","Benjamín Béjar Haro"],"pdf_url":"https://arxiv.org/pdf/2212.07368v3.pdf","comment":"Submitted to TSP"},{"id":"http://arxiv.org/abs/2307.12754v1","updated":"2023-07-24T12:52:55Z","published":"2023-07-24T12:52:55Z","title":"Nonparametric Linear Feature Learning in Regression Through\n Regularisation","summary":" Representation learning plays a crucial role in automated feature selection,\nparticularly in the context of high-dimensional data, where non-parametric\nmethods often struggle. In this study, we focus on supervised learning\nscenarios where the pertinent information resides within a lower-dimensional\nlinear subspace of the data, namely the multi-index model. If this subspace\nwere known, it would greatly enhance prediction, computation, and\ninterpretation. To address this challenge, we propose a novel method for linear\nfeature learning with non-parametric prediction, which simultaneously estimates\nthe prediction function and the linear subspace. Our approach employs empirical\nrisk minimisation, augmented with a penalty on function derivatives, ensuring\nversatility. Leveraging the orthogonality and rotation invariance properties of\nHermite polynomials, we introduce our estimator, named RegFeaL. By utilising\nalternative minimisation, we iteratively rotate the data to improve alignment\nwith leading directions and accurately estimate the relevant dimension in\npractical settings. We establish that our method yields a consistent estimator\nof the prediction function with explicit rates. Additionally, we provide\nempirical results demonstrating the performance of RegFeaL in various\nexperiments.\n","authors":["Bertille Follain","Umut Simsekli","Francis Bach"],"pdf_url":"https://arxiv.org/pdf/2307.12754v1.pdf","comment":"43 pages, 16 figures"},{"id":"http://arxiv.org/abs/2307.12745v1","updated":"2023-07-24T12:36:05Z","published":"2023-07-24T12:36:05Z","title":"Concept-based explainability for an EEG transformer model","summary":" Deep learning models are complex due to their size, structure, and inherent\nrandomness in training procedures. Additional complexity arises from the\nselection of datasets and inductive biases. Addressing these challenges for\nexplainability, Kim et al. (2018) introduced Concept Activation Vectors (CAVs),\nwhich aim to understand deep models' internal states in terms of human-aligned\nconcepts. These concepts correspond to directions in latent space, identified\nusing linear discriminants. Although this method was first applied to image\nclassification, it was later adapted to other domains, including natural\nlanguage processing. In this work, we attempt to apply the method to\nelectroencephalogram (EEG) data for explainability in Kostas et al.'s BENDR\n(2021), a large-scale transformer model. A crucial part of this endeavor\ninvolves defining the explanatory concepts and selecting relevant datasets to\nground concepts in the latent space. Our focus is on two mechanisms for EEG\nconcept formation: the use of externally labeled EEG datasets, and the\napplication of anatomically defined concepts. The former approach is a\nstraightforward generalization of methods used in image classification, while\nthe latter is novel and specific to EEG. We present evidence that both\napproaches to concept formation yield valuable insights into the\nrepresentations learned by deep EEG models.\n","authors":["Anders Gjølbye Madsen","William Theodor Lehn-Schiøler","Áshildur Jónsdóttir","Bergdís Arnardóttir","Lars Kai Hansen"],"pdf_url":"https://arxiv.org/pdf/2307.12745v1.pdf","comment":"To appear in proceedings of 2023 IEEE International workshop on\n Machine Learning for Signal Processing"},{"id":"http://arxiv.org/abs/2207.09657v3","updated":"2023-07-24T12:35:18Z","published":"2022-07-20T05:22:26Z","title":"Reducing Training Time in Cross-Silo Federated Learning using Multigraph\n Topology","summary":" Federated learning is an active research topic since it enables several\nparticipants to jointly train a model without sharing local data. Currently,\ncross-silo federated learning is a popular training setting that utilizes a few\nhundred reliable data silos with high-speed access links to training a model.\nWhile this approach has been widely applied in real-world scenarios, designing\na robust topology to reduce the training time remains an open problem. In this\npaper, we present a new multigraph topology for cross-silo federated learning.\nWe first construct the multigraph using the overlay graph. We then parse this\nmultigraph into different simple graphs with isolated nodes. The existence of\nisolated nodes allows us to perform model aggregation without waiting for other\nnodes, hence effectively reducing the training time. Intensive experiments on\nthree public datasets show that our proposed method significantly reduces the\ntraining time compared with recent state-of-the-art topologies while\nmaintaining the accuracy of the learned model. Our code can be found at\nhttps://github.com/aioz-ai/MultigraphFL\n","authors":["Tuong Do","Binh X. Nguyen","Vuong Pham","Toan Tran","Erman Tjiputra","Quang Tran","Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2207.09657v3.pdf","comment":"accepted in ICCV 2023"},{"id":"http://arxiv.org/abs/2302.09629v2","updated":"2023-07-24T12:33:09Z","published":"2023-02-19T17:15:56Z","title":"BiofilmScanner: A Computational Intelligence Approach to Obtain\n Bacterial Cell Morphological Attributes from Biofilm Image","summary":" Desulfovibrio alaskensis G20 (DA-G20) is utilized as a model for\nsulfate-reducing bacteria (SRB) that are associated with corrosion issues\ncaused by microorganisms. SRB-based biofilms are thought to be responsible for\nthe billion-dollar-per-year bio-corrosion of metal infrastructure.\nUnderstanding the extraction of the bacterial cells' shape and size properties\nin the SRB-biofilm at different growth stages will assist with the design of\nanti-corrosion techniques. However, numerous issues affect current approaches,\nincluding time-consuming geometric property extraction, low efficiency, and\nhigh error rates. This paper proposes BiofilScanner, a Yolact-based deep\nlearning method integrated with invariant moments to address these problems.\nOur approach efficiently detects and segments bacterial cells in an SRB image\nwhile simultaneously invariant moments measure the geometric characteristics of\nthe segmented cells with low errors. The numerical experiments of the proposed\nmethod demonstrate that the BiofilmScanner is 2.1x and 6.8x faster than our\nearlier Mask-RCNN and DLv3+ methods for detecting, segmenting, and measuring\nthe geometric properties of the cell. Furthermore, the BiofilmScanner achieved\nan F1-score of 85.28% while Mask-RCNN and DLv3+ obtained F1-scores of 77.67%\nand 75.18%, respectively.\n","authors":["Md Hafizur Rahman","Md Ali Azam","Md Abir Hossen","Shankarachary Ragi","Venkataramana Gadhamshetty"],"pdf_url":"https://arxiv.org/pdf/2302.09629v2.pdf","comment":"Submitted to Pattern Recognition"},{"id":"http://arxiv.org/abs/2306.16177v3","updated":"2023-07-24T12:32:58Z","published":"2023-06-28T12:58:42Z","title":"Defining data science: a new field of inquiry","summary":" Data science is not a science. It is a research paradigm. Its power, scope,\nand scale will surpass science, our most powerful research paradigm, to enable\nknowledge discovery and change our world. We have yet to understand and define\nit, vital to realizing its potential and managing its risks. Modern data\nscience is in its infancy. Emerging slowly since 1962 and rapidly since 2000,\nit is a fundamentally new field of inquiry, one of the most active, powerful,\nand rapidly evolving 21st century innovations. Due to its value, power, and\napplicability, it is emerging in over 40 disciplines, hundreds of research\nareas, and thousands of applications. Millions of data science publications\ncontain myriad definitions of data science and data science problem solving.\nDue to its infancy, many definitions are independent, application specific,\nmutually incomplete, redundant, or inconsistent, hence so is data science. This\nresearch addresses this data science multiple definitions challenge by\nproposing the development of coherent, unified definition based on a data\nscience reference framework using a data science journal for the data science\ncommunity to achieve such a definition. This paper provides candidate\ndefinitions for essential data science artifacts that are required to discuss\nsuch a definition. They are based on the classical research paradigm concept\nconsisting of a philosophy of data science, the data science problem solving\nparadigm, and the six component data science reference framework (axiology,\nontology, epistemology, methodology, methods, technology) that is a frequently\ncalled for unifying framework with which to define, unify, and evolve data\nscience. It presents challenges for defining data science, solution approaches,\ni.e., means for defining data science, and their requirements and benefits as\nthe basis of a comprehensive solution.\n","authors":["Michael L Brodie"],"pdf_url":"https://arxiv.org/pdf/2306.16177v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12865v3","updated":"2023-07-24T12:08:50Z","published":"2023-03-22T18:59:48Z","title":"NeRF-GAN Distillation for Efficient 3D-Aware Generation with\n Convolutions","summary":" Pose-conditioned convolutional generative models struggle with high-quality\n3D-consistent image generation from single-view datasets, due to their lack of\nsufficient 3D priors. Recently, the integration of Neural Radiance Fields\n(NeRFs) and generative models, such as Generative Adversarial Networks (GANs),\nhas transformed 3D-aware generation from single-view images. NeRF-GANs exploit\nthe strong inductive bias of neural 3D representations and volumetric rendering\nat the cost of higher computational complexity. This study aims at revisiting\npose-conditioned 2D GANs for efficient 3D-aware generation at inference time by\ndistilling 3D knowledge from pretrained NeRF-GANs. We propose a simple and\neffective method, based on re-using the well-disentangled latent space of a\npre-trained NeRF-GAN in a pose-conditioned convolutional network to directly\ngenerate 3D-consistent images corresponding to the underlying 3D\nrepresentations. Experiments on several datasets demonstrate that the proposed\nmethod obtains results comparable with volumetric rendering in terms of quality\nand 3D consistency while benefiting from the computational advantage of\nconvolutional networks. The code will be available at:\nhttps://github.com/mshahbazi72/NeRF-GAN-Distillation\n","authors":["Mohamad Shahbazi","Evangelos Ntavelis","Alessio Tonioni","Edo Collins","Danda Pani Paudel","Martin Danelljan","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2303.12865v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12716v1","updated":"2023-07-24T11:55:32Z","published":"2023-07-24T11:55:32Z","title":"Safety Performance of Neural Networks in the Presence of Covariate Shift","summary":" Covariate shift may impact the operational safety performance of neural\nnetworks. A re-evaluation of the safety performance, however, requires\ncollecting new operational data and creating corresponding ground truth labels,\nwhich often is not possible during operation. We are therefore proposing to\nreshape the initial test set, as used for the safety performance evaluation\nprior to deployment, based on an approximation of the operational data. This\napproximation is obtained by observing and learning the distribution of\nactivation patterns of neurons in the network during operation. The reshaped\ntest set reflects the distribution of neuron activation values as observed\nduring operation, and may therefore be used for re-evaluating safety\nperformance in the presence of covariate shift. First, we derive conservative\nbounds on the values of neurons by applying finite binning and static dataflow\nanalysis. Second, we formulate a mixed integer linear programming (MILP)\nconstraint for constructing the minimum set of data points to be removed in the\ntest set, such that the difference between the discretized test and operational\ndistributions is bounded. We discuss potential benefits and limitations of this\nconstraint-based approach based on our initial experience with an implemented\nresearch prototype.\n","authors":["Chih-Hong Cheng","Harald Ruess","Konstantinos Theodorou"],"pdf_url":"https://arxiv.org/pdf/2307.12716v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13871v2","updated":"2023-07-24T11:44:01Z","published":"2023-04-26T23:34:40Z","title":"Typical and atypical solutions in non-convex neural networks with\n discrete and continuous weights","summary":" We study the binary and continuous negative-margin perceptrons as simple\nnon-convex neural network models learning random rules and associations. We\nanalyze the geometry of the landscape of solutions in both models and find\nimportant similarities and differences. Both models exhibit subdominant\nminimizers which are extremely flat and wide. These minimizers coexist with a\nbackground of dominant solutions which are composed by an exponential number of\nalgorithmically inaccessible small clusters for the binary case (the frozen\n1-RSB phase) or a hierarchical structure of clusters of different sizes for the\nspherical case (the full RSB phase). In both cases, when a certain threshold in\nconstraint density is crossed, the local entropy of the wide flat minima\nbecomes non-monotonic, indicating a break-up of the space of robust solutions\ninto disconnected components. This has a strong impact on the behavior of\nalgorithms in binary models, which cannot access the remaining isolated\nclusters. For the spherical case the behaviour is different, since even beyond\nthe disappearance of the wide flat minima the remaining solutions are shown to\nalways be surrounded by a large number of other solutions at any distance, up\nto capacity. Indeed, we exhibit numerical evidence that algorithms seem to find\nsolutions up to the SAT/UNSAT transition, that we compute here using an 1RSB\napproximation. For both models, the generalization performance as a learning\ndevice is shown to be greatly improved by the existence of wide flat minimizers\neven when trained in the highly underconstrained regime of very negative\nmargins.\n","authors":["Carlo Baldassi","Enrico M. Malatesta","Gabriele Perugini","Riccardo Zecchina"],"pdf_url":"https://arxiv.org/pdf/2304.13871v2.pdf","comment":"34 pages, 13 figures"},{"id":"http://arxiv.org/abs/2210.17230v3","updated":"2023-07-24T11:43:26Z","published":"2022-10-31T11:15:48Z","title":"Lipschitz-regularized gradient flows and generative particle algorithms\n for high-dimensional scarce data","summary":" We build a new class of generative algorithms capable of efficiently learning\nan arbitrary target distribution from possibly scarce, high-dimensional data\nand subsequently generate new samples. These generative algorithms are\nparticle-based and are constructed as gradient flows of Lipschitz-regularized\nKullback-Leibler or other $f$-divergences, where data from a source\ndistribution can be stably transported as particles, towards the vicinity of\nthe target distribution. As a highlighted result in data integration, we\ndemonstrate that the proposed algorithms correctly transport gene expression\ndata points with dimension exceeding 54K, while the sample size is typically\nonly in the hundreds.\n","authors":["Hyemin Gu","Panagiota Birmpa","Yannis Pantazis","Luc Rey-Bellet","Markos A. Katsoulakis"],"pdf_url":"https://arxiv.org/pdf/2210.17230v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12703v1","updated":"2023-07-24T11:37:02Z","published":"2023-07-24T11:37:02Z","title":"Policy Gradient Optimal Correlation Search for Variance Reduction in\n Monte Carlo simulation and Maximum Optimal Transport","summary":" We propose a new algorithm for variance reduction when estimating $f(X_T)$\nwhere $X$ is the solution to some stochastic differential equation and $f$ is a\ntest function. The new estimator is $(f(X^1_T) + f(X^2_T))/2$, where $X^1$ and\n$X^2$ have same marginal law as $X$ but are pathwise correlated so that to\nreduce the variance. The optimal correlation function $\\rho$ is approximated by\na deep neural network and is calibrated along the trajectories of $(X^1, X^2)$\nby policy gradient and reinforcement learning techniques. Finding an optimal\ncoupling given marginal laws has links with maximum optimal transport.\n","authors":["Pierre Bras","Gilles Pagès"],"pdf_url":"https://arxiv.org/pdf/2307.12703v1.pdf","comment":"7 pages"},{"id":"http://arxiv.org/abs/2303.09340v3","updated":"2023-07-24T11:34:21Z","published":"2023-03-16T14:21:45Z","title":"Improving Automated Hemorrhage Detection in Sparse-view Computed\n Tomography via Deep Convolutional Neural Network based Artifact Reduction","summary":" Purpose: Sparse-view computed tomography (CT) is an effective way to reduce\ndose by lowering the total number of views acquired, albeit at the expense of\nimage quality, which, in turn, can impact the ability to detect diseases. We\nexplore deep learning-based artifact reduction in sparse-view cranial CT scans\nand its impact on automated hemorrhage detection. Methods: We trained a U-Net\nfor artefact reduction on simulated sparse-view cranial CT scans from 3000\npatients obtained from a public dataset and reconstructed with varying levels\nof sub-sampling. Additionally, we trained a convolutional neural network on\nfully sampled CT data from 17,545 patients for automated hemorrhage detection.\nWe evaluated the classification performance using the area under the receiver\noperator characteristic curves (AUC-ROCs) with corresponding 95% confidence\nintervals (CIs) and the DeLong test, along with confusion matrices. The\nperformance of the U-Net was compared to an analytical approach based on total\nvariation (TV). Results: The U-Net performed superior compared to unprocessed\nand TV-processed images with respect to image quality and automated hemorrhage\ndiagnosis. With U-Net post-processing, the number of views can be reduced from\n4096 (AUC-ROC: 0.974; 95% CI: 0.972-0.976) views to 512 views (0.973;\n0.971-0.975) with minimal decrease in hemorrhage detection (P<.001) and to 256\nviews (0.967; 0.964-0.969) with a slight performance decrease (P<.001).\nConclusion: The results suggest that U-Net based artifact reduction\nsubstantially enhances automated hemorrhage detection in sparse-view cranial\nCTs. Our findings highlight that appropriate post-processing is crucial for\noptimal image quality and diagnostic accuracy while minimizing radiation dose.\n","authors":["Johannes Thalhammer","Manuel Schultheiss","Tina Dorosti","Tobias Lasser","Franz Pfeiffer","Daniela Pfeiffer","Florian Schaff"],"pdf_url":"https://arxiv.org/pdf/2303.09340v3.pdf","comment":"11 pages, 6 figures, 1 table"},{"id":"http://arxiv.org/abs/2307.12698v1","updated":"2023-07-24T11:27:14Z","published":"2023-07-24T11:27:14Z","title":"MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised\n Learning of Motion and Content Features","summary":" Self-supervised learning of visual representations has been focusing on\nlearning content features, which do not capture object motion or location, and\nfocus on identifying and differentiating objects in images and videos. On the\nother hand, optical flow estimation is a task that does not involve\nunderstanding the content of the images on which it is estimated. We unify the\ntwo approaches and introduce MC-JEPA, a joint-embedding predictive architecture\nand self-supervised learning approach to jointly learn optical flow and content\nfeatures within a shared encoder, demonstrating that the two associated\nobjectives; the optical flow estimation objective and the self-supervised\nlearning objective; benefit from each other and thus learn content features\nthat incorporate motion information. The proposed approach achieves performance\non-par with existing unsupervised optical flow benchmarks, as well as with\ncommon self-supervised learning approaches on downstream tasks such as semantic\nsegmentation of images and videos.\n","authors":["Adrien Bardes","Jean Ponce","Yann LeCun"],"pdf_url":"https://arxiv.org/pdf/2307.12698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.10763v3","updated":"2023-07-24T11:15:47Z","published":"2023-02-12T12:19:57Z","title":"Contrastive Learning and the Emergence of Attributes Associations","summary":" In response to an object presentation, supervised learning schemes generally\nrespond with a parsimonious label. Upon a similar presentation we humans\nrespond again with a label, but are flooded, in addition, by a myriad of\nassociations. A significant portion of these consist of the presented object\nattributes. Contrastive learning is a semi-supervised learning scheme based on\nthe application of identity preserving transformations on the object input\nrepresentations. It is conjectured in this work that these same applied\ntransformations preserve, in addition to the identity of the presented object,\nalso the identity of its semantically meaningful attributes. The corollary of\nthis is that the output representations of such a contrastive learning scheme\ncontain valuable information not only for the classification of the presented\nobject, but also for the presence or absence decision of any attribute of\ninterest. Simulation results which demonstrate this idea and the feasibility of\nthis conjecture are presented.\n","authors":["Daniel N. Nissani"],"pdf_url":"https://arxiv.org/pdf/2302.10763v3.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2210.12583v2","updated":"2023-07-24T11:13:21Z","published":"2022-10-23T00:45:05Z","title":"Active Learning of Discrete-Time Dynamics for Uncertainty-Aware Model\n Predictive Control","summary":" Model-based control requires an accurate model of the system dynamics for\nprecisely and safely controlling the robot in complex and dynamic environments.\nMoreover, in the presence of variations in the operating conditions, the model\nshould be continuously refined to compensate for dynamics changes. In this\npaper, we present a self-supervised learning approach that actively models the\ndynamics of nonlinear robotic systems. We combine offline learning from past\nexperience and online learning from current robot interaction with the unknown\nenvironment. These two ingredients enable a highly sample-efficient and\nadaptive learning process, capable of accurately inferring model dynamics in\nreal-time even in operating regimes that greatly differ from the training\ndistribution. Moreover, we design an uncertainty-aware model predictive\ncontroller that is heuristically conditioned to the aleatoric (data)\nuncertainty of the learned dynamics. This controller actively chooses the\noptimal control actions that (i) optimize the control performance and (ii)\nimprove the efficiency of online learning sample collection. We demonstrate the\neffectiveness of our method through a series of challenging real-world\nexperiments using a quadrotor system. Our approach showcases high resilience\nand generalization capabilities by consistently adapting to unseen flight\nconditions, while it significantly outperforms classical and adaptive control\nbaselines.\n","authors":["Alessandro Saviolo","Jonathan Frey","Abhishek Rathod","Moritz Diehl","Giuseppe Loianno"],"pdf_url":"https://arxiv.org/pdf/2210.12583v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12689v1","updated":"2023-07-24T11:04:22Z","published":"2023-07-24T11:04:22Z","title":"Addressing the Impact of Localized Training Data in Graph Neural\n Networks","summary":" Graph Neural Networks (GNNs) have achieved notable success in learning from\ngraph-structured data, owing to their ability to capture intricate dependencies\nand relationships between nodes. They excel in various applications, including\nsemi-supervised node classification, link prediction, and graph generation.\nHowever, it is important to acknowledge that the majority of state-of-the-art\nGNN models are built upon the assumption of an in-distribution setting, which\nhinders their performance on real-world graphs with dynamic structures. In this\narticle, we aim to assess the impact of training GNNs on localized subsets of\nthe graph. Such restricted training data may lead to a model that performs well\nin the specific region it was trained on but fails to generalize and make\naccurate predictions for the entire graph. In the context of graph-based\nsemi-supervised learning (SSL), resource constraints often lead to scenarios\nwhere the dataset is large, but only a portion of it can be labeled, affecting\nthe model's performance. This limitation affects tasks like anomaly detection\nor spam detection when labeling processes are biased or influenced by human\nsubjectivity. To tackle the challenges posed by localized training data, we\napproach the problem as an out-of-distribution (OOD) data issue by by aligning\nthe distributions between the training data, which represents a small portion\nof labeled data, and the graph inference process that involves making\npredictions for the entire graph. We propose a regularization method to\nminimize distributional discrepancies between localized training data and graph\ninference, improving model performance on OOD data. Extensive tests on popular\nGNN models show significant performance improvement on three citation GNN\nbenchmark datasets. The regularization approach effectively enhances model\nadaptation and generalization, overcoming challenges posed by OOD data.\n","authors":["Singh Akansha"],"pdf_url":"https://arxiv.org/pdf/2307.12689v1.pdf","comment":"6 pages, 4 figures"},{"id":"http://arxiv.org/abs/2307.12679v1","updated":"2023-07-24T10:33:32Z","published":"2023-07-24T10:33:32Z","title":"An Estimator for the Sensitivity to Perturbations of Deep Neural\n Networks","summary":" For Deep Neural Networks (DNNs) to become useful in safety-critical\napplications, such as self-driving cars and disease diagnosis, they must be\nstable to perturbations in input and model parameters. Characterizing the\nsensitivity of a DNN to perturbations is necessary to determine minimal\nbit-width precision that may be used to safely represent the network. However,\nno general result exists that is capable of predicting the sensitivity of a\ngiven DNN to round-off error, noise, or other perturbations in input. This\npaper derives an estimator that can predict such quantities. The estimator is\nderived via inequalities and matrix norms, and the resulting quantity is\nroughly analogous to a condition number for the entire neural network. An\napproximation of the estimator is tested on two Convolutional Neural Networks,\nAlexNet and VGG-19, using the ImageNet dataset. For each of these networks, the\ntightness of the estimator is explored via random perturbations and adversarial\nattacks.\n","authors":["Naman Maheshwari","Nicholas Malaya","Scott Moe","Jaydeep P. Kulkarni","Sudhanva Gurumurthi"],"pdf_url":"https://arxiv.org/pdf/2307.12679v1.pdf","comment":"Actual work and paper concluded in January 2019"},{"id":"http://arxiv.org/abs/2307.12672v1","updated":"2023-07-24T10:20:14Z","published":"2023-07-24T10:20:14Z","title":"Global k-Space Interpolation for Dynamic MRI Reconstruction using Masked\n Image Modeling","summary":" In dynamic Magnetic Resonance Imaging (MRI), k-space is typically\nundersampled due to limited scan time, resulting in aliasing artifacts in the\nimage domain. Hence, dynamic MR reconstruction requires not only modeling\nspatial frequency components in the x and y directions of k-space but also\nconsidering temporal redundancy. Most previous works rely on image-domain\nregularizers (priors) to conduct MR reconstruction. In contrast, we focus on\ninterpolating the undersampled k-space before obtaining images with Fourier\ntransform. In this work, we connect masked image modeling with k-space\ninterpolation and propose a novel Transformer-based k-space Global\nInterpolation Network, termed k-GIN. Our k-GIN learns global dependencies among\nlow- and high-frequency components of 2D+t k-space and uses it to interpolate\nunsampled data. Further, we propose a novel k-space Iterative Refinement Module\n(k-IRM) to enhance the high-frequency components learning. We evaluate our\napproach on 92 in-house 2D+t cardiac MR subjects and compare it to MR\nreconstruction methods with image-domain regularizers. Experiments show that\nour proposed k-space interpolation method quantitatively and qualitatively\noutperforms baseline methods. Importantly, the proposed approach achieves\nsubstantially higher robustness and generalizability in cases of\nhighly-undersampled MR data.\n","authors":["Jiazhen Pan","Suprosanna Shit","Özgün Turgut","Wenqi Huang","Hongwei Bran Li","Nil Stolt-Ansó","Thomas Küstner","Kerstin Hammernik","Daniel Rueckert"],"pdf_url":"https://arxiv.org/pdf/2307.12672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12667v1","updated":"2023-07-24T10:14:51Z","published":"2023-07-24T10:14:51Z","title":"TransFusion: Generating Long, High Fidelity Time Series using Diffusion\n Models with Transformers","summary":" The generation of high-quality, long-sequenced time-series data is essential\ndue to its wide range of applications. In the past, standalone Recurrent and\nConvolutional Neural Network-based Generative Adversarial Networks (GAN) were\nused to synthesize time-series data. However, they are inadequate for\ngenerating long sequences of time-series data due to limitations in the\narchitecture. Furthermore, GANs are well known for their training instability\nand mode collapse problem. To address this, we propose TransFusion, a\ndiffusion, and transformers-based generative model to generate high-quality\nlong-sequence time-series data. We have stretched the sequence length to 384,\nand generated high-quality synthetic data. To the best of our knowledge, this\nis the first study that has been done with this long-sequence length. Also, we\nintroduce two evaluation metrics to evaluate the quality of the synthetic data\nas well as its predictive characteristics. We evaluate TransFusion with a wide\nvariety of visual and empirical metrics, and TransFusion outperforms the\nprevious state-of-the-art by a significant margin.\n","authors":["Md Fahim Sikder","Resmi Ramachandranpillai","Fredrik Heintz"],"pdf_url":"https://arxiv.org/pdf/2307.12667v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12660v1","updated":"2023-07-24T10:04:27Z","published":"2023-07-24T10:04:27Z","title":"Online Continual Learning in Keyword Spotting for Low-Resource Devices\n via Pooling High-Order Temporal Statistics","summary":" Keyword Spotting (KWS) models on embedded devices should adapt fast to new\nuser-defined words without forgetting previous ones. Embedded devices have\nlimited storage and computational resources, thus, they cannot save samples or\nupdate large models. We consider the setup of embedded online continual\nlearning (EOCL), where KWS models with frozen backbone are trained to\nincrementally recognize new words from a non-repeated stream of samples, seen\none at a time. To this end, we propose Temporal Aware Pooling (TAP) which\nconstructs an enriched feature space computing high-order moments of speech\nfeatures extracted by a pre-trained backbone. Our method, TAP-SLDA, updates a\nGaussian model for each class on the enriched feature space to effectively use\naudio representations. In experimental analyses, TAP-SLDA outperforms\ncompetitors on several setups, backbones, and baselines, bringing a relative\naverage gain of 11.3% on the GSC dataset.\n","authors":["Umberto Michieli","Pablo Peso Parada","Mete Ozay"],"pdf_url":"https://arxiv.org/pdf/2307.12660v1.pdf","comment":"INTERSPEECH 2023"},{"id":"http://arxiv.org/abs/2306.12231v2","updated":"2023-07-24T09:36:05Z","published":"2023-06-21T12:44:52Z","title":"Predicting protein variants with equivariant graph neural networks","summary":" Pre-trained models have been successful in many protein engineering tasks.\nMost notably, sequence-based models have achieved state-of-the-art performance\non protein fitness prediction while structure-based models have been used\nexperimentally to develop proteins with enhanced functions. However, there is a\nresearch gap in comparing structure- and sequence-based methods for predicting\nprotein variants that are better than the wildtype protein. This paper aims to\naddress this gap by conducting a comparative study between the abilities of\nequivariant graph neural networks (EGNNs) and sequence-based approaches to\nidentify promising amino-acid mutations. The results show that our proposed\nstructural approach achieves a competitive performance to sequence-based\nmethods while being trained on significantly fewer molecules. Additionally, we\nfind that combining assay labelled data with structure pre-trained models\nyields similar trends as with sequence pre-trained models.\n Our code and trained models can be found at:\nhttps://github.com/semiluna/partIII-amino-acid-prediction.\n","authors":["Antonia Boca","Simon Mathis"],"pdf_url":"https://arxiv.org/pdf/2306.12231v2.pdf","comment":"4 pages, 2 figures, accepted to the 2023 ICML Workshop on\n Computational Biology"},{"id":"http://arxiv.org/abs/2307.12644v1","updated":"2023-07-24T09:35:47Z","published":"2023-07-24T09:35:47Z","title":"Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation\n of rPPG","summary":" Remote Photoplethysmography (rPPG) is a technology that utilizes the light\nabsorption properties of hemoglobin, captured via camera, to analyze and\nmeasure blood volume pulse (BVP). By analyzing the measured BVP, various\nphysiological signals such as heart rate, stress levels, and blood pressure can\nbe derived, enabling applications such as the early prediction of\ncardiovascular diseases. rPPG is a rapidly evolving field as it allows the\nmeasurement of vital signals using camera-equipped devices without the need for\nadditional devices such as blood pressure monitors or pulse oximeters, and\nwithout the assistance of medical experts. Despite extensive efforts and\nadvances in this field, serious challenges remain, including issues related to\nskin color, camera characteristics, ambient lighting, and other sources of\nnoise, which degrade performance accuracy. We argue that fair and evaluable\nbenchmarking is urgently required to overcome these challenges and make any\nmeaningful progress from both academic and commercial perspectives. In most\nexisting work, models are trained, tested, and validated only on limited\ndatasets. Worse still, some studies lack available code or reproducibility,\nmaking it difficult to fairly evaluate and compare performance. Therefore, the\npurpose of this study is to provide a benchmarking framework to evaluate\nvarious rPPG techniques across a wide range of datasets for fair evaluation and\ncomparison, including both conventional non-deep neural network (non-DNN) and\ndeep neural network (DNN) methods. GitHub URL:\nhttps://github.com/remotebiosensing/rppg.\n","authors":["Dae Yeol Kim","Eunsu Goh","KwangKee Lee","JongEui Chae","JongHyeon Mun","Junyeong Na","Chae-bong Sohn","Do-Yup Kim"],"pdf_url":"https://arxiv.org/pdf/2307.12644v1.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2307.12639v1","updated":"2023-07-24T09:30:30Z","published":"2023-07-24T09:30:30Z","title":"Fake News Detection Through Graph-based Neural Networks: A Survey","summary":" The popularity of online social networks has enabled rapid dissemination of\ninformation. People now can share and consume information much more rapidly\nthan ever before. However, low-quality and/or accidentally/deliberately fake\ninformation can also spread rapidly. This can lead to considerable and negative\nimpacts on society. Identifying, labelling and debunking online misinformation\nas early as possible has become an increasingly urgent problem. Many methods\nhave been proposed to detect fake news including many deep learning and\ngraph-based approaches. In recent years, graph-based methods have yielded\nstrong results, as they can closely model the social context and propagation\nprocess of online news. In this paper, we present a systematic review of fake\nnews detection studies based on graph-based and deep learning-based techniques.\nWe classify existing graph-based methods into knowledge-driven methods,\npropagation-based methods, and heterogeneous social context-based methods,\ndepending on how a graph structure is constructed to model news related\ninformation flows. We further discuss the challenges and open problems in\ngraph-based fake news detection and identify future research directions.\n","authors":["Shuzhi Gong","Richard O. Sinnott","Jianzhong Qi","Cecile Paris"],"pdf_url":"https://arxiv.org/pdf/2307.12639v1.pdf","comment":"18 pages, 3 tables, 7 figures"},{"id":"http://arxiv.org/abs/2304.03981v2","updated":"2023-07-24T09:24:04Z","published":"2023-04-08T10:47:41Z","title":"Uncertainty-inspired Open Set Learning for Retinal Anomaly\n Identification","summary":" Failure to recognize samples from the classes unseen during training is a\nmajor limitation of artificial intelligence in the real-world implementation\nfor recognition and classification of retinal anomalies. We established an\nuncertainty-inspired open-set (UIOS) model, which was trained with fundus\nimages of 9 retinal conditions. Besides assessing the probability of each\ncategory, UIOS also calculated an uncertainty score to express its confidence.\nOur UIOS model with thresholding strategy achieved an F1 score of 99.55%,\n97.01% and 91.91% for the internal testing set, external target categories\n(TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1\nscore of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS\ncorrectly predicted high uncertainty scores, which would prompt the need for a\nmanual check in the datasets of non-target categories retinal diseases,\nlow-quality fundus images, and non-fundus images. UIOS provides a robust method\nfor real-world screening of retinal anomalies.\n","authors":["Meng Wang","Tian Lin","Lianyu Wang","Aidi Lin","Ke Zou","Xinxing Xu","Yi Zhou","Yuanyuan Peng","Qingquan Meng","Yiming Qian","Guoyao Deng","Zhiqun Wu","Junhong Chen","Jianhong Lin","Mingzhi Zhang","Weifang Zhu","Changqing Zhang","Daoqiang Zhang","Rick Siow Mong Goh","Yong Liu","Chi Pui Pang","Xinjian Chen","Haoyu Chen","Huazhu Fu"],"pdf_url":"https://arxiv.org/pdf/2304.03981v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12636v1","updated":"2023-07-24T09:19:38Z","published":"2023-07-24T09:19:38Z","title":"Identifying drivers and mitigators for congestion and redispatch in the\n German electric power system with explainable AI","summary":" The transition to a sustainable energy supply challenges the operation of\nelectric power systems in manifold ways. Transmission grid loads increase as\nwind and solar power are often installed far away from the consumers. In\nextreme cases, system operators must intervene via countertrading or redispatch\nto ensure grid stability. In this article, we provide a data-driven analysis of\ncongestion in the German transmission grid. We develop an explainable machine\nlearning model to predict the volume of redispatch and countertrade on an\nhourly basis. The model reveals factors that drive or mitigate grid congestion\nand quantifies their impact. We show that, as expected, wind power generation\nis the main driver, but hydropower and cross-border electricity trading also\nplay an essential role. Solar power, on the other hand, has no mitigating\neffect. Our results suggest that a change to the market design would alleviate\ncongestion.\n","authors":["Maurizio Titz","Sebastian Pütz","Dirk Witthaut"],"pdf_url":"https://arxiv.org/pdf/2307.12636v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.14430v3","updated":"2023-07-24T09:15:02Z","published":"2022-09-28T21:31:43Z","title":"Minimax Optimal Kernel Operator Learning via Multilevel Training","summary":" Learning mappings between infinite-dimensional function spaces has achieved\nempirical success in many disciplines of machine learning, including generative\nmodeling, functional data analysis, causal inference, and multi-agent\nreinforcement learning. In this paper, we study the statistical limit of\nlearning a Hilbert-Schmidt operator between two infinite-dimensional Sobolev\nreproducing kernel Hilbert spaces. We establish the information-theoretic lower\nbound in terms of the Sobolev Hilbert-Schmidt norm and show that a\nregularization that learns the spectral components below the bias contour and\nignores the ones that are above the variance contour can achieve the optimal\nlearning rate. At the same time, the spectral components between the bias and\nvariance contours give us flexibility in designing computationally feasible\nmachine learning algorithms. Based on this observation, we develop a multilevel\nkernel operator learning algorithm that is optimal when learning linear\noperators between infinite-dimensional function spaces.\n","authors":["Jikai Jin","Yiping Lu","Jose Blanchet","Lexing Ying"],"pdf_url":"https://arxiv.org/pdf/2209.14430v3.pdf","comment":"ICLR 2023 spotlight"},{"id":"http://arxiv.org/abs/2307.12625v1","updated":"2023-07-24T08:56:25Z","published":"2023-07-24T08:56:25Z","title":"De-confounding Representation Learning for Counterfactual Inference on\n Continuous Treatment via Generative Adversarial Network","summary":" Counterfactual inference for continuous rather than binary treatment\nvariables is more common in real-world causal inference tasks. While there are\nalready some sample reweighting methods based on Marginal Structural Model for\neliminating the confounding bias, they generally focus on removing the\ntreatment's linear dependence on confounders and rely on the accuracy of the\nassumed parametric models, which are usually unverifiable. In this paper, we\npropose a de-confounding representation learning (DRL) framework for\ncounterfactual outcome estimation of continuous treatment by generating the\nrepresentations of covariates disentangled with the treatment variables. The\nDRL is a non-parametric model that eliminates both linear and nonlinear\ndependence between treatment and covariates. Specifically, we train the\ncorrelations between the de-confounded representations and the treatment\nvariables against the correlations between the covariate representations and\nthe treatment variables to eliminate confounding bias. Further, a\ncounterfactual inference network is embedded into the framework to make the\nlearned representations serve both de-confounding and trusted inference.\nExtensive experiments on synthetic datasets show that the DRL model performs\nsuperiorly in learning de-confounding representations and outperforms\nstate-of-the-art counterfactual inference models for continuous treatment\nvariables. In addition, we apply the DRL model to a real-world medical dataset\nMIMIC and demonstrate a detailed causal relationship between red cell width\ndistribution and mortality.\n","authors":["Yonghe Zhao","Qiang Huang","Haolong Zeng","Yun Pen","Huiyan Sun"],"pdf_url":"https://arxiv.org/pdf/2307.12625v1.pdf","comment":"15 pages,4 figures"},{"id":"http://arxiv.org/abs/2307.12617v1","updated":"2023-07-24T08:46:12Z","published":"2023-07-24T08:46:12Z","title":"Predicting Ordinary Differential Equations with Transformers","summary":" We develop a transformer-based sequence-to-sequence model that recovers\nscalar ordinary differential equations (ODEs) in symbolic form from irregularly\nsampled and noisy observations of a single solution trajectory. We demonstrate\nin extensive empirical evaluations that our model performs better or on par\nwith existing methods in terms of accurate recovery across various settings.\nMoreover, our method is efficiently scalable: after one-time pretraining on a\nlarge set of ODEs, we can infer the governing law of a new observed solution in\na few forward passes of the model.\n","authors":["Sören Becker","Michal Klein","Alexander Neitz","Giambattista Parascandolo","Niki Kilbertus"],"pdf_url":"https://arxiv.org/pdf/2307.12617v1.pdf","comment":"Published at ICML 2023"},{"id":"http://arxiv.org/abs/2307.09458v3","updated":"2023-07-24T08:32:40Z","published":"2023-07-18T17:39:04Z","title":"Does Circuit Analysis Interpretability Scale? Evidence from Multiple\n Choice Capabilities in Chinchilla","summary":" \\emph{Circuit analysis} is a promising technique for understanding the\ninternal mechanisms of language models. However, existing analyses are done in\nsmall models far from the state of the art. To address this, we present a case\nstudy of circuit analysis in the 70B Chinchilla model, aiming to test the\nscalability of circuit analysis. In particular, we study multiple-choice\nquestion answering, and investigate Chinchilla's capability to identify the\ncorrect answer \\emph{label} given knowledge of the correct answer \\emph{text}.\nWe find that the existing techniques of logit attribution, attention pattern\nvisualization, and activation patching naturally scale to Chinchilla, allowing\nus to identify and categorize a small set of `output nodes' (attention heads\nand MLPs).\n We further study the `correct letter' category of attention heads aiming to\nunderstand the semantics of their features, with mixed results. For normal\nmultiple-choice question answers, we significantly compress the query, key and\nvalue subspaces of the head without loss of performance when operating on the\nanswer labels for multiple-choice questions, and we show that the query and key\nsubspaces represent an `Nth item in an enumeration' feature to at least some\nextent. However, when we attempt to use this explanation to understand the\nheads' behaviour on a more general distribution including randomized answer\nlabels, we find that it is only a partial explanation, suggesting there is more\nto learn about the operation of `correct letter' heads on multiple choice\nquestion answering.\n","authors":["Tom Lieberum","Matthew Rahtz","János Kramár","Neel Nanda","Geoffrey Irving","Rohin Shah","Vladimir Mikulik"],"pdf_url":"https://arxiv.org/pdf/2307.09458v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12607v1","updated":"2023-07-24T08:32:27Z","published":"2023-07-24T08:32:27Z","title":"ExWarp: Extrapolation and Warping-based Temporal Supersampling for\n High-frequency Displays","summary":" High-frequency displays are gaining immense popularity because of their\nincreasing use in video games and virtual reality applications. However, the\nissue is that the underlying GPUs cannot continuously generate frames at this\nhigh rate -- this results in a less smooth and responsive experience.\nFurthermore, if the frame rate is not synchronized with the refresh rate, the\nuser may experience screen tearing and stuttering. Previous works propose\nincreasing the frame rate to provide a smooth experience on modern displays by\npredicting new frames based on past or future frames. Interpolation and\nextrapolation are two widely used algorithms that predict new frames.\nInterpolation requires waiting for the future frame to make a prediction, which\nadds additional latency. On the other hand, extrapolation provides a better\nquality of experience because it relies solely on past frames -- it does not\nincur any additional latency. The simplest method to extrapolate a frame is to\nwarp the previous frame using motion vectors; however, the warped frame may\ncontain improperly rendered visual artifacts due to dynamic objects -- this\nmakes it very challenging to design such a scheme. Past work has used DNNs to\nget good accuracy, however, these approaches are slow. This paper proposes\nExwarp -- an approach based on reinforcement learning (RL) to intelligently\nchoose between the slower DNN-based extrapolation and faster warping-based\nmethods to increase the frame rate by 4x with an almost negligible reduction in\nthe perceived image quality.\n","authors":["Akanksha Dixit","Yashashwee Chakrabarty","Smruti R. Sarangi"],"pdf_url":"https://arxiv.org/pdf/2307.12607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12601v1","updated":"2023-07-24T08:21:13Z","published":"2023-07-24T08:21:13Z","title":"Concept backpropagation: An Explainable AI approach for visualising\n learned concepts in neural network models","summary":" Neural network models are widely used in a variety of domains, often as\nblack-box solutions, since they are not directly interpretable for humans. The\nfield of explainable artificial intelligence aims at developing explanation\nmethods to address this challenge, and several approaches have been developed\nover the recent years, including methods for investigating what type of\nknowledge these models internalise during the training process. Among these,\nthe method of concept detection, investigates which \\emph{concepts} neural\nnetwork models learn to represent in order to complete their tasks. In this\nwork, we present an extension to the method of concept detection, named\n\\emph{concept backpropagation}, which provides a way of analysing how the\ninformation representing a given concept is internalised in a given neural\nnetwork model. In this approach, the model input is perturbed in a manner\nguided by a trained concept probe for the described model, such that the\nconcept of interest is maximised. This allows for the visualisation of the\ndetected concept directly in the input space of the model, which in turn makes\nit possible to see what information the model depends on for representing the\ndescribed concept. We present results for this method applied to a various set\nof input modalities, and discuss how our proposed method can be used to\nvisualise what information trained concept probes use, and the degree as to\nwhich the representation of the probed concept is entangled within the neural\nnetwork model itself.\n","authors":["Patrik Hammersborg","Inga Strümke"],"pdf_url":"https://arxiv.org/pdf/2307.12601v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12594v1","updated":"2023-07-24T08:11:59Z","published":"2023-07-24T08:11:59Z","title":"Optimized data collection and analysis process for studying\n solar-thermal desalination by machine learning","summary":" An effective interdisciplinary study between machine learning and\nsolar-thermal desalination requires a sufficiently large and well-analyzed\nexperimental datasets. This study develops a modified dataset collection and\nanalysis process for studying solar-thermal desalination by machine learning.\nBased on the optimized water condensation and collection process, the proposed\nexperimental method collects over one thousand datasets, which is ten times\nmore than the average number of datasets in previous works, by accelerating\ndata collection and reducing the time by 83.3%. On the other hand, the effects\nof dataset features are investigated by using three different algorithms,\nincluding artificial neural networks, multiple linear regressions, and random\nforests. The investigation focuses on the effects of dataset size and range on\nprediction accuracy, factor importance ranking, and the model's generalization\nability. The results demonstrate that a larger dataset can significantly\nimprove prediction accuracy when using artificial neural networks and random\nforests. Additionally, the study highlights the significant impact of dataset\nsize and range on ranking the importance of influence factors. Furthermore, the\nstudy reveals that the extrapolation data range significantly affects the\nextrapolation accuracy of artificial neural networks. Based on the results,\nmassive dataset collection and analysis of dataset feature effects are\nimportant steps in an effective and consistent machine learning process flow\nfor solar-thermal desalination, which can promote machine learning as a more\ngeneral tool in the field of solar-thermal desalination.\n","authors":["Guilong Peng","Senshan Sun","Yangjun Qin","Zhenwei Xu","Juxin Du","Swellam W. sharshir","A. W. Kandel","A. E. Kabeel","Nuo Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12594v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07515v2","updated":"2023-07-24T08:10:52Z","published":"2023-04-15T09:39:52Z","title":"S3M: Scalable Statistical Shape Modeling through Unsupervised\n Correspondences","summary":" Statistical shape models (SSMs) are an established way to represent the\nanatomy of a population with various clinically relevant applications. However,\nthey typically require domain expertise, and labor-intensive landmark\nannotations to construct. We address these shortcomings by proposing an\nunsupervised method that leverages deep geometric features and functional\ncorrespondences to simultaneously learn local and global shape structures\nacross population anatomies. Our pipeline significantly improves unsupervised\ncorrespondence estimation for SSMs compared to baseline methods, even on highly\nirregular surface topologies. We demonstrate this for two different anatomical\nstructures: the thyroid and a multi-chamber heart dataset. Furthermore, our\nmethod is robust enough to learn from noisy neural network predictions,\npotentially enabling scaling SSMs to larger patient populations without manual\nsegmentation annotation.\n","authors":["Lennart Bastian","Alexander Baumann","Emily Hoppe","Vincent Bürgin","Ha Young Kim","Mahdi Saleh","Benjamin Busam","Nassir Navab"],"pdf_url":"https://arxiv.org/pdf/2304.07515v2.pdf","comment":"Accepted at MICCAI 2023. 13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.12586v1","updated":"2023-07-24T07:58:18Z","published":"2023-07-24T07:58:18Z","title":"InVAErt networks: a data-driven framework for emulation, inference and\n identifiability analysis","summary":" Use of generative models and deep learning for physics-based systems is\ncurrently dominated by the task of emulation. However, the remarkable\nflexibility offered by data-driven architectures would suggest to extend this\nrepresentation to other aspects of system synthesis including model inversion\nand identifiability. We introduce inVAErt (pronounced \\emph{invert}) networks,\na comprehensive framework for data-driven analysis and synthesis of parametric\nphysical systems which uses a deterministic encoder and decoder to represent\nthe forward and inverse solution maps, normalizing flow to capture the\nprobabilistic distribution of system outputs, and a variational encoder\ndesigned to learn a compact latent representation for the lack of bijectivity\nbetween inputs and outputs. We formally investigate the selection of penalty\ncoefficients in the loss function and strategies for latent space sampling,\nsince we find that these significantly affect both training and testing\nperformance. We validate our framework through extensive numerical examples,\nincluding simple linear, nonlinear, and periodic maps, dynamical systems, and\nspatio-temporal PDEs.\n","authors":["Guoxiang Grayson Tong","Carlos A. Sing Long","Daniele E. Schiavazzi"],"pdf_url":"https://arxiv.org/pdf/2307.12586v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09087v3","updated":"2023-07-24T07:55:19Z","published":"2023-06-15T12:33:39Z","title":"Deep learning based Meta-modeling for Multi-objective Technology\n Optimization of Electrical Machines","summary":" Optimization of rotating electrical machines is both time- and\ncomputationally expensive. Because of the different parametrization, design\noptimization is commonly executed separately for each machine technology. In\nthis paper, we present the application of a variational auto-encoder (VAE) to\noptimize two different machine technologies simultaneously, namely an\nasynchronous machine and a permanent magnet synchronous machine. After\ntraining, we employ a deep neural network and a decoder as meta-models to\npredict global key performance indicators (KPIs) and generate associated new\ndesigns, respectively, through unified latent space in the optimization loop.\nNumerical results demonstrate concurrent parametric multi-objective technology\noptimization in the high-dimensional design space. The VAE-based approach is\nquantitatively compared to a classical deep learning-based direct approach for\nKPIs prediction.\n","authors":["Vivek Parekh","Dominik Flore","Sebastian Schöps"],"pdf_url":"https://arxiv.org/pdf/2306.09087v3.pdf","comment":"12 pages, 15 figures"},{"id":"http://arxiv.org/abs/2307.12576v1","updated":"2023-07-24T07:47:21Z","published":"2023-07-24T07:47:21Z","title":"Self-refining of Pseudo Labels for Music Source Separation with Noisy\n Labeled Data","summary":" Music source separation (MSS) faces challenges due to the limited\navailability of correctly-labeled individual instrument tracks. With the push\nto acquire larger datasets to improve MSS performance, the inevitability of\nencountering mislabeled individual instrument tracks becomes a significant\nchallenge to address. This paper introduces an automated technique for refining\nthe labels in a partially mislabeled dataset. Our proposed self-refining\ntechnique, employed with a noisy-labeled dataset, results in only a 1% accuracy\ndegradation in multi-label instrument recognition compared to a classifier\ntrained on a clean-labeled dataset. The study demonstrates the importance of\nrefining noisy-labeled data in MSS model training and shows that utilizing the\nrefined dataset leads to comparable results derived from a clean-labeled\ndataset. Notably, upon only access to a noisy dataset, MSS models trained on a\nself-refined dataset even outperform those trained on a dataset refined with a\nclassifier trained on clean labels.\n","authors":["Junghyun Koo","Yunkee Chae","Chang-Bin Jeon","Kyogu Lee"],"pdf_url":"https://arxiv.org/pdf/2307.12576v1.pdf","comment":"24th International Society for Music Information Retrieval Conference\n (ISMIR 2023)"},{"id":"http://arxiv.org/abs/2306.16264v2","updated":"2023-07-24T07:30:53Z","published":"2023-06-28T14:46:55Z","title":"Deep Unfolded Simulated Bifurcation for Massive MIMO Signal Detection","summary":" Multiple-input multiple-output (MIMO) is a key ingredient of next-generation\nwireless communications. Recently, various MIMO signal detectors based on deep\nlearning techniques and quantum(-inspired) algorithms have been proposed to\nimprove the detection performance compared with conventional detectors. This\npaper focuses on the simulated bifurcation (SB) algorithm, a quantum-inspired\nalgorithm. This paper proposes two techniques to improve its detection\nperformance. The first is modifying the algorithm inspired by the\nLevenberg-Marquardt algorithm to eliminate local minima of maximum likelihood\ndetection. The second is the use of deep unfolding, a deep learning technique\nto train the internal parameters of an iterative algorithm. We propose a\ndeep-unfolded SB by making the update rule of SB differentiable. The numerical\nresults show that these proposed detectors significantly improve the signal\ndetection performance in massive MIMO systems.\n","authors":["Satoshi Takabe"],"pdf_url":"https://arxiv.org/pdf/2306.16264v2.pdf","comment":"5pages, 4 figures; codes are available at\n https://github.com/s-takabe/unfolded_simbif"},{"id":"http://arxiv.org/abs/2307.12564v1","updated":"2023-07-24T07:17:33Z","published":"2023-07-24T07:17:33Z","title":"Towards Generalising Neural Topical Representations","summary":" Topic models have evolved from conventional Bayesian probabilistic models to\nNeural Topic Models (NTMs) over the last two decays. Although NTMs have\nachieved promising performance when trained and tested on a specific corpus,\ntheir generalisation ability across corpora is rarely studied. In practice, we\noften expect that an NTM trained on a source corpus can still produce quality\ntopical representation for documents in a different target corpus without\nretraining. In this work, we aim to improve NTMs further so that their benefits\ngeneralise reliably across corpora and tasks. To do so, we propose to model\nsimilar documents by minimising their semantical distance when training NTMs.\nSpecifically, similar documents are created by data augmentation during\ntraining; The semantical distance between documents is measured by the\nHierarchical Topic Transport Distance (HOTT), which computes the Optimal\nTransport (OT) distance between the topical representations. Our framework can\nbe readily applied to most NTMs as a plug-and-play module. Extensive\nexperiments show that our framework significantly improves the generalisation\nability regarding neural topical representation across corpora.\n","authors":["Xiaohao Yang","He Zhao","Dinh Phung","Lan Du"],"pdf_url":"https://arxiv.org/pdf/2307.12564v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.09251v2","updated":"2023-07-24T07:08:59Z","published":"2022-11-16T22:50:40Z","title":"Learning-Augmented B-Trees","summary":" We study learning-augmented binary search trees (BSTs) and B-Trees via Treaps\nwith composite priorities. The result is a simple search tree where the depth\nof each item is determined by its predicted weight $w_x$. To achieve the\nresult, each item $x$ has its composite priority\n$-\\lfloor\\log\\log(1/w_x)\\rfloor + U(0, 1)$ where $U(0, 1)$ is the uniform\nrandom variable. This generalizes the recent learning-augmented BSTs\n[Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, to\narbitrary inputs and predictions. It also gives the first B-Tree data structure\nthat can provably take advantage of localities in the access sequence via\nonline self-reorganization. The data structure is robust to prediction errors\nand handles insertions, deletions, as well as prediction updates.\n","authors":["Xinyuan Cao","Jingbang Chen","Li Chen","Chris Lambert","Richard Peng","Daniel Sleator"],"pdf_url":"https://arxiv.org/pdf/2211.09251v2.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2307.10617v3","updated":"2023-07-24T07:03:01Z","published":"2023-07-20T06:35:43Z","title":"Unmasking Falsehoods in Reviews: An Exploration of NLP Techniques","summary":" In the contemporary digital landscape, online reviews have become an\nindispensable tool for promoting products and services across various\nbusinesses. Marketers, advertisers, and online businesses have found incentives\nto create deceptive positive reviews for their products and negative reviews\nfor their competitors' offerings. As a result, the writing of deceptive reviews\nhas become an unavoidable practice for businesses seeking to promote themselves\nor undermine their rivals. Detecting such deceptive reviews has become an\nintense and ongoing area of research. This research paper proposes a machine\nlearning model to identify deceptive reviews, with a particular focus on\nrestaurants. This study delves into the performance of numerous experiments\nconducted on a dataset of restaurant reviews known as the Deceptive Opinion\nSpam Corpus. To accomplish this, an n-gram model and max features are developed\nto effectively identify deceptive content, particularly focusing on fake\nreviews. A benchmark study is undertaken to explore the performance of two\ndifferent feature extraction techniques, which are then coupled with five\ndistinct machine learning classification algorithms. The experimental results\nreveal that the passive aggressive classifier stands out among the various\nalgorithms, showcasing the highest accuracy not only in text classification but\nalso in identifying fake reviews. Moreover, the research delves into data\naugmentation and implements various deep learning techniques to further enhance\nthe process of detecting deceptive reviews. The findings shed light on the\nefficacy of the proposed machine learning approach and offer valuable insights\ninto dealing with deceptive reviews in the realm of online businesses.\n","authors":["Anusuya Baby Hari Krishnan"],"pdf_url":"https://arxiv.org/pdf/2307.10617v3.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2307.12555v1","updated":"2023-07-24T06:41:59Z","published":"2023-07-24T06:41:59Z","title":"Homophily-Driven Sanitation View for Robust Graph Contrastive Learning","summary":" We investigate adversarial robustness of unsupervised Graph Contrastive\nLearning (GCL) against structural attacks. First, we provide a comprehensive\nempirical and theoretical analysis of existing attacks, revealing how and why\nthey downgrade the performance of GCL. Inspired by our analytic results, we\npresent a robust GCL framework that integrates a homophily-driven sanitation\nview, which can be learned jointly with contrastive learning. A key challenge\nthis poses, however, is the non-differentiable nature of the sanitation\nobjective. To address this challenge, we propose a series of techniques to\nenable gradient-based end-to-end robust GCL. Moreover, we develop a fully\nunsupervised hyperparameter tuning method which, unlike prior approaches, does\nnot require knowledge of node labels. We conduct extensive experiments to\nevaluate the performance of our proposed model, GCHS (Graph Contrastive\nLearning with Homophily-driven Sanitation View), against two state of the art\nstructural attacks on GCL. Our results demonstrate that GCHS consistently\noutperforms all state of the art baselines in terms of the quality of generated\nnode embeddings as well as performance on two important downstream tasks.\n","authors":["Yulin Zhu","Xing Ai","Yevgeniy Vorobeychik","Kai Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.12555v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12551v1","updated":"2023-07-24T06:38:10Z","published":"2023-07-24T06:38:10Z","title":"Continuation Path Learning for Homotopy Optimization","summary":" Homotopy optimization is a traditional method to deal with a complicated\noptimization problem by solving a sequence of easy-to-hard surrogate\nsubproblems. However, this method can be very sensitive to the continuation\nschedule design and might lead to a suboptimal solution to the original\nproblem. In addition, the intermediate solutions, often ignored by classic\nhomotopy optimization, could be useful for many real-world applications. In\nthis work, we propose a novel model-based approach to learn the whole\ncontinuation path for homotopy optimization, which contains infinite\nintermediate solutions for any surrogate subproblems. Rather than the classic\nunidirectional easy-to-hard optimization, our method can simultaneously\noptimize the original problem and all surrogate subproblems in a collaborative\nmanner. The proposed model also supports real-time generation of any\nintermediate solution, which could be desirable for many applications.\nExperimental studies on different problems show that our proposed method can\nsignificantly improve the performance of homotopy optimization and provide\nextra helpful information to support better decision-making.\n","authors":["Xi Lin","Zhiyuan Yang","Xiaoyuan Zhang","Qingfu Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.12551v1.pdf","comment":"Accepted by the 40th International Conference on Machine Learning\n (ICML 2023)"},{"id":"http://arxiv.org/abs/2304.12438v2","updated":"2023-07-24T06:19:17Z","published":"2023-04-24T20:24:07Z","title":"Stochastic MPC for energy hubs using data driven demand forecasting","summary":" Energy hubs convert and distribute energy resources by combining different\nenergy inputs through multiple conversion and storage components. The optimal\noperation of the energy hub exploits its flexibility to increase the energy\nefficiency and reduce the operational costs. However, uncertainties in the\ndemand present challenges to energy hub optimization. In this paper, we propose\na stochastic MPC controller to minimize energy costs using chance constraints\nfor the uncertain electricity and thermal demands. Historical data is used to\nbuild a demand prediction model based on Gaussian processes to generate a\nforecast of the future electricity and heat demands. The stochastic\noptimization problem is solved via the Scenario Approach by sampling multi-step\ndemand trajectories from the derived prediction model. The performance of the\nproposed predictor and of the stochastic controller is verified on a simulated\nenergy hub model and demand data from a real building.\n","authors":["Varsha Behrunani","Francesco Micheli","Jonas Mehr","Philipp Heer","John Lygeros"],"pdf_url":"https://arxiv.org/pdf/2304.12438v2.pdf","comment":"6 pages, 5 figures. Submitted to IFAC World Congress 2023"},{"id":"http://arxiv.org/abs/2211.09710v3","updated":"2023-07-24T05:39:27Z","published":"2022-11-17T17:45:59Z","title":"Style Classification of Rabbinic Literature for Detection of Lost\n Midrash Tanhuma Material","summary":" Midrash collections are complex rabbinic works that consist of text in\nmultiple languages, which evolved through long processes of unstable oral and\nwritten transmission. Determining the origin of a given passage in such a\ncompilation is not always straightforward and is often a matter of dispute\namong scholars, yet it is essential for scholars' understanding of the passage\nand its relationship to other texts in the rabbinic corpus. To help solve this\nproblem, we propose a system for classification of rabbinic literature based on\nits style, leveraging recent advances in natural language processing for Hebrew\ntexts. Additionally, we demonstrate how this method can be applied to uncover\nlost material from a specific midrash genre, Tan\\d{h}uma-Yelammedenu, that has\nbeen preserved in later anthologies.\n","authors":["Shlomo Tannor","Nachum Dershowitz","Moshe Lavee"],"pdf_url":"https://arxiv.org/pdf/2211.09710v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12532v1","updated":"2023-07-24T05:36:19Z","published":"2023-07-24T05:36:19Z","title":"On the Connection between Pre-training Data Diversity and Fine-tuning\n Robustness","summary":" Pre-training has been widely adopted in deep learning to improve model\nperformance, especially when the training data for a target task is limited. In\nour work, we seek to understand the implications of this training strategy on\nthe generalization properties of downstream models. More specifically, we ask\nthe following question: how do properties of the pre-training distribution\naffect the robustness of a fine-tuned model? The properties we explore include\nthe label space, label semantics, image diversity, data domains, and data\nquantity of the pre-training distribution. We find that the primary factor\ninfluencing downstream effective robustness (Taori et al., 2020) is data\nquantity, while other factors have limited significance. For example, reducing\nthe number of ImageNet pre-training classes by 4x while increasing the number\nof images per class by 4x (that is, keeping total data quantity fixed) does not\nimpact the robustness of fine-tuned models. We demonstrate our findings on\npre-training distributions drawn from various natural and synthetic data\nsources, primarily using the iWildCam-WILDS distribution shift as a test for\ndownstream robustness.\n","authors":["Vivek Ramanujan","Thao Nguyen","Sewoong Oh","Ludwig Schmidt","Ali Farhadi"],"pdf_url":"https://arxiv.org/pdf/2307.12532v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12526v1","updated":"2023-07-24T04:56:23Z","published":"2023-07-24T04:56:23Z","title":"Rethinking Medical Report Generation: Disease Revealing Enhancement with\n Knowledge Graph","summary":" Knowledge Graph (KG) plays a crucial role in Medical Report Generation (MRG)\nbecause it reveals the relations among diseases and thus can be utilized to\nguide the generation process. However, constructing a comprehensive KG is\nlabor-intensive and its applications on the MRG process are under-explored. In\nthis study, we establish a complete KG on chest X-ray imaging that includes 137\ntypes of diseases and abnormalities. Based on this KG, we find that the current\nMRG data sets exhibit a long-tailed problem in disease distribution. To\nmitigate this problem, we introduce a novel augmentation strategy that enhances\nthe representation of disease types in the tail-end of the distribution. We\nfurther design a two-stage MRG approach, where a classifier is first trained to\ndetect whether the input images exhibit any abnormalities. The classified\nimages are then independently fed into two transformer-based generators,\nnamely, ``disease-specific generator\" and ``disease-free generator\" to generate\nthe corresponding reports. To enhance the clinical evaluation of whether the\ngenerated reports correctly describe the diseases appearing in the input image,\nwe propose diverse sensitivity (DS), a new metric that checks whether generated\ndiseases match ground truth and measures the diversity of all generated\ndiseases. Results show that the proposed two-stage generation framework and\naugmentation strategies improve DS by a considerable margin, indicating a\nnotable reduction in the long-tailed problem associated with under-represented\ndiseases.\n","authors":["Yixin Wang","Zihao Lin","Haoyu Dong"],"pdf_url":"https://arxiv.org/pdf/2307.12526v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12524v1","updated":"2023-07-24T04:46:22Z","published":"2023-07-24T04:46:22Z","title":"Landslide Surface Displacement Prediction Based on VSXC-LSTM Algorithm","summary":" Landslide is a natural disaster that can easily threaten local ecology,\npeople's lives and property. In this paper, we conduct modelling research on\nreal unidirectional surface displacement data of recent landslides in the\nresearch area and propose a time series prediction framework named\nVMD-SegSigmoid-XGBoost-ClusterLSTM (VSXC-LSTM) based on variational mode\ndecomposition, which can predict the landslide surface displacement more\naccurately. The model performs well on the test set. Except for the random item\nsubsequence that is hard to fit, the root mean square error (RMSE) and the mean\nabsolute percentage error (MAPE) of the trend item subsequence and the periodic\nitem subsequence are both less than 0.1, and the RMSE is as low as 0.006 for\nthe periodic item prediction module based on XGBoost\\footnote{Accepted in\nICANN2023}.\n","authors":["Menglin Kong","Ruichen Li","Fan Liu","Xingquan Li","Juan Cheng","Muzhou Hou","Cong Cao"],"pdf_url":"https://arxiv.org/pdf/2307.12524v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12520v1","updated":"2023-07-24T04:29:43Z","published":"2023-07-24T04:29:43Z","title":"Lost In Translation: Generating Adversarial Examples Robust to\n Round-Trip Translation","summary":" Language Models today provide a high accuracy across a large number of\ndownstream tasks. However, they remain susceptible to adversarial attacks,\nparticularly against those where the adversarial examples maintain considerable\nsimilarity to the original text. Given the multilingual nature of text, the\neffectiveness of adversarial examples across translations and how machine\ntranslations can improve the robustness of adversarial examples remain largely\nunexplored. In this paper, we present a comprehensive study on the robustness\nof current text adversarial attacks to round-trip translation. We demonstrate\nthat 6 state-of-the-art text-based adversarial attacks do not maintain their\nefficacy after round-trip translation. Furthermore, we introduce an\nintervention-based solution to this problem, by integrating Machine Translation\ninto the process of adversarial example generation and demonstrating increased\nrobustness to round-trip translation. Our results indicate that finding\nadversarial examples robust to translation can help identify the insufficiency\nof language models that is common across languages, and motivate further\nresearch into multilingual adversarial attacks.\n","authors":["Neel Bhandari","Pin-Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2307.12520v1.pdf","comment":"Published at International Conference on Acoustics, Speech, and\n Signal Processing (ICASSP) 2023"},{"id":"http://arxiv.org/abs/2307.12519v1","updated":"2023-07-24T04:29:00Z","published":"2023-07-24T04:29:00Z","title":"DEPHN: Different Expression Parallel Heterogeneous Network using virtual\n gradient optimization for Multi-task Learning","summary":" Recommendation system algorithm based on multi-task learning (MTL) is the\nmajor method for Internet operators to understand users and predict their\nbehaviors in the multi-behavior scenario of platform. Task correlation is an\nimportant consideration of MTL goals, traditional models use shared-bottom\nmodels and gating experts to realize shared representation learning and\ninformation differentiation. However, The relationship between real-world tasks\nis often more complex than existing methods do not handle properly sharing\ninformation. In this paper, we propose an Different Expression Parallel\nHeterogeneous Network (DEPHN) to model multiple tasks simultaneously. DEPHN\nconstructs the experts at the bottom of the model by using different feature\ninteraction methods to improve the generalization ability of the shared\ninformation flow. In view of the model's differentiating ability for different\ntask information flows, DEPHN uses feature explicit mapping and virtual\ngradient coefficient for expert gating during the training process, and\nadaptively adjusts the learning intensity of the gated unit by considering the\ndifference of gating values and task correlation. Extensive experiments on\nartificial and real-world datasets demonstrate that our proposed method can\ncapture task correlation in complex situations and achieve better performance\nthan baseline models\\footnote{Accepted in IJCNN2023}.\n","authors":["Menglin Kong","Ri Su","Shaojie Zhao","Muzhou Hou"],"pdf_url":"https://arxiv.org/pdf/2307.12519v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12518v1","updated":"2023-07-24T04:23:08Z","published":"2023-07-24T04:23:08Z","title":"FaFCNN: A General Disease Classification Framework Based on Feature\n Fusion Neural Networks","summary":" There are two fundamental problems in applying deep learning/machine learning\nmethods to disease classification tasks, one is the insufficient number and\npoor quality of training samples; another one is how to effectively fuse\nmultiple source features and thus train robust classification models. To\naddress these problems, inspired by the process of human learning knowledge, we\npropose the Feature-aware Fusion Correlation Neural Network (FaFCNN), which\nintroduces a feature-aware interaction module and a feature alignment module\nbased on domain adversarial learning. This is a general framework for disease\nclassification, and FaFCNN improves the way existing methods obtain sample\ncorrelation features. The experimental results show that training using\naugmented features obtained by pre-training gradient boosting decision tree\nyields more performance gains than random-forest based methods. On the\nlow-quality dataset with a large amount of missing data in our setup, FaFCNN\nobtains a consistently optimal performance compared to competitive baselines.\nIn addition, extensive experiments demonstrate the robustness of the proposed\nmethod and the effectiveness of each component of the model\\footnote{Accepted\nin IEEE SMC2023}.\n","authors":["Menglin Kong","Shaojie Zhao","Juan Cheng","Xingquan Li","Ri Su","Muzhou Hou","Cong Cao"],"pdf_url":"https://arxiv.org/pdf/2307.12518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12510v1","updated":"2023-07-24T03:52:11Z","published":"2023-07-24T03:52:11Z","title":"An Empirical Evaluation of Temporal Graph Benchmark","summary":" In this paper, we conduct an empirical evaluation of Temporal Graph Benchmark\n(TGB) by extending our Dynamic Graph Library (DyGLib) to TGB. Compared with\nTGB, we include eleven popular dynamic graph learning methods for more\nexhaustive comparisons. Through the experiments, we find that (1) some issues\nneed to be addressed in the current version of TGB, including mismatched data\nstatistics, inaccurate evaluation metric computation, and so on; (2) different\nmodels depict varying performance across various datasets, which is in line\nwith previous observations; (3) the performance of some baselines can be\nsignificantly improved over the reported results in TGB when using DyGLib. This\nwork aims to ease the researchers' efforts in evaluating various dynamic graph\nlearning methods on TGB and attempts to offer results that can be directly\nreferenced in the follow-up research. All the used resources in this project\nare publicly available at https://github.com/yule-BUAA/DyGLib_TGB. This work is\nin progress, and feedback from the community is welcomed for improvements.\n","authors":["Le Yu"],"pdf_url":"https://arxiv.org/pdf/2307.12510v1.pdf","comment":"preprint, in progress"},{"id":"http://arxiv.org/abs/2304.03483v2","updated":"2023-07-24T03:28:34Z","published":"2023-04-07T05:29:59Z","title":"RED-PSM: Regularization by Denoising of Partially Separable Models for\n Dynamic Imaging","summary":" Dynamic imaging addresses the recovery of a time-varying 2D or 3D object at\neach time instant using its undersampled measurements. In particular, in the\ncase of dynamic tomography, only a single projection at a single view angle may\nbe available at a time, making the problem severely ill-posed. In this work, we\npropose an approach, RED-PSM, which combines for the first time two powerful\ntechniques to address this challenging imaging problem. The first, are\npartially separable models, which have been used to efficiently introduce a\nlow-rank prior for the spatio-temporal object. The second is the recent\nRegularization by Denoising (RED), which provides a flexible framework to\nexploit the impressive performance of state-of-the-art image denoising\nalgorithms, for various inverse problems. We propose a partially separable\nobjective with RED and a computationally efficient and scalable optimization\nscheme with variable splitting and ADMM. Theoretical analysis proves the\nconvergence of our objective to a value corresponding to a stationary point\nsatisfying the first-order optimality conditions. Convergence is accelerated by\na particular projection-domain-based initialization. We demonstrate the\nperformance and computational improvements of our proposed RED-PSM with a\nlearned image denoiser by comparing it to a recent deep-prior-based method\nknown as TD-DIP. Although the main focus is on dynamic tomography, we also show\nthe performance advantages of RED-PSM in a cardiac dynamic MRI setting.\n","authors":["Berk Iskender","Marc L. Klasky","Yoram Bresler"],"pdf_url":"https://arxiv.org/pdf/2304.03483v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12499v1","updated":"2023-07-24T03:10:02Z","published":"2023-07-24T03:10:02Z","title":"AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion\n Models","summary":" Unrestricted adversarial attacks present a serious threat to deep learning\nmodels and adversarial defense techniques. They pose severe security problems\nfor deep learning applications because they can effectively bypass defense\nmechanisms. However, previous attack methods often utilize Generative\nAdversarial Networks (GANs), which are not theoretically provable and thus\ngenerate unrealistic examples by incorporating adversarial objectives,\nespecially for large-scale datasets like ImageNet. In this paper, we propose a\nnew method, called AdvDiff, to generate unrestricted adversarial examples with\ndiffusion models. We design two novel adversarial guidance techniques to\nconduct adversarial sampling in the reverse generation process of diffusion\nmodels. These two techniques are effective and stable to generate high-quality,\nrealistic adversarial examples by integrating gradients of the target\nclassifier interpretably. Experimental results on MNIST and ImageNet datasets\ndemonstrate that AdvDiff is effective to generate unrestricted adversarial\nexamples, which outperforms GAN-based methods in terms of attack performance\nand generation quality.\n","authors":["Xuelong Dai","Kaisheng Liang","Bin Xiao"],"pdf_url":"https://arxiv.org/pdf/2307.12499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12496v1","updated":"2023-07-24T03:04:10Z","published":"2023-07-24T03:04:10Z","title":"A faster and simpler algorithm for learning shallow networks","summary":" We revisit the well-studied problem of learning a linear combination of $k$\nReLU activations given labeled examples drawn from the standard $d$-dimensional\nGaussian measure. Chen et al. [CDG+23] recently gave the first algorithm for\nthis problem to run in $\\text{poly}(d,1/\\varepsilon)$ time when $k = O(1)$,\nwhere $\\varepsilon$ is the target error. More precisely, their algorithm runs\nin time $(d/\\varepsilon)^{\\mathrm{quasipoly}(k)}$ and learns over multiple\nstages. Here we show that a much simpler one-stage version of their algorithm\nsuffices, and moreover its runtime is only $(d/\\varepsilon)^{O(k^2)}$.\n","authors":["Sitan Chen","Shyam Narayanan"],"pdf_url":"https://arxiv.org/pdf/2307.12496v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2307.12491v1","updated":"2023-07-24T02:50:19Z","published":"2023-07-24T02:50:19Z","title":"Learning Universal and Robust 3D Molecular Representations with Graph\n Convolutional Networks","summary":" To learn accurate representations of molecules, it is essential to consider\nboth chemical and geometric features. To encode geometric information, many\ndescriptors have been proposed in constrained circumstances for specific types\nof molecules and do not have the properties to be ``robust\": 1. Invariant to\nrotations and translations; 2. Injective when embedding molecular structures.\nIn this work, we propose a universal and robust Directional Node Pair (DNP)\ndescriptor based on the graph representations of 3D molecules. Our DNP\ndescriptor is robust compared to previous ones and can be applied to multiple\nmolecular types. To combine the DNP descriptor and chemical features in\nmolecules, we construct the Robust Molecular Graph Convolutional Network\n(RoM-GCN) which is capable to take both node and edge features into\nconsideration when generating molecule representations. We evaluate our model\non protein and small molecule datasets. Our results validate the superiority of\nthe DNP descriptor in incorporating 3D geometric information of molecules.\nRoM-GCN outperforms all compared baselines.\n","authors":["Shuo Zhang","Yang Liu","Li Xie","Lei Xie"],"pdf_url":"https://arxiv.org/pdf/2307.12491v1.pdf","comment":"Preprint. Work in progress"},{"id":"http://arxiv.org/abs/2307.01482v2","updated":"2023-07-24T02:40:29Z","published":"2023-07-04T05:19:19Z","title":"Nexus sine qua non: Essentially Connected Networks for Traffic\n Forecasting","summary":" Spatial-temporal graph neural networks (STGNNs) have become the de facto\nmodels for learning spatiotemporal representations of traffic flow. However,\nmodern STGNNs often contain superfluous or obscure components, along with\ncomplex techniques, posing significant challenges in terms of complexity and\nscalability. Such concerns prompt us to rethink the design of neural\narchitectures and to identify the key challenges in traffic forecasting as\nspatial-temporal contextualization. Here, we present an essentially connected\nmodel based on an efficient message-passing backbone, powered by learnable node\nembedding, without any complex sequential techniques such as TCNs, RNNs, and\nTransformers. Intriguingly, empirical results demonstrate how a simple and\nelegant model with contextualization capability compares favorably w.r.t. the\nstate-of-the-art with elaborate structures, while being much more interpretable\nand computationally efficient for traffic forecasting. We anticipate that our\nfindings will open new horizons for further research to explore the possibility\nof creating simple but effective neural forecasting architectures.\n","authors":["Tong Nie","Guoyang Qin","Yunpeng Wang","Jian Sun"],"pdf_url":"https://arxiv.org/pdf/2307.01482v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04893v2","updated":"2023-07-24T02:38:09Z","published":"2023-07-10T20:31:23Z","title":"Choosing Well Your Opponents: How to Guide the Synthesis of Programmatic\n Strategies","summary":" This paper introduces Local Learner (2L), an algorithm for providing a set of\nreference strategies to guide the search for programmatic strategies in\ntwo-player zero-sum games. Previous learning algorithms, such as Iterated Best\nResponse (IBR), Fictitious Play (FP), and Double-Oracle (DO), can be\ncomputationally expensive or miss important information for guiding search\nalgorithms. 2L actively selects a set of reference strategies to improve the\nsearch signal. We empirically demonstrate the advantages of our approach while\nguiding a local search algorithm for synthesizing strategies in three games,\nincluding MicroRTS, a challenging real-time strategy game. Results show that 2L\nlearns reference strategies that provide a stronger search signal than IBR, FP,\nand DO. We also simulate a tournament of MicroRTS, where a synthesizer using 2L\noutperformed the winners of the two latest MicroRTS competitions, which were\nprogrammatic strategies written by human programmers.\n","authors":["Rubens O. Moraes","David S. Aleixo","Lucas N. Ferreira","Levi H. S. Lelis"],"pdf_url":"https://arxiv.org/pdf/2307.04893v2.pdf","comment":"International Joint Conference on Artificial Intelligence (IJCAI)\n 2023"},{"id":"http://arxiv.org/abs/2307.12480v1","updated":"2023-07-24T02:28:50Z","published":"2023-07-24T02:28:50Z","title":"Learning Resource Allocation Policy: Vertex-GNN or Edge-GNN?","summary":" Graph neural networks (GNNs) update the hidden representations of vertices\n(called Vertex-GNNs) or hidden representations of edges (called Edge-GNNs) by\nprocessing and pooling the information of neighboring vertices and edges and\ncombining to incorporate graph topology. When learning resource allocation\npolicies, GNNs cannot perform well if their expressive power are weak, i.e., if\nthey cannot differentiate all input features such as channel matrices. In this\npaper, we analyze the expressive power of the Vertex-GNNs and Edge-GNNs for\nlearning three representative wireless policies: link scheduling, power\ncontrol, and precoding policies. We find that the expressive power of the GNNs\ndepend on the linearity and output dimensions of the processing and combination\nfunctions. When linear processors are used, the Vertex-GNNs cannot\ndifferentiate all channel matrices due to the loss of channel information,\nwhile the Edge-GNNs can. When learning the precoding policy, even the\nVertex-GNNs with non-linear processors may not be with strong expressive\nability due to the dimension compression. We proceed to provide necessary\nconditions for the GNNs to well learn the precoding policy. Simulation results\nvalidate the analyses and show that the Edge-GNNs can achieve the same\nperformance as the Vertex-GNNs with much lower training and inference time.\n","authors":["Yao Peng","Jia Guo","Chenyang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.12480v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.16392v2","updated":"2023-07-24T02:05:50Z","published":"2022-10-28T20:13:00Z","title":"Physics-aware Graph Neural Network for Accurate RNA 3D Structure\n Prediction","summary":" Biological functions of RNAs are determined by their three-dimensional (3D)\nstructures. Thus, given the limited number of experimentally determined RNA\nstructures, the prediction of RNA structures will facilitate elucidating RNA\nfunctions and RNA-targeted drug discovery, but remains a challenging task. In\nthis work, we propose a Graph Neural Network (GNN)-based scoring function\ntrained only with the atomic types and coordinates on limited solved RNA 3D\nstructures for distinguishing accurate structural models. The proposed\nPhysics-aware Multiplex Graph Neural Network (PaxNet) separately models the\nlocal and non-local interactions inspired by molecular mechanics. Furthermore,\nPaxNet contains an attention-based fusion module that learns the individual\ncontribution of each interaction type for the final prediction. We rigorously\nevaluate the performance of PaxNet on two benchmarks and compare it with\nseveral state-of-the-art baselines. The results show that PaxNet significantly\noutperforms all the baselines overall, and demonstrate the potential of PaxNet\nfor improving the 3D structure modeling of RNA and other macromolecules. Our\ncode is available at https://github.com/zetayue/Physics-aware-Multiplex-GNN.\n","authors":["Shuo Zhang","Yang Liu","Lei Xie"],"pdf_url":"https://arxiv.org/pdf/2210.16392v2.pdf","comment":"Accepted by the Machine Learning for Structural Biology Workshop\n (MLSB) at the 36th Conference on Neural Information Processing Systems\n (NeurIPS 2022)"},{"id":"http://arxiv.org/abs/2307.12472v1","updated":"2023-07-24T01:58:48Z","published":"2023-07-24T01:58:48Z","title":"Model-free generalized fiducial inference","summary":" Motivated by the need for the development of safe and reliable methods for\nuncertainty quantification in machine learning, I propose and develop ideas for\na model-free statistical framework for imprecise probabilistic prediction\ninference. This framework facilitates uncertainty quantification in the form of\nprediction sets that offer finite sample control of type 1 errors, a property\nshared with conformal prediction sets, but this new approach also offers more\nversatile tools for imprecise probabilistic reasoning. Furthermore, I propose\nand consider the theoretical and empirical properties of a precise\nprobabilistic approximation to the model-free imprecise framework.\nApproximating a belief/plausibility measure pair by an [optimal in some sense]\nprobability measure in the credal set is a critical resolution needed for the\nbroader adoption of imprecise probabilistic approaches to inference in\nstatistical and machine learning communities. It is largely undetermined in the\nstatistical and machine learning literatures, more generally, how to properly\nquantify uncertainty in that there is no generally accepted standard of\naccountability of stated uncertainties. The research I present in this\nmanuscript is aimed at motivating a framework for statistical inference with\nreliability and accountability as the guiding principles.\n","authors":["Jonathan P Williams"],"pdf_url":"https://arxiv.org/pdf/2307.12472v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12463v1","updated":"2023-07-24T00:53:46Z","published":"2023-07-24T00:53:46Z","title":"Rethinking Data Distillation: Do Not Overlook Calibration","summary":" Neural networks trained on distilled data often produce over-confident output\nand require correction by calibration methods. Existing calibration methods\nsuch as temperature scaling and mixup work well for networks trained on\noriginal large-scale data. However, we find that these methods fail to\ncalibrate networks trained on data distilled from large source datasets. In\nthis paper, we show that distilled data lead to networks that are not\ncalibratable due to (i) a more concentrated distribution of the maximum logits\nand (ii) the loss of information that is semantically meaningful but unrelated\nto classification tasks. To address this problem, we propose Masked Temperature\nScaling (MTS) and Masked Distillation Training (MDT) which mitigate the\nlimitations of distilled data and achieve better calibration results while\nmaintaining the efficiency of dataset distillation.\n","authors":["Dongyao Zhu","Bowen Lei","Jie Zhang","Yanbo Fang","Ruqi Zhang","Yiqun Xie","Dongkuan Xu"],"pdf_url":"https://arxiv.org/pdf/2307.12463v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2307.12461v1","updated":"2023-07-24T00:16:50Z","published":"2023-07-24T00:16:50Z","title":"Rates of Approximation by ReLU Shallow Neural Networks","summary":" Neural networks activated by the rectified linear unit (ReLU) play a central\nrole in the recent development of deep learning. The topic of approximating\nfunctions from H\\\"older spaces by these networks is crucial for understanding\nthe efficiency of the induced learning algorithms. Although the topic has been\nwell investigated in the setting of deep neural networks with many layers of\nhidden neurons, it is still open for shallow networks having only one hidden\nlayer. In this paper, we provide rates of uniform approximation by these\nnetworks. We show that ReLU shallow neural networks with $m$ hidden neurons can\nuniformly approximate functions from the H\\\"older space $W_\\infty^r([-1, 1]^d)$\nwith rates $O((\\log m)^{\\frac{1}{2} +d}m^{-\\frac{r}{d}\\frac{d+2}{d+4}})$ when\n$r