diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..821c9f5c --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2023-08-07T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2212.09597v6","updated":"2023-08-07T17:50:52Z","published":"2022-12-19T16:32:42Z","title":"Reasoning with Language Model Prompting: A Survey","summary":" Reasoning, as an essential ability for complex problem-solving, can provide\nback-end support for various real-world applications, such as medical\ndiagnosis, negotiation, etc. This paper provides a comprehensive survey of\ncutting-edge research on reasoning with language model prompting. We introduce\nresearch works with comparisons and summaries and provide systematic resources\nto help beginners. We also discuss the potential reasons for emerging such\nreasoning abilities and highlight future research directions. Resources are\navailable at https://github.com/zjunlp/Prompt4ReasoningPapers (updated\nperiodically).\n","authors":["Shuofei Qiao","Yixin Ou","Ningyu Zhang","Xiang Chen","Yunzhi Yao","Shumin Deng","Chuanqi Tan","Fei Huang","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2212.09597v6.pdf","comment":"ACL 2023, fixed Equation 2"},{"id":"http://arxiv.org/abs/2308.03742v1","updated":"2023-08-07T17:46:49Z","published":"2023-08-07T17:46:49Z","title":"What about translation? New coding system for content analysis on the\n perception of literary translation around the political transformation in\n 1989 in Hungary as a classification problem on an unbalanced dataset","summary":" To track trends in the perception of literary translation around the\npolitical transformation in 1989 in Hungary, a coding system was developed on\nthe paragraphs of the 1980-1999 issues of the literary journal Alf\\\"old. This\npaper describes how we trained BERT models to carry over the coding system to\nthe 1980-1999 issues of the literary journal Nagyvil\\'ag. We use extensive\nhyperparameter tuning, loss functions robust to label unbalance, 10-fold\ncross-validation for precise evaluations and a model ensemble for prediction,\nmanual validation on the predict set, a new calibration method to better\npredict label counts for sections of the Nagyvil\\'ag corpus, and to study the\nrelations between labels, we construct label relation networks.\n","authors":["Dalma Galambos","Pál Zsámboki"],"pdf_url":"https://arxiv.org/pdf/2308.03742v1.pdf","comment":"14 pages, 7 figures"},{"id":"http://arxiv.org/abs/2301.09656v3","updated":"2023-08-07T17:40:40Z","published":"2023-01-23T19:00:02Z","title":"Selective Explanations: Leveraging Human Input to Align Explainable AI","summary":" While a vast collection of explainable AI (XAI) algorithms have been\ndeveloped in recent years, they are often criticized for significant gaps with\nhow humans produce and consume explanations. As a result, current XAI\ntechniques are often found to be hard to use and lack effectiveness. In this\nwork, we attempt to close these gaps by making AI explanations selective -- a\nfundamental property of human explanations -- by selectively presenting a\nsubset from a large set of model reasons based on what aligns with the\nrecipient's preferences. We propose a general framework for generating\nselective explanations by leveraging human input on a small sample. This\nframework opens up a rich design space that accounts for different selectivity\ngoals, types of input, and more. As a showcase, we use a decision-support task\nto explore selective explanations based on what the decision-maker would\nconsider relevant to the decision task. We conducted two experimental studies\nto examine three out of a broader possible set of paradigms based on our\nproposed framework: in Study 1, we ask the participants to provide their own\ninput to generate selective explanations, with either open-ended or\ncritique-based input. In Study 2, we show participants selective explanations\nbased on input from a panel of similar users (annotators). Our experiments\ndemonstrate the promise of selective explanations in reducing over-reliance on\nAI and improving decision outcomes and subjective perceptions of the AI, but\nalso paint a nuanced picture that attributes some of these positive effects to\nthe opportunity to provide one's own input to augment AI explanations. Overall,\nour work proposes a novel XAI framework inspired by human communication\nbehaviors and demonstrates its potentials to encourage future work to better\nalign AI explanations with human production and consumption of explanations.\n","authors":["Vivian Lai","Yiming Zhang","Chacha Chen","Q. Vera Liao","Chenhao Tan"],"pdf_url":"https://arxiv.org/pdf/2301.09656v3.pdf","comment":"21 pages, 25 figures"},{"id":"http://arxiv.org/abs/2307.14361v2","updated":"2023-08-07T17:09:07Z","published":"2023-07-24T21:01:46Z","title":"A Hybrid Machine Learning Model for Classifying Gene Mutations in Cancer\n using LSTM, BiLSTM, CNN, GRU, and GloVe","summary":" This study presents an ensemble model combining LSTM, BiLSTM, CNN, GRU, and\nGloVe to classify gene mutations using Kaggle's Personalized Medicine:\nRedefining Cancer Treatment dataset. The results were compared against\nwell-known transformers like as BERT, Electra, Roberta, XLNet, Distilbert, and\ntheir LSTM ensembles. Our model outperformed all other models in terms of\naccuracy, precision, recall, F1 score, and Mean Squared Error. Surprisingly, it\nalso needed less training time, resulting in a perfect combination of\nperformance and efficiency. This study demonstrates the utility of ensemble\nmodels for difficult tasks such as gene mutation classification.\n","authors":["Sanad Aburass","Osama Dorgham","Jamil Al Shaqsi"],"pdf_url":"https://arxiv.org/pdf/2307.14361v2.pdf","comment":"6 pages, 7 figures and 2 tables"},{"id":"http://arxiv.org/abs/2308.03688v1","updated":"2023-08-07T16:08:11Z","published":"2023-08-07T16:08:11Z","title":"AgentBench: Evaluating LLMs as Agents","summary":" Large Language Models (LLMs) are becoming increasingly smart and autonomous,\ntargeting real-world pragmatic missions beyond traditional NLP tasks. As a\nresult, there has been an urgent need to evaluate LLMs as agents on challenging\ntasks in interactive environments. We present AgentBench, a multi-dimensional\nevolving benchmark that currently consists of 8 distinct environments to assess\nLLM-as-Agent's reasoning and decision-making abilities in a multi-turn\nopen-ended generation setting. Our extensive test over 25 LLMs (including APIs\nand open-sourced models) shows that, while top commercial LLMs present a strong\nability of acting as agents in complex environments, there is a significant\ndisparity in performance between them and open-sourced competitors. It also\nserves as a component of an ongoing project with wider coverage and deeper\nconsideration towards systematic LLM evaluation. Datasets, environments, and an\nintegrated evaluation package for AgentBench are released at\nhttps://github.com/THUDM/AgentBench\n","authors":["Xiao Liu","Hao Yu","Hanchen Zhang","Yifan Xu","Xuanyu Lei","Hanyu Lai","Yu Gu","Hangliang Ding","Kaiwen Men","Kejuan Yang","Shudan Zhang","Xiang Deng","Aohan Zeng","Zhengxiao Du","Chenhui Zhang","Sheng Shen","Tianjun Zhang","Yu Su","Huan Sun","Minlie Huang","Yuxiao Dong","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2308.03688v1.pdf","comment":"38 pages"},{"id":"http://arxiv.org/abs/2308.03660v1","updated":"2023-08-07T15:20:20Z","published":"2023-08-07T15:20:20Z","title":"Detecting Spells in Fantasy Literature with a Transformer Based\n Artificial Intelligence","summary":" Transformer architectures and models have made significant progress in\nlanguage-based tasks. In this area, is BERT one of the most widely used and\nfreely available transformer architecture. In our work, we use BERT for\ncontext-based phrase recognition of magic spells in the Harry Potter novel\nseries. Spells are a common part of active magic in fantasy novels. Typically,\nspells are used in a specific context to achieve a supernatural effect. A\nseries of investigations were conducted to see if a Transformer architecture\ncould recognize such phrases based on their context in the Harry Potter saga.\nFor our studies a pre-trained BERT model was used and fine-tuned utilising\ndifferent datasets and training methods to identify the searched context. By\nconsidering different approaches for sequence classification as well as token\nclassification, it is shown that the context of spells can be recognised.\nAccording to our investigations, the examined sequence length for fine-tuning\nand validation of the model plays a significant role in context recognition.\nBased on this, we have investigated whether spells have overarching properties\nthat allow a transfer of the neural network models to other fantasy universes\nas well. The application of our model showed promising results and is worth to\nbe deepened in subsequent studies.\n","authors":["Marcel Moravek","Alexander Zender","Andreas Müller"],"pdf_url":"https://arxiv.org/pdf/2308.03660v1.pdf","comment":"18 pages, 11 figures, 13 tables"},{"id":"http://arxiv.org/abs/2308.03656v1","updated":"2023-08-07T15:18:30Z","published":"2023-08-07T15:18:30Z","title":"Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using\n EmotionBench","summary":" Recently, the community has witnessed the advancement of Large Language\nModels (LLMs), which have shown remarkable performance on various downstream\ntasks. Led by powerful models like ChatGPT and Claude, LLMs are revolutionizing\nhow users engage with software, assuming more than mere tools but intelligent\nassistants. Consequently, evaluating LLMs' anthropomorphic capabilities becomes\nincreasingly important in contemporary discourse. Utilizing the emotion\nappraisal theory from psychology, we propose to evaluate the empathy ability of\nLLMs, i.e., how their feelings change when presented with specific situations.\nAfter a careful and comprehensive survey, we collect a dataset containing over\n400 situations that have proven effective in eliciting the eight emotions\ncentral to our study. Categorizing the situations into 36 factors, we conduct a\nhuman evaluation involving more than 1,200 subjects worldwide. With the human\nevaluation results as references, our evaluation includes five LLMs, covering\nboth commercial and open-source models, including variations in model sizes,\nfeaturing the latest iterations, such as GPT-4 and LLaMA 2. A conclusion can be\ndrawn from the results that, despite several misalignments, LLMs can generally\nrespond appropriately to certain situations. Nevertheless, they fall short in\nalignment with the emotional behaviors of human beings and cannot establish\nconnections between similar situations. Our collected dataset of situations,\nthe human evaluation results, and the code of our testing framework, dubbed\nEmotionBench, is made publicly in https://github.com/CUHK-ARISE/EmotionBench.\nWe aspire to contribute to the advancement of LLMs regarding better alignment\nwith the emotional behaviors of human beings, thereby enhancing their utility\nand applicability as intelligent assistants.\n","authors":["Jen-tse Huang","Man Ho Lam","Eric John Li","Shujie Ren","Wenxuan Wang","Wenxiang Jiao","Zhaopeng Tu","Michael R. Lyu"],"pdf_url":"https://arxiv.org/pdf/2308.03656v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2308.00121v2","updated":"2023-08-07T14:57:11Z","published":"2023-07-24T19:59:22Z","title":"Getting pwn'd by AI: Penetration Testing with Large Language Models","summary":" The field of software security testing, more specifically penetration\ntesting, is an activity that requires high levels of expertise and involves\nmany manual testing and analysis steps. This paper explores the potential usage\nof large-language models, such as GPT3.5, to augment penetration testers with\nAI sparring partners. We explore the feasibility of supplementing penetration\ntesters with AI models for two distinct use cases: high-level task planning for\nsecurity testing assignments and low-level vulnerability hunting within a\nvulnerable virtual machine. For the latter, we implemented a closed-feedback\nloop between LLM-generated low-level actions with a vulnerable virtual machine\n(connected through SSH) and allowed the LLM to analyze the machine state for\nvulnerabilities and suggest concrete attack vectors which were automatically\nexecuted within the virtual machine. We discuss promising initial results,\ndetail avenues for improvement, and close deliberating on the ethics of\nproviding AI-based sparring partners.\n","authors":["Andreas Happe","Jürgen Cito"],"pdf_url":"https://arxiv.org/pdf/2308.00121v2.pdf","comment":"5 pages, 1 figure, vision paper FSE'23"},{"id":"http://arxiv.org/abs/2308.03638v1","updated":"2023-08-07T14:42:49Z","published":"2023-08-07T14:42:49Z","title":"KITLM: Domain-Specific Knowledge InTegration into Language Models for\n Question Answering","summary":" Large language models (LLMs) have demonstrated remarkable performance in a\nwide range of natural language tasks. However, as these models continue to grow\nin size, they face significant challenges in terms of computational costs.\nAdditionally, LLMs often lack efficient domain-specific understanding, which is\nparticularly crucial in specialized fields such as aviation and healthcare. To\nboost the domain-specific understanding, we propose, KITLM, a novel knowledge\nbase integration approach into language model through relevant information\ninfusion. By integrating pertinent knowledge, not only the performance of the\nlanguage model is greatly enhanced, but the model size requirement is also\nsignificantly reduced while achieving comparable performance. Our proposed\nknowledge-infused model surpasses the performance of both GPT-3.5-turbo and the\nstate-of-the-art knowledge infusion method, SKILL, achieving over 1.5 times\nimprovement in exact match scores on the MetaQA. KITLM showed a similar\nperformance boost in the aviation domain with AeroQA. The drastic performance\nimprovement of KITLM over the existing methods can be attributed to the\ninfusion of relevant knowledge while mitigating noise. In addition, we release\ntwo curated datasets to accelerate knowledge infusion research in specialized\nfields: a) AeroQA, a new benchmark dataset designed for multi-hop\nquestion-answering within the aviation domain, and b) Aviation Corpus, a\ndataset constructed from unstructured text extracted from the National\nTransportation Safety Board reports. Our research contributes to advancing the\nfield of domain-specific language understanding and showcases the potential of\nknowledge infusion techniques in improving the performance of language models\non question-answering.\n","authors":["Ankush Agarwal","Sakharam Gawade","Amar Prakash Azad","Pushpak Bhattacharyya"],"pdf_url":"https://arxiv.org/pdf/2308.03638v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03629v1","updated":"2023-08-07T14:36:03Z","published":"2023-08-07T14:36:03Z","title":"MedMine: Examining Pre-trained Language Models on Medication Mining","summary":" Automatic medication mining from clinical and biomedical text has become a\npopular topic due to its real impact on healthcare applications and the recent\ndevelopment of powerful language models (LMs). However, fully-automatic\nextraction models still face obstacles to be overcome such that they can be\ndeployed directly into clinical practice for better impacts. Such obstacles\ninclude their imbalanced performances on different entity types and clinical\nevents. In this work, we examine current state-of-the-art pre-trained language\nmodels (PLMs) on such tasks, via fine-tuning including the monolingual model\nMed7 and multilingual large language model (LLM) XLM-RoBERTa. We compare their\nadvantages and drawbacks using historical medication mining shared task data\nsets from n2c2-2018 challenges. We report the findings we get from these\nfine-tuning experiments such that they can facilitate future research on\naddressing them, for instance, how to combine their outputs, merge such models,\nor improve their overall accuracy by ensemble learning and data augmentation.\nMedMine is part of the M3 Initiative \\url{https://github.com/HECTA-UoM/M3}\n","authors":["Haifa Alrdahi","Lifeng Han","Hendrik Šuvalov","Goran Nenadic"],"pdf_url":"https://arxiv.org/pdf/2308.03629v1.pdf","comment":"Open Research Project. 7 pages, 1 figure, 5 tables"},{"id":"http://arxiv.org/abs/2308.03601v1","updated":"2023-08-07T14:04:15Z","published":"2023-08-07T14:04:15Z","title":"Negative Lexical Constraints in Neural Machine Translation","summary":" This paper explores negative lexical constraining in English to Czech neural\nmachine translation. Negative lexical constraining is used to prohibit certain\nwords or expressions in the translation produced by the neural translation\nmodel. We compared various methods based on modifying either the decoding\nprocess or the training data. The comparison was performed on two tasks:\nparaphrasing and feedback-based translation refinement. We also studied to\nwhich extent these methods \"evade\" the constraints presented to the model\n(usually in the dictionary form) by generating a different surface form of a\ngiven constraint.We propose a way to mitigate the issue through training with\nstemmed negative constraints to counter the model's ability to induce a variety\nof the surface forms of a word that can result in bypassing the constraint. We\ndemonstrate that our method improves the constraining, although the problem\nstill persists in many cases.\n","authors":["Josef Jon","Dušan Variš","Michal Novák","João Paulo Aires","Ondřej Bojar"],"pdf_url":"https://arxiv.org/pdf/2308.03601v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03582v1","updated":"2023-08-07T13:38:54Z","published":"2023-08-07T13:38:54Z","title":"WIKITIDE: A Wikipedia-Based Timestamped Definition Pairs Dataset","summary":" A fundamental challenge in the current NLP context, dominated by language\nmodels, comes from the inflexibility of current architectures to 'learn' new\ninformation. While model-centric solutions like continual learning or\nparameter-efficient fine tuning are available, the question still remains of\nhow to reliably identify changes in language or in the world. In this paper, we\npropose WikiTiDe, a dataset derived from pairs of timestamped definitions\nextracted from Wikipedia. We argue that such resource can be helpful for\naccelerating diachronic NLP, specifically, for training models able to scan\nknowledge resources for core updates concerning a concept, an event, or a named\nentity. Our proposed end-to-end method is fully automatic, and leverages a\nbootstrapping algorithm for gradually creating a high-quality dataset. Our\nresults suggest that bootstrapping the seed version of WikiTiDe leads to better\nfine-tuned models. We also leverage fine-tuned models in a number of downstream\ntasks, showing promising results with respect to competitive baselines.\n","authors":["Hsuvas Borkakoty","Luis Espinosa-Anke"],"pdf_url":"https://arxiv.org/pdf/2308.03582v1.pdf","comment":"Accepted by RANLP 2023 main conference"},{"id":"http://arxiv.org/abs/2308.03581v1","updated":"2023-08-07T13:37:05Z","published":"2023-08-07T13:37:05Z","title":"Towards Controllable Natural Language Inference through Lexical\n Inference Types","summary":" Explainable natural language inference aims to provide a mechanism to produce\nexplanatory (abductive) inference chains which ground claims to their\nsupporting premises. A recent corpus called EntailmentBank strives to advance\nthis task by explaining the answer to a question using an entailment tree\n\\cite{dalvi2021explaining}. They employ the T5 model to directly generate the\ntree, which can explain how the answer is inferred. However, it lacks the\nability to explain and control the generation of intermediate steps, which is\ncrucial for the multi-hop inference process. % One recent corpus,\nEntailmentBank, aims to push this task forward by explaining an answer to a\nquestion according to an entailment tree \\cite{dalvi2021explaining}. They\nemploy T5 to generate the tree directly, which can explain how the answer is\ninferred but cannot explain how the intermediate is generated, which is\nessential to the multi-hop inference process. In this work, we focus on\nproposing a controlled natural language inference architecture for\nmulti-premise explanatory inference. To improve control and enable explanatory\nanalysis over the generation, we define lexical inference types based on\nAbstract Meaning Representation (AMR) graph and modify the architecture of T5\nto learn a latent sentence representation (T5 bottleneck) conditioned on said\ntype information. We also deliver a dataset of approximately 5000 annotated\nexplanatory inference steps, with well-grounded lexical-symbolic operations.\nExperimental results indicate that the inference typing induced at the T5\nbottleneck can help T5 to generate a conclusion under explicit control.\n","authors":["Yingji Zhang","Danilo S. Carvalho","Ian Pratt-Hartmann","Andre Freitas"],"pdf_url":"https://arxiv.org/pdf/2308.03581v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12375v2","updated":"2023-08-07T13:22:01Z","published":"2023-07-23T16:54:41Z","title":"In-Context Learning in Large Language Models Learns Label Relationships\n but Is Not Conventional Learning","summary":" The performance of Large Language Models (LLMs) on downstream tasks often\nimproves significantly when including examples of the input-label relationship\nin the context. However, there is currently no consensus about how this\nin-context learning (ICL) ability of LLMs works: for example, while Xie et al.\n(2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022b)\nargue ICL does not even learn label relationships from in-context examples. In\nthis paper, we study (1) how labels of in-context examples affect predictions,\n(2) how label relationships learned during pre-training interact with\ninput-label examples provided in-context, and (3) how ICL aggregates label\ninformation across in-context examples. Our findings suggests LLMs usually\nincorporate information from in-context labels, but that pre-training and\nin-context label relationships are treated differently, and that the model does\nnot consider all in-context information equally. Our results give insights into\nunderstanding and aligning LLM behavior.\n","authors":["Jannik Kossen","Tom Rainforth","Yarin Gal"],"pdf_url":"https://arxiv.org/pdf/2307.12375v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03565v1","updated":"2023-08-07T13:16:42Z","published":"2023-08-07T13:16:42Z","title":"Topological Interpretations of GPT-3","summary":" This is an experiential study of investigating a consistent method for\nderiving the correlation between sentence vector and semantic meaning of a\nsentence. We first used three state-of-the-art word/sentence embedding methods\nincluding GPT-3, Word2Vec, and Sentence-BERT, to embed plain text sentence\nstrings into high dimensional spaces. Then we compute the pairwise distance\nbetween any possible combination of two sentence vectors in an embedding space\nand map them into a matrix. Based on each distance matrix, we compute the\ncorrelation of distances of a sentence vector with respect to the other\nsentence vectors in an embedding space. Then we compute the correlation of each\npair of the distance matrices. We observed correlations of the same sentence in\ndifferent embedding spaces and correlations of different sentences in the same\nembedding space. These observations are consistent with our hypothesis and take\nus to the next stage.\n","authors":["Tianyi Sun","Bradley Nelson"],"pdf_url":"https://arxiv.org/pdf/2308.03565v1.pdf","comment":"70 pages"},{"id":"http://arxiv.org/abs/2308.03558v1","updated":"2023-08-07T13:10:35Z","published":"2023-08-07T13:10:35Z","title":"Mondrian: Prompt Abstraction Attack Against Large Language Models for\n Cheaper API Pricing","summary":" The Machine Learning as a Service (MLaaS) market is rapidly expanding and\nbecoming more mature. For example, OpenAI's ChatGPT is an advanced large\nlanguage model (LLM) that generates responses for various queries with\nassociated fees. Although these models can deliver satisfactory performance,\nthey are far from perfect. Researchers have long studied the vulnerabilities\nand limitations of LLMs, such as adversarial attacks and model toxicity.\nInevitably, commercial ML models are also not exempt from such issues, which\ncan be problematic as MLaaS continues to grow. In this paper, we discover a new\nattack strategy against LLM APIs, namely the prompt abstraction attack.\nSpecifically, we propose Mondrian, a simple and straightforward method that\nabstracts sentences, which can lower the cost of using LLM APIs. In this\napproach, the adversary first creates a pseudo API (with a lower established\nprice) to serve as the proxy of the target API (with a higher established\nprice). Next, the pseudo API leverages Mondrian to modify the user query,\nobtain the abstracted response from the target API, and forward it back to the\nend user. Our results show that Mondrian successfully reduces user queries'\ntoken length ranging from 13% to 23% across various tasks, including text\nclassification, generation, and question answering. Meanwhile, these abstracted\nqueries do not significantly affect the utility of task-specific and general\nlanguage models like ChatGPT. Mondrian also reduces instruction prompts' token\nlength by at least 11% without compromising output quality. As a result, the\nprompt abstraction attack enables the adversary to profit without bearing the\ncost of API development and deployment.\n","authors":["Wai Man Si","Michael Backes","Yang Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03558v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03549v1","updated":"2023-08-07T12:56:13Z","published":"2023-08-07T12:56:13Z","title":"Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language\n Model through Expert Feedback and Real-world Multi-turn Dialogue","summary":" Recent advances in Large Language Models (LLMs) have achieved remarkable\nbreakthroughs in understanding and responding to user intents. However, their\nperformance lag behind general use cases in some expertise domains, such as\nChinese medicine. Existing efforts to incorporate Chinese medicine into LLMs\nrely on Supervised Fine-Tuning (SFT) with single-turn and distilled dialogue\ndata. These models lack the ability for doctor-like proactive inquiry and\nmulti-turn comprehension and cannot always align responses with safety and\nprofessionalism experts. In this work, we introduce Zhongjing, the first\nChinese medical LLaMA-based LLM that implements an entire training pipeline\nfrom pre-training to reinforcement learning with human feedback (RLHF).\nAdditionally, we introduce a Chinese multi-turn medical dialogue dataset of\n70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly\nenhances the model's capability for complex dialogue and proactive inquiry\ninitiation. We define a refined annotation rule and evaluation criteria given\nthe biomedical domain's unique characteristics. Results show that our model\noutperforms baselines in various capacities and matches the performance of\nChatGPT in a few abilities, despite having 50x training data with previous best\nmodel and 100x parameters with ChatGPT. RLHF further improves the model's\ninstruction-following ability and safety. We also release our code, datasets\nand model for further research.\n","authors":["Songhua Yang","Hanjia Zhao","Senbin Zhu","Guangyu Zhou","Hongfei Xu","Yuxiang Jia","Hongying Zan"],"pdf_url":"https://arxiv.org/pdf/2308.03549v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03531v1","updated":"2023-08-07T12:30:00Z","published":"2023-08-07T12:30:00Z","title":"Measuring Variety, Balance, and Disparity: An Analysis of Media Coverage\n of the 2021 German Federal Election","summary":" Determining and measuring diversity in news articles is important for a\nnumber of reasons, including preventing filter bubbles and fueling public\ndiscourse, especially before elections. So far, the identification and analysis\nof diversity have been illuminated in a variety of ways, such as measuring the\noverlap of words or topics between news articles related to US elections.\nHowever, the question of how diversity in news articles can be measured\nholistically, i.e., with respect to (1) variety, (2) balance, and (3)\ndisparity, considering individuals, parties, and topics, has not been\naddressed. In this paper, we present a framework for determining diversity in\nnews articles according to these dimensions. Furthermore, we create and provide\na dataset of Google Top Stories, encompassing more than 26,000 unique headlines\nfrom more than 900 news outlets collected within two weeks before and after the\n2021 German federal election. While we observe high diversity for more general\nsearch terms (e.g., \"election\"), a range of search terms (\"education,\"\n\"Europe,\" \"climate protection,\" \"government\") resulted in news articles with\nhigh diversity in two out of three dimensions. This reflects a more subjective,\ndedicated discussion on rather future-oriented topics.\n","authors":["Michael Färber","Jannik Schwade","Adam Jatowt"],"pdf_url":"https://arxiv.org/pdf/2308.03531v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03519v1","updated":"2023-08-07T12:13:25Z","published":"2023-08-07T12:13:25Z","title":"Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based\n on Word Embeddings","summary":" In this paper, we propose Vocab-Expander at https://vocab-expander.com, an\nonline tool that enables end-users (e.g., technology scouts) to create and\nexpand a vocabulary of their domain of interest. It utilizes an ensemble of\nstate-of-the-art word embedding techniques based on web text and ConceptNet, a\ncommon-sense knowledge base, to suggest related terms for already given terms.\nThe system has an easy-to-use interface that allows users to quickly confirm or\nreject term suggestions. Vocab-Expander offers a variety of potential use\ncases, such as improving concept-based information retrieval in technology and\ninnovation management, enhancing communication and collaboration within\norganizations or interdisciplinary projects, and creating vocabularies for\nspecific courses in education.\n","authors":["Michael Färber","Nicholas Popovic"],"pdf_url":"https://arxiv.org/pdf/2308.03519v1.pdf","comment":"accepted at RANLP'23"},{"id":"http://arxiv.org/abs/2307.00925v4","updated":"2023-08-07T11:40:59Z","published":"2023-07-03T10:53:05Z","title":"Automatic Design of Semantic Similarity Ensembles Using Grammatical\n Evolution","summary":" Semantic similarity measures are widely used in natural language processing\nto catalyze various computer-related tasks. However, no single semantic\nsimilarity measure is the most appropriate for all tasks, and researchers often\nuse ensemble strategies to ensure performance. This research work proposes a\nmethod for automatically designing semantic similarity ensembles. In fact, our\nproposed method uses grammatical evolution, for the first time, to\nautomatically select and aggregate measures from a pool of candidates to create\nan ensemble that maximizes correlation to human judgment. The method is\nevaluated on several benchmark datasets and compared to state-of-the-art\nensembles, showing that it can significantly improve similarity assessment\naccuracy and outperform existing methods in some cases. As a result, our\nresearch demonstrates the potential of using grammatical evolution to\nautomatically compare text and prove the benefits of using ensembles for\nsemantic similarity tasks. The source code that illustrates our approach can be\ndownloaded from https://github.com/jorge-martinez-gil/sesige.\n","authors":["Jorge Martinez-Gil"],"pdf_url":"https://arxiv.org/pdf/2307.00925v4.pdf","comment":"29 pages"},{"id":"http://arxiv.org/abs/2211.08264v2","updated":"2023-08-07T11:22:16Z","published":"2022-11-15T16:14:39Z","title":"QAmeleon: Multilingual QA with Only 5 Examples","summary":" The availability of large, high-quality datasets has been one of the main\ndrivers of recent progress in question answering (QA). Such annotated datasets\nhowever are difficult and costly to collect, and rarely exist in languages\nother than English, rendering QA technology inaccessible to underrepresented\nlanguages. An alternative to building large monolingual training datasets is to\nleverage pre-trained language models (PLMs) under a few-shot learning setting.\nOur approach, QAmeleon, uses a PLM to automatically generate multilingual data\nupon which QA models are trained, thus avoiding costly annotation. Prompt\ntuning the PLM for data synthesis with only five examples per language delivers\naccuracy superior to translation-based baselines, bridges nearly 60% of the gap\nbetween an English-only baseline and a fully supervised upper bound trained on\nalmost 50,000 hand labeled examples, and always leads to substantial\nimprovements compared to fine-tuning a QA model directly on labeled examples in\nlow resource settings. Experiments on the TyDiQA-GoldP and MLQA benchmarks show\nthat few-shot prompt tuning for data synthesis scales across languages and is a\nviable alternative to large-scale annotation.\n","authors":["Priyanka Agrawal","Chris Alberti","Fantine Huot","Joshua Maynez","Ji Ma","Sebastian Ruder","Kuzman Ganchev","Dipanjan Das","Mirella Lapata"],"pdf_url":"https://arxiv.org/pdf/2211.08264v2.pdf","comment":"To Appear at Transactions of Association for Computational\n Linguistics (TACL)"},{"id":"http://arxiv.org/abs/2301.05880v2","updated":"2023-08-07T10:36:44Z","published":"2023-01-14T10:18:22Z","title":"TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real\n World","summary":" To facilitate the research on intelligent and human-like chatbots with\nmulti-modal context, we introduce a new video-based multi-modal dialogue\ndataset, called TikTalk. We collect 38K videos from a popular video-sharing\nplatform, along with 367K conversations posted by users beneath them. Users\nengage in spontaneous conversations based on their multi-modal experiences from\nwatching videos, which helps recreate real-world chitchat context. Compared to\nprevious multi-modal dialogue datasets, the richer context types in TikTalk\nlead to more diverse conversations, but also increase the difficulty in\ncapturing human interests from intricate multi-modal information to generate\npersonalized responses. Moreover, external knowledge is more frequently evoked\nin our dataset. These facts reveal new challenges for multi-modal dialogue\nmodels. We quantitatively demonstrate the characteristics of TikTalk, propose a\nvideo-based multi-modal chitchat task, and evaluate several dialogue baselines.\nExperimental results indicate that the models incorporating large language\nmodels (LLM) can generate more diverse responses, while the model utilizing\nknowledge graphs to introduce external knowledge performs the best overall.\nFurthermore, no existing model can solve all the above challenges well. There\nis still a large room for future improvements, even for LLM with visual\nextensions. Our dataset is available at\n\\url{https://ruc-aimind.github.io/projects/TikTalk/}.\n","authors":["Hongpeng Lin","Ludan Ruan","Wenke Xia","Peiyu Liu","Jingyuan Wen","Yixin Xu","Di Hu","Ruihua Song","Wayne Xin Zhao","Qin Jin","Zhiwu Lu"],"pdf_url":"https://arxiv.org/pdf/2301.05880v2.pdf","comment":"Accepted to ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2308.03449v1","updated":"2023-08-07T10:11:42Z","published":"2023-08-07T10:11:42Z","title":"Knowledge-preserving Pruning for Pre-trained Language Models without\n Retraining","summary":" Given a pre-trained language model, how can we efficiently compress it\nwithout retraining? Retraining-free structured pruning algorithms are crucial\nin pre-trained language model compression due to their significantly reduced\npruning cost and capability to prune large language models. However, existing\nretraining-free algorithms encounter severe accuracy degradation, as they fail\nto preserve the useful knowledge of pre-trained models. In this paper, we\npropose K-pruning (Knowledge-preserving pruning), an accurate retraining-free\nstructured pruning algorithm for pre-trained language models. K-pruning\nidentifies and prunes attention heads and neurons deemed to be superfluous,\nbased on the amount of their inherent knowledge. K-pruning applies an iterative\nprocess of pruning followed by knowledge reconstruction for each sub-layer to\npreserve the knowledge of the pre-trained models. Consequently, K-pruning shows\nup to 58.02%p higher F1 score than existing retraining-free pruning algorithms\nunder a high compression rate of 80% on the SQuAD benchmark.\n","authors":["Seungcheol Park","Hojun Choi","U Kang"],"pdf_url":"https://arxiv.org/pdf/2308.03449v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.01633v2","updated":"2023-08-07T09:54:55Z","published":"2023-05-02T17:46:12Z","title":"Missing Information, Unresponsive Authors, Experimental Flaws: The\n Impossibility of Assessing the Reproducibility of Previous Human Evaluations\n in NLP","summary":" We report our efforts in identifying a set of previous human evaluations in\nNLP that would be suitable for a coordinated study examining what makes human\nevaluations in NLP more/less reproducible. We present our results and findings,\nwhich include that just 13\\% of papers had (i) sufficiently low barriers to\nreproduction, and (ii) enough obtainable information, to be considered for\nreproduction, and that all but one of the experiments we selected for\nreproduction was discovered to have flaws that made the meaningfulness of\nconducting a reproduction questionable. As a result, we had to change our\ncoordinated study design from a reproduce approach to a\nstandardise-then-reproduce-twice approach. Our overall (negative) finding that\nthe great majority of human evaluations in NLP is not repeatable and/or not\nreproducible and/or too flawed to justify reproduction, paints a dire picture,\nbut presents an opportunity for a rethink about how to design and report human\nevaluations in NLP.\n","authors":["Anya Belz","Craig Thomson","Ehud Reiter","Gavin Abercrombie","Jose M. Alonso-Moral","Mohammad Arvan","Anouck Braggaar","Mark Cieliebak","Elizabeth Clark","Kees van Deemter","Tanvi Dinkar","Ondřej Dušek","Steffen Eger","Qixiang Fang","Mingqi Gao","Albert Gatt","Dimitra Gkatzia","Javier González-Corbelle","Dirk Hovy","Manuela Hürlimann","Takumi Ito","John D. Kelleher","Filip Klubicka","Emiel Krahmer","Huiyuan Lai","Chris van der Lee","Yiru Li","Saad Mahamood","Margot Mieskes","Emiel van Miltenburg","Pablo Mosteiro","Malvina Nissim","Natalie Parde","Ondřej Plátek","Verena Rieser","Jie Ruan","Joel Tetreault","Antonio Toral","Xiaojun Wan","Leo Wanner","Lewis Watson","Diyi Yang"],"pdf_url":"https://arxiv.org/pdf/2305.01633v2.pdf","comment":"5 pages plus appendix, 4 tables, 1 figure. To appear at \"Workshop on\n Insights from Negative Results in NLP\" (co-located with EACL2023). Updated\n author list and acknowledgements"},{"id":"http://arxiv.org/abs/2308.03429v1","updated":"2023-08-07T09:24:24Z","published":"2023-08-07T09:24:24Z","title":"RCMHA: Relative Convolutional Multi-Head Attention for Natural Language\n Modelling","summary":" The Attention module finds common usage in language modeling, presenting\ndistinct challenges within the broader scope of Natural Language Processing.\nMulti-Head Attention (MHA) employs an absolute positional encoding, which\nimposes limitations on token length and entails substantial memory consumption\nduring the processing of embedded inputs. The current remedy proposed by\nresearchers involves the utilization of relative positional encoding, similar\nto the approach adopted in Transformer-XL or Relative Multi-Head Attention\n(RMHA), albeit the employed architecture consumes considerable memory\nresources. To address these challenges, this study endeavors to refine MHA,\nleveraging relative positional encoding in conjunction with the Depth-Wise\nConvolutional Layer architecture, which promises heightened accuracy coupled\nwith minimized memory usage. The proposed RCMHA framework entails the\nmodification of two integral components: firstly, the application of the\nDepth-Wise Convolutional Layer to the input embedding, encompassing Query, Key,\nand Value parameters; secondly, the incorporation of Relative Positional\nEncoding into the attention scoring phase, harmoniously integrated with Scaled\nDot-Product Attention. Empirical experiments underscore the advantages of\nRCMHA, wherein it exhibits superior accuracy, boasting a score of 0.572 in\ncomparison to alternative attention modules such as MHA, Multi-DConv-Head\nAttention (MDHA), and RMHA. Concerning memory utilization, RMHA emerges as the\nmost frugal, demonstrating an average consumption of 2.98 GB, surpassing RMHA\nwhich necessitates 3.5 GB.\n","authors":["Herman Sugiharto"," Aradea","Husni Mubarok"],"pdf_url":"https://arxiv.org/pdf/2308.03429v1.pdf","comment":"13 pages, 13 figures, 6 tables"},{"id":"http://arxiv.org/abs/2308.03423v1","updated":"2023-08-07T09:19:59Z","published":"2023-08-07T09:19:59Z","title":"Boosting Chinese ASR Error Correction with Dynamic Error Scaling\n Mechanism","summary":" Chinese Automatic Speech Recognition (ASR) error correction presents\nsignificant challenges due to the Chinese language's unique features, including\na large character set and borderless, morpheme-based structure. Current\nmainstream models often struggle with effectively utilizing word-level features\nand phonetic information. This paper introduces a novel approach that\nincorporates a dynamic error scaling mechanism to detect and correct\nphonetically erroneous text generated by ASR output. This mechanism operates by\ndynamically fusing word-level features and phonetic information, thereby\nenriching the model with additional semantic data. Furthermore, our method\nimplements unique error reduction and amplification strategies to address the\nissues of matching wrong words caused by incorrect characters. Experimental\nresults indicate substantial improvements in ASR error correction,\ndemonstrating the effectiveness of our proposed method and yielding promising\nresults on established datasets.\n","authors":["Jiaxin Fan","Yong Zhang","Hanzhang Li","Jianzong Wang","Zhitao Li","Sheng Ouyang","Ning Cheng","Jing Xiao"],"pdf_url":"https://arxiv.org/pdf/2308.03423v1.pdf","comment":"Accepted by 24th Annual Conference of the International Speech\n Communication Association (INTERSPEECH 2023)"},{"id":"http://arxiv.org/abs/2306.11518v2","updated":"2023-08-07T09:17:43Z","published":"2023-06-20T13:12:58Z","title":"One model to rule them all: ranking Slovene summarizers","summary":" Text summarization is an essential task in natural language processing, and\nresearchers have developed various approaches over the years, ranging from\nrule-based systems to neural networks. However, there is no single model or\napproach that performs well on every type of text. We propose a system that\nrecommends the most suitable summarization model for a given text. The proposed\nsystem employs a fully connected neural network that analyzes the input content\nand predicts which summarizer should score the best in terms of ROUGE score for\na given input. The meta-model selects among four different summarization\nmodels, developed for the Slovene language, using different properties of the\ninput, in particular its Doc2Vec document representation. The four Slovene\nsummarization models deal with different challenges associated with text\nsummarization in a less-resourced language. We evaluate the proposed SloMetaSum\nmodel performance automatically and parts of it manually. The results show that\nthe system successfully automates the step of manually selecting the best\nmodel.\n","authors":["Aleš Žagar","Marko Robnik-Šikonja"],"pdf_url":"https://arxiv.org/pdf/2306.11518v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03422v1","updated":"2023-08-07T09:15:03Z","published":"2023-08-07T09:15:03Z","title":"Prompt Guided Copy Mechanism for Conversational Question Answering","summary":" Conversational Question Answering (CQA) is a challenging task that aims to\ngenerate natural answers for conversational flow questions. In this paper, we\npropose a pluggable approach for extractive methods that introduces a novel\nprompt-guided copy mechanism to improve the fluency and appropriateness of the\nextracted answers. Our approach uses prompts to link questions to answers and\nemploys attention to guide the copy mechanism to verify the naturalness of\nextracted answers, making necessary edits to ensure that the answers are fluent\nand appropriate. The three prompts, including a question-rationale relationship\nprompt, a question description prompt, and a conversation history prompt,\nenhance the copy mechanism's performance. Our experiments demonstrate that this\napproach effectively promotes the generation of natural answers and achieves\ngood results in the CoQA challenge.\n","authors":["Yong Zhang","Zhitao Li","Jianzong Wang","Yiming Gao","Ning Cheng","Fengying Yu","Jing Xiao"],"pdf_url":"https://arxiv.org/pdf/2308.03422v1.pdf","comment":"Accepted by 24th Annual Conference of the International Speech\n Communication Association (INTERSPEECH 2023)"},{"id":"http://arxiv.org/abs/2308.03421v1","updated":"2023-08-07T09:14:33Z","published":"2023-08-07T09:14:33Z","title":"RecycleGPT: An Autoregressive Language Model with Recyclable Module","summary":" Existing large language models have to run K times to generate a sequence of\nK tokens. In this paper, we present RecycleGPT, a generative language model\nwith fast decoding speed by recycling pre-generated model states without\nrunning the whole model in multiple steps. Our approach relies on the\nobservation that adjacent tokens in a sequence usually have strong correlations\nand the next token in a sequence can be reasonably guessed or inferred based on\nthe preceding ones. Through theoretical evaluations and practical tests on\ndownstream text generation tasks, we demonstrate the effectiveness of our\napproach in lowering inference latency, achieving up to 1.4x speedup while\npreserving high performance.\n","authors":["Yufan Jiang","Qiaozhi He","Xiaomin Zhuang","Zhihua Wu","Kunpeng Wang","Wenlai Zhao","Guangwen Yang"],"pdf_url":"https://arxiv.org/pdf/2308.03421v1.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2307.10511v2","updated":"2023-08-07T09:08:23Z","published":"2023-07-20T00:36:41Z","title":"General Debiasing for Multimodal Sentiment Analysis","summary":" Existing work on Multimodal Sentiment Analysis (MSA) utilizes multimodal\ninformation for prediction yet unavoidably suffers from fitting the spurious\ncorrelations between multimodal features and sentiment labels. For example, if\nmost videos with a blue background have positive labels in a dataset, the model\nwill rely on such correlations for prediction, while \"blue background\" is not a\nsentiment-related feature. To address this problem, we define a general\ndebiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD)\ngeneralization ability of MSA models by reducing their reliance on spurious\ncorrelations. To this end, we propose a general debiasing framework based on\nInverse Probability Weighting (IPW), which adaptively assigns small weights to\nthe samples with larger bias (i.e., the severer spurious correlations). The key\nto this debiasing framework is to estimate the bias of each sample, which is\nachieved by two steps: 1) disentangling the robust features and biased features\nin each modality, and 2) utilizing the biased features to estimate the bias.\nFinally, we employ IPW to reduce the effects of large-biased samples,\nfacilitating robust feature learning for sentiment prediction. To examine the\nmodel's generalization ability, we keep the original testing sets on two\nbenchmarks and additionally construct multiple unimodal and multimodal OOD\ntesting sets. The empirical results demonstrate the superior generalization\nability of our proposed framework. We have released the code and data to\nfacilitate the reproduction https://github.com/Teng-Sun/GEAR.\n","authors":["Teng Sun","Juntong Ni","Wenjie Wang","Liqiang Jing","Yinwei Wei","Liqiang Nie"],"pdf_url":"https://arxiv.org/pdf/2307.10511v2.pdf","comment":"Accepted at ACM MM 2023"},{"id":"http://arxiv.org/abs/2308.03415v1","updated":"2023-08-07T09:06:20Z","published":"2023-08-07T09:06:20Z","title":"End-to-End Evaluation for Low-Latency Simultaneous Speech Translation","summary":" The challenge of low-latency speech translation has recently draw significant\ninterest in the research community as shown by several publications and shared\ntasks. Therefore, it is essential to evaluate these different approaches in\nrealistic scenarios. However, currently only specific aspects of the systems\nare evaluated and often it is not possible to compare different approaches.\n In this work, we propose the first framework to perform and evaluate the\nvarious aspects of low-latency speech translation under realistic conditions.\nThe evaluation is carried out in an end-to-end fashion. This includes the\nsegmentation of the audio as well as the run-time of the different components.\n Secondly, we compare different approaches to low-latency speech translation\nusing this framework. We evaluate models with the option to revise the output\nas well as methods with fixed output. Furthermore, we directly compare\nstate-of-the-art cascaded as well as end-to-end systems. Finally, the framework\nallows to automatically evaluate the translation quality as well as latency and\nalso provides a web interface to show the low-latency model outputs to the\nuser.\n","authors":["Christian Huber","Tu Anh Dinh","Carlos Mullov","Ngoc Quan Pham","Thai Binh Nguyen","Fabian Retkowski","Stefan Constantin","Enes Yavuz Ugan","Danni Liu","Zhaolin Li","Sai Koneru","Jan Niehues","Alexander Waibel"],"pdf_url":"https://arxiv.org/pdf/2308.03415v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.08283v3","updated":"2023-08-07T08:32:54Z","published":"2022-12-16T05:10:09Z","title":"SceneGATE: Scene-Graph based co-Attention networks for TExt visual\n question answering","summary":" Most TextVQA approaches focus on the integration of objects, scene texts and\nquestion words by a simple transformer encoder. But this fails to capture the\nsemantic relations between different modalities. The paper proposes a Scene\nGraph based co-Attention Network (SceneGATE) for TextVQA, which reveals the\nsemantic relations among the objects, Optical Character Recognition (OCR)\ntokens and the question words. It is achieved by a TextVQA-based scene graph\nthat discovers the underlying semantics of an image. We created a\nguided-attention module to capture the intra-modal interplay between the\nlanguage and the vision as a guidance for inter-modal interactions. To make\nexplicit teaching of the relations between the two modalities, we proposed and\nintegrated two attention modules, namely a scene graph-based semantic\nrelation-aware attention and a positional relation-aware attention. We\nconducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA.\nIt is shown that our SceneGATE method outperformed existing ones because of the\nscene graph and its attention modules.\n","authors":["Feiqi Cao","Siwen Luo","Felipe Nunez","Zean Wen","Josiah Poon","Caren Han"],"pdf_url":"https://arxiv.org/pdf/2212.08283v3.pdf","comment":"Published in Robotics (Q1, SCI indexed Journal):\n https://www.mdpi.com/2218-6581/12/4/114"},{"id":"http://arxiv.org/abs/2207.14116v4","updated":"2023-08-07T07:54:45Z","published":"2022-07-28T14:30:06Z","title":"Claim-Dissector: An Interpretable Fact-Checking System with Joint\n Re-ranking and Veracity Prediction","summary":" We present Claim-Dissector: a novel latent variable model for fact-checking\nand analysis, which given a claim and a set of retrieved evidences jointly\nlearns to identify: (i) the relevant evidences to the given claim, (ii) the\nveracity of the claim. We propose to disentangle the per-evidence relevance\nprobability and its contribution to the final veracity probability in an\ninterpretable way -- the final veracity probability is proportional to a linear\nensemble of per-evidence relevance probabilities. In this way, the individual\ncontributions of evidences towards the final predicted probability can be\nidentified. In per-evidence relevance probability, our model can further\ndistinguish whether each relevant evidence is supporting (S) or refuting (R)\nthe claim. This allows to quantify how much the S/R probability contributes to\nthe final verdict or to detect disagreeing evidence.\n Despite its interpretable nature, our system achieves results competitive\nwith state-of-the-art on the FEVER dataset, as compared to typical two-stage\nsystem pipelines, while using significantly fewer parameters. It also sets new\nstate-of-the-art on FAVIQ and RealFC datasets. Furthermore, our analysis shows\nthat our model can learn fine-grained relevance cues while using coarse-grained\nsupervision, and we demonstrate it in 2 ways. (i) We show that our model can\nachieve competitive sentence recall while using only paragraph-level relevance\nsupervision. (ii) Traversing towards the finest granularity of relevance, we\nshow that our model is capable of identifying relevance at the token level. To\ndo this, we present a new benchmark TLR-FEVER focusing on token-level\ninterpretability -- humans annotate tokens in relevant evidences they\nconsidered essential when making their judgment. Then we measure how similar\nare these annotations to the tokens our model is focusing on.\n","authors":["Martin Fajcik","Petr Motlicek","Pavel Smrz"],"pdf_url":"https://arxiv.org/pdf/2207.14116v4.pdf","comment":"updated acknowledgement"},{"id":"http://arxiv.org/abs/2304.14104v2","updated":"2023-08-07T07:52:35Z","published":"2023-04-27T11:32:48Z","title":"Learning Human-Human Interactions in Images from Weak Textual\n Supervision","summary":" Interactions between humans are diverse and context-dependent, but previous\nworks have treated them as categorical, disregarding the heavy tail of possible\ninteractions. We propose a new paradigm of learning human-human interactions as\nfree text from a single still image, allowing for flexibility in modeling the\nunlimited space of situations and relationships between people. To overcome the\nabsence of data labelled specifically for this task, we use knowledge\ndistillation applied to synthetic caption data produced by a large language\nmodel without explicit supervision. We show that the pseudo-labels produced by\nthis procedure can be used to train a captioning model to effectively\nunderstand human-human interactions in images, as measured by a variety of\nmetrics that measure textual and semantic faithfulness and factual groundedness\nof our predictions. We further show that our approach outperforms SOTA image\ncaptioning and situation recognition models on this task. We will release our\ncode and pseudo-labels along with Waldo and Wenda, a manually-curated test set\nfor still image human-human interaction understanding.\n","authors":["Morris Alper","Hadar Averbuch-Elor"],"pdf_url":"https://arxiv.org/pdf/2304.14104v2.pdf","comment":"To be presented at ICCV 2023. Project webpage:\n https://learning-interactions.github.io"},{"id":"http://arxiv.org/abs/2308.03365v1","updated":"2023-08-07T07:39:43Z","published":"2023-08-07T07:39:43Z","title":"Improving Few-shot and Zero-shot Entity Linking with Coarse-to-Fine\n Lexicon-based Retriever","summary":" Few-shot and zero-shot entity linking focus on the tail and emerging\nentities, which are more challenging but closer to real-world scenarios. The\nmainstream method is the ''retrieve and rerank'' two-stage framework. In this\npaper, we propose a coarse-to-fine lexicon-based retriever to retrieve entity\ncandidates in an effective manner, which operates in two layers. The first\nlayer retrieves coarse-grained candidates by leveraging entity names, while the\nsecond layer narrows down the search to fine-grained candidates within the\ncoarse-grained ones. In addition, this second layer utilizes entity\ndescriptions to effectively disambiguate tail or new entities that share names\nwith existing popular entities. Experimental results indicate that our approach\ncan obtain superior performance without requiring extensive finetuning in the\nretrieval stage. Notably, our approach ranks the 1st in NLPCC 2023 Shared Task\n6 on Chinese Few-shot and Zero-shot Entity Linking.\n","authors":["Shijue Huang","Bingbing Wang","Libo Qin","Qin Zhao","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2308.03365v1.pdf","comment":"Accepted to NLPCC2023"},{"id":"http://arxiv.org/abs/2308.03360v1","updated":"2023-08-07T07:29:49Z","published":"2023-08-07T07:29:49Z","title":"Coupling Symbolic Reasoning with Language Modeling for Efficient\n Longitudinal Understanding of Unstructured Electronic Medical Records","summary":" The application of Artificial Intelligence (AI) in healthcare has been\nrevolutionary, especially with the recent advancements in transformer-based\nLarge Language Models (LLMs). However, the task of understanding unstructured\nelectronic medical records remains a challenge given the nature of the records\n(e.g., disorganization, inconsistency, and redundancy) and the inability of\nLLMs to derive reasoning paradigms that allow for comprehensive understanding\nof medical variables. In this work, we examine the power of coupling symbolic\nreasoning with language modeling toward improved understanding of unstructured\nclinical texts. We show that such a combination improves the extraction of\nseveral medical variables from unstructured records. In addition, we show that\nthe state-of-the-art commercially-free LLMs enjoy retrieval capabilities\ncomparable to those provided by their commercial counterparts. Finally, we\nelaborate on the need for LLM steering through the application of symbolic\nreasoning as the exclusive use of LLMs results in the lowest performance.\n","authors":["Shivani Shekhar","Simran Tiwari","T. C. Rensink","Ramy Eskander","Wael Salloum"],"pdf_url":"https://arxiv.org/pdf/2308.03360v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03349v1","updated":"2023-08-07T07:03:49Z","published":"2023-08-07T07:03:49Z","title":"SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering\n Dataset for Scientific Graphs","summary":" In this work, we present SciGraphQA, a synthetic multi-turn question-answer\ndataset related to academic graphs. SciGraphQA is 13 times larger than\nChartVQA, the previously largest chart-visual question-answering dataset. It is\nalso the largest open-sourced chart VQA dataset with non-synthetic charts. To\nbuild our dataset, we selected 290,000 Computer Science or Machine Learning\nArXiv papers published between 2010 and 2020, and then used Palm-2 to generate\n295K samples of open-vocabulary multi-turn question-answering dialogues about\nthe graphs. As context, we provided the text-only Palm-2 with paper title,\nabstract, paragraph mentioning the graph, and rich text contextual data from\nthe graph itself, obtaining dialogues with an average 2.23 question-answer\nturns for each graph. We asked GPT-4 to assess the matching quality of our\nquestion-answer turns given the paper's context, obtaining an average rating of\n8.7/10 on our 3K test set. We evaluated the 0-shot capability of the most\npopular MLLM models such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo's on our\ndataset, finding LLaVA-13B being the most performant with a CIDEr score of\n0.08. We further enriched the question prompts for LLAVA by including the\nserialized data tables extracted from the graphs using the DePlot model,\nboosting LLaVA's 0-shot CIDEr to 0.15. To verify the validity of our dataset,\nwe also fine-tuned LLaVa using our dataset, reaching a substantially higher\nCIDEr score of 0.26. We anticipate further accuracy improvement by including\nsegmentation mask tokens and leveraging larger LLM backbones coupled with\nemergent prompting techniques. Our code and data are open-sourced.\n","authors":["Shengzhi Li","Nima Tajbakhsh"],"pdf_url":"https://arxiv.org/pdf/2308.03349v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.03960v2","updated":"2023-08-07T06:35:25Z","published":"2023-05-06T07:06:47Z","title":"Beyond Rule-based Named Entity Recognition and Relation Extraction for\n Process Model Generation from Natural Language Text","summary":" Process-aware information systems offer extensive advantages to companies,\nfacilitating planning, operations, and optimization of day-to-day business\nactivities. However, the time-consuming but required step of designing formal\nbusiness process models often hampers the potential of these systems. To\novercome this challenge, automated generation of business process models from\nnatural language text has emerged as a promising approach to expedite this\nstep. Generally two crucial subtasks have to be solved: extracting\nprocess-relevant information from natural language and creating the actual\nmodel. Approaches towards the first subtask are rule based methods, highly\noptimized for specific domains, but hard to adapt to related applications. To\nsolve this issue, we present an extension to an existing pipeline, to make it\nentirely data driven. We demonstrate the competitiveness of our improved\npipeline, which not only eliminates the substantial overhead associated with\nfeature engineering and rule definition, but also enables adaptation to\ndifferent datasets, entity and relation types, and new domains. Additionally,\nthe largest available dataset (PET) for the first subtask, contains no\ninformation about linguistic references between mentions of entities in the\nprocess description. Yet, the resolution of these mentions into a single visual\nelement is essential for high quality process models. We propose an extension\nto the PET dataset that incorporates information about linguistic references\nand a corresponding method for resolving them. Finally, we provide a detailed\nanalysis of the inherent challenges in the dataset at hand.\n","authors":["Julian Neuberger","Lars Ackermann","Stefan Jablonski"],"pdf_url":"https://arxiv.org/pdf/2305.03960v2.pdf","comment":"Currently under review for CoopIS23"},{"id":"http://arxiv.org/abs/2305.18462v2","updated":"2023-08-07T06:32:56Z","published":"2023-05-29T07:06:03Z","title":"Membership Inference Attacks against Language Models via Neighbourhood\n Comparison","summary":" Membership Inference attacks (MIAs) aim to predict whether a data sample was\npresent in the training data of a machine learning model or not, and are widely\nused for assessing the privacy risks of language models. Most existing attacks\nrely on the observation that models tend to assign higher probabilities to\ntheir training samples than non-training points. However, simple thresholding\nof the model score in isolation tends to lead to high false-positive rates as\nit does not account for the intrinsic complexity of a sample. Recent work has\ndemonstrated that reference-based attacks which compare model scores to those\nobtained from a reference model trained on similar data can substantially\nimprove the performance of MIAs. However, in order to train reference models,\nattacks of this kind make the strong and arguably unrealistic assumption that\nan adversary has access to samples closely resembling the original training\ndata. Therefore, we investigate their performance in more realistic scenarios\nand find that they are highly fragile in relation to the data distribution used\nto train reference models. To investigate whether this fragility provides a\nlayer of safety, we propose and evaluate neighbourhood attacks, which compare\nmodel scores for a given sample to scores of synthetically generated neighbour\ntexts and therefore eliminate the need for access to the training data\ndistribution. We show that, in addition to being competitive with\nreference-based attacks that have perfect knowledge about the training data\ndistribution, our attack clearly outperforms existing reference-free attacks as\nwell as reference-based attacks with imperfect knowledge, which demonstrates\nthe need for a reevaluation of the threat model of adversarial attacks.\n","authors":["Justus Mattern","Fatemehsadat Mireshghallah","Zhijing Jin","Bernhard Schölkopf","Mrinmaya Sachan","Taylor Berg-Kirkpatrick"],"pdf_url":"https://arxiv.org/pdf/2305.18462v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02047v2","updated":"2023-08-07T06:21:31Z","published":"2023-07-05T06:05:36Z","title":"CAME: Confidence-guided Adaptive Memory Efficient Optimization","summary":" Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent\nperformance in the training of large language models. Nevertheless, the need\nfor adaptivity requires maintaining second-moment estimates of the\nper-parameter gradients, which entails a high cost of extra memory overheads.\nTo solve this problem, several memory-efficient optimizers (e.g., Adafactor)\nhave been proposed to obtain a drastic reduction in auxiliary memory usage, but\nwith a performance penalty. In this paper, we first study a confidence-guided\nstrategy to reduce the instability of existing memory efficient optimizers.\nBased on this strategy, we propose CAME to simultaneously achieve two goals:\nfast convergence as in traditional adaptive methods, and low memory usage as in\nmemory-efficient methods. Extensive experiments demonstrate the training\nstability and superior performance of CAME across various NLP tasks such as\nBERT and GPT-2 training. Notably, for BERT pre-training on the large batch size\nof 32,768, our proposed optimizer attains faster convergence and higher\naccuracy compared with the Adam optimizer. The implementation of CAME is\npublicly available.\n","authors":["Yang Luo","Xiaozhe Ren","Zangwei Zheng","Zhuo Jiang","Xin Jiang","Yang You"],"pdf_url":"https://arxiv.org/pdf/2307.02047v2.pdf","comment":"Accepted by ACL 2023"},{"id":"http://arxiv.org/abs/2308.03311v1","updated":"2023-08-07T05:40:01Z","published":"2023-08-07T05:40:01Z","title":"CrossTalk: Enhancing Communication and Collaboration in\n Videoconferencing with Intent Recognition from Conversational Speech","summary":" Despite the advances and ubiquity of digital communication media such as\nvideoconferencing and virtual reality, they remain oblivious to the rich\nintentions expressed by users. Beyond transmitting audio, videos, and messages,\nwe envision digital communication media as proactive facilitators that can\nprovide unobtrusive assistance to enhance communication and collaboration.\nInformed by the results of a formative study, we propose three key design\nconcepts to explore the systematic integration of intelligence into\ncommunication and collaboration, including the panel substrate, language-based\nintent recognition, and lightweight interaction techniques. We developed\nCrossTalk, a videoconferencing system that instantiates these concepts, which\nwas found to enable a more fluid and flexible communication and collaboration\nexperience.\n","authors":["Haijun Xia","Tony Wang","Aditya Gunturu","Peiling Jiang","William Duan","Xiaoshuo Yao"],"pdf_url":"https://arxiv.org/pdf/2308.03311v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03303v1","updated":"2023-08-07T05:12:27Z","published":"2023-08-07T05:12:27Z","title":"LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models\n Fine-tuning","summary":" The low-rank adaptation (LoRA) method can largely reduce the amount of\ntrainable parameters for fine-tuning large language models (LLMs), however, it\nstill requires expensive activation memory to update low-rank weights. Reducing\nthe number of LoRA layers or using activation recomputation could harm the\nfine-tuning performance or increase the computational overhead. In this work,\nwe present LoRA-FA, a memory-efficient fine-tuning method that reduces the\nactivation memory without performance degradation and expensive recomputation.\nLoRA-FA chooses to freeze the projection-down weight of $A$ and update the\nprojection-up weight of $B$ in each LoRA layer. It ensures the change of model\nweight reside in a low-rank space during LLMs fine-tuning, while eliminating\nthe requirement to store full-rank input activations. We conduct extensive\nexperiments across multiple model types (RoBERTa, T5, LLaMA) and model scales.\nOur results show that LoRA-FA can always achieve close fine-tuning accuracy\nacross different tasks compared to full parameter fine-tuning and LoRA.\nFurthermore, LoRA-FA can reduce the overall memory cost by up to 1.4$\\times$\ncompared to LoRA.\n","authors":["Longteng Zhang","Lin Zhang","Shaohuai Shi","Xiaowen Chu","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2308.03303v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2308.03296v1","updated":"2023-08-07T04:47:42Z","published":"2023-08-07T04:47:42Z","title":"Studying Large Language Model Generalization with Influence Functions","summary":" When trying to gain better visibility into a machine learning model in order\nto understand and mitigate the associated risks, a potentially valuable source\nof evidence is: which training examples most contribute to a given behavior?\nInfluence functions aim to answer a counterfactual: how would the model's\nparameters (and hence its outputs) change if a given sequence were added to the\ntraining set? While influence functions have produced insights for small\nmodels, they are difficult to scale to large language models (LLMs) due to the\ndifficulty of computing an inverse-Hessian-vector product (IHVP). We use the\nEigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC)\napproximation to scale influence functions up to LLMs with up to 52 billion\nparameters. In our experiments, EK-FAC achieves similar accuracy to traditional\ninfluence function estimators despite the IHVP computation being orders of\nmagnitude faster. We investigate two algorithmic techniques to reduce the cost\nof computing gradients of candidate training sequences: TF-IDF filtering and\nquery batching. We use influence functions to investigate the generalization\npatterns of LLMs, including the sparsity of the influence patterns, increasing\nabstraction with scale, math and programming abilities, cross-lingual\ngeneralization, and role-playing behavior. Despite many apparently\nsophisticated forms of generalization, we identify a surprising limitation:\ninfluences decay to near-zero when the order of key phrases is flipped.\nOverall, influence functions give us a powerful new tool for studying the\ngeneralization properties of LLMs.\n","authors":["Roger Grosse","Juhan Bae","Cem Anil","Nelson Elhage","Alex Tamkin","Amirhossein Tajdini","Benoit Steiner","Dustin Li","Esin Durmus","Ethan Perez","Evan Hubinger","Kamilė Lukošiūtė","Karina Nguyen","Nicholas Joseph","Sam McCandlish","Jared Kaplan","Samuel R. Bowman"],"pdf_url":"https://arxiv.org/pdf/2308.03296v1.pdf","comment":"119 pages, 47 figures, 22 tables"},{"id":"http://arxiv.org/abs/2308.03293v1","updated":"2023-08-07T04:42:36Z","published":"2023-08-07T04:42:36Z","title":"Dialogue Systems Can Generate Appropriate Responses without the Use of\n Question Marks? -- Investigation of the Effects of Question Marks on Dialogue\n Systems","summary":" When individuals engage in spoken discourse, various phenomena can be\nobserved that differ from those that are apparent in text-based conversation.\nWhile written communication commonly uses a question mark to denote a query, in\nspoken discourse, queries are frequently indicated by a rising intonation at\nthe end of a sentence. However, numerous speech recognition engines do not\nappend a question mark to recognized queries, presenting a challenge when\ncreating a spoken dialogue system. Specifically, the absence of a question mark\nat the end of a sentence can impede the generation of appropriate responses to\nqueries in spoken dialogue systems. Hence, we investigate the impact of\nquestion marks on dialogue systems, with the results showing that they have a\nsignificant impact. Moreover, we analyze specific examples in an effort to\ndetermine which types of utterances have the impact on dialogue systems.\n","authors":["Tomoya Mizumoto","Takato Yamazaki","Katsumasa Yoshikawa","Masaya Ohagi","Toshiki Kawamoto","Toshinori Sato"],"pdf_url":"https://arxiv.org/pdf/2308.03293v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03291v1","updated":"2023-08-07T04:20:38Z","published":"2023-08-07T04:20:38Z","title":"SynJax: Structured Probability Distributions for JAX","summary":" The development of deep learning software libraries enabled significant\nprogress in the field by allowing users to focus on modeling, while letting the\nlibrary to take care of the tedious and time-consuming task of optimizing\nexecution for modern hardware accelerators. However, this has benefited only\nparticular types of deep learning models, such as Transformers, whose\nprimitives map easily to the vectorized computation. The models that explicitly\naccount for structured objects, such as trees and segmentations, did not\nbenefit equally because they require custom algorithms that are difficult to\nimplement in a vectorized form.\n SynJax directly addresses this problem by providing an efficient vectorized\nimplementation of inference algorithms for structured distributions covering\nalignment, tagging, segmentation, constituency trees and spanning trees. With\nSynJax we can build large-scale differentiable models that explicitly model\nstructure in the data. The code is available at\nhttps://github.com/deepmind/synjax.\n","authors":["Miloš Stanojević","Laurent Sartran"],"pdf_url":"https://arxiv.org/pdf/2308.03291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03281v1","updated":"2023-08-07T03:52:59Z","published":"2023-08-07T03:52:59Z","title":"Towards General Text Embeddings with Multi-stage Contrastive Learning","summary":" We present GTE, a general-purpose text embedding model trained with\nmulti-stage contrastive learning. In line with recent advancements in unifying\nvarious NLP tasks into a single format, we train a unified text embedding model\nby employing contrastive learning over a diverse mixture of datasets from\nmultiple sources. By significantly increasing the number of training data\nduring both unsupervised pre-training and supervised fine-tuning stages, we\nachieve substantial performance gains over existing embedding models. Notably,\neven with a relatively modest parameter count of 110M, GTE$_\\text{base}$\noutperforms the black-box embedding API provided by OpenAI and even surpasses\n10x larger text embedding models on the massive text embedding benchmark.\nFurthermore, without additional fine-tuning on each programming language\nindividually, our model outperforms previous best code retrievers of similar\nsize by treating code as text. In summary, our model achieves impressive\nresults by effectively harnessing multi-stage contrastive learning, offering a\npowerful and efficient text embedding model with broad applicability across\nvarious NLP and code-related tasks.\n","authors":["Zehan Li","Xin Zhang","Yanzhao Zhang","Dingkun Long","Pengjun Xie","Meishan Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03281v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03279v1","updated":"2023-08-07T03:39:52Z","published":"2023-08-07T03:39:52Z","title":"UniversalNER: Targeted Distillation from Large Language Models for Open\n Named Entity Recognition","summary":" Large language models (LLMs) have demonstrated remarkable generalizability,\nsuch as understanding arbitrary entities and relations. Instruction tuning has\nproven effective for distilling LLMs into more cost-efficient models such as\nAlpaca and Vicuna. Yet such student models still trail the original LLMs by\nlarge margins in downstream applications. In this paper, we explore targeted\ndistillation with mission-focused instruction tuning to train student models\nthat can excel in a broad application class such as open information\nextraction. Using named entity recognition (NER) for case study, we show how\nChatGPT can be distilled into much smaller UniversalNER models for open NER.\nFor evaluation, we assemble the largest NER benchmark to date, comprising 43\ndatasets across 9 diverse domains such as biomedicine, programming, social\nmedia, law, finance. Without using any direct supervision, UniversalNER attains\nremarkable NER accuracy across tens of thousands of entity types, outperforming\ngeneral instruction-tuned models such as Alpaca and Vicuna by over 30 absolute\nF1 points in average. With a tiny fraction of parameters, UniversalNER not only\nacquires ChatGPT's capability in recognizing arbitrary entity types, but also\noutperforms its NER accuracy by 7-9 absolute F1 points in average. Remarkably,\nUniversalNER even outperforms by a large margin state-of-the-art multi-task\ninstruction-tuned systems such as InstructUIE, which uses supervised NER\nexamples. We also conduct thorough ablation studies to assess the impact of\nvarious components in our distillation approach. We will release the\ndistillation recipe, data, and UniversalNER models to facilitate future\nresearch on targeted distillation.\n","authors":["Wenxuan Zhou","Sheng Zhang","Yu Gu","Muhao Chen","Hoifung Poon"],"pdf_url":"https://arxiv.org/pdf/2308.03279v1.pdf","comment":"Project page: https://universal-ner.github.io/"},{"id":"http://arxiv.org/abs/2308.03277v1","updated":"2023-08-07T03:37:31Z","published":"2023-08-07T03:37:31Z","title":"From Ambiguity to Explicitness: NLP-Assisted 5G Specification\n Abstraction for Formal Analysis","summary":" Formal method-based analysis of the 5G Wireless Communication Protocol is\ncrucial for identifying logical vulnerabilities and facilitating an\nall-encompassing security assessment, especially in the design phase. Natural\nLanguage Processing (NLP) assisted techniques and most of the tools are not\nwidely adopted by the industry and research community. Traditional formal\nverification through a mathematics approach heavily relied on manual logical\nabstraction prone to being time-consuming, and error-prone. The reason that the\nNLP-assisted method did not apply in industrial research may be due to the\nambiguity in the natural language of the protocol designs nature is\ncontroversial to the explicitness of formal verification. To address the\nchallenge of adopting the formal methods in protocol designs, targeting (3GPP)\nprotocols that are written in natural language, in this study, we propose a\nhybrid approach to streamline the analysis of protocols. We introduce a\ntwo-step pipeline that first uses NLP tools to construct data and then uses\nconstructed data to extract identifiers and formal properties by using the NLP\nmodel. The identifiers and formal properties are further used for formal\nanalysis. We implemented three models that take different dependencies between\nidentifiers and formal properties as criteria. Our results of the optimal model\nreach valid accuracy of 39% for identifier extraction and 42% for formal\nproperties predictions. Our work is proof of concept for an efficient procedure\nin performing formal analysis for largescale complicate specification and\nprotocol analysis, especially for 5G and nextG communications.\n","authors":["Shiyu Yuan","Jingda Yang","Sudhanshu Arya","Carlo Lipizzi","Ying Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03277v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03275v1","updated":"2023-08-07T03:34:01Z","published":"2023-08-07T03:34:01Z","title":"Adapter-based Selective Knowledge Distillation for Federated\n Multi-domain Meeting Summarization","summary":" Meeting summarization has emerged as a promising technique for providing\nusers with condensed summaries. However, existing work has focused on training\nmodels on centralized data, neglecting real-world scenarios where meeting data\nare infeasible to collect centrally, due to their sensitive nature. This gap\nmotivates us to explore federated learning for meeting summarization. Two\ncritical challenges impede progress. First, state-of-the-art summarizers are\nbased on parameter-heavy pre-trained models. Exchanging such a model's\nparameters across clients imposes large bandwidth costs. Second, as real-world\nmeeting data belong to various domains and are distributed across clients, they\nare instances of non-identically and independently distributed (non-IID). IID\nassumptions do not hold, which changes which forms of learning algorithms best\napply. To address this, we propose Adapter-based Federated Selective Knowledge\nDistillation (AdaFedSelecKD) for training performant client models.\nSpecifically, we develop an adapter-based summarization model where two\nadapters cooperatively facilitate learning using fewer parameters to reduce\ncommunication costs. Then, we devise a selective knowledge distillation\nstrategy, assisting clients in robustly handling domain-focused modelling on\ntheir own data, while leveraging global parameters based on non-IID data.\nExtensive experiments on the QMSum benchmark demonstrate AdaFedSelecKD can\nachieve comparable performance with powerful centralized training methods, and\nshows its generalizability and robustness.\n","authors":["Xiachong Feng","Xiaocheng Feng","Xiyuan Du","Min-Yen Kan","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2308.03275v1.pdf","comment":"This work has been submitted to the IEEE TASLP for possible\n publication. Copyright may be transferred without notice, after which this\n version may no longer be accessible"},{"id":"http://arxiv.org/abs/2103.00676v2","updated":"2023-08-07T03:25:37Z","published":"2021-03-01T01:00:09Z","title":"Token-Modification Adversarial Attacks for Natural Language Processing:\n A Survey","summary":" There are now many adversarial attacks for natural language processing\nsystems. Of these, a vast majority achieve success by modifying individual\ndocument tokens, which we call here a token-modification attack. Each\ntoken-modification attack is defined by a specific combination of fundamental\ncomponents, such as a constraint on the adversary or a particular search\nalgorithm. Motivated by this observation, we survey existing token-modification\nattacks and extract the components of each. We use an attack-independent\nframework to structure our survey which results in an effective categorisation\nof the field and an easy comparison of components. This survey aims to guide\nnew researchers to this field and spark further research into individual attack\ncomponents.\n","authors":["Tom Roth","Yansong Gao","Alsharif Abuadbba","Surya Nepal","Wei Liu"],"pdf_url":"https://arxiv.org/pdf/2103.00676v2.pdf","comment":"Version 2: updated"},{"id":"http://arxiv.org/abs/2308.03269v1","updated":"2023-08-07T03:19:59Z","published":"2023-08-07T03:19:59Z","title":"Simple Rule Injection for ComplEx Embeddings","summary":" Recent works in neural knowledge graph inference attempt to combine logic\nrules with knowledge graph embeddings to benefit from prior knowledge. However,\nthey usually cannot avoid rule grounding, and injecting a diverse set of rules\nhas still not been thoroughly explored. In this work, we propose InjEx, a\nmechanism to inject multiple types of rules through simple constraints, which\ncapture definite Horn rules. To start, we theoretically prove that InjEx can\ninject such rules. Next, to demonstrate that InjEx infuses interpretable prior\nknowledge into the embedding space, we evaluate InjEx on both the knowledge\ngraph completion (KGC) and few-shot knowledge graph completion (FKGC) settings.\nOur experimental results reveal that InjEx outperforms both baseline KGC models\nas well as specialized few-shot models while maintaining its scalability and\nefficiency.\n","authors":["Haodi Ma","Anthony Colas","Yuejie Wang","Ali Sadeghian","Daisy Zhe Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03269v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03266v1","updated":"2023-08-07T03:12:27Z","published":"2023-08-07T03:12:27Z","title":"SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and\n Effective Hotword Customization Ability","summary":" Hotword customization is one of the important issues remained in ASR field -\nit is of value to enable users of ASR systems to customize names of entities,\npersons and other phrases. The past few years have seen both implicit and\nexplicit modeling strategies for ASR contextualization developed. While these\napproaches have performed adequately, they still exhibit certain shortcomings,\nsuch as instability in effectiveness, especially in non-autoregressive ASR\nmodels. In this paper we propose Semantic-augmented Contextual-Paraformer\n(SeACo-Paraformer) a novel NAR based ASR system with flexible and effective\nhotword customization ability. It combines the accuracy of the AED-based model,\nthe efficiency of the NAR model, and the excellent performance in\ncontextualization. In tens of thousands of hours industrial big data\nexperiments, our proposed model outperforms strong baselines in customization\nand general ASR tasks. Besides, we explore an efficient way to filter large\nscale incoming hotwords for further improvement.\n","authors":["Xian Shi","Yexin Yang","Zerui Li","Shiliang Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03266v1.pdf","comment":"early draft"},{"id":"http://arxiv.org/abs/2305.02394v2","updated":"2023-08-07T03:07:59Z","published":"2023-05-03T19:29:26Z","title":"Defending against Insertion-based Textual Backdoor Attacks via\n Attribution","summary":" Textual backdoor attack, as a novel attack model, has been shown to be\neffective in adding a backdoor to the model during training. Defending against\nsuch backdoor attacks has become urgent and important. In this paper, we\npropose AttDef, an efficient attribution-based pipeline to defend against two\ninsertion-based poisoning attacks, BadNL and InSent. Specifically, we regard\nthe tokens with larger attribution scores as potential triggers since larger\nattribution words contribute more to the false prediction results and therefore\nare more likely to be poison triggers. Additionally, we further utilize an\nexternal pre-trained language model to distinguish whether input is poisoned or\nnot. We show that our proposed method can generalize sufficiently well in two\ncommon attack scenarios (poisoning training data and testing data), which\nconsistently improves previous methods. For instance, AttDef can successfully\nmitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34%\n(3.99% up) under pre-training and post-training attack defense respectively,\nachieving the new state-of-the-art performance on prediction recovery over four\nbenchmark datasets.\n","authors":["Jiazhao Li","Zhuofeng Wu","Wei Ping","Chaowei Xiao","V. G. Vinod Vydiswaran"],"pdf_url":"https://arxiv.org/pdf/2305.02394v2.pdf","comment":"Findings of ACL 2023. Camera-ready version"},{"id":"http://arxiv.org/abs/2212.08632v2","updated":"2023-08-07T03:02:06Z","published":"2022-12-16T18:12:04Z","title":"Enhancing Multi-modal and Multi-hop Question Answering via Structured\n Knowledge and Unified Retrieval-Generation","summary":" Multi-modal multi-hop question answering involves answering a question by\nreasoning over multiple input sources from different modalities. Existing\nmethods often retrieve evidences separately and then use a language model to\ngenerate an answer based on the retrieved evidences, and thus do not adequately\nconnect candidates and are unable to model the interdependent relations during\nretrieval. Moreover, the pipelined approaches of retrieval and generation might\nresult in poor generation performance when retrieval performance is low. To\naddress these issues, we propose a Structured Knowledge and Unified\nRetrieval-Generation (SKURG) approach. SKURG employs an Entity-centered Fusion\nEncoder to align sources from different modalities using shared entities. It\nthen uses a unified Retrieval-Generation Decoder to integrate intermediate\nretrieval results for answer generation and also adaptively determine the\nnumber of retrieval steps. Extensive experiments on two representative\nmulti-modal multi-hop QA datasets MultimodalQA and WebQA demonstrate that SKURG\noutperforms the state-of-the-art models in both source retrieval and answer\ngeneration performance with fewer parameters. Our code is available at\nhttps://github.com/HITsz-TMG/SKURG.\n","authors":["Qian Yang","Qian Chen","Wen Wang","Baotian Hu","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2212.08632v2.pdf","comment":"Accepted by ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2308.02180v2","updated":"2023-08-07T02:53:06Z","published":"2023-08-04T07:51:15Z","title":"Scaling Clinical Trial Matching Using Large Language Models: A Case\n Study in Oncology","summary":" Clinical trial matching is a key process in health delivery and discovery. In\npractice, it is plagued by overwhelming unstructured data and unscalable manual\nprocessing. In this paper, we conduct a systematic study on scaling clinical\ntrial matching using large language models (LLMs), with oncology as the focus\narea. Our study is grounded in a clinical trial matching system currently in\ntest deployment at a large U.S. health network. Initial findings are promising:\nout of box, cutting-edge LLMs, such as GPT-4, can already structure elaborate\neligibility criteria of clinical trials and extract complex matching logic\n(e.g., nested AND/OR/NOT). While still far from perfect, LLMs substantially\noutperform prior strong baselines and may serve as a preliminary solution to\nhelp triage patient-trial candidates with humans in the loop. Our study also\nreveals a few significant growth areas for applying LLMs to end-to-end clinical\ntrial matching, such as context limitation and accuracy, especially in\nstructuring patient information from longitudinal medical records.\n","authors":["Cliff Wong","Sheng Zhang","Yu Gu","Christine Moung","Jacob Abel","Naoto Usuyama","Roshanthi Weerasinghe","Brian Piening","Tristan Naumann","Carlo Bifulco","Hoifung Poon"],"pdf_url":"https://arxiv.org/pdf/2308.02180v2.pdf","comment":"24 pages, 5 figures, accepted at Machine Learning for Healthcare\n (MLHC) 2023"},{"id":"http://arxiv.org/abs/2308.03253v1","updated":"2023-08-07T02:18:23Z","published":"2023-08-07T02:18:23Z","title":"PaniniQA: Enhancing Patient Education Through Interactive Question\n Answering","summary":" Patient portal allows discharged patients to access their personalized\ndischarge instructions in electronic health records (EHRs). However, many\npatients have difficulty understanding or memorizing their discharge\ninstructions. In this paper, we present PaniniQA, a patient-centric interactive\nquestion answering system designed to help patients understand their discharge\ninstructions. PaniniQA first identifies important clinical content from\npatients' discharge instructions and then formulates patient-specific\neducational questions. In addition, PaniniQA is also equipped with answer\nverification functionality to provide timely feedback to correct patients'\nmisunderstandings. Our comprehensive automatic and human evaluation results\ndemonstrate our PaniniQA is capable of improving patients' mastery of their\nmedical instructions through effective interactions\n","authors":["Pengshan Cai","Zonghai Yao","Fei Liu","Dakuo Wang","Meghan Reilly","Huixue Zhou","Lingxi Li","Yi Cao","Alok Kapoor","Adarsha Bajracharya","Dan Berlowitz","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2308.03253v1.pdf","comment":"Accepted to TACL 2023. This arXiv version is a pre-MIT Press\n publication version"},{"id":"http://arxiv.org/abs/2308.03235v1","updated":"2023-08-07T01:10:50Z","published":"2023-08-07T01:10:50Z","title":"Analysis of the Evolution of Advanced Transformer-Based Language Models:\n Experiments on Opinion Mining","summary":" Opinion mining, also known as sentiment analysis, is a subfield of natural\nlanguage processing (NLP) that focuses on identifying and extracting subjective\ninformation in textual material. This can include determining the overall\nsentiment of a piece of text (e.g., positive or negative), as well as\nidentifying specific emotions or opinions expressed in the text, that involves\nthe use of advanced machine and deep learning techniques. Recently,\ntransformer-based language models make this task of human emotion analysis\nintuitive, thanks to the attention mechanism and parallel computation. These\nadvantages make such models very powerful on linguistic tasks, unlike recurrent\nneural networks that spend a lot of time on sequential processing, making them\nprone to fail when it comes to processing long text. The scope of our paper\naims to study the behaviour of the cutting-edge Transformer-based language\nmodels on opinion mining and provide a high-level comparison between them to\nhighlight their key particularities. Additionally, our comparative study shows\nleads and paves the way for production engineers regarding the approach to\nfocus on and is useful for researchers as it provides guidelines for future\nresearch subjects.\n","authors":["Nour Eddine Zekaoui","Siham Yousfi","Maryem Rhanoui","Mounia Mikram"],"pdf_url":"https://arxiv.org/pdf/2308.03235v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03234v1","updated":"2023-08-07T01:03:04Z","published":"2023-08-07T01:03:04Z","title":"Exploring Automated Distractor and Feedback Generation for Math\n Multiple-choice Questions via In-context Learning","summary":" Multiple-choice questions (MCQs) are ubiquitous in almost all levels of\neducation since they are easy to administer, grade, and are a reliable format\nin both assessments and practices. An important aspect of MCQs is the\ndistractors, i.e., incorrect options that are designed to target specific\nmisconceptions or insufficient knowledge among students. To date, the task of\ncrafting high-quality distractors has largely remained a labor-intensive\nprocess for teachers and learning content designers, which has limited\nscalability. In this work, we explore the task of automated distractor and\ncorresponding feedback message generation in math MCQs using large language\nmodels. We establish a formulation of these two tasks and propose a simple,\nin-context learning-based solution. Moreover, we explore using two non-standard\nmetrics to evaluate the quality of the generated distractors and feedback\nmessages. We conduct extensive experiments on these tasks using a real-world\nMCQ dataset that contains student response information. Our findings suggest\nthat there is a lot of room for improvement in automated distractor and\nfeedback generation. We also outline several directions for future work\n","authors":["Hunter McNichols","Wanyong Feng","Jaewook Lee","Alexander Scarlatos","Digory Smith","Simon Woodhead","Andrew Lan"],"pdf_url":"https://arxiv.org/pdf/2308.03234v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03958v1","updated":"2023-08-07T23:48:36Z","published":"2023-08-07T23:48:36Z","title":"Simple synthetic data reduces sycophancy in large language models","summary":" Sycophancy is an undesirable behavior where models tailor their responses to\nfollow a human user's view even when that view is not objectively correct\n(e.g., adapting liberal views once a user reveals that they are liberal). In\nthis paper, we study the prevalence of sycophancy in language models and\npropose a simple synthetic-data intervention to reduce this behavior.\n First, on a set of three sycophancy tasks (Perez et al., 2022) where models\nare asked for an opinion on statements with no correct answers (e.g.,\npolitics), we observe that both model scaling and instruction tuning\nsignificantly increase sycophancy for PaLM models up to 540B parameters.\nSecond, we extend sycophancy evaluations to simple addition statements that are\nobjectively incorrect, finding that despite knowing that these statements are\nwrong, language models will still agree with them if the user does as well.\n To reduce sycophancy, we present a straightforward synthetic-data\nintervention that takes public NLP tasks and encourages models to be robust to\nuser opinions on these tasks. Adding these data in a lightweight finetuning\nstep can significantly reduce sycophantic behavior on held-out prompts. Code\nfor generating synthetic data for intervention can be found at\nhttps://github.com/google/sycophancy-intervention.\n","authors":["Jerry Wei","Da Huang","Yifeng Lu","Denny Zhou","Quoc V. Le"],"pdf_url":"https://arxiv.org/pdf/2308.03958v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03929v1","updated":"2023-08-07T22:13:30Z","published":"2023-08-07T22:13:30Z","title":"Establishing Trust in ChatGPT BioMedical Generated Text: An\n Ontology-Based Knowledge Graph to Validate Disease-Symptom Links","summary":" Methods: Through an innovative approach, we construct ontology-based\nknowledge graphs from authentic medical literature and AI-generated content.\nOur goal is to distinguish factual information from unverified data. We\ncompiled two datasets: one from biomedical literature using a \"human disease\nand symptoms\" query, and another generated by ChatGPT, simulating articles.\nWith these datasets (PubMed and ChatGPT), we curated 10 sets of 250 abstracts\neach, selected randomly with a specific seed. Our method focuses on utilizing\ndisease ontology (DOID) and symptom ontology (SYMP) to build knowledge graphs,\nrobust mathematical models that facilitate unbiased comparisons. By employing\nour fact-checking algorithms and network centrality metrics, we conducted GPT\ndisease-symptoms link analysis to quantify the accuracy of factual knowledge\namid noise, hypotheses, and significant findings.\n Results: The findings obtained from the comparison of diverse ChatGPT\nknowledge graphs with their PubMed counterparts revealed some interesting\nobservations. While PubMed knowledge graphs exhibit a wealth of disease-symptom\nterms, it is surprising to observe that some ChatGPT graphs surpass them in the\nnumber of connections. Furthermore, some GPT graphs are demonstrating supremacy\nof the centrality scores, especially for the overlapping nodes. This striking\ncontrast indicates the untapped potential of knowledge that can be derived from\nAI-generated content, awaiting verification. Out of all the graphs, the factual\nlink ratio between any two graphs reached its peak at 60%.\n Conclusions: An intriguing insight from our findings was the striking number\nof links among terms in the knowledge graph generated from ChatGPT datasets,\nsurpassing some of those in its PubMed counterpart. This early discovery has\nprompted further investigation using universal network metrics to unveil the\nnew knowledge the links may hold.\n","authors":["Ahmed Abdeen Hamed","Alessandro Crimi","Magdalena M. Misiak","Byung Suk Lee"],"pdf_url":"https://arxiv.org/pdf/2308.03929v1.pdf","comment":"7 Pages, 3 algorithms, 4 tables, and 7 figures"},{"id":"http://arxiv.org/abs/2308.02013v2","updated":"2023-08-07T21:34:44Z","published":"2023-08-03T20:08:23Z","title":"Federated Representation Learning for Automatic Speech Recognition","summary":" Federated Learning (FL) is a privacy-preserving paradigm, allowing edge\ndevices to learn collaboratively without sharing data. Edge devices like Alexa\nand Siri are prospective sources of unlabeled audio data that can be tapped to\nlearn robust audio representations. In this work, we bring Self-supervised\nLearning (SSL) and FL together to learn representations for Automatic Speech\nRecognition respecting data privacy constraints. We use the speaker and chapter\ninformation in the unlabeled speech dataset, Libri-Light, to simulate non-IID\nspeaker-siloed data distributions and pre-train an LSTM encoder with the\nContrastive Predictive Coding framework with FedSGD. We show that the\npre-trained ASR encoder in FL performs as well as a centrally pre-trained model\nand produces an improvement of 12-15% (WER) compared to no pre-training. We\nfurther adapt the federated pre-trained models to a new language, French, and\nshow a 20% (WER) improvement over no pre-training.\n","authors":["Guruprasad V Ramesh","Gopinath Chennupati","Milind Rao","Anit Kumar Sahu","Ariya Rastrow","Jasha Droppo"],"pdf_url":"https://arxiv.org/pdf/2308.02013v2.pdf","comment":"Accepted at ISCA SPSC Symposium 3rd Symposium on Security and Privacy\n in Speech Communication, 2023"},{"id":"http://arxiv.org/abs/2308.03917v1","updated":"2023-08-07T21:29:51Z","published":"2023-08-07T21:29:51Z","title":"Universal Automatic Phonetic Transcription into the International\n Phonetic Alphabet","summary":" This paper presents a state-of-the-art model for transcribing speech in any\nlanguage into the International Phonetic Alphabet (IPA). Transcription of\nspoken languages into IPA is an essential yet time-consuming process in\nlanguage documentation, and even partially automating this process has the\npotential to drastically speed up the documentation of endangered languages.\nLike the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is\nbased on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use\ntraining data from seven languages from CommonVoice 11.0, transcribed into IPA\nsemi-automatically. Although this training dataset is much smaller than\nWav2Vec2Phoneme's, its higher quality lets our model achieve comparable or\nbetter results. Furthermore, we show that the quality of our universal\nspeech-to-IPA models is close to that of human annotators.\n","authors":["Chihiro Taguchi","Yusuke Sakai","Parisa Haghani","David Chiang"],"pdf_url":"https://arxiv.org/pdf/2308.03917v1.pdf","comment":"5 pages, 7 tables"},{"id":"http://arxiv.org/abs/2308.03905v1","updated":"2023-08-07T20:43:42Z","published":"2023-08-07T20:43:42Z","title":"Intelligent Assistant Language Understanding On Device","summary":" It has recently become feasible to run personal digital assistants on phones\nand other personal devices. In this paper we describe a design for a natural\nlanguage understanding system that runs on device. In comparison to a\nserver-based assistant, this system is more private, more reliable, faster,\nmore expressive, and more accurate. We describe what led to key choices about\narchitecture and technologies. For example, some approaches in the dialog\nsystems literature are difficult to maintain over time in a deployment setting.\nWe hope that sharing learnings from our practical experiences may help inform\nfuture work in the research community.\n","authors":["Cecilia Aas","Hisham Abdelsalam","Irina Belousova","Shruti Bhargava","Jianpeng Cheng","Robert Daland","Joris Driesen","Federico Flego","Tristan Guigue","Anders Johannsen","Partha Lal","Jiarui Lu","Joel Ruben Antony Moniz","Nathan Perkins","Dhivya Piraviperumal","Stephen Pulman","Diarmuid Ó Séaghdha","David Q. Sun","John Torr","Marco Del Vecchio","Jay Wacker","Jason D. Williams","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2308.03905v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03891v1","updated":"2023-08-07T19:50:59Z","published":"2023-08-07T19:50:59Z","title":"A Cross-Domain Evaluation of Approaches for Causal Knowledge Extraction","summary":" Causal knowledge extraction is the task of extracting relevant causes and\neffects from text by detecting the causal relation. Although this task is\nimportant for language understanding and knowledge discovery, recent works in\nthis domain have largely focused on binary classification of a text segment as\ncausal or non-causal. In this regard, we perform a thorough analysis of three\nsequence tagging models for causal knowledge extraction and compare it with a\nspan based approach to causality extraction. Our experiments show that\nembeddings from pre-trained language models (e.g. BERT) provide a significant\nperformance boost on this task compared to previous state-of-the-art models\nwith complex architectures. We observe that span based models perform better\nthan simple sequence tagging models based on BERT across all 4 data sets from\ndiverse domains with different types of cause-effect phrases.\n","authors":["Anik Saha","Oktie Hassanzadeh","Alex Gittens","Jian Ni","Kavitha Srinivas","Bulent Yener"],"pdf_url":"https://arxiv.org/pdf/2308.03891v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03883v1","updated":"2023-08-07T19:26:09Z","published":"2023-08-07T19:26:09Z","title":"Generative Benchmark Creation for Table Union Search","summary":" Data management has traditionally relied on synthetic data generators to\ngenerate structured benchmarks, like the TPC suite, where we can control\nimportant parameters like data size and its distribution precisely. These\nbenchmarks were central to the success and adoption of database management\nsystems. But more and more, data management problems are of a semantic nature.\nAn important example is finding tables that can be unioned. While any two\ntables with the same cardinality can be unioned, table union search is the\nproblem of finding tables whose union is semantically coherent. Semantic\nproblems cannot be benchmarked using synthetic data. Our current methods for\ncreating benchmarks involve the manual curation and labeling of real data.\nThese methods are not robust or scalable and perhaps more importantly, it is\nnot clear how robust the created benchmarks are. We propose to use generative\nAI models to create structured data benchmarks for table union search. We\npresent a novel method for using generative models to create tables with\nspecified properties. Using this method, we create a new benchmark containing\npairs of tables that are both unionable and non-unionable but related. We\nthoroughly evaluate recent existing table union search methods over existing\nbenchmarks and our new benchmark. We also present and evaluate a new table\nsearch methods based on recent large language models over all benchmarks. We\nshow that the new benchmark is more challenging for all methods than\nhand-curated benchmarks, specifically, the top-performing method achieves a\nMean Average Precision of around 60%, over 30% less than its performance on\nexisting manually created benchmarks. We examine why this is the case and show\nthat the new benchmark permits more detailed analysis of methods, including a\nstudy of both false positives and false negatives that were not possible with\nexisting benchmarks.\n","authors":["Koyena Pal","Aamod Khatiwada","Roee Shraga","Renée J. Miller"],"pdf_url":"https://arxiv.org/pdf/2308.03883v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03869v1","updated":"2023-08-07T18:40:13Z","published":"2023-08-07T18:40:13Z","title":"Semantic Equivalence of e-Commerce Queries","summary":" Search query variation poses a challenge in e-commerce search, as equivalent\nsearch intents can be expressed through different queries with surface-level\ndifferences. This paper introduces a framework to recognize and leverage query\nequivalence to enhance searcher and business outcomes. The proposed approach\naddresses three key problems: mapping queries to vector representations of\nsearch intent, identifying nearest neighbor queries expressing equivalent or\nsimilar intent, and optimizing for user or business objectives. The framework\nutilizes both surface similarity and behavioral similarity to determine query\nequivalence. Surface similarity involves canonicalizing queries based on word\ninflection, word order, compounding, and noise words. Behavioral similarity\nleverages historical search behavior to generate vector representations of\nquery intent. An offline process is used to train a sentence similarity model,\nwhile an online nearest neighbor approach supports processing of unseen\nqueries. Experimental evaluations demonstrate the effectiveness of the proposed\napproach, outperforming popular sentence transformer models and achieving a\nPearson correlation of 0.85 for query similarity. The results highlight the\npotential of leveraging historical behavior data and training models to\nrecognize and utilize query equivalence in e-commerce search, leading to\nimproved user experiences and business outcomes. Further advancements and\nbenchmark datasets are encouraged to facilitate the development of solutions\nfor this critical problem in the e-commerce domain.\n","authors":["Aritra Mandal","Daniel Tunkelang","Zhe Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03869v1.pdf","comment":"The 6th Workshop on e-Commerce and NLP"},{"id":"http://arxiv.org/abs/2308.03866v1","updated":"2023-08-07T18:27:54Z","published":"2023-08-07T18:27:54Z","title":"Trusting Language Models in Education","summary":" Language Models are being widely used in Education. Even though modern deep\nlearning models achieve very good performance on question-answering tasks,\nsometimes they make errors. To avoid misleading students by showing wrong\nanswers, it is important to calibrate the confidence - that is, the prediction\nprobability - of these models. In our work, we propose to use an XGBoost on top\nof BERT to output the corrected probabilities, using features based on the\nattention mechanism. Our hypothesis is that the level of uncertainty contained\nin the flow of attention is related to the quality of the model's response\nitself.\n","authors":["Jogi Suda Neto","Li Deng","Thejaswi Raya","Reza Shahbazi","Nick Liu","Adhitya Venkatesh","Miral Shah","Neeru Khosla","Rodrigo Capobianco Guido"],"pdf_url":"https://arxiv.org/pdf/2308.03866v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03864v1","updated":"2023-08-07T18:25:00Z","published":"2023-08-07T18:25:00Z","title":"Storyfier: Exploring Vocabulary Learning Support with Text Generation\n Models","summary":" Vocabulary learning support tools have widely exploited existing materials,\ne.g., stories or video clips, as contexts to help users memorize each target\nword. However, these tools could not provide a coherent context for any target\nwords of learners' interests, and they seldom help practice word usage. In this\npaper, we work with teachers and students to iteratively develop Storyfier,\nwhich leverages text generation models to enable learners to read a generated\nstory that covers any target words, conduct a story cloze test, and use these\nwords to write a new story with adaptive AI assistance. Our within-subjects\nstudy (N=28) shows that learners generally favor the generated stories for\nconnecting target words and writing assistance for easing their learning\nworkload. However, in the read-cloze-write learning sessions, participants\nusing Storyfier perform worse in recalling and using target words than learning\nwith a baseline tool without our AI features. We discuss insights into\nsupporting learning tasks with generative models.\n","authors":["Zhenhui Peng","Xingbo Wang","Qiushi Han","Junkai Zhu","Xiaojuan Ma","Huamin Qu"],"pdf_url":"https://arxiv.org/pdf/2308.03864v1.pdf","comment":"To appear at the 2023 ACM Symposium on User Interface Software and\n Technology (UIST); 16 pages (7 figures, 23 tables)"},{"id":"http://arxiv.org/abs/2308.03853v1","updated":"2023-08-07T18:03:10Z","published":"2023-08-07T18:03:10Z","title":"Extracting detailed oncologic history and treatment plan from medical\n oncology notes with large language models","summary":" Both medical care and observational studies in oncology require a thorough\nunderstanding of a patient's disease progression and treatment history, often\nelaborately documented in clinical notes. Despite their vital role, no current\noncology information representation and annotation schema fully encapsulates\nthe diversity of information recorded within these notes. Although large\nlanguage models (LLMs) have recently exhibited impressive performance on\nvarious medical natural language processing tasks, due to the current lack of\ncomprehensively annotated oncology datasets, an extensive evaluation of LLMs in\nextracting and reasoning with the complex rhetoric in oncology notes remains\nunderstudied. We developed a detailed schema for annotating textual oncology\ninformation, encompassing patient characteristics, tumor characteristics,\ntests, treatments, and temporality. Using a corpus of 10 de-identified breast\ncancer progress notes at University of California, San Francisco, we applied\nthis schema to assess the abilities of three recently-released LLMs (GPT-4,\nGPT-3.5-turbo, and FLAN-UL2) to perform zero-shot extraction of detailed\noncological history from two narrative sections of clinical progress notes. Our\nteam annotated 2750 entities, 2874 modifiers, and 1623 relationships. The GPT-4\nmodel exhibited overall best performance, with an average BLEU score of 0.69,\nan average ROUGE score of 0.72, and an average accuracy of 67% on complex tasks\n(expert manual evaluation). Notably, it was proficient in tumor characteristic\nand medication extraction, and demonstrated superior performance in inferring\nsymptoms due to cancer and considerations of future medications. The analysis\ndemonstrates that GPT-4 is potentially already usable to extract important\nfacts from cancer progress notes needed for clinical research, complex\npopulation management, and documenting quality patient care.\n","authors":["Madhumita Sushil","Vanessa E. Kennedy","Brenda Y. Miao","Divneet Mandair","Travis Zack","Atul J. Butte"],"pdf_url":"https://arxiv.org/pdf/2308.03853v1.pdf","comment":"Source code available at:\n https://github.com/MadhumitaSushil/OncLLMExtraction"},{"id":"http://arxiv.org/abs/2308.03311v1","updated":"2023-08-07T05:40:01Z","published":"2023-08-07T05:40:01Z","title":"CrossTalk: Intelligent Substrates for Language-Oriented Interaction in\n Video-Based Communication and Collaboration","summary":" Despite the advances and ubiquity of digital communication media such as\nvideoconferencing and virtual reality, they remain oblivious to the rich\nintentions expressed by users. Beyond transmitting audio, videos, and messages,\nwe envision digital communication media as proactive facilitators that can\nprovide unobtrusive assistance to enhance communication and collaboration.\nInformed by the results of a formative study, we propose three key design\nconcepts to explore the systematic integration of intelligence into\ncommunication and collaboration, including the panel substrate, language-based\nintent recognition, and lightweight interaction techniques. We developed\nCrossTalk, a videoconferencing system that instantiates these concepts, which\nwas found to enable a more fluid and flexible communication and collaboration\nexperience.\n","authors":["Haijun Xia","Tony Wang","Aditya Gunturu","Peiling Jiang","William Duan","Xiaoshuo Yao"],"pdf_url":"https://arxiv.org/pdf/2308.03311v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2308.03757v1","updated":"2023-08-07T17:59:59Z","published":"2023-08-07T17:59:59Z","title":"3D Motion Magnification: Visualizing Subtle Motions with Time Varying\n Radiance Fields","summary":" Motion magnification helps us visualize subtle, imperceptible motion.\nHowever, prior methods only work for 2D videos captured with a fixed camera. We\npresent a 3D motion magnification method that can magnify subtle motions from\nscenes captured by a moving camera, while supporting novel view rendering. We\nrepresent the scene with time-varying radiance fields and leverage the Eulerian\nprinciple for motion magnification to extract and amplify the variation of the\nembedding of a fixed point over time. We study and validate our proposed\nprinciple for 3D motion magnification using both implicit and tri-plane-based\nradiance fields as our underlying 3D scene representation. We evaluate the\neffectiveness of our method on both synthetic and real-world scenes captured\nunder various camera setups.\n","authors":["Brandon Y. Feng","Hadi Alzayer","Michael Rubinstein","William T. Freeman","Jia-Bin Huang"],"pdf_url":"https://arxiv.org/pdf/2308.03757v1.pdf","comment":"ICCV 2023. See the project page at\n https://3d-motion-magnification.github.io"},{"id":"http://arxiv.org/abs/2209.11359v4","updated":"2023-08-07T17:59:53Z","published":"2022-09-23T01:09:06Z","title":"CUTS: A Fully Unsupervised Framework for Medical Image Segmentation","summary":" In this work we introduce CUTS (Contrastive and Unsupervised Training for\nSegmentation), a fully unsupervised deep learning framework for medical image\nsegmentation to better utilize the vast majority of imaging data that is not\nlabeled or annotated. We utilize self-supervision from pixels and their local\nneighborhoods in the images themselves. Our unsupervised approach optimizes a\ntraining objective that leverages concepts from contrastive learning and\nautoencoding. Our framework segments medical images with a novel two-stage\napproach without relying on any labeled data at any stage. The first stage\ninvolves the creation of a \"pixel-centered patch\" that embeds every pixel along\nwith its surrounding patch, using a vector representation in a high-dimensional\nlatent embedding space. The second stage utilizes diffusion condensation, a\nmulti-scale topological data analysis approach, to dynamically coarse-grain\nthese embedding vectors at all levels of granularity. The final outcome is a\nseries of coarse-to-fine segmentations that highlight image structures at\nvarious scales. In this work, we show successful multi-scale segmentation on\nnatural images, retinal fundus images, and brain MRI images. Our framework\ndelineates structures and patterns at different scales which, in the cases of\nmedical images, may carry distinct information relevant to clinical\ninterpretation. Quantitatively, our framework demonstrates improvements ranging\nfrom 10% to 200% on dice coefficient and Hausdorff distance compared to\nexisting unsupervised methods across three medical image datasets. As we tackle\nthe problem of segmenting medical images at multiple meaningful granularities\nwithout relying on any label, we hope to demonstrate the possibility to\ncircumvent tedious and repetitive manual annotations in future practice.\n","authors":["Chen Liu","Matthew Amodio","Liangbo L. Shen","Feng Gao","Arman Avesta","Sanjay Aneja","Jay C. Wang","Lucian V. Del Priore","Smita Krishnaswamy"],"pdf_url":"https://arxiv.org/pdf/2209.11359v4.pdf","comment":"Included new dataset. Ensured evaluation consistency among competing\n methods"},{"id":"http://arxiv.org/abs/2308.03755v1","updated":"2023-08-07T17:59:48Z","published":"2023-08-07T17:59:48Z","title":"FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels","summary":" LiDAR-based fully sparse architecture has garnered increasing attention.\nFSDv1 stands out as a representative work, achieving impressive efficacy and\nefficiency, albeit with intricate structures and handcrafted designs. In this\npaper, we present FSDv2, an evolution that aims to simplify the previous FSDv1\nwhile eliminating the inductive bias introduced by its handcrafted\ninstance-level representation, thus promoting better general applicability. To\nthis end, we introduce the concept of \\textbf{virtual voxels}, which takes over\nthe clustering-based instance segmentation in FSDv1. Virtual voxels not only\naddress the notorious issue of the Center Feature Missing problem in fully\nsparse detectors but also endow the framework with a more elegant and\nstreamlined approach. Consequently, we develop a suite of components to\ncomplement the virtual voxel concept, including a virtual voxel encoder, a\nvirtual voxel mixer, and a virtual voxel assignment strategy. Through empirical\nvalidation, we demonstrate that the virtual voxel mechanism is functionally\nsimilar to the handcrafted clustering in FSDv1 while being more general. We\nconduct experiments on three large-scale datasets: Waymo Open Dataset,\nArgoverse 2 dataset, and nuScenes dataset. Our results showcase\nstate-of-the-art performance on all three datasets, highlighting the\nsuperiority of FSDv2 in long-range scenarios and its general applicability to\nachieve competitive performance across diverse scenarios. Moreover, we provide\ncomprehensive experimental analysis to elucidate the workings of FSDv2. To\nfoster reproducibility and further research, we have open-sourced FSDv2 at\nhttps://github.com/tusen-ai/SST.\n","authors":["Lue Fan","Feng Wang","Naiyan Wang","Zhaoxiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03755v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.09027v3","updated":"2023-08-07T17:56:54Z","published":"2022-11-12T10:12:17Z","title":"LLEDA -- Lifelong Self-Supervised Domain Adaptation","summary":" Humans and animals have the ability to continuously learn new information\nover their lifetime without losing previously acquired knowledge. However,\nartificial neural networks struggle with this due to new information\nconflicting with old knowledge, resulting in catastrophic forgetting. The\ncomplementary learning systems (CLS) theory suggests that the interplay between\nhippocampus and neocortex systems enables long-term and efficient learning in\nthe mammalian brain, with memory replay facilitating the interaction between\nthese two systems to reduce forgetting. The proposed Lifelong Self-Supervised\nDomain Adaptation (LLEDA) framework draws inspiration from the CLS theory and\nmimics the interaction between two networks: a DA network inspired by the\nhippocampus that quickly adjusts to changes in data distribution and an SSL\nnetwork inspired by the neocortex that gradually learns domain-agnostic general\nrepresentations. LLEDA's latent replay technique facilitates communication\nbetween these two networks by reactivating and replaying the past memory latent\nrepresentations to stabilise long-term generalisation and retention without\ninterfering with the previously learned information. Extensive experiments\ndemonstrate that the proposed method outperforms several other methods\nresulting in a long-term adaptation while being less prone to catastrophic\nforgetting when transferred to new domains.\n","authors":["Mamatha Thota","Dewei Yi","Georgios Leontidis"],"pdf_url":"https://arxiv.org/pdf/2211.09027v3.pdf","comment":"19 pages, 6 figures, 6 tables; V2 added more experiments on more\n domains and fixed typos"},{"id":"http://arxiv.org/abs/2308.03747v1","updated":"2023-08-07T17:53:21Z","published":"2023-08-07T17:53:21Z","title":"Mask Frozen-DETR: High Quality Instance Segmentation with One GPU","summary":" In this paper, we aim to study how to build a strong instance segmenter with\nminimal training time and GPUs, as opposed to the majority of current\napproaches that pursue more accurate instance segmenter by building more\nadvanced frameworks at the cost of longer training time and higher GPU\nrequirements. To achieve this, we introduce a simple and general framework,\ntermed Mask Frozen-DETR, which can convert any existing DETR-based object\ndetection model into a powerful instance segmentation model. Our method only\nrequires training an additional lightweight mask network that predicts instance\nmasks within the bounding boxes given by a frozen DETR-based object detector.\nRemarkably, our method outperforms the state-of-the-art instance segmentation\nmethod Mask DINO in terms of performance on the COCO test-dev split (55.3% vs.\n54.7%) while being over 10X times faster to train. Furthermore, all of our\nexperiments can be trained using only one Tesla V100 GPU with 16 GB of memory,\ndemonstrating the significant efficiency of our proposed framework.\n","authors":["Zhanhao Liang","Yuhui Yuan"],"pdf_url":"https://arxiv.org/pdf/2308.03747v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.01390v2","updated":"2023-08-07T17:53:09Z","published":"2023-08-02T19:10:23Z","title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive\n Vision-Language Models","summary":" We introduce OpenFlamingo, a family of autoregressive vision-language models\nranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce\nan open-source replication of DeepMind's Flamingo models. On seven\nvision-language datasets, OpenFlamingo models average between 80 - 89% of\ncorresponding Flamingo performance. This technical report describes our models,\ntraining data, hyperparameters, and evaluation suite. We share our models and\ncode at https://github.com/mlfoundations/open_flamingo.\n","authors":["Anas Awadalla","Irena Gao","Josh Gardner","Jack Hessel","Yusuf Hanafy","Wanrong Zhu","Kalyani Marathe","Yonatan Bitton","Samir Gadre","Shiori Sagawa","Jenia Jitsev","Simon Kornblith","Pang Wei Koh","Gabriel Ilharco","Mitchell Wortsman","Ludwig Schmidt"],"pdf_url":"https://arxiv.org/pdf/2308.01390v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.09597v6","updated":"2023-08-07T17:50:52Z","published":"2022-12-19T16:32:42Z","title":"Reasoning with Language Model Prompting: A Survey","summary":" Reasoning, as an essential ability for complex problem-solving, can provide\nback-end support for various real-world applications, such as medical\ndiagnosis, negotiation, etc. This paper provides a comprehensive survey of\ncutting-edge research on reasoning with language model prompting. We introduce\nresearch works with comparisons and summaries and provide systematic resources\nto help beginners. We also discuss the potential reasons for emerging such\nreasoning abilities and highlight future research directions. Resources are\navailable at https://github.com/zjunlp/Prompt4ReasoningPapers (updated\nperiodically).\n","authors":["Shuofei Qiao","Yixin Ou","Ningyu Zhang","Xiang Chen","Yunzhi Yao","Shumin Deng","Chuanqi Tan","Fei Huang","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2212.09597v6.pdf","comment":"ACL 2023, fixed Equation 2"},{"id":"http://arxiv.org/abs/2308.03729v1","updated":"2023-08-07T17:17:05Z","published":"2023-08-07T17:17:05Z","title":"Tiny LVLM-eHub: Early Multimodal Experiments with Bard","summary":" Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated\nsignificant progress in tackling complex multimodal tasks. Among these\ncutting-edge developments, Google's Bard stands out for its remarkable\nmultimodal capabilities, promoting comprehensive comprehension and reasoning\nacross various domains. This work presents an early and holistic evaluation of\nLVLMs' multimodal abilities, with a particular focus on Bard, by proposing a\nlightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the\nvanilla version, Tiny LVLM-eHub possesses several appealing properties.\nFirstly, it provides a systematic assessment of six categories of multimodal\ncapabilities, including visual perception, visual knowledge acquisition, visual\nreasoning, visual commonsense, object hallucination, and embodied intelligence,\nthrough quantitative evaluation of $42$ standard text-related visual\nbenchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions\nusing the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and\naccurate evaluation and exhibits improved alignment with human evaluation\ncompared to the word matching approach. Thirdly, it comprises a mere $2.1$K\nimage-text pairs, facilitating ease of use for practitioners to evaluate their\nown offline LVLMs. Through extensive experimental analysis, this study\ndemonstrates that Bard outperforms previous LVLMs in most multimodal\ncapabilities except object hallucination, to which Bard is still susceptible.\nTiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages\ninnovative strategies aimed at advancing multimodal techniques. Our project is\npublicly available at \\url{https://github.com/OpenGVLab/Multi-Modality-Arena}.\n","authors":["Wenqi Shao","Yutao Hu","Peng Gao","Meng Lei","Kaipeng Zhang","Fanqing Meng","Peng Xu","Siyuan Huang","Hongsheng Li","Yu Qiao","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2308.03729v1.pdf","comment":"24 pages, 24 figures, 7 Tables. Project Page:\n http://lvlm-ehub.opengvlab.com/"},{"id":"http://arxiv.org/abs/2308.03726v1","updated":"2023-08-07T17:12:54Z","published":"2023-08-07T17:12:54Z","title":"AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene\n Segmentation","summary":" Segmentation is a fundamental problem in surgical scene analysis using\nartificial intelligence. However, the inherent data scarcity in this domain\nmakes it challenging to adapt traditional segmentation techniques for this\ntask. To tackle this issue, current research employs pretrained models and\nfinetunes them on the given data. Even so, these require training deep networks\nwith millions of parameters every time new data becomes available. A recently\npublished foundation model, Segment-Anything (SAM), generalizes well to a large\nvariety of natural images, hence tackling this challenge to a reasonable\nextent. However, SAM does not generalize well to the medical domain as is\nwithout utilizing a large amount of compute resources for fine-tuning and using\ntask-specific prompts. Moreover, these prompts are in the form of\nbounding-boxes or foreground/background points that need to be annotated\nexplicitly for every image, making this solution increasingly tedious with\nhigher data size. In this work, we propose AdaptiveSAM - an adaptive\nmodification of SAM that can adjust to new datasets quickly and efficiently,\nwhile enabling text-prompted segmentation. For finetuning AdaptiveSAM, we\npropose an approach called bias-tuning that requires a significantly smaller\nnumber of trainable parameters than SAM (less than 2\\%). At the same time,\nAdaptiveSAM requires negligible expert intervention since it uses free-form\ntext as prompt and can segment the object of interest with just the label name\nas prompt. Our experiments show that AdaptiveSAM outperforms current\nstate-of-the-art methods on various medical imaging datasets including surgery,\nultrasound and X-ray. Code is available at\nhttps://github.com/JayParanjape/biastuning\n","authors":["Jay N. Paranjape","Nithin Gopalakrishnan Nair","Shameema Sikder","S. Swaroop Vedula","Vishal M. Patel"],"pdf_url":"https://arxiv.org/pdf/2308.03726v1.pdf","comment":"10 pages, 6 figures, 5 tables"},{"id":"http://arxiv.org/abs/2308.03725v1","updated":"2023-08-07T17:07:48Z","published":"2023-08-07T17:07:48Z","title":"Efficient Temporal Sentence Grounding in Videos with Multi-Teacher\n Knowledge Distillation","summary":" Temporal Sentence Grounding in Videos (TSGV) aims to detect the event\ntimestamps described by the natural language query from untrimmed videos. This\npaper discusses the challenge of achieving efficient computation in TSGV models\nwhile maintaining high performance. Most existing approaches exquisitely design\ncomplex architectures to improve accuracy with extra layers and loss, suffering\nfrom inefficiency and heaviness. Although some works have noticed that, they\nonly make an issue of feature fusion layers, which can hardly enjoy the\nhighspeed merit in the whole clunky network. To tackle this problem, we propose\na novel efficient multi-teacher model (EMTM) based on knowledge distillation to\ntransfer diverse knowledge from both heterogeneous and isomorphic networks.\nSpecifically, We first unify different outputs of the heterogeneous models into\none single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire\nhigh-quality integrated soft labels from multiple teachers. After that, the KAU\nmodule leverages the multi-scale video and global query information to\nadaptively determine the weights of different teachers. A Shared Encoder\nstrategy is then proposed to solve the problem that the student shallow layers\nhardly benefit from teachers, in which an isomorphic teacher is collaboratively\ntrained with the student to align their hidden states. Extensive experimental\nresults on three popular TSGV benchmarks demonstrate that our method is both\neffective and efficient without bells and whistles.\n","authors":["Renjie Liang","Yiming Yang","Hui Lu","Li Li"],"pdf_url":"https://arxiv.org/pdf/2308.03725v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03723v1","updated":"2023-08-07T16:58:48Z","published":"2023-08-07T16:58:48Z","title":"Dimensionality Reduction for Improving Out-of-Distribution Detection in\n Medical Image Segmentation","summary":" Clinically deployed segmentation models are known to fail on data outside of\ntheir training distribution. As these models perform well on most cases, it is\nimperative to detect out-of-distribution (OOD) images at inference to protect\nagainst automation bias. This work applies the Mahalanobis distance post hoc to\nthe bottleneck features of a Swin UNETR model that segments the liver on\nT1-weighted magnetic resonance imaging. By reducing the dimensions of the\nbottleneck features with principal component analysis, OOD images were detected\nwith high performance and minimal computational load.\n","authors":["McKell Woodland","Nihil Patel","Mais Al Taie","Joshua P. Yung","Tucker J. Netherton","Ankit B. Patel","Kristy K. Brock"],"pdf_url":"https://arxiv.org/pdf/2308.03723v1.pdf","comment":"This preprint has not undergone peer review or any post-submission\n improvements or corrections. The Version of Record of this contribution will\n be published in the Proceedings of Uncertainty for Safe Utilization of\n Machine Learning in Medical Imaging (5th International Workshop) - Held in\n conjunction with MICCAI 2023"},{"id":"http://arxiv.org/abs/2308.03718v1","updated":"2023-08-07T16:43:46Z","published":"2023-08-07T16:43:46Z","title":"SEM-GAT: Explainable Semantic Pose Estimation using Learned Graph\n Attention","summary":" This paper proposes a GNN-based method for exploiting semantics and local\ngeometry to guide the identification of reliable pointcloud registration\ncandidates. Semantic and morphological features of the environment serve as key\nreference points for registration, enabling accurate lidar-based pose\nestimation. Our novel lightweight static graph structure informs our\nattention-based keypoint node aggregation GNN network by identifying semantic\ninstance-based relationships, acting as inductive bias to significantly reduce\nthe computational burden of pointcloud registration. By connecting candidate\nnodes and exploiting cross-graph attention, we identify confidence scores for\nall potential registration correspondences, estimating the displacement between\npointcloud scans. Our pipeline enables introspective analysis of the model's\nperformance by correlating it with the individual contributions of local\nstructures in the environment, providing valuable insights into the system's\nbehaviour. We test our method on the KITTI odometry dataset, achieving\ncompetitive accuracy compared to benchmark methods and a higher track\nsmoothness while relying on significantly fewer network parameters.\n","authors":["Efimia Panagiotaki","Daniele De Martini","Georgi Pramatarov","Matthew Gadd","Lars Kunze"],"pdf_url":"https://arxiv.org/pdf/2308.03718v1.pdf","comment":"8 pages, 5 figures"},{"id":"http://arxiv.org/abs/2308.03717v1","updated":"2023-08-07T16:40:19Z","published":"2023-08-07T16:40:19Z","title":"Automated Real Time Delineation of Supraclavicular Brachial Plexus in\n Neck Ultrasonography Videos: A Deep Learning Approach","summary":" Peripheral nerve blocks are crucial to treatment of post-surgical pain and\nare associated with reduction in perioperative opioid use and hospital stay.\nAccurate interpretation of sono-anatomy is critical for the success of\nultrasound (US) guided peripheral nerve blocks and can be challenging to the\nnew operators. This prospective study enrolled 227 subjects who were\nsystematically scanned for supraclavicular and interscalene brachial plexus in\nvarious settings using three different US machines to create a dataset of 227\nunique videos. In total, 41,000 video frames were annotated by experienced\nanaesthesiologists using partial automation with object tracking and active\ncontour algorithms. Four baseline neural network models were trained on the\ndataset and their performance was evaluated for object detection and\nsegmentation tasks. Generalizability of the best suited model was then tested\non the datasets constructed from separate US scanners with and without\nfine-tuning. The results demonstrate that deep learning models can be leveraged\nfor real time segmentation of supraclavicular brachial plexus in neck\nultrasonography videos with high accuracy and reliability. Model was also\ntested for its ability to differentiate between supraclavicular and adjoining\ninterscalene brachial plexus. The entire dataset has been released publicly for\nfurther study by the research community.\n","authors":["Abhay Tyagi","Abhishek Tyagi","Manpreet Kaur","Jayanthi Sivaswami","Richa Aggarwal","Kapil Dev Soni","Anjan Trikha"],"pdf_url":"https://arxiv.org/pdf/2308.03717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03712v1","updated":"2023-08-07T16:31:38Z","published":"2023-08-07T16:31:38Z","title":"Scaling may be all you need for achieving human-level object recognition\n capacity with human-like visual experience","summary":" This paper asks whether current self-supervised learning methods, if\nsufficiently scaled up, would be able to reach human-level visual object\nrecognition capabilities with the same type and amount of visual experience\nhumans learn from. Previous work on this question only considered the scaling\nof data size. Here, we consider the simultaneous scaling of data size, model\nsize, and image resolution. We perform a scaling experiment with vision\ntransformers up to 633M parameters in size (ViT-H/14) trained with up to 5K\nhours of human-like video data (long, continuous, mostly egocentric videos)\nwith image resolutions of up to 476x476 pixels. The efficiency of masked\nautoencoders (MAEs) as a self-supervised learning algorithm makes it possible\nto run this scaling experiment on an unassuming academic budget. We find that\nit is feasible to reach human-level object recognition capacity at sub-human\nscales of model size, data size, and image size, if these factors are scaled up\nsimultaneously. To give a concrete example, we estimate that a 2.5B parameter\nViT model trained with 20K hours (2.3 years) of human-like video data with a\nspatial resolution of 952x952 pixels should be able to reach human-level\naccuracy on ImageNet. Human-level competence is thus achievable for a\nfundamental perceptual capability from human-like perceptual experience\n(human-like in both amount and type) with extremely generic learning algorithms\nand architectures and without any substantive inductive biases.\n","authors":["A. Emin Orhan"],"pdf_url":"https://arxiv.org/pdf/2308.03712v1.pdf","comment":"7 pages, 3 figures, 2 tables; code & models available from\n https://github.com/eminorhan/humanlike-vits"},{"id":"http://arxiv.org/abs/2308.03709v1","updated":"2023-08-07T16:30:24Z","published":"2023-08-07T16:30:24Z","title":"Prototype Learning for Out-of-Distribution Polyp Segmentation","summary":" Existing polyp segmentation models from colonoscopy images often fail to\nprovide reliable segmentation results on datasets from different centers,\nlimiting their applicability. Our objective in this study is to create a robust\nand well-generalized segmentation model named PrototypeLab that can assist in\npolyp segmentation. To achieve this, we incorporate various lighting modes such\nas White light imaging (WLI), Blue light imaging (BLI), Linked color imaging\n(LCI), and Flexible spectral imaging color enhancement (FICE) into our new\nsegmentation model, that learns to create prototypes for each class of object\npresent in the images. These prototypes represent the characteristic features\nof the objects, such as their shape, texture, color. Our model is designed to\nperform effectively on out-of-distribution (OOD) datasets from multiple\ncenters. We first generate a coarse mask that is used to learn prototypes for\nthe main object class, which are then employed to generate the final\nsegmentation mask. By using prototypes to represent the main class, our\napproach handles the variability present in the medical images and generalize\nwell to new data since prototype capture the underlying distribution of the\ndata. PrototypeLab offers a promising solution with a dice coefficient of\n$\\geq$ 90\\% and mIoU $\\geq$ 85\\% with a near real-time processing speed for\npolyp segmentation. It achieved superior performance on OOD datasets compared\nto 16 state-of-the-art image segmentation architectures, potentially improving\nclinical outcomes. Codes are available at\nhttps://github.com/xxxxx/PrototypeLab.\n","authors":["Nikhil Kumar Tomar","Debesh Jha","Ulas Bagci"],"pdf_url":"https://arxiv.org/pdf/2308.03709v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03703v1","updated":"2023-08-07T16:22:47Z","published":"2023-08-07T16:22:47Z","title":"Video-based Person Re-identification with Long Short-Term Representation\n Learning","summary":" Video-based person Re-Identification (V-ReID) aims to retrieve specific\npersons from raw videos captured by non-overlapped cameras. As a fundamental\ntask, it spreads many multimedia and computer vision applications. However, due\nto the variations of persons and scenes, there are still many obstacles that\nmust be overcome for high performance. In this work, we notice that both the\nlong-term and short-term information of persons are important for robust video\nrepresentations. Thus, we propose a novel deep learning framework named Long\nShort-Term Representation Learning (LSTRL) for effective V-ReID. More\nspecifically, to extract long-term representations, we propose a\nMulti-granularity Appearance Extractor (MAE), in which four granularity\nappearances are effectively captured across multiple frames. Meanwhile, to\nextract short-term representations, we propose a Bi-direction Motion Estimator\n(BME), in which reciprocal motion information is efficiently extracted from\nconsecutive frames. The MAE and BME are plug-and-play and can be easily\ninserted into existing networks for efficient feature learning. As a result,\nthey significantly improve the feature representation ability for V-ReID.\nExtensive experiments on three widely used benchmarks show that our proposed\napproach can deliver better performances than most state-of-the-arts.\n","authors":["Xuehu Liu","Pingping Zhang","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2308.03703v1.pdf","comment":"This work is accepted by ICIG2023, including 13 pages, 5 figures and\n 5 tables. Modifications may be performed for further improvements"},{"id":"http://arxiv.org/abs/2308.03698v1","updated":"2023-08-07T16:14:27Z","published":"2023-08-07T16:14:27Z","title":"Screen-based 3D Subjective Experiment Software","summary":" Recently, widespread 3D graphics (e.g., point clouds and meshes) have drawn\nconsiderable efforts from academia and industry to assess their perceptual\nquality by conducting subjective experiments. However, lacking a handy software\nfor 3D subjective experiments complicates the construction of 3D graphics\nquality assessment datasets, thus hindering the prosperity of relevant fields.\nIn this paper, we develop a powerful platform with which users can flexibly\ndesign their 3D subjective methodologies and build high-quality datasets,\neasing a broad spectrum of 3D graphics subjective quality study. To accurately\nillustrate the perceptual quality differences of 3D stimuli, our software can\nsimultaneously render the source stimulus and impaired stimulus and allows both\nstimuli to respond synchronously to viewer interactions. Compared with amateur\n3D visualization tool-based or image/video rendering-based schemes, our\napproach embodies typical 3D applications while minimizing cognitive overload\nduring subjective experiments. We organized a subjective experiment involving\n40 participants to verify the validity of the proposed software. Experimental\nanalyses demonstrate that subjective tests on our software can produce\nreasonable subjective quality scores of 3D models. All resources in this paper\ncan be found at https://openi.pcl.ac.cn/OpenDatasets/3DQA.\n","authors":["Songlin Fan","Wei Gao"],"pdf_url":"https://arxiv.org/pdf/2308.03698v1.pdf","comment":"Accepted to ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2308.03685v1","updated":"2023-08-07T16:00:22Z","published":"2023-08-07T16:00:22Z","title":"Learning Concise and Descriptive Attributes for Visual Recognition","summary":" Recent advances in foundation models present new opportunities for\ninterpretable visual recognition -- one can first query Large Language Models\n(LLMs) to obtain a set of attributes that describe each class, then apply\nvision-language models to classify images via these attributes. Pioneering work\nshows that querying thousands of attributes can achieve performance competitive\nwith image features. However, our further investigation on 8 datasets reveals\nthat LLM-generated attributes in a large quantity perform almost the same as\nrandom words. This surprising finding suggests that significant noise may be\npresent in these attributes. We hypothesize that there exist subsets of\nattributes that can maintain the classification performance with much smaller\nsizes, and propose a novel learning-to-search method to discover those concise\nsets of attributes. As a result, on the CUB dataset, our method achieves\nperformance close to that of massive LLM-generated attributes (e.g., 10k\nattributes for CUB), yet using only 32 attributes in total to distinguish 200\nbird species. Furthermore, our new paradigm demonstrates several additional\nbenefits: higher interpretability and interactivity for humans, and the ability\nto summarize knowledge for a recognition task.\n","authors":["An Yan","Yu Wang","Yiwu Zhong","Chengyu Dong","Zexue He","Yujie Lu","William Wang","Jingbo Shang","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2308.03685v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2308.03670v1","updated":"2023-08-07T15:44:58Z","published":"2023-08-07T15:44:58Z","title":"Improving FHB Screening in Wheat Breeding Using an Efficient Transformer\n Model","summary":" Fusarium head blight is a devastating disease that causes significant\neconomic losses annually on small grains. Efficiency, accuracy, and timely\ndetection of FHB in the resistance screening are critical for wheat and barley\nbreeding programs. In recent years, various image processing techniques have\nbeen developed using supervised machine learning algorithms for the early\ndetection of FHB. The state-of-the-art convolutional neural network-based\nmethods, such as U-Net, employ a series of encoding blocks to create a local\nrepresentation and a series of decoding blocks to capture the semantic\nrelations. However, these methods are not often capable of long-range modeling\ndependencies inside the input data, and their ability to model multi-scale\nobjects with significant variations in texture and shape is limited. Vision\ntransformers as alternative architectures with innate global self-attention\nmechanisms for sequence-to-sequence prediction, due to insufficient low-level\ndetails, may also limit localization capabilities. To overcome these\nlimitations, a new Context Bridge is proposed to integrate the local\nrepresentation capability of the U-Net network in the transformer model. In\naddition, the standard attention mechanism of the original transformer is\nreplaced with Efficient Self-attention, which is less complicated than other\nstate-of-the-art methods. To train the proposed network, 12,000 wheat images\nfrom an FHB-inoculated wheat field at the SDSU research farm in Volga, SD, were\ncaptured. In addition to healthy and unhealthy plants, these images encompass\nvarious stages of the disease. A team of expert pathologists annotated the\nimages for training and evaluating the developed model. As a result, the\neffectiveness of the transformer-based method for FHB-disease detection,\nthrough extensive experiments across typical tasks for plant image\nsegmentation, is demonstrated.\n","authors":["Babak Azad","Ahmed Abdalla","Kwanghee Won","Ali Mirzakhani Nafchi"],"pdf_url":"https://arxiv.org/pdf/2308.03670v1.pdf","comment":"10 pages, 5 figures, 1 table. Presented at the 2023 ASABE Annual\n International Meeting conference in Omaha, Nebraska. Also available at\n https://elibrary.asabe.org/abstract.asp?aid=54149"},{"id":"http://arxiv.org/abs/2307.16177v2","updated":"2023-08-07T15:22:37Z","published":"2023-07-30T09:15:38Z","title":"Fusing VHR Post-disaster Aerial Imagery and LiDAR Data for Roof\n Classification in the Caribbean using CNNs","summary":" Accurate and up-to-date information on building characteristics is essential\nfor vulnerability assessment; however, the high costs and long timeframes\nassociated with conducting traditional field surveys can be an obstacle to\nobtaining critical exposure datasets needed for disaster risk management. In\nthis work, we leverage deep learning techniques for the automated\nclassification of roof characteristics from very high-resolution orthophotos\nand airborne LiDAR data obtained in Dominica following Hurricane Maria in 2017.\nWe demonstrate that the fusion of multimodal earth observation data performs\nbetter than using any single data source alone. Using our proposed methods, we\nachieve F1 scores of 0.93 and 0.92 for roof type and roof material\nclassification, respectively. This work is intended to help governments produce\nmore timely building information to improve resilience and disaster response in\nthe Caribbean.\n","authors":["Isabelle Tingzon","Nuala Margaret Cowan","Pierre Chrzanowski"],"pdf_url":"https://arxiv.org/pdf/2307.16177v2.pdf","comment":"2023 ICCV Humanitarian Assistance and Disaster Response Workshop"},{"id":"http://arxiv.org/abs/2308.03654v1","updated":"2023-08-07T15:10:21Z","published":"2023-08-07T15:10:21Z","title":"FFF: Fragments-Guided Flexible Fitting for Building Complete Protein\n Structures","summary":" Cryo-electron microscopy (cryo-EM) is a technique for reconstructing the\n3-dimensional (3D) structure of biomolecules (especially large protein\ncomplexes and molecular assemblies). As the resolution increases to the\nnear-atomic scale, building protein structures de novo from cryo-EM maps\nbecomes possible. Recently, recognition-based de novo building methods have\nshown the potential to streamline this process. However, it cannot build a\ncomplete structure due to the low signal-to-noise ratio (SNR) problem. At the\nsame time, AlphaFold has led to a great breakthrough in predicting protein\nstructures. This has inspired us to combine fragment recognition and structure\nprediction methods to build a complete structure. In this paper, we propose a\nnew method named FFF that bridges protein structure prediction and protein\nstructure recognition with flexible fitting. First, a multi-level recognition\nnetwork is used to capture various structural features from the input 3D\ncryo-EM map. Next, protein structural fragments are generated using pseudo\npeptide vectors and a protein sequence alignment method based on these\nextracted features. Finally, a complete structural model is constructed using\nthe predicted protein fragments via flexible fitting. Based on our benchmark\ntests, FFF outperforms the baseline methods for building complete protein\nstructures.\n","authors":["Weijie Chen","Xinyan Wang","Yuhang Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03654v1.pdf","comment":"Published in the Proceedings of the IEEE/CVF Conference on Computer\n Vision and Pattern Recognition (CVPR), 2023"},{"id":"http://arxiv.org/abs/2308.03652v1","updated":"2023-08-07T15:07:21Z","published":"2023-08-07T15:07:21Z","title":"WarpEM: Dynamic Time Warping for Accurate Catheter Registration in\n EM-guided Procedures","summary":" Accurate catheter tracking is crucial during minimally invasive endovascular\nprocedures (MIEP), and electromagnetic (EM) tracking is a widely used\ntechnology that serves this purpose. However, registration between preoperative\nimages and the EM tracking system is often challenging. Existing registration\nmethods typically require manual interactions, which can be time-consuming,\nincrease the risk of errors and change the procedural workflow. Although\nseveral registration methods are available for catheter tracking, such as\nmarker-based and path-based approaches, their limitations can impact the\naccuracy of the resulting tracking solution, consequently, the outcome of the\nmedical procedure.\n This paper introduces a novel automated catheter registration method for\nEM-guided MIEP. The method utilizes 3D signal temporal analysis, such as\nDynamic Time Warping (DTW) algorithms, to improve registration accuracy and\nreliability compared to existing methods. DTW can accurately warp and match\nEM-tracked paths to the vessel's centerline, making it particularly suitable\nfor registration. The introduced registration method is evaluated for accuracy\nin a vascular phantom using a marker-based registration as the ground truth.\nThe results indicate that the DTW method yields accurate and reliable\nregistration outcomes, with a mean error of $2.22$mm. The introduced\nregistration method presents several advantages over state-of-the-art methods,\nsuch as high registration accuracy, no initialization required, and increased\nautomation.\n","authors":["Ardit Ramadani","Peter Ewert","Heribert Schunkert","Nassir Navab"],"pdf_url":"https://arxiv.org/pdf/2308.03652v1.pdf","comment":"The 26th International Conference on Medical Image Computing and\n Computer Assisted Intervention, MICCAI 2023"},{"id":"http://arxiv.org/abs/2308.03631v1","updated":"2023-08-07T14:36:49Z","published":"2023-08-07T14:36:49Z","title":"Segmentation Framework for Heat Loss Identification in Thermal Images:\n Empowering Scottish Retrofitting and Thermographic Survey Companies","summary":" Retrofitting and thermographic survey (TS) companies in Scotland collaborate\nwith social housing providers to tackle fuel poverty. They employ ground-level\ninfrared (IR) camera-based-TSs (GIRTSs) for collecting thermal images to\nidenti-fy the heat loss sources resulting from poor insulation. However, this\nidentifica-tion process is labor-intensive and time-consuming, necessitating\nextensive data processing. To automate this, an AI-driven approach is\nnecessary. Therefore, this study proposes a deep learning (DL)-based\nsegmentation framework using the Mask Region Proposal Convolutional Neural\nNetwork (Mask RCNN) to validate its applicability to these thermal images. The\nobjective of the framework is to au-tomatically identify, and crop heat loss\nsources caused by weak insulation, while also eliminating obstructive objects\npresent in those images. By doing so, it min-imizes labor-intensive tasks and\nprovides an automated, consistent, and reliable solution. To validate the\nproposed framework, approximately 2500 thermal imag-es were collected in\ncollaboration with industrial TS partner. Then, 1800 repre-sentative images\nwere carefully selected with the assistance of experts and anno-tated to\nhighlight the target objects (TO) to form the final dataset. Subsequently, a\ntransfer learning strategy was employed to train the dataset, progressively\naug-menting the training data volume and fine-tuning the pre-trained baseline\nMask RCNN. As a result, the final fine-tuned model achieved a mean average\nprecision (mAP) score of 77.2% for segmenting the TO, demonstrating the\nsignificant po-tential of proposed framework in accurately quantifying energy\nloss in Scottish homes.\n","authors":["Md Junayed Hasan","Eyad Elyan","Yijun Yan","Jinchang Ren","Md Mostafa Kamal Sarker"],"pdf_url":"https://arxiv.org/pdf/2308.03631v1.pdf","comment":"9 Pages, 3 Figures, Accepted from the conference - BICS 2023: 2023\n International Conference on Brain-Inspired Cognitive Systems Kuala Lumpur,\n Malaysia, August 5-6, 2023 [peer-reviewed]"},{"id":"http://arxiv.org/abs/2308.03624v1","updated":"2023-08-07T14:31:07Z","published":"2023-08-07T14:31:07Z","title":"MOMA-Force: Visual-Force Imitation for Real-World Mobile Manipulation","summary":" In this paper, we present a novel method for mobile manipulators to perform\nmultiple contact-rich manipulation tasks. While learning-based methods have the\npotential to generate actions in an end-to-end manner, they often suffer from\ninsufficient action accuracy and robustness against noise. On the other hand,\nclassical control-based methods can enhance system robustness, but at the cost\nof extensive parameter tuning. To address these challenges, we present\nMOMA-Force, a visual-force imitation method that seamlessly combines\nrepresentation learning for perception, imitation learning for complex motion\ngeneration, and admittance whole-body control for system robustness and\ncontrollability. MOMA-Force enables a mobile manipulator to learn multiple\ncomplex contact-rich tasks with high success rates and small contact forces. In\na real household setting, our method outperforms baseline methods in terms of\ntask success rates. Moreover, our method achieves smaller contact forces and\nsmaller force variances compared to baseline methods without force imitation.\nOverall, we offer a promising approach for efficient and robust mobile\nmanipulation in the real world. Videos and more details can be found on\n\\url{https://visual-force-imitation.github.io}\n","authors":["Taozheng Yang","Ya Jing","Hongtao Wu","Jiafeng Xu","Kuankuan Sima","Guangzeng Chen","Qie Sima","Tao Kong"],"pdf_url":"https://arxiv.org/pdf/2308.03624v1.pdf","comment":"IEEE/RSJ International Conference on Intelligent Robots and Systems\n (IROS), 2023"},{"id":"http://arxiv.org/abs/2308.03620v1","updated":"2023-08-07T14:24:52Z","published":"2023-08-07T14:24:52Z","title":"Exploring Visual Pre-training for Robot Manipulation: Datasets, Models\n and Methods","summary":" Visual pre-training with large-scale real-world data has made great progress\nin recent years, showing great potential in robot learning with pixel\nobservations. However, the recipes of visual pre-training for robot\nmanipulation tasks are yet to be built. In this paper, we thoroughly\ninvestigate the effects of visual pre-training strategies on robot manipulation\ntasks from three fundamental perspectives: pre-training datasets, model\narchitectures and training methods. Several significant experimental findings\nare provided that are beneficial for robot learning. Further, we propose a\nvisual pre-training scheme for robot manipulation termed Vi-PRoM, which\ncombines self-supervised learning and supervised learning. Concretely, the\nformer employs contrastive learning to acquire underlying patterns from\nlarge-scale unlabeled data, while the latter aims learning visual semantics and\ntemporal dynamics. Extensive experiments on robot manipulations in various\nsimulation environments and the real robot demonstrate the superiority of the\nproposed scheme. Videos and more details can be found on\n\\url{https://explore-pretrain-robot.github.io}.\n","authors":["Ya Jing","Xuelin Zhu","Xingbin Liu","Qie Sima","Taozheng Yang","Yunhai Feng","Tao Kong"],"pdf_url":"https://arxiv.org/pdf/2308.03620v1.pdf","comment":"IEEE/RSJ International Conference on Intelligent Robots and Systems\n (IROS), 2023"},{"id":"http://arxiv.org/abs/2308.03613v1","updated":"2023-08-07T14:16:52Z","published":"2023-08-07T14:16:52Z","title":"Adaptive Semi-Supervised Segmentation of Brain Vessels with Ambiguous\n Labels","summary":" Accurate segmentation of brain vessels is crucial for cerebrovascular disease\ndiagnosis and treatment. However, existing methods face challenges in capturing\nsmall vessels and handling datasets that are partially or ambiguously\nannotated. In this paper, we propose an adaptive semi-supervised approach to\naddress these challenges. Our approach incorporates innovative techniques\nincluding progressive semi-supervised learning, adaptative training strategy,\nand boundary enhancement. Experimental results on 3DRA datasets demonstrate the\nsuperiority of our method in terms of mesh-based segmentation metrics. By\nleveraging the partially and ambiguously labeled data, which only annotates the\nmain vessels, our method achieves impressive segmentation performance on\nmislabeled fine vessels, showcasing its potential for clinical applications.\n","authors":["Fengming Lin","Yan Xia","Nishant Ravikumar","Qiongyao Liu","Michael MacRaild","Alejandro F Frangi"],"pdf_url":"https://arxiv.org/pdf/2308.03613v1.pdf","comment":"Accepted by DALI MICCAI 2023"},{"id":"http://arxiv.org/abs/2308.03610v1","updated":"2023-08-07T14:09:46Z","published":"2023-08-07T14:09:46Z","title":"AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose","summary":" Creating expressive, diverse and high-quality 3D avatars from highly\ncustomized text descriptions and pose guidance is a challenging task, due to\nthe intricacy of modeling and texturing in 3D that ensure details and various\nstyles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline\nfor generating expressive high-quality 3D avatars from nothing but text\ndescriptions and pose guidance. In specific, we introduce a 2D diffusion model\nconditioned on DensePose signal to establish 3D pose control of avatars through\n2D images, which enhances view consistency from partially observed scenarios.\nIt addresses the infamous Janus Problem and significantly stablizes the\ngeneration process. Moreover, we propose a progressive high-resolution 3D\nsynthesis strategy, which obtains substantial improvement over the quality of\nthe created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves\nzero-shot 3D modeling of 3D avatars that are not only more expressive, but also\nin higher quality and fidelity than previous works. Rigorous qualitative\nevaluations and user studies showcase AvatarVerse's superiority in synthesizing\nhigh-fidelity 3D avatars, leading to a new standard in high-quality and stable\n3D avatar creation. Our project page is: https://avatarverse3d.github.io\n","authors":["Huichao Zhang","Bowen Chen","Hao Yang","Liao Qu","Xu Wang","Li Chen","Chao Long","Feida Zhu","Kang Du","Min Zheng"],"pdf_url":"https://arxiv.org/pdf/2308.03610v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03608v1","updated":"2023-08-07T14:09:08Z","published":"2023-08-07T14:09:08Z","title":"Recurrent Self-Supervised Video Denoising with Denser Receptive Field","summary":" Self-supervised video denoising has seen decent progress through the use of\nblind spot networks. However, under their blind spot constraints, previous\nself-supervised video denoising methods suffer from significant information\nloss and texture destruction in either the whole reference frame or neighbor\nframes, due to their inadequate consideration of the receptive field. Moreover,\nthe limited number of available neighbor frames in previous methods leads to\nthe discarding of distant temporal information. Nonetheless, simply adopting\nexisting recurrent frameworks does not work, since they easily break the\nconstraints on the receptive field imposed by self-supervision. In this paper,\nwe propose RDRF for self-supervised video denoising, which not only fully\nexploits both the reference and neighbor frames with a denser receptive field,\nbut also better leverages the temporal information from both local and distant\nneighbor features. First, towards a comprehensive utilization of information\nfrom both reference and neighbor frames, RDRF realizes a denser receptive field\nby taking more neighbor pixels along the spatial and temporal dimensions.\nSecond, it features a self-supervised recurrent video denoising framework,\nwhich concurrently integrates distant and near-neighbor temporal features. This\nenables long-term bidirectional information aggregation, while mitigating error\naccumulation in the plain recurrent framework. Our method exhibits superior\nperformance on both synthetic and real video denoising datasets. Codes will be\navailable at https://github.com/Wang-XIaoDingdd/RDRF.\n","authors":["Zichun Wang","Yulun Zhang","Debing Zhang","Ying Fu"],"pdf_url":"https://arxiv.org/pdf/2308.03608v1.pdf","comment":"Accepted to ACMMM 2023"},{"id":"http://arxiv.org/abs/2303.14643v2","updated":"2023-08-07T14:08:44Z","published":"2023-03-26T06:59:23Z","title":"POAR: Towards Open Vocabulary Pedestrian Attribute Recognition","summary":" Pedestrian attribute recognition (PAR) aims to predict the attributes of a\ntarget pedestrian in a surveillance system. Existing methods address the PAR\nproblem by training a multi-label classifier with predefined attribute classes.\nHowever, it is impossible to exhaust all pedestrian attributes in the real\nworld. To tackle this problem, we develop a novel pedestrian open-attribute\nrecognition (POAR) framework. Our key idea is to formulate the POAR problem as\nan image-text search problem. We design a Transformer-based image encoder with\na masking strategy. A set of attribute tokens are introduced to focus on\nspecific pedestrian parts (e.g., head, upper body, lower body, feet, etc.) and\nencode corresponding attributes into visual embeddings. Each attribute category\nis described as a natural language sentence and encoded by the text encoder.\nThen, we compute the similarity between the visual and text embeddings of\nattributes to find the best attribute descriptions for the input images.\nDifferent from existing methods that learn a specific classifier for each\nattribute category, we model the pedestrian at a part-level and explore the\nsearching method to handle the unseen attributes. Finally, a many-to-many\ncontrastive (MTMC) loss with masked tokens is proposed to train the network\nsince a pedestrian image can comprise multiple attributes. Extensive\nexperiments have been conducted on benchmark PAR datasets with an\nopen-attribute setting. The results verified the effectiveness of the proposed\nPOAR method, which can form a strong baseline for the POAR task. Our code is\navailable at \\url{https://github.com/IvyYZ/POAR}.\n","authors":["Yue Zhang","Suchen Wang","Shichao Kan","Zhenyu Weng","Yigang Cen","Yap-peng Tan"],"pdf_url":"https://arxiv.org/pdf/2303.14643v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03594v1","updated":"2023-08-07T13:52:21Z","published":"2023-08-07T13:52:21Z","title":"FeatEnHancer: Enhancing Hierarchical Features for Object Detection and\n Beyond Under Low-Light Vision","summary":" Extracting useful visual cues for the downstream tasks is especially\nchallenging under low-light vision. Prior works create enhanced representations\nby either correlating visual quality with machine perception or designing\nillumination-degrading transformation methods that require pre-training on\nsynthetic datasets. We argue that optimizing enhanced image representation\npertaining to the loss of the downstream task can result in more expressive\nrepresentations. Therefore, in this work, we propose a novel module,\nFeatEnHancer, that hierarchically combines multiscale features using\nmultiheaded attention guided by task-related loss function to create suitable\nrepresentations. Furthermore, our intra-scale enhancement improves the quality\nof features extracted at each scale or level, as well as combines features from\ndifferent scales in a way that reflects their relative importance for the task\nat hand. FeatEnHancer is a general-purpose plug-and-play module and can be\nincorporated into any low-light vision pipeline. We show with extensive\nexperimentation that the enhanced representation produced with FeatEnHancer\nsignificantly and consistently improves results in several low-light vision\ntasks, including dark object detection (+5.7 mAP on ExDark), face detection\n(+1.5 mAPon DARK FACE), nighttime semantic segmentation (+5.1 mIoU on ACDC ),\nand video object detection (+1.8 mAP on DarkVision), highlighting the\neffectiveness of enhancing hierarchical features under low-light vision.\n","authors":["Khurram Azeem Hashmi","Goutham Kallempudi","Didier Stricker","Muhammamd Zeshan Afzal"],"pdf_url":"https://arxiv.org/pdf/2308.03594v1.pdf","comment":"19 pages, 9 Figures, and 10 Tables. Accepted at ICCV2023"},{"id":"http://arxiv.org/abs/2308.03586v1","updated":"2023-08-07T13:44:44Z","published":"2023-08-07T13:44:44Z","title":"SoilNet: An Attention-based Spatio-temporal Deep Learning Framework for\n Soil Organic Carbon Prediction with Digital Soil Mapping in Europe","summary":" Digital soil mapping (DSM) is an advanced approach that integrates\nstatistical modeling and cutting-edge technologies, including machine learning\n(ML) methods, to accurately depict soil properties and their spatial\ndistribution. Soil organic carbon (SOC) is a crucial soil attribute providing\nvaluable insights into soil health, nutrient cycling, greenhouse gas emissions,\nand overall ecosystem productivity. This study highlights the significance of\nspatial-temporal deep learning (DL) techniques within the DSM framework. A\nnovel architecture is proposed, incorporating spatial information using a base\nconvolutional neural network (CNN) model and spatial attention mechanism, along\nwith climate temporal information using a long short-term memory (LSTM)\nnetwork, for SOC prediction across Europe. The model utilizes a comprehensive\nset of environmental features, including Landsat-8 images, topography, remote\nsensing indices, and climate time series, as input features. Results\ndemonstrate that the proposed framework outperforms conventional ML approaches\nlike random forest commonly used in DSM, yielding lower root mean square error\n(RMSE). This model is a robust tool for predicting SOC and could be applied to\nother soil properties, thereby contributing to the advancement of DSM\ntechniques and facilitating land management and decision-making processes based\non accurate information.\n","authors":["Nafiseh Kakhani","Moien Rangzan","Ali Jamali","Sara Attarchi","Seyed Kazem Alavipanah","Thomas Scholten"],"pdf_url":"https://arxiv.org/pdf/2308.03586v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2308.03580v1","updated":"2023-08-07T13:35:53Z","published":"2023-08-07T13:35:53Z","title":"Revealing the Underlying Patterns: Investigating Dataset Similarity,\n Performance, and Generalization","summary":" Supervised deep learning models require significant amount of labelled data\nto achieve an acceptable performance on a specific task. However, when tested\non unseen data, the models may not perform well. Therefore, the models need to\nbe trained with additional and varying labelled data to improve the\ngeneralization. In this work, our goal is to understand the models, their\nperformance and generalization. We establish image-image, dataset-dataset, and\nimage-dataset distances to gain insights into the model's behavior. Our\nproposed distance metric when combined with model performance can help in\nselecting an appropriate model/architecture from a pool of candidate\narchitectures. We have shown that the generalization of these models can be\nimproved by only adding a small number of unseen images (say 1, 3 or 7) into\nthe training set. Our proposed approach reduces training and annotation costs\nwhile providing an estimate of model performance on unseen data in dynamic\nenvironments.\n","authors":["Akshit Achara","Ram Krishna Pandey"],"pdf_url":"https://arxiv.org/pdf/2308.03580v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.08083v4","updated":"2023-08-07T13:24:06Z","published":"2022-06-16T10:53:18Z","title":"CARLANE: A Lane Detection Benchmark for Unsupervised Domain Adaptation\n from Simulation to multiple Real-World Domains","summary":" Unsupervised Domain Adaptation demonstrates great potential to mitigate\ndomain shifts by transferring models from labeled source domains to unlabeled\ntarget domains. While Unsupervised Domain Adaptation has been applied to a wide\nvariety of complex vision tasks, only few works focus on lane detection for\nautonomous driving. This can be attributed to the lack of publicly available\ndatasets. To facilitate research in these directions, we propose CARLANE, a\n3-way sim-to-real domain adaptation benchmark for 2D lane detection. CARLANE\nencompasses the single-target datasets MoLane and TuLane and the multi-target\ndataset MuLane. These datasets are built from three different domains, which\ncover diverse scenes and contain a total of 163K unique images, 118K of which\nare annotated. In addition we evaluate and report systematic baselines,\nincluding our own method, which builds upon Prototypical Cross-domain\nSelf-supervised Learning. We find that false positive and false negative rates\nof the evaluated domain adaptation methods are high compared to those of fully\nsupervised baselines. This affirms the need for benchmarks such as CARLANE to\nfurther strengthen research in Unsupervised Domain Adaptation for lane\ndetection. CARLANE, all evaluated models and the corresponding implementations\nare publicly available at https://carlanebenchmark.github.io.\n","authors":["Julian Gebele","Bonifaz Stuhr","Johann Haselberger"],"pdf_url":"https://arxiv.org/pdf/2206.08083v4.pdf","comment":"36th Conference on Neural Information Processing Systems (NeurIPS\n 2022) Track on Datasets and Benchmarks, 22 pages, 11 figures"},{"id":"http://arxiv.org/abs/2304.09534v2","updated":"2023-08-07T13:00:47Z","published":"2023-04-19T09:52:50Z","title":"Realistic Data Enrichment for Robust Image Segmentation in\n Histopathology","summary":" Poor performance of quantitative analysis in histopathological Whole Slide\nImages (WSI) has been a significant obstacle in clinical practice. Annotating\nlarge-scale WSIs manually is a demanding and time-consuming task, unlikely to\nyield the expected results when used for fully supervised learning systems.\nRarely observed disease patterns and large differences in object scales are\ndifficult to model through conventional patient intake. Prior methods either\nfall back to direct disease classification, which only requires learning a few\nfactors per image, or report on average image segmentation performance, which\nis highly biased towards majority observations. Geometric image augmentation is\ncommonly used to improve robustness for average case predictions and to enrich\nlimited datasets. So far no method provided sampling of a realistic posterior\ndistribution to improve stability, e.g. for the segmentation of imbalanced\nobjects within images. Therefore, we propose a new approach, based on diffusion\nmodels, which can enrich an imbalanced dataset with plausible examples from\nunderrepresented groups by conditioning on segmentation maps. Our method can\nsimply expand limited clinical datasets making them suitable to train machine\nlearning pipelines, and provides an interpretable and human-controllable way of\ngenerating histopathology images that are indistinguishable from real ones to\nhuman experts. We validate our findings on two datasets, one from the public\ndomain and one from a Kidney Transplant study.\n","authors":["Sarah Cechnicka","James Ball","Hadrien Reynaud","Callum Arthurs","Candice Roufosse","Bernhard Kainz"],"pdf_url":"https://arxiv.org/pdf/2304.09534v2.pdf","comment":"11 pages, 2 figures, 1 table"},{"id":"http://arxiv.org/abs/2308.03529v1","updated":"2023-08-07T12:26:34Z","published":"2023-08-07T12:26:34Z","title":"Feature Decoupling-Recycling Network for Fast Interactive Segmentation","summary":" Recent interactive segmentation methods iteratively take source image, user\nguidance and previously predicted mask as the input without considering the\ninvariant nature of the source image. As a result, extracting features from the\nsource image is repeated in each interaction, resulting in substantial\ncomputational redundancy. In this work, we propose the Feature\nDecoupling-Recycling Network (FDRN), which decouples the modeling components\nbased on their intrinsic discrepancies and then recycles components for each\nuser interaction. Thus, the efficiency of the whole interactive process can be\nsignificantly improved. To be specific, we apply the Decoupling-Recycling\nstrategy from three perspectives to address three types of discrepancies,\nrespectively. First, our model decouples the learning of source image semantics\nfrom the encoding of user guidance to process two types of input domains\nseparately. Second, FDRN decouples high-level and low-level features from\nstratified semantic representations to enhance feature learning. Third, during\nthe encoding of user guidance, current user guidance is decoupled from\nhistorical guidance to highlight the effect of current user guidance. We\nconduct extensive experiments on 6 datasets from different domains and\nmodalities, which demonstrate the following merits of our model: 1) superior\nefficiency than other methods, particularly advantageous in challenging\nscenarios requiring long-term interactions (up to 4.25x faster), while\nachieving favorable segmentation performance; 2) strong applicability to\nvarious methods serving as a universal enhancement technique; 3) well\ncross-task generalizability, e.g., to medical image segmentation, and\nrobustness against misleading user guidance.\n","authors":["Huimin Zeng","Weinong Wang","Xin Tao","Zhiwei Xiong","Yu-Wing Tai","Wenjie Pei"],"pdf_url":"https://arxiv.org/pdf/2308.03529v1.pdf","comment":"Accepted to ACM MM 2023"},{"id":"http://arxiv.org/abs/2307.14863v2","updated":"2023-08-07T12:13:05Z","published":"2023-07-27T13:49:27Z","title":"IML-ViT: Benchmarking Image Manipulation Localization by Vision\n Transformer","summary":" Advanced image tampering techniques are increasingly challenging the\ntrustworthiness of multimedia, leading to the development of Image Manipulation\nLocalization (IML). But what makes a good IML model? The answer lies in the way\nto capture artifacts. Exploiting artifacts requires the model to extract\nnon-semantic discrepancies between manipulated and authentic regions,\nnecessitating explicit comparisons between the two areas. With the\nself-attention mechanism, naturally, the Transformer should be a better\ncandidate to capture artifacts. However, due to limited datasets, there is\ncurrently no pure ViT-based approach for IML to serve as a benchmark, and CNNs\ndominate the entire task. Nevertheless, CNNs suffer from weak long-range and\nnon-semantic modeling. To bridge this gap, based on the fact that artifacts are\nsensitive to image resolution, amplified under multi-scale features, and\nmassive at the manipulation border, we formulate the answer to the former\nquestion as building a ViT with high-resolution capacity, multi-scale feature\nextraction capability, and manipulation edge supervision that could converge\nwith a small amount of data. We term this simple but effective ViT paradigm\nIML-ViT, which has significant potential to become a new benchmark for IML.\nExtensive experiments on five benchmark datasets verified our model outperforms\nthe state-of-the-art manipulation localization methods.Code and models are\navailable at \\url{https://github.com/SunnyHaze/IML-ViT}.\n","authors":["Xiaochen Ma","Bo Du","Zhuohang Jiang","Ahmed Y. Al Hammadi","Jizhe Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.14863v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03515v1","updated":"2023-08-07T12:11:04Z","published":"2023-08-07T12:11:04Z","title":"Keyword Spotting Simplified: A Segmentation-Free Approach using\n Character Counting and CTC re-scoring","summary":" Recent advances in segmentation-free keyword spotting treat this problem\nw.r.t. an object detection paradigm and borrow from state-of-the-art detection\nsystems to simultaneously propose a word bounding box proposal mechanism and\ncompute a corresponding representation. Contrary to the norm of such methods\nthat rely on complex and large DNN models, we propose a novel segmentation-free\nsystem that efficiently scans a document image to find rectangular areas that\ninclude the query information. The underlying model is simple and compact,\npredicting character occurrences over rectangular areas through an implicitly\nlearned scale map, trained on word-level annotated images. The proposed\ndocument scanning is then performed using this character counting in a\ncost-effective manner via integral images and binary search. Finally, the\nretrieval similarity by character counting is refined by a pyramidal\nrepresentation and a CTC-based re-scoring algorithm, fully utilizing the\ntrained CNN model. Experimental validation on two widely-used datasets shows\nthat our method achieves state-of-the-art results outperforming the more\ncomplex alternatives, despite the simplicity of the underlying model.\n","authors":["George Retsinas","Giorgos Sfikas","Christophoros Nikou"],"pdf_url":"https://arxiv.org/pdf/2308.03515v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03495v1","updated":"2023-08-07T11:42:50Z","published":"2023-08-07T11:42:50Z","title":"Balanced Face Dataset: Guiding StyleGAN to Generate Labeled Synthetic\n Face Image Dataset for Underrepresented Group","summary":" For a machine learning model to generalize effectively to unseen data within\na particular problem domain, it is well-understood that the data needs to be of\nsufficient size and representative of real-world scenarios. Nonetheless,\nreal-world datasets frequently have overrepresented and underrepresented\ngroups. One solution to mitigate bias in machine learning is to leverage a\ndiverse and representative dataset. Training a model on a dataset that covers\nall demographics is crucial to reducing bias in machine learning. However,\ncollecting and labeling large-scale datasets has been challenging, prompting\nthe use of synthetic data generation and active labeling to decrease the costs\nof manual labeling. The focus of this study was to generate a robust face image\ndataset using the StyleGAN model. In order to achieve a balanced distribution\nof the dataset among different demographic groups, a synthetic dataset was\ncreated by controlling the generation process of StyleGaN and annotated for\ndifferent downstream tasks.\n","authors":["Kidist Amde Mekonnen"],"pdf_url":"https://arxiv.org/pdf/2308.03495v1.pdf","comment":"7 pages, 7 figures,submitted to AMLD Africa 2021 conference"},{"id":"http://arxiv.org/abs/2208.11176v3","updated":"2023-08-07T11:36:16Z","published":"2022-08-23T20:04:17Z","title":"A Study on the Impact of Data Augmentation for Training Convolutional\n Neural Networks in the Presence of Noisy Labels","summary":" Label noise is common in large real-world datasets, and its presence harms\nthe training process of deep neural networks. Although several works have\nfocused on the training strategies to address this problem, there are few\nstudies that evaluate the impact of data augmentation as a design choice for\ntraining deep neural networks. In this work, we analyse the model robustness\nwhen using different data augmentations and their improvement on the training\nwith the presence of noisy labels. We evaluate state-of-the-art and classical\ndata augmentation strategies with different levels of synthetic noise for the\ndatasets MNist, CIFAR-10, CIFAR-100, and the real-world dataset Clothing1M. We\nevaluate the methods using the accuracy metric. Results show that the\nappropriate selection of data augmentation can drastically improve the model\nrobustness to label noise, increasing up to 177.84% of relative best test\naccuracy compared to the baseline with no augmentation, and an increase of up\nto 6% in absolute value with the state-of-the-art DivideMix training strategy.\n","authors":["Emeson Santana","Gustavo Carneiro","Filipe R. Cordeiro"],"pdf_url":"https://arxiv.org/pdf/2208.11176v3.pdf","comment":"Paper accepted at SIBGRAPI 2022"},{"id":"http://arxiv.org/abs/2308.03492v1","updated":"2023-08-07T11:34:27Z","published":"2023-08-07T11:34:27Z","title":"Learning Photometric Feature Transform for Free-form Object Scan","summary":" We propose a novel framework to automatically learn to aggregate and\ntransform photometric measurements from multiple unstructured views into\nspatially distinctive and view-invariant low-level features, which are fed to a\nmulti-view stereo method to enhance 3D reconstruction. The illumination\nconditions during acquisition and the feature transform are jointly trained on\na large amount of synthetic data. We further build a system to reconstruct the\ngeometry and anisotropic reflectance of a variety of challenging objects from\nhand-held scans. The effectiveness of the system is demonstrated with a\nlightweight prototype, consisting of a camera and an array of LEDs, as well as\nan off-the-shelf tablet. Our results are validated against reconstructions from\na professional 3D scanner and photographs, and compare favorably with\nstate-of-the-art techniques.\n","authors":["Xiang Feng","Kaizhang Kang","Fan Pei","Huakeng Ding","Jinjiang You","Ping Tan","Kun Zhou","Hongzhi Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03492v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.13984v4","updated":"2023-08-07T11:29:26Z","published":"2022-10-24T07:43:59Z","title":"Abductive Action Inference","summary":" Abductive reasoning aims to make the most likely inference for a given set of\nincomplete observations. In this paper, we introduce a novel research task\nknown as \"abductive action inference\" which addresses the question of which\nactions were executed by a human to reach a specific state shown in a single\nsnapshot. The research explores three key abductive inference problems: action\nset prediction, action sequence prediction, and abductive action verification.\nTo tackle these challenging tasks, we investigate various models, including\nestablished ones such as Transformers, Graph Neural Networks, CLIP, BLIP, GPT3,\nend-to-end trained Slow-Fast, Resnet50-3D, and ViT models. Furthermore, the\npaper introduces several innovative models tailored for abductive action\ninference, including a relational graph neural network, a relational bilinear\npooling model, a relational rule-based inference model, a relational GPT-3\nprompt method, and a relational Transformer model. Notably, the newly proposed\nobject-relational bilinear graph encoder-decoder (BiGED) model emerges as the\nmost effective among all methods evaluated, demonstrating good proficiency in\nhandling the intricacies of the Action Genome dataset. The contributions of\nthis research offer significant progress toward comprehending the implications\nof human actions and making highly plausible inferences concerning the outcomes\nof these actions.\n","authors":["Clement Tan","Chai Kiat Yeo","Cheston Tan","Basura Fernando"],"pdf_url":"https://arxiv.org/pdf/2210.13984v4.pdf","comment":"16 pages, 9 figures"},{"id":"http://arxiv.org/abs/2308.03486v1","updated":"2023-08-07T11:28:36Z","published":"2023-08-07T11:28:36Z","title":"Improving Mass Detection in Mammography Images: A Study of Weakly\n Supervised Learning and Class Activation Map Methods","summary":" In recent years, weakly supervised models have aided in mass detection using\nmammography images, decreasing the need for pixel-level annotations. However,\nmost existing models in the literature rely on Class Activation Maps (CAM) as\nthe activation method, overlooking the potential benefits of exploring other\nactivation techniques. This work presents a study that explores and compares\ndifferent activation maps in conjunction with state-of-the-art methods for\nweakly supervised training in mammography images. Specifically, we investigate\nCAM, GradCAM, GradCAM++, XGradCAM, and LayerCAM methods within the framework of\nthe GMIC model for mass detection in mammography images. The evaluation is\nconducted on the VinDr-Mammo dataset, utilizing the metrics Accuracy, True\nPositive Rate (TPR), False Negative Rate (FNR), and False Positive Per Image\n(FPPI). Results show that using different strategies of activation maps during\ntraining and test stages leads to an improvement of the model. With this\nstrategy, we improve the results of the GMIC method, decreasing the FPPI value\nand increasing TPR.\n","authors":["Vicente Sampaio","Filipe R. Cordeiro"],"pdf_url":"https://arxiv.org/pdf/2308.03486v1.pdf","comment":"Accepted for publication at SIBGRAPI 20203"},{"id":"http://arxiv.org/abs/2307.08265v2","updated":"2023-08-07T11:21:31Z","published":"2023-07-17T06:14:19Z","title":"Extreme Image Compression using Fine-tuned VQGAN Models","summary":" Recent advances in generative compression methods have demonstrated\nremarkable progress in enhancing the perceptual quality of compressed data,\nespecially in scenarios with low bitrates. Nevertheless, their efficacy and\napplicability in achieving extreme compression ratios ($<0.1$ bpp) still remain\nconstrained. In this work, we propose a simple yet effective coding framework\nby introducing vector quantization (VQ)-based generative models into the image\ncompression domain. The main insight is that the codebook learned by the VQGAN\nmodel yields strong expressive capacity, facilitating efficient compression of\ncontinuous information in the latent space while maintaining reconstruction\nquality. Specifically, an image can be represented as VQ-indices by finding the\nnearest codeword, which can be encoded using lossless compression methods into\nbitstreams. We then propose clustering a pre-trained large-scale codebook into\nsmaller codebooks using the K-means algorithm. This enables images to be\nrepresented as diverse ranges of VQ-indices maps, resulting in variable\nbitrates and different levels of reconstruction quality. Extensive qualitative\nand quantitative experiments on various datasets demonstrate that the proposed\nframework outperforms the state-of-the-art codecs in terms of perceptual\nquality-oriented metrics and human perception under extremely low bitrates.\n","authors":["Qi Mao","Tinghan Yang","Yinuo Zhang","Shuyin Pan","Meng Wang","Shiqi Wang","Siwei Ma"],"pdf_url":"https://arxiv.org/pdf/2307.08265v2.pdf","comment":"Generative Compression, Extreme Compression, VQGANs, Low Bitrate"},{"id":"http://arxiv.org/abs/2308.03476v1","updated":"2023-08-07T11:09:12Z","published":"2023-08-07T11:09:12Z","title":"Exploring the Physical World Adversarial Robustness of Vehicle Detection","summary":" Adversarial attacks can compromise the robustness of real-world detection\nmodels. However, evaluating these models under real-world conditions poses\nchallenges due to resource-intensive experiments. Virtual simulations offer an\nalternative, but the absence of standardized benchmarks hampers progress.\nAddressing this, we propose an innovative instant-level data generation\npipeline using the CARLA simulator. Through this pipeline, we establish the\nDiscrete and Continuous Instant-level (DCI) dataset, enabling comprehensive\nexperiments involving three detection models and three physical adversarial\nattacks. Our findings highlight diverse model performances under adversarial\nconditions. Yolo v6 demonstrates remarkable resilience, experiencing just a\nmarginal 6.59% average drop in average precision (AP). In contrast, the ASA\nattack yields a substantial 14.51% average AP reduction, twice the effect of\nother algorithms. We also note that static scenes yield higher recognition AP\nvalues, and outcomes remain relatively consistent across varying weather\nconditions. Intriguingly, our study suggests that advancements in adversarial\nattack algorithms may be approaching its ``limitation''.In summary, our work\nunderscores the significance of adversarial attacks in real-world contexts and\nintroduces the DCI dataset as a versatile benchmark. Our findings provide\nvaluable insights for enhancing the robustness of detection models and offer\nguidance for future research endeavors in the realm of adversarial attacks.\n","authors":["Wei Jiang","Tianyuan Zhang","Shuangcheng Liu","Weiyu Ji","Zichao Zhang","Gang Xiao"],"pdf_url":"https://arxiv.org/pdf/2308.03476v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03471v1","updated":"2023-08-07T10:57:20Z","published":"2023-08-07T10:57:20Z","title":"Deepfake Detection: A Comparative Analysis","summary":" This paper present a comprehensive comparative analysis of supervised and\nself-supervised models for deepfake detection. We evaluate eight supervised\ndeep learning architectures and two transformer-based models pre-trained using\nself-supervised strategies (DINO, CLIP) on four benchmarks (FakeAVCeleb,\nCelebDF-V2, DFDC, and FaceForensics++). Our analysis includes intra-dataset and\ninter-dataset evaluations, examining the best performing models, generalisation\ncapabilities, and impact of augmentations. We also investigate the trade-off\nbetween model size and performance. Our main goal is to provide insights into\nthe effectiveness of different deep learning architectures (transformers,\nCNNs), training strategies (supervised, self-supervised), and deepfake\ndetection benchmarks. These insights can help guide the development of more\naccurate and reliable deepfake detection systems, which are crucial in\nmitigating the harmful impact of deepfakes on individuals and society.\n","authors":["Sohail Ahmed Khan","Duc-Tien Dang-Nguyen"],"pdf_url":"https://arxiv.org/pdf/2308.03471v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03467v1","updated":"2023-08-07T10:47:08Z","published":"2023-08-07T10:47:08Z","title":"RoadScan: A Novel and Robust Transfer Learning Framework for Autonomous\n Pothole Detection in Roads","summary":" This research paper presents a novel approach to pothole detection using Deep\nLearning and Image Processing techniques. The proposed system leverages the\nVGG16 model for feature extraction and utilizes a custom Siamese network with\ntriplet loss, referred to as RoadScan. The system aims to address the critical\nissue of potholes on roads, which pose significant risks to road users.\nAccidents due to potholes on the roads have led to numerous accidents. Although\nit is necessary to completely remove potholes, it is a time-consuming process.\nHence, a general road user should be able to detect potholes from a safe\ndistance in order to avoid damage. Existing methods for pothole detection\nheavily rely on object detection algorithms which tend to have a high chance of\nfailure owing to the similarity in structures and textures of a road and a\npothole. Additionally, these systems utilize millions of parameters thereby\nmaking the model difficult to use in small-scale applications for the general\ncitizen. By analyzing diverse image processing methods and various\nhigh-performing networks, the proposed model achieves remarkable performance in\naccurately detecting potholes. Evaluation metrics such as accuracy, EER,\nprecision, recall, and AUROC validate the effectiveness of the system.\nAdditionally, the proposed model demonstrates computational efficiency and\ncost-effectiveness by utilizing fewer parameters and data for training. The\nresearch highlights the importance of technology in the transportation sector\nand its potential to enhance road safety and convenience. The network proposed\nin this model performs with a 96.12 % accuracy, 3.89 % EER, and a 0.988 AUROC\nvalue, which is highly competitive with other state-of-the-art works.\n","authors":["Guruprasad Parasnis","Anmol Chokshi","Kailas Devadkar"],"pdf_url":"https://arxiv.org/pdf/2308.03467v1.pdf","comment":"6 pages, 5 figures"},{"id":"http://arxiv.org/abs/2307.15063v2","updated":"2023-08-07T10:43:33Z","published":"2023-07-27T17:59:59Z","title":"To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation","summary":" The goal of Online Domain Adaptation for semantic segmentation is to handle\nunforeseeable domain changes that occur during deployment, like sudden weather\nevents. However, the high computational costs associated with brute-force\nadaptation make this paradigm unfeasible for real-world applications. In this\npaper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training\nframework for real-time domain adaptation. Our approach includes a\nhardware-aware back-propagation orchestration agent (HAMT) and a dedicated\ndomain-shift detector that enables active control over when and how the model\nis adapted (LT). Thanks to these advancements, our approach is capable of\nperforming semantic segmentation while simultaneously adapting at more than\n29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and\nspeed trade-off is demonstrated on OnDA and SHIFT benchmarks through\nexperimental results.\n","authors":["Marc Botet Colomer","Pier Luigi Dovesi","Theodoros Panagiotakopoulos","Joao Frederico Carvalho","Linus Härenstam-Nielsen","Hossein Azizpour","Hedvig Kjellström","Daniel Cremers","Matteo Poggi"],"pdf_url":"https://arxiv.org/pdf/2307.15063v2.pdf","comment":"ICCV 2023. The first two authors contributed equally. Project page:\n https://marcbotet.github.io/hamlet-web/"},{"id":"http://arxiv.org/abs/2308.03463v1","updated":"2023-08-07T10:41:52Z","published":"2023-08-07T10:41:52Z","title":"DiffSynth: Latent In-Iteration Deflickering for Realistic Video\n Synthesis","summary":" In recent years, diffusion models have emerged as the most powerful approach\nin image synthesis. However, applying these models directly to video synthesis\npresents challenges, as it often leads to noticeable flickering contents.\nAlthough recently proposed zero-shot methods can alleviate flicker to some\nextent, we still struggle to generate coherent videos. In this paper, we\npropose DiffSynth, a novel approach that aims to convert image synthesis\npipelines to video synthesis pipelines. DiffSynth consists of two key\ncomponents: a latent in-iteration deflickering framework and a video\ndeflickering algorithm. The latent in-iteration deflickering framework applies\nvideo deflickering to the latent space of diffusion models, effectively\npreventing flicker accumulation in intermediate steps. Additionally, we propose\na video deflickering algorithm, named patch blending algorithm, that remaps\nobjects in different frames and blends them together to enhance video\nconsistency. One of the notable advantages of DiffSynth is its general\napplicability to various video synthesis tasks, including text-guided video\nstylization, fashion video synthesis, image-guided video stylization, video\nrestoring, and 3D rendering. In the task of text-guided video stylization, we\nmake it possible to synthesize high-quality videos without cherry-picking. The\nexperimental results demonstrate the effectiveness of DiffSynth. All videos can\nbe viewed on our project page. Source codes will also be released.\n","authors":["Zhongjie Duan","Lizhou You","Chengyu Wang","Cen Chen","Ziheng Wu","Weining Qian","Jun Huang","Fei Chao","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2308.03463v1.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2210.15808v2","updated":"2023-08-07T10:33:34Z","published":"2022-10-28T00:03:43Z","title":"Hyper-Connected Transformer Network for Multi-Modality PET-CT\n Segmentation","summary":" [18F]-Fluorodeoxyglucose (FDG) positron emission tomography - computed\ntomography (PET-CT) has become the imaging modality of choice for diagnosing\nmany cancers. Co-learning complementary PET-CT imaging features is a\nfundamental requirement for automatic tumor segmentation and for developing\ncomputer aided cancer diagnosis systems. In this study, we propose a\nhyper-connected transformer (HCT) network that integrates a transformer network\n(TN) with a hyper connected fusion for multi-modality PET-CT images. The TN was\nleveraged for its ability to provide global dependencies in image feature\nlearning, which was achieved by using image patch embeddings with a\nself-attention mechanism to capture image-wide contextual information. We\nextended the single-modality definition of TN with multiple TN based branches\nto separately extract image features. We also introduced a hyper connected\nfusion to fuse the contextual and complementary image features across multiple\ntransformers in an iterative manner. Our results with two clinical datasets\nshow that HCT achieved better performance in segmentation accuracy when\ncompared to the existing methods.\n","authors":["Lei Bi","Michael Fulham","Shaoli Song","David Dagan Feng","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2210.15808v2.pdf","comment":"EMBC 2023"},{"id":"http://arxiv.org/abs/2308.03457v1","updated":"2023-08-07T10:25:54Z","published":"2023-08-07T10:25:54Z","title":"Cross-Silo Prototypical Calibration for Federated Learning with Non-IID\n Data","summary":" Federated Learning aims to learn a global model on the server side that\ngeneralizes to all clients in a privacy-preserving manner, by leveraging the\nlocal models from different clients. Existing solutions focus on either\nregularizing the objective functions among clients or improving the aggregation\nmechanism for the improved model generalization capability. However, their\nperformance is typically limited by the dataset biases, such as the\nheterogeneous data distributions and the missing classes. To address this\nissue, this paper presents a cross-silo prototypical calibration method\n(FedCSPC), which takes additional prototype information from the clients to\nlearn a unified feature space on the server side. Specifically, FedCSPC first\nemploys the Data Prototypical Modeling (DPM) module to learn data patterns via\nclustering to aid calibration. Subsequently, the cross-silo prototypical\ncalibration (CSPC) module develops an augmented contrastive learning method to\nimprove the robustness of the calibration, which can effectively project\ncross-source features into a consistent space while maintaining clear decision\nboundaries. Moreover, the CSPC module's ease of implementation and\nplug-and-play characteristics make it even more remarkable. Experiments were\nconducted on four datasets in terms of performance comparison, ablation study,\nin-depth analysis and case study, and the results verified that FedCSPC is\ncapable of learning the consistent features across different data sources of\nthe same class under the guidance of calibrated model, which leads to better\nperformance than the state-of-the-art methods. The source codes have been\nreleased at https://github.com/qizhuang-qz/FedCSPC.\n","authors":["Zhuang Qi","Lei Meng","Zitan Chen","Han Hu","Hui Lin","Xiangxu Meng"],"pdf_url":"https://arxiv.org/pdf/2308.03457v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.12043v2","updated":"2023-08-07T10:20:59Z","published":"2023-04-24T12:38:09Z","title":"MixPro: Data Augmentation with MaskMix and Progressive Attention\n Labeling for Vision Transformer","summary":" The recently proposed data augmentation TransMix employs attention labels to\nhelp visual transformers (ViT) achieve better robustness and performance.\nHowever, TransMix is deficient in two aspects: 1) The image cropping method of\nTransMix may not be suitable for ViTs. 2) At the early stage of training, the\nmodel produces unreliable attention maps. TransMix uses unreliable attention\nmaps to compute mixed attention labels that can affect the model. To address\nthe aforementioned issues, we propose MaskMix and Progressive Attention\nLabeling (PAL) in image and label space, respectively. In detail, from the\nperspective of image space, we design MaskMix, which mixes two images based on\na patch-like grid mask. In particular, the size of each mask patch is\nadjustable and is a multiple of the image patch size, which ensures each image\npatch comes from only one image and contains more global contents. From the\nperspective of label space, we design PAL, which utilizes a progressive factor\nto dynamically re-weight the attention weights of the mixed attention label.\nFinally, we combine MaskMix and Progressive Attention Labeling as our new data\naugmentation method, named MixPro. The experimental results show that our\nmethod can improve various ViT-based models at scales on ImageNet\nclassification (73.8\\% top-1 accuracy based on DeiT-T for 300 epochs). After\nbeing pre-trained with MixPro on ImageNet, the ViT-based models also\ndemonstrate better transferability to semantic segmentation, object detection,\nand instance segmentation. Furthermore, compared to TransMix, MixPro also shows\nstronger robustness on several benchmarks. The code is available at\nhttps://github.com/fistyee/MixPro.\n","authors":["Qihao Zhao","Yangyu Huang","Wei Hu","Fan Zhang","Jun Liu"],"pdf_url":"https://arxiv.org/pdf/2304.12043v2.pdf","comment":"ICLR 2023, 16 pages, 6 figures"},{"id":"http://arxiv.org/abs/2305.07176v2","updated":"2023-08-07T10:09:21Z","published":"2023-05-11T23:12:13Z","title":"Automatic Radiology Report Generation by Learning with Increasingly Hard\n Negatives","summary":" Automatic radiology report generation is challenging as medical images or\nreports are usually similar to each other due to the common content of anatomy.\nThis makes a model hard to capture the uniqueness of individual images and is\nprone to producing undesired generic or mismatched reports. This situation\ncalls for learning more discriminative features that could capture even\nfine-grained mismatches between images and reports. To achieve this, this paper\nproposes a novel framework to learn discriminative image and report features by\ndistinguishing them from their closest peers, i.e., hard negatives. Especially,\nto attain more discriminative features, we gradually raise the difficulty of\nsuch a learning task by creating increasingly hard negative reports for each\nimage in the feature space during training, respectively. By treating the\nincreasingly hard negatives as auxiliary variables, we formulate this process\nas a min-max alternating optimisation problem. At each iteration, conditioned\non a given set of hard negative reports, image and report features are learned\nas usual by minimising the loss functions related to report generation. After\nthat, a new set of harder negative reports will be created by maximising a loss\nreflecting image-report alignment. By solving this optimisation, we attain a\nmodel that can generate more specific and accurate reports. It is noteworthy\nthat our framework enhances discriminative feature learning without introducing\nextra network weights. Also, in contrast to the existing way of generating hard\nnegatives, our framework extends beyond the granularity of the dataset by\ngenerating harder samples out of the training set. Experimental study on\nbenchmark datasets verifies the efficacy of our framework and shows that it can\nserve as a plug-in to readily improve existing medical report generation\nmodels.\n","authors":["Bhanu Prakash Voutharoja","Lei Wang","Luping Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.07176v2.pdf","comment":"Accepted to European Conference on Artificial Intelligence (ECAI)\n 2023"},{"id":"http://arxiv.org/abs/2308.03448v1","updated":"2023-08-07T10:09:11Z","published":"2023-08-07T10:09:11Z","title":"Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for\n RAW Denoising","summary":" Calibration-based methods have dominated RAW image denoising under extremely\nlow-light environments. However, these methods suffer from several main\ndeficiencies: 1) the calibration procedure is laborious and time-consuming, 2)\ndenoisers for different cameras are difficult to transfer, and 3) the\ndiscrepancy between synthetic noise and real noise is enlarged by high digital\ngain. To overcome the above shortcomings, we propose a calibration-free\npipeline for Lighting Every Drakness (LED), regardless of the digital gain or\ncamera sensor. Instead of calibrating the noise parameters and training\nrepeatedly, our method could adapt to a target camera only with few-shot paired\ndata and fine-tuning. In addition, well-designed structural modification during\nboth stages alleviates the domain gap between synthetic and real noise without\nany extra computational cost. With 2 pairs for each additional digital gain (in\ntotal 6 pairs) and 0.5% iterations, our method achieves superior performance\nover other calibration-based methods. Our code is available at\nhttps://github.com/Srameo/LED .\n","authors":["Xin Jin","Jia-Wen Xiao","Ling-Hao Han","Chunle Guo","Ruixun Zhang","Xialei Liu","Chongyi Li"],"pdf_url":"https://arxiv.org/pdf/2308.03448v1.pdf","comment":"Accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2306.09780v2","updated":"2023-08-07T09:25:55Z","published":"2023-06-16T11:33:47Z","title":"Understanding Deep Generative Models with Generalized Empirical\n Likelihoods","summary":" Understanding how well a deep generative model captures a distribution of\nhigh-dimensional data remains an important open challenge. It is especially\ndifficult for certain model classes, such as Generative Adversarial Networks\nand Diffusion Models, whose models do not admit exact likelihoods. In this\nwork, we demonstrate that generalized empirical likelihood (GEL) methods offer\na family of diagnostic tools that can identify many deficiencies of deep\ngenerative models (DGMs). We show, with appropriate specification of moment\nconditions, that the proposed method can identify which modes have been\ndropped, the degree to which DGMs are mode imbalanced, and whether DGMs\nsufficiently capture intra-class diversity. We show how to combine techniques\nfrom Maximum Mean Discrepancy and Generalized Empirical Likelihood to create\nnot only distribution tests that retain per-sample interpretability, but also\nmetrics that include label information. We find that such tests predict the\ndegree of mode dropping and mode imbalance up to 60% better than metrics such\nas improved precision/recall. We provide an implementation at\nhttps://github.com/deepmind/understanding_deep_generative_models_with_generalized_empirical_likelihood/.\n","authors":["Suman Ravuri","Mélanie Rey","Shakir Mohamed","Marc Deisenroth"],"pdf_url":"https://arxiv.org/pdf/2306.09780v2.pdf","comment":"Computer Vision and Pattern Recognition 2023 (Highlight, top 2.6% of\n submissions)"},{"id":"http://arxiv.org/abs/2308.03413v1","updated":"2023-08-07T09:03:35Z","published":"2023-08-07T09:03:35Z","title":"GaFET: Learning Geometry-aware Facial Expression Translation from\n In-The-Wild Images","summary":" While current face animation methods can manipulate expressions individually,\nthey suffer from several limitations. The expressions manipulated by some\nmotion-based facial reenactment models are crude. Other ideas modeled with\nfacial action units cannot generalize to arbitrary expressions not covered by\nannotations. In this paper, we introduce a novel Geometry-aware Facial\nExpression Translation (GaFET) framework, which is based on parametric 3D\nfacial representations and can stably decoupled expression. Among them, a\nMulti-level Feature Aligned Transformer is proposed to complement non-geometric\nfacial detail features while addressing the alignment challenge of spatial\nfeatures. Further, we design a De-expression model based on StyleGAN, in order\nto reduce the learning difficulty of GaFET in unpaired \"in-the-wild\" images.\nExtensive qualitative and quantitative experiments demonstrate that we achieve\nhigher-quality and more accurate facial expression transfer results compared to\nstate-of-the-art methods, and demonstrate applicability of various poses and\ncomplex textures. Besides, videos or annotated training data are omitted,\nmaking our method easier to use and generalize.\n","authors":["Tianxiang Ma","Bingchuan Li","Qian He","Jing Dong","Tieniu Tan"],"pdf_url":"https://arxiv.org/pdf/2308.03413v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2308.03411v1","updated":"2023-08-07T09:02:26Z","published":"2023-08-07T09:02:26Z","title":"A Horse with no Labels: Self-Supervised Horse Pose Estimation from\n Unlabelled Images and Synthetic Prior","summary":" Obtaining labelled data to train deep learning methods for estimating animal\npose is challenging. Recently, synthetic data has been widely used for pose\nestimation tasks, but most methods still rely on supervised learning paradigms\nutilising synthetic images and labels. Can training be fully unsupervised? Is a\ntiny synthetic dataset sufficient? What are the minimum assumptions that we\ncould make for estimating animal pose? Our proposal addresses these questions\nthrough a simple yet effective self-supervised method that only assumes the\navailability of unlabelled images and a small set of synthetic 2D poses. We\ncompletely remove the need for any 3D or 2D pose annotations (or complex 3D\nanimal models), and surprisingly our approach can still learn accurate 3D and\n2D poses simultaneously. We train our method with unlabelled images of horses\nmainly collected for YouTube videos and a prior consisting of 2D synthetic\nposes. The latter is three times smaller than the number of images needed for\ntraining. We test our method on a challenging set of horse images and evaluate\nthe predicted 3D and 2D poses. We demonstrate that it is possible to learn\naccurate animal poses even with as few assumptions as unlabelled images and a\nsmall set of 2D poses generated from synthetic data. Given the minimum\nrequirements and the abundance of unlabelled data, our method could be easily\ndeployed to different animals.\n","authors":["Jose Sosa","David Hogg"],"pdf_url":"https://arxiv.org/pdf/2308.03411v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03409v1","updated":"2023-08-07T08:55:48Z","published":"2023-08-07T08:55:48Z","title":"DiT: Efficient Vision Transformers with Dynamic Token Routing","summary":" Recently, the tokens of images share the same static data flow in many dense\nnetworks. However, challenges arise from the variance among the objects in\nimages, such as large variations in the spatial scale and difficulties of\nrecognition for visual entities. In this paper, we propose a data-dependent\ntoken routing strategy to elaborate the routing paths of image tokens for\nDynamic Vision Transformer, dubbed DiT. The proposed framework generates a\ndata-dependent path per token, adapting to the object scales and visual\ndiscrimination of tokens. In feed-forward, the differentiable routing gates are\ndesigned to select the scaling paths and feature transformation paths for image\ntokens, leading to multi-path feature propagation. In this way, the impact of\nobject scales and visual discrimination of image representation can be\ncarefully tuned. Moreover, the computational cost can be further reduced by\ngiving budget constraints to the routing gate and early-stopping of feature\nextraction. In experiments, our DiT achieves superior performance and favorable\ncomplexity/accuracy trade-offs than many SoTA methods on ImageNet\nclassification, object detection, instance segmentation, and semantic\nsegmentation. Particularly, the DiT-B5 obtains 84.8\\% top-1 Acc on ImageNet\nwith 10.3 GFLOPs, which is 1.0\\% higher than that of the SoTA method with\nsimilar computational complexity. These extensive results demonstrate that DiT\ncan serve as versatile backbones for various vision tasks.\n","authors":["Yuchen Ma","Zhengcong Fei","Junshi Huang"],"pdf_url":"https://arxiv.org/pdf/2308.03409v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03407v1","updated":"2023-08-07T08:48:46Z","published":"2023-08-07T08:48:46Z","title":"Spatially Varying Nanophotonic Neural Networks","summary":" The explosive growth of computation and energy cost of artificial\nintelligence has spurred strong interests in new computing modalities as\npotential alternatives to conventional electronic processors. Photonic\nprocessors that execute operations using photons instead of electrons, have\npromised to enable optical neural networks with ultra-low latency and power\nconsumption. However, existing optical neural networks, limited by the\nunderlying network designs, have achieved image recognition accuracy much lower\nthan state-of-the-art electronic neural networks. In this work, we close this\ngap by introducing a large-kernel spatially-varying convolutional neural\nnetwork learned via low-dimensional reparameterization techniques. We\nexperimentally instantiate the network with a flat meta-optical system that\nencompasses an array of nanophotonic structures designed to induce\nangle-dependent responses. Combined with an extremely lightweight electronic\nbackend with approximately 2K parameters we demonstrate a nanophotonic neural\nnetwork reaches 73.80\\% blind test classification accuracy on CIFAR-10 dataset,\nand, as such, the first time, an optical neural network outperforms the first\nmodern digital neural network -- AlexNet (72.64\\%) with 57M parameters,\nbringing optical neural network into modern deep learning era.\n","authors":["Kaixuan Wei","Xiao Li","Johannes Froech","Praneeth Chakravarthula","James Whitehead","Ethan Tseng","Arka Majumdar","Felix Heide"],"pdf_url":"https://arxiv.org/pdf/2308.03407v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.08283v3","updated":"2023-08-07T08:32:54Z","published":"2022-12-16T05:10:09Z","title":"SceneGATE: Scene-Graph based co-Attention networks for TExt visual\n question answering","summary":" Most TextVQA approaches focus on the integration of objects, scene texts and\nquestion words by a simple transformer encoder. But this fails to capture the\nsemantic relations between different modalities. The paper proposes a Scene\nGraph based co-Attention Network (SceneGATE) for TextVQA, which reveals the\nsemantic relations among the objects, Optical Character Recognition (OCR)\ntokens and the question words. It is achieved by a TextVQA-based scene graph\nthat discovers the underlying semantics of an image. We created a\nguided-attention module to capture the intra-modal interplay between the\nlanguage and the vision as a guidance for inter-modal interactions. To make\nexplicit teaching of the relations between the two modalities, we proposed and\nintegrated two attention modules, namely a scene graph-based semantic\nrelation-aware attention and a positional relation-aware attention. We\nconducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA.\nIt is shown that our SceneGATE method outperformed existing ones because of the\nscene graph and its attention modules.\n","authors":["Feiqi Cao","Siwen Luo","Felipe Nunez","Zean Wen","Josiah Poon","Caren Han"],"pdf_url":"https://arxiv.org/pdf/2212.08283v3.pdf","comment":"Published in Robotics (Q1, SCI indexed Journal):\n https://www.mdpi.com/2218-6581/12/4/114"},{"id":"http://arxiv.org/abs/2307.13294v2","updated":"2023-08-07T08:12:57Z","published":"2023-07-25T07:20:21Z","title":"Imperceptible Physical Attack against Face Recognition Systems via LED\n Illumination Modulation","summary":" Although face recognition starts to play an important role in our daily life,\nwe need to pay attention that data-driven face recognition vision systems are\nvulnerable to adversarial attacks. However, the current two categories of\nadversarial attacks, namely digital attacks and physical attacks both have\ndrawbacks, with the former ones impractical and the latter one conspicuous,\nhigh-computational and inexecutable. To address the issues, we propose a\npractical, executable, inconspicuous and low computational adversarial attack\nbased on LED illumination modulation. To fool the systems, the proposed attack\ngenerates imperceptible luminance changes to human eyes through fast intensity\nmodulation of scene LED illumination and uses the rolling shutter effect of\nCMOS image sensors in face recognition systems to implant luminance information\nperturbation to the captured face images. In summary,we present a\ndenial-of-service (DoS) attack for face detection and a dodging attack for face\nverification. We also evaluate their effectiveness against well-known face\ndetection models, Dlib, MTCNN and RetinaFace , and face verification models,\nDlib, FaceNet,and ArcFace.The extensive experiments show that the success rates\nof DoS attacks against face detection models reach 97.67%, 100%, and 100%,\nrespectively, and the success rates of dodging attacks against all face\nverification models reach 100%.\n","authors":["Junbin Fang","Canjian Jiang","You Jiang","Puxi Lin","Zhaojie Chen","Yujing Sun","Siu-Ming Yiu","Zoe L. Jiang"],"pdf_url":"https://arxiv.org/pdf/2307.13294v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09036v2","updated":"2023-08-07T08:10:55Z","published":"2023-03-16T02:18:41Z","title":"Mimic3D: Thriving 3D-Aware GANs via 3D-to-2D Imitation","summary":" Generating images with both photorealism and multiview 3D consistency is\ncrucial for 3D-aware GANs, yet existing methods struggle to achieve them\nsimultaneously. Improving the photorealism via CNN-based 2D super-resolution\ncan break the strict 3D consistency, while keeping the 3D consistency by\nlearning high-resolution 3D representations for direct rendering often\ncompromises image quality. In this paper, we propose a novel learning strategy,\nnamely 3D-to-2D imitation, which enables a 3D-aware GAN to generate\nhigh-quality images while maintaining their strict 3D consistency, by letting\nthe images synthesized by the generator's 3D rendering branch to mimic those\ngenerated by its 2D super-resolution branch. We also introduce 3D-aware\nconvolutions into the generator for better 3D representation learning, which\nfurther improves the image generation quality. With the above strategies, our\nmethod reaches FID scores of 5.4 and 4.3 on FFHQ and AFHQ-v2 Cats,\nrespectively, at 512x512 resolution, largely outperforming existing 3D-aware\nGANs using direct 3D rendering and coming very close to the previous\nstate-of-the-art method that leverages 2D super-resolution. Project website:\nhttps://seanchenxy.github.io/Mimic3DWeb.\n","authors":["Xingyu Chen","Yu Deng","Baoyuan Wang"],"pdf_url":"https://arxiv.org/pdf/2303.09036v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03382v1","updated":"2023-08-07T08:03:20Z","published":"2023-08-07T08:03:20Z","title":"Enhancing Nucleus Segmentation with HARU-Net: A Hybrid Attention Based\n Residual U-Blocks Network","summary":" Nucleus image segmentation is a crucial step in the analysis, pathological\ndiagnosis, and classification, which heavily relies on the quality of nucleus\nsegmentation. However, the complexity of issues such as variations in nucleus\nsize, blurred nucleus contours, uneven staining, cell clustering, and\noverlapping cells poses significant challenges. Current methods for nucleus\nsegmentation primarily rely on nuclear morphology or contour-based approaches.\nNuclear morphology-based methods exhibit limited generalization ability and\nstruggle to effectively predict irregular-shaped nuclei, while contour-based\nextraction methods face challenges in accurately segmenting overlapping nuclei.\nTo address the aforementioned issues, we propose a dual-branch network using\nhybrid attention based residual U-blocks for nucleus instance segmentation. The\nnetwork simultaneously predicts target information and target contours.\nAdditionally, we introduce a post-processing method that combines the target\ninformation and target contours to distinguish overlapping nuclei and generate\nan instance segmentation image. Within the network, we propose a context fusion\nblock (CF-block) that effectively extracts and merges contextual information\nfrom the network. Extensive quantitative evaluations are conducted to assess\nthe performance of our method. Experimental results demonstrate the superior\nperformance of the proposed method compared to state-of-the-art approaches on\nthe BNS, MoNuSeg, CoNSeg, and CPM-17 datasets.\n","authors":["Junzhou Chen","Qian Huang","Yulin Chen","Linyi Qian","Chengyuan Yu"],"pdf_url":"https://arxiv.org/pdf/2308.03382v1.pdf","comment":"Nucleus segmentation, Deep learning, Instance segmentation, Medical\n imaging, Dual-Branch network"},{"id":"http://arxiv.org/abs/2308.01661v3","updated":"2023-08-07T08:00:36Z","published":"2023-08-03T09:56:31Z","title":"BEVControl: Accurately Controlling Street-view Elements with\n Multi-perspective Consistency via BEV Sketch Layout","summary":" Using synthesized images to boost the performance of perception models is a\nlong-standing research challenge in computer vision. It becomes more eminent in\nvisual-centric autonomous driving systems with multi-view cameras as some\nlong-tail scenarios can never be collected. Guided by the BEV segmentation\nlayouts, the existing generative networks seem to synthesize photo-realistic\nstreet-view images when evaluated solely on scene-level metrics. However, once\nzoom-in, they usually fail to produce accurate foreground and background\ndetails such as heading. To this end, we propose a two-stage generative method,\ndubbed BEVControl, that can generate accurate foreground and background\ncontents. In contrast to segmentation-like input, it also supports sketch style\ninput, which is more flexible for humans to edit. In addition, we propose a\ncomprehensive multi-level evaluation protocol to fairly compare the quality of\nthe generated scene, foreground object, and background geometry. Our extensive\nexperiments show that our BEVControl surpasses the state-of-the-art method,\nBEVGen, by a significant margin, from 5.89 to 26.80 on foreground segmentation\nmIoU. In addition, we show that using images generated by BEVControl to train\nthe downstream perception model, it achieves on average 1.29 improvement in NDS\nscore.\n","authors":["Kairui Yang","Enhui Ma","Jibin Peng","Qing Guo","Di Lin","Kaicheng Yu"],"pdf_url":"https://arxiv.org/pdf/2308.01661v3.pdf","comment":"13 pages, 8 figures"},{"id":"http://arxiv.org/abs/2308.03381v1","updated":"2023-08-07T07:59:56Z","published":"2023-08-07T07:59:56Z","title":"Bilevel Generative Learning for Low-Light Vision","summary":" Recently, there has been a growing interest in constructing deep learning\nschemes for Low-Light Vision (LLV). Existing techniques primarily focus on\ndesigning task-specific and data-dependent vision models on the standard RGB\ndomain, which inherently contain latent data associations. In this study, we\npropose a generic low-light vision solution by introducing a generative block\nto convert data from the RAW to the RGB domain. This novel approach connects\ndiverse vision problems by explicitly depicting data generation, which is the\nfirst in the field. To precisely characterize the latent correspondence between\nthe generative procedure and the vision task, we establish a bilevel model with\nthe parameters of the generative block defined as the upper level and the\nparameters of the vision task defined as the lower level. We further develop\ntwo types of learning strategies targeting different goals, namely low cost and\nhigh accuracy, to acquire a new bilevel generative learning paradigm. The\ngenerative blocks embrace a strong generalization ability in other low-light\nvision tasks through the bilevel optimization on enhancement tasks. Extensive\nexperimental evaluations on three representative low-light vision tasks, namely\nenhancement, detection, and segmentation, fully demonstrate the superiority of\nour proposed approach. The code will be available at\nhttps://github.com/Yingchi1998/BGL.\n","authors":["Yingchi Liu","Zhu Liu","Long Ma","Jinyuan Liu","Xin Fan","Zhongxuan Luo","Risheng Liu"],"pdf_url":"https://arxiv.org/pdf/2308.03381v1.pdf","comment":"Accepted by ACM MM'2023, The code will be available at\n https://github.com/Yingchi1998/BGL"},{"id":"http://arxiv.org/abs/2308.03375v1","updated":"2023-08-07T07:54:32Z","published":"2023-08-07T07:54:32Z","title":"VR-based body tracking to stimulate musculoskeletal training","summary":" Training helps to maintain and improve sufficient muscle function, body\ncontrol, and body coordination. These are important to reduce the risk of\nfracture incidents caused by falls, especially for the elderly or people\nrecovering from injury. Virtual reality training can offer a cost-effective and\nindividualized training experience. We present an application for the HoloLens\n2 to enable musculoskeletal training for elderly and impaired persons to allow\nfor autonomous training and automatic progress evaluation. We designed a\nvirtual downhill skiing scenario that is controlled by body movement to\nstimulate balance and body control. By adapting the parameters of the ski\nslope, we can tailor the intensity of the training to individual users. In this\nwork, we evaluate whether the movement data of the HoloLens 2 alone is\nsufficient to control and predict body movement and joint angles during\nmusculoskeletal training. We record the movements of 10 healthy volunteers with\nexternal tracking cameras and track a set of body and joint angles of the\nparticipant during training. We estimate correlation coefficients and\nsystematically analyze whether whole body movement can be derived from the\nmovement data of the HoloLens 2. No participant reports movement sickness\neffects and all were able to quickly interact and control their movement during\nskiing. Our results show a high correlation between HoloLens 2 movement data\nand the external tracking of the upper body movement and joint angles of the\nlower limbs.\n","authors":["M. Neidhardt","S. Gerlach F. N. Schmidt","I. A. K. Fiedler","S. Grube","B. Busse","A. Schlaefer"],"pdf_url":"https://arxiv.org/pdf/2308.03375v1.pdf","comment":"Conference"},{"id":"http://arxiv.org/abs/2308.03374v1","updated":"2023-08-07T07:53:39Z","published":"2023-08-07T07:53:39Z","title":"Heterogeneous Forgetting Compensation for Class-Incremental Learning","summary":" Class-incremental learning (CIL) has achieved remarkable successes in\nlearning new classes consecutively while overcoming catastrophic forgetting on\nold categories. However, most existing CIL methods unreasonably assume that all\nold categories have the same forgetting pace, and neglect negative influence of\nforgetting heterogeneity among different old classes on forgetting\ncompensation. To surmount the above challenges, we develop a novel\nHeterogeneous Forgetting Compensation (HFC) model, which can resolve\nheterogeneous forgetting of easy-to-forget and hard-to-forget old categories\nfrom both representation and gradient aspects. Specifically, we design a\ntask-semantic aggregation block to alleviate heterogeneous forgetting from\nrepresentation aspect. It aggregates local category information within each\ntask to learn task-shared global representations. Moreover, we develop two\nnovel plug-and-play losses: a gradient-balanced forgetting compensation loss\nand a gradient-balanced relation distillation loss to alleviate forgetting from\ngradient aspect. They consider gradient-balanced compensation to rectify\nforgetting heterogeneity of old categories and heterogeneous relation\nconsistency. Experiments on several representative datasets illustrate\neffectiveness of our HFC model. The code is available at\nhttps://github.com/JiahuaDong/HFC.\n","authors":["Jiahua Dong","Wenqi Liang","Yang Cong","Gan Sun"],"pdf_url":"https://arxiv.org/pdf/2308.03374v1.pdf","comment":"Accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2304.14104v2","updated":"2023-08-07T07:52:35Z","published":"2023-04-27T11:32:48Z","title":"Learning Human-Human Interactions in Images from Weak Textual\n Supervision","summary":" Interactions between humans are diverse and context-dependent, but previous\nworks have treated them as categorical, disregarding the heavy tail of possible\ninteractions. We propose a new paradigm of learning human-human interactions as\nfree text from a single still image, allowing for flexibility in modeling the\nunlimited space of situations and relationships between people. To overcome the\nabsence of data labelled specifically for this task, we use knowledge\ndistillation applied to synthetic caption data produced by a large language\nmodel without explicit supervision. We show that the pseudo-labels produced by\nthis procedure can be used to train a captioning model to effectively\nunderstand human-human interactions in images, as measured by a variety of\nmetrics that measure textual and semantic faithfulness and factual groundedness\nof our predictions. We further show that our approach outperforms SOTA image\ncaptioning and situation recognition models on this task. We will release our\ncode and pseudo-labels along with Waldo and Wenda, a manually-curated test set\nfor still image human-human interaction understanding.\n","authors":["Morris Alper","Hadar Averbuch-Elor"],"pdf_url":"https://arxiv.org/pdf/2304.14104v2.pdf","comment":"To be presented at ICCV 2023. Project webpage:\n https://learning-interactions.github.io"},{"id":"http://arxiv.org/abs/2307.13925v3","updated":"2023-08-07T07:40:39Z","published":"2023-07-26T02:46:50Z","title":"EasyNet: An Easy Network for 3D Industrial Anomaly Detection","summary":" 3D anomaly detection is an emerging and vital computer vision task in\nindustrial manufacturing (IM). Recently many advanced algorithms have been\npublished, but most of them cannot meet the needs of IM. There are several\ndisadvantages: i) difficult to deploy on production lines since their\nalgorithms heavily rely on large pre-trained models; ii) hugely increase\nstorage overhead due to overuse of memory banks; iii) the inference speed\ncannot be achieved in real-time. To overcome these issues, we propose an easy\nand deployment-friendly network (called EasyNet) without using pre-trained\nmodels and memory banks: firstly, we design a multi-scale multi-modality\nfeature encoder-decoder to accurately reconstruct the segmentation maps of\nanomalous regions and encourage the interaction between RGB images and depth\nimages; secondly, we adopt a multi-modality anomaly segmentation network to\nachieve a precise anomaly map; thirdly, we propose an attention-based\ninformation entropy fusion module for feature fusion during inference, making\nit suitable for real-time deployment. Extensive experiments show that EasyNet\nachieves an anomaly detection AUROC of 92.6% without using pre-trained models\nand memory banks. In addition, EasyNet is faster than existing methods, with a\nhigh frame rate of 94.55 FPS on a Tesla V100 GPU.\n","authors":["Ruitao Chen","Guoyang Xie","Jiaqi Liu","Jinbao Wang","Ziqi Luo","Jinfan Wang","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2307.13925v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03364v1","updated":"2023-08-07T07:39:39Z","published":"2023-08-07T07:39:39Z","title":"Dual Aggregation Transformer for Image Super-Resolution","summary":" Transformer has recently gained considerable popularity in low-level vision\ntasks, including image super-resolution (SR). These networks utilize\nself-attention along different dimensions, spatial or channel, and achieve\nimpressive performance. This inspires us to combine the two dimensions in\nTransformer for a more powerful representation capability. Based on the above\nidea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT),\nfor image SR. Our DAT aggregates features across spatial and channel\ndimensions, in the inter-block and intra-block dual manner. Specifically, we\nalternately apply spatial and channel self-attention in consecutive Transformer\nblocks. The alternate strategy enables DAT to capture the global context and\nrealize inter-block feature aggregation. Furthermore, we propose the adaptive\ninteraction module (AIM) and the spatial-gate feed-forward network (SGFN) to\nachieve intra-block feature aggregation. AIM complements two self-attention\nmechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional\nnon-linear spatial information in the feed-forward network. Extensive\nexperiments show that our DAT surpasses current methods. Code and models are\nobtainable at https://github.com/zhengchen1999/DAT.\n","authors":["Zheng Chen","Yulun Zhang","Jinjin Gu","Linghe Kong","Xiaokang Yang","Fisher Yu"],"pdf_url":"https://arxiv.org/pdf/2308.03364v1.pdf","comment":"Accepted to ICCV 2023. Code is available at\n https://github.com/zhengchen1999/DAT"},{"id":"http://arxiv.org/abs/2308.03359v1","updated":"2023-08-07T07:28:24Z","published":"2023-08-07T07:28:24Z","title":"Distortion-aware Transformer in 360° Salient Object Detection","summary":" With the emergence of VR and AR, 360{\\deg} data attracts increasing attention\nfrom the computer vision and multimedia communities. Typically, 360{\\deg} data\nis projected into 2D ERP (equirectangular projection) images for feature\nextraction. However, existing methods cannot handle the distortions that result\nfrom the projection, hindering the development of 360-data-based tasks.\nTherefore, in this paper, we propose a Transformer-based model called DATFormer\nto address the distortion problem. We tackle this issue from two perspectives.\nFirstly, we introduce two distortion-adaptive modules. The first is a\nDistortion Mapping Module, which guides the model to pre-adapt to distorted\nfeatures globally. The second module is a Distortion-Adaptive Attention Block\nthat reduces local distortions on multi-scale features. Secondly, to exploit\nthe unique characteristics of 360{\\deg} data, we present a learnable relation\nmatrix and use it as part of the positional embedding to further improve\nperformance. Extensive experiments are conducted on three public datasets, and\nthe results show that our model outperforms existing 2D SOD (salient object\ndetection) and 360 SOD methods.\n","authors":["Yinjie Zhao","Lichen Zhao","Qian Yu","Jing Zhang","Lu Sheng","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2308.03359v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2304.03532v2","updated":"2023-08-07T07:25:34Z","published":"2023-04-07T08:11:16Z","title":"Graph-Guided MLP-Mixer for Skeleton-Based Human Motion Prediction","summary":" In recent years, Graph Convolutional Networks (GCNs) have been widely used in\nhuman motion prediction, but their performance remains unsatisfactory.\nRecently, MLP-Mixer, initially developed for vision tasks, has been leveraged\ninto human motion prediction as a promising alternative to GCNs, which achieves\nboth better performance and better efficiency than GCNs. Unlike GCNs, which can\nexplicitly capture human skeleton's bone-joint structure by representing it as\na graph with edges and nodes, MLP-Mixer relies on fully connected layers and\nthus cannot explicitly model such graph-like structure of human's. To break\nthis limitation of MLP-Mixer's, we propose \\textit{Graph-Guided Mixer}, a novel\napproach that equips the original MLP-Mixer architecture with the capability to\nmodel graph structure. By incorporating graph guidance, our\n\\textit{Graph-Guided Mixer} can effectively capture and utilize the specific\nconnectivity patterns within human skeleton's graph representation. In this\npaper, first we uncover a theoretical connection between MLP-Mixer and GCN that\nis unexplored in existing research. Building on this theoretical connection,\nnext we present our proposed \\textit{Graph-Guided Mixer}, explaining how the\noriginal MLP-Mixer architecture is reinvented to incorporate guidance from\ngraph structure. Then we conduct an extensive evaluation on the Human3.6M,\nAMASS, and 3DPW datasets, which shows that our method achieves state-of-the-art\nperformance.\n","authors":["Xinshun Wang","Qiongjie Cui","Chen Chen","Shen Zhao","Mengyuan Liu"],"pdf_url":"https://arxiv.org/pdf/2304.03532v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03354v1","updated":"2023-08-07T07:23:43Z","published":"2023-08-07T07:23:43Z","title":"Energy-Guided Diffusion Model for CBCT-to-CT Synthesis","summary":" Cone Beam CT (CBCT) plays a crucial role in Adaptive Radiation Therapy (ART)\nby accurately providing radiation treatment when organ anatomy changes occur.\nHowever, CBCT images suffer from scatter noise and artifacts, making relying\nsolely on CBCT for precise dose calculation and accurate tissue localization\nchallenging. Therefore, there is a need to improve CBCT image quality and\nHounsfield Unit (HU) accuracy while preserving anatomical structures. To\nenhance the role and application value of CBCT in ART, we propose an\nenergy-guided diffusion model (EGDiff) and conduct experiments on a chest tumor\ndataset to generate synthetic CT (sCT) from CBCT. The experimental results\ndemonstrate impressive performance with an average absolute error of\n26.87$\\pm$6.14 HU, a structural similarity index measurement of 0.850$\\pm$0.03,\na peak signal-to-noise ratio of the sCT of 19.83$\\pm$1.39 dB, and a normalized\ncross-correlation of the sCT of 0.874$\\pm$0.04. These results indicate that our\nmethod outperforms state-of-the-art unsupervised synthesis methods in accuracy\nand visual quality, producing superior sCT images.\n","authors":["Linjie Fu","Xia Li","Xiuding Cai","Dong Miao","Yu Yao","Yali Shen"],"pdf_url":"https://arxiv.org/pdf/2308.03354v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03349v1","updated":"2023-08-07T07:03:49Z","published":"2023-08-07T07:03:49Z","title":"SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering\n Dataset for Scientific Graphs","summary":" In this work, we present SciGraphQA, a synthetic multi-turn question-answer\ndataset related to academic graphs. SciGraphQA is 13 times larger than\nChartVQA, the previously largest chart-visual question-answering dataset. It is\nalso the largest open-sourced chart VQA dataset with non-synthetic charts. To\nbuild our dataset, we selected 290,000 Computer Science or Machine Learning\nArXiv papers published between 2010 and 2020, and then used Palm-2 to generate\n295K samples of open-vocabulary multi-turn question-answering dialogues about\nthe graphs. As context, we provided the text-only Palm-2 with paper title,\nabstract, paragraph mentioning the graph, and rich text contextual data from\nthe graph itself, obtaining dialogues with an average 2.23 question-answer\nturns for each graph. We asked GPT-4 to assess the matching quality of our\nquestion-answer turns given the paper's context, obtaining an average rating of\n8.7/10 on our 3K test set. We evaluated the 0-shot capability of the most\npopular MLLM models such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo's on our\ndataset, finding LLaVA-13B being the most performant with a CIDEr score of\n0.08. We further enriched the question prompts for LLAVA by including the\nserialized data tables extracted from the graphs using the DePlot model,\nboosting LLaVA's 0-shot CIDEr to 0.15. To verify the validity of our dataset,\nwe also fine-tuned LLaVa using our dataset, reaching a substantially higher\nCIDEr score of 0.26. We anticipate further accuracy improvement by including\nsegmentation mask tokens and leveraging larger LLM backbones coupled with\nemergent prompting techniques. Our code and data are open-sourced.\n","authors":["Shengzhi Li","Nima Tajbakhsh"],"pdf_url":"https://arxiv.org/pdf/2308.03349v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03348v1","updated":"2023-08-07T07:02:42Z","published":"2023-08-07T07:02:42Z","title":"Cooperative Colorization: Exploring Latent Cross-Domain Priors for NIR\n Image Spectrum Translation","summary":" Near-infrared (NIR) image spectrum translation is a challenging problem with\nmany promising applications. Existing methods struggle with the mapping\nambiguity between the NIR and the RGB domains, and generalize poorly due to the\nlimitations of models' learning capabilities and the unavailability of\nsufficient NIR-RGB image pairs for training. To address these challenges, we\npropose a cooperative learning paradigm that colorizes NIR images in parallel\nwith another proxy grayscale colorization task by exploring latent cross-domain\npriors (i.e., latent spectrum context priors and task domain priors), dubbed\nCoColor. The complementary statistical and semantic spectrum information from\nthese two task domains -- in the forms of pre-trained colorization networks --\nare brought in as task domain priors. A bilateral domain translation module is\nsubsequently designed, in which intermittent NIR images are generated from\ngrayscale and colorized in parallel with authentic NIR images; and vice versa\nfor the grayscale images. These intermittent transformations act as latent\nspectrum context priors for efficient domain knowledge exchange. We\nprogressively fine-tune and fuse these modules with a series of pixel-level and\nfeature-level consistency constraints. Experiments show that our proposed\ncooperative learning framework produces satisfactory spectrum translation\noutputs with diverse colors and rich textures, and outperforms state-of-the-art\ncounterparts by 3.95dB and 4.66dB in terms of PNSR for the NIR and grayscale\ncolorization tasks, respectively.\n","authors":["Xingxing Yang","Jie Chen","Zaifeng Yang"],"pdf_url":"https://arxiv.org/pdf/2308.03348v1.pdf","comment":"Accepted by ACMMM 2023"},{"id":"http://arxiv.org/abs/2308.03340v1","updated":"2023-08-07T06:47:36Z","published":"2023-08-07T06:47:36Z","title":"A Hybrid CNN-Transformer Architecture with Frequency Domain Contrastive\n Learning for Image Deraining","summary":" Image deraining is a challenging task that involves restoring degraded images\naffected by rain streaks.\n","authors":["Cheng Wang","Wei Li"],"pdf_url":"https://arxiv.org/pdf/2308.03340v1.pdf","comment":"21 pages,6 figures"},{"id":"http://arxiv.org/abs/2209.10510v2","updated":"2023-08-07T06:40:13Z","published":"2022-09-21T17:15:58Z","title":"Learning to Relight Portrait Images via a Virtual Light Stage and\n Synthetic-to-Real Adaptation","summary":" Given a portrait image of a person and an environment map of the target\nlighting, portrait relighting aims to re-illuminate the person in the image as\nif the person appeared in an environment with the target lighting. To achieve\nhigh-quality results, recent methods rely on deep learning. An effective\napproach is to supervise the training of deep neural networks with a\nhigh-fidelity dataset of desired input-output pairs, captured with a light\nstage. However, acquiring such data requires an expensive special capture rig\nand time-consuming efforts, limiting access to only a few resourceful\nlaboratories. To address the limitation, we propose a new approach that can\nperform on par with the state-of-the-art (SOTA) relighting methods without\nrequiring a light stage. Our approach is based on the realization that a\nsuccessful relighting of a portrait image depends on two conditions. First, the\nmethod needs to mimic the behaviors of physically-based relighting. Second, the\noutput has to be photorealistic. To meet the first condition, we propose to\ntrain the relighting network with training data generated by a virtual light\nstage that performs physically-based rendering on various 3D synthetic humans\nunder different environment maps. To meet the second condition, we develop a\nnovel synthetic-to-real approach to bring photorealism to the relighting\nnetwork output. In addition to achieving SOTA results, our approach offers\nseveral advantages over the prior methods, including controllable glares on\nglasses and more temporally-consistent results for relighting videos.\n","authors":["Yu-Ying Yeh","Koki Nagano","Sameh Khamis","Jan Kautz","Ming-Yu Liu","Ting-Chun Wang"],"pdf_url":"https://arxiv.org/pdf/2209.10510v2.pdf","comment":"To appear in ACM Transactions on Graphics (SIGGRAPH Asia 2022). 21\n pages, 25 figures, 7 tables. Project page:\n https://research.nvidia.com/labs/dir/lumos/"},{"id":"http://arxiv.org/abs/2304.01198v2","updated":"2023-08-07T06:24:13Z","published":"2023-04-03T17:59:21Z","title":"Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network","summary":" Recently, the open-vocabulary semantic segmentation problem has attracted\nincreasing attention and the best performing methods are based on two-stream\nnetworks: one stream for proposal mask generation and the other for segment\nclassification using a pretrained visual-language model. However, existing\ntwo-stream methods require passing a great number of (up to a hundred) image\ncrops into the visual-language model, which is highly inefficient. To address\nthe problem, we propose a network that only needs a single pass through the\nvisual-language model for each input image. Specifically, we first propose a\nnovel network adaptation approach, termed patch severance, to restrict the\nharmful interference between the patch embeddings in the pre-trained visual\nencoder. We then propose classification anchor learning to encourage the\nnetwork to spatially focus on more discriminative features for classification.\nExtensive experiments demonstrate that the proposed method achieves outstanding\nperformance, surpassing state-of-the-art methods while being 4 to 7 times\nfaster at inference. Code: https://github.com/CongHan0808/DeOP.git\n","authors":["Cong Han","Yujie Zhong","Dengjie Li","Kai Han","Lin Ma"],"pdf_url":"https://arxiv.org/pdf/2304.01198v2.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2308.03322v1","updated":"2023-08-07T06:15:51Z","published":"2023-08-07T06:15:51Z","title":"Part-Aware Transformer for Generalizable Person Re-identification","summary":" Domain generalization person re-identification (DG-ReID) aims to train a\nmodel on source domains and generalize well on unseen domains. Vision\nTransformer usually yields better generalization ability than common CNN\nnetworks under distribution shifts. However, Transformer-based ReID models\ninevitably over-fit to domain-specific biases due to the supervised learning\nstrategy on the source domain. We observe that while the global images of\ndifferent IDs should have different features, their similar local parts (e.g.,\nblack backpack) are not bounded by this constraint. Motivated by this, we\npropose a pure Transformer model (termed Part-aware Transformer) for DG-ReID by\ndesigning a proxy task, named Cross-ID Similarity Learning (CSL), to mine local\nvisual information shared by different IDs. This proxy task allows the model to\nlearn generic features because it only cares about the visual similarity of the\nparts regardless of the ID labels, thus alleviating the side effect of\ndomain-specific biases. Based on the local similarity obtained in CSL, a\nPart-guided Self-Distillation (PSD) is proposed to further improve the\ngeneralization of global features. Our method achieves state-of-the-art\nperformance under most DG ReID settings. Under the Market$\\to$Duke setting, our\nmethod exceeds state-of-the-art by 10.9% and 12.8% in Rank1 and mAP,\nrespectively. The code is available at\nhttps://github.com/liyuke65535/Part-Aware-Transformer.\n","authors":["Hao Ni","Yuke Li","Heng Tao Shen","Jingkuan Song"],"pdf_url":"https://arxiv.org/pdf/2308.03322v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03321v1","updated":"2023-08-07T06:08:51Z","published":"2023-08-07T06:08:51Z","title":"AFN: Adaptive Fusion Normalization via Encoder-Decoder Framework","summary":" The success of deep learning is inseparable from normalization layers.\nResearchers have proposed various normalization functions, and each of them has\nboth advantages and disadvantages. In response, efforts have been made to\ndesign a unified normalization function that combines all normalization\nprocedures and mitigates their weaknesses. We also proposed a new normalization\nfunction called Adaptive Fusion Normalization. Through experiments, we\ndemonstrate AFN outperforms the previous normalization techniques in domain\ngeneralization and image classification tasks.\n","authors":["Zikai Zhou","Huanran Chen"],"pdf_url":"https://arxiv.org/pdf/2308.03321v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2106.01899 by other authors"},{"id":"http://arxiv.org/abs/2304.01199v2","updated":"2023-08-07T05:07:20Z","published":"2023-04-03T17:59:49Z","title":"On the Benefits of 3D Pose and Tracking for Human Action Recognition","summary":" In this work we study the benefits of using tracking and 3D poses for action\nrecognition. To achieve this, we take the Lagrangian view on analysing actions\nover a trajectory of human motion rather than at a fixed point in space. Taking\nthis stand allows us to use the tracklets of people to predict their actions.\nIn this spirit, first we show the benefits of using 3D pose to infer actions,\nand study person-person interactions. Subsequently, we propose a Lagrangian\nAction Recognition model by fusing 3D pose and contextualized appearance over\ntracklets. To this end, our method achieves state-of-the-art performance on the\nAVA v2.2 dataset on both pose only settings and on standard benchmark settings.\nWhen reasoning about the action using only pose cues, our pose model achieves\n+10.0 mAP gain over the corresponding state-of-the-art while our fused model\nhas a gain of +2.8 mAP over the best state-of-the-art model. Code and results\nare available at: https://brjathu.github.io/LART\n","authors":["Jathushan Rajasegaran","Georgios Pavlakos","Angjoo Kanazawa","Christoph Feichtenhofer","Jitendra Malik"],"pdf_url":"https://arxiv.org/pdf/2304.01199v2.pdf","comment":"CVPR2023 (project page: https://brjathu.github.io/LART)"},{"id":"http://arxiv.org/abs/2106.14490v3","updated":"2023-08-07T04:47:05Z","published":"2021-06-28T09:09:14Z","title":"Making Images Real Again: A Comprehensive Survey on Deep Image\n Composition","summary":" As a common image editing operation, image composition aims to combine the\nforeground from one image and another background image, resulting in a\ncomposite image. However, there are many issues that could make the composite\nimages unrealistic. These issues can be summarized as the inconsistency between\nforeground and background, which includes appearance inconsistency (e.g.,\nincompatible illumination), geometry inconsistency (e.g., unreasonable size),\nand semantic inconsistency (e.g., mismatched semantic context). Image\ncomposition task could be decomposed into multiple sub-tasks, in which each\nsub-task targets at one or more issues. Specifically, object placement aims to\nfind reasonable scale, location, and shape for the foreground. Image blending\naims to address the unnatural boundary between foreground and background. Image\nharmonization aims to adjust the illumination statistics of foreground. Shadow\ngeneration aims to generate plausible shadow for the foreground. These\nsub-tasks can be executed sequentially or parallelly to acquire realistic\ncomposite images. To the best of our knowledge, there is no previous survey on\nimage composition. In this paper, we conduct comprehensive survey over the\nsub-tasks and combinatorial task of image composition. For each one, we\nsummarize the existing methods, available datasets, and common evaluation\nmetrics. Datasets and codes for image composition are summarized at\nhttps://github.com/bcmi/Awesome-Image-Composition.\n","authors":["Li Niu","Wenyan Cong","Liu Liu","Yan Hong","Bo Zhang","Jing Liang","Liqing Zhang"],"pdf_url":"https://arxiv.org/pdf/2106.14490v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.16415v2","updated":"2023-08-07T04:29:12Z","published":"2023-07-31T05:48:39Z","title":"DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised\n Temporal Action Localization","summary":" Weakly-supervised temporal action localization (WTAL) is a practical yet\nchallenging task. Due to large-scale datasets, most existing methods use a\nnetwork pretrained in other datasets to extract features, which are not\nsuitable enough for WTAL. To address this problem, researchers design several\nmodules for feature enhancement, which improve the performance of the\nlocalization module, especially modeling the temporal relationship between\nsnippets. However, all of them neglect the adverse effects of ambiguous\ninformation, which would reduce the discriminability of others. Considering\nthis phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net),\nwhich explicitly models ambiguous snippets and discriminative snippets with\nwell-designed connections, preventing the transmission of ambiguous information\nand enhancing the discriminability of snippet-level representations.\nAdditionally, we propose feature consistency loss to prevent the assimilation\nof features and drive the graph convolution network to generate more\ndiscriminative representations. Extensive experiments on THUMOS14 and\nActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net,\nestablishing new state-of-the-art results on both datasets. Source code is\navailable at \\url{https://github.com/XiaojunTang22/ICCV2023-DDGNet}.\n","authors":["Xiaojun Tang","Junsong Fan","Chuanchen Luo","Zhaoxiang Zhang","Man Zhang","Zongyuan Yang"],"pdf_url":"https://arxiv.org/pdf/2307.16415v2.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2308.03290v1","updated":"2023-08-07T04:17:19Z","published":"2023-08-07T04:17:19Z","title":"FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization\n Search","summary":" Quantization has become a mainstream compression technique for reducing model\nsize, computational requirements, and energy consumption for modern deep neural\nnetworks (DNNs). With the improved numerical support in recent hardware,\nincluding multiple variants of integer and floating point, mixed-precision\nquantization has become necessary to achieve high-quality results with low\nmodel cost. Prior mixed-precision quantization methods have performed a\npost-training quantization search, which compromises on accuracy, or a\ndifferentiable quantization search, which leads to high memory usage from\nbranching. Therefore, we propose the first one-shot mixed-precision\nquantization search that eliminates the need for retraining in both integer and\nlow-precision floating point models. We evaluate our floating-point and integer\nquantization search (FLIQS) on multiple convolutional networks and vision\ntransformer models to discover Pareto-optimal models. Our approach discovers\nmodels that improve upon uniform precision, manual mixed-precision, and recent\ninteger quantization search methods. With the proposed integer quantization\nsearch, we increase the accuracy of ResNet-18 on ImageNet by 1.31% points and\nResNet-50 by 0.90% points with equivalent model cost over previous methods.\nAdditionally, for the first time, we explore a novel mixed-precision\nfloating-point search and improve MobileNetV2 by up to 0.98% points compared to\nprior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously\nsearch a joint quantization and neural architecture space and improve the\nImageNet accuracy by 2.69% points with similar model cost on a MobileNetV2\nsearch space.\n","authors":["Jordan Dotzel","Gang Wu","Andrew Li","Muhammad Umar","Yun Ni","Mohamed S. Abdelfattah","Zhiru Zhang","Liqun Cheng","Martin G. Dixon","Norman P. Jouppi","Quoc V. Le","Sheng Li"],"pdf_url":"https://arxiv.org/pdf/2308.03290v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03286v1","updated":"2023-08-07T04:04:22Z","published":"2023-08-07T04:04:22Z","title":"Multi-Label Self-Supervised Learning with Scene Images","summary":" Self-supervised learning (SSL) methods targeting scene images have seen a\nrapid growth recently, and they mostly rely on either a dedicated dense\nmatching mechanism or a costly unsupervised object discovery module. This paper\nshows that instead of hinging on these strenuous operations, quality image\nrepresentations can be learned by treating scene/multi-label image SSL simply\nas a multi-label classification problem, which greatly simplifies the learning\nframework. Specifically, multiple binary pseudo-labels are assigned for each\ninput image by comparing its embeddings with those in two dictionaries, and the\nnetwork is optimized using the binary cross entropy loss. The proposed method\nis named Multi-Label Self-supervised learning (MLS). Visualizations\nqualitatively show that clearly the pseudo-labels by MLS can automatically find\nsemantically similar pseudo-positive pairs across different images to\nfacilitate contrastive learning. MLS learns high quality representations on\nMS-COCO and achieves state-of-the-art results on classification, detection and\nsegmentation benchmarks. At the same time, MLS is much simpler than existing\nmethods, making it easier to deploy and for further exploration.\n","authors":["Ke Zhu","Minghao Fu","Jianxin Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03286v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03282v1","updated":"2023-08-07T03:56:15Z","published":"2023-08-07T03:56:15Z","title":"Environment-Invariant Curriculum Relation Learning for Fine-Grained\n Scene Graph Generation","summary":" The scene graph generation (SGG) task is designed to identify the predicates\nbased on the subject-object pairs.However,existing datasets generally include\ntwo imbalance cases: one is the class imbalance from the predicted predicates\nand another is the context imbalance from the given subject-object pairs, which\npresents significant challenges for SGG. Most existing methods focus on the\nimbalance of the predicted predicate while ignoring the imbalance of the\nsubject-object pairs, which could not achieve satisfactory results. To address\nthe two imbalance cases, we propose a novel Environment Invariant Curriculum\nRelation learning (EICR) method, which can be applied in a plug-and-play\nfashion to existing SGG methods. Concretely, to remove the imbalance of the\nsubject-object pairs, we first construct different distribution environments\nfor the subject-object pairs and learn a model invariant to the environment\nchanges. Then, we construct a class-balanced curriculum learning strategy to\nbalance the different environments to remove the predicate imbalance.\nComprehensive experiments conducted on VG and GQA datasets demonstrate that our\nEICR framework can be taken as a general strategy for various SGG models, and\nachieve significant improvements.\n","authors":["Yukuan Min","Aming Wu","Cheng Deng"],"pdf_url":"https://arxiv.org/pdf/2308.03282v1.pdf","comment":"ICCV2023. arXiv admin note: text overlap with arXiv:2203.11654 by\n other authors"},{"id":"http://arxiv.org/abs/2308.03280v1","updated":"2023-08-07T03:48:07Z","published":"2023-08-07T03:48:07Z","title":"Mirror-NeRF: Learning Neural Radiance Fields for Mirrors with\n Whitted-Style Ray Tracing","summary":" Recently, Neural Radiance Fields (NeRF) has exhibited significant success in\nnovel view synthesis, surface reconstruction, etc. However, since no physical\nreflection is considered in its rendering pipeline, NeRF mistakes the\nreflection in the mirror as a separate virtual scene, leading to the inaccurate\nreconstruction of the mirror and multi-view inconsistent reflections in the\nmirror. In this paper, we present a novel neural rendering framework, named\nMirror-NeRF, which is able to learn accurate geometry and reflection of the\nmirror and support various scene manipulation applications with mirrors, such\nas adding new objects or mirrors into the scene and synthesizing the\nreflections of these new objects in mirrors, controlling mirror roughness, etc.\nTo achieve this goal, we propose a unified radiance field by introducing the\nreflection probability and tracing rays following the light transport model of\nWhitted Ray Tracing, and also develop several techniques to facilitate the\nlearning process. Experiments and comparisons on both synthetic and real\ndatasets demonstrate the superiority of our method. The code and supplementary\nmaterial are available on the project webpage:\nhttps://zju3dv.github.io/Mirror-NeRF/.\n","authors":["Junyi Zeng","Chong Bao","Rui Chen","Zilong Dong","Guofeng Zhang","Hujun Bao","Zhaopeng Cui"],"pdf_url":"https://arxiv.org/pdf/2308.03280v1.pdf","comment":"Accepted to ACM Multimedia 2023. Project Page:\n https://zju3dv.github.io/Mirror-NeRF/"},{"id":"http://arxiv.org/abs/2308.03276v1","updated":"2023-08-07T03:35:47Z","published":"2023-08-07T03:35:47Z","title":"Spatialyze: A Geospatial Video Analytics System with Spatial-Aware\n Optimizations","summary":" Videos that are shot using commodity hardware such as phones and surveillance\ncameras record various metadata such as time and location. We encounter such\ngeospatial videos on a daily basis and such videos have been growing in volume\nsignificantly. Yet, we do not have data management systems that allow users to\ninteract with such data effectively.\n In this paper, we describe Spatialyze, a new framework for end-to-end\nquerying of geospatial videos. Spatialyze comes with a domain-specific language\nwhere users can construct geospatial video analytic workflows using a 3-step,\ndeclarative, build-filter-observe paradigm. Internally, Spatialyze leverages\nthe declarative nature of such workflows, the temporal-spatial metadata stored\nwith videos, and physical behavior of real-world objects to optimize the\nexecution of workflows. Our results using real-world videos and workflows show\nthat Spatialyze can reduce execution time by up to 5.3x, while maintaining up\nto 97.1% accuracy compared to unoptimized execution.\n","authors":["Chanwut Kittivorawong","Yongming Ge","Yousef Helal","Alvin Cheung"],"pdf_url":"https://arxiv.org/pdf/2308.03276v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03272v1","updated":"2023-08-07T03:27:04Z","published":"2023-08-07T03:27:04Z","title":"Feature-Suppressed Contrast for Self-Supervised Food Pre-training","summary":" Most previous approaches for analyzing food images have relied on extensively\nannotated datasets, resulting in significant human labeling expenses due to the\nvaried and intricate nature of such images. Inspired by the effectiveness of\ncontrastive self-supervised methods in utilizing unlabelled data, weiqing\nexplore leveraging these techniques on unlabelled food images. In contrastive\nself-supervised methods, two views are randomly generated from an image by data\naugmentations. However, regarding food images, the two views tend to contain\nsimilar informative contents, causing large mutual information, which impedes\nthe efficacy of contrastive self-supervised learning. To address this problem,\nwe propose Feature Suppressed Contrast (FeaSC) to reduce mutual information\nbetween views. As the similar contents of the two views are salient or highly\nresponsive in the feature map, the proposed FeaSC uses a response-aware scheme\nto localize salient features in an unsupervised manner. By suppressing some\nsalient features in one view while leaving another contrast view unchanged, the\nmutual information between the two views is reduced, thereby enhancing the\neffectiveness of contrast learning for self-supervised food pre-training. As a\nplug-and-play module, the proposed method consistently improves BYOL and\nSimSiam by 1.70\\% $\\sim$ 6.69\\% classification accuracy on four publicly\navailable food recognition datasets. Superior results have also been achieved\non downstream segmentation tasks, demonstrating the effectiveness of the\nproposed method.\n","authors":["Xinda Liu","Yaohui Zhu","Linhu Liu","Jiang Tian","Lili Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03272v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11418v2","updated":"2023-08-07T03:18:31Z","published":"2023-07-21T08:22:14Z","title":"FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural\n Radiance Fields","summary":" As recent advances in Neural Radiance Fields (NeRF) have enabled\nhigh-fidelity 3D face reconstruction and novel view synthesis, its manipulation\nalso became an essential task in 3D vision. However, existing manipulation\nmethods require extensive human labor, such as a user-provided semantic mask\nand manual attribute search unsuitable for non-expert users. Instead, our\napproach is designed to require a single text to manipulate a face\nreconstructed with NeRF. To do so, we first train a scene manipulator, a latent\ncode-conditional deformable NeRF, over a dynamic scene to control a face\ndeformation using the latent code. However, representing a scene deformation\nwith a single latent code is unfavorable for compositing local deformations\nobserved in different instances. As so, our proposed Position-conditional\nAnchor Compositor (PAC) learns to represent a manipulated scene with spatially\nvarying latent codes. Their renderings with the scene manipulator are then\noptimized to yield high cosine similarity to a target text in CLIP embedding\nspace for text-driven manipulation. To the best of our knowledge, our approach\nis the first to address the text-driven manipulation of a face reconstructed\nwith NeRF. Extensive results, comparisons, and ablation studies demonstrate the\neffectiveness of our approach.\n","authors":["Sungwon Hwang","Junha Hyung","Daejin Kim","Min-Jung Kim","Jaegul Choo"],"pdf_url":"https://arxiv.org/pdf/2307.11418v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2308.03267v1","updated":"2023-08-07T03:16:24Z","published":"2023-08-07T03:16:24Z","title":"Redundancy-aware Transformer for Video Question Answering","summary":" This paper identifies two kinds of redundancy in the current VideoQA\nparadigm. Specifically, the current video encoders tend to holistically embed\nall video clues at different granularities in a hierarchical manner, which\ninevitably introduces \\textit{neighboring-frame redundancy} that can overwhelm\ndetailed visual clues at the object level. Subsequently, prevailing\nvision-language fusion designs introduce the \\textit{cross-modal redundancy} by\nexhaustively fusing all visual elements with question tokens without explicitly\ndifferentiating their pairwise vision-language interactions, thus making a\npernicious impact on the answering.\n To this end, we propose a novel transformer-based architecture, that aims to\nmodel VideoQA in a redundancy-aware manner. To address the neighboring-frame\nredundancy, we introduce a video encoder structure that emphasizes the\nobject-level change in neighboring frames, while adopting an out-of-neighboring\nmessage-passing scheme that imposes attention only on distant frames. As for\nthe cross-modal redundancy, we equip our fusion module with a novel adaptive\nsampling, which explicitly differentiates the vision-language interactions by\nidentifying a small subset of visual elements that exclusively support the\nanswer. Upon these advancements, we find this\n\\underline{R}edundancy-\\underline{a}ware trans\\underline{former} (RaFormer) can\nachieve state-of-the-art results on multiple VideoQA benchmarks.\n","authors":["Yicong Li","Xun Yang","An Zhang","Chun Feng","Xiang Wang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2308.03267v1.pdf","comment":"Accepted to ACM MM23"},{"id":"http://arxiv.org/abs/2207.01405v4","updated":"2023-08-07T03:11:49Z","published":"2022-07-04T13:37:38Z","title":"I-ViT: Integer-only Quantization for Efficient Vision Transformer\n Inference","summary":" Vision Transformers (ViTs) have achieved state-of-the-art performance on\nvarious computer vision applications. However, these models have considerable\nstorage and computational overheads, making their deployment and efficient\ninference on edge devices challenging. Quantization is a promising approach to\nreducing model complexity, and the dyadic arithmetic pipeline can allow the\nquantized models to perform efficient integer-only inference. Unfortunately,\ndyadic arithmetic is based on the homogeneity condition in convolutional neural\nnetworks, which is not applicable to the non-linear components in ViTs, making\ninteger-only inference of ViTs an open issue. In this paper, we propose I-ViT,\nan integer-only quantization scheme for ViTs, to enable ViTs to perform the\nentire computational graph of inference with integer arithmetic and\nbit-shifting, and without any floating-point arithmetic. In I-ViT, linear\noperations (e.g., MatMul and Dense) follow the integer-only pipeline with\ndyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and\nLayerNorm) are approximated by the proposed light-weight integer-only\narithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and\nShiftGELU, which are designed to use integer bit-shifting to approximate the\ncorresponding floating-point operations. We evaluate I-ViT on various benchmark\nmodels and the results show that integer-only INT8 quantization achieves\ncomparable (or even slightly higher) accuracy to the full-precision (FP)\nbaseline. Furthermore, we utilize TVM for practical hardware deployment on the\nGPU's integer arithmetic units, achieving 3.72$\\sim$4.11$\\times$ inference\nspeedup compared to the FP model. Code of both Pytorch and TVM is released at\nhttps://github.com/zkkli/I-ViT.\n","authors":["Zhikai Li","Qingyi Gu"],"pdf_url":"https://arxiv.org/pdf/2207.01405v4.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2212.08632v2","updated":"2023-08-07T03:02:06Z","published":"2022-12-16T18:12:04Z","title":"Enhancing Multi-modal and Multi-hop Question Answering via Structured\n Knowledge and Unified Retrieval-Generation","summary":" Multi-modal multi-hop question answering involves answering a question by\nreasoning over multiple input sources from different modalities. Existing\nmethods often retrieve evidences separately and then use a language model to\ngenerate an answer based on the retrieved evidences, and thus do not adequately\nconnect candidates and are unable to model the interdependent relations during\nretrieval. Moreover, the pipelined approaches of retrieval and generation might\nresult in poor generation performance when retrieval performance is low. To\naddress these issues, we propose a Structured Knowledge and Unified\nRetrieval-Generation (SKURG) approach. SKURG employs an Entity-centered Fusion\nEncoder to align sources from different modalities using shared entities. It\nthen uses a unified Retrieval-Generation Decoder to integrate intermediate\nretrieval results for answer generation and also adaptively determine the\nnumber of retrieval steps. Extensive experiments on two representative\nmulti-modal multi-hop QA datasets MultimodalQA and WebQA demonstrate that SKURG\noutperforms the state-of-the-art models in both source retrieval and answer\ngeneration performance with fewer parameters. Our code is available at\nhttps://github.com/HITsz-TMG/SKURG.\n","authors":["Qian Yang","Qian Chen","Wen Wang","Baotian Hu","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2212.08632v2.pdf","comment":"Accepted by ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2212.08254v2","updated":"2023-08-07T03:00:41Z","published":"2022-12-16T02:52:37Z","title":"RepQ-ViT: Scale Reparameterization for Post-Training Quantization of\n Vision Transformers","summary":" Post-training quantization (PTQ), which only requires a tiny dataset for\ncalibration without end-to-end retraining, is a light and practical model\ncompression technique. Recently, several PTQ schemes for vision transformers\n(ViTs) have been presented; unfortunately, they typically suffer from\nnon-trivial accuracy degradation, especially in low-bit cases. In this paper,\nwe propose RepQ-ViT, a novel PTQ framework for ViTs based on quantization scale\nreparameterization, to address the above issues. RepQ-ViT decouples the\nquantization and inference processes, where the former employs complex\nquantizers and the latter employs scale-reparameterized simplified quantizers.\nThis ensures both accurate quantization and efficient inference, which\ndistinguishes it from existing approaches that sacrifice quantization\nperformance to meet the target hardware. More specifically, we focus on two\ncomponents with extreme distributions: post-LayerNorm activations with severe\ninter-channel variation and post-Softmax activations with power-law features,\nand initially apply channel-wise quantization and log$\\sqrt{2}$ quantization,\nrespectively. Then, we reparameterize the scales to hardware-friendly\nlayer-wise quantization and log2 quantization for inference, with only slight\naccuracy or computational costs. Extensive experiments are conducted on\nmultiple vision tasks with different model variants, proving that RepQ-ViT,\nwithout hyperparameters and expensive reconstruction procedures, can outperform\nexisting strong baselines and encouragingly improve the accuracy of 4-bit PTQ\nof ViTs to a usable level. Code is available at\nhttps://github.com/zkkli/RepQ-ViT.\n","authors":["Zhikai Li","Junrui Xiao","Lianwei Yang","Qingyi Gu"],"pdf_url":"https://arxiv.org/pdf/2212.08254v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2308.03262v1","updated":"2023-08-07T02:57:48Z","published":"2023-08-07T02:57:48Z","title":"A Benchmark for Chinese-English Scene Text Image Super-resolution","summary":" Scene Text Image Super-resolution (STISR) aims to recover high-resolution\n(HR) scene text images with visually pleasant and readable text content from\nthe given low-resolution (LR) input. Most existing works focus on recovering\nEnglish texts, which have relatively simple character structures, while little\nwork has been done on the more challenging Chinese texts with diverse and\ncomplex character structures. In this paper, we propose a real-world\nChinese-English benchmark dataset, namely Real-CE, for the task of STISR with\nthe emphasis on restoring structurally complex Chinese characters. The\nbenchmark provides 1,935/783 real-world LR-HR text image pairs~(contains 33,789\ntext lines in total) for training/testing in 2$\\times$ and 4$\\times$ zooming\nmodes, complemented by detailed annotations, including detection boxes and text\ntranscripts. Moreover, we design an edge-aware learning method, which provides\nstructural supervision in image and feature domains, to effectively reconstruct\nthe dense structures of Chinese characters. We conduct experiments on the\nproposed Real-CE benchmark and evaluate the existing STISR models with and\nwithout our edge-aware loss. The benchmark, including data and source code, is\navailable at https://github.com/mjq11302010044/Real-CE.\n","authors":["Jianqi Ma","Zhetong Liang","Wangmeng Xiang","Xi Yang","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03262v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2308.02153v2","updated":"2023-08-07T02:33:21Z","published":"2023-08-04T06:20:20Z","title":"Robust Self-Supervised Extrinsic Self-Calibration","summary":" Autonomous vehicles and robots need to operate over a wide variety of\nscenarios in order to complete tasks efficiently and safely. Multi-camera\nself-supervised monocular depth estimation from videos is a promising way to\nreason about the environment, as it generates metrically scaled geometric\npredictions from visual data without requiring additional sensors. However,\nmost works assume well-calibrated extrinsics to fully leverage this\nmulti-camera setup, even though accurate and efficient calibration is still a\nchallenging problem. In this work, we introduce a novel method for extrinsic\ncalibration that builds upon the principles of self-supervised monocular depth\nand ego-motion learning. Our proposed curriculum learning strategy uses\nmonocular depth and pose estimators with velocity supervision to estimate\nextrinsics, and then jointly learns extrinsic calibration along with depth and\npose for a set of overlapping cameras rigidly attached to a moving vehicle.\nExperiments on a benchmark multi-camera dataset (DDAD) demonstrate that our\nmethod enables self-calibration in various scenes robustly and efficiently\ncompared to a traditional vision-based pose estimation pipeline. Furthermore,\nwe demonstrate the benefits of extrinsics self-calibration as a way to improve\ndepth prediction via joint optimization.\n","authors":["Takayuki Kanai","Igor Vasiljevic","Vitor Guizilini","Adrien Gaidon","Rares Ambrus"],"pdf_url":"https://arxiv.org/pdf/2308.02153v2.pdf","comment":"Project page: https://sites.google.com/view/tri-sesc"},{"id":"http://arxiv.org/abs/2308.03258v1","updated":"2023-08-07T02:30:47Z","published":"2023-08-07T02:30:47Z","title":"APBench: A Unified Benchmark for Availability Poisoning Attacks and\n Defenses","summary":" The efficacy of availability poisoning, a method of poisoning data by\ninjecting imperceptible perturbations to prevent its use in model training, has\nbeen a hot subject of investigation. Previous research suggested that it was\ndifficult to effectively counteract such poisoning attacks. However, the\nintroduction of various defense methods has challenged this notion. Due to the\nrapid progress in this field, the performance of different novel methods cannot\nbe accurately validated due to variations in experimental setups. To further\nevaluate the attack and defense capabilities of these poisoning methods, we\nhave developed a benchmark -- APBench for assessing the efficacy of adversarial\npoisoning. APBench consists of 9 state-of-the-art availability poisoning\nattacks, 8 defense algorithms, and 4 conventional data augmentation techniques.\nWe also have set up experiments with varying different poisoning ratios, and\nevaluated the attacks on multiple datasets and their transferability across\nmodel architectures. We further conducted a comprehensive evaluation of 2\nadditional attacks specifically targeting unsupervised models. Our results\nreveal the glaring inadequacy of existing attacks in safeguarding individual\nprivacy. APBench is open source and available to the deep learning community:\nhttps://github.com/lafeat/apbench.\n","authors":["Tianrui Qin","Xitong Gao","Juanjuan Zhao","Kejiang Ye","Cheng-Zhong Xu"],"pdf_url":"https://arxiv.org/pdf/2308.03258v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03256v1","updated":"2023-08-07T02:25:06Z","published":"2023-08-07T02:25:06Z","title":"Learning a Graph Neural Network with Cross Modality Interaction for\n Image Fusion","summary":" Infrared and visible image fusion has gradually proved to be a vital fork in\nthe field of multi-modality imaging technologies. In recent developments,\nresearchers not only focus on the quality of fused images but also evaluate\ntheir performance in downstream tasks. Nevertheless, the majority of methods\nseldom put their eyes on the mutual learning from different modalities,\nresulting in fused images lacking significant details and textures. To overcome\nthis issue, we propose an interactive graph neural network (GNN)-based\narchitecture between cross modality for fusion, called IGNet. Specifically, we\nfirst apply a multi-scale extractor to achieve shallow features, which are\nemployed as the necessary input to build graph structures. Then, the graph\ninteraction module can construct the extracted intermediate features of the\ninfrared/visible branch into graph structures. Meanwhile, the graph structures\nof two branches interact for cross-modality and semantic learning, so that\nfused images can maintain the important feature expressions and enhance the\nperformance of downstream tasks. Besides, the proposed leader nodes can improve\ninformation propagation in the same modality. Finally, we merge all graph\nfeatures to get the fusion result. Extensive experiments on different datasets\n(TNO, MFNet and M3FD) demonstrate that our IGNet can generate visually\nappealing fused images while scoring averagely 2.59% mAP@.5 and 7.77% mIoU\nhigher in detection and segmentation than the compared state-of-the-art\nmethods. The source code of the proposed IGNet can be available at\nhttps://github.com/lok-18/IGNet.\n","authors":["Jiawei Li","Jiansheng Chen","Jinyuan Liu","Huimin Ma"],"pdf_url":"https://arxiv.org/pdf/2308.03256v1.pdf","comment":"9 pages, 10 figures, ACM MM 2023"},{"id":"http://arxiv.org/abs/2308.03244v1","updated":"2023-08-07T01:43:25Z","published":"2023-08-07T01:43:25Z","title":"Mind the Gap: Improving Success Rate of Vision-and-Language Navigation\n by Revisiting Oracle Success Routes","summary":" Vision-and-Language Navigation (VLN) aims to navigate to the target location\nby following a given instruction. Unlike existing methods focused on predicting\na more accurate action at each step in navigation, in this paper, we make the\nfirst attempt to tackle a long-ignored problem in VLN: narrowing the gap\nbetween Success Rate (SR) and Oracle Success Rate (OSR). We observe a\nconsistently large gap (up to 9%) on four state-of-the-art VLN methods across\ntwo benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent\npasses the target location, while the low SR suggests the agent actually fails\nto stop at the target location at last. Instead of predicting actions directly,\nwe propose to mine the target location from a trajectory given by off-the-shelf\nVLN models. Specially, we design a multi-module transformer-based model for\nlearning compact discriminative trajectory viewpoint representation, which is\nused to predict the confidence of being a target location as described in the\ninstruction. The proposed method is evaluated on three widely-adopted datasets:\nR2R, REVERIE and NDH, and shows promising results, demonstrating the potential\nfor more future research.\n","authors":["Chongyang Zhao","Yuankai Qi","Qi Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03244v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.12294v2","updated":"2023-08-07T01:21:19Z","published":"2022-12-23T12:51:42Z","title":"FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos","summary":" Neural fields, also known as coordinate-based or implicit neural\nrepresentations, have shown a remarkable capability of representing,\ngenerating, and manipulating various forms of signals. For video\nrepresentations, however, mapping pixel-wise coordinates to RGB colors has\nshown relatively low compression performance and slow convergence and inference\nspeed. Frame-wise video representation, which maps a temporal coordinate to its\nentire frame, has recently emerged as an alternative method to represent\nvideos, improving compression rates and encoding speed. While promising, it has\nstill failed to reach the performance of state-of-the-art video compression\nalgorithms. In this work, we propose FFNeRV, a novel method for incorporating\nflow information into frame-wise representations to exploit the temporal\nredundancy across the frames in videos inspired by the standard video codecs.\nFurthermore, we introduce a fully convolutional architecture, enabled by\none-dimensional temporal grids, improving the continuity of spatial features.\nExperimental results show that FFNeRV yields the best performance for video\ncompression and frame interpolation among the methods using frame-wise\nrepresentations or neural fields. To reduce the model size even further, we\ndevise a more compact convolutional architecture using the group and pointwise\nconvolutions. With model compression techniques, including quantization-aware\ntraining and entropy coding, FFNeRV outperforms widely-used standard video\ncodecs (H.264 and HEVC) and performs on par with state-of-the-art video\ncompression algorithms.\n","authors":["Joo Chan Lee","Daniel Rho","Jong Hwan Ko","Eunbyung Park"],"pdf_url":"https://arxiv.org/pdf/2212.12294v2.pdf","comment":"Our project page including code is available at\n https://maincold2.github.io/ffnerv/"},{"id":"http://arxiv.org/abs/2206.02659v5","updated":"2023-08-07T01:20:01Z","published":"2022-06-06T14:52:46Z","title":"Robust Fine-Tuning of Deep Neural Networks with Hessian-based\n Generalization Guarantees","summary":" We consider fine-tuning a pretrained deep neural network on a target task. We\nstudy the generalization properties of fine-tuning to understand the problem of\noverfitting, which has often been observed (e.g., when the target dataset is\nsmall or when the training labels are noisy). Existing generalization measures\nfor deep networks depend on notions such as distance from the initialization\n(i.e., the pretrained network) of the fine-tuned model and noise stability\nproperties of deep networks. This paper identifies a Hessian-based distance\nmeasure through PAC-Bayesian analysis, which is shown to correlate well with\nobserved generalization gaps of fine-tuned models. Theoretically, we prove\nHessian distance-based generalization bounds for fine-tuned models. We also\ndescribe an extended study of fine-tuning against label noise, where\noverfitting is against a critical problem; We present an algorithm and a\ngeneralization error guarantee for this algorithm under a class conditional\nindependent noise model. Empirically, we observe that the Hessian-based\ndistance measure can match the scale of the observed generalization gap of\nfine-tuned models in practice. We also test our algorithm on several image\nclassification tasks with noisy training labels, showing notable gains over\nprior methods, and the Hessian distance measure of the fine-tuned model\ndecreases substantially.\n","authors":["Haotian Ju","Dongyue Li","Hongyang R. Zhang"],"pdf_url":"https://arxiv.org/pdf/2206.02659v5.pdf","comment":"37 pages. Appeared in ICML 2022"},{"id":"http://arxiv.org/abs/2308.03950v1","updated":"2023-08-07T23:41:55Z","published":"2023-08-07T23:41:55Z","title":"Zero-shot Skeleton-based Action Recognition via Mutual Information\n Estimation and Maximization","summary":" Zero-shot skeleton-based action recognition aims to recognize actions of\nunseen categories after training on data of seen categories. The key is to\nbuild the connection between visual and semantic space from seen to unseen\nclasses. Previous studies have primarily focused on encoding sequences into a\nsingular feature vector, with subsequent mapping the features to an identical\nanchor point within the embedded space. Their performance is hindered by 1) the\nignorance of the global visual/semantic distribution alignment, which results\nin a limitation to capture the true interdependence between the two spaces. 2)\nthe negligence of temporal information since the frame-wise features with rich\naction clues are directly pooled into a single feature vector. We propose a new\nzero-shot skeleton-based action recognition method via mutual information (MI)\nestimation and maximization. Specifically, 1) we maximize the MI between visual\nand semantic space for distribution alignment; 2) we leverage the temporal\ninformation for estimating the MI by encouraging MI to increase as more frames\nare observed. Extensive experiments on three large-scale skeleton action\ndatasets confirm the effectiveness of our method. Code:\nhttps://github.com/YujieOuO/SMIE.\n","authors":["Yujie Zhou","Wenwen Qiang","Anyi Rao","Ning Lin","Bing Su","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03950v1.pdf","comment":"Accepted by ACM MM 2023"},{"id":"http://arxiv.org/abs/2204.11041v2","updated":"2023-08-07T22:47:07Z","published":"2022-04-23T10:19:58Z","title":"Learning by Erasing: Conditional Entropy based Transferable\n Out-Of-Distribution Detection","summary":" Out-of-distribution (OOD) detection is essential to handle the distribution\nshifts between training and test scenarios. For a new in-distribution (ID)\ndataset, existing methods require retraining to capture the dataset-specific\nfeature representation or data distribution. In this paper, we propose a deep\ngenerative models (DGM) based transferable OOD detection method, which is\nunnecessary to retrain on a new ID dataset. We design an image erasing strategy\nto equip exclusive conditional entropy distribution for each ID dataset, which\ndetermines the discrepancy of DGM's posteriori ucertainty distribution on\ndifferent ID datasets. Owing to the powerful representation capacity of\nconvolutional neural networks, the proposed model trained on complex dataset\ncan capture the above discrepancy between ID datasets without retraining and\nthus achieve transferable OOD detection. We validate the proposed method on\nfive datasets and verity that ours achieves comparable performance to the\nstate-of-the-art group based OOD detection methods that need to be retrained to\ndeploy on new ID datasets. Our code is available at\nhttps://github.com/oOHCIOo/CETOOD.\n","authors":["Meng Xing","Zhiyong Feng","Yong Su","Changjae Oh"],"pdf_url":"https://arxiv.org/pdf/2204.11041v2.pdf","comment":"update new experimental results"},{"id":"http://arxiv.org/abs/2308.03939v1","updated":"2023-08-07T22:44:26Z","published":"2023-08-07T22:44:26Z","title":"Deterministic Neural Illumination Mapping for Efficient Auto-White\n Balance Correction","summary":" Auto-white balance (AWB) correction is a critical operation in image signal\nprocessors for accurate and consistent color correction across various\nillumination scenarios. This paper presents a novel and efficient AWB\ncorrection method that achieves at least 35 times faster processing with\nequivalent or superior performance on high-resolution images for the current\nstate-of-the-art methods. Inspired by deterministic color style transfer, our\napproach introduces deterministic illumination color mapping, leveraging\nlearnable projection matrices for both canonical illumination form and\nAWB-corrected output. It involves feeding high-resolution images and\ncorresponding latent representations into a mapping module to derive a\ncanonical form, followed by another mapping module that maps the pixel values\nto those for the corrected version. This strategy is designed as\nresolution-agnostic and also enables seamless integration of any pre-trained\nAWB network as the backbone. Experimental results confirm the effectiveness of\nour approach, revealing significant performance improvements and reduced time\ncomplexity compared to state-of-the-art methods. Our method provides an\nefficient deep learning-based AWB correction solution, promising real-time,\nhigh-quality color correction for digital imaging applications. Source code is\navailable at https://github.com/birdortyedi/DeNIM/\n","authors":["Furkan Kınlı","Doğa Yılmaz","Barış Özcan","Furkan Kıraç"],"pdf_url":"https://arxiv.org/pdf/2308.03939v1.pdf","comment":"9 pages, 5 figures, ICCV 2023 Workshops (RCV 2023)"},{"id":"http://arxiv.org/abs/2308.03936v1","updated":"2023-08-07T22:39:44Z","published":"2023-08-07T22:39:44Z","title":"ALFA -- Leveraging All Levels of Feature Abstraction for Enhancing the\n Generalization of Histopathology Image Classification Across Unseen Hospitals","summary":" We propose an exhaustive methodology that leverages all levels of feature\nabstraction, targeting an enhancement in the generalizability of image\nclassification to unobserved hospitals. Our approach incorporates\naugmentation-based self-supervision with common distribution shifts in\nhistopathology scenarios serving as the pretext task. This enables us to derive\ninvariant features from training images without relying on training labels,\nthereby covering different abstraction levels. Moving onto the subsequent\nabstraction level, we employ a domain alignment module to facilitate further\nextraction of invariant features across varying training hospitals. To\nrepresent the highly specific features of participating hospitals, an encoder\nis trained to classify hospital labels, independent of their diagnostic labels.\nThe features from each of these encoders are subsequently disentangled to\nminimize redundancy and segregate the features. This representation, which\nspans a broad spectrum of semantic information, enables the development of a\nmodel demonstrating increased robustness to unseen images from disparate\ndistributions. Experimental results from the PACS dataset (a domain\ngeneralization benchmark), a synthetic dataset created by applying\nhistopathology-specific jitters to the MHIST dataset (defining different\ndomains with varied distribution shifts), and a Renal Cell Carcinoma dataset\nderived from four image repositories from TCGA, collectively indicate that our\nproposed model is adept at managing varying levels of image granularity. Thus,\nit shows improved generalizability when faced with new, out-of-distribution\nhospital images.\n","authors":["Milad Sikaroudi","Shahryar Rahnamayan","H. R. Tizhoosh"],"pdf_url":"https://arxiv.org/pdf/2308.03936v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.01248v2","updated":"2023-08-07T22:21:24Z","published":"2022-04-04T05:27:40Z","title":"Differentiable Rendering for Synthetic Aperture Radar Imagery","summary":" There is rising interest in differentiable rendering, which allows explicitly\nmodeling geometric priors and constraints in optimization pipelines using\nfirst-order methods such as backpropagation. Incorporating such domain\nknowledge can lead to deep neural networks that are trained more robustly and\nwith limited data, as well as the capability to solve ill-posed inverse\nproblems. Existing efforts in differentiable rendering have focused on imagery\nfrom electro-optical sensors, particularly conventional RGB-imagery. In this\nwork, we propose an approach for differentiable rendering of Synthetic Aperture\nRadar (SAR) imagery, which combines methods from 3D computer graphics with\nneural rendering. We demonstrate the approach on the inverse graphics problem\nof 3D Object Reconstruction from limited SAR imagery using high-fidelity\nsimulated SAR data.\n","authors":["Michael Wilmanski","Jonathan Tamir"],"pdf_url":"https://arxiv.org/pdf/2204.01248v2.pdf","comment":"This version of the manuscript is an updated preprint which has been\n recently accepted by IEEE Transactions on Aerospace Electronic Systems, but\n has not yet been published or processed by IEEE"},{"id":"http://arxiv.org/abs/2307.16074v2","updated":"2023-08-07T22:11:33Z","published":"2023-07-29T20:46:44Z","title":"Iterative Graph Filtering Network for 3D Human Pose Estimation","summary":" Graph convolutional networks (GCNs) have proven to be an effective approach\nfor 3D human pose estimation. By naturally modeling the skeleton structure of\nthe human body as a graph, GCNs are able to capture the spatial relationships\nbetween joints and learn an efficient representation of the underlying pose.\nHowever, most GCN-based methods use a shared weight matrix, making it\nchallenging to accurately capture the different and complex relationships\nbetween joints. In this paper, we introduce an iterative graph filtering\nframework for 3D human pose estimation, which aims to predict the 3D joint\npositions given a set of 2D joint locations in images. Our approach builds upon\nthe idea of iteratively solving graph filtering with Laplacian regularization\nvia the Gauss-Seidel iterative method. Motivated by this iterative solution, we\ndesign a Gauss-Seidel network (GS-Net) architecture, which makes use of weight\nand adjacency modulation, skip connection, and a pure convolutional block with\nlayer normalization. Adjacency modulation facilitates the learning of edges\nthat go beyond the inherent connections of body joints, resulting in an\nadjusted graph structure that reflects the human skeleton, while skip\nconnections help maintain crucial information from the input layer's initial\nfeatures as the network depth increases. We evaluate our proposed model on two\nstandard benchmark datasets, and compare it with a comprehensive set of strong\nbaseline methods for 3D human pose estimation. Our experimental results\ndemonstrate that our approach outperforms the baseline methods on both\ndatasets, achieving state-of-the-art performance. Furthermore, we conduct\nablation studies to analyze the contributions of different components of our\nmodel architecture and show that the skip connection and adjacency modulation\nhelp improve the model performance.\n","authors":["Zaedul Islam","A. Ben Hamza"],"pdf_url":"https://arxiv.org/pdf/2307.16074v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.05116v2","updated":"2023-08-07T21:50:04Z","published":"2022-12-09T20:45:09Z","title":"Leveraging Contextual Data Augmentation for Generalizable Melanoma\n Detection","summary":" While skin cancer detection has been a valuable deep learning application for\nyears, its evaluation has often neglected the context in which testing images\nare assessed. Traditional melanoma classifiers assume that their testing\nenvironments are comparable to the structured images they are trained on. This\npaper challenges this notion and argues that mole size, a critical attribute in\nprofessional dermatology, can be misleading in automated melanoma detection.\nWhile malignant melanomas tend to be larger than benign melanomas, relying\nsolely on size can be unreliable and even harmful when contextual scaling of\nimages is not possible. To address this issue, this implementation proposes a\ncustom model that performs various data augmentation procedures to prevent\noverfitting to incorrect parameters and simulate real-world usage of melanoma\ndetection applications. Multiple custom models employing different forms of\ndata augmentation are implemented to highlight the most significant features of\nmole classifiers. These implementations emphasize the importance of considering\nuser unpredictability when deploying such applications. The caution required\nwhen manually modifying data is acknowledged, as it can result in data loss and\nbiased conclusions. Additionally, the significance of data augmentation in both\nthe dermatology and deep learning communities is considered.\n","authors":["Nick DiSanto","Gavin Harding","Ethan Martinez","Benjamin Sanders"],"pdf_url":"https://arxiv.org/pdf/2212.05116v2.pdf","comment":"6 pages, 3 figures, 4 tables"},{"id":"http://arxiv.org/abs/2308.03908v1","updated":"2023-08-07T20:50:54Z","published":"2023-08-07T20:50:54Z","title":"ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings\n for Video Action Recognition","summary":" Video Action Recognition (VAR) is a challenging task due to its inherent\ncomplexities. Though different approaches have been explored in the literature,\ndesigning a unified framework to recognize a large number of human actions is\nstill a challenging problem. Recently, Multi-Modal Learning (MML) has\ndemonstrated promising results in this domain. In literature, 2D skeleton or\npose modality has often been used for this task, either independently or in\nconjunction with the visual information (RGB modality) present in videos.\nHowever, the combination of pose, visual information, and text attributes has\nnot been explored yet, though text and pose attributes independently have been\nproven to be effective in numerous computer vision tasks. In this paper, we\npresent the first pose augmented Vision-language model (VLM) for VAR. Notably,\nour scheme achieves an accuracy of 92.81% and 73.02% on two popular human video\naction recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even\nwithout any video data pre-training, and an accuracy of 96.11% and 75.75% after\nkinetics pre-training.\n","authors":["Soumyabrata Chaudhuri","Saumik Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2308.03908v1.pdf","comment":"7 pages, 3 figures, 2 Tables"},{"id":"http://arxiv.org/abs/2308.03906v1","updated":"2023-08-07T20:48:07Z","published":"2023-08-07T20:48:07Z","title":"TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal\n Backdoored Models","summary":" We present a Multimodal Backdoor Defense technique TIJO (Trigger Inversion\nusing Joint Optimization). Recent work arXiv:2112.07668 has demonstrated\nsuccessful backdoor attacks on multimodal models for the Visual Question\nAnswering task. Their dual-key backdoor trigger is split across two modalities\n(image and text), such that the backdoor is activated if and only if the\ntrigger is present in both modalities. We propose TIJO that defends against\ndual-key attacks through a joint optimization that reverse-engineers the\ntrigger in both the image and text modalities. This joint optimization is\nchallenging in multimodal models due to the disconnected nature of the visual\npipeline which consists of an offline feature extractor, whose output is then\nfused with the text using a fusion module. The key insight enabling the joint\noptimization in TIJO is that the trigger inversion needs to be carried out in\nthe object detection box feature space as opposed to the pixel space. We\ndemonstrate the effectiveness of our method on the TrojVQA benchmark, where\nTIJO improves upon the state-of-the-art unimodal methods from an AUC of 0.6 to\n0.92 on multimodal dual-key backdoors. Furthermore, our method also improves\nupon the unimodal baselines on unimodal backdoors. We present ablation studies\nand qualitative results to provide insights into our algorithm such as the\ncritical importance of overlaying the inverted feature triggers on all visual\nfeatures during trigger inversion. The prototype implementation of TIJO is\navailable at https://github.com/SRI-CSL/TIJO.\n","authors":["Indranil Sur","Karan Sikka","Matthew Walmer","Kaushik Koneripalli","Anirban Roy","Xiao Lin","Ajay Divakaran","Susmit Jha"],"pdf_url":"https://arxiv.org/pdf/2308.03906v1.pdf","comment":"Published as conference paper at ICCV 2023. 13 pages, 6 figures, 7\n tables"},{"id":"http://arxiv.org/abs/2308.03900v1","updated":"2023-08-07T20:23:39Z","published":"2023-08-07T20:23:39Z","title":"Developability Approximation for Neural Implicits through Rank\n Minimization","summary":" Developability refers to the process of creating a surface without any\ntearing or shearing from a two-dimensional plane. It finds practical\napplications in the fabrication industry. An essential characteristic of a\ndevelopable 3D surface is its zero Gaussian curvature, which means that either\none or both of the principal curvatures are zero. This paper introduces a\nmethod for reconstructing an approximate developable surface from a neural\nimplicit surface. The central idea of our method involves incorporating a\nregularization term that operates on the second-order derivatives of the neural\nimplicits, effectively promoting zero Gaussian curvature. Implicit surfaces\noffer the advantage of smoother deformation with infinite resolution,\novercoming the high polygonal constraints of state-of-the-art methods using\ndiscrete representations. We draw inspiration from the properties of surface\ncurvature and employ rank minimization techniques derived from compressed\nsensing. Experimental results on both developable and non-developable surfaces,\nincluding those affected by noise, validate the generalizability of our method.\n","authors":["Pratheba Selvaraju"],"pdf_url":"https://arxiv.org/pdf/2308.03900v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12622v4","updated":"2023-08-07T20:10:51Z","published":"2023-07-24T08:51:49Z","title":"Phase Matching for Out-of-Distribution Generalization","summary":" The Fourier transform, serving as an explicit decomposition method for visual\nsignals, has been employed to explain the out-of-distribution generalization\nbehaviors of Convolutional Neural Networks (CNNs). Previous studies have\nindicated that the amplitude spectrum is susceptible to the disturbance caused\nby distribution shifts. On the other hand, the phase spectrum preserves\nhighly-structured spatial information, which is crucial for robust visual\nrepresentation learning. However, the spatial relationships of phase spectrum\nremain unexplored in previous researches. In this paper, we aim to clarify the\nrelationships between Domain Generalization (DG) and the frequency components,\nand explore the spatial relationships of the phase spectrum. Specifically, we\nfirst introduce a Fourier-based structural causal model which interprets the\nphase spectrum as semi-causal factors and the amplitude spectrum as non-causal\nfactors. Then, we propose Phase Matching (PhaMa) to address DG problems. Our\nmethod introduces perturbations on the amplitude spectrum and establishes\nspatial relationships to match the phase components. Through experiments on\nmultiple benchmarks, we demonstrate that our proposed method achieves\nstate-of-the-art performance in domain generalization and out-of-distribution\nrobustness tasks.\n","authors":["Chengming Hu","Yeqian Du","Rui Wang","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2307.12622v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.00501v4","updated":"2023-08-07T19:10:18Z","published":"2023-04-02T10:27:34Z","title":"A Comprehensive Review of YOLO: From YOLOv1 and Beyond","summary":" YOLO has become a central real-time object detection system for robotics,\ndriverless cars, and video monitoring applications. We present a comprehensive\nanalysis of YOLO's evolution, examining the innovations and contributions in\neach iteration from the original YOLO up to YOLOv8, YOLO-NAS, and YOLO with\nTransformers. We start by describing the standard metrics and postprocessing;\nthen, we discuss the major changes in network architecture and training tricks\nfor each model. Finally, we summarize the essential lessons from YOLO's\ndevelopment and provide a perspective on its future, highlighting potential\nresearch directions to enhance real-time object detection systems.\n","authors":["Juan Terven","Diana Cordova-Esparza"],"pdf_url":"https://arxiv.org/pdf/2304.00501v4.pdf","comment":"34 pages, 19 figures, 4 tables, submitted to ACM Computing Surveys.\n This version adds information about YOLO with transformers"},{"id":"http://arxiv.org/abs/2308.03867v1","updated":"2023-08-07T18:39:14Z","published":"2023-08-07T18:39:14Z","title":"From Sky to the Ground: A Large-scale Benchmark and Simple Baseline\n Towards Real Rain Removal","summary":" Learning-based image deraining methods have made great progress. However, the\nlack of large-scale high-quality paired training samples is the main bottleneck\nto hamper the real image deraining (RID). To address this dilemma and advance\nRID, we construct a Large-scale High-quality Paired real rain benchmark\n(LHP-Rain), including 3000 video sequences with 1 million high-resolution\n(1920*1080) frame pairs. The advantages of the proposed dataset over the\nexisting ones are three-fold: rain with higher-diversity and larger-scale,\nimage with higher-resolution and higher-quality ground-truth. Specifically, the\nreal rains in LHP-Rain not only contain the classical rain\nstreak/veiling/occlusion in the sky, but also the \\textbf{splashing on the\nground} overlooked by deraining community. Moreover, we propose a novel robust\nlow-rank tensor recovery model to generate the GT with better separating the\nstatic background from the dynamic rain. In addition, we design a simple\ntransformer-based single image deraining baseline, which simultaneously utilize\nthe self-attention and cross-layer attention within the image and rain layer\nwith discriminative feature representation. Extensive experiments verify the\nsuperiority of the proposed dataset and deraining method over state-of-the-art.\n","authors":["Yun Guo","Xueyao Xiao","Yi Chang","Shumin Deng","Luxin Yan"],"pdf_url":"https://arxiv.org/pdf/2308.03867v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2308.03865v1","updated":"2023-08-07T18:27:04Z","published":"2023-08-07T18:27:04Z","title":"DefCor-Net: Physics-Aware Ultrasound Deformation Correction","summary":" The recovery of morphologically accurate anatomical images from deformed ones\nis challenging in ultrasound (US) image acquisition, but crucial to accurate\nand consistent diagnosis, particularly in the emerging field of\ncomputer-assisted diagnosis. This article presents a novel anatomy-aware\ndeformation correction approach based on a coarse-to-fine, multi-scale deep\nneural network (DefCor-Net). To achieve pixel-wise performance, DefCor-Net\nincorporates biomedical knowledge by estimating pixel-wise stiffness online\nusing a U-shaped feature extractor. The deformation field is then computed\nusing polynomial regression by integrating the measured force applied by the US\nprobe. Based on real-time estimation of pixel-by-pixel tissue properties, the\nlearning-based approach enables the potential for anatomy-aware deformation\ncorrection. To demonstrate the effectiveness of the proposed DefCor-Net, images\nrecorded at multiple locations on forearms and upper arms of six volunteers are\nused to train and validate DefCor-Net. The results demonstrate that DefCor-Net\ncan significantly improve the accuracy of deformation correction to recover the\noriginal geometry (Dice Coefficient: from $14.3\\pm20.9$ to $82.6\\pm12.1$ when\nthe force is $6N$).\n","authors":["Zhongliang Jiang","Yue Zhou","Dongliang Cao","Nassir Navab"],"pdf_url":"https://arxiv.org/pdf/2308.03865v1.pdf","comment":"Accepted by MedIA. code is available"},{"id":"http://arxiv.org/abs/2308.03861v1","updated":"2023-08-07T18:15:03Z","published":"2023-08-07T18:15:03Z","title":"High-Throughput and Accurate 3D Scanning of Cattle Using Time-of-Flight\n Sensors and Deep Learning","summary":" We introduce a high throughput 3D scanning solution specifically designed to\nprecisely measure cattle phenotypes. This scanner leverages an array of depth\nsensors, i.e. time-of-flight (Tof) sensors, each governed by dedicated embedded\ndevices. The system excels at generating high-fidelity 3D point clouds, thus\nfacilitating an accurate mesh that faithfully reconstructs the cattle geometry\non the fly. In order to evaluate the performance of our system, we have\nimplemented a two-fold validation process. Initially, we test the scanner's\ncompetency in determining volume and surface area measurements within a\ncontrolled environment featuring known objects. Secondly, we explore the impact\nand necessity of multi-device synchronization when operating a series of\ntime-of-flight sensors. Based on the experimental results, the proposed system\nis capable of producing high-quality meshes of untamed cattle for livestock\nstudies.\n","authors":["Gbenga Omotara","Seyed Mohamad Ali Tousi","Jared Decker","Derek Brake","Guilherme N. DeSouza"],"pdf_url":"https://arxiv.org/pdf/2308.03861v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03826v1","updated":"2023-08-07T17:49:04Z","published":"2023-08-07T17:49:04Z","title":"Recurrent Multi-scale Transformer for High-Resolution Salient Object\n Detection","summary":" Salient Object Detection (SOD) aims to identify and segment the most\nconspicuous objects in an image or video. As an important pre-processing step,\nit has many potential applications in multimedia and vision tasks. With the\nadvance of imaging devices, SOD with high-resolution images is of great demand,\nrecently. However, traditional SOD methods are largely limited to\nlow-resolution images, making them difficult to adapt to the development of\nHigh-Resolution SOD (HRSOD). Although some HRSOD methods emerge, there are no\nlarge enough datasets for training and evaluating. Besides, current HRSOD\nmethods generally produce incomplete object regions and irregular object\nboundaries. To address above issues, in this work, we first propose a new\nHRS10K dataset, which contains 10,500 high-quality annotated images at 2K-8K\nresolution. As far as we know, it is the largest dataset for the HRSOD task,\nwhich will significantly help future works in training and evaluating models.\nFurthermore, to improve the HRSOD performance, we propose a novel Recurrent\nMulti-scale Transformer (RMFormer), which recurrently utilizes shared\nTransformers and multi-scale refinement architectures. Thus, high-resolution\nsaliency maps can be generated with the guidance of lower-resolution\npredictions. Extensive experiments on both high-resolution and low-resolution\nbenchmarks show the effectiveness and superiority of the proposed framework.\nThe source code and dataset are released at:\nhttps://github.com/DrowsyMon/RMFormer.\n","authors":["Xinhao Deng","Pingping Zhang","Wei Liu","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2308.03826v1.pdf","comment":"This work is accepted by ACM MM2023. More modifications may be\n performed for further improvements"},{"id":"http://arxiv.org/abs/2308.03821v1","updated":"2023-08-07T15:30:02Z","published":"2023-08-07T15:30:02Z","title":"Distributionally Robust Classification on a Data Budget","summary":" Real world uses of deep learning require predictable model behavior under\ndistribution shifts. Models such as CLIP show emergent natural distributional\nrobustness comparable to humans, but may require hundreds of millions of\ntraining samples. Can we train robust learners in a domain where data is\nlimited? To rigorously address this question, we introduce JANuS (Joint\nAnnotations and Names Set), a collection of four new training datasets with\nimages, labels, and corresponding captions, and perform a series of carefully\ncontrolled investigations of factors contributing to robustness in image\nclassification, then compare those results to findings derived from a\nlarge-scale meta-analysis. Using this approach, we show that standard ResNet-50\ntrained with the cross-entropy loss on 2.4 million image samples can attain\ncomparable robustness to a CLIP ResNet-50 trained on 400 million samples. To\nour knowledge, this is the first result showing (near) state-of-the-art\ndistributional robustness on limited data budgets. Our dataset is available at\n\\url{https://huggingface.co/datasets/penfever/JANuS_dataset}, and the code used\nto reproduce our experiments can be found at\n\\url{https://github.com/penfever/vlhub/}.\n","authors":["Benjamin Feuer","Ameya Joshi","Minh Pham","Chinmay Hegde"],"pdf_url":"https://arxiv.org/pdf/2308.03821v1.pdf","comment":"TMLR 2023; openreview link:\n https://openreview.net/forum?id=D5Z2E8CNsD"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2212.09597v6","updated":"2023-08-07T17:50:52Z","published":"2022-12-19T16:32:42Z","title":"Reasoning with Language Model Prompting: A Survey","summary":" Reasoning, as an essential ability for complex problem-solving, can provide\nback-end support for various real-world applications, such as medical\ndiagnosis, negotiation, etc. This paper provides a comprehensive survey of\ncutting-edge research on reasoning with language model prompting. We introduce\nresearch works with comparisons and summaries and provide systematic resources\nto help beginners. We also discuss the potential reasons for emerging such\nreasoning abilities and highlight future research directions. Resources are\navailable at https://github.com/zjunlp/Prompt4ReasoningPapers (updated\nperiodically).\n","authors":["Shuofei Qiao","Yixin Ou","Ningyu Zhang","Xiang Chen","Yunzhi Yao","Shumin Deng","Chuanqi Tan","Fei Huang","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2212.09597v6.pdf","comment":"ACL 2023, fixed Equation 2"},{"id":"http://arxiv.org/abs/2308.03735v1","updated":"2023-08-07T17:34:58Z","published":"2023-08-07T17:34:58Z","title":"Randomized algorithms for precise measurement of differentially-private,\n personalized recommendations","summary":" Personalized recommendations form an important part of today's internet\necosystem, helping artists and creators to reach interested users, and helping\nusers to discover new and engaging content. However, many users today are\nskeptical of platforms that personalize recommendations, in part due to\nhistorically careless treatment of personal data and data privacy. Now,\nbusinesses that rely on personalized recommendations are entering a new\nparadigm, where many of their systems must be overhauled to be privacy-first.\nIn this article, we propose an algorithm for personalized recommendations that\nfacilitates both precise and differentially-private measurement. We consider\nadvertising as an example application, and conduct offline experiments to\nquantify how the proposed privacy-preserving algorithm affects key metrics\nrelated to user experience, advertiser value, and platform revenue compared to\nthe extremes of both (private) non-personalized and non-private, personalized\nimplementations.\n","authors":["Allegra Laro","Yanqing Chen","Hao He","Babak Aghazadeh"],"pdf_url":"https://arxiv.org/pdf/2308.03735v1.pdf","comment":"Submitted to AAAI"},{"id":"http://arxiv.org/abs/2308.03734v1","updated":"2023-08-07T17:32:33Z","published":"2023-08-07T17:32:33Z","title":"Labeling without Seeing? Blind Annotation for Privacy-Preserving Entity\n Resolution","summary":" The entity resolution problem requires finding pairs across datasets that\nbelong to different owners but refer to the same entity in the real world. To\ntrain and evaluate solutions (either rule-based or machine-learning-based) to\nthe entity resolution problem, generating a ground truth dataset with entity\npairs or clusters is needed. However, such a data annotation process involves\nhumans as domain oracles to review the plaintext data for all candidate record\npairs from different parties, which inevitably infringes the privacy of data\nowners, especially in privacy-sensitive cases like medical records. To the best\nof our knowledge, there is no prior work on privacy-preserving ground truth\ndataset generation, especially in the domain of entity resolution. We propose a\nnovel blind annotation protocol based on homomorphic encryption that allows\ndomain oracles to collaboratively label ground truths without sharing data in\nplaintext with other parties. In addition, we design a domain-specific\neasy-to-use language that hides the sophisticated underlying homomorphic\nencryption layer. Rigorous proof of the privacy guarantee is provided and our\nempirical experiments via an annotation simulator indicate the feasibility of\nour privacy-preserving protocol (f-measure on average achieves more than 90\\%\ncompared with the real ground truths).\n","authors":["Yixiang Yao","Weizhao Jin","Srivatsan Ravi"],"pdf_url":"https://arxiv.org/pdf/2308.03734v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03588v1","updated":"2023-08-07T13:45:48Z","published":"2023-08-07T13:45:48Z","title":"Multi-View Graph Convolutional Network for Multimedia Recommendation","summary":" Multimedia recommendation has received much attention in recent years. It\nmodels user preferences based on both behavior information and item multimodal\ninformation. Though current GCN-based methods achieve notable success, they\nsuffer from two limitations: (1) Modality noise contamination to the item\nrepresentations. Existing methods often mix modality features and behavior\nfeatures in a single view (e.g., user-item view) for propagation, the noise in\nthe modality features may be amplified and coupled with behavior features. In\nthe end, it leads to poor feature discriminability; (2) Incomplete user\npreference modeling caused by equal treatment of modality features. Users often\nexhibit distinct modality preferences when purchasing different items. Equally\nfusing each modality feature ignores the relative importance among different\nmodalities, leading to the suboptimal user preference modeling. To tackle the\nabove issues, we propose a novel Multi-View Graph Convolutional Network for the\nmultimedia recommendation. Specifically, to avoid modality noise contamination,\nthe modality features are first purified with the aid of item behavior\ninformation. Then, the purified modality features of items and behavior\nfeatures are enriched in separate views, including the user-item view and the\nitem-item view. In this way, the distinguishability of features is enhanced.\nMeanwhile, a behavior-aware fuser is designed to comprehensively model user\npreferences by adaptively learning the relative importance of different\nmodality features. Furthermore, we equip the fuser with a self-supervised\nauxiliary task. This task is expected to maximize the mutual information\nbetween the fused multimodal features and behavior features, so as to capture\ncomplementary and supplementary preference information simultaneously.\nExtensive experiments on three public datasets demonstrate the effectiveness of\nour methods.\n","authors":["Penghang Yu","Zhiyi Tan","Guanming Lu","Bing-Kun Bao"],"pdf_url":"https://arxiv.org/pdf/2308.03588v1.pdf","comment":"MM'23"},{"id":"http://arxiv.org/abs/2308.03578v1","updated":"2023-08-07T13:35:02Z","published":"2023-08-07T13:35:02Z","title":"TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs","summary":" We introduce TeraHAC, a $(1+\\epsilon)$-approximate hierarchical agglomerative\nclustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm\nis based on a new approach to computing $(1+\\epsilon)$-approximate HAC, which\nis a novel combination of the nearest-neighbor chain algorithm and the notion\nof $(1+\\epsilon)$-approximate HAC. Our approach allows us to partition the\ngraph among multiple machines and make significant progress in computing the\nclustering within each partition before any communication with other partitions\nis needed.\n We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8\ntrillion edges. We show that TeraHAC requires over 100x fewer rounds compared\nto previously known approaches for computing HAC. It is up to 8.3x faster than\nSCC, the state-of-the-art distributed algorithm for hierarchical clustering,\nwhile achieving 1.16x higher quality. In fact, TeraHAC essentially retains the\nquality of the celebrated HAC algorithm while significantly improving the\nrunning time.\n","authors":["Laxman Dhulipala","Jason Lee","Jakub Łącki","Vahab Mirrokni"],"pdf_url":"https://arxiv.org/pdf/2308.03578v1.pdf","comment":"To appear at SIGMOD 2024"},{"id":"http://arxiv.org/abs/2308.03563v1","updated":"2023-08-07T13:15:33Z","published":"2023-08-07T13:15:33Z","title":"Global cognitive graph properties dynamics of hippocampal formation","summary":" In the present study we have used a set of methods and metrics to build a\ngraph of relative neural connections in a hippocampus of a rodent. A set of\ngraphs was built on top of time-sequenced data and analyzed in terms of\ndynamics of a connection genesis. The analysis has shown that during the\nprocess of a rodent exploring a novel environment, the relations between\nneurons constantly change which indicates that globally memory is constantly\nupdated even for known areas of space. Even if some neurons gain cognitive\nspecialization, the global network though remains relatively stable.\nAdditionally we suggest a set of methods for building a graph of cognitive\nneural network.\n","authors":["Konstantin Sorokin","Andrey Zaitsew","Aleksandr Levin","German Magai","Maxim Beketov","Vladimir Sotskov"],"pdf_url":"https://arxiv.org/pdf/2308.03563v1.pdf","comment":"12 pages, 6 figures, paper for DAMDID 2023 Conference"},{"id":"http://arxiv.org/abs/2308.03470v1","updated":"2023-08-07T10:56:57Z","published":"2023-08-07T10:56:57Z","title":"Uncertainty-aware Consistency Learning for Cold-Start Item\n Recommendation","summary":" Graph Neural Network (GNN)-based models have become the mainstream approach\nfor recommender systems. Despite the effectiveness, they are still suffering\nfrom the cold-start problem, i.e., recommend for few-interaction items.\nExisting GNN-based recommendation models to address the cold-start problem\nmainly focus on utilizing auxiliary features of users and items, leaving the\nuser-item interactions under-utilized. However, embeddings distributions of\ncold and warm items are still largely different, since cold items' embeddings\nare learned from lower-popularity interactions, while warm items' embeddings\nare from higher-popularity interactions. Thus, there is a seesaw phenomenon,\nwhere the recommendation performance for the cold and warm items cannot be\nimproved simultaneously. To this end, we proposed a Uncertainty-aware\nConsistency learning framework for Cold-start item recommendation (shorten as\nUCC) solely based on user-item interactions. Under this framework, we train the\nteacher model (generator) and student model (recommender) with consistency\nlearning, to ensure the cold items with additionally generated low-uncertainty\ninteractions can have similar distribution with the warm items. Therefore, the\nproposed framework improves the recommendation of cold and warm items at the\nsame time, without hurting any one of them. Extensive experiments on benchmark\ndatasets demonstrate that our proposed method significantly outperforms\nstate-of-the-art methods on both warm and cold items, with an average\nperformance improvement of 27.6%.\n","authors":["Taichi Liu","Chen Gao","Zhenyu Wang","Dong Li","Jianye Hao","Depeng Jin","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2308.03470v1.pdf","comment":"Accepted by SIGIR 2023"},{"id":"http://arxiv.org/abs/2308.03443v1","updated":"2023-08-07T10:00:07Z","published":"2023-08-07T10:00:07Z","title":"Doubly Robust Estimator for Off-Policy Evaluation with Large Action\n Spaces","summary":" We study Off-Policy Evaluation (OPE) in contextual bandit settings with large\naction spaces. The benchmark estimators suffer from severe bias and variance\ntradeoffs. Parametric approaches suffer from bias due to difficulty specifying\nthe correct model, whereas ones with importance weight suffer from variance. To\novercome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was\nproposed to mitigate the estimator's variance via embeddings of an action. To\nmake the estimator more accurate, we propose the doubly robust estimator of\nMIPS called the Marginalized Doubly Robust (MDR) estimator. Theoretical\nanalysis shows that the proposed estimator is unbiased under weaker assumptions\nthan MIPS while maintaining variance reduction against IPS, which was the main\nadvantage of MIPS. The empirical experiment verifies the supremacy of MDR\nagainst existing estimators.\n","authors":["Tatsuhiro Shimizu"],"pdf_url":"https://arxiv.org/pdf/2308.03443v1.pdf","comment":"6 pages, 1 figure"},{"id":"http://arxiv.org/abs/2308.03400v1","updated":"2023-08-07T08:38:15Z","published":"2023-08-07T08:38:15Z","title":"Hierarchical Contrastive Learning with Multiple Augmentation for\n Sequential Recommendation","summary":" Sequential recommendation addresses the issue of preference drift by\npredicting the next item based on the user's previous behaviors. Recently, a\npromising approach using contrastive learning has emerged, demonstrating its\neffectiveness in recommending items under sparse user-item interactions.\nSignificantly, the effectiveness of combinations of various augmentation\nmethods has been demonstrated in different domains, particularly in computer\nvision. However, when it comes to augmentation within a contrastive learning\nframework in sequential recommendation, previous research has only focused on\nlimited conditions and simple structures. Thus, it is still possible to extend\nexisting approaches to boost the effects of augmentation methods by using\nprogressed structures with the combinations of multiple augmentation methods.\nIn this work, we propose a novel framework called Hierarchical Contrastive\nLearning with Multiple Augmentation for Sequential Recommendation(HCLRec) to\novercome the aforementioned limitation. Our framework leverages existing\naugmentation methods hierarchically to improve performance. By combining\naugmentation methods continuously, we generate low-level and high-level view\npairs. We employ a Transformers-based model to encode the input sequence\neffectively. Furthermore, we introduce additional blocks consisting of\nTransformers and position-wise feed-forward network(PFFN) layers to learn the\ninvariance of the original sequences from hierarchically augmented views. We\npass the input sequence to subsequent layers based on the number of increment\nlevels applied to the views to handle various augmentation levels. Within each\nlayer, we compute contrastive loss between pairs of views at the same level.\nExtensive experiments demonstrate that our proposed method outperforms\nstate-of-the-art approaches and that HCLRec is robust even when faced with the\nproblem of sparse interaction.\n","authors":["Dongjun Lee","Donggeun Ko","Jaekwang Kim"],"pdf_url":"https://arxiv.org/pdf/2308.03400v1.pdf","comment":"10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2308.03366v1","updated":"2023-08-07T07:41:01Z","published":"2023-08-07T07:41:01Z","title":"POSIT: Promotion of Semantic Item Tail via Adversarial Learning","summary":" In many recommender problems, a handful of popular items (e.g. movies/TV\nshows, news etc.) can be dominant in recommendations for many users. However,\nwe know that in a large catalog of items, users are likely interested in more\nthan what is popular. The dominance of popular items may mean that users will\nnot see items they would likely enjoy. In this paper, we propose a technique to\novercome this problem using adversarial machine learning. We define a metric to\ntranslate user-level utility metric in terms of an advantage/disadvantage over\nitems. We subsequently use that metric in an adversarial learning framework to\nsystematically promote disadvantaged items. The resulting algorithm identifies\nsemantically meaningful items that get promoted in the learning algorithm. In\nthe empirical study, we evaluate the proposed technique on three publicly\navailable datasets and four competitive baselines. The result shows that our\nproposed method not only improves the coverage, but also, surprisingly,\nimproves the overall performance.\n","authors":["Qiuling Xu","Pannaga Shivaswamy","Xiangyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03366v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2308.03333v1","updated":"2023-08-07T06:29:20Z","published":"2023-08-07T06:29:20Z","title":"Heterogeneous Knowledge Fusion: A Novel Approach for Personalized\n Recommendation via LLM","summary":" The analysis and mining of user heterogeneous behavior are of paramount\nimportance in recommendation systems. However, the conventional approach of\nincorporating various types of heterogeneous behavior into recommendation\nmodels leads to feature sparsity and knowledge fragmentation issues. To address\nthis challenge, we propose a novel approach for personalized recommendation via\nLarge Language Model (LLM), by extracting and fusing heterogeneous knowledge\nfrom user heterogeneous behavior information. In addition, by combining\nheterogeneous knowledge and recommendation tasks, instruction tuning is\nperformed on LLM for personalized recommendations. The experimental results\ndemonstrate that our method can effectively integrate user heterogeneous\nbehavior and significantly improve recommendation performance.\n","authors":["Bin Yin","Junjie Xie","Yu Qin","Zixiang Ding","Zhichao Feng","Xiang Li","Wei Lin"],"pdf_url":"https://arxiv.org/pdf/2308.03333v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03869v1","updated":"2023-08-07T18:40:13Z","published":"2023-08-07T18:40:13Z","title":"Semantic Equivalence of e-Commerce Queries","summary":" Search query variation poses a challenge in e-commerce search, as equivalent\nsearch intents can be expressed through different queries with surface-level\ndifferences. This paper introduces a framework to recognize and leverage query\nequivalence to enhance searcher and business outcomes. The proposed approach\naddresses three key problems: mapping queries to vector representations of\nsearch intent, identifying nearest neighbor queries expressing equivalent or\nsimilar intent, and optimizing for user or business objectives. The framework\nutilizes both surface similarity and behavioral similarity to determine query\nequivalence. Surface similarity involves canonicalizing queries based on word\ninflection, word order, compounding, and noise words. Behavioral similarity\nleverages historical search behavior to generate vector representations of\nquery intent. An offline process is used to train a sentence similarity model,\nwhile an online nearest neighbor approach supports processing of unseen\nqueries. Experimental evaluations demonstrate the effectiveness of the proposed\napproach, outperforming popular sentence transformer models and achieving a\nPearson correlation of 0.85 for query similarity. The results highlight the\npotential of leveraging historical behavior data and training models to\nrecognize and utilize query equivalence in e-commerce search, leading to\nimproved user experiences and business outcomes. Further advancements and\nbenchmark datasets are encouraged to facilitate the development of solutions\nfor this critical problem in the e-commerce domain.\n","authors":["Aritra Mandal","Daniel Tunkelang","Zhe Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03869v1.pdf","comment":"The 6th Workshop on e-Commerce and NLP"},{"id":"http://arxiv.org/abs/2308.03855v1","updated":"2023-08-07T18:06:46Z","published":"2023-08-07T18:06:46Z","title":"Mobile Supply: The Last Piece of Jigsaw of Recommender System","summary":" Recommendation system is a fundamental functionality of online platforms.\nWith the development of computing power of mobile phones, some researchers have\ndeployed recommendation algorithms on users' devices to solve the problems of\ndata transmission delay and pagination mechanism. However, the existing\nedge-side mobile rankings cannot completely solve the problem of pagination\nmechanism. The mobile rankings can only sort the items on the current page, so\nit will not work if it is called once or twice. Besides, after the user has\nviewed the items of interest to the user on the current page, the user refresh\nto get a new page of items. This will make the mobile ranking model do a lot of\nuseless work and affect the user's immersive experience. In order to solve the\npagination mechanism problem, we propose a completely new module in the\npipeline of recommender named Mobile Supply. The pipeline of recommender system\nis extended to \"retrival->pre-ranking->ranking->re-ranking->Mobile\nSupply->mobile ranking\". Specifically, we introduce the concept of list value\nand use point-wise method to approximate list-wise estimation. We also design a\nnew mobile ranking named device-aware mobile ranking considering the difference\nof mobile devices tailored to the new pipeline. Extensive offline and online\nexperiments show the superiority of our proposed method and prove that Mobile\nSupply can further improve the performance of edge-side recommender system and\nuser experience. Mobile Supply has been deployed on the homepage page of a\nlarge-scale online food platform and has yielded considerable profits in our\nbusiness.\n","authors":["Zhenhao Jiang","Biao Zeng","Hao Feng","Jin Liu","Jie Zhang","Jia Jia","Ning Hu"],"pdf_url":"https://arxiv.org/pdf/2308.03855v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03842v1","updated":"2023-08-07T18:00:04Z","published":"2023-08-07T18:00:04Z","title":"Search Engine and Recommendation System for the Music Industry built\n with JinaAI","summary":" One of the most intriguing debates regarding a novel task is the development\nof search engines and recommendation-based systems in the music industry.\nStudies have shown a drastic depression in the search engine fields, due to\nconcerning factors such as speed, accuracy and the format of data given for\nquerying. Often people face difficulty in searching for a song solely based on\nthe title, hence a solution is proposed to complete a search analysis through a\nsingle query input and is matched with the lyrics of the songs present in the\ndatabase. Hence it is essential to incorporate cutting-edge technology tools\nfor developing a user-friendly search engine. Jina AI is an MLOps framework for\nbuilding neural search engines that are utilized, in order for the user to\nobtain accurate results. Jina AI effectively helps to maintain and enhance the\nquality of performance for the search engine for the query given. An effective\nsearch engine and a recommendation system for the music industry, built with\nJinaAI.\n","authors":["Ishita Gopalakrishnan","Sanjjushri Varshini R","Ponshriharini V"],"pdf_url":"https://arxiv.org/pdf/2308.03842v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2302.07181v2","updated":"2023-08-07T17:59:16Z","published":"2023-02-14T16:49:25Z","title":"Quantum algorithms applied to satellite mission planning for Earth\n observation","summary":" Earth imaging satellites are a crucial part of our everyday lives that enable\nglobal tracking of industrial activities. Use cases span many applications,\nfrom weather forecasting to digital maps, carbon footprint tracking, and\nvegetation monitoring. However, there are limitations; satellites are difficult\nto manufacture, expensive to maintain, and tricky to launch into orbit.\nTherefore, satellites must be employed efficiently. This poses a challenge\nknown as the satellite mission planning problem, which could be computationally\nprohibitive to solve on large scales. However, close-to-optimal algorithms,\nsuch as greedy reinforcement learning and optimization algorithms, can often\nprovide satisfactory resolutions. This paper introduces a set of quantum\nalgorithms to solve the mission planning problem and demonstrate an advantage\nover the classical algorithms implemented thus far. The problem is formulated\nas maximizing the number of high-priority tasks completed on real datasets\ncontaining thousands of tasks and multiple satellites. This work demonstrates\nthat through solution-chaining and clustering, optimization and machine\nlearning algorithms offer the greatest potential for optimal solutions. This\npaper notably illustrates that a hybridized quantum-enhanced reinforcement\nlearning agent can achieve a completion percentage of 98.5% over high-priority\ntasks, significantly improving over the baseline greedy methods with a\ncompletion rate of 75.8%. The results presented in this work pave the way to\nquantum-enabled solutions in the space industry and, more generally, future\nmission planning problems across industries.\n","authors":["Serge Rainjonneau","Igor Tokarev","Sergei Iudin","Saaketh Rayaprolu","Karan Pinto","Daria Lemtiuzhnikova","Miras Koblan","Egor Barashov","Mo Kordzanganeh","Markus Pflitsch","Alexey Melnikov"],"pdf_url":"https://arxiv.org/pdf/2302.07181v2.pdf","comment":"13 pages, 9 figures, 3 tables"},{"id":"http://arxiv.org/abs/2211.09027v3","updated":"2023-08-07T17:56:54Z","published":"2022-11-12T10:12:17Z","title":"LLEDA -- Lifelong Self-Supervised Domain Adaptation","summary":" Humans and animals have the ability to continuously learn new information\nover their lifetime without losing previously acquired knowledge. However,\nartificial neural networks struggle with this due to new information\nconflicting with old knowledge, resulting in catastrophic forgetting. The\ncomplementary learning systems (CLS) theory suggests that the interplay between\nhippocampus and neocortex systems enables long-term and efficient learning in\nthe mammalian brain, with memory replay facilitating the interaction between\nthese two systems to reduce forgetting. The proposed Lifelong Self-Supervised\nDomain Adaptation (LLEDA) framework draws inspiration from the CLS theory and\nmimics the interaction between two networks: a DA network inspired by the\nhippocampus that quickly adjusts to changes in data distribution and an SSL\nnetwork inspired by the neocortex that gradually learns domain-agnostic general\nrepresentations. LLEDA's latent replay technique facilitates communication\nbetween these two networks by reactivating and replaying the past memory latent\nrepresentations to stabilise long-term generalisation and retention without\ninterfering with the previously learned information. Extensive experiments\ndemonstrate that the proposed method outperforms several other methods\nresulting in a long-term adaptation while being less prone to catastrophic\nforgetting when transferred to new domains.\n","authors":["Mamatha Thota","Dewei Yi","Georgios Leontidis"],"pdf_url":"https://arxiv.org/pdf/2211.09027v3.pdf","comment":"19 pages, 6 figures, 6 tables; V2 added more experiments on more\n domains and fixed typos"},{"id":"http://arxiv.org/abs/2308.01390v2","updated":"2023-08-07T17:53:09Z","published":"2023-08-02T19:10:23Z","title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive\n Vision-Language Models","summary":" We introduce OpenFlamingo, a family of autoregressive vision-language models\nranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce\nan open-source replication of DeepMind's Flamingo models. On seven\nvision-language datasets, OpenFlamingo models average between 80 - 89% of\ncorresponding Flamingo performance. This technical report describes our models,\ntraining data, hyperparameters, and evaluation suite. We share our models and\ncode at https://github.com/mlfoundations/open_flamingo.\n","authors":["Anas Awadalla","Irena Gao","Josh Gardner","Jack Hessel","Yusuf Hanafy","Wanrong Zhu","Kalyani Marathe","Yonatan Bitton","Samir Gadre","Shiori Sagawa","Jenia Jitsev","Simon Kornblith","Pang Wei Koh","Gabriel Ilharco","Mitchell Wortsman","Ludwig Schmidt"],"pdf_url":"https://arxiv.org/pdf/2308.01390v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03743v1","updated":"2023-08-07T17:51:09Z","published":"2023-08-07T17:51:09Z","title":"The Copycat Perceptron: Smashing Barriers Through Collective Learning","summary":" We characterize the equilibrium properties of a model of $y$ coupled binary\nperceptrons in the teacher-student scenario, subject to a suitable learning\nrule, with an explicit ferromagnetic coupling proportional to the Hamming\ndistance between the students' weights. In contrast to recent works, we analyze\na more general setting in which a thermal noise is present that affects the\ngeneralization performance of each student. Specifically, in the presence of a\nnonzero temperature, which assigns nonzero probability to configurations that\nmisclassify samples with respect to the teacher's prescription, we find that\nthe coupling of replicas leads to a shift of the phase diagram to smaller\nvalues of $\\alpha$: This suggests that the free energy landscape gets smoother\naround the solution with good generalization (i.e., the teacher) at a fixed\nfraction of reviewed examples, which allows local update algorithms such as\nSimulated Annealing to reach the solution before the dynamics gets frozen.\nFinally, from a learning perspective, these results suggest that more students\n(in this case, with the same amount of data) are able to learn the same rule\nwhen coupled together with a smaller amount of data.\n","authors":["Giovanni Catania","Aurélien Decelle","Beatriz Seoane"],"pdf_url":"https://arxiv.org/pdf/2308.03743v1.pdf","comment":"4 figures"},{"id":"http://arxiv.org/abs/2212.09597v6","updated":"2023-08-07T17:50:52Z","published":"2022-12-19T16:32:42Z","title":"Reasoning with Language Model Prompting: A Survey","summary":" Reasoning, as an essential ability for complex problem-solving, can provide\nback-end support for various real-world applications, such as medical\ndiagnosis, negotiation, etc. This paper provides a comprehensive survey of\ncutting-edge research on reasoning with language model prompting. We introduce\nresearch works with comparisons and summaries and provide systematic resources\nto help beginners. We also discuss the potential reasons for emerging such\nreasoning abilities and highlight future research directions. Resources are\navailable at https://github.com/zjunlp/Prompt4ReasoningPapers (updated\nperiodically).\n","authors":["Shuofei Qiao","Yixin Ou","Ningyu Zhang","Xiang Chen","Yunzhi Yao","Shumin Deng","Chuanqi Tan","Fei Huang","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2212.09597v6.pdf","comment":"ACL 2023, fixed Equation 2"},{"id":"http://arxiv.org/abs/2301.09656v3","updated":"2023-08-07T17:40:40Z","published":"2023-01-23T19:00:02Z","title":"Selective Explanations: Leveraging Human Input to Align Explainable AI","summary":" While a vast collection of explainable AI (XAI) algorithms have been\ndeveloped in recent years, they are often criticized for significant gaps with\nhow humans produce and consume explanations. As a result, current XAI\ntechniques are often found to be hard to use and lack effectiveness. In this\nwork, we attempt to close these gaps by making AI explanations selective -- a\nfundamental property of human explanations -- by selectively presenting a\nsubset from a large set of model reasons based on what aligns with the\nrecipient's preferences. We propose a general framework for generating\nselective explanations by leveraging human input on a small sample. This\nframework opens up a rich design space that accounts for different selectivity\ngoals, types of input, and more. As a showcase, we use a decision-support task\nto explore selective explanations based on what the decision-maker would\nconsider relevant to the decision task. We conducted two experimental studies\nto examine three out of a broader possible set of paradigms based on our\nproposed framework: in Study 1, we ask the participants to provide their own\ninput to generate selective explanations, with either open-ended or\ncritique-based input. In Study 2, we show participants selective explanations\nbased on input from a panel of similar users (annotators). Our experiments\ndemonstrate the promise of selective explanations in reducing over-reliance on\nAI and improving decision outcomes and subjective perceptions of the AI, but\nalso paint a nuanced picture that attributes some of these positive effects to\nthe opportunity to provide one's own input to augment AI explanations. Overall,\nour work proposes a novel XAI framework inspired by human communication\nbehaviors and demonstrates its potentials to encourage future work to better\nalign AI explanations with human production and consumption of explanations.\n","authors":["Vivian Lai","Yiming Zhang","Chacha Chen","Q. Vera Liao","Chenhao Tan"],"pdf_url":"https://arxiv.org/pdf/2301.09656v3.pdf","comment":"21 pages, 25 figures"},{"id":"http://arxiv.org/abs/2308.03735v1","updated":"2023-08-07T17:34:58Z","published":"2023-08-07T17:34:58Z","title":"Randomized algorithms for precise measurement of differentially-private,\n personalized recommendations","summary":" Personalized recommendations form an important part of today's internet\necosystem, helping artists and creators to reach interested users, and helping\nusers to discover new and engaging content. However, many users today are\nskeptical of platforms that personalize recommendations, in part due to\nhistorically careless treatment of personal data and data privacy. Now,\nbusinesses that rely on personalized recommendations are entering a new\nparadigm, where many of their systems must be overhauled to be privacy-first.\nIn this article, we propose an algorithm for personalized recommendations that\nfacilitates both precise and differentially-private measurement. We consider\nadvertising as an example application, and conduct offline experiments to\nquantify how the proposed privacy-preserving algorithm affects key metrics\nrelated to user experience, advertiser value, and platform revenue compared to\nthe extremes of both (private) non-personalized and non-private, personalized\nimplementations.\n","authors":["Allegra Laro","Yanqing Chen","Hao He","Babak Aghazadeh"],"pdf_url":"https://arxiv.org/pdf/2308.03735v1.pdf","comment":"Submitted to AAAI"},{"id":"http://arxiv.org/abs/2308.03730v1","updated":"2023-08-07T17:18:37Z","published":"2023-08-07T17:18:37Z","title":"SurvBeX: An explanation method of the machine learning survival models\n based on the Beran estimator","summary":" An explanation method called SurvBeX is proposed to interpret predictions of\nthe machine learning survival black-box models. The main idea behind the method\nis to use the modified Beran estimator as the surrogate explanation model.\nCoefficients, incorporated into Beran estimator, can be regarded as values of\nthe feature impacts on the black-box model prediction. Following the well-known\nLIME method, many points are generated in a local area around an example of\ninterest. For every generated example, the survival function of the black-box\nmodel is computed, and the survival function of the surrogate model (the Beran\nestimator) is constructed as a function of the explanation coefficients. In\norder to find the explanation coefficients, it is proposed to minimize the mean\ndistance between the survival functions of the black-box model and the Beran\nestimator produced by the generated examples. Many numerical experiments with\nsynthetic and real survival data demonstrate the SurvBeX efficiency and compare\nthe method with the well-known method SurvLIME. The method is also compared\nwith the method SurvSHAP. The code implementing SurvBeX is available at:\nhttps://github.com/DanilaEremenko/SurvBeX\n","authors":["Lev V. Utkin","Danila Y. Eremenko","Andrei V. Konstantinov"],"pdf_url":"https://arxiv.org/pdf/2308.03730v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.08049v9","updated":"2023-08-07T17:09:10Z","published":"2022-12-15T18:55:23Z","title":"Sliced Optimal Partial Transport","summary":" Optimal transport (OT) has become exceedingly popular in machine learning,\ndata science, and computer vision. The core assumption in the OT problem is the\nequal total amount of mass in source and target measures, which limits its\napplication. Optimal Partial Transport (OPT) is a recently proposed solution to\nthis limitation. Similar to the OT problem, the computation of OPT relies on\nsolving a linear programming problem (often in high dimensions), which can\nbecome computationally prohibitive. In this paper, we propose an efficient\nalgorithm for calculating the OPT problem between two non-negative measures in\none dimension. Next, following the idea of sliced OT distances, we utilize\nslicing to define the sliced OPT distance. Finally, we demonstrate the\ncomputational and accuracy benefits of the sliced OPT-based method in various\nnumerical experiments. In particular, we show an application of our proposed\nSliced-OPT in noisy point cloud registration.\n","authors":["Yikun Bai","Berhnard Schmitzer","Mathew Thorpe","Soheil Kolouri"],"pdf_url":"https://arxiv.org/pdf/2212.08049v9.pdf","comment":"modify the link of Github page"},{"id":"http://arxiv.org/abs/2307.14361v2","updated":"2023-08-07T17:09:07Z","published":"2023-07-24T21:01:46Z","title":"A Hybrid Machine Learning Model for Classifying Gene Mutations in Cancer\n using LSTM, BiLSTM, CNN, GRU, and GloVe","summary":" This study presents an ensemble model combining LSTM, BiLSTM, CNN, GRU, and\nGloVe to classify gene mutations using Kaggle's Personalized Medicine:\nRedefining Cancer Treatment dataset. The results were compared against\nwell-known transformers like as BERT, Electra, Roberta, XLNet, Distilbert, and\ntheir LSTM ensembles. Our model outperformed all other models in terms of\naccuracy, precision, recall, F1 score, and Mean Squared Error. Surprisingly, it\nalso needed less training time, resulting in a perfect combination of\nperformance and efficiency. This study demonstrates the utility of ensemble\nmodels for difficult tasks such as gene mutation classification.\n","authors":["Sanad Aburass","Osama Dorgham","Jamil Al Shaqsi"],"pdf_url":"https://arxiv.org/pdf/2307.14361v2.pdf","comment":"6 pages, 7 figures and 2 tables"},{"id":"http://arxiv.org/abs/2308.01157v2","updated":"2023-08-07T17:06:56Z","published":"2023-08-02T13:59:35Z","title":"LLMs Understand Glass-Box Models, Discover Surprises, and Suggest\n Repairs","summary":" We show that large language models (LLMs) are remarkably good at working with\ninterpretable models that decompose complex outcomes into univariate\ngraph-represented components. By adopting a hierarchical approach to reasoning,\nLLMs can provide comprehensive model-level summaries without ever requiring the\nentire model to fit in context. This approach enables LLMs to apply their\nextensive background knowledge to automate common tasks in data science such as\ndetecting anomalies that contradict prior knowledge, describing potential\nreasons for the anomalies, and suggesting repairs that would remove the\nanomalies. We use multiple examples in healthcare to demonstrate the utility of\nthese new capabilities of LLMs, with particular emphasis on Generalized\nAdditive Models (GAMs). Finally, we present the package $\\texttt{TalkToEBM}$ as\nan open-source LLM-GAM interface.\n","authors":["Benjamin J. Lengerich","Sebastian Bordt","Harsha Nori","Mark E. Nunnally","Yin Aphinyanaphongs","Manolis Kellis","Rich Caruana"],"pdf_url":"https://arxiv.org/pdf/2308.01157v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03723v1","updated":"2023-08-07T16:58:48Z","published":"2023-08-07T16:58:48Z","title":"Dimensionality Reduction for Improving Out-of-Distribution Detection in\n Medical Image Segmentation","summary":" Clinically deployed segmentation models are known to fail on data outside of\ntheir training distribution. As these models perform well on most cases, it is\nimperative to detect out-of-distribution (OOD) images at inference to protect\nagainst automation bias. This work applies the Mahalanobis distance post hoc to\nthe bottleneck features of a Swin UNETR model that segments the liver on\nT1-weighted magnetic resonance imaging. By reducing the dimensions of the\nbottleneck features with principal component analysis, OOD images were detected\nwith high performance and minimal computational load.\n","authors":["McKell Woodland","Nihil Patel","Mais Al Taie","Joshua P. Yung","Tucker J. Netherton","Ankit B. Patel","Kristy K. Brock"],"pdf_url":"https://arxiv.org/pdf/2308.03723v1.pdf","comment":"This preprint has not undergone peer review or any post-submission\n improvements or corrections. The Version of Record of this contribution will\n be published in the Proceedings of Uncertainty for Safe Utilization of\n Machine Learning in Medical Imaging (5th International Workshop) - Held in\n conjunction with MICCAI 2023"},{"id":"http://arxiv.org/abs/2308.02029v2","updated":"2023-08-07T16:36:59Z","published":"2023-08-03T20:45:11Z","title":"Deep Maxout Network-based Feature Fusion and Political Tangent Search\n Optimizer enabled Transfer Learning for Thalassemia Detection","summary":" Thalassemia is a heritable blood disorder which is the outcome of a genetic\ndefect causing lack of production of hemoglobin polypeptide chains. However,\nthere is less understanding of the precise frequency as well as sharing in\nthese areas. Knowing about the frequency of thalassemia occurrence and\ndependable mutations is thus a significant step in preventing, controlling, and\ntreatment planning. Here, Political Tangent Search Optimizer based Transfer\nLearning (PTSO_TL) is introduced for thalassemia detection. Initially, input\ndata obtained from a particular dataset is normalized in the data normalization\nstage. Quantile normalization is utilized in the data normalization stage, and\nthe data are then passed to the feature fusion phase, in which Weighted\nEuclidean Distance with Deep Maxout Network (DMN) is utilized. Thereafter, data\naugmentation is performed using the oversampling method to increase data\ndimensionality. Lastly, thalassemia detection is carried out by TL, wherein a\nconvolutional neural network (CNN) is utilized with hyperparameters from a\ntrained model such as Xception. TL is tuned by PTSO, and the training algorithm\nPTSO is presented by merging of Political Optimizer (PO) and Tangent Search\nAlgorithm (TSA). Furthermore, PTSO_TL obtained maximal precision, recall, and\nf-measure values of about 94.3%, 96.1%, and 95.2%, respectively.\n","authors":["Hemn Barzan Abdalla","Awder Ahmed","Guoquan Li","Nasser Mustafa","Abdur Rashid Sangi"],"pdf_url":"https://arxiv.org/pdf/2308.02029v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03713v1","updated":"2023-08-07T16:32:14Z","published":"2023-08-07T16:32:14Z","title":"Communication-Efficient Framework for Distributed Image Semantic\n Wireless Transmission","summary":" Multi-node communication, which refers to the interaction among multiple\ndevices, has attracted lots of attention in many Internet-of-Things (IoT)\nscenarios. However, its huge amounts of data flows and inflexibility for task\nextension have triggered the urgent requirement of communication-efficient\ndistributed data transmission frameworks. In this paper, inspired by the great\nsuperiorities on bandwidth reduction and task adaptation of semantic\ncommunications, we propose a federated learning-based semantic communication\n(FLSC) framework for multi-task distributed image transmission with IoT\ndevices. Federated learning enables the design of independent semantic\ncommunication link of each user while further improves the semantic extraction\nand task performance through global aggregation. Each link in FLSC is composed\nof a hierarchical vision transformer (HVT)-based extractor and a task-adaptive\ntranslator for coarse-to-fine semantic extraction and meaning translation\naccording to specific tasks. In order to extend the FLSC into more realistic\nconditions, we design a channel state information-based multiple-input\nmultiple-output transmission module to combat channel fading and noise.\nSimulation results show that the coarse semantic information can deal with a\nrange of image-level tasks. Moreover, especially in low signal-to-noise ratio\nand channel bandwidth ratio regimes, FLSC evidently outperforms the traditional\nscheme, e.g. about 10 peak signal-to-noise ratio gain in the 3 dB channel\ncondition.\n","authors":["Bingyan Xie","Yongpeng Wu","Yuxuan Shi","Derrick Wing Kwan Ng","Wenjun Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03713v1.pdf","comment":"This paper has been accepted by IEEE Internet of Things Journal"},{"id":"http://arxiv.org/abs/2308.03712v1","updated":"2023-08-07T16:31:38Z","published":"2023-08-07T16:31:38Z","title":"Scaling may be all you need for achieving human-level object recognition\n capacity with human-like visual experience","summary":" This paper asks whether current self-supervised learning methods, if\nsufficiently scaled up, would be able to reach human-level visual object\nrecognition capabilities with the same type and amount of visual experience\nhumans learn from. Previous work on this question only considered the scaling\nof data size. Here, we consider the simultaneous scaling of data size, model\nsize, and image resolution. We perform a scaling experiment with vision\ntransformers up to 633M parameters in size (ViT-H/14) trained with up to 5K\nhours of human-like video data (long, continuous, mostly egocentric videos)\nwith image resolutions of up to 476x476 pixels. The efficiency of masked\nautoencoders (MAEs) as a self-supervised learning algorithm makes it possible\nto run this scaling experiment on an unassuming academic budget. We find that\nit is feasible to reach human-level object recognition capacity at sub-human\nscales of model size, data size, and image size, if these factors are scaled up\nsimultaneously. To give a concrete example, we estimate that a 2.5B parameter\nViT model trained with 20K hours (2.3 years) of human-like video data with a\nspatial resolution of 952x952 pixels should be able to reach human-level\naccuracy on ImageNet. Human-level competence is thus achievable for a\nfundamental perceptual capability from human-like perceptual experience\n(human-like in both amount and type) with extremely generic learning algorithms\nand architectures and without any substantive inductive biases.\n","authors":["A. Emin Orhan"],"pdf_url":"https://arxiv.org/pdf/2308.03712v1.pdf","comment":"7 pages, 3 figures, 2 tables; code & models available from\n https://github.com/eminorhan/humanlike-vits"},{"id":"http://arxiv.org/abs/2308.03704v1","updated":"2023-08-07T16:22:59Z","published":"2023-08-07T16:22:59Z","title":"DeRisk: An Effective Deep Learning Framework for Credit Risk Prediction\n over Real-World Financial Data","summary":" Despite the tremendous advances achieved over the past years by deep learning\ntechniques, the latest risk prediction models for industrial applications still\nrely on highly handtuned stage-wised statistical learning tools, such as\ngradient boosting and random forest methods. Different from images or\nlanguages, real-world financial data are high-dimensional, sparse, noisy and\nextremely imbalanced, which makes deep neural network models particularly\nchallenging to train and fragile in practice. In this work, we propose DeRisk,\nan effective deep learning risk prediction framework for credit risk prediction\non real-world financial data. DeRisk is the first deep risk prediction model\nthat outperforms statistical learning approaches deployed in our company's\nproduction system. We also perform extensive ablation studies on our method to\npresent the most critical factors for the empirical success of DeRisk.\n","authors":["Yancheng Liang","Jiajie Zhang","Hui Li","Xiaochen Liu","Yi Hu","Yong Wu","Jinyao Zhang","Yongyan Liu","Yi Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03704v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.07448v2","updated":"2023-08-07T16:10:43Z","published":"2023-05-12T13:05:32Z","title":"Deep Deterministic Policy Gradient for End-to-End Communication Systems\n without Prior Channel Knowledge","summary":" End-to-End (E2E) learning-based concept has been recently introduced to\njointly optimize both the transmitter and the receiver in wireless\ncommunication systems. Unfortunately, this E2E learning architecture requires a\nprior differentiable channel model to jointly train the deep neural networks\n(DNNs) at the transceivers, which is hardly obtained in practice. This paper\naims to solve this issue by developing a deep deterministic policy gradient\n(DDPG)-based framework. In particular, the proposed solution uses the loss\nvalue of the receiver DNN as the reward to train the transmitter DNN. The\nsimulation results then show that our proposed solution can jointly train the\ntransmitter and the receiver without requiring the prior channel model. In\naddition, we demonstrate that the proposed DDPG-based solution can achieve\nbetter detection performance compared to the state-of-the-art solutions.\n","authors":["Bolun Zhang","Nguyen Van Huynh"],"pdf_url":"https://arxiv.org/pdf/2305.07448v2.pdf","comment":"submitted to IEEE GLOBECOM 2023"},{"id":"http://arxiv.org/abs/2308.03688v1","updated":"2023-08-07T16:08:11Z","published":"2023-08-07T16:08:11Z","title":"AgentBench: Evaluating LLMs as Agents","summary":" Large Language Models (LLMs) are becoming increasingly smart and autonomous,\ntargeting real-world pragmatic missions beyond traditional NLP tasks. As a\nresult, there has been an urgent need to evaluate LLMs as agents on challenging\ntasks in interactive environments. We present AgentBench, a multi-dimensional\nevolving benchmark that currently consists of 8 distinct environments to assess\nLLM-as-Agent's reasoning and decision-making abilities in a multi-turn\nopen-ended generation setting. Our extensive test over 25 LLMs (including APIs\nand open-sourced models) shows that, while top commercial LLMs present a strong\nability of acting as agents in complex environments, there is a significant\ndisparity in performance between them and open-sourced competitors. It also\nserves as a component of an ongoing project with wider coverage and deeper\nconsideration towards systematic LLM evaluation. Datasets, environments, and an\nintegrated evaluation package for AgentBench are released at\nhttps://github.com/THUDM/AgentBench\n","authors":["Xiao Liu","Hao Yu","Hanchen Zhang","Yifan Xu","Xuanyu Lei","Hanyu Lai","Yu Gu","Hangliang Ding","Kaiwen Men","Kejuan Yang","Shudan Zhang","Xiang Deng","Aohan Zeng","Zhengxiao Du","Chenhui Zhang","Sheng Shen","Tianjun Zhang","Yu Su","Huan Sun","Minlie Huang","Yuxiao Dong","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2308.03688v1.pdf","comment":"38 pages"},{"id":"http://arxiv.org/abs/2308.00086v2","updated":"2023-08-07T16:04:02Z","published":"2023-07-28T10:33:12Z","title":"Unsupervised machine-learning shock-capturing technique for high-order\n solvers","summary":" We present a novel unsupervised machine learning shock capturing algorithm\nbased on Gaussian Mixture Models (GMMs). The proposed GMM sensor demonstrates\nremarkable accuracy in detecting shocks and is robust across diverse test cases\nwithout the need for parameter tuning. We compare the GMM-based sensor with\nstate-of-the-art alternatives. All methods are integrated into a high-order\ncompressible discontinuous Galerkin solver where artificial viscosity can be\nmodulated to capture shocks. Supersonic test cases, including high Reynolds\nnumbers, showcase the sensor's performance, demonstrating the same\neffectiveness as fine-tuned state-of-the-art sensors. %The nodal DG aproach\nallows for potential applications in sub-cell flux-differencing formulations,\nsupersonic feature detection, and mesh refinement. The adaptive nature and\nability to function without extensive training datasets make this GMM-based\nsensor suitable for complex geometries and varied flow configurations. Our\nstudy reveals the potential of unsupervised machine learning methods,\nexemplified by the GMM sensor, to improve the robustness and efficiency of\nadvanced CFD codes.\n","authors":["Andrés Mateo-Gabín","Kenza Tlales","Eusebio Valero","Esteban Ferrer","Gonzalo Rubio"],"pdf_url":"https://arxiv.org/pdf/2308.00086v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03687v1","updated":"2023-08-07T16:03:40Z","published":"2023-08-07T16:03:40Z","title":"Almost-sure convergence of iterates and multipliers in stochastic\n sequential quadratic optimization","summary":" Stochastic sequential quadratic optimization (SQP) methods for solving\ncontinuous optimization problems with nonlinear equality constraints have\nattracted attention recently, such as for solving large-scale data-fitting\nproblems subject to nonconvex constraints. However, for a recently proposed\nsubclass of such methods that is built on the popular stochastic-gradient\nmethodology from the unconstrained setting, convergence guarantees have been\nlimited to the asymptotic convergence of the expected value of a stationarity\nmeasure to zero. This is in contrast to the unconstrained setting in which\nalmost-sure convergence guarantees (of the gradient of the objective to zero)\ncan be proved for stochastic-gradient-based methods. In this paper, new\nalmost-sure convergence guarantees for the primal iterates, Lagrange\nmultipliers, and stationarity measures generated by a stochastic SQP algorithm\nin this subclass of methods are proved. It is shown that the error in the\nLagrange multipliers can be bounded by the distance of the primal iterate to a\nprimal stationary point plus the error in the latest stochastic gradient\nestimate. It is further shown that, subject to certain assumptions, this latter\nerror can be made to vanish by employing a running average of the Lagrange\nmultipliers that are computed during the run of the algorithm. The results of\nnumerical experiments are provided to demonstrate the proved theoretical\nguarantees.\n","authors":["Frank E. Curtis","Xin Jiang","Qi Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03687v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03686v1","updated":"2023-08-07T16:01:14Z","published":"2023-08-07T16:01:14Z","title":"Linear Convergence Bounds for Diffusion Models via Stochastic\n Localization","summary":" Diffusion models are a powerful method for generating approximate samples\nfrom high-dimensional data distributions. Several recent results have provided\npolynomial bounds on the convergence rate of such models, assuming\n$L^2$-accurate score estimators. However, up until now the best known such\nbounds were either superlinear in the data dimension or required strong\nsmoothness assumptions. We provide the first convergence bounds which are\nlinear in the data dimension (up to logarithmic factors) assuming only finite\nsecond moments of the data distribution. We show that diffusion models require\nat most $\\tilde O(\\frac{d \\log^2(1/\\delta)}{\\varepsilon^2})$ steps to\napproximate an arbitrary data distribution on $\\mathbb{R}^d$ corrupted with\nGaussian noise of variance $\\delta$ to within $\\varepsilon^2$ in\nKullback--Leibler divergence. Our proof builds on the Girsanov-based methods of\nprevious works. We introduce a refined treatment of the error arising from the\ndiscretization of the reverse SDE, which is based on tools from stochastic\nlocalization.\n","authors":["Joe Benton","Valentin De Bortoli","Arnaud Doucet","George Deligiannidis"],"pdf_url":"https://arxiv.org/pdf/2308.03686v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03670v1","updated":"2023-08-07T15:44:58Z","published":"2023-08-07T15:44:58Z","title":"Improving FHB Screening in Wheat Breeding Using an Efficient Transformer\n Model","summary":" Fusarium head blight is a devastating disease that causes significant\neconomic losses annually on small grains. Efficiency, accuracy, and timely\ndetection of FHB in the resistance screening are critical for wheat and barley\nbreeding programs. In recent years, various image processing techniques have\nbeen developed using supervised machine learning algorithms for the early\ndetection of FHB. The state-of-the-art convolutional neural network-based\nmethods, such as U-Net, employ a series of encoding blocks to create a local\nrepresentation and a series of decoding blocks to capture the semantic\nrelations. However, these methods are not often capable of long-range modeling\ndependencies inside the input data, and their ability to model multi-scale\nobjects with significant variations in texture and shape is limited. Vision\ntransformers as alternative architectures with innate global self-attention\nmechanisms for sequence-to-sequence prediction, due to insufficient low-level\ndetails, may also limit localization capabilities. To overcome these\nlimitations, a new Context Bridge is proposed to integrate the local\nrepresentation capability of the U-Net network in the transformer model. In\naddition, the standard attention mechanism of the original transformer is\nreplaced with Efficient Self-attention, which is less complicated than other\nstate-of-the-art methods. To train the proposed network, 12,000 wheat images\nfrom an FHB-inoculated wheat field at the SDSU research farm in Volga, SD, were\ncaptured. In addition to healthy and unhealthy plants, these images encompass\nvarious stages of the disease. A team of expert pathologists annotated the\nimages for training and evaluating the developed model. As a result, the\neffectiveness of the transformer-based method for FHB-disease detection,\nthrough extensive experiments across typical tasks for plant image\nsegmentation, is demonstrated.\n","authors":["Babak Azad","Ahmed Abdalla","Kwanghee Won","Ali Mirzakhani Nafchi"],"pdf_url":"https://arxiv.org/pdf/2308.03670v1.pdf","comment":"10 pages, 5 figures, 1 table. Presented at the 2023 ASABE Annual\n International Meeting conference in Omaha, Nebraska. Also available at\n https://elibrary.asabe.org/abstract.asp?aid=54149"},{"id":"http://arxiv.org/abs/2308.03669v1","updated":"2023-08-07T15:40:34Z","published":"2023-08-07T15:40:34Z","title":"Diffusion Model in Causal Inference with Unmeasured Confounders","summary":" We study how to extend the use of the diffusion model to answer the causal\nquestion from the observational data under the existence of unmeasured\nconfounders. In Pearl's framework of using a Directed Acyclic Graph (DAG) to\ncapture the causal intervention, a Diffusion-based Causal Model (DCM) was\nproposed incorporating the diffusion model to answer the causal questions more\naccurately, assuming that all of the confounders are observed. However,\nunmeasured confounders in practice exist, which hinders DCM from being\napplicable. To alleviate this limitation of DCM, we propose an extended model\ncalled Backdoor Criterion based DCM (BDCM), whose idea is rooted in the\nBackdoor criterion to find the variables in DAG to be included in the decoding\nprocess of the diffusion model so that we can extend DCM to the case with\nunmeasured confounders. Synthetic data experiment demonstrates that our\nproposed model captures the counterfactual distribution more precisely than DCM\nunder the unmeasured confounders.\n","authors":["Tatsuhiro Shimizu"],"pdf_url":"https://arxiv.org/pdf/2308.03669v1.pdf","comment":"6 pages, 7 figures"},{"id":"http://arxiv.org/abs/2304.05365v6","updated":"2023-08-07T15:39:37Z","published":"2023-04-11T17:20:37Z","title":"Did we personalize? Assessing personalization by an online reinforcement\n learning algorithm using resampling","summary":" There is a growing interest in using reinforcement learning (RL) to\npersonalize sequences of treatments in digital health to support users in\nadopting healthier behaviors. Such sequential decision-making problems involve\ndecisions about when to treat and how to treat based on the user's context\n(e.g., prior activity level, location, etc.). Online RL is a promising\ndata-driven approach for this problem as it learns based on each user's\nhistorical responses and uses that knowledge to personalize these decisions.\nHowever, to decide whether the RL algorithm should be included in an\n``optimized'' intervention for real-world deployment, we must assess the data\nevidence indicating that the RL algorithm is actually personalizing the\ntreatments to its users. Due to the stochasticity in the RL algorithm, one may\nget a false impression that it is learning in certain states and using this\nlearning to provide specific treatments. We use a working definition of\npersonalization and introduce a resampling-based methodology for investigating\nwhether the personalization exhibited by the RL algorithm is an artifact of the\nRL algorithm stochasticity. We illustrate our methodology with a case study by\nanalyzing the data from a physical activity clinical trial called HeartSteps,\nwhich included the use of an online RL algorithm. We demonstrate how our\napproach enhances data-driven truth-in-advertising of algorithm personalization\nboth across all users as well as within specific users in the study.\n","authors":["Susobhan Ghosh","Raphael Kim","Prasidh Chhabria","Raaz Dwivedi","Predrag Klasnja","Peng Liao","Kelly Zhang","Susan Murphy"],"pdf_url":"https://arxiv.org/pdf/2304.05365v6.pdf","comment":"The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2308.03666v1","updated":"2023-08-07T15:35:32Z","published":"2023-08-07T15:35:32Z","title":"Bridging Trustworthiness and Open-World Learning: An Exploratory Neural\n Approach for Enhancing Interpretability, Generalization, and Robustness","summary":" As researchers strive to narrow the gap between machine intelligence and\nhuman through the development of artificial intelligence technologies, it is\nimperative that we recognize the critical importance of trustworthiness in\nopen-world, which has become ubiquitous in all aspects of daily life for\neveryone. However, several challenges may create a crisis of trust in current\nartificial intelligence systems that need to be bridged: 1) Insufficient\nexplanation of predictive results; 2) Inadequate generalization for learning\nmodels; 3) Poor adaptability to uncertain environments. Consequently, we\nexplore a neural program to bridge trustworthiness and open-world learning,\nextending from single-modal to multi-modal scenarios for readers. 1) To enhance\ndesign-level interpretability, we first customize trustworthy networks with\nspecific physical meanings; 2) We then design environmental well-being\ntask-interfaces via flexible learning regularizers for improving the\ngeneralization of trustworthy learning; 3) We propose to increase the\nrobustness of trustworthy learning by integrating open-world recognition losses\nwith agent mechanisms. Eventually, we enhance various trustworthy properties\nthrough the establishment of design-level explainability, environmental\nwell-being task-interfaces and open-world recognition programs. These designed\nopen-world protocols are applicable across a wide range of surroundings, under\nopen-world multimedia recognition scenarios with significant performance\nimprovements observed.\n","authors":["Shide Du","Zihan Fang","Shiyang Lan","Yanchao Tan","Manuel Günther","Shiping Wang","Wenzhong Guo"],"pdf_url":"https://arxiv.org/pdf/2308.03666v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03664v1","updated":"2023-08-07T15:28:39Z","published":"2023-08-07T15:28:39Z","title":"Two-stage Early Prediction Framework of Remaining Useful Life for\n Lithium-ion Batteries","summary":" Early prediction of remaining useful life (RUL) is crucial for effective\nbattery management across various industries, ranging from household appliances\nto large-scale applications. Accurate RUL prediction improves the reliability\nand maintainability of battery technology. However, existing methods have\nlimitations, including assumptions of data from the same sensors or\ndistribution, foreknowledge of the end of life (EOL), and neglect to determine\nthe first prediction cycle (FPC) to identify the start of the unhealthy stage.\nThis paper proposes a novel method for RUL prediction of Lithium-ion batteries.\nThe proposed framework comprises two stages: determining the FPC using a neural\nnetwork-based model to divide the degradation data into distinct health states\nand predicting the degradation pattern after the FPC to estimate the remaining\nuseful life as a percentage. Experimental results demonstrate that the proposed\nmethod outperforms conventional approaches in terms of RUL prediction.\nFurthermore, the proposed method shows promise for real-world scenarios,\nproviding improved accuracy and applicability for battery management.\n","authors":["Dhruv Mittal","Hymalai Bello","Bo Zhou","Mayank Shekhar Jha","Sungho Suh","Paul Lukowicz"],"pdf_url":"https://arxiv.org/pdf/2308.03664v1.pdf","comment":"Accepted at the 49th Annual Conference of the IEEE Industrial\n Electronics Society (IECON 2023)"},{"id":"http://arxiv.org/abs/2308.03661v1","updated":"2023-08-07T15:24:49Z","published":"2023-08-07T15:24:49Z","title":"Matrix Completion in Almost-Verification Time","summary":" We give a new framework for solving the fundamental problem of low-rank\nmatrix completion, i.e., approximating a rank-$r$ matrix $\\mathbf{M} \\in\n\\mathbb{R}^{m \\times n}$ (where $m \\ge n$) from random observations. First, we\nprovide an algorithm which completes $\\mathbf{M}$ on $99\\%$ of rows and columns\nunder no further assumptions on $\\mathbf{M}$ from $\\approx mr$ samples and\nusing $\\approx mr^2$ time. Then, assuming the row and column spans of\n$\\mathbf{M}$ satisfy additional regularity properties, we show how to boost\nthis partial completion guarantee to a full matrix completion algorithm by\naggregating solutions to regression problems involving the observations.\n In the well-studied setting where $\\mathbf{M}$ has incoherent row and column\nspans, our algorithms complete $\\mathbf{M}$ to high precision from\n$mr^{2+o(1)}$ observations in $mr^{3 + o(1)}$ time (omitting logarithmic\nfactors in problem parameters), improving upon the prior state-of-the-art\n[JN15] which used $\\approx mr^5$ samples and $\\approx mr^7$ time. Under an\nassumption on the row and column spans of $\\mathbf{M}$ we introduce (which is\nsatisfied by random subspaces with high probability), our sample complexity\nimproves to an almost information-theoretically optimal $mr^{1 + o(1)}$, and\nour runtime improves to $mr^{2 + o(1)}$. Our runtimes have the appealing\nproperty of matching the best known runtime to verify that a rank-$r$\ndecomposition $\\mathbf{U}\\mathbf{V}^\\top$ agrees with the sampled observations.\nWe also provide robust variants of our algorithms that, given random\nobservations from $\\mathbf{M} + \\mathbf{N}$ with $\\|\\mathbf{N}\\|_{F} \\le\n\\Delta$, complete $\\mathbf{M}$ to Frobenius norm distance $\\approx\nr^{1.5}\\Delta$ in the same runtimes as the noiseless setting. Prior noisy\nmatrix completion algorithms [CP10] only guaranteed a distance of $\\approx\n\\sqrt{n}\\Delta$.\n","authors":["Jonathan A. Kelner","Jerry Li","Allen Liu","Aaron Sidford","Kevin Tian"],"pdf_url":"https://arxiv.org/pdf/2308.03661v1.pdf","comment":"FOCS 2023"},{"id":"http://arxiv.org/abs/2308.03648v1","updated":"2023-08-07T14:58:53Z","published":"2023-08-07T14:58:53Z","title":"Generative Forests","summary":" Tabular data represents one of the most prevalent form of data. When it comes\nto data generation, many approaches would learn a density for the data\ngeneration process, but would not necessarily end up with a sampler, even less\nso being exact with respect to the underlying density. A second issue is on\nmodels: while complex modeling based on neural nets thrives in image or text\ngeneration (etc.), less is known for powerful generative models on tabular\ndata. A third problem is the visible chasm on tabular data between training\nalgorithms for supervised learning with remarkable properties (e.g. boosting),\nand a comparative lack of guarantees when it comes to data generation. In this\npaper, we tackle the three problems, introducing new tree-based generative\nmodels convenient for density modeling and tabular data generation that improve\non modeling capabilities of recent proposals, and a training algorithm which\nsimplifies the training setting of previous approaches and displays\nboosting-compliant convergence. This algorithm has the convenient property to\nrely on a supervised training scheme that can be implemented by a few tweaks to\nthe most popular induction scheme for decision tree induction with two classes.\nExperiments are provided on missing data imputation and comparing generated\ndata to real data, displaying the quality of the results obtained by our\napproach, in particular against state of the art.\n","authors":["Richard Nock","Mathieu Guillame-Bert"],"pdf_url":"https://arxiv.org/pdf/2308.03648v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.05174v5","updated":"2023-08-07T14:37:00Z","published":"2022-08-10T06:36:49Z","title":"FedOBD: Opportunistic Block Dropout for Efficiently Training Large-scale\n Neural Networks through Federated Learning","summary":" Large-scale neural networks possess considerable expressive power. They are\nwell-suited for complex learning tasks in industrial applications. However,\nlarge-scale models pose significant challenges for training under the current\nFederated Learning (FL) paradigm. Existing approaches for efficient FL training\noften leverage model parameter dropout. However, manipulating individual model\nparameters is not only inefficient in meaningfully reducing the communication\noverhead when training large-scale FL models, but may also be detrimental to\nthe scaling efforts and model performance as shown by recent research. To\naddress these issues, we propose the Federated Opportunistic Block Dropout\n(FedOBD) approach. The key novelty is that it decomposes large-scale models\ninto semantic blocks so that FL participants can opportunistically upload\nquantized blocks, which are deemed to be significant towards training the\nmodel, to the FL server for aggregation. Extensive experiments evaluating\nFedOBD against four state-of-the-art approaches based on multiple real-world\ndatasets show that it reduces the overall communication overhead by more than\n88% compared to the best performing baseline approach, while achieving the\nhighest test accuracy. To the best of our knowledge, FedOBD is the first\napproach to perform dropout on FL models at the block level rather than at the\nindividual parameter level.\n","authors":["Yuanyuan Chen","Zichen Chen","Pengcheng Wu","Han Yu"],"pdf_url":"https://arxiv.org/pdf/2208.05174v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03629v1","updated":"2023-08-07T14:36:03Z","published":"2023-08-07T14:36:03Z","title":"MedMine: Examining Pre-trained Language Models on Medication Mining","summary":" Automatic medication mining from clinical and biomedical text has become a\npopular topic due to its real impact on healthcare applications and the recent\ndevelopment of powerful language models (LMs). However, fully-automatic\nextraction models still face obstacles to be overcome such that they can be\ndeployed directly into clinical practice for better impacts. Such obstacles\ninclude their imbalanced performances on different entity types and clinical\nevents. In this work, we examine current state-of-the-art pre-trained language\nmodels (PLMs) on such tasks, via fine-tuning including the monolingual model\nMed7 and multilingual large language model (LLM) XLM-RoBERTa. We compare their\nadvantages and drawbacks using historical medication mining shared task data\nsets from n2c2-2018 challenges. We report the findings we get from these\nfine-tuning experiments such that they can facilitate future research on\naddressing them, for instance, how to combine their outputs, merge such models,\nor improve their overall accuracy by ensemble learning and data augmentation.\nMedMine is part of the M3 Initiative \\url{https://github.com/HECTA-UoM/M3}\n","authors":["Haifa Alrdahi","Lifeng Han","Hendrik Šuvalov","Goran Nenadic"],"pdf_url":"https://arxiv.org/pdf/2308.03629v1.pdf","comment":"Open Research Project. 7 pages, 1 figure, 5 tables"},{"id":"http://arxiv.org/abs/2303.12642v3","updated":"2023-08-07T14:29:03Z","published":"2023-03-22T15:23:22Z","title":"Democratising AI: Multiple Meanings, Goals, and Methods","summary":" Numerous parties are calling for the democratisation of AI, but the phrase is\nused to refer to a variety of goals, the pursuit of which sometimes conflict.\nThis paper identifies four kinds of AI democratisation that are commonly\ndiscussed: (1) the democratisation of AI use, (2) the democratisation of AI\ndevelopment, (3) the democratisation of AI profits, and (4) the democratisation\nof AI governance. Numerous goals and methods of achieving each form of\ndemocratisation are discussed. The main takeaway from this paper is that AI\ndemocratisation is a multifarious and sometimes conflicting concept that should\nnot be conflated with improving AI accessibility. If we want to move beyond\nambiguous commitments to democratising AI, to productive discussions of\nconcrete policies and trade-offs, then we need to recognise the principal role\nof the democratisation of AI governance in navigating tradeoffs and risks\nacross decisions around use, development, and profits.\n","authors":["Elizabeth Seger","Aviv Ovadya","Ben Garfinkel","Divya Siddarth","Allan Dafoe"],"pdf_url":"https://arxiv.org/pdf/2303.12642v3.pdf","comment":"V2 Changed second author affiliation; added citation to section 5.2;\n edit to author contribution statement; V3 camera ready version for conference\n proceedings. Minor content changes in response to reviewer comments"},{"id":"http://arxiv.org/abs/2308.03613v1","updated":"2023-08-07T14:16:52Z","published":"2023-08-07T14:16:52Z","title":"Adaptive Semi-Supervised Segmentation of Brain Vessels with Ambiguous\n Labels","summary":" Accurate segmentation of brain vessels is crucial for cerebrovascular disease\ndiagnosis and treatment. However, existing methods face challenges in capturing\nsmall vessels and handling datasets that are partially or ambiguously\nannotated. In this paper, we propose an adaptive semi-supervised approach to\naddress these challenges. Our approach incorporates innovative techniques\nincluding progressive semi-supervised learning, adaptative training strategy,\nand boundary enhancement. Experimental results on 3DRA datasets demonstrate the\nsuperiority of our method in terms of mesh-based segmentation metrics. By\nleveraging the partially and ambiguously labeled data, which only annotates the\nmain vessels, our method achieves impressive segmentation performance on\nmislabeled fine vessels, showcasing its potential for clinical applications.\n","authors":["Fengming Lin","Yan Xia","Nishant Ravikumar","Qiongyao Liu","Michael MacRaild","Alejandro F Frangi"],"pdf_url":"https://arxiv.org/pdf/2308.03613v1.pdf","comment":"Accepted by DALI MICCAI 2023"},{"id":"http://arxiv.org/abs/2308.03574v1","updated":"2023-08-07T13:25:48Z","published":"2023-08-07T13:25:48Z","title":"Generalized Early Stopping in Evolutionary Direct Policy Search","summary":" Lengthy evaluation times are common in many optimization problems such as\ndirect policy search tasks, especially when they involve conducting evaluations\nin the physical world, e.g. in robotics applications. Often, when evaluating a\nsolution over a fixed time period, it becomes clear that the objective value\nwill not increase with additional computation time (for example, when a\ntwo-wheeled robot continuously spins on the spot). In such cases, it makes\nsense to stop the evaluation early to save computation time. However, most\napproaches to stop the evaluation are problem-specific and need to be\nspecifically designed for the task at hand. Therefore, we propose an early\nstopping method for direct policy search. The proposed method only looks at the\nobjective value at each time step and requires no problem-specific knowledge.\n We test the introduced stopping criterion in five direct policy search\nenvironments drawn from games, robotics, and classic control domains, and show\nthat it can save up to 75% of the computation time. We also compare it with\nproblem-specific stopping criteria and demonstrate that it performs comparably\nwhile being more generally applicable.\n","authors":["Etor Arza","Leni K. Le Goff","Emma Hart"],"pdf_url":"https://arxiv.org/pdf/2308.03574v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03573v1","updated":"2023-08-07T13:24:52Z","published":"2023-08-07T13:24:52Z","title":"When Federated Learning meets Watermarking: A Comprehensive Overview of\n Techniques for Intellectual Property Protection","summary":" Federated Learning (FL) is a technique that allows multiple participants to\ncollaboratively train a Deep Neural Network (DNN) without the need of\ncentralizing their data. Among other advantages, it comes with\nprivacy-preserving properties making it attractive for application in sensitive\ncontexts, such as health care or the military. Although the data are not\nexplicitly exchanged, the training procedure requires sharing information about\nparticipants' models. This makes the individual models vulnerable to theft or\nunauthorized distribution by malicious actors. To address the issue of\nownership rights protection in the context of Machine Learning (ML), DNN\nWatermarking methods have been developed during the last five years. Most\nexisting works have focused on watermarking in a centralized manner, but only a\nfew methods have been designed for FL and its unique constraints. In this\npaper, we provide an overview of recent advancements in Federated Learning\nwatermarking, shedding light on the new challenges and opportunities that arise\nin this field.\n","authors":["Mohammed Lansari","Reda Bellafqira","Katarzyna Kapusta","Vincent Thouvenot","Olivier Bettan","Gouenou Coatrieux"],"pdf_url":"https://arxiv.org/pdf/2308.03573v1.pdf","comment":"2figures, 14pages, 3tables"},{"id":"http://arxiv.org/abs/2308.03572v1","updated":"2023-08-07T13:24:50Z","published":"2023-08-07T13:24:50Z","title":"Provably Efficient Learning in Partially Observable Contextual Bandit","summary":" In this paper, we investigate transfer learning in partially observable\ncontextual bandits, where agents have limited knowledge from other agents and\npartial information about hidden confounders. We first convert the problem to\nidentifying or partially identifying causal effects between actions and rewards\nthrough optimization problems. To solve these optimization problems, we\ndiscretize the original functional constraints of unknown distributions into\nlinear constraints, and sample compatible causal models via sequentially\nsolving linear programmings to obtain causal bounds with the consideration of\nestimation error. Our sampling algorithms provide desirable convergence results\nfor suitable sampling distributions. We then show how causal bounds can be\napplied to improving classical bandit algorithms and affect the regrets with\nrespect to the size of action sets and function spaces. Notably, in the task\nwith function approximation which allows us to handle general context\ndistributions, our method improves the order dependence on function space size\ncompared with previous literatures. We formally prove that our causally\nenhanced algorithms outperform classical bandit algorithms and achieve orders\nof magnitude faster convergence rates. Finally, we perform simulations that\ndemonstrate the efficiency of our strategy compared to the current\nstate-of-the-art methods. This research has the potential to enhance the\nperformance of contextual bandit agents in real-world applications where data\nis scarce and costly to obtain.\n","authors":["Xueping Gong","Jiheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03572v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2010.03104 by other authors"},{"id":"http://arxiv.org/abs/2206.08083v4","updated":"2023-08-07T13:24:06Z","published":"2022-06-16T10:53:18Z","title":"CARLANE: A Lane Detection Benchmark for Unsupervised Domain Adaptation\n from Simulation to multiple Real-World Domains","summary":" Unsupervised Domain Adaptation demonstrates great potential to mitigate\ndomain shifts by transferring models from labeled source domains to unlabeled\ntarget domains. While Unsupervised Domain Adaptation has been applied to a wide\nvariety of complex vision tasks, only few works focus on lane detection for\nautonomous driving. This can be attributed to the lack of publicly available\ndatasets. To facilitate research in these directions, we propose CARLANE, a\n3-way sim-to-real domain adaptation benchmark for 2D lane detection. CARLANE\nencompasses the single-target datasets MoLane and TuLane and the multi-target\ndataset MuLane. These datasets are built from three different domains, which\ncover diverse scenes and contain a total of 163K unique images, 118K of which\nare annotated. In addition we evaluate and report systematic baselines,\nincluding our own method, which builds upon Prototypical Cross-domain\nSelf-supervised Learning. We find that false positive and false negative rates\nof the evaluated domain adaptation methods are high compared to those of fully\nsupervised baselines. This affirms the need for benchmarks such as CARLANE to\nfurther strengthen research in Unsupervised Domain Adaptation for lane\ndetection. CARLANE, all evaluated models and the corresponding implementations\nare publicly available at https://carlanebenchmark.github.io.\n","authors":["Julian Gebele","Bonifaz Stuhr","Johann Haselberger"],"pdf_url":"https://arxiv.org/pdf/2206.08083v4.pdf","comment":"36th Conference on Neural Information Processing Systems (NeurIPS\n 2022) Track on Datasets and Benchmarks, 22 pages, 11 figures"},{"id":"http://arxiv.org/abs/2307.12375v2","updated":"2023-08-07T13:22:01Z","published":"2023-07-23T16:54:41Z","title":"In-Context Learning in Large Language Models Learns Label Relationships\n but Is Not Conventional Learning","summary":" The performance of Large Language Models (LLMs) on downstream tasks often\nimproves significantly when including examples of the input-label relationship\nin the context. However, there is currently no consensus about how this\nin-context learning (ICL) ability of LLMs works: for example, while Xie et al.\n(2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022b)\nargue ICL does not even learn label relationships from in-context examples. In\nthis paper, we study (1) how labels of in-context examples affect predictions,\n(2) how label relationships learned during pre-training interact with\ninput-label examples provided in-context, and (3) how ICL aggregates label\ninformation across in-context examples. Our findings suggests LLMs usually\nincorporate information from in-context labels, but that pre-training and\nin-context label relationships are treated differently, and that the model does\nnot consider all in-context information equally. Our results give insights into\nunderstanding and aligning LLM behavior.\n","authors":["Jannik Kossen","Tom Rainforth","Yarin Gal"],"pdf_url":"https://arxiv.org/pdf/2307.12375v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03570v1","updated":"2023-08-07T13:21:58Z","published":"2023-08-07T13:21:58Z","title":"Partial identification of kernel based two sample tests with mismeasured\n data","summary":" Nonparametric two-sample tests such as the Maximum Mean Discrepancy (MMD) are\noften used to detect differences between two distributions in machine learning\napplications. However, the majority of existing literature assumes that\nerror-free samples from the two distributions of interest are available.We\nrelax this assumption and study the estimation of the MMD under\n$\\epsilon$-contamination, where a possibly non-random $\\epsilon$ proportion of\none distribution is erroneously grouped with the other. We show that under\n$\\epsilon$-contamination, the typical estimate of the MMD is unreliable.\nInstead, we study partial identification of the MMD, and characterize sharp\nupper and lower bounds that contain the true, unknown MMD. We propose a method\nto estimate these bounds, and show that it gives estimates that converge to the\nsharpest possible bounds on the MMD as sample size increases, with a\nconvergence rate that is faster than alternative approaches. Using three\ndatasets, we empirically validate that our approach is superior to the\nalternatives: it gives tight bounds with a low false coverage rate.\n","authors":["Ron Nafshi","Maggie Makar"],"pdf_url":"https://arxiv.org/pdf/2308.03570v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.10923v2","updated":"2023-08-07T12:58:57Z","published":"2022-06-22T09:02:42Z","title":"FairGrad: Fairness Aware Gradient Descent","summary":" We address the problem of group fairness in classification, where the\nobjective is to learn models that do not unjustly discriminate against\nsubgroups of the population. Most existing approaches are limited to simple\nbinary tasks or involve difficult to implement training mechanisms which\nreduces their practical applicability. In this paper, we propose FairGrad, a\nmethod to enforce fairness based on a re-weighting scheme that iteratively\nlearns group specific weights based on whether they are advantaged or not.\nFairGrad is easy to implement, accommodates various standard fairness\ndefinitions, and comes with minimal overhead. Furthermore, we show that it is\ncompetitive with standard baselines over various datasets including ones used\nin natural language processing and computer vision.\n FairGrad is available as a PyPI package at -\nhttps://pypi.org/project/fairgrad\n","authors":["Gaurav Maheshwari","Michaël Perrot"],"pdf_url":"https://arxiv.org/pdf/2206.10923v2.pdf","comment":"Paper is accepted at Transactions on Machine Learning Research.\n Reviewed on OpenReview: https://openreview.net/forum?id=0f8tU3QwWD"},{"id":"http://arxiv.org/abs/2308.03542v1","updated":"2023-08-07T12:44:10Z","published":"2023-08-07T12:44:10Z","title":"A Transfer Learning Framework for Proactive Ramp Metering Performance\n Assessment","summary":" Transportation agencies need to assess ramp metering performance when\ndeploying or expanding a ramp metering system. The evaluation of a ramp\nmetering strategy is primarily centered around examining its impact on freeway\ntraffic mobility. One way these effects can be explored is by comparing traffic\nstates, such as the speed before and after the ramp metering strategy has been\naltered. Predicting freeway traffic states for the after scenarios following\nthe implementation of a new ramp metering control strategy could offer valuable\ninsights into the potential effectiveness of the target strategy. However, the\nuse of machine learning methods in predicting the freeway traffic state for the\nafter scenarios and evaluating the effectiveness of transportation policies or\ntraffic control strategies such as ramp metering is somewhat limited in the\ncurrent literature. To bridge the research gap, this study presents a framework\nfor predicting freeway traffic parameters (speed, occupancy, and flow rate) for\nthe after situations when a new ramp metering control strategy is implemented.\nBy learning the association between the spatial-temporal features of traffic\nstates in before and after situations for known freeway segments, the proposed\nframework can transfer this learning to predict the traffic parameters for new\nfreeway segments. The proposed framework is built upon a transfer learning\nmodel. Experimental results show that the proposed framework is feasible for\nuse as an alternative for predicting freeway traffic parameters to proactively\nevaluate ramp metering performance.\n","authors":["Xiaobo Ma","Adrian Cottam","Mohammad Razaur Rahman Shaon","Yao-Jan Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03542v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03538v1","updated":"2023-08-07T12:36:30Z","published":"2023-08-07T12:36:30Z","title":"On-ramp and Off-ramp Traffic Flows Estimation Based on A Data-driven\n Transfer Learning Framework","summary":" To develop the most appropriate control strategy and monitor, maintain, and\nevaluate the traffic performance of the freeway weaving areas, state and local\nDepartments of Transportation need to have access to traffic flows at each pair\nof on-ramp and off-ramp. However, ramp flows are not always readily available\nto transportation agencies and little effort has been made to estimate these\nmissing flows in locations where no physical sensors are installed. To bridge\nthis research gap, a data-driven framework is proposed that can accurately\nestimate the missing ramp flows by solely using data collected from loop\ndetectors on freeway mainlines. The proposed framework employs a transfer\nlearning model. The transfer learning model relaxes the assumption that the\nunderlying data distributions of the source and target domains must be the\nsame. Therefore, the proposed framework can guarantee high-accuracy estimation\nof on-ramp and off-ramp flows on freeways with different traffic patterns,\ndistributions, and characteristics. Based on the experimental results, the flow\nestimation mean absolute errors range between 23.90 veh/h to 40.85 veh/h for\non-ramps, and 31.58 veh/h to 45.31 veh/h for off-ramps; the flow estimation\nroot mean square errors range between 34.55 veh/h to 57.77 veh/h for on-ramps,\nand 41.75 veh/h to 58.80 veh/h for off-ramps. Further, the comparison analysis\nshows that the proposed framework outperforms other conventional machine\nlearning models. The estimated ramp flows based on the proposed method can help\ntransportation agencies to enhance the operations of their ramp control\nstrategies for locations where physical sensors are not installed.\n","authors":["Xiaobo Ma","Abolfazl Karimpour","Yao-Jan Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03538v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2108.11577v4","updated":"2023-08-07T12:33:20Z","published":"2021-08-26T04:42:24Z","title":"Machine Unlearning of Features and Labels","summary":" Removing information from a machine learning model is a non-trivial task that\nrequires to partially revert the training process. This task is unavoidable\nwhen sensitive data, such as credit card numbers or passwords, accidentally\nenter the model and need to be removed afterwards. Recently, different concepts\nfor machine unlearning have been proposed to address this problem. While these\napproaches are effective in removing individual data points, they do not scale\nto scenarios where larger groups of features and labels need to be reverted. In\nthis paper, we propose the first method for unlearning features and labels. Our\napproach builds on the concept of influence functions and realizes unlearning\nthrough closed-form updates of model parameters. It enables to adapt the\ninfluence of training data on a learning model retrospectively, thereby\ncorrecting data leaks and privacy issues. For learning models with strongly\nconvex loss functions, our method provides certified unlearning with\ntheoretical guarantees. For models with non-convex losses, we empirically show\nthat unlearning features and labels is effective and significantly faster than\nother strategies.\n","authors":["Alexander Warnecke","Lukas Pirch","Christian Wressnegger","Konrad Rieck"],"pdf_url":"https://arxiv.org/pdf/2108.11577v4.pdf","comment":"Network and Distributed System Security Symposium (NDSS) 2023"},{"id":"http://arxiv.org/abs/2308.03530v1","updated":"2023-08-07T12:27:19Z","published":"2023-08-07T12:27:19Z","title":"Deep Feature Learning for Wireless Spectrum Data","summary":" In recent years, the traditional feature engineering process for training\nmachine learning models is being automated by the feature extraction layers\nintegrated in deep learning architectures. In wireless networks, many studies\nwere conducted in automatic learning of feature representations for\ndomain-related challenges. However, most of the existing works assume some\nsupervision along the learning process by using labels to optimize the model.\nIn this paper, we investigate an approach to learning feature representations\nfor wireless transmission clustering in a completely unsupervised manner, i.e.\nrequiring no labels in the process. We propose a model based on convolutional\nneural networks that automatically learns a reduced dimensionality\nrepresentation of the input data with 99.3% less components compared to a\nbaseline principal component analysis (PCA). We show that the automatic\nrepresentation learning is able to extract fine-grained clusters containing the\nshapes of the wireless transmission bursts, while the baseline enables only\ngeneral separability of the data based on the background noise.\n","authors":["Ljupcho Milosheski","Gregor Cerar","Blaž Bertalanič","Carolina Fortuna","Mihael Mohorčič"],"pdf_url":"https://arxiv.org/pdf/2308.03530v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03526v1","updated":"2023-08-07T12:21:37Z","published":"2023-08-07T12:21:37Z","title":"AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning","summary":" StarCraft II is one of the most challenging simulated reinforcement learning\nenvironments; it is partially observable, stochastic, multi-agent, and\nmastering StarCraft II requires strategic planning over long time horizons with\nreal-time low-level execution. It also has an active professional competitive\nscene. StarCraft II is uniquely suited for advancing offline RL algorithms,\nboth because of its challenging nature and because Blizzard has released a\nmassive dataset of millions of StarCraft II games played by human players. This\npaper leverages that and establishes a benchmark, called AlphaStar Unplugged,\nintroducing unprecedented challenges for offline reinforcement learning. We\ndefine a dataset (a subset of Blizzard's release), tools standardizing an API\nfor machine learning methods, and an evaluation protocol. We also present\nbaseline agents, including behavior cloning, offline variants of actor-critic\nand MuZero. We improve the state of the art of agents using only offline data,\nand we achieve 90% win rate against previously published AlphaStar behavior\ncloning agent.\n","authors":["Michaël Mathieu","Sherjil Ozair","Srivatsan Srinivasan","Caglar Gulcehre","Shangtong Zhang","Ray Jiang","Tom Le Paine","Richard Powell","Konrad Żołna","Julian Schrittwieser","David Choi","Petko Georgiev","Daniel Toyama","Aja Huang","Roman Ring","Igor Babuschkin","Timo Ewalds","Mahyar Bordbar","Sarah Henderson","Sergio Gómez Colmenarejo","Aäron van den Oord","Wojciech Marian Czarnecki","Nando de Freitas","Oriol Vinyals"],"pdf_url":"https://arxiv.org/pdf/2308.03526v1.pdf","comment":"32 pages, 13 figures, previous version published as a NeurIPS 2021\n workshop: https://openreview.net/forum?id=Np8Pumfoty"},{"id":"http://arxiv.org/abs/2308.03514v1","updated":"2023-08-07T12:10:13Z","published":"2023-08-07T12:10:13Z","title":"Worker Activity Recognition in Manufacturing Line Using Near-body\n Electric Field","summary":" Manufacturing industries strive to improve production efficiency and product\nquality by deploying advanced sensing and control systems. Wearable sensors are\nemerging as a promising solution for achieving this goal, as they can provide\ncontinuous and unobtrusive monitoring of workers' activities in the\nmanufacturing line. This paper presents a novel wearable sensing prototype that\ncombines IMU and body capacitance sensing modules to recognize worker\nactivities in the manufacturing line. To handle these multimodal sensor data,\nwe propose and compare early, and late sensor data fusion approaches for\nmulti-channel time-series convolutional neural networks and deep convolutional\nLSTM. We evaluate the proposed hardware and neural network model by collecting\nand annotating sensor data using the proposed sensing prototype and Apple\nWatches in the testbed of the manufacturing line. Experimental results\ndemonstrate that our proposed methods achieve superior performance compared to\nthe baseline methods, indicating the potential of the proposed approach for\nreal-world applications in manufacturing industries. Furthermore, the proposed\nsensing prototype with a body capacitive sensor and feature fusion method\nimproves by 6.35%, yielding a 9.38% higher macro F1 score than the proposed\nsensing prototype without a body capacitive sensor and Apple Watch data,\nrespectively.\n","authors":["Sungho Suh","Vitor Fortes Rey","Sizhen Bian","Yu-Chi Huang","Jože M. Rožanec","Hooman Tavakoli Ghinani","Bo Zhou","Paul Lukowicz"],"pdf_url":"https://arxiv.org/pdf/2308.03514v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08674v3","updated":"2023-08-07T12:08:17Z","published":"2023-07-17T17:36:09Z","title":"TableGPT: Towards Unifying Tables, Nature Language and Commands into One\n GPT","summary":" Tables are prevalent in real-world databases, requiring significant time and\neffort for humans to analyze and manipulate. The advancements in large language\nmodels (LLMs) have made it possible to interact with tables using natural\nlanguage input, bringing this capability closer to reality. In this paper, we\npresent TableGPT, a unified fine-tuned framework that enables LLMs to\nunderstand and operate on tables using external functional commands. It\nintroduces the capability to seamlessly interact with tables, enabling a wide\nrange of functionalities such as question answering, data manipulation (e.g.,\ninsert, delete, query, and modify operations), data visualization, analysis\nreport generation, and automated prediction. TableGPT aims to provide\nconvenience and accessibility to users by empowering them to effortlessly\nleverage tabular data. At the core of TableGPT lies the novel concept of global\ntabular representations, which empowers LLMs to gain a comprehensive\nunderstanding of the entire table beyond meta-information. By jointly training\nLLMs on both table and text modalities, TableGPT achieves a deep understanding\nof tabular data and the ability to perform complex operations on tables through\nchain-of-command instructions. Importantly, TableGPT offers the advantage of\nbeing a self-contained system rather than relying on external API interfaces.\nMoreover, it supports efficient data process flow, query rejection (when\nappropriate) and private deployment, enabling faster domain data fine-tuning\nand ensuring data privacy, which enhances the framework's adaptability to\nspecific use cases.\n","authors":["Liangyu Zha","Junlin Zhou","Liyao Li","Rui Wang","Qingyi Huang","Saisai Yang","Jing Yuan","Changbao Su","Xiang Li","Aofeng Su","Tao Zhang","Chen Zhou","Kaizhe Shou","Miao Wang","Wufang Zhu","Guoshan Lu","Chao Ye","Yali Ye","Wentao Ye","Yiming Zhang","Xinglong Deng","Jie Xu","Haobo Wang","Gang Chen","Junbo Zhao"],"pdf_url":"https://arxiv.org/pdf/2307.08674v3.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2302.00025v2","updated":"2023-08-07T12:06:43Z","published":"2023-01-31T19:00:28Z","title":"On the Within-Group Fairness of Screening Classifiers","summary":" Screening classifiers are increasingly used to identify qualified candidates\nin a variety of selection processes. In this context, it has been recently\nshown that, if a classifier is calibrated, one can identify the smallest set of\ncandidates which contains, in expectation, a desired number of qualified\ncandidates using a threshold decision rule. This lends support to focusing on\ncalibration as the only requirement for screening classifiers. In this paper,\nwe argue that screening policies that use calibrated classifiers may suffer\nfrom an understudied type of within-group unfairness -- they may unfairly treat\nqualified members within demographic groups of interest. Further, we argue that\nthis type of unfairness can be avoided if classifiers satisfy within-group\nmonotonicity, a natural monotonicity property within each of the groups. Then,\nwe introduce an efficient post-processing algorithm based on dynamic\nprogramming to minimally modify a given calibrated classifier so that its\nprobability estimates satisfy within-group monotonicity. We validate our\nalgorithm using US Census survey data and show that within-group monotonicity\ncan be often achieved at a small cost in terms of prediction granularity and\nshortlist size.\n","authors":["Nastaran Okati","Stratis Tsirtsis","Manuel Gomez Rodriguez"],"pdf_url":"https://arxiv.org/pdf/2302.00025v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.14311v3","updated":"2023-08-07T12:06:19Z","published":"2023-02-28T05:01:01Z","title":"Towards Memory- and Time-Efficient Backpropagation for Training Spiking\n Neural Networks","summary":" Spiking Neural Networks (SNNs) are promising energy-efficient models for\nneuromorphic computing. For training the non-differentiable SNN models, the\nbackpropagation through time (BPTT) with surrogate gradients (SG) method has\nachieved high performance. However, this method suffers from considerable\nmemory cost and training time during training. In this paper, we propose the\nSpatial Learning Through Time (SLTT) method that can achieve high performance\nwhile greatly improving training efficiency compared with BPTT. First, we show\nthat the backpropagation of SNNs through the temporal domain contributes just a\nlittle to the final calculated gradients. Thus, we propose to ignore the\nunimportant routes in the computational graph during backpropagation. The\nproposed method reduces the number of scalar multiplications and achieves a\nsmall memory occupation that is independent of the total time steps.\nFurthermore, we propose a variant of SLTT, called SLTT-K, that allows\nbackpropagation only at K time steps, then the required number of scalar\nmultiplications is further reduced and is independent of the total time steps.\nExperiments on both static and neuromorphic datasets demonstrate superior\ntraining efficiency and performance of our SLTT. In particular, our method\nachieves state-of-the-art accuracy on ImageNet, while the memory cost and\ntraining time are reduced by more than 70% and 50%, respectively, compared with\nBPTT.\n","authors":["Qingyan Meng","Mingqing Xiao","Shen Yan","Yisen Wang","Zhouchen Lin","Zhi-Quan Luo"],"pdf_url":"https://arxiv.org/pdf/2302.14311v3.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2308.03511v1","updated":"2023-08-07T12:05:55Z","published":"2023-08-07T12:05:55Z","title":"A data-driven approach to predict decision point choice during normal\n and evacuation wayfinding in multi-story buildings","summary":" Understanding pedestrian route choice behavior in complex buildings is\nimportant to ensure pedestrian safety. Previous studies have mostly used\ntraditional data collection methods and discrete choice modeling to understand\nthe influence of different factors on pedestrian route and exit choice,\nparticularly in simple indoor environments. However, research on pedestrian\nroute choice in complex buildings is still limited. This paper presents a\ndata-driven approach for understanding and predicting the pedestrian decision\npoint choice during normal and emergency wayfinding in a multi-story building.\nFor this, we first built an indoor network representation and proposed a data\nmapping technique to map VR coordinates to the indoor representation. We then\nused a well-established machine learning algorithm, namely the random forest\n(RF) model to predict pedestrian decision point choice along a route during\nfour wayfinding tasks in a multi-story building. Pedestrian behavioral data in\na multi-story building was collected by a Virtual Reality experiment. The\nresults show a much higher prediction accuracy of decision points using the RF\nmodel (i.e., 93% on average) compared to the logistic regression model. The\nhighest prediction accuracy was 96% for task 3. Additionally, we tested the\nmodel performance combining personal characteristics and we found that personal\ncharacteristics did not affect decision point choice. This paper demonstrates\nthe potential of applying a machine learning algorithm to study pedestrian\nroute choice behavior in complex indoor buildings.\n","authors":["Yan Feng","Panchamy Krishnakumari"],"pdf_url":"https://arxiv.org/pdf/2308.03511v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.12377v4","updated":"2023-08-07T12:02:40Z","published":"2022-07-25T17:46:09Z","title":"A novel Deep Learning approach for one-step Conformal Prediction\n approximation","summary":" Deep Learning predictions with measurable confidence are increasingly\ndesirable for real-world problems, especially in high-risk settings. The\nConformal Prediction (CP) framework is a versatile solution that guarantees a\nmaximum error rate given minimal constraints. In this paper, we propose a novel\nconformal loss function that approximates the traditionally two-step CP\napproach in a single step. By evaluating and penalising deviations from the\nstringent expected CP output distribution, a Deep Learning model may learn the\ndirect relationship between the input data and the conformal p-values. We carry\nout a comprehensive empirical evaluation to show our novel loss function's\ncompetitiveness for seven binary and multi-class prediction tasks on five\nbenchmark datasets. On the same datasets, our approach achieves significant\ntraining time reductions up to 86% compared to Aggregated Conformal Prediction\n(ACP), while maintaining comparable approximate validity and predictive\nefficiency.\n","authors":["Julia A. Meister","Khuong An Nguyen","Stelios Kapetanakis","Zhiyuan Luo"],"pdf_url":"https://arxiv.org/pdf/2207.12377v4.pdf","comment":"34 pages, 15 figures, 5 tables"},{"id":"http://arxiv.org/abs/2308.03495v1","updated":"2023-08-07T11:42:50Z","published":"2023-08-07T11:42:50Z","title":"Balanced Face Dataset: Guiding StyleGAN to Generate Labeled Synthetic\n Face Image Dataset for Underrepresented Group","summary":" For a machine learning model to generalize effectively to unseen data within\na particular problem domain, it is well-understood that the data needs to be of\nsufficient size and representative of real-world scenarios. Nonetheless,\nreal-world datasets frequently have overrepresented and underrepresented\ngroups. One solution to mitigate bias in machine learning is to leverage a\ndiverse and representative dataset. Training a model on a dataset that covers\nall demographics is crucial to reducing bias in machine learning. However,\ncollecting and labeling large-scale datasets has been challenging, prompting\nthe use of synthetic data generation and active labeling to decrease the costs\nof manual labeling. The focus of this study was to generate a robust face image\ndataset using the StyleGAN model. In order to achieve a balanced distribution\nof the dataset among different demographic groups, a synthetic dataset was\ncreated by controlling the generation process of StyleGaN and annotated for\ndifferent downstream tasks.\n","authors":["Kidist Amde Mekonnen"],"pdf_url":"https://arxiv.org/pdf/2308.03495v1.pdf","comment":"7 pages, 7 figures,submitted to AMLD Africa 2021 conference"},{"id":"http://arxiv.org/abs/2208.00953v2","updated":"2023-08-07T11:18:47Z","published":"2022-08-01T16:05:14Z","title":"Visual Interpretable and Explainable Deep Learning Models for Brain\n Tumor MRI and COVID-19 Chest X-ray Images","summary":" Deep learning shows promise for medical image analysis but lacks\ninterpretability, hindering adoption in healthcare. Attribution techniques that\nexplain model reasoning may increase trust in deep learning among clinical\nstakeholders. This paper aimed to evaluate attribution methods for illuminating\nhow deep neural networks analyze medical images. Using adaptive path-based\ngradient integration, we attributed predictions from brain tumor MRI and\nCOVID-19 chest X-ray datasets made by recent deep convolutional neural network\nmodels. The technique highlighted possible biomarkers, exposed model biases,\nand offered insights into the links between input and prediction. Our analysis\ndemonstrates the method's ability to elucidate model reasoning on these\ndatasets. The resulting attributions show promise for improving deep learning\ntransparency for domain experts by revealing the rationale behind predictions.\nThis study advances model interpretability to increase trust in deep learning\namong healthcare stakeholders.\n","authors":["Yusuf Brima","Marcellin Atemkeng"],"pdf_url":"https://arxiv.org/pdf/2208.00953v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.03345v3","updated":"2023-08-07T11:16:08Z","published":"2023-01-09T13:56:59Z","title":"Latent Spectral Regularization for Continual Learning","summary":" While biological intelligence grows organically as new knowledge is gathered\nthroughout life, Artificial Neural Networks forget catastrophically whenever\nthey face a changing training data distribution. Rehearsal-based Continual\nLearning (CL) approaches have been established as a versatile and reliable\nsolution to overcome this limitation; however, sudden input disruptions and\nmemory constraints are known to alter the consistency of their predictions. We\nstudy this phenomenon by investigating the geometric characteristics of the\nlearner's latent space and find that replayed data points of different classes\nincreasingly mix up, interfering with classification. Hence, we propose a\ngeometric regularizer that enforces weak requirements on the Laplacian spectrum\nof the latent space, promoting a partitioning behavior. We show that our\nproposal, called Continual Spectral Regularizer (CaSpeR), can be easily\ncombined with any rehearsal-based CL approach and improves the performance of\nSOTA methods on standard benchmarks. Finally, we conduct additional analysis to\nprovide insights into CaSpeR's effects and applicability.\n","authors":["Emanuele Frascaroli","Riccardo Benaglia","Matteo Boschini","Luca Moschella","Cosimo Fiorini","Emanuele Rodolà","Simone Calderara"],"pdf_url":"https://arxiv.org/pdf/2301.03345v3.pdf","comment":"8 pages, 3 figures"},{"id":"http://arxiv.org/abs/2308.03476v1","updated":"2023-08-07T11:09:12Z","published":"2023-08-07T11:09:12Z","title":"Exploring the Physical World Adversarial Robustness of Vehicle Detection","summary":" Adversarial attacks can compromise the robustness of real-world detection\nmodels. However, evaluating these models under real-world conditions poses\nchallenges due to resource-intensive experiments. Virtual simulations offer an\nalternative, but the absence of standardized benchmarks hampers progress.\nAddressing this, we propose an innovative instant-level data generation\npipeline using the CARLA simulator. Through this pipeline, we establish the\nDiscrete and Continuous Instant-level (DCI) dataset, enabling comprehensive\nexperiments involving three detection models and three physical adversarial\nattacks. Our findings highlight diverse model performances under adversarial\nconditions. Yolo v6 demonstrates remarkable resilience, experiencing just a\nmarginal 6.59% average drop in average precision (AP). In contrast, the ASA\nattack yields a substantial 14.51% average AP reduction, twice the effect of\nother algorithms. We also note that static scenes yield higher recognition AP\nvalues, and outcomes remain relatively consistent across varying weather\nconditions. Intriguingly, our study suggests that advancements in adversarial\nattack algorithms may be approaching its ``limitation''.In summary, our work\nunderscores the significance of adversarial attacks in real-world contexts and\nintroduces the DCI dataset as a versatile benchmark. Our findings provide\nvaluable insights for enhancing the robustness of detection models and offer\nguidance for future research endeavors in the realm of adversarial attacks.\n","authors":["Wei Jiang","Tianyuan Zhang","Shuangcheng Liu","Weiyu Ji","Zichao Zhang","Gang Xiao"],"pdf_url":"https://arxiv.org/pdf/2308.03476v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03472v1","updated":"2023-08-07T11:02:44Z","published":"2023-08-07T11:02:44Z","title":"How to forecast power generation in wind farms? Insights from leveraging\n hierarchical structure","summary":" Forecasting of renewable energy generation provides key insights which may\nhelp with decision-making towards global decarbonisation. Renewable energy\ngeneration can often be represented through cross-sectional hierarchies,\nwhereby a single farm may have multiple individual generators. Hierarchical\nforecasting through reconciliation has demonstrated a significant increase in\nthe quality of forecasts both theoretically and empirically. However, it is not\nevident whether forecasts generated by individual temporal and cross-sectional\naggregation can be superior to integrated cross-temporal forecasts and to\nindividual forecasts on more granular data. In this study, we investigate the\naccuracies of different cross-sectional and cross-temporal reconciliation\nmethods using both linear regression and gradient boosting machine learning for\nforecasting wind farm power generation. We found that cross-temporal\nreconciliation is superior to individual cross-sectional reconciliation at\nmultiple temporal aggregations. Cross-temporally reconciled machine learning\nbase forecasts also demonstrated a high accuracy at coarser temporal\ngranularities, which may encourage adoption for short-term wind forecasts. We\nalso show that linear regression can outperform machine learning models across\nmost levels in cross-sectional wind time series.\n","authors":["Lucas English","Mahdi Abolghasemi"],"pdf_url":"https://arxiv.org/pdf/2308.03472v1.pdf","comment":"22 pages, 11 figures"},{"id":"http://arxiv.org/abs/2306.08432v2","updated":"2023-08-07T10:58:21Z","published":"2023-06-14T11:02:08Z","title":"Batches Stabilize the Minimum Norm Risk in High Dimensional\n Overparameterized Linear Regression","summary":" Learning algorithms that divide the data into batches are prevalent in many\nmachine-learning applications, typically offering useful trade-offs between\ncomputational efficiency and performance. In this paper, we examine the\nbenefits of batch-partitioning through the lens of a minimum-norm\noverparameterized linear regression model with isotropic Gaussian features. We\nsuggest a natural small-batch version of the minimum-norm estimator, and derive\nan upper bound on its quadratic risk, showing it is inversely proportional to\nthe noise level as well as to the overparameterization ratio, for the optimal\nchoice of batch size. In contrast to minimum-norm, our estimator admits a\nstable risk behavior that is monotonically increasing in the\noverparameterization ratio, eliminating both the blowup at the interpolation\npoint and the double-descent phenomenon. Interestingly, we observe that this\nimplicit regularization offered by the batch partition is partially explained\nby feature overlap between the batches. Our bound is derived via a novel\ncombination of techniques, in particular normal approximation in the\nWasserstein metric of noisy projections over random subspaces.\n","authors":["Shahar Stein Ioushua","Inbar Hasidim","Ofer Shayevitz","Meir Feder"],"pdf_url":"https://arxiv.org/pdf/2306.08432v2.pdf","comment":"55 pages"},{"id":"http://arxiv.org/abs/2308.03464v1","updated":"2023-08-07T10:43:48Z","published":"2023-08-07T10:43:48Z","title":"Wide Gaps and Clustering Axioms","summary":" The widely applied k-means algorithm produces clusterings that violate our\nexpectations with respect to high/low similarity/density and is in conflict\nwith Kleinberg's axiomatic system for distance based clustering algorithms that\nformalizes those expectations in a natural way. k-means violates in particular\nthe consistency axiom. We hypothesise that this clash is due to the not\nexplicated expectation that the data themselves should have the property of\nbeing clusterable in order to expect the algorithm clustering hem to fit a\nclustering axiomatic system. To demonstrate this, we introduce two new\nclusterability properties, variational k-separability and residual\nk-separability and show that then the Kleinberg's consistency axiom holds for\nk-means operating in the Euclidean or non-Euclidean space. Furthermore, we\npropose extensions of k-means algorithm that fit approximately the Kleinberg's\nrichness axiom that does not hold for k-means. In this way, we reconcile\nk-means with Kleinberg's axiomatic framework in Euclidean and non-Euclidean\nsettings. Besides contribution to the theory of axiomatic frameworks of\nclustering and for clusterability theory, practical contribution is the\npossibility to construct {datasets for testing purposes of algorithms\noptimizing k-means cost function. This includes a method of construction of\n{clusterable data with known in advance global optimum.\n","authors":["Mieczysław A. Kłopotek"],"pdf_url":"https://arxiv.org/pdf/2308.03464v1.pdf","comment":"14 Theorems. arXiv admin note: substantial text overlap with\n arXiv:2211.17036"},{"id":"http://arxiv.org/abs/2308.03457v1","updated":"2023-08-07T10:25:54Z","published":"2023-08-07T10:25:54Z","title":"Cross-Silo Prototypical Calibration for Federated Learning with Non-IID\n Data","summary":" Federated Learning aims to learn a global model on the server side that\ngeneralizes to all clients in a privacy-preserving manner, by leveraging the\nlocal models from different clients. Existing solutions focus on either\nregularizing the objective functions among clients or improving the aggregation\nmechanism for the improved model generalization capability. However, their\nperformance is typically limited by the dataset biases, such as the\nheterogeneous data distributions and the missing classes. To address this\nissue, this paper presents a cross-silo prototypical calibration method\n(FedCSPC), which takes additional prototype information from the clients to\nlearn a unified feature space on the server side. Specifically, FedCSPC first\nemploys the Data Prototypical Modeling (DPM) module to learn data patterns via\nclustering to aid calibration. Subsequently, the cross-silo prototypical\ncalibration (CSPC) module develops an augmented contrastive learning method to\nimprove the robustness of the calibration, which can effectively project\ncross-source features into a consistent space while maintaining clear decision\nboundaries. Moreover, the CSPC module's ease of implementation and\nplug-and-play characteristics make it even more remarkable. Experiments were\nconducted on four datasets in terms of performance comparison, ablation study,\nin-depth analysis and case study, and the results verified that FedCSPC is\ncapable of learning the consistent features across different data sources of\nthe same class under the guidance of calibrated model, which leads to better\nperformance than the state-of-the-art methods. The source codes have been\nreleased at https://github.com/qizhuang-qz/FedCSPC.\n","authors":["Zhuang Qi","Lei Meng","Zitan Chen","Han Hu","Hui Lin","Xiangxu Meng"],"pdf_url":"https://arxiv.org/pdf/2308.03457v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.07176v2","updated":"2023-08-07T10:09:21Z","published":"2023-05-11T23:12:13Z","title":"Automatic Radiology Report Generation by Learning with Increasingly Hard\n Negatives","summary":" Automatic radiology report generation is challenging as medical images or\nreports are usually similar to each other due to the common content of anatomy.\nThis makes a model hard to capture the uniqueness of individual images and is\nprone to producing undesired generic or mismatched reports. This situation\ncalls for learning more discriminative features that could capture even\nfine-grained mismatches between images and reports. To achieve this, this paper\nproposes a novel framework to learn discriminative image and report features by\ndistinguishing them from their closest peers, i.e., hard negatives. Especially,\nto attain more discriminative features, we gradually raise the difficulty of\nsuch a learning task by creating increasingly hard negative reports for each\nimage in the feature space during training, respectively. By treating the\nincreasingly hard negatives as auxiliary variables, we formulate this process\nas a min-max alternating optimisation problem. At each iteration, conditioned\non a given set of hard negative reports, image and report features are learned\nas usual by minimising the loss functions related to report generation. After\nthat, a new set of harder negative reports will be created by maximising a loss\nreflecting image-report alignment. By solving this optimisation, we attain a\nmodel that can generate more specific and accurate reports. It is noteworthy\nthat our framework enhances discriminative feature learning without introducing\nextra network weights. Also, in contrast to the existing way of generating hard\nnegatives, our framework extends beyond the granularity of the dataset by\ngenerating harder samples out of the training set. Experimental study on\nbenchmark datasets verifies the efficacy of our framework and shows that it can\nserve as a plug-in to readily improve existing medical report generation\nmodels.\n","authors":["Bhanu Prakash Voutharoja","Lei Wang","Luping Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.07176v2.pdf","comment":"Accepted to European Conference on Artificial Intelligence (ECAI)\n 2023"},{"id":"http://arxiv.org/abs/2306.07886v3","updated":"2023-08-07T10:01:49Z","published":"2023-06-13T16:25:30Z","title":"Symmetry & Critical Points for Symmetric Tensor Decomposition Problems","summary":" We consider the nonconvex optimization problem associated with the\ndecomposition of a real symmetric tensor into a sum of rank one terms. Use is\nmade of the rich symmetry structure to construct infinite families of critical\npoints represented by Puiseux series in the problem dimension, and so obtain\nprecise analytic estimates on the value of the objective function and the\nHessian spectrum. The results allow an analytic characterization of various\nobstructions to using local optimization methods, revealing in particular a\ncomplex array of saddles and minima differing by their symmetry, structure and\nanalytic properties. A~desirable phenomenon, occurring for all critical points\nconsidered, concerns the number of negative Hessian eigenvalues increasing with\nthe value of the objective function. Our approach makes use of Newton polyhedra\nas well as results from real algebraic geometry, notably the Curve Selection\nLemma, to determine the extremal character of degenerate critical points,\nestablishing in particular the existence of infinite families of third-order\nsaddles which can significantly slow down the optimization process.\n","authors":["Yossi Arjevani","Gal Vinograd"],"pdf_url":"https://arxiv.org/pdf/2306.07886v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03443v1","updated":"2023-08-07T10:00:07Z","published":"2023-08-07T10:00:07Z","title":"Doubly Robust Estimator for Off-Policy Evaluation with Large Action\n Spaces","summary":" We study Off-Policy Evaluation (OPE) in contextual bandit settings with large\naction spaces. The benchmark estimators suffer from severe bias and variance\ntradeoffs. Parametric approaches suffer from bias due to difficulty specifying\nthe correct model, whereas ones with importance weight suffer from variance. To\novercome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was\nproposed to mitigate the estimator's variance via embeddings of an action. To\nmake the estimator more accurate, we propose the doubly robust estimator of\nMIPS called the Marginalized Doubly Robust (MDR) estimator. Theoretical\nanalysis shows that the proposed estimator is unbiased under weaker assumptions\nthan MIPS while maintaining variance reduction against IPS, which was the main\nadvantage of MIPS. The empirical experiment verifies the supremacy of MDR\nagainst existing estimators.\n","authors":["Tatsuhiro Shimizu"],"pdf_url":"https://arxiv.org/pdf/2308.03443v1.pdf","comment":"6 pages, 1 figure"},{"id":"http://arxiv.org/abs/2301.09930v2","updated":"2023-08-07T09:48:44Z","published":"2023-01-24T11:27:17Z","title":"Quadruple-star systems are not always nested triples: a machine learning\n approach to dynamical stability","summary":" The dynamical stability of quadruple-star systems has traditionally been\ntreated as a problem involving two `nested' triples which constitute a\nquadruple. In this novel study, we employed a machine learning algorithm, the\nmulti-layer perceptron (MLP), to directly classify 2+2 and 3+1 quadruples based\non their stability (or long-term boundedness). The training data sets for the\nclassification, comprised of $5\\times10^5$ quadruples each, were integrated\nusing the highly accurate direct $N$-body code MSTAR. We also carried out a\nlimited parameter space study of zero-inclination systems to directly compare\nquadruples to triples. We found that both our quadruple MLP models perform\nbetter than a `nested' triple MLP approach, which is especially significant for\n3+1 quadruples. The classification accuracies for the 2+2 MLP and 3+1 MLP\nmodels are 94% and 93% respectively, while the scores for the `nested' triple\napproach are 88% and 66% respectively. This is a crucial implication for\nquadruple population synthesis studies. Our MLP models, which are very simple\nand almost instantaneous to implement, are available on GitHub, along with\nPython3 scripts to access them.\n","authors":["Pavan Vynatheya","Rosemary A. Mardling","Adrian S. Hamers"],"pdf_url":"https://arxiv.org/pdf/2301.09930v2.pdf","comment":"Accepted for publication by MNRAS"},{"id":"http://arxiv.org/abs/2306.09780v2","updated":"2023-08-07T09:25:55Z","published":"2023-06-16T11:33:47Z","title":"Understanding Deep Generative Models with Generalized Empirical\n Likelihoods","summary":" Understanding how well a deep generative model captures a distribution of\nhigh-dimensional data remains an important open challenge. It is especially\ndifficult for certain model classes, such as Generative Adversarial Networks\nand Diffusion Models, whose models do not admit exact likelihoods. In this\nwork, we demonstrate that generalized empirical likelihood (GEL) methods offer\na family of diagnostic tools that can identify many deficiencies of deep\ngenerative models (DGMs). We show, with appropriate specification of moment\nconditions, that the proposed method can identify which modes have been\ndropped, the degree to which DGMs are mode imbalanced, and whether DGMs\nsufficiently capture intra-class diversity. We show how to combine techniques\nfrom Maximum Mean Discrepancy and Generalized Empirical Likelihood to create\nnot only distribution tests that retain per-sample interpretability, but also\nmetrics that include label information. We find that such tests predict the\ndegree of mode dropping and mode imbalance up to 60% better than metrics such\nas improved precision/recall. We provide an implementation at\nhttps://github.com/deepmind/understanding_deep_generative_models_with_generalized_empirical_likelihood/.\n","authors":["Suman Ravuri","Mélanie Rey","Shakir Mohamed","Marc Deisenroth"],"pdf_url":"https://arxiv.org/pdf/2306.09780v2.pdf","comment":"Computer Vision and Pattern Recognition 2023 (Highlight, top 2.6% of\n submissions)"},{"id":"http://arxiv.org/abs/2210.14245v2","updated":"2023-08-07T09:09:48Z","published":"2022-10-25T18:00:25Z","title":"CaloFlow for CaloChallenge Dataset 1","summary":" CaloFlow is a new and promising approach to fast calorimeter simulation based\non normalizing flows. Applying CaloFlow to the photon and charged pion Geant4\nshowers of Dataset 1 of the Fast Calorimeter Simulation Challenge 2022, we show\nhow it can produce high-fidelity samples with a sampling time that is several\norders of magnitude faster than Geant4. We demonstrate the fidelity of the\nsamples using calorimeter shower images, histograms of high-level features, and\naggregate metrics such as a classifier trained to distinguish CaloFlow from\nGeant4 samples.\n","authors":["Claudius Krause","Ian Pang","David Shih"],"pdf_url":"https://arxiv.org/pdf/2210.14245v2.pdf","comment":"32 pages, 18 figures, v2: updated pion evaluation"},{"id":"http://arxiv.org/abs/2308.03417v1","updated":"2023-08-07T09:08:39Z","published":"2023-08-07T09:08:39Z","title":"PURL: Safe and Effective Sanitization of Link Decoration","summary":" While privacy-focused browsers have taken steps to block third-party cookies\nand browser fingerprinting, novel tracking methods that bypass existing\ndefenses continue to emerge. Since trackers need to exfiltrate information from\nthe client- to server-side through link decoration regardless of the tracking\ntechnique they employ, a promising orthogonal approach is to detect and\nsanitize tracking information in decorated links. We present PURL, a\nmachine-learning approach that leverages a cross-layer graph representation of\nwebpage execution to safely and effectively sanitize link decoration. Our\nevaluation shows that PURL significantly outperforms existing countermeasures\nin terms of accuracy and reducing website breakage while being robust to common\nevasion techniques. We use PURL to perform a measurement study on top-million\nwebsites. We find that link decorations are widely abused by well-known\nadvertisers and trackers to exfiltrate user information collected from browser\nstorage, email addresses, and scripts involved in fingerprinting.\n","authors":["Shaoor Munir","Patrick Lee","Umar Iqbal","Zubair Shafiq","Sandra Siby"],"pdf_url":"https://arxiv.org/pdf/2308.03417v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12414v3","updated":"2023-08-07T08:54:11Z","published":"2023-03-22T09:23:29Z","title":"Delay-Aware Hierarchical Federated Learning","summary":" Federated learning has gained popularity as a means of training models\ndistributed across the wireless edge. The paper introduces delay-aware\nhierarchical federated learning (DFL) to improve the efficiency of distributed\nmachine learning (ML) model training by accounting for communication delays\nbetween edge and cloud. Different from traditional federated learning, DFL\nleverages multiple stochastic gradient descent iterations on device datasets\nwithin each global aggregation period and intermittently aggregates model\nparameters through edge servers in local subnetworks. During global\nsynchronization, the cloud server consolidates local models with the outdated\nglobal model using a local-global combiner, thus preserving crucial elements of\nboth, enhancing learning efficiency under the presence of delay. A set of\nconditions is obtained to achieve the sub-linear convergence rate of O(1/k).\nBased on these findings, an adaptive control algorithm is developed for DFL,\nimplementing policies to mitigate energy consumption and communication latency\nwhile aiming for a sublinear convergence rate. Numerical evaluations show DFL's\nsuperior performance in terms of faster global model convergence, reduced\nresource consumption, and robustness against communication delays compared to\nexisting FL algorithms. In summary, this proposed method offers improved\nefficiency and results when dealing with both convex and non-convex loss\nfunctions.\n","authors":["Frank Po-Chen Lin","Seyyedali Hosseinalipour","Nicolò Michelusi","Christopher Brinton"],"pdf_url":"https://arxiv.org/pdf/2303.12414v3.pdf","comment":"A condensed version of this paper was presented at IEEE Globecom 2020"},{"id":"http://arxiv.org/abs/2308.03404v1","updated":"2023-08-07T08:46:10Z","published":"2023-08-07T08:46:10Z","title":"Applied metamodelling for ATM performance simulations","summary":" The use of Air traffic management (ATM) simulators for planing and operations\ncan be challenging due to their modelling complexity. This paper presents XALM\n(eXplainable Active Learning Metamodel), a three-step framework integrating\nactive learning and SHAP (SHapley Additive exPlanations) values into simulation\nmetamodels for supporting ATM decision-making. XALM efficiently uncovers hidden\nrelationships among input and output variables in ATM simulators, those usually\nof interest in policy analysis. Our experiments show XALM's predictive\nperformance comparable to the XGBoost metamodel with fewer simulations.\nAdditionally, XALM exhibits superior explanatory capabilities compared to\nnon-active learning metamodels.\n Using the `Mercury' (flight and passenger) ATM simulator, XALM is applied to\na real-world scenario in Paris Charles de Gaulle airport, extending an arrival\nmanager's range and scope by analysing six variables. This case study\nillustrates XALM's effectiveness in enhancing simulation interpretability and\nunderstanding variable interactions. By addressing computational challenges and\nimproving explainability, XALM complements traditional simulation-based\nanalyses.\n Lastly, we discuss two practical approaches for reducing the computational\nburden of the metamodelling further: we introduce a stopping criterion for\nactive learning based on the inherent uncertainty of the metamodel, and we show\nhow the simulations used for the metamodel can be reused across key performance\nindicators, thus decreasing the overall number of simulations needed.\n","authors":["Christoffer Riis","Francisco N. Antunes","Tatjana Bolić","Gérald Gurtner","Andrew Cook","Carlos Lima Azevedo","Francisco Câmara Pereira"],"pdf_url":"https://arxiv.org/pdf/2308.03404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03403v1","updated":"2023-08-07T08:44:15Z","published":"2023-08-07T08:44:15Z","title":"Towards Machine Learning-based Fish Stock Assessment","summary":" The accurate assessment of fish stocks is crucial for sustainable fisheries\nmanagement. However, existing statistical stock assessment models can have low\nforecast performance of relevant stock parameters like recruitment or spawning\nstock biomass, especially in ecosystems that are changing due to global warming\nand other anthropogenic stressors. In this paper, we investigate the use of\nmachine learning models to improve the estimation and forecast of such stock\nparameters. We propose a hybrid model that combines classical statistical stock\nassessment models with supervised ML, specifically gradient boosted trees. Our\nhybrid model leverages the initial estimate provided by the classical model and\nuses the ML model to make a post-hoc correction to improve accuracy. We\nexperiment with five different stocks and find that the forecast accuracy of\nrecruitment and spawning stock biomass improves considerably in most cases.\n","authors":["Stefan Lüdtke","Maria E. Pierce"],"pdf_url":"https://arxiv.org/pdf/2308.03403v1.pdf","comment":"Accepted at Fragile Earth Workshop 2023"},{"id":"http://arxiv.org/abs/2307.12306v2","updated":"2023-08-07T08:36:45Z","published":"2023-07-23T12:18:12Z","title":"Tackling the Curse of Dimensionality with Physics-Informed Neural\n Networks","summary":" The curse-of-dimensionality (CoD) taxes computational resources heavily with\nexponentially increasing computational cost as the dimension increases. This\nposes great challenges in solving high-dimensional PDEs as Richard Bellman\nfirst pointed out over 60 years ago. While there has been some recent success\nin solving numerically partial differential equations (PDEs) in high\ndimensions, such computations are prohibitively expensive, and true scaling of\ngeneral nonlinear PDEs to high dimensions has never been achieved. In this\npaper, we develop a new method of scaling up physics-informed neural networks\n(PINNs) to solve arbitrary high-dimensional PDEs. The new method, called\nStochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs\ninto pieces corresponding to different dimensions and samples randomly a subset\nof these dimensional pieces in each iteration of training PINNs. We\ntheoretically prove the convergence guarantee and other desired properties of\nthe proposed method. We experimentally demonstrate that the proposed method\nallows us to solve many notoriously hard high-dimensional PDEs, including the\nHamilton-Jacobi-Bellman (HJB) and the Schr\\\"{o}dinger equations in thousands of\ndimensions very fast on a single GPU using the PINNs mesh-free approach. For\ninstance, we solve nontrivial nonlinear PDEs (one HJB equation and one\nBlack-Scholes equation) in 100,000 dimensions in 6 hours on a single GPU using\nSDGD with PINNs. Since SDGD is a general training methodology of PINNs, SDGD\ncan be applied to any current and future variants of PINNs to scale them up for\narbitrary high-dimensional PDEs.\n","authors":["Zheyuan Hu","Khemraj Shukla","George Em Karniadakis","Kenji Kawaguchi"],"pdf_url":"https://arxiv.org/pdf/2307.12306v2.pdf","comment":"37 pages, 8 figures"},{"id":"http://arxiv.org/abs/2308.03382v1","updated":"2023-08-07T08:03:20Z","published":"2023-08-07T08:03:20Z","title":"Enhancing Nucleus Segmentation with HARU-Net: A Hybrid Attention Based\n Residual U-Blocks Network","summary":" Nucleus image segmentation is a crucial step in the analysis, pathological\ndiagnosis, and classification, which heavily relies on the quality of nucleus\nsegmentation. However, the complexity of issues such as variations in nucleus\nsize, blurred nucleus contours, uneven staining, cell clustering, and\noverlapping cells poses significant challenges. Current methods for nucleus\nsegmentation primarily rely on nuclear morphology or contour-based approaches.\nNuclear morphology-based methods exhibit limited generalization ability and\nstruggle to effectively predict irregular-shaped nuclei, while contour-based\nextraction methods face challenges in accurately segmenting overlapping nuclei.\nTo address the aforementioned issues, we propose a dual-branch network using\nhybrid attention based residual U-blocks for nucleus instance segmentation. The\nnetwork simultaneously predicts target information and target contours.\nAdditionally, we introduce a post-processing method that combines the target\ninformation and target contours to distinguish overlapping nuclei and generate\nan instance segmentation image. Within the network, we propose a context fusion\nblock (CF-block) that effectively extracts and merges contextual information\nfrom the network. Extensive quantitative evaluations are conducted to assess\nthe performance of our method. Experimental results demonstrate the superior\nperformance of the proposed method compared to state-of-the-art approaches on\nthe BNS, MoNuSeg, CoNSeg, and CPM-17 datasets.\n","authors":["Junzhou Chen","Qian Huang","Yulin Chen","Linyi Qian","Chengyuan Yu"],"pdf_url":"https://arxiv.org/pdf/2308.03382v1.pdf","comment":"Nucleus segmentation, Deep learning, Instance segmentation, Medical\n imaging, Dual-Branch network"},{"id":"http://arxiv.org/abs/2304.14104v2","updated":"2023-08-07T07:52:35Z","published":"2023-04-27T11:32:48Z","title":"Learning Human-Human Interactions in Images from Weak Textual\n Supervision","summary":" Interactions between humans are diverse and context-dependent, but previous\nworks have treated them as categorical, disregarding the heavy tail of possible\ninteractions. We propose a new paradigm of learning human-human interactions as\nfree text from a single still image, allowing for flexibility in modeling the\nunlimited space of situations and relationships between people. To overcome the\nabsence of data labelled specifically for this task, we use knowledge\ndistillation applied to synthetic caption data produced by a large language\nmodel without explicit supervision. We show that the pseudo-labels produced by\nthis procedure can be used to train a captioning model to effectively\nunderstand human-human interactions in images, as measured by a variety of\nmetrics that measure textual and semantic faithfulness and factual groundedness\nof our predictions. We further show that our approach outperforms SOTA image\ncaptioning and situation recognition models on this task. We will release our\ncode and pseudo-labels along with Waldo and Wenda, a manually-curated test set\nfor still image human-human interaction understanding.\n","authors":["Morris Alper","Hadar Averbuch-Elor"],"pdf_url":"https://arxiv.org/pdf/2304.14104v2.pdf","comment":"To be presented at ICCV 2023. Project webpage:\n https://learning-interactions.github.io"},{"id":"http://arxiv.org/abs/2302.02807v2","updated":"2023-08-07T07:43:37Z","published":"2023-02-06T14:31:51Z","title":"Federated Survival Forests","summary":" Survival analysis is a subfield of statistics concerned with modeling the\noccurrence time of a particular event of interest for a population. Survival\nanalysis found widespread applications in healthcare, engineering, and social\nsciences. However, real-world applications involve survival datasets that are\ndistributed, incomplete, censored, and confidential. In this context, federated\nlearning can tremendously improve the performance of survival analysis\napplications. Federated learning provides a set of privacy-preserving\ntechniques to jointly train machine learning models on multiple datasets\nwithout compromising user privacy, leading to a better generalization\nperformance. However, despite the widespread development of federated learning\nin recent AI research, few studies focus on federated survival analysis. In\nthis work, we present a novel federated algorithm for survival analysis based\non one of the most successful survival models, the random survival forest. We\ncall the proposed method Federated Survival Forest (FedSurF). With a single\ncommunication round, FedSurF obtains a discriminative power comparable to\ndeep-learning-based federated models trained over hundreds of federated\niterations. Moreover, FedSurF retains all the advantages of random forests,\nnamely low computational cost and natural handling of missing values and\nincomplete datasets. These advantages are especially desirable in real-world\nfederated environments with multiple small datasets stored on devices with low\ncomputational capabilities. Numerical experiments compare FedSurF with\nstate-of-the-art survival models in federated networks, showing how FedSurF\noutperforms deep-learning-based federated algorithms in realistic environments\nwith non-identically distributed data.\n","authors":["Alberto Archetti","Matteo Matteucci"],"pdf_url":"https://arxiv.org/pdf/2302.02807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.05628v2","updated":"2023-08-07T07:41:47Z","published":"2023-07-11T06:30:43Z","title":"DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence\n Analysis Tasks","summary":" GPT has been proven to be capable of extracting general information from\nlanguage sequences, thereby benefiting all downstream tasks. This motivates us\nto use pre-trained models to explore the hidden inherent information in DNA\nsequences. However, data and task requirements in DNA sequence analyses are\ntasked in different formats such as generation, prediction and regression, and\nare complexity and involve different modalities, such as nucleotides sequences\nand, expression levels, etc. Existing BERT-based models are mostly for\ngeneration tasks and use sequence data as input and output, thus cannot easily\nhandle various DNA analysis tasks in one single model. Herein, we propose a\ngeneralized DNA pre-training DNA model, DNAGPT, that was trained on over 200\nbillion base pairs from all the mammals. We enhance the classic GPT model by\nadding binary classification task (DNA sequence order) and numerical regression\ntask (guanine-cytosine content prediction) in the pre-training period and\nenhancing the architecture with corresponding embedding layers and encoding\nheads. We also design a comprehensive token language to encode sequence, number\nand task related information in the same token space. Therefore, DNAGPT can\nhandle versatile DNA analysis tasks and simultaneously process handle both\nsequence and numerical data. We have evaluated our model on genomic signals and\nregions recognition, pseudo genomes generation and mRNA abudance regression\ntasks. We demonstrate that benefiting from pre-training, DNAGPT can shows\nsuperior performance than the existing models specially designed for various\ndownstreams tasks.\n","authors":["Daoan Zhang","Weitong Zhang","Bing He","Yu Zhao","Jianguo Zhang","Chenchen Qin","Jianhua Yao"],"pdf_url":"https://arxiv.org/pdf/2307.05628v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03363v1","updated":"2023-08-07T07:37:26Z","published":"2023-08-07T07:37:26Z","title":"A reading survey on adversarial machine learning: Adversarial attacks\n and their understanding","summary":" Deep Learning has empowered us to train neural networks for complex data with\nhigh performance. However, with the growing research, several vulnerabilities\nin neural networks have been exposed. A particular branch of research,\nAdversarial Machine Learning, exploits and understands some of the\nvulnerabilities that cause the neural networks to misclassify for near original\ninput. A class of algorithms called adversarial attacks is proposed to make the\nneural networks misclassify for various tasks in different domains. With the\nextensive and growing research in adversarial attacks, it is crucial to\nunderstand the classification of adversarial attacks. This will help us\nunderstand the vulnerabilities in a systematic order and help us to mitigate\nthe effects of adversarial attacks. This article provides a survey of existing\nadversarial attacks and their understanding based on different perspectives. We\nalso provide a brief overview of existing adversarial defences and their\nlimitations in mitigating the effect of adversarial attacks. Further, we\nconclude with a discussion on the future research directions in the field of\nadversarial machine learning.\n","authors":["Shashank Kotyan"],"pdf_url":"https://arxiv.org/pdf/2308.03363v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.11312v3","updated":"2023-08-07T07:20:51Z","published":"2022-07-22T19:38:25Z","title":"HybMT: Hybrid Meta-Predictor based ML Algorithm for Fast Test Vector\n Generation","summary":" ML models are increasingly being used to increase the test coverage and\ndecrease the overall testing time. This field is still in its nascent stage and\nup till now there were no algorithms that could match or outperform commercial\ntools in terms of speed and accuracy for large circuits. We propose an ATPG\nalgorithm HybMT in this paper that finally breaks this barrier. Like sister\nmethods, we augment the classical PODEM algorithm that uses recursive\nbacktracking. We design a custom 2-level predictor that predicts the input net\nof a logic gate whose value needs to be set to ensure that the output is a\ngiven value (0 or 1). Our predictor chooses the output from among two\nfirst-level predictors, where the most effective one is a bespoke neural\nnetwork and the other is an SVM regressor. As compared to a popular,\nstate-of-the-art commercial ATPG tool, HybMT shows an overall reduction of\n56.6% in the CPU time without compromising on the fault coverage for the EPFL\nbenchmark circuits. HybMT also shows a speedup of 126.4% over the best ML-based\nalgorithm while obtaining an equal or better fault coverage for the EPFL\nbenchmark circuits.\n","authors":["Shruti Pandey"," Jayadeva","Smruti R. Sarangi"],"pdf_url":"https://arxiv.org/pdf/2207.11312v3.pdf","comment":"6 pages, 5 figures and 5 tables. Changes from the previous version:\n We modified our novel neural network model \"HybNN\" with a skip connection and\n found a significant improvement in the fault coverage and runtime of our\n HybMT-based PODEM algorithm. We train on the smaller ISCAS'85 circuits,\n report the results for the EPFL benchmark circuits (most recent and up to 70X\n large)"},{"id":"http://arxiv.org/abs/2303.01254v3","updated":"2023-08-07T07:07:25Z","published":"2023-02-13T10:33:21Z","title":"Privacy-Preserving Tree-Based Inference with TFHE","summary":" Privacy enhancing technologies (PETs) have been proposed as a way to protect\nthe privacy of data while still allowing for data analysis. In this work, we\nfocus on Fully Homomorphic Encryption (FHE), a powerful tool that allows for\narbitrary computations to be performed on encrypted data. FHE has received lots\nof attention in the past few years and has reached realistic execution times\nand correctness.\n More precisely, we explain in this paper how we apply FHE to tree-based\nmodels and get state-of-the-art solutions over encrypted tabular data. We show\nthat our method is applicable to a wide range of tree-based models, including\ndecision trees, random forests, and gradient boosted trees, and has been\nimplemented within the Concrete-ML library, which is open-source at\nhttps://github.com/zama-ai/concrete-ml. With a selected set of use-cases, we\ndemonstrate that our FHE version is very close to the unprotected version in\nterms of accuracy.\n","authors":["Jordan Frery","Andrei Stoian","Roman Bredehoft","Luis Montero","Celia Kherfallah","Benoit Chevallier-Mames","Arthur Meyre"],"pdf_url":"https://arxiv.org/pdf/2303.01254v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.10510v2","updated":"2023-08-07T06:40:13Z","published":"2022-09-21T17:15:58Z","title":"Learning to Relight Portrait Images via a Virtual Light Stage and\n Synthetic-to-Real Adaptation","summary":" Given a portrait image of a person and an environment map of the target\nlighting, portrait relighting aims to re-illuminate the person in the image as\nif the person appeared in an environment with the target lighting. To achieve\nhigh-quality results, recent methods rely on deep learning. An effective\napproach is to supervise the training of deep neural networks with a\nhigh-fidelity dataset of desired input-output pairs, captured with a light\nstage. However, acquiring such data requires an expensive special capture rig\nand time-consuming efforts, limiting access to only a few resourceful\nlaboratories. To address the limitation, we propose a new approach that can\nperform on par with the state-of-the-art (SOTA) relighting methods without\nrequiring a light stage. Our approach is based on the realization that a\nsuccessful relighting of a portrait image depends on two conditions. First, the\nmethod needs to mimic the behaviors of physically-based relighting. Second, the\noutput has to be photorealistic. To meet the first condition, we propose to\ntrain the relighting network with training data generated by a virtual light\nstage that performs physically-based rendering on various 3D synthetic humans\nunder different environment maps. To meet the second condition, we develop a\nnovel synthetic-to-real approach to bring photorealism to the relighting\nnetwork output. In addition to achieving SOTA results, our approach offers\nseveral advantages over the prior methods, including controllable glares on\nglasses and more temporally-consistent results for relighting videos.\n","authors":["Yu-Ying Yeh","Koki Nagano","Sameh Khamis","Jan Kautz","Ming-Yu Liu","Ting-Chun Wang"],"pdf_url":"https://arxiv.org/pdf/2209.10510v2.pdf","comment":"To appear in ACM Transactions on Graphics (SIGGRAPH Asia 2022). 21\n pages, 25 figures, 7 tables. Project page:\n https://research.nvidia.com/labs/dir/lumos/"},{"id":"http://arxiv.org/abs/2308.03337v1","updated":"2023-08-07T06:38:59Z","published":"2023-08-07T06:38:59Z","title":"Solving Falkner-Skan type equations via Legendre and Chebyshev Neural\n Blocks","summary":" In this paper, a new deep-learning architecture for solving the non-linear\nFalkner-Skan equation is proposed. Using Legendre and Chebyshev neural blocks,\nthis approach shows how orthogonal polynomials can be used in neural networks\nto increase the approximation capability of artificial neural networks. In\naddition, utilizing the mathematical properties of these functions, we overcome\nthe computational complexity of the backpropagation algorithm by using the\noperational matrices of the derivative. The efficiency of the proposed method\nis carried out by simulating various configurations of the Falkner-Skan\nequation.\n","authors":["Alireza Afzal Aghaei","Kourosh Parand","Ali Nikkhah","Shakila Jaberi"],"pdf_url":"https://arxiv.org/pdf/2308.03337v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.03960v2","updated":"2023-08-07T06:35:25Z","published":"2023-05-06T07:06:47Z","title":"Beyond Rule-based Named Entity Recognition and Relation Extraction for\n Process Model Generation from Natural Language Text","summary":" Process-aware information systems offer extensive advantages to companies,\nfacilitating planning, operations, and optimization of day-to-day business\nactivities. However, the time-consuming but required step of designing formal\nbusiness process models often hampers the potential of these systems. To\novercome this challenge, automated generation of business process models from\nnatural language text has emerged as a promising approach to expedite this\nstep. Generally two crucial subtasks have to be solved: extracting\nprocess-relevant information from natural language and creating the actual\nmodel. Approaches towards the first subtask are rule based methods, highly\noptimized for specific domains, but hard to adapt to related applications. To\nsolve this issue, we present an extension to an existing pipeline, to make it\nentirely data driven. We demonstrate the competitiveness of our improved\npipeline, which not only eliminates the substantial overhead associated with\nfeature engineering and rule definition, but also enables adaptation to\ndifferent datasets, entity and relation types, and new domains. Additionally,\nthe largest available dataset (PET) for the first subtask, contains no\ninformation about linguistic references between mentions of entities in the\nprocess description. Yet, the resolution of these mentions into a single visual\nelement is essential for high quality process models. We propose an extension\nto the PET dataset that incorporates information about linguistic references\nand a corresponding method for resolving them. Finally, we provide a detailed\nanalysis of the inherent challenges in the dataset at hand.\n","authors":["Julian Neuberger","Lars Ackermann","Stefan Jablonski"],"pdf_url":"https://arxiv.org/pdf/2305.03960v2.pdf","comment":"Currently under review for CoopIS23"},{"id":"http://arxiv.org/abs/2305.18462v2","updated":"2023-08-07T06:32:56Z","published":"2023-05-29T07:06:03Z","title":"Membership Inference Attacks against Language Models via Neighbourhood\n Comparison","summary":" Membership Inference attacks (MIAs) aim to predict whether a data sample was\npresent in the training data of a machine learning model or not, and are widely\nused for assessing the privacy risks of language models. Most existing attacks\nrely on the observation that models tend to assign higher probabilities to\ntheir training samples than non-training points. However, simple thresholding\nof the model score in isolation tends to lead to high false-positive rates as\nit does not account for the intrinsic complexity of a sample. Recent work has\ndemonstrated that reference-based attacks which compare model scores to those\nobtained from a reference model trained on similar data can substantially\nimprove the performance of MIAs. However, in order to train reference models,\nattacks of this kind make the strong and arguably unrealistic assumption that\nan adversary has access to samples closely resembling the original training\ndata. Therefore, we investigate their performance in more realistic scenarios\nand find that they are highly fragile in relation to the data distribution used\nto train reference models. To investigate whether this fragility provides a\nlayer of safety, we propose and evaluate neighbourhood attacks, which compare\nmodel scores for a given sample to scores of synthetically generated neighbour\ntexts and therefore eliminate the need for access to the training data\ndistribution. We show that, in addition to being competitive with\nreference-based attacks that have perfect knowledge about the training data\ndistribution, our attack clearly outperforms existing reference-free attacks as\nwell as reference-based attacks with imperfect knowledge, which demonstrates\nthe need for a reevaluation of the threat model of adversarial attacks.\n","authors":["Justus Mattern","Fatemehsadat Mireshghallah","Zhijing Jin","Bernhard Schölkopf","Mrinmaya Sachan","Taylor Berg-Kirkpatrick"],"pdf_url":"https://arxiv.org/pdf/2305.18462v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03330v1","updated":"2023-08-07T06:23:24Z","published":"2023-08-07T06:23:24Z","title":"Expediting Neural Network Verification via Network Reduction","summary":" A wide range of verification methods have been proposed to verify the safety\nproperties of deep neural networks ensuring that the networks function\ncorrectly in critical applications. However, many well-known verification tools\nstill struggle with complicated network architectures and large network sizes.\nIn this work, we propose a network reduction technique as a pre-processing\nmethod prior to verification. The proposed method reduces neural networks via\neliminating stable ReLU neurons, and transforming them into a sequential neural\nnetwork consisting of ReLU and Affine layers which can be handled by the most\nverification tools. We instantiate the reduction technique on the\nstate-of-the-art complete and incomplete verification tools, including\nalpha-beta-crown, VeriNet and PRIMA. Our experiments on a large set of\nbenchmarks indicate that the proposed technique can significantly reduce neural\nnetworks and speed up existing verification tools. Furthermore, the experiment\nresults also show that network reduction can improve the availability of\nexisting verification tools on many networks by reducing them into sequential\nneural networks.\n","authors":["Yuyi Zhong","Ruiwei Wang","Siau-Cheng Khoo"],"pdf_url":"https://arxiv.org/pdf/2308.03330v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.07912v2","updated":"2023-08-07T06:20:31Z","published":"2023-01-19T06:46:36Z","title":"Interval Reachability of Nonlinear Dynamical Systems with Neural Network\n Controllers","summary":" This paper proposes a computationally efficient framework, based on interval\nanalysis, for rigorous verification of nonlinear continuous-time dynamical\nsystems with neural network controllers. Given a neural network, we use an\nexisting verification algorithm to construct inclusion functions for its\ninput-output behavior. Inspired by mixed monotone theory, we embed the\nclosed-loop dynamics into a larger system using an inclusion function of the\nneural network and a decomposition function of the open-loop system. This\nembedding provides a scalable approach for safety analysis of the neural\ncontrol loop while preserving the nonlinear structure of the system.\n We show that one can efficiently compute hyper-rectangular\nover-approximations of the reachable sets using a single trajectory of the\nembedding system. We design an algorithm to leverage this computational\nadvantage through partitioning strategies, improving our reachable set\nestimates while balancing its runtime with tunable parameters. We demonstrate\nthe performance of this algorithm through two case studies. First, we\ndemonstrate this method's strength in complex nonlinear environments. Then, we\nshow that our approach matches the performance of the state-of-the art\nverification algorithm for linear discretized systems.\n","authors":["Saber Jafarpour","Akash Harapanahalli","Samuel Coogan"],"pdf_url":"https://arxiv.org/pdf/2301.07912v2.pdf","comment":"Extended L4DC version with proofs"},{"id":"http://arxiv.org/abs/2308.03321v1","updated":"2023-08-07T06:08:51Z","published":"2023-08-07T06:08:51Z","title":"AFN: Adaptive Fusion Normalization via Encoder-Decoder Framework","summary":" The success of deep learning is inseparable from normalization layers.\nResearchers have proposed various normalization functions, and each of them has\nboth advantages and disadvantages. In response, efforts have been made to\ndesign a unified normalization function that combines all normalization\nprocedures and mitigates their weaknesses. We also proposed a new normalization\nfunction called Adaptive Fusion Normalization. Through experiments, we\ndemonstrate AFN outperforms the previous normalization techniques in domain\ngeneralization and image classification tasks.\n","authors":["Zikai Zhou","Huanran Chen"],"pdf_url":"https://arxiv.org/pdf/2308.03321v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2106.01899 by other authors"},{"id":"http://arxiv.org/abs/2308.03320v1","updated":"2023-08-07T06:07:04Z","published":"2023-08-07T06:07:04Z","title":"Binary Federated Learning with Client-Level Differential Privacy","summary":" Federated learning (FL) is a privacy-preserving collaborative learning\nframework, and differential privacy can be applied to further enhance its\nprivacy protection. Existing FL systems typically adopt Federated Average\n(FedAvg) as the training algorithm and implement differential privacy with a\nGaussian mechanism. However, the inherent privacy-utility trade-off in these\nsystems severely degrades the training performance if a tight privacy budget is\nenforced. Besides, the Gaussian mechanism requires model weights to be of\nhigh-precision. To improve communication efficiency and achieve a better\nprivacy-utility trade-off, we propose a communication-efficient FL training\nalgorithm with differential privacy guarantee. Specifically, we propose to\nadopt binary neural networks (BNNs) and introduce discrete noise in the FL\nsetting. Binary model parameters are uploaded for higher communication\nefficiency and discrete noise is added to achieve the client-level differential\nprivacy protection. The achieved performance guarantee is rigorously proved,\nand it is shown to depend on the level of discrete noise. Experimental results\nbased on MNIST and Fashion-MNIST datasets will demonstrate that the proposed\ntraining algorithm achieves client-level privacy protection with performance\ngain while enjoying the benefits of low communication overhead from binary\nmodel updates.\n","authors":["Lumin Liu","Jun Zhang","Shenghui Song","Khaled B. Letaief"],"pdf_url":"https://arxiv.org/pdf/2308.03320v1.pdf","comment":"6 pages, 6 figures, accepted by IEEE GLOBECOM 2023"},{"id":"http://arxiv.org/abs/2308.03317v1","updated":"2023-08-07T06:01:50Z","published":"2023-08-07T06:01:50Z","title":"HomOpt: A Homotopy-Based Hyperparameter Optimization Method","summary":" Machine learning has achieved remarkable success over the past couple of\ndecades, often attributed to a combination of algorithmic innovations and the\navailability of high-quality data available at scale. However, a third critical\ncomponent is the fine-tuning of hyperparameters, which plays a pivotal role in\nachieving optimal model performance. Despite its significance, hyperparameter\noptimization (HPO) remains a challenging task for several reasons. Many HPO\ntechniques rely on naive search methods or assume that the loss function is\nsmooth and continuous, which may not always be the case. Traditional methods,\nlike grid search and Bayesian optimization, often struggle to quickly adapt and\nefficiently search the loss landscape. Grid search is computationally\nexpensive, while Bayesian optimization can be slow to prime. Since the search\nspace for HPO is frequently high-dimensional and non-convex, it is often\nchallenging to efficiently find a global minimum. Moreover, optimal\nhyperparameters can be sensitive to the specific dataset or task, further\ncomplicating the search process. To address these issues, we propose a new\nhyperparameter optimization method, HomOpt, using a data-driven approach based\non a generalized additive model (GAM) surrogate combined with homotopy\noptimization. This strategy augments established optimization methodologies to\nboost the performance and effectiveness of any given method with faster\nconvergence to the optimum on continuous, discrete, and categorical domain\nspaces. We compare the effectiveness of HomOpt applied to multiple optimization\ntechniques (e.g., Random Search, TPE, Bayes, and SMAC) showing improved\nobjective performance on many standardized machine learning benchmarks and\nchallenging open-set recognition tasks.\n","authors":["Sophia J. Abraham","Kehelwala D. G. Maduranga","Jeffery Kinnison","Zachariah Carmichael","Jonathan D. Hauenstein","Walter J. Scheirer"],"pdf_url":"https://arxiv.org/pdf/2308.03317v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03316v1","updated":"2023-08-07T05:58:40Z","published":"2023-08-07T05:58:40Z","title":"Deep Q-Network for Stochastic Process Environments","summary":" Reinforcement learning is a powerful approach for training an optimal policy\nto solve complex problems in a given system. This project aims to demonstrate\nthe application of reinforcement learning in stochastic process environments\nwith missing information, using Flappy Bird and a newly developed stock trading\nenvironment as case studies. We evaluate various structures of Deep Q-learning\nnetworks and identify the most suitable variant for the stochastic process\nenvironment. Additionally, we discuss the current challenges and propose\npotential improvements for further work in environment-building and\nreinforcement learning techniques.\n","authors":["Kuangheng He"],"pdf_url":"https://arxiv.org/pdf/2308.03316v1.pdf","comment":"5 pages, 3 figures"},{"id":"http://arxiv.org/abs/2303.03724v2","updated":"2023-08-07T05:52:36Z","published":"2023-03-07T08:16:46Z","title":"Learning Bipedal Walking for Humanoids with Current Feedback","summary":" Recent advances in deep reinforcement learning (RL) based techniques combined\nwith training in simulation have offered a new approach to developing robust\ncontrollers for legged robots. However, the application of such approaches to\nreal hardware has largely been limited to quadrupedal robots with direct-drive\nactuators and light-weight bipedal robots with low gear-ratio transmission\nsystems. Application to real, life-sized humanoid robots has been less common\narguably due to a large sim2real gap. In this paper, we present an approach for\neffectively overcoming the sim2real gap issue for humanoid robots arising from\ninaccurate torque-tracking at the actuator level. Our key idea is to utilize\nthe current feedback from the actuators on the real robot, after training the\npolicy in a simulation environment artificially degraded with poor\ntorque-tracking. Our approach successfully trains a unified, end-to-end policy\nin simulation that can be deployed on a real HRP-5P humanoid robot to achieve\nbipedal locomotion. Through ablations, we also show that a feedforward policy\narchitecture combined with targeted dynamics randomization is sufficient for\nzero-shot sim2real success, thus eliminating the need for computationally\nexpensive, memory-based network architectures. Finally, we validate the\nrobustness of the proposed RL policy by comparing its performance against a\nconventional model-based controller for walking on uneven terrain with the real\nrobot.\n","authors":["Rohan Pratap Singh","Zhaoming Xie","Pierre Gergondet","Fumio Kanehiro"],"pdf_url":"https://arxiv.org/pdf/2303.03724v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03312v1","updated":"2023-08-07T05:40:58Z","published":"2023-08-07T05:40:58Z","title":"Symmetry-Preserving Program Representations for Learning Code Semantics","summary":" Large Language Models (LLMs) have shown promise in automated program\nreasoning, a crucial aspect of many security tasks. However, existing LLM\narchitectures for code are often borrowed from other domains like natural\nlanguage processing, raising concerns about their generalization and robustness\nto unseen code. A key generalization challenge is to incorporate the knowledge\nof code semantics, including control and data flow, into the LLM architectures.\n Drawing inspiration from examples of convolution layers exploiting\ntranslation symmetry, we explore how code symmetries can enhance LLM\narchitectures for program analysis and modeling. We present a rigorous\ngroup-theoretic framework that formally defines code symmetries as\nsemantics-preserving transformations and provides techniques for precisely\nreasoning about symmetry preservation within LLM architectures. Using this\nframework, we introduce a novel variant of self-attention that preserves\nprogram symmetries, demonstrating its effectiveness in generalization and\nrobustness through detailed experimental evaluations across different binary\nand source code analysis tasks. Overall, our code symmetry framework offers\nrigorous and powerful reasoning techniques that can guide the future\ndevelopment of specialized LLMs for code and advance LLM-guided program\nreasoning tasks.\n","authors":["Kexin Pei","Weichen Li","Qirui Jin","Shuyang Liu","Scott Geng","Lorenzo Cavallaro","Junfeng Yang","Suman Jana"],"pdf_url":"https://arxiv.org/pdf/2308.03312v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03306v1","updated":"2023-08-07T05:22:33Z","published":"2023-08-07T05:22:33Z","title":"Implicit Graph Neural Diffusion Based on Constrained Dirichlet Energy\n Minimization","summary":" Implicit graph neural networks (GNNs) have emerged as a potential approach to\nenable GNNs to capture long-range dependencies effectively. However, poorly\ndesigned implicit GNN layers can experience over-smoothing or may have limited\nadaptability to learn data geometry, potentially hindering their performance in\ngraph learning problems. To address these issues, we introduce a geometric\nframework to design implicit graph diffusion layers based on a parameterized\ngraph Laplacian operator. Our framework allows learning the geometry of vertex\nand edge spaces, as well as the graph gradient operator from data. We further\nshow how implicit GNN layers can be viewed as the fixed-point solution of a\nDirichlet energy minimization problem and give conditions under which it may\nsuffer from over-smoothing. To overcome the over-smoothing problem, we design\nour implicit graph diffusion layer as the solution of a Dirichlet energy\nminimization problem with constraints on vertex features, enabling it to trade\noff smoothing with the preservation of node feature information. With an\nappropriate hyperparameter set to be larger than the largest eigenvalue of the\nparameterized graph Laplacian, our framework guarantees a unique equilibrium\nand quick convergence. Our models demonstrate better performance than leading\nimplicit and explicit GNNs on benchmark datasets for node and graph\nclassification tasks, with substantial accuracy improvements observed for some\ndatasets.\n","authors":["Guoji Fu","Mohammed Haroon Dupty","Yanfei Dong","Lee Wee Sun"],"pdf_url":"https://arxiv.org/pdf/2308.03306v1.pdf","comment":"33 pages"},{"id":"http://arxiv.org/abs/2308.03300v1","updated":"2023-08-07T05:05:49Z","published":"2023-08-07T05:05:49Z","title":"Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio\n Detection","summary":" Current fake audio detection algorithms have achieved promising performances\non most datasets. However, their performance may be significantly degraded when\ndealing with audio of a different dataset. The orthogonal weight modification\nto overcome catastrophic forgetting does not consider the similarity of genuine\naudio across different datasets. To overcome this limitation, we propose a\ncontinual learning algorithm for fake audio detection to overcome catastrophic\nforgetting, called Regularized Adaptive Weight Modification (RAWM). When\nfine-tuning a detection network, our approach adaptively computes the direction\nof weight modification according to the ratio of genuine utterances and fake\nutterances. The adaptive modification direction ensures the network can\neffectively detect fake audio on the new dataset while preserving its knowledge\nof old model, thus mitigating catastrophic forgetting. In addition, genuine\naudio collected from quite different acoustic conditions may skew their feature\ndistribution, so we introduce a regularization constraint to force the network\nto remember the old distribution in this regard. Our method can easily be\ngeneralized to related fields, like speech emotion recognition. We also\nevaluate our approach across multiple datasets and obtain a significant\nperformance improvement on cross-dataset experiments.\n","authors":["Xiaohui Zhang","Jiangyan Yi","Jianhua Tao","Chenglong Wang","Chuyuan Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03300v1.pdf","comment":"40th Internation Conference on Machine Learning (ICML 2023)"},{"id":"http://arxiv.org/abs/2308.03296v1","updated":"2023-08-07T04:47:42Z","published":"2023-08-07T04:47:42Z","title":"Studying Large Language Model Generalization with Influence Functions","summary":" When trying to gain better visibility into a machine learning model in order\nto understand and mitigate the associated risks, a potentially valuable source\nof evidence is: which training examples most contribute to a given behavior?\nInfluence functions aim to answer a counterfactual: how would the model's\nparameters (and hence its outputs) change if a given sequence were added to the\ntraining set? While influence functions have produced insights for small\nmodels, they are difficult to scale to large language models (LLMs) due to the\ndifficulty of computing an inverse-Hessian-vector product (IHVP). We use the\nEigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC)\napproximation to scale influence functions up to LLMs with up to 52 billion\nparameters. In our experiments, EK-FAC achieves similar accuracy to traditional\ninfluence function estimators despite the IHVP computation being orders of\nmagnitude faster. We investigate two algorithmic techniques to reduce the cost\nof computing gradients of candidate training sequences: TF-IDF filtering and\nquery batching. We use influence functions to investigate the generalization\npatterns of LLMs, including the sparsity of the influence patterns, increasing\nabstraction with scale, math and programming abilities, cross-lingual\ngeneralization, and role-playing behavior. Despite many apparently\nsophisticated forms of generalization, we identify a surprising limitation:\ninfluences decay to near-zero when the order of key phrases is flipped.\nOverall, influence functions give us a powerful new tool for studying the\ngeneralization properties of LLMs.\n","authors":["Roger Grosse","Juhan Bae","Cem Anil","Nelson Elhage","Alex Tamkin","Amirhossein Tajdini","Benoit Steiner","Dustin Li","Esin Durmus","Ethan Perez","Evan Hubinger","Kamilė Lukošiūtė","Karina Nguyen","Nicholas Joseph","Sam McCandlish","Jared Kaplan","Samuel R. Bowman"],"pdf_url":"https://arxiv.org/pdf/2308.03296v1.pdf","comment":"119 pages, 47 figures, 22 tables"},{"id":"http://arxiv.org/abs/2308.01814v2","updated":"2023-08-07T04:47:32Z","published":"2023-08-03T15:22:51Z","title":"Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit","summary":" Going beyond stochastic gradient descent (SGD), what new phenomena emerge in\nwide neural networks trained by adaptive optimizers like Adam? Here we show:\nThe same dichotomy between feature learning and kernel behaviors (as in SGD)\nholds for general optimizers as well, including Adam -- albeit with a nonlinear\nnotion of \"kernel.\" We derive the corresponding \"neural tangent\" and \"maximal\nupdate\" limits for any architecture. Two foundational advances underlie the\nabove results: 1) A new Tensor Program language, NEXORT, that can express how\nadaptive optimizers process gradients into updates. 2) The introduction of\nbra-ket notation to drastically simplify expressions and calculations in Tensor\nPrograms. This work summarizes and generalizes all previous results in the\nTensor Programs series of papers.\n","authors":["Greg Yang","Etai Littwin"],"pdf_url":"https://arxiv.org/pdf/2308.01814v2.pdf","comment":"This is the complete version of \"Adaptive Optimization in the\n Infinite-Width Limit\" in ICLR 2023,\n https://openreview.net/forum?id=zgVDqw9ZUES"},{"id":"http://arxiv.org/abs/2308.03295v1","updated":"2023-08-07T04:44:12Z","published":"2023-08-07T04:44:12Z","title":"DOMINO: Domain-invariant Hyperdimensional Classification for\n Multi-Sensor Time Series Data","summary":" With the rapid evolution of the Internet of Things, many real-world\napplications utilize heterogeneously connected sensors to capture time-series\ninformation. Edge-based machine learning (ML) methodologies are often employed\nto analyze locally collected data. However, a fundamental issue across\ndata-driven ML approaches is distribution shift. It occurs when a model is\ndeployed on a data distribution different from what it was trained on, and can\nsubstantially degrade model performance. Additionally, increasingly\nsophisticated deep neural networks (DNNs) have been proposed to capture spatial\nand temporal dependencies in multi-sensor time series data, requiring intensive\ncomputational resources beyond the capacity of today's edge devices. While\nbrain-inspired hyperdimensional computing (HDC) has been introduced as a\nlightweight solution for edge-based learning, existing HDCs are also vulnerable\nto the distribution shift challenge. In this paper, we propose DOMINO, a novel\nHDC learning framework addressing the distribution shift problem in noisy\nmulti-sensor time-series data. DOMINO leverages efficient and parallel matrix\noperations on high-dimensional space to dynamically identify and filter out\ndomain-variant dimensions. Our evaluation on a wide range of multi-sensor time\nseries classification tasks shows that DOMINO achieves on average 2.04% higher\naccuracy than state-of-the-art (SOTA) DNN-based domain generalization\ntechniques, and delivers 7.83x faster training and 26.94x faster inference.\nMore importantly, DOMINO performs notably better when learning from partially\nlabeled and highly imbalanced data, providing 10.93x higher robustness against\nhardware noises than SOTA DNNs.\n","authors":["Junyao Wang","Luke Chen","Mohammad Abdullah Al Faruque"],"pdf_url":"https://arxiv.org/pdf/2308.03295v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.02360v2","updated":"2023-08-07T04:32:21Z","published":"2023-08-04T14:52:22Z","title":"Intensity-free Integral-based Learning of Marked Temporal Point\n Processes","summary":" In the marked temporal point processes (MTPP), a core problem is to\nparameterize the conditional joint PDF (probability distribution function)\n$p^*(m,t)$ for inter-event time $t$ and mark $m$, conditioned on the history.\nThe majority of existing studies predefine intensity functions. Their utility\nis challenged by specifying the intensity function's proper form, which is\ncritical to balance expressiveness and processing efficiency. Recently, there\nare studies moving away from predefining the intensity function -- one models\n$p^*(t)$ and $p^*(m)$ separately, while the other focuses on temporal point\nprocesses (TPPs), which do not consider marks. This study aims to develop\nhigh-fidelity $p^*(m,t)$ for discrete events where the event marks are either\ncategorical or numeric in a multi-dimensional continuous space. We propose a\nsolution framework IFIB (\\underline{I}ntensity-\\underline{f}ree\n\\underline{I}ntegral-\\underline{b}ased process) that models conditional joint\nPDF $p^*(m,t)$ directly without intensity functions. It remarkably simplifies\nthe process to compel the essential mathematical restrictions. We show the\ndesired properties of IFIB and the superior experimental results of IFIB on\nreal-world and synthetic datasets. The code is available at\n\\url{https://github.com/StepinSilence/IFIB}.\n","authors":["Sishun Liu","Ke Deng","Xiuzhen Zhang","Yongli Ren"],"pdf_url":"https://arxiv.org/pdf/2308.02360v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03291v1","updated":"2023-08-07T04:20:38Z","published":"2023-08-07T04:20:38Z","title":"SynJax: Structured Probability Distributions for JAX","summary":" The development of deep learning software libraries enabled significant\nprogress in the field by allowing users to focus on modeling, while letting the\nlibrary to take care of the tedious and time-consuming task of optimizing\nexecution for modern hardware accelerators. However, this has benefited only\nparticular types of deep learning models, such as Transformers, whose\nprimitives map easily to the vectorized computation. The models that explicitly\naccount for structured objects, such as trees and segmentations, did not\nbenefit equally because they require custom algorithms that are difficult to\nimplement in a vectorized form.\n SynJax directly addresses this problem by providing an efficient vectorized\nimplementation of inference algorithms for structured distributions covering\nalignment, tagging, segmentation, constituency trees and spanning trees. With\nSynJax we can build large-scale differentiable models that explicitly model\nstructure in the data. The code is available at\nhttps://github.com/deepmind/synjax.\n","authors":["Miloš Stanojević","Laurent Sartran"],"pdf_url":"https://arxiv.org/pdf/2308.03291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03290v1","updated":"2023-08-07T04:17:19Z","published":"2023-08-07T04:17:19Z","title":"FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization\n Search","summary":" Quantization has become a mainstream compression technique for reducing model\nsize, computational requirements, and energy consumption for modern deep neural\nnetworks (DNNs). With the improved numerical support in recent hardware,\nincluding multiple variants of integer and floating point, mixed-precision\nquantization has become necessary to achieve high-quality results with low\nmodel cost. Prior mixed-precision quantization methods have performed a\npost-training quantization search, which compromises on accuracy, or a\ndifferentiable quantization search, which leads to high memory usage from\nbranching. Therefore, we propose the first one-shot mixed-precision\nquantization search that eliminates the need for retraining in both integer and\nlow-precision floating point models. We evaluate our floating-point and integer\nquantization search (FLIQS) on multiple convolutional networks and vision\ntransformer models to discover Pareto-optimal models. Our approach discovers\nmodels that improve upon uniform precision, manual mixed-precision, and recent\ninteger quantization search methods. With the proposed integer quantization\nsearch, we increase the accuracy of ResNet-18 on ImageNet by 1.31% points and\nResNet-50 by 0.90% points with equivalent model cost over previous methods.\nAdditionally, for the first time, we explore a novel mixed-precision\nfloating-point search and improve MobileNetV2 by up to 0.98% points compared to\nprior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously\nsearch a joint quantization and neural architecture space and improve the\nImageNet accuracy by 2.69% points with similar model cost on a MobileNetV2\nsearch space.\n","authors":["Jordan Dotzel","Gang Wu","Andrew Li","Muhammad Umar","Yun Ni","Mohamed S. Abdelfattah","Zhiru Zhang","Liqun Cheng","Martin G. Dixon","Norman P. Jouppi","Quoc V. Le","Sheng Li"],"pdf_url":"https://arxiv.org/pdf/2308.03290v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.15306v2","updated":"2023-08-07T04:07:06Z","published":"2022-06-30T14:24:32Z","title":"Transfer Learning with Deep Tabular Models","summary":" Recent work on deep learning for tabular data demonstrates the strong\nperformance of deep tabular models, often bridging the gap between gradient\nboosted decision trees and neural networks. Accuracy aside, a major advantage\nof neural models is that they learn reusable features and are easily fine-tuned\nin new domains. This property is often exploited in computer vision and natural\nlanguage applications, where transfer learning is indispensable when\ntask-specific training data is scarce. In this work, we demonstrate that\nupstream data gives tabular neural networks a decisive advantage over widely\nused GBDT models. We propose a realistic medical diagnosis benchmark for\ntabular transfer learning, and we present a how-to guide for using upstream\ndata to boost performance with a variety of tabular neural network\narchitectures. Finally, we propose a pseudo-feature method for cases where the\nupstream and downstream feature sets differ, a tabular-specific problem\nwidespread in real-world applications. Our code is available at\nhttps://github.com/LevinRoman/tabular-transfer-learning .\n","authors":["Roman Levin","Valeriia Cherepanova","Avi Schwarzschild","Arpit Bansal","C. Bayan Bruss","Tom Goldstein","Andrew Gordon Wilson","Micah Goldblum"],"pdf_url":"https://arxiv.org/pdf/2206.15306v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03283v1","updated":"2023-08-07T04:00:13Z","published":"2023-08-07T04:00:13Z","title":"High-rate discretely-modulated continuous-variable quantum key\n distribution using quantum machine learning","summary":" We propose a high-rate scheme for discretely-modulated continuous-variable\nquantum key distribution (DM CVQKD) using quantum machine learning\ntechnologies, which divides the whole CVQKD system into three parts, i.e., the\ninitialization part that is used for training and estimating quantum\nclassifier, the prediction part that is used for generating highly correlated\nraw keys, and the data-postprocessing part that generates the final secret key\nstring shared by Alice and Bob. To this end, a low-complexity quantum k-nearest\nneighbor (QkNN) classifier is designed for predicting the lossy\ndiscretely-modulated coherent states (DMCSs) at Bob's side. The performance of\nthe proposed QkNN-based CVQKD especially in terms of machine learning metrics\nand complexity is analyzed, and its theoretical security is proved by using\nsemi-definite program (SDP) method. Numerical simulation shows that the secret\nkey rate of our proposed scheme is explicitly superior to the existing DM CVQKD\nprotocols, and it can be further enhanced with the increase of modulation\nvariance.\n","authors":["Qin Liao","Jieyu Liu","Anqi Huang","Lei Huang","Zhuoying Fei","Xiquan Fu"],"pdf_url":"https://arxiv.org/pdf/2308.03283v1.pdf","comment":"18 pages, 17 figures"},{"id":"http://arxiv.org/abs/2212.09201v2","updated":"2023-08-07T03:33:28Z","published":"2022-12-19T00:42:21Z","title":"Spectral Regularized Kernel Two-Sample Tests","summary":" Over the last decade, an approach that has gained a lot of popularity to\ntackle non-parametric testing problems on general (i.e., non-Euclidean) domains\nis based on the notion of reproducing kernel Hilbert space (RKHS) embedding of\nprobability distributions. The main goal of our work is to understand the\noptimality of two-sample tests constructed based on this approach. First, we\nshow that the popular MMD (maximum mean discrepancy) two-sample test is not\noptimal in terms of the separation boundary measured in Hellinger distance.\nSecond, we propose a modification to the MMD test based on spectral\nregularization by taking into account the covariance information (which is not\ncaptured by the MMD test) and prove the proposed test to be minimax optimal\nwith a smaller separation boundary than that achieved by the MMD test. Third,\nwe propose an adaptive version of the above test which involves a data-driven\nstrategy to choose the regularization parameter and show the adaptive test to\nbe almost minimax optimal up to a logarithmic factor. Moreover, our results\nhold for the permutation variant of the test where the test threshold is chosen\nelegantly through the permutation of the samples. Through numerical experiments\non synthetic and real-world data, we demonstrate the superior performance of\nthe proposed test in comparison to the MMD test.\n","authors":["Omar Hagrass","Bharath K. Sriperumbudur","Bing Li"],"pdf_url":"https://arxiv.org/pdf/2212.09201v2.pdf","comment":"63 pages"},{"id":"http://arxiv.org/abs/2308.03274v1","updated":"2023-08-07T03:32:39Z","published":"2023-08-07T03:32:39Z","title":"DSformer: A Double Sampling Transformer for Multivariate Time Series\n Long-term Prediction","summary":" Multivariate time series long-term prediction, which aims to predict the\nchange of data in a long time, can provide references for decision-making.\nAlthough transformer-based models have made progress in this field, they\nusually do not make full use of three features of multivariate time series:\nglobal information, local information, and variables correlation. To\neffectively mine the above three features and establish a high-precision\nprediction model, we propose a double sampling transformer (DSformer), which\nconsists of the double sampling (DS) block and the temporal variable attention\n(TVA) block. Firstly, the DS block employs down sampling and piecewise sampling\nto transform the original series into feature vectors that focus on global\ninformation and local information respectively. Then, TVA block uses temporal\nattention and variable attention to mine these feature vectors from different\ndimensions and extract key information. Finally, based on a parallel structure,\nDSformer uses multiple TVA blocks to mine and integrate different features\nobtained from DS blocks respectively. The integrated feature information is\npassed to the generative decoder based on a multi-layer perceptron to realize\nmultivariate time series long-term prediction. Experimental results on nine\nreal-world datasets show that DSformer can outperform eight existing baselines.\n","authors":["Chengqing Yu","Fei Wang","Zezhi Shao","Tao Sun","Lin Wu","Yongjun Xu"],"pdf_url":"https://arxiv.org/pdf/2308.03274v1.pdf","comment":"Accepted by CIKM 2023 (FULL paper)"},{"id":"http://arxiv.org/abs/2103.00676v2","updated":"2023-08-07T03:25:37Z","published":"2021-03-01T01:00:09Z","title":"Token-Modification Adversarial Attacks for Natural Language Processing:\n A Survey","summary":" There are now many adversarial attacks for natural language processing\nsystems. Of these, a vast majority achieve success by modifying individual\ndocument tokens, which we call here a token-modification attack. Each\ntoken-modification attack is defined by a specific combination of fundamental\ncomponents, such as a constraint on the adversary or a particular search\nalgorithm. Motivated by this observation, we survey existing token-modification\nattacks and extract the components of each. We use an attack-independent\nframework to structure our survey which results in an effective categorisation\nof the field and an easy comparison of components. This survey aims to guide\nnew researchers to this field and spark further research into individual attack\ncomponents.\n","authors":["Tom Roth","Yansong Gao","Alsharif Abuadbba","Surya Nepal","Wei Liu"],"pdf_url":"https://arxiv.org/pdf/2103.00676v2.pdf","comment":"Version 2: updated"},{"id":"http://arxiv.org/abs/2308.03271v1","updated":"2023-08-07T03:23:46Z","published":"2023-08-07T03:23:46Z","title":"Local Structure-aware Graph Contrastive Representation Learning","summary":" Traditional Graph Neural Network (GNN), as a graph representation learning\nmethod, is constrained by label information. However, Graph Contrastive\nLearning (GCL) methods, which tackle the label problem effectively, mainly\nfocus on the feature information of the global graph or small subgraph\nstructure (e.g., the first-order neighborhood). In the paper, we propose a\nLocal Structure-aware Graph Contrastive representation Learning method (LS-GCL)\nto model the structural information of nodes from multiple views. Specifically,\nwe construct the semantic subgraphs that are not limited to the first-order\nneighbors. For the local view, the semantic subgraph of each target node is\ninput into a shared GNN encoder to obtain the target node embeddings at the\nsubgraph-level. Then, we use a pooling function to generate the subgraph-level\ngraph embeddings. For the global view, considering the original graph preserves\nindispensable semantic information of nodes, we leverage the shared GNN encoder\nto learn the target node embeddings at the global graph-level. The proposed\nLS-GCL model is optimized to maximize the common information among similar\ninstances at three various perspectives through a multi-level contrastive loss\nfunction. Experimental results on five datasets illustrate that our method\noutperforms state-of-the-art graph representation learning approaches for both\nnode classification and link prediction tasks.\n","authors":["Kai Yang","Yuan Liu","Zijuan Zhao","Peijin Ding","Wenqian Zhao"],"pdf_url":"https://arxiv.org/pdf/2308.03271v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03269v1","updated":"2023-08-07T03:19:59Z","published":"2023-08-07T03:19:59Z","title":"Simple Rule Injection for ComplEx Embeddings","summary":" Recent works in neural knowledge graph inference attempt to combine logic\nrules with knowledge graph embeddings to benefit from prior knowledge. However,\nthey usually cannot avoid rule grounding, and injecting a diverse set of rules\nhas still not been thoroughly explored. In this work, we propose InjEx, a\nmechanism to inject multiple types of rules through simple constraints, which\ncapture definite Horn rules. To start, we theoretically prove that InjEx can\ninject such rules. Next, to demonstrate that InjEx infuses interpretable prior\nknowledge into the embedding space, we evaluate InjEx on both the knowledge\ngraph completion (KGC) and few-shot knowledge graph completion (FKGC) settings.\nOur experimental results reveal that InjEx outperforms both baseline KGC models\nas well as specialized few-shot models while maintaining its scalability and\nefficiency.\n","authors":["Haodi Ma","Anthony Colas","Yuejie Wang","Ali Sadeghian","Daisy Zhe Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03269v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.02394v2","updated":"2023-08-07T03:07:59Z","published":"2023-05-03T19:29:26Z","title":"Defending against Insertion-based Textual Backdoor Attacks via\n Attribution","summary":" Textual backdoor attack, as a novel attack model, has been shown to be\neffective in adding a backdoor to the model during training. Defending against\nsuch backdoor attacks has become urgent and important. In this paper, we\npropose AttDef, an efficient attribution-based pipeline to defend against two\ninsertion-based poisoning attacks, BadNL and InSent. Specifically, we regard\nthe tokens with larger attribution scores as potential triggers since larger\nattribution words contribute more to the false prediction results and therefore\nare more likely to be poison triggers. Additionally, we further utilize an\nexternal pre-trained language model to distinguish whether input is poisoned or\nnot. We show that our proposed method can generalize sufficiently well in two\ncommon attack scenarios (poisoning training data and testing data), which\nconsistently improves previous methods. For instance, AttDef can successfully\nmitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34%\n(3.99% up) under pre-training and post-training attack defense respectively,\nachieving the new state-of-the-art performance on prediction recovery over four\nbenchmark datasets.\n","authors":["Jiazhao Li","Zhuofeng Wu","Wei Ping","Chaowei Xiao","V. G. Vinod Vydiswaran"],"pdf_url":"https://arxiv.org/pdf/2305.02394v2.pdf","comment":"Findings of ACL 2023. Camera-ready version"},{"id":"http://arxiv.org/abs/2212.08254v2","updated":"2023-08-07T03:00:41Z","published":"2022-12-16T02:52:37Z","title":"RepQ-ViT: Scale Reparameterization for Post-Training Quantization of\n Vision Transformers","summary":" Post-training quantization (PTQ), which only requires a tiny dataset for\ncalibration without end-to-end retraining, is a light and practical model\ncompression technique. Recently, several PTQ schemes for vision transformers\n(ViTs) have been presented; unfortunately, they typically suffer from\nnon-trivial accuracy degradation, especially in low-bit cases. In this paper,\nwe propose RepQ-ViT, a novel PTQ framework for ViTs based on quantization scale\nreparameterization, to address the above issues. RepQ-ViT decouples the\nquantization and inference processes, where the former employs complex\nquantizers and the latter employs scale-reparameterized simplified quantizers.\nThis ensures both accurate quantization and efficient inference, which\ndistinguishes it from existing approaches that sacrifice quantization\nperformance to meet the target hardware. More specifically, we focus on two\ncomponents with extreme distributions: post-LayerNorm activations with severe\ninter-channel variation and post-Softmax activations with power-law features,\nand initially apply channel-wise quantization and log$\\sqrt{2}$ quantization,\nrespectively. Then, we reparameterize the scales to hardware-friendly\nlayer-wise quantization and log2 quantization for inference, with only slight\naccuracy or computational costs. Extensive experiments are conducted on\nmultiple vision tasks with different model variants, proving that RepQ-ViT,\nwithout hyperparameters and expensive reconstruction procedures, can outperform\nexisting strong baselines and encouragingly improve the accuracy of 4-bit PTQ\nof ViTs to a usable level. Code is available at\nhttps://github.com/zkkli/RepQ-ViT.\n","authors":["Zhikai Li","Junrui Xiao","Lianwei Yang","Qingyi Gu"],"pdf_url":"https://arxiv.org/pdf/2212.08254v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2308.02180v2","updated":"2023-08-07T02:53:06Z","published":"2023-08-04T07:51:15Z","title":"Scaling Clinical Trial Matching Using Large Language Models: A Case\n Study in Oncology","summary":" Clinical trial matching is a key process in health delivery and discovery. In\npractice, it is plagued by overwhelming unstructured data and unscalable manual\nprocessing. In this paper, we conduct a systematic study on scaling clinical\ntrial matching using large language models (LLMs), with oncology as the focus\narea. Our study is grounded in a clinical trial matching system currently in\ntest deployment at a large U.S. health network. Initial findings are promising:\nout of box, cutting-edge LLMs, such as GPT-4, can already structure elaborate\neligibility criteria of clinical trials and extract complex matching logic\n(e.g., nested AND/OR/NOT). While still far from perfect, LLMs substantially\noutperform prior strong baselines and may serve as a preliminary solution to\nhelp triage patient-trial candidates with humans in the loop. Our study also\nreveals a few significant growth areas for applying LLMs to end-to-end clinical\ntrial matching, such as context limitation and accuracy, especially in\nstructuring patient information from longitudinal medical records.\n","authors":["Cliff Wong","Sheng Zhang","Yu Gu","Christine Moung","Jacob Abel","Naoto Usuyama","Roshanthi Weerasinghe","Brian Piening","Tristan Naumann","Carlo Bifulco","Hoifung Poon"],"pdf_url":"https://arxiv.org/pdf/2308.02180v2.pdf","comment":"24 pages, 5 figures, accepted at Machine Learning for Healthcare\n (MLHC) 2023"},{"id":"http://arxiv.org/abs/2308.03260v1","updated":"2023-08-07T02:42:21Z","published":"2023-08-07T02:42:21Z","title":"Exploring Different Time-series-Transformer (TST) Architectures: A Case\n Study in Battery Life Prediction for Electric Vehicles (EVs)","summary":" In recent years, battery technology for electric vehicles (EVs) has been a\nmajor focus, with a significant emphasis on developing new battery materials\nand chemistries. However, accurately predicting key battery parameters, such as\nstate-of-charge (SOC) and temperature, remains a challenge for constructing\nadvanced battery management systems (BMS). Existing battery models do not\ncomprehensively cover all parameters affecting battery performance, including\nnon-battery-related factors like ambient temperature, cabin temperature,\nelevation, and regenerative braking during EV operation. Due to the difficulty\nof incorporating these auxiliary parameters into traditional models, a\ndata-driven approach is suggested. Time-series-transformers (TSTs), leveraging\nmultiheaded attention and parallelization-friendly architecture, are explored\nalongside LSTM models. Novel TST architectures, including encoder TST + decoder\nLSTM and a hybrid TST-LSTM, are also developed and compared against existing\nmodels. A dataset comprising 72 driving trips in a BMW i3 (60 Ah) is used to\naddress battery life prediction in EVs, aiming to create accurate TST models\nthat incorporate environmental, battery, vehicle driving, and heating circuit\ndata to predict SOC and battery temperature for future time steps.\n","authors":["Niranjan Sitapure","Atharva Kulkarni"],"pdf_url":"https://arxiv.org/pdf/2308.03260v1.pdf","comment":"13 pages and 7 figures"},{"id":"http://arxiv.org/abs/2308.03259v1","updated":"2023-08-07T02:37:02Z","published":"2023-08-07T02:37:02Z","title":"Optimal Approximation and Learning Rates for Deep Convolutional Neural\n Networks","summary":" This paper focuses on approximation and learning performance analysis for\ndeep convolutional neural networks with zero-padding and max-pooling. We prove\nthat, to approximate $r$-smooth function, the approximation rates of deep\nconvolutional neural networks with depth $L$ are of order $ (L^2/\\log\nL)^{-2r/d} $, which is optimal up to a logarithmic factor. Furthermore, we\ndeduce almost optimal learning rates for implementing empirical risk\nminimization over deep convolutional neural networks.\n","authors":["Shao-Bo Lin"],"pdf_url":"https://arxiv.org/pdf/2308.03259v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2301.01470v5","updated":"2023-08-07T02:06:09Z","published":"2023-01-04T07:16:46Z","title":"Model Parameter Identification via a Hyperparameter Optimization Scheme\n for Autonomous Racing Systems","summary":" In this letter, we propose a model parameter identification method via a\nhyperparameter optimization scheme (MI-HPO). Our method adopts an efficient\nexplore-exploit strategy to identify the parameters of dynamic models in a\ndata-driven optimization manner. We utilize our method for model parameter\nidentification of the AV-21, a full-scaled autonomous race vehicle. We then\nincorporate the optimized parameters for the design of model-based planning and\ncontrol systems of our platform. In experiments, MI-HPO exhibits more than 13\ntimes faster convergence than traditional parameter identification methods.\nFurthermore, the parametric models learned via MI-HPO demonstrate good fitness\nto the given datasets and show generalization ability in unseen dynamic\nscenarios. We further conduct extensive field tests to validate our model-based\nsystem, demonstrating stable obstacle avoidance and high-speed driving up to\n217 km/h at the Indianapolis Motor Speedway and Las Vegas Motor Speedway. The\nsource code for our work and videos of the tests are available at\nhttps://github.com/hynkis/MI-HPO.\n","authors":["Hyunki Seong","Chanyoung Chung","David Hyunchul Shim"],"pdf_url":"https://arxiv.org/pdf/2301.01470v5.pdf","comment":"6 pages, 8 figures. Published in IEEE Control Systems Letters (L-CSS)"},{"id":"http://arxiv.org/abs/2304.06833v3","updated":"2023-08-07T01:41:25Z","published":"2023-04-13T21:54:53Z","title":"Estimate-Then-Optimize versus Integrated-Estimation-Optimization versus\n Sample Average Approximation: A Stochastic Dominance Perspective","summary":" In data-driven stochastic optimization, model parameters of the underlying\ndistribution need to be estimated from data in addition to the optimization\ntask. Recent literature considers integrating the estimation and optimization\nprocesses by selecting model parameters that lead to the best empirical\nobjective performance. This integrated approach, which we call\nintegrated-estimation-optimization (IEO), can be readily shown to outperform\nsimple estimate-then-optimize (ETO) when the model is misspecified. In this\npaper, we show that a reverse behavior appears when the model class is\nwell-specified and there is sufficient data. Specifically, for a general class\nof nonlinear stochastic optimization problems, we show that simple ETO\noutperforms IEO asymptotically when the model class covers the ground truth, in\nthe strong sense of stochastic dominance of the regret. Namely, the entire\ndistribution of the regret, not only its mean or other moments, is always\nbetter for ETO compared to IEO. Our results also apply to constrained,\ncontextual optimization problems where the decision depends on observed\nfeatures. Whenever applicable, we also demonstrate how standard sample average\napproximation (SAA) performs the worst when the model class is well-specified\nin terms of regret, and best when it is misspecified. Finally, we provide\nexperimental results to support our theoretical comparisons and illustrate when\nour insights hold in finite-sample regimes and under various degrees of\nmisspecification.\n","authors":["Adam N. Elmachtoub","Henry Lam","Haofeng Zhang","Yunfan Zhao"],"pdf_url":"https://arxiv.org/pdf/2304.06833v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03243v1","updated":"2023-08-07T01:41:21Z","published":"2023-08-07T01:41:21Z","title":"Unsupervised Adversarial Detection without Extra Model: Training Loss\n Should Change","summary":" Adversarial robustness poses a critical challenge in the deployment of deep\nlearning models for real-world applications. Traditional approaches to\nadversarial training and supervised detection rely on prior knowledge of attack\ntypes and access to labeled training data, which is often impractical. Existing\nunsupervised adversarial detection methods identify whether the target model\nworks properly, but they suffer from bad accuracies owing to the use of common\ncross-entropy training loss, which relies on unnecessary features and\nstrengthens adversarial attacks. We propose new training losses to reduce\nuseless features and the corresponding detection method without prior knowledge\nof adversarial attacks. The detection rate (true positive rate) against all\ngiven white-box attacks is above 93.9% except for attacks without limits\n(DF($\\infty$)), while the false positive rate is barely 2.5%. The proposed\nmethod works well in all tested attack types and the false positive rates are\neven better than the methods good at certain types.\n","authors":["Chien Cheng Chyou","Hung-Ting Su","Winston H. Hsu"],"pdf_url":"https://arxiv.org/pdf/2308.03243v1.pdf","comment":"AdvML in ICML 2023\n code:https://github.com/CycleBooster/Unsupervised-adversarial-detection-without-extra-model"},{"id":"http://arxiv.org/abs/2308.03239v1","updated":"2023-08-07T01:32:09Z","published":"2023-08-07T01:32:09Z","title":"Asynchronous Decentralized Q-Learning: Two Timescale Analysis By\n Persistence","summary":" Non-stationarity is a fundamental challenge in multi-agent reinforcement\nlearning (MARL), where agents update their behaviour as they learn. Many\ntheoretical advances in MARL avoid the challenge of non-stationarity by\ncoordinating the policy updates of agents in various ways, including\nsynchronizing times at which agents are allowed to revise their policies.\nSynchronization enables analysis of many MARL algorithms via multi-timescale\nmethods, but such synchrony is infeasible in many decentralized applications.\nIn this paper, we study an asynchronous variant of the decentralized Q-learning\nalgorithm, a recent MARL algorithm for stochastic games. We provide sufficient\nconditions under which the asynchronous algorithm drives play to equilibrium\nwith high probability. Our solution utilizes constant learning rates in the\nQ-factor update, which we show to be critical for relaxing the synchrony\nassumptions of earlier work. Our analysis also applies to asynchronous\ngeneralizations of a number of other algorithms from the regret testing\ntradition, whose performance is analyzed by multi-timescale methods that study\nMarkov chains obtained via policy update dynamics. This work extends the\napplicability of the decentralized Q-learning algorithm and its relatives to\nsettings in which parameters are selected in an independent manner, and tames\nnon-stationarity without imposing the coordination assumptions of prior work.\n","authors":["Bora Yongacoglu","Gürdal Arslan","Serdar Yüksel"],"pdf_url":"https://arxiv.org/pdf/2308.03239v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03236v1","updated":"2023-08-07T01:25:10Z","published":"2023-08-07T01:25:10Z","title":"G-Mix: A Generalized Mixup Learning Framework Towards Flat Minima","summary":" Deep neural networks (DNNs) have demonstrated promising results in various\ncomplex tasks. However, current DNNs encounter challenges with\nover-parameterization, especially when there is limited training data\navailable. To enhance the generalization capability of DNNs, the Mixup\ntechnique has gained popularity. Nevertheless, it still produces suboptimal\noutcomes. Inspired by the successful Sharpness-Aware Minimization (SAM)\napproach, which establishes a connection between the sharpness of the training\nloss landscape and model generalization, we propose a new learning framework\ncalled Generalized-Mixup, which combines the strengths of Mixup and SAM for\ntraining DNN models. The theoretical analysis provided demonstrates how the\ndeveloped G-Mix framework enhances generalization. Additionally, to further\noptimize DNN performance with the G-Mix framework, we introduce two novel\nalgorithms: Binary G-Mix and Decomposed G-Mix. These algorithms partition the\ntraining data into two subsets based on the sharpness-sensitivity of each\nexample to address the issue of \"manifold intrusion\" in Mixup. Both theoretical\nexplanations and experimental results reveal that the proposed BG-Mix and\nDG-Mix algorithms further enhance model generalization across multiple datasets\nand models, achieving state-of-the-art performance.\n","authors":["Xingyu Li","Bo Tang"],"pdf_url":"https://arxiv.org/pdf/2308.03236v1.pdf","comment":"19 pages, 23 figures"},{"id":"http://arxiv.org/abs/2212.12294v2","updated":"2023-08-07T01:21:19Z","published":"2022-12-23T12:51:42Z","title":"FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos","summary":" Neural fields, also known as coordinate-based or implicit neural\nrepresentations, have shown a remarkable capability of representing,\ngenerating, and manipulating various forms of signals. For video\nrepresentations, however, mapping pixel-wise coordinates to RGB colors has\nshown relatively low compression performance and slow convergence and inference\nspeed. Frame-wise video representation, which maps a temporal coordinate to its\nentire frame, has recently emerged as an alternative method to represent\nvideos, improving compression rates and encoding speed. While promising, it has\nstill failed to reach the performance of state-of-the-art video compression\nalgorithms. In this work, we propose FFNeRV, a novel method for incorporating\nflow information into frame-wise representations to exploit the temporal\nredundancy across the frames in videos inspired by the standard video codecs.\nFurthermore, we introduce a fully convolutional architecture, enabled by\none-dimensional temporal grids, improving the continuity of spatial features.\nExperimental results show that FFNeRV yields the best performance for video\ncompression and frame interpolation among the methods using frame-wise\nrepresentations or neural fields. To reduce the model size even further, we\ndevise a more compact convolutional architecture using the group and pointwise\nconvolutions. With model compression techniques, including quantization-aware\ntraining and entropy coding, FFNeRV outperforms widely-used standard video\ncodecs (H.264 and HEVC) and performs on par with state-of-the-art video\ncompression algorithms.\n","authors":["Joo Chan Lee","Daniel Rho","Jong Hwan Ko","Eunbyung Park"],"pdf_url":"https://arxiv.org/pdf/2212.12294v2.pdf","comment":"Our project page including code is available at\n https://maincold2.github.io/ffnerv/"},{"id":"http://arxiv.org/abs/2206.02659v5","updated":"2023-08-07T01:20:01Z","published":"2022-06-06T14:52:46Z","title":"Robust Fine-Tuning of Deep Neural Networks with Hessian-based\n Generalization Guarantees","summary":" We consider fine-tuning a pretrained deep neural network on a target task. We\nstudy the generalization properties of fine-tuning to understand the problem of\noverfitting, which has often been observed (e.g., when the target dataset is\nsmall or when the training labels are noisy). Existing generalization measures\nfor deep networks depend on notions such as distance from the initialization\n(i.e., the pretrained network) of the fine-tuned model and noise stability\nproperties of deep networks. This paper identifies a Hessian-based distance\nmeasure through PAC-Bayesian analysis, which is shown to correlate well with\nobserved generalization gaps of fine-tuned models. Theoretically, we prove\nHessian distance-based generalization bounds for fine-tuned models. We also\ndescribe an extended study of fine-tuning against label noise, where\noverfitting is against a critical problem; We present an algorithm and a\ngeneralization error guarantee for this algorithm under a class conditional\nindependent noise model. Empirically, we observe that the Hessian-based\ndistance measure can match the scale of the observed generalization gap of\nfine-tuned models in practice. We also test our algorithm on several image\nclassification tasks with noisy training labels, showing notable gains over\nprior methods, and the Hessian distance measure of the fine-tuned model\ndecreases substantially.\n","authors":["Haotian Ju","Dongyue Li","Hongyang R. Zhang"],"pdf_url":"https://arxiv.org/pdf/2206.02659v5.pdf","comment":"37 pages. Appeared in ICML 2022"},{"id":"http://arxiv.org/abs/2308.03235v1","updated":"2023-08-07T01:10:50Z","published":"2023-08-07T01:10:50Z","title":"Analysis of the Evolution of Advanced Transformer-Based Language Models:\n Experiments on Opinion Mining","summary":" Opinion mining, also known as sentiment analysis, is a subfield of natural\nlanguage processing (NLP) that focuses on identifying and extracting subjective\ninformation in textual material. This can include determining the overall\nsentiment of a piece of text (e.g., positive or negative), as well as\nidentifying specific emotions or opinions expressed in the text, that involves\nthe use of advanced machine and deep learning techniques. Recently,\ntransformer-based language models make this task of human emotion analysis\nintuitive, thanks to the attention mechanism and parallel computation. These\nadvantages make such models very powerful on linguistic tasks, unlike recurrent\nneural networks that spend a lot of time on sequential processing, making them\nprone to fail when it comes to processing long text. The scope of our paper\naims to study the behaviour of the cutting-edge Transformer-based language\nmodels on opinion mining and provide a high-level comparison between them to\nhighlight their key particularities. Additionally, our comparative study shows\nleads and paves the way for production engineers regarding the approach to\nfocus on and is useful for researchers as it provides guidelines for future\nresearch subjects.\n","authors":["Nour Eddine Zekaoui","Siham Yousfi","Maryem Rhanoui","Mounia Mikram"],"pdf_url":"https://arxiv.org/pdf/2308.03235v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03231v1","updated":"2023-08-07T00:30:29Z","published":"2023-08-07T00:30:29Z","title":"Imbalanced Large Graph Learning Framework for FPGA Logic Elements\n Packing Prediction","summary":" Packing is a required step in a typical FPGA CAD flow. It has high impacts to\nthe performance of FPGA placement and routing. Early prediction of packing\nresults can guide design optimization and expedite design closure. In this\nwork, we propose an imbalanced large graph learning framework, ImLG, for\nprediction of whether logic elements will be packed after placement.\nSpecifically, we propose dedicated feature extraction and feature aggregation\nmethods to enhance the node representation learning of circuit graphs. With\nimbalanced distribution of packed and unpacked logic elements, we further\npropose techniques such as graph oversampling and mini-batch training for this\nimbalanced learning task in large circuit graphs. Experimental results\ndemonstrate that our framework can improve the F1 score by 42.82% compared to\nthe most recent Gaussian-based prediction method. Physical design results show\nthat the proposed method can assist the placer in improving routed wirelength\nby 0.93% and SLICE occupation by 0.89%.\n","authors":["Zhixiong Di","Runzhe Tao","Lin Chen","Qiang Wu","Yibo Lin"],"pdf_url":"https://arxiv.org/pdf/2308.03231v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03230v1","updated":"2023-08-07T00:14:46Z","published":"2023-08-07T00:14:46Z","title":"Tractability of approximation by general shallow networks","summary":" In this paper, we present a sharper version of the results in the paper\nDimension independent bounds for general shallow networks; Neural Networks,\n\\textbf{123} (2020), 142-152. Let $\\mathbb{X}$ and $\\mathbb{Y}$ be compact\nmetric spaces. We consider approximation of functions of the form $\nx\\mapsto\\int_{\\mathbb{Y}} G( x, y)d\\tau( y)$, $ x\\in\\mathbb{X}$, by\n$G$-networks of the form $ x\\mapsto \\sum_{k=1}^n a_kG( x, y_k)$, $ y_1,\\cdots,\ny_n\\in\\mathbb{Y}$, $a_1,\\cdots, a_n\\in\\mathbb{R}$. Defining the dimensions of\n$\\mathbb{X}$ and $\\mathbb{Y}$ in terms of covering numbers, we obtain dimension\nindependent bounds on the degree of approximation in terms of $n$, where also\nthe constants involved are all dependent at most polynomially on the\ndimensions. Applications include approximation by power rectified linear unit\nnetworks, zonal function networks, certain radial basis function networks as\nwell as the important problem of function extension to higher dimensional\nspaces.\n","authors":["Hrushikesh Mhaskar","Tong Mao"],"pdf_url":"https://arxiv.org/pdf/2308.03230v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03960v1","updated":"2023-08-07T23:52:03Z","published":"2023-08-07T23:52:03Z","title":"Amortized Global Search for Efficient Preliminary Trajectory Design with\n Deep Generative Models","summary":" Preliminary trajectory design is a global search problem that seeks multiple\nqualitatively different solutions to a trajectory optimization problem. Due to\nits high dimensionality and non-convexity, and the frequent adjustment of\nproblem parameters, the global search becomes computationally demanding. In\nthis paper, we exploit the clustering structure in the solutions and propose an\namortized global search (AmorGS) framework. We use deep generative models to\npredict trajectory solutions that share similar structures with previously\nsolved problems, which accelerates the global search for unseen parameter\nvalues. Our method is evaluated using De Jong's 5th function and a low-thrust\ncircular restricted three-body problem.\n","authors":["Anjian Li","Amlan Sinha","Ryne Beeson"],"pdf_url":"https://arxiv.org/pdf/2308.03960v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03956v1","updated":"2023-08-07T23:46:14Z","published":"2023-08-07T23:46:14Z","title":"Fixed Inter-Neuron Covariability Induces Adversarial Robustness","summary":" The vulnerability to adversarial perturbations is a major flaw of Deep Neural\nNetworks (DNNs) that raises question about their reliability when in real-world\nscenarios. On the other hand, human perception, which DNNs are supposed to\nemulate, is highly robust to such perturbations, indicating that there may be\ncertain features of the human perception that make it robust but are not\nrepresented in the current class of DNNs. One such feature is that the activity\nof biological neurons is correlated and the structure of this correlation tends\nto be rather rigid over long spans of times, even if it hampers performance and\nlearning. We hypothesize that integrating such constraints on the activations\nof a DNN would improve its adversarial robustness, and, to test this\nhypothesis, we have developed the Self-Consistent Activation (SCA) layer, which\ncomprises of neurons whose activations are consistent with each other, as they\nconform to a fixed, but learned, covariability pattern. When evaluated on image\nand sound recognition tasks, the models with a SCA layer achieved high\naccuracy, and exhibited significantly greater robustness than multi-layer\nperceptron models to state-of-the-art Auto-PGD adversarial attacks\n\\textit{without being trained on adversarially perturbed data\n","authors":["Muhammad Ahmed Shah","Bhiksha Raj"],"pdf_url":"https://arxiv.org/pdf/2308.03956v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03953v1","updated":"2023-08-07T23:44:35Z","published":"2023-08-07T23:44:35Z","title":"PMU measurements based short-term voltage stability assessment of power\n systems via deep transfer learning","summary":" Deep learning has emerged as an effective solution for addressing the\nchallenges of short-term voltage stability assessment (STVSA) in power systems.\nHowever, existing deep learning-based STVSA approaches face limitations in\nadapting to topological changes, sample labeling, and handling small datasets.\nTo overcome these challenges, this paper proposes a novel phasor measurement\nunit (PMU) measurements-based STVSA method by using deep transfer learning. The\nmethod leverages the real-time dynamic information captured by PMUs to create\nan initial dataset. It employs temporal ensembling for sample labeling and\nutilizes least squares generative adversarial networks (LSGAN) for data\naugmentation, enabling effective deep learning on small-scale datasets.\nAdditionally, the method enhances adaptability to topological changes by\nexploring connections between different faults. Experimental results on the\nIEEE 39-bus test system demonstrate that the proposed method improves model\nevaluation accuracy by approximately 20% through transfer learning, exhibiting\nstrong adaptability to topological changes. Leveraging the self-attention\nmechanism of the Transformer model, this approach offers significant advantages\nover shallow learning methods and other deep learning-based approaches.\n","authors":["Yang Li","Shitu Zhang","Yuanzheng Li","Jiting Cao","Shuyue Jia"],"pdf_url":"https://arxiv.org/pdf/2308.03953v1.pdf","comment":"Accepted by IEEE Transactions on Instrumentation & Measurement"},{"id":"http://arxiv.org/abs/2308.03945v1","updated":"2023-08-07T23:27:20Z","published":"2023-08-07T23:27:20Z","title":"The Prospect of Enhancing Large-Scale Heterogeneous Federated Learning\n with Transformers","summary":" Federated learning (FL) addresses data privacy concerns by enabling\ncollaborative training of AI models across distributed data owners. Wide\nadoption of FL faces the fundamental challenges of data heterogeneity and the\nlarge scale of data owners involved. In this paper, we investigate the prospect\nof Transformer-based FL models for achieving generalization and personalization\nin this setting. We conduct extensive comparative experiments involving FL with\nTransformers, ResNet, and personalized ResNet-based FL approaches under various\nscenarios. These experiments consider varying numbers of data owners to\ndemonstrate Transformers' advantages over deep neural networks in large-scale\nheterogeneous FL tasks. In addition, we analyze the superior performance of\nTransformers by comparing the Centered Kernel Alignment (CKA) representation\nsimilarity across different layers and FL models to gain insight into the\nreasons behind their promising capabilities.\n","authors":["Yulan Gao","Hao Sun","Zengxiang Li","Han Yu"],"pdf_url":"https://arxiv.org/pdf/2308.03945v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03944v1","updated":"2023-08-07T23:19:34Z","published":"2023-08-07T23:19:34Z","title":"GraPhSyM: Graph Physical Synthesis Model","summary":" In this work, we introduce GraPhSyM, a Graph Attention Network (GATv2) model\nfor fast and accurate estimation of post-physical synthesis circuit delay and\narea metrics from pre-physical synthesis circuit netlists. Once trained,\nGraPhSyM provides accurate visibility of final design metrics to early EDA\nstages, such as logic synthesis, without running the slow physical synthesis\nflow, enabling global co-optimization across stages. Additionally, the swift\nand precise feedback provided by GraPhSym is instrumental for\nmachine-learning-based EDA optimization frameworks. Given a gate-level netlist\nof a circuit represented as a graph, GraPhSyM utilizes graph structure,\nconnectivity, and electrical property features to predict the impact of\nphysical synthesis transformations such as buffer insertion and gate sizing.\nWhen trained on a dataset of 6000 prefix adder designs synthesized at an\naggressive delay target, GraPhSyM can accurately predict the post-synthesis\ndelay (98.3%) and area (96.1%) metrics of unseen adders with a fast 0.22s\ninference time. Furthermore, we illustrate the compositionality of GraPhSyM by\nemploying the model trained on a fixed delay target to accurately anticipate\npost-synthesis metrics at a variety of unseen delay targets. Lastly, we report\npromising generalization capabilities of the GraPhSyM model when it is\nevaluated on circuits different from the adders it was exclusively trained on.\nThe results show the potential for GraPhSyM to serve as a powerful tool for\nadvanced optimization techniques and as an oracle for EDA machine learning\nframeworks.\n","authors":["Ahmed Agiza","Rajarshi Roy","Teodor Dumitru Ene","Saad Godil","Sherief Reda","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2308.03944v1.pdf","comment":"Accepted at ICCAD'23"},{"id":"http://arxiv.org/abs/2308.00824v2","updated":"2023-08-07T22:47:33Z","published":"2023-08-01T20:22:53Z","title":"An Exact Kernel Equivalence for Finite Classification Models","summary":" We explore the equivalence between neural networks and kernel methods by\nderiving the first exact representation of any finite-size parametric\nclassification model trained with gradient descent as a kernel machine. We\ncompare our exact representation to the well-known Neural Tangent Kernel (NTK)\nand discuss approximation error relative to the NTK and other non-exact path\nkernel formulations. We experimentally demonstrate that the kernel can be\ncomputed for realistic networks up to machine precision. We use this exact\nkernel to show that our theoretical contribution can provide useful insights\ninto the predictions made by neural networks, particularly the way in which\nthey generalize.\n","authors":["Brian Bell","Michael Geyer","David Glickenstein","Amanda Fernandez","Juston Moore"],"pdf_url":"https://arxiv.org/pdf/2308.00824v2.pdf","comment":"TAG-ML at ICML 2023 in Proceedings. 8 pages, 6 figures, proofs in\n Appendix"},{"id":"http://arxiv.org/abs/2204.01248v2","updated":"2023-08-07T22:21:24Z","published":"2022-04-04T05:27:40Z","title":"Differentiable Rendering for Synthetic Aperture Radar Imagery","summary":" There is rising interest in differentiable rendering, which allows explicitly\nmodeling geometric priors and constraints in optimization pipelines using\nfirst-order methods such as backpropagation. Incorporating such domain\nknowledge can lead to deep neural networks that are trained more robustly and\nwith limited data, as well as the capability to solve ill-posed inverse\nproblems. Existing efforts in differentiable rendering have focused on imagery\nfrom electro-optical sensors, particularly conventional RGB-imagery. In this\nwork, we propose an approach for differentiable rendering of Synthetic Aperture\nRadar (SAR) imagery, which combines methods from 3D computer graphics with\nneural rendering. We demonstrate the approach on the inverse graphics problem\nof 3D Object Reconstruction from limited SAR imagery using high-fidelity\nsimulated SAR data.\n","authors":["Michael Wilmanski","Jonathan Tamir"],"pdf_url":"https://arxiv.org/pdf/2204.01248v2.pdf","comment":"This version of the manuscript is an updated preprint which has been\n recently accepted by IEEE Transactions on Aerospace Electronic Systems, but\n has not yet been published or processed by IEEE"},{"id":"http://arxiv.org/abs/2308.03928v1","updated":"2023-08-07T22:12:48Z","published":"2023-08-07T22:12:48Z","title":"Optimizing the switching operation in monoclonal antibody production:\n Economic MPC and reinforcement learning","summary":" Monoclonal antibodies (mAbs) have emerged as indispensable assets in\nmedicine, and are currently at the forefront of biopharmaceutical product\ndevelopment. However, the growing market demand and the substantial doses\nrequired for mAb clinical treatments necessitate significant progress in its\nlarge-scale production. Most of the processes for industrial mAb production\nrely on batch operations, which result in significant downtime. The shift\ntowards a fully continuous and integrated manufacturing process holds the\npotential to boost product yield and quality, while eliminating the extra\nexpenses associated with storing intermediate products. The integrated\ncontinuous mAb production process can be divided into the upstream and\ndownstream processes. One crucial aspect that ensures the continuity of the\nintegrated process is the switching of the capture columns, which are typically\nchromatography columns operated in a fed-batch manner downstream. Due to the\ndiscrete nature of the switching operation, advanced process control algorithms\nsuch as economic MPC (EMPC) are computationally difficult to implement. This is\nbecause an integer nonlinear program (INLP) needs to be solved online at each\nsampling time. This paper introduces two computationally-efficient approaches\nfor EMPC implementation, namely, a sigmoid function approximation approach and\na rectified linear unit (ReLU) approximation approach. It also explores the\napplication of deep reinforcement learning (DRL). These three methods are\ncompared to the traditional switching approach which is based on a 1% product\nbreakthrough rule and which involves no optimization.\n","authors":["Sandra A. Obiri","Song Bo","Bernard T. Agyeman","Benjamin Decardi-Nelson","Jinfeng Liu"],"pdf_url":"https://arxiv.org/pdf/2308.03928v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.10331v3","updated":"2023-08-07T22:07:04Z","published":"2023-02-20T21:54:25Z","title":"Causal Razors","summary":" When performing causal discovery, assumptions have to be made on how the true\ncausal mechanism corresponds to the underlying joint probability distribution.\nThese assumptions are labeled as causal razors in this work. We review numerous\ncausal razors that appeared in the literature, and offer a comprehensive\nlogical comparison of them. In particular, we scrutinize an unpopular causal\nrazor, namely parameter minimality, in multinomial causal models and its\nlogical relations with other well-studied causal razors. Our logical result\nposes a dilemma in selecting a reasonable scoring criterion for score-based\ncasual search algorithms.\n","authors":["Wai-yin Lam"],"pdf_url":"https://arxiv.org/pdf/2302.10331v3.pdf","comment":"29 pages for the main paper. 14 pages for the supplementary materials"},{"id":"http://arxiv.org/abs/2308.02013v2","updated":"2023-08-07T21:34:44Z","published":"2023-08-03T20:08:23Z","title":"Federated Representation Learning for Automatic Speech Recognition","summary":" Federated Learning (FL) is a privacy-preserving paradigm, allowing edge\ndevices to learn collaboratively without sharing data. Edge devices like Alexa\nand Siri are prospective sources of unlabeled audio data that can be tapped to\nlearn robust audio representations. In this work, we bring Self-supervised\nLearning (SSL) and FL together to learn representations for Automatic Speech\nRecognition respecting data privacy constraints. We use the speaker and chapter\ninformation in the unlabeled speech dataset, Libri-Light, to simulate non-IID\nspeaker-siloed data distributions and pre-train an LSTM encoder with the\nContrastive Predictive Coding framework with FedSGD. We show that the\npre-trained ASR encoder in FL performs as well as a centrally pre-trained model\nand produces an improvement of 12-15% (WER) compared to no pre-training. We\nfurther adapt the federated pre-trained models to a new language, French, and\nshow a 20% (WER) improvement over no pre-training.\n","authors":["Guruprasad V Ramesh","Gopinath Chennupati","Milind Rao","Anit Kumar Sahu","Ariya Rastrow","Jasha Droppo"],"pdf_url":"https://arxiv.org/pdf/2308.02013v2.pdf","comment":"Accepted at ISCA SPSC Symposium 3rd Symposium on Security and Privacy\n in Speech Communication, 2023"},{"id":"http://arxiv.org/abs/2308.03915v1","updated":"2023-08-07T21:20:24Z","published":"2023-08-07T21:20:24Z","title":"Predicting and explaining nonlinear material response using deep\n Physically Guided Neural Networks with Internal Variables","summary":" Nonlinear materials are often difficult to model with classical state model\ntheory because they have a complex and sometimes inaccurate physical and\nmathematical description or we simply do not know how to describe such\nmaterials in terms of relations between external and internal variables. In\nmany disciplines, Neural Network methods have arisen as powerful tools to\nidentify very complex and non-linear correlations. In this work, we use the\nvery recently developed concept of Physically Guided Neural Networks with\nInternal Variables (PGNNIV) to discover constitutive laws using a model-free\napproach and training solely with measured force-displacement data. PGNNIVs\nmake a particular use of the physics of the problem to enforce constraints on\nspecific hidden layers and are able to make predictions without internal\nvariable data. We demonstrate that PGNNIVs are capable of predicting both\ninternal and external variables under unseen load scenarios, regardless of the\nnature of the material considered (linear, with hardening or softening behavior\nand hyperelastic), unravelling the constitutive law of the material hence\nexplaining its nature altogether, placing the method in what is known as\neXplainable Artificial Intelligence (XAI).\n","authors":["Javier Orera-Echeverria","Jacobo Ayensa-Jiménez","Manuel Doblare"],"pdf_url":"https://arxiv.org/pdf/2308.03915v1.pdf","comment":"Main text: 25 pages, 6 figures. Appendices: 13 pages, 12 figures"},{"id":"http://arxiv.org/abs/2112.04629v4","updated":"2023-08-07T21:06:18Z","published":"2021-12-09T00:08:09Z","title":"Transferability Properties of Graph Neural Networks","summary":" Graph neural networks (GNNs) are composed of layers consisting of graph\nconvolutions and pointwise nonlinearities. Due to their invariance and\nstability properties, GNNs are provably successful at learning representations\nfrom data supported on moderate-scale graphs. However, they are difficult to\nlearn on large-scale graphs. In this paper, we study the problem of training\nGNNs on graphs of moderate size and transferring them to large-scale graphs. We\nuse graph limits called graphons to define limit objects for graph filters and\nGNNs -- graphon filters and graphon neural networks (WNNs) -- which we\ninterpret as generative models for graph filters and GNNs. We then show that\ngraphon filters and WNNs can be approximated by graph filters and GNNs sampled\nfrom them on weighted and stochastic graphs. Because the error of these\napproximations can be upper bounded, by a triangle inequality argument we can\nfurther bound the error of transferring a graph filter or a GNN across graphs.\nOur results show that (i) the transference error decreases with the graph size,\nand (ii) that graph filters have a transferability-discriminability tradeoff\nthat in GNNs is alleviated by the scattering behavior of the nonlinearity.\nThese findings are demonstrated empirically in a movie recommendation problem\nand in a decentralized control task.\n","authors":["Luana Ruiz","Luiz F. O. Chamon","Alejandro Ribeiro"],"pdf_url":"https://arxiv.org/pdf/2112.04629v4.pdf","comment":"IEEE TSP"},{"id":"http://arxiv.org/abs/2308.03908v1","updated":"2023-08-07T20:50:54Z","published":"2023-08-07T20:50:54Z","title":"ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings\n for Video Action Recognition","summary":" Video Action Recognition (VAR) is a challenging task due to its inherent\ncomplexities. Though different approaches have been explored in the literature,\ndesigning a unified framework to recognize a large number of human actions is\nstill a challenging problem. Recently, Multi-Modal Learning (MML) has\ndemonstrated promising results in this domain. In literature, 2D skeleton or\npose modality has often been used for this task, either independently or in\nconjunction with the visual information (RGB modality) present in videos.\nHowever, the combination of pose, visual information, and text attributes has\nnot been explored yet, though text and pose attributes independently have been\nproven to be effective in numerous computer vision tasks. In this paper, we\npresent the first pose augmented Vision-language model (VLM) for VAR. Notably,\nour scheme achieves an accuracy of 92.81% and 73.02% on two popular human video\naction recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even\nwithout any video data pre-training, and an accuracy of 96.11% and 75.75% after\nkinetics pre-training.\n","authors":["Soumyabrata Chaudhuri","Saumik Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2308.03908v1.pdf","comment":"7 pages, 3 figures, 2 Tables"},{"id":"http://arxiv.org/abs/2308.03907v1","updated":"2023-08-07T20:50:48Z","published":"2023-08-07T20:50:48Z","title":"Advancements In Crowd-Monitoring System: A Comprehensive Analysis of\n Systematic Approaches and Automation Algorithms: State-of-The-Art","summary":" Growing apprehensions surrounding public safety have captured the attention\nof numerous governments and security agencies across the globe. These entities\nare increasingly acknowledging the imperative need for reliable and secure\ncrowd-monitoring systems to address these concerns. Effectively managing human\ngatherings necessitates proactive measures to prevent unforeseen events or\ncomplications, ensuring a safe and well-coordinated environment. The scarcity\nof research focusing on crowd monitoring systems and their security\nimplications has given rise to a burgeoning area of investigation, exploring\npotential approaches to safeguard human congregations effectively. Crowd\nmonitoring systems depend on a bifurcated approach, encompassing vision-based\nand non-vision-based technologies. An in-depth analysis of these two\nmethodologies will be conducted in this research. The efficacy of these\napproaches is contingent upon the specific environment and temporal context in\nwhich they are deployed, as they each offer distinct advantages. This paper\nendeavors to present an in-depth analysis of the recent incorporation of\nartificial intelligence (AI) algorithms and models into automated systems,\nemphasizing their contemporary applications and effectiveness in various\ncontexts.\n","authors":["Mohammed Ameen","Richard Stone"],"pdf_url":"https://arxiv.org/pdf/2308.03907v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03905v1","updated":"2023-08-07T20:43:42Z","published":"2023-08-07T20:43:42Z","title":"Intelligent Assistant Language Understanding On Device","summary":" It has recently become feasible to run personal digital assistants on phones\nand other personal devices. In this paper we describe a design for a natural\nlanguage understanding system that runs on device. In comparison to a\nserver-based assistant, this system is more private, more reliable, faster,\nmore expressive, and more accurate. We describe what led to key choices about\narchitecture and technologies. For example, some approaches in the dialog\nsystems literature are difficult to maintain over time in a deployment setting.\nWe hope that sharing learnings from our practical experiences may help inform\nfuture work in the research community.\n","authors":["Cecilia Aas","Hisham Abdelsalam","Irina Belousova","Shruti Bhargava","Jianpeng Cheng","Robert Daland","Joris Driesen","Federico Flego","Tristan Guigue","Anders Johannsen","Partha Lal","Jiarui Lu","Joel Ruben Antony Moniz","Nathan Perkins","Dhivya Piraviperumal","Stephen Pulman","Diarmuid Ó Séaghdha","David Q. Sun","John Torr","Marco Del Vecchio","Jay Wacker","Jason D. Williams","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2308.03905v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03904v1","updated":"2023-08-07T20:41:19Z","published":"2023-08-07T20:41:19Z","title":"On genuine invariance learning without weight-tying","summary":" In this paper, we investigate properties and limitations of invariance\nlearned by neural networks from the data compared to the genuine invariance\nachieved through invariant weight-tying. To do so, we adopt a group theoretical\nperspective and analyze invariance learning in neural networks without\nweight-tying constraints. We demonstrate that even when a network learns to\ncorrectly classify samples on a group orbit, the underlying decision-making in\nsuch a model does not attain genuine invariance. Instead, learned invariance is\nstrongly conditioned on the input data, rendering it unreliable if the input\ndistribution shifts. We next demonstrate how to guide invariance learning\ntoward genuine invariance by regularizing the invariance of a model at the\ntraining. To this end, we propose several metrics to quantify learned\ninvariance: (i) predictive distribution invariance, (ii) logit invariance, and\n(iii) saliency invariance similarity. We show that the invariance learned with\nthe invariance error regularization closely reassembles the genuine invariance\nof weight-tying models and reliably holds even under a severe input\ndistribution shift. Closer analysis of the learned invariance also reveals the\nspectral decay phenomenon, when a network chooses to achieve the invariance to\na specific transformation group by reducing the sensitivity to any input\nperturbation.\n","authors":["Artem Moskalev","Anna Sepliarskaia","Erik J. Bekkers","Arnold Smeulders"],"pdf_url":"https://arxiv.org/pdf/2308.03904v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03901v1","updated":"2023-08-07T20:28:22Z","published":"2023-08-07T20:28:22Z","title":"FLIPS: Federated Learning using Intelligent Participant Selection","summary":" This paper presents the design and implementation of FLIPS, a middleware\nsystem to manage data and participant heterogeneity in federated learning (FL)\ntraining workloads. In particular, we examine the benefits of label\ndistribution clustering on participant selection in federated learning. FLIPS\nclusters parties involved in an FL training job based on the label distribution\nof their data apriori, and during FL training, ensures that each cluster is\nequitably represented in the participants selected. FLIPS can support the most\ncommon FL algorithms, including FedAvg, FedProx, FedDyn, FedOpt and FedYogi. To\nmanage platform heterogeneity and dynamic resource availability, FLIPS\nincorporates a straggler management mechanism to handle changing capacities in\ndistributed, smart community applications. Privacy of label distributions,\nclustering and participant selection is ensured through a trusted execution\nenvironment (TEE). Our comprehensive empirical evaluation compares FLIPS with\nrandom participant selection, as well as two other \"smart\" selection mechanisms\n- Oort and gradient clustering using two real-world datasets, two different\nnon-IID distributions and three common FL algorithms (FedYogi, FedProx and\nFedAvg). We demonstrate that FLIPS significantly improves convergence,\nachieving higher accuracy by 17 - 20 % with 20 - 60 % lower communication\ncosts, and these benefits endure in the presence of straggler participants.\n","authors":["Rahul Atul Bhope","K. R. Jayaram","Nalini Venkatasubramanian","Ashish Verma","Gegi Thomas"],"pdf_url":"https://arxiv.org/pdf/2308.03901v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08496v2","updated":"2023-08-07T20:27:19Z","published":"2023-07-17T13:59:07Z","title":"Can We Trust Race Prediction?","summary":" In the absence of sensitive race and ethnicity data, researchers, regulators,\nand firms alike turn to proxies. In this paper, I train a Bidirectional Long\nShort-Term Memory (BiLSTM) model on a novel dataset of voter registration data\nfrom all 50 US states and create an ensemble that achieves up to 36.8% higher\nout of sample (OOS) F1 scores than the best performing machine learning models\nin the literature. Additionally, I construct the most comprehensive database of\nfirst and surname distributions in the US in order to improve the coverage and\naccuracy of Bayesian Improved Surname Geocoding (BISG) and Bayesian Improved\nFirstname Surname Geocoding (BIFSG). Finally, I provide the first high-quality\nbenchmark dataset in order to fairly compare existing models and aid future\nmodel developers.\n","authors":["Cangyuan Li"],"pdf_url":"https://arxiv.org/pdf/2307.08496v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13452v3","updated":"2023-08-07T19:57:38Z","published":"2023-05-22T19:52:08Z","title":"Measuring and Modeling Physical Intrinsic Motivation","summary":" Humans are interactive agents driven to seek out situations with interesting\nphysical dynamics. Here we formalize the functional form of physical intrinsic\nmotivation. We first collect ratings of how interesting humans find a variety\nof physics scenarios. We then model human interestingness responses by\nimplementing various hypotheses of intrinsic motivation including models that\nrely on simple scene features to models that depend on forward physics\nprediction. We find that the single best predictor of human responses is\nadversarial reward, a model derived from physical prediction loss. We also find\nthat simple scene feature models do not generalize their prediction of human\nresponses across all scenarios. Finally, linearly combining the adversarial\nmodel with the number of collisions in a scene leads to the greatest\nimprovement in predictivity of human responses, suggesting humans are driven\ntowards scenarios that result in high information gain and physical activity.\n","authors":["Julio Martinez","Felix Binder","Haoliang Wang","Nick Haber","Judith Fan","Daniel L. K. Yamins"],"pdf_url":"https://arxiv.org/pdf/2305.13452v3.pdf","comment":"6 pages, 5 figures, accepted to CogSci 2023 with full paper\n publication in the proceedings"},{"id":"http://arxiv.org/abs/2305.02640v3","updated":"2023-08-07T19:55:10Z","published":"2023-05-04T08:20:37Z","title":"Towards Causal Representation Learning and Deconfounding from Indefinite\n Data","summary":" We redefine causal data from two novel perspectives: the number of causal\nskeletons and the dimension of causal variables, thereby proposing three data\nparadigms. Among them, the indefinite data (like dialogues or video sources) is\ncharacterized by multi-skeleton structures and multi-value variables. Multi\nskeletons induce low sample utilization, and multi values induce incapability\nof the distribution assumption, both leading to the fact that learning causal\nrepresentation from indefinite data is, as of yet, largely unexplored. We\ndesign the causal strength variational model to settle down these two problems.\nSpecifically, we leverage the causal strength instead of independent noise as\nthe latent variable to construct evidence lower bound. By this design ethos,\nThe causal strengths of different skeletons are regarded as a distribution and\ncan be expressed as a single-valued causal graph matrix. Moreover, considering\nthe latent confounders, we disentangle the causal graph G into two relation\nsubgraphs O and C. O contains pure relations between observed variables, while\nC represents the relations from latent variables to observed variables. We\nimplement the above designs as a dynamic variational inference model, tailored\nto learn causal representation from indefinite data under latent confounding.\nFinally, we conduct comprehensive experiments on synthetic and real-world data\nto demonstrate the effectiveness of our method.\n","authors":["Hang Chen","Xinyu Yang","Qing Yang"],"pdf_url":"https://arxiv.org/pdf/2305.02640v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03892v1","updated":"2023-08-07T19:51:10Z","published":"2023-08-07T19:51:10Z","title":"Scalable and Equitable Math Problem Solving Strategy Prediction in Big\n Educational Data","summary":" Understanding a student's problem-solving strategy can have a significant\nimpact on effective math learning using Intelligent Tutoring Systems (ITSs) and\nAdaptive Instructional Systems (AISs). For instance, the ITS/AIS can better\npersonalize itself to correct specific misconceptions that are indicated by\nincorrect strategies, specific problems can be designed to improve strategies\nand frustration can be minimized by adapting to a student's natural way of\nthinking rather than trying to fit a standard strategy for all. While it may be\npossible for human experts to identify strategies manually in classroom\nsettings with sufficient student interaction, it is not possible to scale this\nup to big data. Therefore, we leverage advances in Machine Learning and AI\nmethods to perform scalable strategy prediction that is also fair to students\nat all skill levels. Specifically, we develop an embedding called MVec where we\nlearn a representation based on the mastery of students. We then cluster these\nembeddings with a non-parametric clustering method where we progressively learn\nclusters such that we group together instances that have approximately\nsymmetrical strategies. The strategy prediction model is trained on instances\nsampled from these clusters. This ensures that we train the model over diverse\nstrategies and also that strategies from a particular group do not bias the DNN\nmodel, thus allowing it to optimize its parameters over all groups. Using real\nworld large-scale student interaction datasets from MATHia, we implement our\napproach using transformers and Node2Vec for learning the mastery embeddings\nand LSTMs for predicting strategies. We show that our approach can scale up to\nachieve high accuracy by training on a small sample of a large dataset and also\nhas predictive equality, i.e., it can predict strategies equally well for\nlearners at diverse skill levels.\n","authors":["Anup Shakya","Vasile Rus","Deepak Venugopal"],"pdf_url":"https://arxiv.org/pdf/2308.03892v1.pdf","comment":"12 pages, 7 figures Published as a full paper in the 16th\n International Conference on Educational Data Mining 2023"},{"id":"http://arxiv.org/abs/2301.00790v3","updated":"2023-08-07T19:44:14Z","published":"2022-12-30T17:19:00Z","title":"Online learning techniques for prediction of temporal tabular datasets\n with regime changes","summary":" The application of deep learning to non-stationary temporal datasets can lead\nto overfitted models that underperform under regime changes. In this work, we\npropose a modular machine learning pipeline for ranking predictions on temporal\npanel datasets which is robust under regime changes. The modularity of the\npipeline allows the use of different models, including Gradient Boosting\nDecision Trees (GBDTs) and Neural Networks, with and without feature\nengineering. We evaluate our framework on financial data for stock portfolio\nprediction, and find that GBDT models with dropout display high performance,\nrobustness and generalisability with reduced complexity and computational cost.\nWe then demonstrate how online learning techniques, which require no retraining\nof models, can be used post-prediction to enhance the results. First, we show\nthat dynamic feature projection improves robustness by reducing drawdown in\nregime changes. Second, we demonstrate that dynamical model ensembling based on\nselection of models with good recent performance leads to improved Sharpe and\nCalmar ratios of out-of-sample predictions. We also evaluate the robustness of\nour pipeline across different data splits and random seeds with good\nreproducibility.\n","authors":["Thomas Wong","Mauricio Barahona"],"pdf_url":"https://arxiv.org/pdf/2301.00790v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03883v1","updated":"2023-08-07T19:26:09Z","published":"2023-08-07T19:26:09Z","title":"Generative Benchmark Creation for Table Union Search","summary":" Data management has traditionally relied on synthetic data generators to\ngenerate structured benchmarks, like the TPC suite, where we can control\nimportant parameters like data size and its distribution precisely. These\nbenchmarks were central to the success and adoption of database management\nsystems. But more and more, data management problems are of a semantic nature.\nAn important example is finding tables that can be unioned. While any two\ntables with the same cardinality can be unioned, table union search is the\nproblem of finding tables whose union is semantically coherent. Semantic\nproblems cannot be benchmarked using synthetic data. Our current methods for\ncreating benchmarks involve the manual curation and labeling of real data.\nThese methods are not robust or scalable and perhaps more importantly, it is\nnot clear how robust the created benchmarks are. We propose to use generative\nAI models to create structured data benchmarks for table union search. We\npresent a novel method for using generative models to create tables with\nspecified properties. Using this method, we create a new benchmark containing\npairs of tables that are both unionable and non-unionable but related. We\nthoroughly evaluate recent existing table union search methods over existing\nbenchmarks and our new benchmark. We also present and evaluate a new table\nsearch methods based on recent large language models over all benchmarks. We\nshow that the new benchmark is more challenging for all methods than\nhand-curated benchmarks, specifically, the top-performing method achieves a\nMean Average Precision of around 60%, over 30% less than its performance on\nexisting manually created benchmarks. We examine why this is the case and show\nthat the new benchmark permits more detailed analysis of methods, including a\nstudy of both false positives and false negatives that were not possible with\nexisting benchmarks.\n","authors":["Koyena Pal","Aamod Khatiwada","Roee Shraga","Renée J. Miller"],"pdf_url":"https://arxiv.org/pdf/2308.03883v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03882v1","updated":"2023-08-07T19:24:47Z","published":"2023-08-07T19:24:47Z","title":"Exploiting Generalization in Offline Reinforcement Learning via Unseen\n State Augmentations","summary":" Offline reinforcement learning (RL) methods strike a balance between\nexploration and exploitation by conservative value estimation -- penalizing\nvalues of unseen states and actions. Model-free methods penalize values at all\nunseen actions, while model-based methods are able to further exploit unseen\nstates via model rollouts. However, such methods are handicapped in their\nability to find unseen states far away from the available offline data due to\ntwo factors -- (a) very short rollout horizons in models due to cascading model\nerrors, and (b) model rollouts originating solely from states observed in\noffline data. We relax the second assumption and present a novel unseen state\naugmentation strategy to allow exploitation of unseen states where the learned\nmodel and value estimates generalize. Our strategy finds unseen states by\nvalue-informed perturbations of seen states followed by filtering out states\nwith epistemic uncertainty estimates too high (high error) or too low (too\nsimilar to seen data). We observe improved performance in several offline RL\ntasks and find that our augmentation strategy consistently leads to overall\nlower average dataset Q-value estimates i.e. more conservative Q-value\nestimates than a baseline.\n","authors":["Nirbhay Modhe","Qiaozi Gao","Ashwin Kalyan","Dhruv Batra","Govind Thattai","Gaurav Sukhatme"],"pdf_url":"https://arxiv.org/pdf/2308.03882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03873v1","updated":"2023-08-07T18:50:57Z","published":"2023-08-07T18:50:57Z","title":"Evaluating and Explaining Large Language Models for Code Using Syntactic\n Structures","summary":" Large Language Models (LLMs) for code are a family of high-parameter,\ntransformer-based neural networks pre-trained on massive datasets of both\nnatural and programming languages. These models are rapidly being employed in\ncommercial AI-based developer tools, such as GitHub CoPilot. However, measuring\nand explaining their effectiveness on programming tasks is a challenging\nproposition, given their size and complexity. The methods for evaluating and\nexplaining LLMs for code are inextricably linked. That is, in order to explain\na model's predictions, they must be reliably mapped to fine-grained,\nunderstandable concepts. Once this mapping is achieved, new methods for\ndetailed model evaluations are possible. However, most current explainability\ntechniques and evaluation benchmarks focus on model robustness or individual\ntask performance, as opposed to interpreting model predictions.\n To this end, this paper introduces ASTxplainer, an explainability method\nspecific to LLMs for code that enables both new methods for LLM evaluation and\nvisualizations of LLM predictions that aid end-users in understanding model\npredictions. At its core, ASTxplainer provides an automated method for aligning\ntoken predictions with AST nodes, by extracting and aggregating normalized\nmodel logits within AST structures. To demonstrate the practical benefit of\nASTxplainer, we illustrate the insights that our framework can provide by\nperforming an empirical evaluation on 12 popular LLMs for code using a curated\ndataset of the most popular GitHub projects. Additionally, we perform a user\nstudy examining the usefulness of an ASTxplainer-derived visualization of model\npredictions aimed at enabling model users to explain predictions. The results\nof these studies illustrate the potential for ASTxplainer to provide insights\ninto LLM effectiveness, and aid end-users in understanding predictions.\n","authors":["David N Palacio","Alejandro Velasco","Daniel Rodriguez-Cardenas","Kevin Moran","Denys Poshyvanyk"],"pdf_url":"https://arxiv.org/pdf/2308.03873v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03869v1","updated":"2023-08-07T18:40:13Z","published":"2023-08-07T18:40:13Z","title":"Semantic Equivalence of e-Commerce Queries","summary":" Search query variation poses a challenge in e-commerce search, as equivalent\nsearch intents can be expressed through different queries with surface-level\ndifferences. This paper introduces a framework to recognize and leverage query\nequivalence to enhance searcher and business outcomes. The proposed approach\naddresses three key problems: mapping queries to vector representations of\nsearch intent, identifying nearest neighbor queries expressing equivalent or\nsimilar intent, and optimizing for user or business objectives. The framework\nutilizes both surface similarity and behavioral similarity to determine query\nequivalence. Surface similarity involves canonicalizing queries based on word\ninflection, word order, compounding, and noise words. Behavioral similarity\nleverages historical search behavior to generate vector representations of\nquery intent. An offline process is used to train a sentence similarity model,\nwhile an online nearest neighbor approach supports processing of unseen\nqueries. Experimental evaluations demonstrate the effectiveness of the proposed\napproach, outperforming popular sentence transformer models and achieving a\nPearson correlation of 0.85 for query similarity. The results highlight the\npotential of leveraging historical behavior data and training models to\nrecognize and utilize query equivalence in e-commerce search, leading to\nimproved user experiences and business outcomes. Further advancements and\nbenchmark datasets are encouraged to facilitate the development of solutions\nfor this critical problem in the e-commerce domain.\n","authors":["Aritra Mandal","Daniel Tunkelang","Zhe Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03869v1.pdf","comment":"The 6th Workshop on e-Commerce and NLP"},{"id":"http://arxiv.org/abs/2308.03854v1","updated":"2023-08-07T18:04:12Z","published":"2023-08-07T18:04:12Z","title":"Revisiting Prompt Engineering via Declarative Crowdsourcing","summary":" Large language models (LLMs) are incredibly powerful at comprehending and\ngenerating data in the form of text, but are brittle and error-prone. There has\nbeen an advent of toolkits and recipes centered around so-called prompt\nengineering-the process of asking an LLM to do something via a series of\nprompts. However, for LLM-powered data processing workflows, in particular,\noptimizing for quality, while keeping cost bounded, is a tedious, manual\nprocess. We put forth a vision for declarative prompt engineering. We view LLMs\nlike crowd workers and leverage ideas from the declarative crowdsourcing\nliterature-including leveraging multiple prompting strategies, ensuring\ninternal consistency, and exploring hybrid-LLM-non-LLM approaches-to make\nprompt engineering a more principled process. Preliminary case studies on\nsorting, entity resolution, and imputation demonstrate the promise of our\napproach\n","authors":["Aditya G. Parameswaran","Shreya Shankar","Parth Asawa","Naman Jain","Yujie Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03854v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03842v1","updated":"2023-08-07T18:00:04Z","published":"2023-08-07T18:00:04Z","title":"Search Engine and Recommendation System for the Music Industry built\n with JinaAI","summary":" One of the most intriguing debates regarding a novel task is the development\nof search engines and recommendation-based systems in the music industry.\nStudies have shown a drastic depression in the search engine fields, due to\nconcerning factors such as speed, accuracy and the format of data given for\nquerying. Often people face difficulty in searching for a song solely based on\nthe title, hence a solution is proposed to complete a search analysis through a\nsingle query input and is matched with the lyrics of the songs present in the\ndatabase. Hence it is essential to incorporate cutting-edge technology tools\nfor developing a user-friendly search engine. Jina AI is an MLOps framework for\nbuilding neural search engines that are utilized, in order for the user to\nobtain accurate results. Jina AI effectively helps to maintain and enhance the\nquality of performance for the search engine for the query given. An effective\nsearch engine and a recommendation system for the music industry, built with\nJinaAI.\n","authors":["Ishita Gopalakrishnan","Sanjjushri Varshini R","Ponshriharini V"],"pdf_url":"https://arxiv.org/pdf/2308.03842v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03825v1","updated":"2023-08-07T16:55:20Z","published":"2023-08-07T16:55:20Z","title":"\"Do Anything Now\": Characterizing and Evaluating In-The-Wild Jailbreak\n Prompts on Large Language Models","summary":" The misuse of large language models (LLMs) has garnered significant attention\nfrom the general public and LLM vendors. In response, efforts have been made to\nalign LLMs with human values and intent use. However, a particular type of\nadversarial prompts, known as jailbreak prompt, has emerged and continuously\nevolved to bypass the safeguards and elicit harmful content from LLMs. In this\npaper, we conduct the first measurement study on jailbreak prompts in the wild,\nwith 6,387 prompts collected from four platforms over six months. Leveraging\nnatural language processing technologies and graph-based community detection\nmethods, we discover unique characteristics of jailbreak prompts and their\nmajor attack strategies, such as prompt injection and privilege escalation. We\nalso observe that jailbreak prompts increasingly shift from public platforms to\nprivate ones, posing new challenges for LLM vendors in proactive detection. To\nassess the potential harm caused by jailbreak prompts, we create a question set\ncomprising 46,800 samples across 13 forbidden scenarios. Our experiments show\nthat current LLMs and safeguards cannot adequately defend jailbreak prompts in\nall scenarios. Particularly, we identify two highly effective jailbreak prompts\nwhich achieve 0.99 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and\nthey have persisted online for over 100 days. Our work sheds light on the\nsevere and evolving threat landscape of jailbreak prompts. We hope our study\ncan facilitate the research community and LLM vendors in promoting safer and\nregulated LLMs.\n","authors":["Xinyue Shen","Zeyuan Chen","Michael Backes","Yun Shen","Yang Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03821v1","updated":"2023-08-07T15:30:02Z","published":"2023-08-07T15:30:02Z","title":"Distributionally Robust Classification on a Data Budget","summary":" Real world uses of deep learning require predictable model behavior under\ndistribution shifts. Models such as CLIP show emergent natural distributional\nrobustness comparable to humans, but may require hundreds of millions of\ntraining samples. Can we train robust learners in a domain where data is\nlimited? To rigorously address this question, we introduce JANuS (Joint\nAnnotations and Names Set), a collection of four new training datasets with\nimages, labels, and corresponding captions, and perform a series of carefully\ncontrolled investigations of factors contributing to robustness in image\nclassification, then compare those results to findings derived from a\nlarge-scale meta-analysis. Using this approach, we show that standard ResNet-50\ntrained with the cross-entropy loss on 2.4 million image samples can attain\ncomparable robustness to a CLIP ResNet-50 trained on 400 million samples. To\nour knowledge, this is the first result showing (near) state-of-the-art\ndistributional robustness on limited data budgets. Our dataset is available at\n\\url{https://huggingface.co/datasets/penfever/JANuS_dataset}, and the code used\nto reproduce our experiments can be found at\n\\url{https://github.com/penfever/vlhub/}.\n","authors":["Benjamin Feuer","Ameya Joshi","Minh Pham","Chinmay Hegde"],"pdf_url":"https://arxiv.org/pdf/2308.03821v1.pdf","comment":"TMLR 2023; openreview link:\n https://openreview.net/forum?id=D5Z2E8CNsD"}],"Multimedia":[{"id":"http://arxiv.org/abs/2308.03703v1","updated":"2023-08-07T16:22:47Z","published":"2023-08-07T16:22:47Z","title":"Video-based Person Re-identification with Long Short-Term Representation\n Learning","summary":" Video-based person Re-Identification (V-ReID) aims to retrieve specific\npersons from raw videos captured by non-overlapped cameras. As a fundamental\ntask, it spreads many multimedia and computer vision applications. However, due\nto the variations of persons and scenes, there are still many obstacles that\nmust be overcome for high performance. In this work, we notice that both the\nlong-term and short-term information of persons are important for robust video\nrepresentations. Thus, we propose a novel deep learning framework named Long\nShort-Term Representation Learning (LSTRL) for effective V-ReID. More\nspecifically, to extract long-term representations, we propose a\nMulti-granularity Appearance Extractor (MAE), in which four granularity\nappearances are effectively captured across multiple frames. Meanwhile, to\nextract short-term representations, we propose a Bi-direction Motion Estimator\n(BME), in which reciprocal motion information is efficiently extracted from\nconsecutive frames. The MAE and BME are plug-and-play and can be easily\ninserted into existing networks for efficient feature learning. As a result,\nthey significantly improve the feature representation ability for V-ReID.\nExtensive experiments on three widely used benchmarks show that our proposed\napproach can deliver better performances than most state-of-the-arts.\n","authors":["Xuehu Liu","Pingping Zhang","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2308.03703v1.pdf","comment":"This work is accepted by ICIG2023, including 13 pages, 5 figures and\n 5 tables. Modifications may be performed for further improvements"},{"id":"http://arxiv.org/abs/2308.03643v1","updated":"2023-08-07T14:47:45Z","published":"2023-08-07T14:47:45Z","title":"Mamba: Bringing Multi-Dimensional ABR to WebRTC","summary":" Contemporary real-time video communication systems, such as WebRTC, use an\nadaptive bitrate (ABR) algorithm to assure high-quality and low-delay services,\ne.g., promptly adjusting video bitrate according to the instantaneous network\nbandwidth. However, target bitrate decisions in the network and bitrate control\nin the codec are typically incoordinated and simply ignoring the effect of\ninappropriate resolution and frame rate settings also leads to compromised\nresults in bitrate control, thus devastatingly deteriorating the quality of\nexperience (QoE). To tackle these challenges, Mamba, an end-to-end\nmulti-dimensional ABR algorithm is proposed, which utilizes multi-agent\nreinforcement learning (MARL) to maximize the user's QoE by adaptively and\ncollaboratively adjusting encoding factors including the quantization\nparameters (QP), resolution, and frame rate based on observed states such as\nnetwork conditions and video complexity information in a video conferencing\nsystem. We also introduce curriculum learning to improve the training\nefficiency of MARL. Both the in-lab and real-world evaluation results\ndemonstrate the remarkable efficacy of Mamba.\n","authors":["Yueheng Li","Zicheng Zhang","Hao Chen","Zhan Ma"],"pdf_url":"https://arxiv.org/pdf/2308.03643v1.pdf","comment":"In Proceedings of the 31st ACM International Conference on\n Multimedia, October 29-November 3, 2023, Ottawa, ON, Canada. ACM, New York,\n NY, USA, 9 pages"},{"id":"http://arxiv.org/abs/2308.03475v1","updated":"2023-08-07T11:05:59Z","published":"2023-08-07T11:05:59Z","title":"COPA: Efficient Vision-Language Pre-training Through Collaborative\n Object- and Patch-Text Alignment","summary":" Vision-Language Pre-training (VLP) methods based on object detection enjoy\nthe rich knowledge of fine-grained object-text alignment but at the cost of\ncomputationally expensive inference. Recent Visual-Transformer (ViT)-based\napproaches circumvent this issue while struggling with long visual sequences\nwithout detailed cross-modal alignment information. This paper introduces a\nViT-based VLP technique that efficiently incorporates object information\nthrough a novel patch-text alignment mechanism. Specifically, we convert\nobject-level signals into patch-level ones and devise a Patch-Text Alignment\npre-training task (PTA) to learn a text-aware patch detector. By using\noff-the-shelf delicate object annotations in 5\\% training images, we jointly\ntrain PTA with other conventional VLP objectives in an end-to-end manner,\nbypassing the high computational cost of object detection and yielding an\neffective patch detector that accurately detects text-relevant patches, thus\nconsiderably reducing patch sequences and accelerating computation within the\nViT backbone. Our experiments on a variety of widely-used benchmarks reveal\nthat our method achieves a speedup of nearly 88\\% compared to prior VLP models\nwhile maintaining competitive or superior performance on downstream tasks with\nsimilar model size and data scale.\n","authors":["Chaoya Jiang","Haiyang Xu","Wei Ye","Qinghao Ye","Chenliang Li","Ming Yan","Bin Bi","Shikun Zhang","Ji Zhang","Fei Huang"],"pdf_url":"https://arxiv.org/pdf/2308.03475v1.pdf","comment":"Accepted on ACM MM2023"},{"id":"http://arxiv.org/abs/2308.03463v1","updated":"2023-08-07T10:41:52Z","published":"2023-08-07T10:41:52Z","title":"DiffSynth: Latent In-Iteration Deflickering for Realistic Video\n Synthesis","summary":" In recent years, diffusion models have emerged as the most powerful approach\nin image synthesis. However, applying these models directly to video synthesis\npresents challenges, as it often leads to noticeable flickering contents.\nAlthough recently proposed zero-shot methods can alleviate flicker to some\nextent, we still struggle to generate coherent videos. In this paper, we\npropose DiffSynth, a novel approach that aims to convert image synthesis\npipelines to video synthesis pipelines. DiffSynth consists of two key\ncomponents: a latent in-iteration deflickering framework and a video\ndeflickering algorithm. The latent in-iteration deflickering framework applies\nvideo deflickering to the latent space of diffusion models, effectively\npreventing flicker accumulation in intermediate steps. Additionally, we propose\na video deflickering algorithm, named patch blending algorithm, that remaps\nobjects in different frames and blends them together to enhance video\nconsistency. One of the notable advantages of DiffSynth is its general\napplicability to various video synthesis tasks, including text-guided video\nstylization, fashion video synthesis, image-guided video stylization, video\nrestoring, and 3D rendering. In the task of text-guided video stylization, we\nmake it possible to synthesize high-quality videos without cherry-picking. The\nexperimental results demonstrate the effectiveness of DiffSynth. All videos can\nbe viewed on our project page. Source codes will also be released.\n","authors":["Zhongjie Duan","Lizhou You","Chengyu Wang","Cen Chen","Ziheng Wu","Weining Qian","Jun Huang","Fei Chao","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2308.03463v1.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2305.07176v2","updated":"2023-08-07T10:09:21Z","published":"2023-05-11T23:12:13Z","title":"Automatic Radiology Report Generation by Learning with Increasingly Hard\n Negatives","summary":" Automatic radiology report generation is challenging as medical images or\nreports are usually similar to each other due to the common content of anatomy.\nThis makes a model hard to capture the uniqueness of individual images and is\nprone to producing undesired generic or mismatched reports. This situation\ncalls for learning more discriminative features that could capture even\nfine-grained mismatches between images and reports. To achieve this, this paper\nproposes a novel framework to learn discriminative image and report features by\ndistinguishing them from their closest peers, i.e., hard negatives. Especially,\nto attain more discriminative features, we gradually raise the difficulty of\nsuch a learning task by creating increasingly hard negative reports for each\nimage in the feature space during training, respectively. By treating the\nincreasingly hard negatives as auxiliary variables, we formulate this process\nas a min-max alternating optimisation problem. At each iteration, conditioned\non a given set of hard negative reports, image and report features are learned\nas usual by minimising the loss functions related to report generation. After\nthat, a new set of harder negative reports will be created by maximising a loss\nreflecting image-report alignment. By solving this optimisation, we attain a\nmodel that can generate more specific and accurate reports. It is noteworthy\nthat our framework enhances discriminative feature learning without introducing\nextra network weights. Also, in contrast to the existing way of generating hard\nnegatives, our framework extends beyond the granularity of the dataset by\ngenerating harder samples out of the training set. Experimental study on\nbenchmark datasets verifies the efficacy of our framework and shows that it can\nserve as a plug-in to readily improve existing medical report generation\nmodels.\n","authors":["Bhanu Prakash Voutharoja","Lei Wang","Luping Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.07176v2.pdf","comment":"Accepted to European Conference on Artificial Intelligence (ECAI)\n 2023"},{"id":"http://arxiv.org/abs/2308.03432v1","updated":"2023-08-07T09:26:36Z","published":"2023-08-07T09:26:36Z","title":"Cuing Without Sharing: A Federated Cued Speech Recognition Framework via\n Mutual Knowledge Distillation","summary":" Cued Speech (CS) is a visual coding tool to encode spoken languages at the\nphonetic level, which combines lip-reading and hand gestures to effectively\nassist communication among people with hearing impairments. The Automatic CS\nRecognition (ACSR) task aims to recognize CS videos into linguistic texts,\nwhich involves both lips and hands as two distinct modalities conveying\ncomplementary information. However, the traditional centralized training\napproach poses potential privacy risks due to the use of facial and gesture\nvideos in CS data. To address this issue, we propose a new Federated Cued\nSpeech Recognition (FedCSR) framework to train an ACSR model over the\ndecentralized CS data without sharing private information. In particular, a\nmutual knowledge distillation method is proposed to maintain cross-modal\nsemantic consistency of the Non-IID CS data, which ensures learning a unified\nfeature space for both linguistic and visual information. On the server side, a\nglobally shared linguistic model is trained to capture the long-term\ndependencies in the text sentences, which is aligned with the visual\ninformation from the local clients via visual-to-linguistic distillation. On\nthe client side, the visual model of each client is trained with its own local\ndata, assisted by linguistic-to-visual distillation treating the linguistic\nmodel as the teacher. To the best of our knowledge, this is the first approach\nto consider the federated ACSR task for privacy protection. Experimental\nresults on the Chinese CS dataset with multiple cuers demonstrate that our\napproach outperforms both mainstream federated learning baselines and existing\ncentralized state-of-the-art ACSR methods, achieving 9.7% performance\nimprovement for character error rate (CER) and 15.0% for word error rate (WER).\n","authors":["Yuxuan Zhang","Lei Liu","Li Liu"],"pdf_url":"https://arxiv.org/pdf/2308.03432v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03950v1","updated":"2023-08-07T23:41:55Z","published":"2023-08-07T23:41:55Z","title":"Zero-shot Skeleton-based Action Recognition via Mutual Information\n Estimation and Maximization","summary":" Zero-shot skeleton-based action recognition aims to recognize actions of\nunseen categories after training on data of seen categories. The key is to\nbuild the connection between visual and semantic space from seen to unseen\nclasses. Previous studies have primarily focused on encoding sequences into a\nsingular feature vector, with subsequent mapping the features to an identical\nanchor point within the embedded space. Their performance is hindered by 1) the\nignorance of the global visual/semantic distribution alignment, which results\nin a limitation to capture the true interdependence between the two spaces. 2)\nthe negligence of temporal information since the frame-wise features with rich\naction clues are directly pooled into a single feature vector. We propose a new\nzero-shot skeleton-based action recognition method via mutual information (MI)\nestimation and maximization. Specifically, 1) we maximize the MI between visual\nand semantic space for distribution alignment; 2) we leverage the temporal\ninformation for estimating the MI by encouraging MI to increase as more frames\nare observed. Extensive experiments on three large-scale skeleton action\ndatasets confirm the effectiveness of our method. Code:\nhttps://github.com/YujieOuO/SMIE.\n","authors":["Yujie Zhou","Wenwen Qiang","Anyi Rao","Ning Lin","Bing Su","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2308.03950v1.pdf","comment":"Accepted by ACM MM 2023"},{"id":"http://arxiv.org/abs/2308.03826v1","updated":"2023-08-07T17:49:04Z","published":"2023-08-07T17:49:04Z","title":"Recurrent Multi-scale Transformer for High-Resolution Salient Object\n Detection","summary":" Salient Object Detection (SOD) aims to identify and segment the most\nconspicuous objects in an image or video. As an important pre-processing step,\nit has many potential applications in multimedia and vision tasks. With the\nadvance of imaging devices, SOD with high-resolution images is of great demand,\nrecently. However, traditional SOD methods are largely limited to\nlow-resolution images, making them difficult to adapt to the development of\nHigh-Resolution SOD (HRSOD). Although some HRSOD methods emerge, there are no\nlarge enough datasets for training and evaluating. Besides, current HRSOD\nmethods generally produce incomplete object regions and irregular object\nboundaries. To address above issues, in this work, we first propose a new\nHRS10K dataset, which contains 10,500 high-quality annotated images at 2K-8K\nresolution. As far as we know, it is the largest dataset for the HRSOD task,\nwhich will significantly help future works in training and evaluating models.\nFurthermore, to improve the HRSOD performance, we propose a novel Recurrent\nMulti-scale Transformer (RMFormer), which recurrently utilizes shared\nTransformers and multi-scale refinement architectures. Thus, high-resolution\nsaliency maps can be generated with the guidance of lower-resolution\npredictions. Extensive experiments on both high-resolution and low-resolution\nbenchmarks show the effectiveness and superiority of the proposed framework.\nThe source code and dataset are released at:\nhttps://github.com/DrowsyMon/RMFormer.\n","authors":["Xinhao Deng","Pingping Zhang","Wei Liu","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2308.03826v1.pdf","comment":"This work is accepted by ACM MM2023. More modifications may be\n performed for further improvements"}]},"2023-08-08T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2308.04430v1","updated":"2023-08-08T17:58:15Z","published":"2023-08-08T17:58:15Z","title":"SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore","summary":" The legality of training language models (LMs) on copyrighted or otherwise\nrestricted data is under intense debate. However, as we show, model performance\nsignificantly degrades if trained only on low-risk text (e.g., out-of-copyright\nbooks or government documents), due to its limited size and domain coverage. We\npresent SILO, a new language model that manages this risk-performance tradeoff\nduring inference. SILO is built by (1) training a parametric LM on Open License\nCorpus (OLC), a new corpus we curate with 228B tokens of public domain and\npermissively licensed text and (2) augmenting it with a more general and easily\nmodifiable nonparametric datastore (e.g., containing copyrighted books or news)\nthat is only queried during inference. The datastore allows use of high-risk\ndata without training on it, supports sentence-level data attribution, and\nenables data producers to opt out from the model by removing content from the\nstore. These capabilities can foster compliance with data-use regulations such\nas the fair use doctrine in the United States and the GDPR in the European\nUnion. Our experiments show that the parametric LM struggles on domains not\ncovered by OLC. However, access to the datastore greatly improves out of domain\nperformance, closing 90% of the performance gap with an LM trained on the Pile,\na more diverse corpus with mostly high-risk text. We also analyze which\nnonparametric approach works best, where the remaining errors lie, and how\nperformance scales with datastore size. Our results suggest that it is possible\nto build high quality language models while mitigating their legal risk.\n","authors":["Sewon Min","Suchin Gururangan","Eric Wallace","Hannaneh Hajishirzi","Noah A. Smith","Luke Zettlemoyer"],"pdf_url":"https://arxiv.org/pdf/2308.04430v1.pdf","comment":"27 pages; 6 figures. Code, models, and data available at\n https://github.com/kernelmachine/silo-lm"},{"id":"http://arxiv.org/abs/2308.04424v1","updated":"2023-08-08T17:53:24Z","published":"2023-08-08T17:53:24Z","title":"A Bi-directional Multi-hop Inference Model for Joint Dialog Sentiment\n Classification and Act Recognition","summary":" The joint task of Dialog Sentiment Classification (DSC) and Act Recognition\n(DAR) aims to predict the sentiment label and act label for each utterance in a\ndialog simultaneously. However, current methods encode the dialog context in\nonly one direction, which limits their ability to thoroughly comprehend the\ncontext. Moreover, these methods overlook the explicit correlations between\nsentiment and act labels, which leads to an insufficient ability to capture\nrich sentiment and act clues and hinders effective and accurate reasoning. To\naddress these issues, we propose a Bi-directional Multi-hop Inference Model\n(BMIM) that leverages a feature selection network and a bi-directional\nmulti-hop inference network to iteratively extract and integrate rich sentiment\nand act clues in a bi-directional manner. We also employ contrastive learning\nand dual learning to explicitly model the correlations of sentiment and act\nlabels. Our experiments on two widely-used datasets show that BMIM outperforms\nstate-of-the-art baselines by at least 2.6% on F1 score in DAR and 1.4% on F1\nscore in DSC. Additionally, Our proposed model not only improves the\nperformance but also enhances the interpretability of the joint sentiment and\nact prediction task.\n","authors":["Li Zheng","Fei Li","Yuyang Chai","Chong Teng","Donghong Ji"],"pdf_url":"https://arxiv.org/pdf/2308.04424v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.15002v5","updated":"2023-08-08T17:39:57Z","published":"2023-07-27T16:57:32Z","title":"Gzip versus bag-of-words for text classification","summary":" The effectiveness of compression in text classification ('gzip') has recently\ngarnered lots of attention. In this note we show that `bag-of-words' approaches\ncan achieve similar or better results, and are more efficient.\n","authors":["Juri Opitz"],"pdf_url":"https://arxiv.org/pdf/2307.15002v5.pdf","comment":"improved writing, extended with more results"},{"id":"http://arxiv.org/abs/2308.04398v1","updated":"2023-08-08T17:01:42Z","published":"2023-08-08T17:01:42Z","title":"Character-level NMT and language similarity","summary":" We explore the effectiveness of character-level neural machine translation\nusing Transformer architecture for various levels of language similarity and\nsize of the training dataset on translation between Czech and Croatian, German,\nHungarian, Slovak, and Spanish. We evaluate the models using automatic MT\nmetrics and show that translation between similar languages benefits from\ncharacter-level input segmentation, while for less related languages,\ncharacter-level vanilla Transformer-base often lags behind subword-level\nsegmentation. We confirm previous findings that it is possible to close the gap\nby finetuning the already trained subword-level models to character-level.\n","authors":["Josef Jon","Ondřej Bojar"],"pdf_url":"https://arxiv.org/pdf/2308.04398v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04386v1","updated":"2023-08-08T16:41:16Z","published":"2023-08-08T16:41:16Z","title":"Learning Evaluation Models from Large Language Models for Sequence\n Generation","summary":" Large language models achieve state-of-the-art performance on sequence\ngeneration evaluation, but typically have a large number of parameters. This is\na computational challenge as presented by applying their evaluation capability\nat scale. To overcome the challenge, in this paper, we propose \\textbf{ECT}, an\n\\textbf{e}valuation \\textbf{c}apability \\textbf{t}ransfer method, to transfer\nthe evaluation capability from LLMs to relatively lightweight language models.\nBased on the proposed ECT, we learn various evaluation models from ChatGPT, and\nemploy them as reward models to improve sequence generation models via\nreinforcement learning and reranking approaches. Experimental results on\nmachine translation, text style transfer, and summarization tasks demonstrate\nthe effectiveness of our ECT. Notably, applying the learned evaluation models\nto sequence generation models results in better generated sequences as\nevaluated by commonly used metrics and ChatGPT.\n","authors":["Chenglong Wang","Hang Zhou","Kaiyan Chang","Tongran Liu","Chunliang Zhang","Quan Du","Tong Xiao","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2308.04386v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.06713v2","updated":"2023-08-08T16:21:49Z","published":"2023-07-13T12:11:36Z","title":"Unsupervised Calibration through Prior Adaptation for Text\n Classification using Large Language Models","summary":" A wide variety of natural language tasks are currently being addressed with\nlarge-scale language models (LLMs). These models are usually trained with a\nvery large amount of unsupervised text data and adapted to perform a downstream\nnatural language task using methods like fine-tuning, calibration or in-context\nlearning. In this work, we propose an approach to adapt the prior class\ndistribution to perform text classification tasks without the need for labelled\nsamples and only few in-domain sample queries. The proposed approach treats the\nLLM as a black box, adding a stage where the model posteriors are calibrated to\nthe task. Results show that these methods outperform the un-adapted model for\ndifferent number of training shots in the prompt and a previous approach were\ncalibration is performed without using any adaptation data.\n","authors":["Lautaro Estienne"],"pdf_url":"https://arxiv.org/pdf/2307.06713v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04346v1","updated":"2023-08-08T15:46:27Z","published":"2023-08-08T15:46:27Z","title":"Unmasking Nationality Bias: A Study of Human Perception of Nationalities\n in AI-Generated Articles","summary":" We investigate the potential for nationality biases in natural language\nprocessing (NLP) models using human evaluation methods. Biased NLP models can\nperpetuate stereotypes and lead to algorithmic discrimination, posing a\nsignificant challenge to the fairness and justice of AI systems. Our study\nemploys a two-step mixed-methods approach that includes both quantitative and\nqualitative analysis to identify and understand the impact of nationality bias\nin a text generation model. Through our human-centered quantitative analysis,\nwe measure the extent of nationality bias in articles generated by AI sources.\nWe then conduct open-ended interviews with participants, performing qualitative\ncoding and thematic analysis to understand the implications of these biases on\nhuman readers. Our findings reveal that biased NLP models tend to replicate and\namplify existing societal biases, which can translate to harm if used in a\nsociotechnical setting. The qualitative analysis from our interviews offers\ninsights into the experience readers have when encountering such articles,\nhighlighting the potential to shift a reader's perception of a country. These\nfindings emphasize the critical role of public perception in shaping AI's\nimpact on society and the need to correct biases in AI systems.\n","authors":["Pranav Narayanan Venkit","Sanjana Gautam","Ruchi Panchanadikar","Ting-Hao `Kenneth' Huang","Shomir Wilson"],"pdf_url":"https://arxiv.org/pdf/2308.04346v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03629v2","updated":"2023-08-08T15:38:21Z","published":"2023-08-07T14:36:03Z","title":"MedMine: Examining Pre-trained Language Models on Medication Mining","summary":" Automatic medication mining from clinical and biomedical text has become a\npopular topic due to its real impact on healthcare applications and the recent\ndevelopment of powerful language models (LMs). However, fully-automatic\nextraction models still face obstacles to be overcome such that they can be\ndeployed directly into clinical practice for better impacts. Such obstacles\ninclude their imbalanced performances on different entity types and clinical\nevents. In this work, we examine current state-of-the-art pre-trained language\nmodels (PLMs) on such tasks, via fine-tuning including the monolingual model\nMed7 and multilingual large language model (LLM) XLM-RoBERTa. We compare their\nadvantages and drawbacks using historical medication mining shared task data\nsets from n2c2-2018 challenges. We report the findings we get from these\nfine-tuning experiments such that they can facilitate future research on\naddressing them, for instance, how to combine their outputs, merge such models,\nor improve their overall accuracy by ensemble learning and data augmentation.\nMedMine is part of the M3 Initiative \\url{https://github.com/HECTA-UoM/M3}\n","authors":["Haifa Alrdahi","Lifeng Han","Hendrik Šuvalov","Goran Nenadic"],"pdf_url":"https://arxiv.org/pdf/2308.03629v2.pdf","comment":"Open Research Project. 7 pages, 1 figure, 5 tables"},{"id":"http://arxiv.org/abs/2308.04333v1","updated":"2023-08-08T15:26:58Z","published":"2023-08-08T15:26:58Z","title":"Towards an AI to Win Ghana's National Science and Maths Quiz","summary":" Can an AI win Ghana's National Science and Maths Quiz (NSMQ)? That is the\nquestion we seek to answer in the NSMQ AI project, an open-source project that\nis building AI to compete live in the NSMQ and win. The NSMQ is an annual live\nscience and mathematics competition for senior secondary school students in\nGhana in which 3 teams of 2 students compete by answering questions across\nbiology, chemistry, physics, and math in 5 rounds over 5 progressive stages\nuntil a winning team is crowned for that year. The NSMQ is an exciting live\nquiz competition with interesting technical challenges across speech-to-text,\ntext-to-speech, question-answering, and human-computer interaction. In this\nongoing work that began in January 2023, we give an overview of the project,\ndescribe each of the teams, progress made thus far, and the next steps toward\nour planned launch and debut of the AI in October for NSMQ 2023. An AI that\nconquers this grand challenge can have real-world impact on education such as\nenabling millions of students across Africa to have one-on-one learning support\nfrom this AI.\n","authors":["George Boateng","Jonathan Abrefah Mensah","Kevin Takyi Yeboah","William Edor","Andrew Kojo Mensah-Onumah","Naafi Dasana Ibrahim","Nana Sam Yeboah"],"pdf_url":"https://arxiv.org/pdf/2308.04333v1.pdf","comment":"7 pages. Under review at Deep Learning Indaba and Black in AI\n Workshop @NeurIPS 2023"},{"id":"http://arxiv.org/abs/2308.04306v1","updated":"2023-08-08T14:51:16Z","published":"2023-08-08T14:51:16Z","title":"Deep Learning-Based Knowledge Injection for Metaphor Detection: A\n Comprehensive Review","summary":" The history of metaphor research also marks the evolution of knowledge\ninfusion research. With the continued advancement of deep learning techniques\nin recent years, the natural language processing community has shown great\ninterest in applying knowledge to successful results in metaphor recognition\ntasks. Although there has been a gradual increase in the number of approaches\ninvolving knowledge injection in the field of metaphor recognition, there is a\nlack of a complete review article on knowledge injection based approaches.\nTherefore, the goal of this paper is to provide a comprehensive review of\nresearch advances in the application of deep learning for knowledge injection\nin metaphor recognition tasks. In this paper, we systematically summarize and\ngeneralize the mainstream knowledge and knowledge injection principles, as well\nas review the datasets, evaluation metrics, and benchmark models used in\nmetaphor recognition tasks. Finally, we explore the current issues facing\nknowledge injection methods and provide an outlook on future research\ndirections.\n","authors":["Cheng Yang","Wenye Zhao","Qingbao Huang"],"pdf_url":"https://arxiv.org/pdf/2308.04306v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2308.04286v1","updated":"2023-08-08T14:29:35Z","published":"2023-08-08T14:29:35Z","title":"Comparative Analysis of the wav2vec 2.0 Feature Extractor","summary":" Automatic speech recognition (ASR) systems typically use handcrafted feature\nextraction pipelines. To avoid their inherent information loss and to achieve\nmore consistent modeling from speech to transcribed text, neural raw waveform\nfeature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model,\nwhich has recently gained large popularity, uses a convolutional FE which\noperates directly on the speech waveform. However, it is not yet studied\nextensively in the literature. In this work, we study its capability to replace\nthe standard feature extraction methods in a connectionist temporal\nclassification (CTC) ASR model and compare it to an alternative neural FE. We\nshow that both are competitive with traditional FEs on the LibriSpeech\nbenchmark and analyze the effect of the individual components. Furthermore, we\nanalyze the learned filters and show that the most important information for\nthe ASR system is obtained by a set of bandpass filters.\n","authors":["Peter Vieting","Ralf Schlüter","Hermann Ney"],"pdf_url":"https://arxiv.org/pdf/2308.04286v1.pdf","comment":"Accepted at ITG 2023"},{"id":"http://arxiv.org/abs/2308.04275v1","updated":"2023-08-08T14:17:17Z","published":"2023-08-08T14:17:17Z","title":"In-Context Alignment: Chat with Vanilla Language Models Before\n Fine-Tuning","summary":" In this note, we explore inference-time alignment through in-context\nlearning. We consider a vanilla pretrained language model Llama-2 before any\nfine-tuning and retrieve an average of 9 demonstration alignment examples when\nthe model is prompted to follow chat-style instructions. Compared to direct\nprompting, the in-context alignment without changing model weights leads to a\n7x increase in win-rate w.r.t. the text-davinci-003 model from OpenAI, making\nthe vanilla language model comparable to strong baselines with alignment\nfine-tuning.\n","authors":["Xiaochuang Han"],"pdf_url":"https://arxiv.org/pdf/2308.04275v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11661v2","updated":"2023-08-08T13:44:12Z","published":"2023-07-21T15:49:59Z","title":"Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts","summary":" Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have\nrevolutionized visual representation learning by providing good performance on\ndownstream datasets. VLMs are 0-shot adapted to a downstream dataset by\ndesigning prompts that are relevant to the dataset. Such prompt engineering\nmakes use of domain expertise and a validation dataset. Meanwhile, recent\ndevelopments in generative pretrained models like GPT-4 mean they can be used\nas advanced internet search tools. They can also be manipulated to provide\nvisual information in any structure. In this work, we show that GPT-4 can be\nused to generate text that is visually descriptive and how this can be used to\nadapt CLIP to downstream tasks. We show considerable improvements in 0-shot\ntransfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD\n(~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt.\nWe also design a simple few-shot adapter that learns to choose the best\npossible sentences to construct generalizable classifiers that outperform the\nrecently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized\nfine-grained datasets. The code, prompts, and auxiliary text dataset is\navailable at https://github.com/mayug/VDT-Adapter.\n","authors":["Mayug Maniparambil","Chris Vorster","Derek Molloy","Noel Murphy","Kevin McGuinness","Noel E. O'Connor"],"pdf_url":"https://arxiv.org/pdf/2307.11661v2.pdf","comment":"Paper accepted at ICCV-W 2023. V2 contains additional comparisons\n with concurrent works"},{"id":"http://arxiv.org/abs/2308.04255v1","updated":"2023-08-08T13:41:41Z","published":"2023-08-08T13:41:41Z","title":"CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic\n Languages","summary":" We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of\nthe South Slavic languages, which is based on the Stanza natural language\nprocessing pipeline. We describe the main improvements in CLASSLA-Stanza with\nrespect to Stanza, and give a detailed description of the model training\nprocess for the latest 2.1 release of the pipeline. We also report performance\nscores produced by the pipeline for different languages and varieties.\nCLASSLA-Stanza exhibits consistently high performance across all the supported\nlanguages and outperforms or expands its parent pipeline Stanza at all the\nsupported tasks. We also present the pipeline's new functionality enabling\nefficient processing of web data and the reasons that led to its\nimplementation.\n","authors":["Luka Terčon","Nikola Ljubešić"],"pdf_url":"https://arxiv.org/pdf/2308.04255v1.pdf","comment":"17 pages, 14 tables, 1 figure"},{"id":"http://arxiv.org/abs/2302.03512v3","updated":"2023-08-08T13:27:29Z","published":"2023-02-07T14:56:52Z","title":"A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and\n Future Trends","summary":" As more and more Arabic texts emerged on the Internet, extracting important\ninformation from these Arabic texts is especially useful. As a fundamental\ntechnology, Named entity recognition (NER) serves as the core component in\ninformation extraction technology, while also playing a critical role in many\nother Natural Language Processing (NLP) systems, such as question answering and\nknowledge graph building. In this paper, we provide a comprehensive review of\nthe development of Arabic NER, especially the recent advances in deep learning\nand pre-trained language model. Specifically, we first introduce the background\nof Arabic NER, including the characteristics of Arabic and existing resources\nfor Arabic NER. Then, we systematically review the development of Arabic NER\nmethods. Traditional Arabic NER systems focus on feature engineering and\ndesigning domain-specific rules. In recent years, deep learning methods achieve\nsignificant progress by representing texts via continuous vector\nrepresentations. With the growth of pre-trained language model, Arabic NER\nyields better performance. Finally, we conclude the method gap between Arabic\nNER and NER methods from other languages, which helps outline future directions\nfor Arabic NER.\n","authors":["Xiaoye Qu","Yingjie Gu","Qingrong Xia","Zechang Li","Zhefeng Wang","Baoxing Huai"],"pdf_url":"https://arxiv.org/pdf/2302.03512v3.pdf","comment":"Accepted by IEEE TKDE"},{"id":"http://arxiv.org/abs/2308.04248v1","updated":"2023-08-08T13:26:53Z","published":"2023-08-08T13:26:53Z","title":"Gloss Alignment Using Word Embeddings","summary":" Capturing and annotating Sign language datasets is a time consuming and\ncostly process. Current datasets are orders of magnitude too small to\nsuccessfully train unconstrained \\acf{slt} models. As a result, research has\nturned to TV broadcast content as a source of large-scale training data,\nconsisting of both the sign language interpreter and the associated audio\nsubtitle. However, lack of sign language annotation limits the usability of\nthis data and has led to the development of automatic annotation techniques\nsuch as sign spotting. These spottings are aligned to the video rather than the\nsubtitle, which often results in a misalignment between the subtitle and\nspotted signs. In this paper we propose a method for aligning spottings with\ntheir corresponding subtitles using large spoken language models. Using a\nsingle modality means our method is computationally inexpensive and can be\nutilized in conjunction with existing alignment techniques. We quantitatively\ndemonstrate the effectiveness of our method on the \\acf{mdgs} and \\acf{bobsl}\ndatasets, recovering up to a 33.22 BLEU-1 score in word alignment.\n","authors":["Harry Walsh","Ozge Mercanoglu Sincan","Ben Saunders","Richard Bowden"],"pdf_url":"https://arxiv.org/pdf/2308.04248v1.pdf","comment":"4 pages, 4 figures, 2023 IEEE International Conference on Acoustics,\n Speech, and Signal Processing Workshops (ICASSPW)"},{"id":"http://arxiv.org/abs/2306.09841v3","updated":"2023-08-08T12:57:18Z","published":"2023-06-16T13:39:35Z","title":"Are Large Language Models Really Good Logical Reasoners? A Comprehensive\n Evaluation and Beyond","summary":" Logical reasoning consistently plays a fundamental and significant role in\nthe domains of knowledge engineering and artificial intelligence. Recently,\nLarge Language Models (LLMs) have emerged as a noteworthy innovation in natural\nlanguage processing (NLP), exhibiting impressive achievements across various\nclassic NLP tasks. However, the question of whether LLMs can effectively\naddress the task of logical reasoning, which requires gradual cognitive\ninference similar to human intelligence, remains unanswered. To this end, we\naim to bridge this gap and provide comprehensive evaluations in this paper.\nFirstly, to offer systematic evaluations, we select fifteen typical logical\nreasoning datasets and organize them into deductive, inductive, abductive and\nmixed-form reasoning settings. Considering the comprehensiveness of\nevaluations, we include three representative LLMs (i.e., text-davinci-003,\nChatGPT and BARD) and evaluate them on all selected datasets under zero-shot,\none-shot and three-shot settings. Secondly, different from previous evaluations\nrelying only on simple metrics (e.g., accuracy), we propose fine-level\nevaluations from objective and subjective manners, covering both answers and\nexplanations. Additionally, to uncover the logical flaws of LLMs, problematic\ncases will be attributed to five error types from two dimensions, i.e.,\nevidence selection process and reasoning process. Thirdly, to avoid the\ninfluences of knowledge bias and purely focus on benchmarking the logical\nreasoning capability of LLMs, we propose a new dataset with neutral content. It\ncontains 3,000 samples and covers deductive, inductive and abductive settings.\nBased on the in-depth evaluations, this paper finally forms a general\nevaluation scheme of logical reasoning capability from six dimensions. It\nreflects the pros and cons of LLMs and gives guiding directions for future\nworks.\n","authors":["Fangzhi Xu","Qika Lin","Jiawei Han","Tianzhe Zhao","Jun Liu","Erik Cambria"],"pdf_url":"https://arxiv.org/pdf/2306.09841v3.pdf","comment":"14 pages, 11 figures"},{"id":"http://arxiv.org/abs/2308.04215v1","updated":"2023-08-08T12:27:20Z","published":"2023-08-08T12:27:20Z","title":"Hybrid Retrieval-Augmented Generation for Real-time Composition\n Assistance","summary":" Retrieval augmented models show promise in enhancing traditional language\nmodels by improving their contextual understanding, integrating private data,\nand reducing hallucination. However, the processing time required for retrieval\naugmented large language models poses a challenge when applying them to tasks\nthat require real-time responses, such as composition assistance.\n To overcome this limitation, we propose the Hybrid Retrieval-Augmented\nGeneration (HybridRAG) framework that leverages a hybrid setting that combines\nboth client and cloud models. HybridRAG incorporates retrieval-augmented memory\ngenerated asynchronously by a Large Language Model (LLM) in the cloud. By\nintegrating this retrieval augmented memory, the client model acquires the\ncapability to generate highly effective responses, benefiting from the LLM's\ncapabilities. Furthermore, through asynchronous memory integration, the client\nmodel is capable of delivering real-time responses to user requests without the\nneed to wait for memory synchronization from the cloud. Our experiments on\nWikitext and Pile subsets show that HybridRAG achieves lower latency than a\ncloud-based retrieval-augmented LLM, while outperforming client-only models in\nutility.\n","authors":["Xuchao Zhang","Menglin Xia","Camille Couturier","Guoqing Zheng","Saravan Rajmohan","Victor Ruhle"],"pdf_url":"https://arxiv.org/pdf/2308.04215v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09998v3","updated":"2023-08-08T12:23:49Z","published":"2023-07-19T14:13:02Z","title":"Generating Mathematical Derivations with Large Language Models","summary":" The derivation of mathematical results in specialised fields, using Large\nLanguage Models (LLMs), is an emerging research direction that can help\nidentify models' limitations, and potentially support mathematical discovery.\nIn this paper, we leverage a symbolic engine to generate derivations of\nequations at scale, and investigate the capabilities of LLMs when deriving goal\nequations from premises. Specifically, we employ in-context learning for GPT\nand fine-tune a range of T5 models to compare the robustness and generalisation\nof pre-training strategies to specialised models. Empirical results show that\nfine-tuned FLAN-T5-large (MathT5) outperforms GPT models on all static and\nout-of-distribution test sets in conventional scores. However, an in-depth\nanalysis reveals that the fine-tuned models are more sensitive to perturbations\ninvolving unseen symbols and (to a lesser extent) changes to equation\nstructure. In addition, we analyse 1.7K equations, and over 200 derivations, to\nhighlight common reasoning errors such as the inclusion of incorrect,\nirrelevant, and redundant equations. Finally, we explore the suitability of\nexisting metrics for evaluating mathematical derivations and find evidence\nthat, while they can capture general properties such as sensitivity to\nperturbations, they fail to highlight fine-grained reasoning errors and\nessential differences between models. Overall, this work demonstrates that\ntraining models on synthetic data may improve their math capabilities beyond\nmuch larger LLMs, but current metrics are not appropriately assessing the\nquality of generated mathematical text.\n","authors":["Jordan Meadows","Marco Valentino","Andre Freitas"],"pdf_url":"https://arxiv.org/pdf/2307.09998v3.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2308.03565v2","updated":"2023-08-08T12:12:55Z","published":"2023-08-07T13:16:42Z","title":"Topological Interpretations of GPT-3","summary":" This is an experiential study of investigating a consistent method for\nderiving the correlation between sentence vector and semantic meaning of a\nsentence. We first used three state-of-the-art word/sentence embedding methods\nincluding GPT-3, Word2Vec, and Sentence-BERT, to embed plain text sentence\nstrings into high dimensional spaces. Then we compute the pairwise distance\nbetween any possible combination of two sentence vectors in an embedding space\nand map them into a matrix. Based on each distance matrix, we compute the\ncorrelation of distances of a sentence vector with respect to the other\nsentence vectors in an embedding space. Then we compute the correlation of each\npair of the distance matrices. We observed correlations of the same sentence in\ndifferent embedding spaces and correlations of different sentences in the same\nembedding space. These observations are consistent with our hypothesis and take\nus to the next stage.\n","authors":["Tianyi Sun","Bradley Nelson"],"pdf_url":"https://arxiv.org/pdf/2308.03565v2.pdf","comment":"70 pages"},{"id":"http://arxiv.org/abs/2305.10652v2","updated":"2023-08-08T11:10:32Z","published":"2023-05-18T02:19:05Z","title":"Speech Separation based on Contrastive Learning and Deep Modularization","summary":" The current monaural state of the art tools for speech separation relies on\nsupervised learning. This means that they must deal with permutation problem,\nthey are impacted by the mismatch on the number of speakers used in training\nand inference. Moreover, their performance heavily relies on the presence of\nhigh-quality labelled data. These problems can be effectively addressed by\nemploying a fully unsupervised technique for speech separation. In this paper,\nwe use contrastive learning to establish the representations of frames then use\nthe learned representations in the downstream deep modularization task.\nConcretely, we demonstrate experimentally that in speech separation, different\nframes of a speaker can be viewed as augmentations of a given hidden standard\nframe of that speaker. The frames of a speaker contain enough prosodic\ninformation overlap which is key in speech separation. Based on this, we\nimplement a self-supervised learning to learn to minimize the distance between\nframes belonging to a given speaker. The learned representations are used in a\ndownstream deep modularization task to cluster frames based on speaker\nidentity. Evaluation of the developed technique on WSJ0-2mix and WSJ0-3mix\nshows that the technique attains SI-SNRi and SDRi of 20.8 and 21.0 respectively\nin WSJ0-2mix. In WSJ0-3mix, it attains SI-SNRi and SDRi of 20.7 and 20.7\nrespectively in WSJ0-2mix. Its greatest strength being that as the number of\nspeakers increase, its performance does not degrade significantly.\n","authors":["Peter Ochieng"],"pdf_url":"https://arxiv.org/pdf/2305.10652v2.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2212.00369"},{"id":"http://arxiv.org/abs/2308.04180v1","updated":"2023-08-08T10:42:33Z","published":"2023-08-08T10:42:33Z","title":"Studying Socially Unacceptable Discourse Classification (SUD) through\n different eyes: \"Are we on the same page ?\"","summary":" We study Socially Unacceptable Discourse (SUD) characterization and detection\nin online text. We first build and present a novel corpus that contains a large\nvariety of manually annotated texts from different online sources used so far\nin state-of-the-art Machine learning (ML) SUD detection solutions. This global\ncontext allows us to test the generalization ability of SUD classifiers that\nacquire knowledge around the same SUD categories, but from different contexts.\nFrom this perspective, we can analyze how (possibly) different annotation\nmodalities influence SUD learning by discussing open challenges and open\nresearch directions. We also provide several data insights which can support\ndomain experts in the annotation task.\n","authors":["Bruno Machado Carneiro","Michele Linardi","Julien Longhi"],"pdf_url":"https://arxiv.org/pdf/2308.04180v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04176v1","updated":"2023-08-08T10:23:04Z","published":"2023-08-08T10:23:04Z","title":"On Monotonic Aggregation for Open-domain QA","summary":" Question answering (QA) is a critical task for speech-based retrieval from\nknowledge sources, by sifting only the answers without requiring to read\nsupporting documents. Specifically, open-domain QA aims to answer user\nquestions on unrestricted knowledge sources. Ideally, adding a source should\nnot decrease the accuracy, but we find this property (denoted as\n\"monotonicity\") does not hold for current state-of-the-art methods. We identify\nthe cause, and based on that we propose Judge-Specialist framework. Our\nframework consists of (1) specialist retrievers/readers to cover individual\nsources, and (2) judge, a dedicated language model to select the final answer.\nOur experiments show that our framework not only ensures monotonicity, but also\noutperforms state-of-the-art multi-source QA methods on Natural Questions.\nAdditionally, we show that our models robustly preserve the monotonicity\nagainst noise from speech recognition. We publicly release our code and\nsetting.\n","authors":["Sang-eun Han","Yeonseok Jeong","Seung-won Hwang","Kyungjae Lee"],"pdf_url":"https://arxiv.org/pdf/2308.04176v1.pdf","comment":"INTERSPEECH 2023 Camera Ready"},{"id":"http://arxiv.org/abs/2306.02864v2","updated":"2023-08-08T09:48:36Z","published":"2023-06-05T13:35:01Z","title":"Leveraging Large Language Models for Topic Classification in the Domain\n of Public Affairs","summary":" The analysis of public affairs documents is crucial for citizens as it\npromotes transparency, accountability, and informed decision-making. It allows\ncitizens to understand government policies, participate in public discourse,\nand hold representatives accountable. This is crucial, and sometimes a matter\nof life or death, for companies whose operation depend on certain regulations.\nLarge Language Models (LLMs) have the potential to greatly enhance the analysis\nof public affairs documents by effectively processing and understanding the\ncomplex language used in such documents. In this work, we analyze the\nperformance of LLMs in classifying public affairs documents. As a natural\nmulti-label task, the classification of these documents presents important\nchallenges. In this work, we use a regex-powered tool to collect a database of\npublic affairs documents with more than 33K samples and 22.5M tokens. Our\nexperiments assess the performance of 4 different Spanish LLMs to classify up\nto 30 different topics in the data in different configurations. The results\nshows that LLMs can be of great use to process domain-specific documents, such\nas those in the domain of public affairs.\n","authors":["Alejandro Peña","Aythami Morales","Julian Fierrez","Ignacio Serna","Javier Ortega-Garcia","Iñigo Puente","Jorge Cordova","Gonzalo Cordova"],"pdf_url":"https://arxiv.org/pdf/2306.02864v2.pdf","comment":"Accepted in ICDAR 2023 Workshop on Automatic Domain-Adapted and\n Personalized Document Analysis"},{"id":"http://arxiv.org/abs/2308.02582v2","updated":"2023-08-08T08:57:20Z","published":"2023-08-01T05:31:36Z","title":"Adapt and Decompose: Efficient Generalization of Text-to-SQL via Domain\n Adapted Least-To-Most Prompting","summary":" Cross-domain and cross-compositional generalization of Text-to-SQL semantic\nparsing is a challenging task. Existing Large Language Model (LLM) based\nsolutions rely on inference-time retrieval of few-shot exemplars from the\ntraining set to synthesize a run-time prompt for each Natural Language (NL)\ntest query. In contrast, we devise an algorithm which performs offline sampling\nof a minimal set-of few-shots from the training data, with complete coverage of\nSQL clauses, operators and functions, and maximal domain coverage within the\nallowed token length. This allows for synthesis of a fixed Generic Prompt (GP),\nwith a diverse set-of exemplars common across NL test queries, avoiding\nexpensive test time exemplar retrieval. We further auto-adapt the GP to the\ntarget database domain (DA-GP), to better handle cross-domain generalization;\nfollowed by a decomposed Least-To-Most-Prompting (LTMP-DA-GP) to handle\ncross-compositional generalization. The synthesis of LTMP-DA-GP is an offline\ntask, to be performed one-time per new database with minimal human\nintervention. Our approach demonstrates superior performance on the KaggleDBQA\ndataset, designed to evaluate generalizability for the Text-to-SQL task. We\nfurther showcase consistent performance improvement of LTMP-DA-GP over GP,\nacross LLMs and databases of KaggleDBQA, highlighting the efficacy and model\nagnostic benefits of our prompt based adapt and decompose approach.\n","authors":["Aseem Arora","Shabbirhussain Bhaisaheb","Manasi Patwardhan","Lovekesh Vig","Gautam Shroff"],"pdf_url":"https://arxiv.org/pdf/2308.02582v2.pdf","comment":"22 Pages"},{"id":"http://arxiv.org/abs/2308.04138v1","updated":"2023-08-08T08:57:01Z","published":"2023-08-08T08:57:01Z","title":"Large Language Model Prompt Chaining for Long Legal Document\n Classification","summary":" Prompting is used to guide or steer a language model in generating an\nappropriate response that is consistent with the desired outcome. Chaining is a\nstrategy used to decompose complex tasks into smaller, manageable components.\nIn this study, we utilize prompt chaining for extensive legal document\nclassification tasks, which present difficulties due to their intricate\ndomain-specific language and considerable length. Our approach begins with the\ncreation of a concise summary of the original document, followed by a semantic\nsearch for related exemplar texts and their corresponding annotations from a\ntraining corpus. Finally, we prompt for a label - based on the task - to\nassign, by leveraging the in-context learning from the few-shot prompt. We\ndemonstrate that through prompt chaining, we can not only enhance the\nperformance over zero-shot, but also surpass the micro-F1 score achieved by\nlarger models, such as ChatGPT zero-shot, using smaller models.\n","authors":["Dietrich Trautmann"],"pdf_url":"https://arxiv.org/pdf/2308.04138v1.pdf","comment":"SwissText 2023 Late Breaking Work (Generative AI & LLM)"},{"id":"http://arxiv.org/abs/2308.04124v1","updated":"2023-08-08T08:27:57Z","published":"2023-08-08T08:27:57Z","title":"Social Media, Topic Modeling and Sentiment Analysis in Municipal\n Decision Support","summary":" Many cities around the world are aspiring to become. However, smart\ninitiatives often give little weight to the opinions of average citizens.\n Social media are one of the most important sources of citizen opinions. This\npaper presents a prototype of a framework for processing social media posts\nwith municipal decision-making in mind. The framework consists of a sequence of\nthree steps: (1) determining the sentiment polarity of each social media post\n(2) identifying prevalent topics and mapping these topics to individual posts,\nand (3) aggregating these two pieces of information into a fuzzy number\nrepresenting the overall sentiment expressed towards each topic. Optionally,\nthe fuzzy number can be reduced into a tuple of two real numbers indicating the\n\"amount\" of positive and negative opinion expressed towards each topic.\n The framework is demonstrated on tweets published from Ostrava, Czechia over\na period of about two months. This application illustrates how fuzzy numbers\nrepresent sentiment in a richer way and capture the diversity of opinions\nexpressed on social media.\n","authors":["Miloš Švaňa"],"pdf_url":"https://arxiv.org/pdf/2308.04124v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.07748v4","updated":"2023-08-08T08:08:12Z","published":"2023-02-15T15:54:01Z","title":"Whats New? Identifying the Unfolding of New Events in Narratives","summary":" Narratives include a rich source of events unfolding over time and context.\nAutomatic understanding of these events provides a summarised comprehension of\nthe narrative for further computation (such as reasoning). In this paper, we\nstudy the Information Status (IS) of the events and propose a novel challenging\ntask: the automatic identification of new events in a narrative. We define an\nevent as a triplet of subject, predicate, and object. The event is categorized\nas new with respect to the discourse context and whether it can be inferred\nthrough commonsense reasoning. We annotated a publicly available corpus of\nnarratives with the new events at sentence level using human annotators. We\npresent the annotation protocol and study the quality of the annotation and the\ndifficulty of the task. We publish the annotated dataset, annotation materials,\nand machine learning baseline models for the task of new event extraction for\nnarrative understanding.\n","authors":["Seyed Mahed Mousavi","Shohei Tanaka","Gabriel Roccabruna","Koichiro Yoshino","Satoshi Nakamura","Giuseppe Riccardi"],"pdf_url":"https://arxiv.org/pdf/2302.07748v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04114v1","updated":"2023-08-08T08:00:52Z","published":"2023-08-08T08:00:52Z","title":"Collective Human Opinions in Semantic Textual Similarity","summary":" Despite the subjective nature of semantic textual similarity (STS) and\npervasive disagreements in STS annotation, existing benchmarks have used\naveraged human ratings as the gold standard. Averaging masks the true\ndistribution of human opinions on examples of low agreement, and prevents\nmodels from capturing the semantic vagueness that the individual ratings\nrepresent. In this work, we introduce USTS, the first Uncertainty-aware STS\ndataset with ~15,000 Chinese sentence pairs and 150,000 labels, to study\ncollective human opinions in STS. Analysis reveals that neither a scalar nor a\nsingle Gaussian fits a set of observed judgements adequately. We further show\nthat current STS models cannot capture the variance caused by human\ndisagreement on individual instances, but rather reflect the predictive\nconfidence over the aggregate dataset.\n","authors":["Yuxia Wang","Shimin Tao","Ning Xie","Hao Yang","Timothy Baldwin","Karin Verspoor"],"pdf_url":"https://arxiv.org/pdf/2308.04114v1.pdf","comment":"16 pages, 7 figures"},{"id":"http://arxiv.org/abs/2308.03421v2","updated":"2023-08-08T07:58:06Z","published":"2023-08-07T09:14:33Z","title":"RecycleGPT: An Autoregressive Language Model with Recyclable Module","summary":" Existing large language models have to run K times to generate a sequence of\nK tokens. In this paper, we present RecycleGPT, a generative language model\nwith fast decoding speed by recycling pre-generated model states without\nrunning the whole model in multiple steps. Our approach relies on the\nobservation that adjacent tokens in a sequence usually have strong correlations\nand the next token in a sequence can be reasonably guessed or inferred based on\nthe preceding ones. Experiments and analysis demonstrate the effectiveness of\nour approach in lowering inference latency, achieving up to 1.4x speedup while\npreserving high performance.\n","authors":["Yufan Jiang","Qiaozhi He","Xiaomin Zhuang","Zhihua Wu","Kunpeng Wang","Wenlai Zhao","Guangwen Yang"],"pdf_url":"https://arxiv.org/pdf/2308.03421v2.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2308.04109v1","updated":"2023-08-08T07:47:10Z","published":"2023-08-08T07:47:10Z","title":"I-WAS: a Data Augmentation Method with GPT-2 for Simile Detection","summary":" Simile detection is a valuable task for many natural language processing\n(NLP)-based applications, particularly in the field of literature. However,\nexisting research on simile detection often relies on corpora that are limited\nin size and do not adequately represent the full range of simile forms. To\naddress this issue, we propose a simile data augmentation method based on\n\\textbf{W}ord replacement And Sentence completion using the GPT-2 language\nmodel. Our iterative process called I-WAS, is designed to improve the quality\nof the augmented sentences. To better evaluate the performance of our method in\nreal-world applications, we have compiled a corpus containing a more diverse\nset of simile forms for experimentation. Our experimental results demonstrate\nthe effectiveness of our proposed data augmentation method for simile\ndetection.\n","authors":["Yongzhu Chang","Rongsheng Zhang","Jiashu Pu"],"pdf_url":"https://arxiv.org/pdf/2308.04109v1.pdf","comment":"15 pages, 1 figure"},{"id":"http://arxiv.org/abs/2201.05337v4","updated":"2023-08-08T06:50:57Z","published":"2022-01-14T08:32:20Z","title":"A Survey of Controllable Text Generation using Transformer-based\n Pre-trained Language Models","summary":" Controllable Text Generation (CTG) is emerging area in the field of natural\nlanguage generation (NLG). It is regarded as crucial for the development of\nadvanced text generation technologies that better meet the specific constraints\nin practical applications. In recent years, methods using large-scale\npre-trained language models (PLMs), in particular the widely used\ntransformer-based PLMs, have become a new paradigm of NLG, allowing generation\nof more diverse and fluent text. However, due to the limited level of\ninterpretability of deep neural networks, the controllability of these methods\nneed to be guaranteed. To this end, controllable text generation using\ntransformer-based PLMs has become a rapidly growing yet challenging new\nresearch hotspot. A diverse range of approaches have emerged in the recent 3-4\nyears, targeting different CTG tasks that require different types of controlled\nconstraints. In this paper, we present a systematic critical review on the\ncommon tasks, main approaches, and evaluation methods in this area. Finally, we\ndiscuss the challenges that the field is facing, and put forward various\npromising future directions. To the best of our knowledge, this is the first\nsurvey paper to summarize the state-of-the-art CTG techniques from the\nperspective of Transformer-based PLMs. We hope it can help researchers and\npractitioners in the related fields to quickly track the academic and\ntechnological frontier, providing them with a landscape of the area and a\nroadmap for future research.\n","authors":["Hanqing Zhang","Haolin Song","Shaoyu Li","Ming Zhou","Dawei Song"],"pdf_url":"https://arxiv.org/pdf/2201.05337v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04076v1","updated":"2023-08-08T06:21:58Z","published":"2023-08-08T06:21:58Z","title":"DataTales: Investigating the use of Large Language Models for Authoring\n Data-Driven Articles","summary":" Authoring data-driven articles is a complex process requiring authors to not\nonly analyze data for insights but also craft a cohesive narrative that\neffectively communicates the insights. Text generation capabilities of\ncontemporary large language models (LLMs) present an opportunity to assist the\nauthoring of data-driven articles and expedite the writing process. In this\nwork, we investigate the feasibility and perceived value of leveraging LLMs to\nsupport authors of data-driven articles. We designed a prototype system,\nDataTales, that leverages a LLM to generate textual narratives accompanying a\ngiven chart. Using DataTales as a design probe, we conducted a qualitative\nstudy with 11 professionals to evaluate the concept, from which we distilled\naffordances and opportunities to further integrate LLMs as valuable data-driven\narticle authoring assistants.\n","authors":["Nicole Sultanum","Arjun Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2308.04076v1.pdf","comment":"4 pages, 3 figures"},{"id":"http://arxiv.org/abs/2308.04052v1","updated":"2023-08-08T05:16:51Z","published":"2023-08-08T05:16:51Z","title":"The Five-Dollar Model: Generating Game Maps and Sprites from Sentence\n Embeddings","summary":" The five-dollar model is a lightweight text-to-image generative architecture\nthat generates low dimensional images from an encoded text prompt. This model\ncan successfully generate accurate and aesthetically pleasing content in low\ndimensional domains, with limited amounts of training data. Despite the small\nsize of both the model and datasets, the generated images are still able to\nmaintain the encoded semantic meaning of the textual prompt. We apply this\nmodel to three small datasets: pixel art video game maps, video game sprite\nimages, and down-scaled emoji images and apply novel augmentation strategies to\nimprove the performance of our model on these limited datasets. We evaluate our\nmodels performance using cosine similarity score between text-image pairs\ngenerated by the CLIP VIT-B/32 model.\n","authors":["Timothy Merino","Roman Negri","Dipika Rajesh","M Charity","Julian Togelius"],"pdf_url":"https://arxiv.org/pdf/2308.04052v1.pdf","comment":"to be published in AIIDE 2023"},{"id":"http://arxiv.org/abs/2308.04041v1","updated":"2023-08-08T04:37:41Z","published":"2023-08-08T04:37:41Z","title":"InfeRE: Step-by-Step Regex Generation via Chain of Inference","summary":" Automatically generating regular expressions (abbrev. regexes) from natural\nlanguage description (NL2RE) has been an emerging research area. Prior studies\ntreat regex as a linear sequence of tokens and generate the final expressions\nautoregressively in a single pass. They did not take into account the\nstep-by-step internal text-matching processes behind the final results. This\nsignificantly hinders the efficacy and interpretability of regex generation by\nneural language models. In this paper, we propose a new paradigm called InfeRE,\nwhich decomposes the generation of regexes into chains of step-by-step\ninference. To enhance the robustness, we introduce a self-consistency decoding\nmechanism that ensembles multiple outputs sampled from different models. We\nevaluate InfeRE on two publicly available datasets, NL-RX-Turk and KB13, and\ncompare the results with state-of-the-art approaches and the popular tree-based\ngeneration approach TRANX. Experimental results show that InfeRE substantially\noutperforms previous baselines, yielding 16.3% and 14.7% improvement in DFA@5\naccuracy on two datasets, respectively. Particularly, InfeRE outperforms the\npopular tree-based generation approach by 18.1% and 11.3% on both datasets,\nrespectively, in terms of DFA@5 accuracy.\n","authors":["Shuai Zhang","Xiaodong Gu","Yuting Chen","Beijun Shen"],"pdf_url":"https://arxiv.org/pdf/2308.04041v1.pdf","comment":"This paper has been accepted by ASE'23"},{"id":"http://arxiv.org/abs/2308.04037v1","updated":"2023-08-08T04:27:34Z","published":"2023-08-08T04:27:34Z","title":"A Comparative Study on TF-IDF feature Weighting Method and its Analysis\n using Unstructured Dataset","summary":" Text Classification is the process of categorizing text into the relevant\ncategories and its algorithms are at the core of many Natural Language\nProcessing (NLP). Term Frequency-Inverse Document Frequency (TF-IDF) and NLP\nare the most highly used information retrieval methods in text classification.\nWe have investigated and analyzed the feature weighting method for text\nclassification on unstructured data. The proposed model considered two features\nN-Grams and TF-IDF on the IMDB movie reviews and Amazon Alexa reviews dataset\nfor sentiment analysis. Then we have used the state-of-the-art classifier to\nvalidate the method i.e., Support Vector Machine (SVM), Logistic Regression,\nMultinomial Naive Bayes (Multinomial NB), Random Forest, Decision Tree, and\nk-nearest neighbors (KNN). From those two feature extractions, a significant\nincrease in feature extraction with TF-IDF features rather than based on\nN-Gram. TF-IDF got the maximum accuracy (93.81%), precision (94.20%), recall\n(93.81%), and F1-score (91.99%) value in Random Forest classifier.\n","authors":["Mamata Das","Selvakumar K.","P. J. A. Alphonse"],"pdf_url":"https://arxiv.org/pdf/2308.04037v1.pdf","comment":"10 pages, 3 figures, COLINS-2021, 5th International Conference on\n Computational Linguistics and Intelligent Systems, April 22-23, 2021,\n Kharkiv, Ukraine"},{"id":"http://arxiv.org/abs/2307.10457v3","updated":"2023-08-08T04:18:34Z","published":"2023-07-19T21:00:16Z","title":"Improving the Reusability of Pre-trained Language Models in Real-world\n Applications","summary":" The reusability of state-of-the-art Pre-trained Language Models (PLMs) is\noften limited by their generalization problem, where their performance\ndrastically decreases when evaluated on examples that differ from the training\ndataset, known as Out-of-Distribution (OOD)/unseen examples. This limitation\narises from PLMs' reliance on spurious correlations, which work well for\nfrequent example types but not for general examples. To address this issue, we\npropose a training approach called Mask-tuning, which integrates Masked\nLanguage Modeling (MLM) training objectives into the fine-tuning process to\nenhance PLMs' generalization. Comprehensive experiments demonstrate that\nMask-tuning surpasses current state-of-the-art techniques and enhances PLMs'\ngeneralization on OOD datasets while improving their performance on\nin-distribution datasets. The findings suggest that Mask-tuning improves the\nreusability of PLMs on unseen data, making them more practical and effective\nfor real-world applications.\n","authors":["Somayeh Ghanbarzadeh","Hamid Palangi","Yan Huang","Radames Cruz Moreno","Hamed Khanpour"],"pdf_url":"https://arxiv.org/pdf/2307.10457v3.pdf","comment":"Accepted as a long paper and awarded as the BEST Resaerch Paper in\n IEEE IRI'23 (IEEE 24th International conference on Information Reuse and\n Integrationfor Data Science)"},{"id":"http://arxiv.org/abs/2308.04028v1","updated":"2023-08-08T04:06:11Z","published":"2023-08-08T04:06:11Z","title":"Top K Relevant Passage Retrieval for Biomedical Question Answering","summary":" Question answering is a task that answers factoid questions using a large\ncollection of documents. It aims to provide precise answers in response to the\nuser's questions in natural language. Question answering relies on efficient\npassage retrieval to select candidate contexts, where traditional sparse vector\nspace models, such as TF-IDF or BM25, are the de facto method. On the web,\nthere is no single article that could provide all the possible answers\navailable on the internet to the question of the problem asked by the user. The\nexisting Dense Passage Retrieval model has been trained on Wikipedia dump from\nDec. 20, 2018, as the source documents for answering questions. Question\nanswering (QA) has made big strides with several open-domain and machine\ncomprehension systems built using large-scale annotated datasets. However, in\nthe clinical domain, this problem remains relatively unexplored. According to\nmultiple surveys, Biomedical Questions cannot be answered correctly from\nWikipedia Articles. In this work, we work on the existing DPR framework for the\nbiomedical domain and retrieve answers from the Pubmed articles which is a\nreliable source to answer medical questions. When evaluated on a BioASQ QA\ndataset, our fine-tuned dense retriever results in a 0.81 F1 score.\n","authors":["Shashank Gupta"],"pdf_url":"https://arxiv.org/pdf/2308.04028v1.pdf","comment":"6 pages, 5 figures. arXiv admin note: text overlap with\n arXiv:2004.04906 by other authors"},{"id":"http://arxiv.org/abs/2306.07848v6","updated":"2023-08-08T03:41:47Z","published":"2023-06-13T15:28:10Z","title":"GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio\n Pretraining for Speech Emotion Recognition","summary":" Contrastive learning based cross-modality pretraining approaches have\nrecently exhibited impressive success in diverse fields. In this paper, we\npropose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive\nlanguage-audio pretraining (CLAP) method for speech emotion recognition.\nSpecifically, a novel emotion CLAP model (Emo-CLAP) is first built, utilizing\npre-trained WavLM and RoBERTa models. Second, given the significance of the\ngender attribute in speech emotion modeling, two novel soft label based\nGEmo-CLAP (SL-GEmo-CLAP) and multi-task learning based GEmo-CLAP (ML-GEmo-CLAP)\nmodels are further proposed to integrate emotion and gender information of\nspeech signals, forming more reasonable objectives. Extensive experiments on\nIEMOCAP show that our proposed two GEmo-CLAP models consistently outperform the\nbaseline Emo-CLAP, while also achieving the best recognition performance\ncompared with recent state-of-the-art methods. Noticeably, the proposed\nSL-GEmo-CLAP model achieves the best UAR of 81.43\\% and WAR of 83.16\\% which\nperforms better than other state-of-the-art SER methods by at least 3\\%.\n","authors":["Yu Pan","Yanni Hu","Yuguang Yang","Jixun Yao","Wen Fei","Lei Ma","Heng Lu"],"pdf_url":"https://arxiv.org/pdf/2306.07848v6.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2308.01681v2","updated":"2023-08-08T03:19:10Z","published":"2023-08-03T10:48:30Z","title":"NBIAS: A Natural Language Processing Framework for Bias Identification\n in Text","summary":" Bias in textual data can lead to skewed interpretations and outcomes when the\ndata is used. These biases could perpetuate stereotypes, discrimination, or\nother forms of unfair treatment. An algorithm trained on biased data ends up\nmaking decisions that disproportionately impact a certain group of people.\nTherefore, it is crucial to detect and remove these biases to ensure the fair\nand ethical use of data. To this end, we develop a comprehensive and robust\nframework \\textsc{Nbias} that consists of a data layer, corpus contruction,\nmodel development layer and an evaluation layer. The dataset is constructed by\ncollecting diverse data from various fields, including social media,\nhealthcare, and job hiring portals. As such, we applied a transformer-based\ntoken classification model that is able to identify bias words/ phrases through\na unique named entity. In the assessment procedure, we incorporate a blend of\nquantitative and qualitative evaluations to gauge the effectiveness of our\nmodels. We achieve accuracy improvements ranging from 1% to 8% compared to\nbaselines. We are also able to generate a robust understanding of the model\nfunctioning, capturing not only numerical data but also the quality and\nintricacies of its performance. The proposed approach is applicable to a\nvariety of biases and contributes to the fair and ethical use of textual data.\n","authors":["Shaina Raza","Muskan Garg","Deepak John Reji","Syed Raza Bashir","Chen Ding"],"pdf_url":"https://arxiv.org/pdf/2308.01681v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2308.04014v1","updated":"2023-08-08T03:18:18Z","published":"2023-08-08T03:18:18Z","title":"Continual Pre-Training of Large Language Models: How to (re)warm your\n model?","summary":" Large language models (LLMs) are routinely pre-trained on billions of tokens,\nonly to restart the process over again once new data becomes available. A much\ncheaper and more efficient solution would be to enable the continual\npre-training of these models, i.e. updating pre-trained models with new data\ninstead of re-training them from scratch. However, the distribution shift\ninduced by novel data typically results in degraded performance on past data.\nTaking a step towards efficient continual pre-training, in this work, we\nexamine the effect of different warm-up strategies. Our hypothesis is that the\nlearning rate must be re-increased to improve compute efficiency when training\non a new dataset. We study the warmup phase of models pre-trained on the Pile\n(upstream data, 300B tokens) as we continue to pre-train on SlimPajama\n(downstream data, 297B tokens), following a linear warmup and cosine decay\nschedule. We conduct all experiments on the Pythia 410M language model\narchitecture and evaluate performance through validation perplexity. We\nexperiment with different pre-training checkpoints, various maximum learning\nrates, and various warmup lengths. Our results show that while rewarming models\nfirst increases the loss on upstream and downstream data, in the longer run it\nimproves the downstream performance, outperforming models trained from\nscratch$\\unicode{x2013}$even for a large downstream dataset.\n","authors":["Kshitij Gupta","Benjamin Thérien","Adam Ibrahim","Mats L. Richter","Quentin Anthony","Eugene Belilovsky","Irina Rish","Timothée Lesort"],"pdf_url":"https://arxiv.org/pdf/2308.04014v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03131v2","updated":"2023-08-08T02:01:14Z","published":"2023-08-06T14:49:26Z","title":"Towards Multiple References Era -- Addressing Data Leakage and Limited\n Reference Diversity in NLG Evaluation","summary":" N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely\nutilized across a range of natural language generation (NLG) tasks. However,\nrecent studies have revealed a weak correlation between these matching-based\nmetrics and human evaluations, especially when compared with neural-based\nmetrics like BLEURT. In this paper, we conjecture that the performance\nbottleneck in matching-based metrics may be caused by the limited diversity of\nreferences. To address this issue, we propose to utilize \\textit{multiple\nreferences} to enhance the consistency between these metrics and human\nevaluations. Within the WMT Metrics benchmarks, we observe that the\nmulti-references F200spBLEU surpasses the conventional single-reference one by\nan accuracy improvement of 7.2\\%. Remarkably, it also exceeds the neural-based\nBERTscore by an accuracy enhancement of 3.9\\%. Moreover, we observe that the\ndata leakage issue in large language models (LLMs) can be mitigated to a large\nextent by our multi-reference metric. We release the code and data at\n\\url{https://github.com/SefaZeng/LLM-Ref}\n","authors":["Xianfeng Zeng","Yijin Liu","Fandong Meng","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2308.03131v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03983v1","updated":"2023-08-08T02:00:43Z","published":"2023-08-08T02:00:43Z","title":"SimplyRetrieve: A Private and Lightweight Retrieval-Centric Generative\n AI Tool","summary":" Large Language Model (LLM) based Generative AI systems have seen significant\nprogress in recent years. Integrating a knowledge retrieval architecture allows\nfor seamless integration of private data into publicly available Generative AI\nsystems using pre-trained LLM without requiring additional model fine-tuning.\nMoreover, Retrieval-Centric Generation (RCG) approach, a promising future\nresearch direction that explicitly separates roles of LLMs and retrievers in\ncontext interpretation and knowledge memorization, potentially leads to more\nefficient implementation. SimplyRetrieve is an open-source tool with the goal\nof providing a localized, lightweight, and user-friendly interface to these\nsophisticated advancements to the machine learning community. SimplyRetrieve\nfeatures a GUI and API based RCG platform, assisted by a Private Knowledge Base\nConstructor and a Retrieval Tuning Module. By leveraging these capabilities,\nusers can explore the potential of RCG for improving generative AI performance\nwhile maintaining privacy standards. The tool is available at\nhttps://github.com/RCGAI/SimplyRetrieve with an MIT license.\n","authors":["Youyang Ng","Daisuke Miyashita","Yasuto Hoshi","Yasuhiro Morioka","Osamu Torii","Tomoya Kodama","Jun Deguchi"],"pdf_url":"https://arxiv.org/pdf/2308.03983v1.pdf","comment":"12 pages, 6 figures"},{"id":"http://arxiv.org/abs/2308.04625v1","updated":"2023-08-08T23:31:10Z","published":"2023-08-08T23:31:10Z","title":"A Comparative Study of Sentence Embedding Models for Assessing Semantic\n Variation","summary":" Analyzing the pattern of semantic variation in long real-world texts such as\nbooks or transcripts is interesting from the stylistic, cognitive, and\nlinguistic perspectives. It is also useful for applications such as text\nsegmentation, document summarization, and detection of semantic novelty. The\nrecent emergence of several vector-space methods for sentence embedding has\nmade such analysis feasible. However, this raises the issue of how consistent\nand meaningful the semantic representations produced by various methods are in\nthemselves. In this paper, we compare several recent sentence embedding methods\nvia time-series of semantic similarity between successive sentences and\nmatrices of pairwise sentence similarity for multiple books of literature. In\ncontrast to previous work using target tasks and curated datasets to compare\nsentence embedding methods, our approach provides an evaluation of the methods\n'in the wild'. We find that most of the sentence embedding methods considered\ndo infer highly correlated patterns of semantic similarity in a given document,\nbut show interesting differences.\n","authors":["Deven M. Mistry","Ali A. Minai"],"pdf_url":"https://arxiv.org/pdf/2308.04625v1.pdf","comment":"12 pages, 6 figures, Accepted for publication in the Proceedings of\n the 2023 International Conference on Artificial Neural Networks, Heraklion,,\n Greece, September 26-29, 2023"},{"id":"http://arxiv.org/abs/2308.04624v1","updated":"2023-08-08T23:30:20Z","published":"2023-08-08T23:30:20Z","title":"Benchmarking LLM powered Chatbots: Methods and Metrics","summary":" Autonomous conversational agents, i.e. chatbots, are becoming an increasingly\ncommon mechanism for enterprises to provide support to customers and partners.\nIn order to rate chatbots, especially ones powered by Generative AI tools like\nLarge Language Models (LLMs) we need to be able to accurately assess their\nperformance. This is where chatbot benchmarking becomes important. In this\npaper, we propose the use of a novel benchmark that we call the E2E (End to\nEnd) benchmark, and show how the E2E benchmark can be used to evaluate accuracy\nand usefulness of the answers provided by chatbots, especially ones powered by\nLLMs. We evaluate an example chatbot at different levels of sophistication\nbased on both our E2E benchmark, as well as other available metrics commonly\nused in the state of art, and observe that the proposed benchmark show better\nresults compared to others. In addition, while some metrics proved to be\nunpredictable, the metric associated with the E2E benchmark, which uses cosine\nsimilarity performed well in evaluating chatbots. The performance of our best\nmodels shows that there are several benefits of using the cosine similarity\nscore as a metric in the E2E benchmark.\n","authors":["Debarag Banerjee","Pooja Singh","Arjun Avadhanam","Saksham Srivastava"],"pdf_url":"https://arxiv.org/pdf/2308.04624v1.pdf","comment":"8 pages, 14 figures"},{"id":"http://arxiv.org/abs/2308.04623v1","updated":"2023-08-08T23:29:55Z","published":"2023-08-08T23:29:55Z","title":"Accelerating LLM Inference with Staged Speculative Decoding","summary":" Recent advances with large language models (LLM) illustrate their diverse\ncapabilities. We propose a novel algorithm, staged speculative decoding, to\naccelerate LLM inference in small-batch, on-device scenarios. We address the\nlow arithmetic intensity of small-batch inference by improving upon previous\nwork in speculative decoding. First, we restructure the speculative batch as a\ntree, which reduces generation costs and increases the expected tokens per\nbatch. Second, we add a second stage of speculative decoding. Taken together,\nwe reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L\nmodel while perfectly preserving output quality.\n","authors":["Benjamin Spector","Chris Re"],"pdf_url":"https://arxiv.org/pdf/2308.04623v1.pdf","comment":"Published at ES-FOMO at ICML 2023"},{"id":"http://arxiv.org/abs/2307.07415v2","updated":"2023-08-08T21:26:53Z","published":"2023-07-13T00:49:27Z","title":"AutoHint: Automatic Prompt Optimization with Hint Generation","summary":" This paper presents AutoHint, a novel framework for automatic prompt\nengineering and optimization for Large Language Models (LLM). While LLMs have\ndemonstrated remarkable ability in achieving high-quality annotation in various\ntasks, the key to applying this ability to specific tasks lies in developing\nhigh-quality prompts. Thus we propose a framework to inherit the merits of both\nin-context learning and zero-shot learning by incorporating enriched\ninstructions derived from input-output demonstrations to optimize original\nprompt. We refer to the enrichment as the hint and propose a framework to\nautomatically generate the hint from labeled data. More concretely, starting\nfrom an initial prompt, our method first instructs a LLM to deduce new hints\nfor selected samples from incorrect predictions, and then summarizes from\nper-sample hints and adds the results back to the initial prompt to form a new,\nenriched instruction. The proposed method is evaluated on the BIG-Bench\nInstruction Induction dataset for both zero-shot and few-short prompts, where\nexperiments demonstrate our method is able to significantly boost accuracy for\nmultiple tasks.\n","authors":["Hong Sun","Xue Li","Yinchuan Xu","Youkow Homma","Qi Cao","Min Wu","Jian Jiao","Denis Charles"],"pdf_url":"https://arxiv.org/pdf/2307.07415v2.pdf","comment":"KDD 2023: Foundations and Applications in Large-scale AI\n Models-Pre-training, Fine-tuning, and Prompt-based Learning workshop"},{"id":"http://arxiv.org/abs/2308.04592v1","updated":"2023-08-08T21:23:23Z","published":"2023-08-08T21:23:23Z","title":"Shepherd: A Critic for Language Model Generation","summary":" As large language models improve, there is increasing interest in techniques\nthat leverage these models' capabilities to refine their own outputs. In this\nwork, we introduce Shepherd, a language model specifically tuned to critique\nresponses and suggest refinements, extending beyond the capabilities of an\nuntuned model to identify diverse errors and provide suggestions to remedy\nthem. At the core of our approach is a high quality feedback dataset, which we\ncurate from community feedback and human annotations. Even though Shepherd is\nsmall (7B parameters), its critiques are either equivalent or preferred to\nthose from established models including ChatGPT. Using GPT-4 for evaluation,\nShepherd reaches an average win-rate of 53-87% compared to competitive\nalternatives. In human evaluation, Shepherd strictly outperforms other models\nand on average closely ties with ChatGPT.\n","authors":["Tianlu Wang","Ping Yu","Xiaoqing Ellen Tan","Sean O'Brien","Ramakanth Pasunuru","Jane Dwivedi-Yu","Olga Golovneva","Luke Zettlemoyer","Maryam Fazel-Zarandi","Asli Celikyilmaz"],"pdf_url":"https://arxiv.org/pdf/2308.04592v1.pdf","comment":"7 figures, 7 tables"},{"id":"http://arxiv.org/abs/2308.04566v1","updated":"2023-08-08T20:29:13Z","published":"2023-08-08T20:29:13Z","title":"Single-Sentence Reader: A Novel Approach for Addressing Answer Position\n Bias","summary":" Machine Reading Comprehension (MRC) models tend to take advantage of spurious\ncorrelations (also known as dataset bias or annotation artifacts in the\nresearch community). Consequently, these models may perform the MRC task\nwithout fully comprehending the given context and question, which is\nundesirable since it may result in low robustness against distribution shift.\nThis paper delves into the concept of answer-position bias, where a significant\npercentage of training questions have answers located solely in the first\nsentence of the context. We propose a Single-Sentence Reader as a new approach\nfor addressing answer position bias in MRC. We implement this approach using\nsix different models and thoroughly analyze their performance. Remarkably, our\nproposed Single-Sentence Readers achieve results that nearly match those of\nmodels trained on conventional training sets, proving their effectiveness. Our\nstudy also discusses several challenges our Single-Sentence Readers encounter\nand proposes a potential solution.\n","authors":["Son Quoc Tran","Matt Kretchmar"],"pdf_url":"https://arxiv.org/pdf/2308.04566v1.pdf","comment":"11 pages, 5 tables, 2 figures. arXiv admin note: text overlap with\n arXiv:2211.16220 by other authors"},{"id":"http://arxiv.org/abs/2308.04534v1","updated":"2023-08-08T18:56:52Z","published":"2023-08-08T18:56:52Z","title":"Ahead of the Text: Leveraging Entity Preposition for Financial Relation\n Extraction","summary":" In the context of the ACM KDF-SIGIR 2023 competition, we undertook an entity\nrelation task on a dataset of financial entity relations called REFind. Our\ntop-performing solution involved a multi-step approach. Initially, we inserted\nthe provided entities at their corresponding locations within the text.\nSubsequently, we fine-tuned the transformer-based language model roberta-large\nfor text classification by utilizing a labeled training set to predict the\nentity relations. Lastly, we implemented a post-processing phase to identify\nand handle improbable predictions generated by the model. As a result of our\nmethodology, we achieved the 1st place ranking on the competition's public\nleaderboard.\n","authors":["Stefan Pasch","Dimitrios Petridis"],"pdf_url":"https://arxiv.org/pdf/2308.04534v1.pdf","comment":"Stefan Pasch, Dimitrios Petridis 2023. Ahead of the Text: Leveraging\n Entity Preposition for Financial Relation Extraction. ACM SIGIR: The 4th\n Workshop on Knowledge Discovery from Unstructured Data in Financial Services\n (SIGIR-KDF '23)"},{"id":"http://arxiv.org/abs/2308.04519v1","updated":"2023-08-08T18:35:22Z","published":"2023-08-08T18:35:22Z","title":"DisCoCat for Donkey Sentences","summary":" We demonstrate how to parse Geach's Donkey sentences in a compositional\ndistributional model of meaning. We build on previous work on the DisCoCat\n(Distributional Compositional Categorical) framework, including extensions that\nmodel discourse, determiners, and relative pronouns. We present a type-logical\nsyntax for parsing donkey sentences, for which we define both relational and\nvector space semantics.\n","authors":["Lachlan McPheat","Daphne Wang"],"pdf_url":"https://arxiv.org/pdf/2308.04519v1.pdf","comment":"In Proceedings AMSLO 2023, arXiv:2308.03679"},{"id":"http://arxiv.org/abs/2308.04502v1","updated":"2023-08-08T18:11:27Z","published":"2023-08-08T18:11:27Z","title":"Revisiting Disentanglement and Fusion on Modality and Context in\n Conversational Multimodal Emotion Recognition","summary":" It has been a hot research topic to enable machines to understand human\nemotions in multimodal contexts under dialogue scenarios, which is tasked with\nmultimodal emotion analysis in conversation (MM-ERC). MM-ERC has received\nconsistent attention in recent years, where a diverse range of methods has been\nproposed for securing better task performance. Most existing works treat MM-ERC\nas a standard multimodal classification problem and perform multimodal feature\ndisentanglement and fusion for maximizing feature utility. Yet after revisiting\nthe characteristic of MM-ERC, we argue that both the feature multimodality and\nconversational contextualization should be properly modeled simultaneously\nduring the feature disentanglement and fusion steps. In this work, we target\nfurther pushing the task performance by taking full consideration of the above\ninsights. On the one hand, during feature disentanglement, based on the\ncontrastive learning technique, we devise a Dual-level Disentanglement\nMechanism (DDM) to decouple the features into both the modality space and\nutterance space. On the other hand, during the feature fusion stage, we propose\na Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism\n(CRM) for multimodal and context integration, respectively. They together\nschedule the proper integrations of multimodal and context features.\nSpecifically, CFM explicitly manages the multimodal feature contributions\ndynamically, while CRM flexibly coordinates the introduction of dialogue\ncontexts. On two public MM-ERC datasets, our system achieves new\nstate-of-the-art performance consistently. Further analyses demonstrate that\nall our proposed mechanisms greatly facilitate the MM-ERC task by making full\nuse of the multimodal and context features adaptively. Note that our proposed\nmethods have the great potential to facilitate a broader range of other\nconversational multimodal tasks.\n","authors":["Bobo Li","Hao Fei","Lizi Liao","Yu Zhao","Chong Teng","Tat-Seng Chua","Donghong Ji","Fei Li"],"pdf_url":"https://arxiv.org/pdf/2308.04502v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.04673v2","updated":"2023-08-08T18:04:11Z","published":"2023-03-08T15:52:14Z","title":"Cost-Effective Hyperparameter Optimization for Large Language Model\n Generation Inference","summary":" Large Language Models (LLMs) have sparked significant interest in their\ngenerative capabilities, leading to the development of various commercial\napplications. The high cost of using the models drives application builders to\nmaximize the value of generation under a limited inference budget. This paper\npresents a study of optimizing inference hyperparameters such as the number of\nresponses, temperature and max tokens, which significantly affects the\nutility/cost of text generation. We design a framework named EcoOptiGen which\nleverages economical hyperparameter optimization and cost-based pruning.\nExperiments with the GPT-3.5/GPT-4 models on a variety of tasks verify its\neffectiveness. EcoOptiGen is implemented in the `autogen' package of the FLAML\nlibrary: \\url{https://aka.ms/autogen}.\n","authors":["Chi Wang","Susan Xueqing Liu","Ahmed H. Awadallah"],"pdf_url":"https://arxiv.org/pdf/2303.04673v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04498v1","updated":"2023-08-08T18:03:29Z","published":"2023-08-08T18:03:29Z","title":"DialogRE^C+: An Extension of DialogRE to Investigate How Much\n Coreference Helps Relation Extraction in Dialogs","summary":" Dialogue relation extraction (DRE) that identifies the relations between\nargument pairs in dialogue text, suffers much from the frequent occurrence of\npersonal pronouns, or entity and speaker coreference. This work introduces a\nnew benchmark dataset DialogRE^C+, introducing coreference resolution into the\nDRE scenario. With the aid of high-quality coreference knowledge, the reasoning\nof argument relations is expected to be enhanced. In DialogRE^C+ dataset, we\nmanually annotate total 5,068 coreference chains over 36,369 argument mentions\nbased on the existing DialogRE data, where four different coreference chain\ntypes namely speaker chain, person chain, location chain and organization chain\nare explicitly marked. We further develop 4 coreference-enhanced graph-based\nDRE models, which learn effective coreference representations for improving the\nDRE task. We also train a coreference resolution model based on our annotations\nand evaluate the effect of automatically extracted coreference chains\ndemonstrating the practicality of our dataset and its potential to other\ndomains and tasks.\n","authors":["Yiyun Xiong","Mengwei Dai","Fei Li","Hao Fei","Bobo Li","Shengqiong Wu","Donghong Ji","Chong Teng"],"pdf_url":"https://arxiv.org/pdf/2308.04498v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.00595v2","updated":"2023-08-08T17:16:03Z","published":"2023-03-01T15:35:32Z","title":"A Universal Question-Answering Platform for Knowledge Graphs","summary":" Knowledge from diverse application domains is organized as knowledge graphs\n(KGs) that are stored in RDF engines accessible in the web via SPARQL\nendpoints. Expressing a well-formed SPARQL query requires information about the\ngraph structure and the exact URIs of its components, which is impractical for\nthe average user. Question answering (QA) systems assist by translating natural\nlanguage questions to SPARQL. Existing QA systems are typically based on\napplication-specific human-curated rules, or require prior information,\nexpensive pre-processing and model adaptation for each targeted KG. Therefore,\nthey are hard to generalize to a broad set of applications and KGs.\n In this paper, we propose KGQAn, a universal QA system that does not need to\nbe tailored to each target KG. Instead of curated rules, KGQAn introduces a\nnovel formalization of question understanding as a text generation problem to\nconvert a question into an intermediate abstract representation via a neural\nsequence-to-sequence model. We also develop a just-in-time linker that maps at\nquery time the abstract representation to a SPARQL query for a specific KG,\nusing only the publicly accessible APIs and the existing indices of the RDF\nstore, without requiring any pre-processing. Our experiments with several real\nKGs demonstrate that KGQAn is easily deployed and outperforms by a large margin\nthe state-of-the-art in terms of quality of answers and processing time,\nespecially for arbitrary KGs, unseen during the training.\n","authors":["Reham Omar","Ishika Dhall","Panos Kalnis","Essam Mansour"],"pdf_url":"https://arxiv.org/pdf/2303.00595v2.pdf","comment":"The paper is accepted to SIGMOD 2023"},{"id":"http://arxiv.org/abs/2308.04226v1","updated":"2023-08-08T12:45:01Z","published":"2023-08-08T12:45:01Z","title":"OpinionConv: Conversational Product Search with Grounded Opinions","summary":" When searching for products, the opinions of others play an important role in\nmaking informed decisions. Subjective experiences about a product can be a\nvaluable source of information. This is also true in sales conversations, where\na customer and a sales assistant exchange facts and opinions about products.\nHowever, training an AI for such conversations is complicated by the fact that\nlanguage models do not possess authentic opinions for their lack of real-world\nexperience. We address this problem by leveraging product reviews as a rich\nsource of product opinions to ground conversational AI in true subjective\nnarratives. With OpinionConv, we develop the first conversational AI for\nsimulating sales conversations. To validate the generated conversations, we\nconduct several user studies showing that the generated opinions are perceived\nas realistic. Our assessors also confirm the importance of opinions as an\ninformative basis for decision-making.\n","authors":["Vahid Sadiri Javadi","Martin Potthast","Lucie Flek"],"pdf_url":"https://arxiv.org/pdf/2308.04226v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2308.04431v1","updated":"2023-08-08T17:58:45Z","published":"2023-08-08T17:58:45Z","title":"When More is Less: Incorporating Additional Datasets Can Hurt\n Performance By Introducing Spurious Correlations","summary":" In machine learning, incorporating more data is often seen as a reliable\nstrategy for improving model performance; this work challenges that notion by\ndemonstrating that the addition of external datasets in many cases can hurt the\nresulting model's performance. In a large-scale empirical study across\ncombinations of four different open-source chest x-ray datasets and 9 different\nlabels, we demonstrate that in 43% of settings, a model trained on data from\ntwo hospitals has poorer worst group accuracy over both hospitals than a model\ntrained on just a single hospital's data. This surprising result occurs even\nthough the added hospital makes the training distribution more similar to the\ntest distribution. We explain that this phenomenon arises from the spurious\ncorrelation that emerges between the disease and hospital, due to\nhospital-specific image artifacts. We highlight the trade-off one encounters\nwhen training on multiple datasets, between the obvious benefit of additional\ndata and insidious cost of the introduced spurious correlation. In some cases,\nbalancing the dataset can remove the spurious correlation and improve\nperformance, but it is not always an effective strategy. We contextualize our\nresults within the literature on spurious correlations to help explain these\noutcomes. Our experiments underscore the importance of exercising caution when\nselecting training data for machine learning models, especially in settings\nwhere there is a risk of spurious correlations such as with medical imaging.\nThe risks outlined highlight the need for careful data selection and model\nevaluation in future research and practice.\n","authors":["Rhys Compton","Lily Zhang","Aahlad Puli","Rajesh Ranganath"],"pdf_url":"https://arxiv.org/pdf/2308.04431v1.pdf","comment":"Accepted at MLHC 2023"},{"id":"http://arxiv.org/abs/2308.04426v1","updated":"2023-08-08T17:55:30Z","published":"2023-08-08T17:55:30Z","title":"A Deep-Learning Method Using Auto-encoder and Generative Adversarial\n Network for Anomaly Detection on Ancient Stone Stele Surfaces","summary":" Accurate detection of natural deterioration and man-made damage on the\nsurfaces of ancient stele in the first instance is essential for their\npreventive conservation. Existing methods for cultural heritage preservation\nare not able to achieve this goal perfectly due to the difficulty of balancing\naccuracy, efficiency, timeliness, and cost. This paper presents a deep-learning\nmethod to automatically detect above mentioned emergencies on ancient stone\nstele in real time, employing autoencoder (AE) and generative adversarial\nnetwork (GAN). The proposed method overcomes the limitations of existing\nmethods by requiring no extensive anomaly samples while enabling comprehensive\ndetection of unpredictable anomalies. the method includes stages of monitoring,\ndata acquisition, pre-processing, model structuring, and post-processing.\nTaking the Longmen Grottoes' stone steles as a case study, an unsupervised\nlearning model based on AE and GAN architectures is proposed and validated with\na reconstruction accuracy of 99.74\\%. The method's evaluation revealed the\nproficient detection of seven artificially designed anomalies and demonstrated\nprecision and reliability without false alarms. This research provides novel\nideas and possibilities for the application of deep learning in the field of\ncultural heritage.\n","authors":["Yikun Liu","Yuning Wang","Cheng Liu"],"pdf_url":"https://arxiv.org/pdf/2308.04426v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.04170v3","updated":"2023-08-08T17:49:29Z","published":"2023-05-07T03:00:06Z","title":"YOLOCS: Object Detection based on Dense Channel Compression for Feature\n Spatial Solidification","summary":" In this study, we examine the associations between channel features and\nconvolutional kernels during the processes of feature purification and gradient\nbackpropagation, with a focus on the forward and backward propagation within\nthe network. Consequently, we propose a method called Dense Channel Compression\nfor Feature Spatial Solidification. Drawing upon the central concept of this\nmethod, we introduce two innovative modules for backbone and head networks: the\nDense Channel Compression for Feature Spatial Solidification Structure (DCFS)\nand the Asymmetric Multi-Level Compression Decoupled Head (ADH). When\nintegrated into the YOLOv5 model, these two modules demonstrate exceptional\nperformance, resulting in a modified model referred to as YOLOCS. Evaluated on\nthe MSCOCO dataset, the large, medium, and small YOLOCS models yield AP of\n50.1%, 47.6%, and 42.5%, respectively. Maintaining inference speeds remarkably\nsimilar to those of the YOLOv5 model, the large, medium, and small YOLOCS\nmodels surpass the YOLOv5 model's AP by 1.1%, 2.3%, and 5.2%, respectively.\n","authors":["Lin Huang","Weisheng Li","Linlin Shen","Haojie Fu","Xue Xiao","Suihan Xiao"],"pdf_url":"https://arxiv.org/pdf/2305.04170v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04417v1","updated":"2023-08-08T17:34:28Z","published":"2023-08-08T17:34:28Z","title":"DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal from\n Optical Satellite Images","summary":" Optical satellite images are a critical data source; however, cloud cover\noften compromises their quality, hindering image applications and analysis.\nConsequently, effectively removing clouds from optical satellite images has\nemerged as a prominent research direction. While recent advancements in cloud\nremoval primarily rely on generative adversarial networks, which may yield\nsuboptimal image quality, diffusion models have demonstrated remarkable success\nin diverse image-generation tasks, showcasing their potential in addressing\nthis challenge. This paper presents a novel framework called DiffCR, which\nleverages conditional guided diffusion with deep convolutional networks for\nhigh-performance cloud removal for optical satellite imagery. Specifically, we\nintroduce a decoupled encoder for conditional image feature extraction,\nproviding a robust color representation to ensure the close similarity of\nappearance information between the conditional input and the synthesized\noutput. Moreover, we propose a novel and efficient time and condition fusion\nblock within the cloud removal model to accurately simulate the correspondence\nbetween the appearance in the conditional image and the target image at a low\ncomputational cost. Extensive experimental evaluations on two commonly used\nbenchmark datasets demonstrate that DiffCR consistently achieves\nstate-of-the-art performance on all metrics, with parameter and computational\ncomplexities amounting to only 5.1% and 5.4%, respectively, of those previous\nbest methods. The source code, pre-trained models, and all the experimental\nresults will be publicly available at https://github.com/XavierJiezou/DiffCR\nupon the paper's acceptance of this work.\n","authors":["Xuechao Zou","Kai Li","Junliang Xing","Yu Zhang","Shiying Wang","Lei Jin","Pin Tao"],"pdf_url":"https://arxiv.org/pdf/2308.04417v1.pdf","comment":"13 pages, 7 figures"},{"id":"http://arxiv.org/abs/2306.09345v2","updated":"2023-08-08T17:26:58Z","published":"2023-06-15T17:59:51Z","title":"Evaluating Data Attribution for Text-to-Image Models","summary":" While large text-to-image models are able to synthesize \"novel\" images, these\nimages are necessarily a reflection of the training data. The problem of data\nattribution in such models -- which of the images in the training set are most\nresponsible for the appearance of a given generated image -- is a difficult yet\nimportant one. As an initial step toward this problem, we evaluate attribution\nthrough \"customization\" methods, which tune an existing large-scale model\ntoward a given exemplar object or style. Our key insight is that this allows us\nto efficiently create synthetic images that are computationally influenced by\nthe exemplar by construction. With our new dataset of such exemplar-influenced\nimages, we are able to evaluate various data attribution algorithms and\ndifferent possible feature spaces. Furthermore, by training on our dataset, we\ncan tune standard models, such as DINO, CLIP, and ViT, toward the attribution\nproblem. Even though the procedure is tuned towards small exemplar sets, we\nshow generalization to larger sets. Finally, by taking into account the\ninherent uncertainty of the problem, we can assign soft attribution scores over\na set of training images.\n","authors":["Sheng-Yu Wang","Alexei A. Efros","Jun-Yan Zhu","Richard Zhang"],"pdf_url":"https://arxiv.org/pdf/2306.09345v2.pdf","comment":"Updated v2 -- ICCV 2023 camera ready version. Project page:\n https://peterwang512.github.io/GenDataAttribution Code:\n https://github.com/PeterWang512/GenDataAttribution"},{"id":"http://arxiv.org/abs/2308.04413v1","updated":"2023-08-08T17:18:59Z","published":"2023-08-08T17:18:59Z","title":"Digging into Depth Priors for Outdoor Neural Radiance Fields","summary":" Neural Radiance Fields (NeRF) have demonstrated impressive performance in\nvision and graphics tasks, such as novel view synthesis and immersive reality.\nHowever, the shape-radiance ambiguity of radiance fields remains a challenge,\nespecially in the sparse viewpoints setting. Recent work resorts to integrating\ndepth priors into outdoor NeRF training to alleviate the issue. However, the\ncriteria for selecting depth priors and the relative merits of different priors\nhave not been thoroughly investigated. Moreover, the relative merits of\nselecting different approaches to use the depth priors is also an unexplored\nproblem. In this paper, we provide a comprehensive study and evaluation of\nemploying depth priors to outdoor neural radiance fields, covering common depth\nsensing technologies and most application ways. Specifically, we conduct\nextensive experiments with two representative NeRF methods equipped with four\ncommonly-used depth priors and different depth usages on two widely used\noutdoor datasets. Our experimental results reveal several interesting findings\nthat can potentially benefit practitioners and researchers in training their\nNeRF models with depth priors. Project Page:\nhttps://cwchenwang.github.io/outdoor-nerf-depth\n","authors":["Chen Wang","Jiadai Sun","Lina Liu","Chenming Wu","Zhelun Shen","Dayan Wu","Yuchao Dai","Liangjun Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.04413v1.pdf","comment":"Accepted to ACM MM 2023. Project Page:\n https://cwchenwang.github.io/outdoor-nerf-depth"},{"id":"http://arxiv.org/abs/2308.04409v1","updated":"2023-08-08T17:14:14Z","published":"2023-08-08T17:14:14Z","title":"V-DETR: DETR with Vertex Relative Position Encoding for 3D Object\n Detection","summary":" We introduce a highly performant 3D object detector for point clouds using\nthe DETR framework. The prior attempts all end up with suboptimal results\nbecause they fail to learn accurate inductive biases from the limited scale of\ntraining data. In particular, the queries often attend to points that are far\naway from the target objects, violating the locality principle in object\ndetection. To address the limitation, we introduce a novel 3D Vertex Relative\nPosition Encoding (3DV-RPE) method which computes position encoding for each\npoint based on its relative position to the 3D boxes predicted by the queries\nin each decoder layer, thus providing clear information to guide the model to\nfocus on points near the objects, in accordance with the principle of locality.\nIn addition, we systematically improve the pipeline from various aspects such\nas data normalization based on our understanding of the task. We show\nexceptional results on the challenging ScanNetV2 benchmark, achieving\nsignificant improvements over the previous 3DETR in\n$\\rm{AP}_{25}$/$\\rm{AP}_{50}$ from 65.0\\%/47.0\\% to 77.8\\%/66.0\\%,\nrespectively. In addition, our method sets a new record on ScanNetV2 and SUN\nRGB-D datasets.Code will be released at http://github.com/yichaoshen-MS/V-DETR.\n","authors":["Yichao Shen","Zigang Geng","Yuhui Yuan","Yutong Lin","Ze Liu","Chunyu Wang","Han Hu","Nanning Zheng","Baining Guo"],"pdf_url":"https://arxiv.org/pdf/2308.04409v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04402v1","updated":"2023-08-08T17:04:53Z","published":"2023-08-08T17:04:53Z","title":"Person Re-Identification without Identification via Event Anonymization","summary":" Wide-scale use of visual surveillance in public spaces puts individual\nprivacy at stake while increasing resource consumption (energy, bandwidth, and\ncomputation). Neuromorphic vision sensors (event-cameras) have been recently\nconsidered a valid solution to the privacy issue because they do not capture\ndetailed RGB visual information of the subjects in the scene. However, recent\ndeep learning architectures have been able to reconstruct images from event\ncameras with high fidelity, reintroducing a potential threat to privacy for\nevent-based vision applications. In this paper, we aim to anonymize\nevent-streams to protect the identity of human subjects against such image\nreconstruction attacks. To achieve this, we propose an end-to-end network\narchitecture jointly optimized for the twofold objective of preserving privacy\nand performing a downstream task such as person ReId. Our network learns to\nscramble events, enforcing the degradation of images recovered from the privacy\nattacker. In this work, we also bring to the community the first ever\nevent-based person ReId dataset gathered to evaluate the performance of our\napproach. We validate our approach with extensive experiments and report\nresults on the synthetic event data simulated from the publicly available\nSoftBio dataset and our proposed Event-ReId dataset.\n","authors":["Shafiq Ahmad","Pietro Morerio","Alessio Del Bue"],"pdf_url":"https://arxiv.org/pdf/2308.04402v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04397v1","updated":"2023-08-08T17:01:33Z","published":"2023-08-08T17:01:33Z","title":"LEFormer: A Hybrid CNN-Transformer Architecture for Accurate Lake\n Extraction from Remote Sensing Imagery","summary":" Lake extraction from remote sensing imagery is challenging due to the complex\nshapes of lakes and the presence of noise. Existing methods suffer from blurred\nsegmentation boundaries and poor foreground modeling. In this paper, we propose\na hybrid CNN-Transformer architecture, called LEFormer, for accurate lake\nextraction. LEFormer contains four main modules: CNN encoder, Transformer\nencoder, cross-encoder fusion, and lightweight decoder. The CNN encoder\nrecovers local spatial information and improves fine-scale details.\nSimultaneously, the Transformer encoder captures long-range dependencies\nbetween sequences of any length, allowing them to obtain global features and\ncontext information better. Finally, a lightweight decoder is employed for mask\nprediction. We evaluate the performance and efficiency of LEFormer on two\ndatasets, the Surface Water (SW) and the Qinghai-Tibet Plateau Lake (QTPL).\nExperimental results show that LEFormer consistently achieves state-of-the-art\n(SOTA) performance and efficiency on these two datasets, outperforming existing\nmethods. Specifically, LEFormer achieves 90.86% and 97.42% mIoU on the SW and\nQTPL datasets with a parameter count of 3.61M, respectively, while being 20x\nminor than the previous SOTA method.\n","authors":["Ben Chen","Xuechao Zou","Yu Zhang","Jiayu Li","Kai Li","Pin Tao"],"pdf_url":"https://arxiv.org/pdf/2308.04397v1.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2308.04395v1","updated":"2023-08-08T17:00:11Z","published":"2023-08-08T17:00:11Z","title":"Data Augmentation-Based Unsupervised Domain Adaptation In Medical\n Imaging","summary":" Deep learning-based models in medical imaging often struggle to generalize\neffectively to new scans due to data heterogeneity arising from differences in\nhardware, acquisition parameters, population, and artifacts. This limitation\npresents a significant challenge in adopting machine learning models for\nclinical practice. We propose an unsupervised method for robust domain\nadaptation in brain MRI segmentation by leveraging MRI-specific augmentation\ntechniques. To evaluate the effectiveness of our method, we conduct extensive\nexperiments across diverse datasets, modalities, and segmentation tasks,\ncomparing against the state-of-the-art methods. The results show that our\nproposed approach achieves high accuracy, exhibits broad applicability, and\nshowcases remarkable robustness against domain shift in various tasks,\nsurpassing the state-of-the-art performance in the majority of cases.\n","authors":["Sebastian Nørgaard Llambias","Mads Nielsen","Mostafa Mehdipour Ghazi"],"pdf_url":"https://arxiv.org/pdf/2308.04395v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04383v1","updated":"2023-08-08T16:37:24Z","published":"2023-08-08T16:37:24Z","title":"DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point\n Clouds","summary":" Point clouds are naturally sparse, while image pixels are dense. The\ninconsistency limits feature fusion from both modalities for point-wise scene\nflow estimation. Previous methods rarely predict scene flow from the entire\npoint clouds of the scene with one-time inference due to the memory\ninefficiency and heavy overhead from distance calculation and sorting involved\nin commonly used farthest point sampling, KNN, and ball query algorithms for\nlocal feature aggregation. To mitigate these issues in scene flow learning, we\nregularize raw points to a dense format by storing 3D coordinates in 2D grids.\nUnlike the sampling operation commonly used in existing works, the dense 2D\nrepresentation 1) preserves most points in the given scene, 2) brings in a\nsignificant boost of efficiency, and 3) eliminates the density gap between\npoints and pixels, allowing us to perform effective feature fusion. We also\npresent a novel warping projection technique to alleviate the information loss\nproblem resulting from the fact that multiple points could be mapped into one\ngrid during projection when computing cost volume. Sufficient experiments\ndemonstrate the efficiency and effectiveness of our method, outperforming the\nprior-arts on the FlyingThings3D and KITTI dataset.\n","authors":["Chensheng Peng","Guangming Wang","Xian Wan Lo","Xinrui Wu","Chenfeng Xu","Masayoshi Tomizuka","Wei Zhan","Hesheng Wang"],"pdf_url":"https://arxiv.org/pdf/2308.04383v1.pdf","comment":"Accepted by ICCV2023. Codes will be released at\n https://github.com/IRMVLab/DELFlow"},{"id":"http://arxiv.org/abs/2308.04380v1","updated":"2023-08-08T16:31:43Z","published":"2023-08-08T16:31:43Z","title":"Your Negative May not Be True Negative: Boosting Image-Text Matching\n with False Negative Elimination","summary":" Most existing image-text matching methods adopt triplet loss as the\noptimization objective, and choosing a proper negative sample for the triplet\nof is important for effectively training the\nmodel, e.g., hard negatives make the model learn efficiently and effectively.\nHowever, we observe that existing methods mainly employ the most similar\nsamples as hard negatives, which may not be true negatives. In other words, the\nsamples with high similarity but not paired with the anchor may reserve\npositive semantic associations, and we call them false negatives. Repelling\nthese false negatives in triplet loss would mislead the semantic representation\nlearning and result in inferior retrieval performance. In this paper, we\npropose a novel False Negative Elimination (FNE) strategy to select negatives\nvia sampling, which could alleviate the problem introduced by false negatives.\nSpecifically, we first construct the distributions of positive and negative\nsamples separately via their similarities with the anchor, based on the\nfeatures extracted from image and text encoders. Then we calculate the false\nnegative probability of a given sample based on its similarity with the anchor\nand the above distributions via the Bayes' rule, which is employed as the\nsampling weight during negative sampling process. Since there may not exist any\nfalse negative in a small batch size, we design a memory module with momentum\nto retain a large negative buffer and implement our negative sampling strategy\nspanning over the buffer. In addition, to make the model focus on hard\nnegatives, we reassign the sampling weights for the simple negatives with a\ncut-down strategy. The extensive experiments are conducted on Flickr30K and\nMS-COCO, and the results demonstrate the superiority of our proposed false\nnegative elimination strategy. The code is available at\nhttps://github.com/LuminosityX/FNE.\n","authors":["Haoxuan Li","Yi Bin","Junrong Liao","Yang Yang","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2308.04380v1.pdf","comment":"Accepted at ACM MM 2023"},{"id":"http://arxiv.org/abs/2308.04373v1","updated":"2023-08-08T16:22:44Z","published":"2023-08-08T16:22:44Z","title":"Pelta: Shielding Transformers to Mitigate Evasion Attacks in Federated\n Learning","summary":" The main premise of federated learning is that machine learning model updates\nare computed locally, in particular to preserve user data privacy, as those\nnever leave the perimeter of their device. This mechanism supposes the general\nmodel, once aggregated, to be broadcast to collaborating and non malicious\nnodes. However, without proper defenses, compromised clients can easily probe\nthe model inside their local memory in search of adversarial examples. For\ninstance, considering image-based applications, adversarial examples consist of\nimperceptibly perturbed images (to the human eye) misclassified by the local\nmodel, which can be later presented to a victim node's counterpart model to\nreplicate the attack. To mitigate such malicious probing, we introduce Pelta, a\nnovel shielding mechanism leveraging trusted hardware. By harnessing the\ncapabilities of Trusted Execution Environments (TEEs), Pelta masks part of the\nback-propagation chain rule, otherwise typically exploited by attackers for the\ndesign of malicious samples. We evaluate Pelta on a state of the art ensemble\nmodel and demonstrate its effectiveness against the Self Attention Gradient\nadversarial Attack.\n","authors":["Simon Queyrut","Yérom-David Bromberg","Valerio Schiavoni"],"pdf_url":"https://arxiv.org/pdf/2308.04373v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04370v1","updated":"2023-08-08T16:17:46Z","published":"2023-08-08T16:17:46Z","title":"When Super-Resolution Meets Camouflaged Object Detection: A Comparison\n Study","summary":" Super Resolution (SR) and Camouflaged Object Detection (COD) are two hot\ntopics in computer vision with various joint applications. For instance,\nlow-resolution surveillance images can be successively processed by\nsuper-resolution techniques and camouflaged object detection. However, in\nprevious work, these two areas are always studied in isolation. In this paper,\nwe, for the first time, conduct an integrated comparative evaluation for both.\nSpecifically, we benchmark different super-resolution methods on commonly used\nCOD datasets, and meanwhile, we evaluate the robustness of different COD models\nby using COD data processed by SR methods. Our goal is to bridge these two\ndomains, discover novel experimental phenomena, summarize new experim.\n","authors":["Juan Wen","Shupeng Cheng","Peng Xu","Bowen Zhou","Radu Timofte","Weiyan Hou","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2308.04370v1.pdf","comment":"23 pages with 8 figures"},{"id":"http://arxiv.org/abs/2308.04369v1","updated":"2023-08-08T16:15:35Z","published":"2023-08-08T16:15:35Z","title":"SSTFormer: Bridging Spiking Neural Network and Memory Support\n Transformer for Frame-Event based Recognition","summary":" Event camera-based pattern recognition is a newly arising research topic in\nrecent years. Current researchers usually transform the event streams into\nimages, graphs, or voxels, and adopt deep neural networks for event-based\nclassification. Although good performance can be achieved on simple event\nrecognition datasets, however, their results may be still limited due to the\nfollowing two issues. Firstly, they adopt spatial sparse event streams for\nrecognition only, which may fail to capture the color and detailed texture\ninformation well. Secondly, they adopt either Spiking Neural Networks (SNN) for\nenergy-efficient recognition with suboptimal results, or Artificial Neural\nNetworks (ANN) for energy-intensive, high-performance recognition. However,\nseldom of them consider achieving a balance between these two aspects. In this\npaper, we formally propose to recognize patterns by fusing RGB frames and event\nstreams simultaneously and propose a new RGB frame-event recognition framework\nto address the aforementioned issues. The proposed method contains four main\nmodules, i.e., memory support Transformer network for RGB frame encoding,\nspiking neural network for raw event stream encoding, multi-modal bottleneck\nfusion module for RGB-Event feature aggregation, and prediction head. Due to\nthe scarce of RGB-Event based classification dataset, we also propose a\nlarge-scale PokerEvent dataset which contains 114 classes, and 27102\nframe-event pairs recorded using a DVS346 event camera. Extensive experiments\non two RGB-Event based classification datasets fully validated the\neffectiveness of our proposed framework. We hope this work will boost the\ndevelopment of pattern recognition by fusing RGB frames and event streams. Both\nour dataset and source code of this work will be released at\nhttps://github.com/Event-AHU/SSTFormer.\n","authors":["Xiao Wang","Zongzhen Wu","Yao Rong","Lin Zhu","Bo Jiang","Jin Tang","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2308.04369v1.pdf","comment":"In Peer Review"},{"id":"http://arxiv.org/abs/2303.09040v2","updated":"2023-08-08T16:14:32Z","published":"2023-03-16T02:24:31Z","title":"Hybrid Spectral Denoising Transformer with Guided Attention","summary":" In this paper, we present a Hybrid Spectral Denoising Transformer (HSDT) for\nhyperspectral image denoising. Challenges in adapting transformer for HSI arise\nfrom the capabilities to tackle existing limitations of CNN-based methods in\ncapturing the global and local spatial-spectral correlations while maintaining\nefficiency and flexibility. To address these issues, we introduce a hybrid\napproach that combines the advantages of both models with a Spatial-Spectral\nSeparable Convolution (S3Conv), Guided Spectral Self-Attention (GSSA), and\nSelf-Modulated Feed-Forward Network (SM-FFN). Our S3Conv works as a lightweight\nalternative to 3D convolution, which extracts more spatial-spectral correlated\nfeatures while keeping the flexibility to tackle HSIs with an arbitrary number\nof bands. These features are then adaptively processed by GSSA which per-forms\n3D self-attention across the spectral bands, guided by a set of learnable\nqueries that encode the spectral signatures. This not only enriches our model\nwith powerful capabilities for identifying global spectral correlations but\nalso maintains linear complexity. Moreover, our SM-FFN proposes the\nself-modulation that intensifies the activations of more informative regions,\nwhich further strengthens the aggregated features. Extensive experiments are\nconducted on various datasets under both simulated and real-world noise, and it\nshows that our HSDT significantly outperforms the existing state-of-the-art\nmethods while maintaining low computational overhead. Code is at https:\n//github.com/Zeqiang-Lai/HSDT.\n","authors":["Zeqiang Lai","Chenggang Yan","Ying Fu"],"pdf_url":"https://arxiv.org/pdf/2303.09040v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2304.07916v3","updated":"2023-08-08T16:06:11Z","published":"2023-04-16T23:37:24Z","title":"GaitRef: Gait Recognition with Refined Sequential Skeletons","summary":" Identifying humans with their walking sequences, known as gait recognition,\nis a useful biometric understanding task as it can be observed from a long\ndistance and does not require cooperation from the subject. Two common\nmodalities used for representing the walking sequence of a person are\nsilhouettes and joint skeletons. Silhouette sequences, which record the\nboundary of the walking person in each frame, may suffer from the variant\nappearances from carried-on objects and clothes of the person. Framewise joint\ndetections are noisy and introduce some jitters that are not consistent with\nsequential detections. In this paper, we combine the silhouettes and skeletons\nand refine the framewise joint predictions for gait recognition. With temporal\ninformation from the silhouette sequences, we show that the refined skeletons\ncan improve gait recognition performance without extra annotations. We compare\nour methods on four public datasets, CASIA-B, OUMVLP, Gait3D and GREW, and show\nstate-of-the-art performance.\n","authors":["Haidong Zhu","Wanrong Zheng","Zhaoheng Zheng","Ram Nevatia"],"pdf_url":"https://arxiv.org/pdf/2304.07916v3.pdf","comment":"IJCB 2023 oral. Code is available at\n https://github.com/haidongz-usc/GaitRef"},{"id":"http://arxiv.org/abs/2303.16565v2","updated":"2023-08-08T16:01:41Z","published":"2023-03-29T09:47:48Z","title":"PMAA: A Progressive Multi-scale Attention Autoencoder Model for\n High-performance Cloud Removal from Multi-temporal Satellite Imagery","summary":" Satellite imagery analysis plays a pivotal role in remote sensing; however,\ninformation loss due to cloud cover significantly impedes its application.\nAlthough existing deep cloud removal models have achieved notable outcomes,\nthey scarcely consider contextual information. This study introduces a\nhigh-performance cloud removal architecture, termed Progressive Multi-scale\nAttention Autoencoder (PMAA), which concurrently harnesses global and local\ninformation to construct robust contextual dependencies using a novel\nMulti-scale Attention Module (MAM) and a novel Local Interaction Module (LIM).\nPMAA establishes long-range dependencies of multi-scale features using MAM and\nmodulates the reconstruction of fine-grained details utilizing LIM, enabling\nsimultaneous representation of fine- and coarse-grained features at the same\nlevel. With the help of diverse and multi-scale features, PMAA consistently\noutperforms the previous state-of-the-art model CTGAN on two benchmark\ndatasets. Moreover, PMAA boasts considerable efficiency advantages, with only\n0.5% and 14.6% of the parameters and computational complexity of CTGAN,\nrespectively. These comprehensive results underscore PMAA's potential as a\nlightweight cloud removal network suitable for deployment on edge devices to\naccomplish large-scale cloud removal tasks. Our source code and pre-trained\nmodels are available at https://github.com/XavierJiezou/PMAA.\n","authors":["Xuechao Zou","Kai Li","Junliang Xing","Pin Tao","Yachao Cui"],"pdf_url":"https://arxiv.org/pdf/2303.16565v2.pdf","comment":"Accepted by ECAI 2023"},{"id":"http://arxiv.org/abs/2308.04356v1","updated":"2023-08-08T16:01:11Z","published":"2023-08-08T16:01:11Z","title":"Learning Unbiased Image Segmentation: A Case Study with Plain Knee\n Radiographs","summary":" Automatic segmentation of knee bony anatomy is essential in orthopedics, and\nit has been around for several years in both pre-operative and post-operative\nsettings. While deep learning algorithms have demonstrated exceptional\nperformance in medical image analysis, the assessment of fairness and potential\nbiases within these models remains limited. This study aims to revisit deep\nlearning-powered knee-bony anatomy segmentation using plain radiographs to\nuncover visible gender and racial biases. The current contribution offers the\npotential to advance our understanding of biases, and it provides practical\ninsights for researchers and practitioners in medical imaging. The proposed\nmitigation strategies mitigate gender and racial biases, ensuring fair and\nunbiased segmentation results. Furthermore, this work promotes equal access to\naccurate diagnoses and treatment outcomes for diverse patient populations,\nfostering equitable and inclusive healthcare provision.\n","authors":["Nickolas Littlefield","Johannes F. Plate","Kurt R. Weiss","Ines Lohse","Avani Chhabra","Ismaeel A. Siddiqui","Zoe Menezes","George Mastorakos","Sakshi Mehul Thakar","Mehrnaz Abedian","Matthew F. Gong","Luke A. Carlson","Hamidreza Moradi","Soheyla Amirian","Ahmad P. Tafti"],"pdf_url":"https://arxiv.org/pdf/2308.04356v1.pdf","comment":"This paper has been accepted by IEEE BHI 2023"},{"id":"http://arxiv.org/abs/2308.04352v1","updated":"2023-08-08T15:59:17Z","published":"2023-08-08T15:59:17Z","title":"3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment","summary":" 3D vision-language grounding (3D-VL) is an emerging field that aims to\nconnect the 3D physical world with natural language, which is crucial for\nachieving embodied intelligence. Current 3D-VL models rely heavily on\nsophisticated modules, auxiliary losses, and optimization tricks, which calls\nfor a simple and unified model. In this paper, we propose 3D-VisTA, a\npre-trained Transformer for 3D Vision and Text Alignment that can be easily\nadapted to various downstream tasks. 3D-VisTA simply utilizes self-attention\nlayers for both single-modal modeling and multi-modal fusion without any\nsophisticated task-specific design. To further enhance its performance on 3D-VL\ntasks, we construct ScanScribe, the first large-scale 3D scene-text pairs\ndataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185\nunique indoor scenes originating from ScanNet and 3R-Scan datasets, along with\npaired 278K scene descriptions generated from existing 3D-VL tasks, templates,\nand GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object\nmodeling and scene-text matching. It achieves state-of-the-art results on\nvarious 3D-VL tasks, ranging from visual grounding and dense captioning to\nquestion answering and situated reasoning. Moreover, 3D-VisTA demonstrates\nsuperior data efficiency, obtaining strong performance even with limited\nannotations during downstream task fine-tuning.\n","authors":["Ziyu Zhu","Xiaojian Ma","Yixin Chen","Zhidong Deng","Siyuan Huang","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2308.04352v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.16181v3","updated":"2023-08-08T15:50:35Z","published":"2023-06-28T13:03:43Z","title":"Learning to Pan-sharpening with Memories of Spatial Details","summary":" Pan-sharpening, as one of the most commonly used techniques in remote sensing\nsystems, aims to inject spatial details from panchromatic images into\nmultispectral images (MS) to obtain high-resolution multispectral images. Since\ndeep learning has received widespread attention because of its powerful fitting\nability and efficient feature extraction, a variety of pan-sharpening methods\nhave been proposed to achieve remarkable performance. However, current\npan-sharpening methods usually require the paired panchromatic (PAN) and MS\nimages as input, which limits their usage in some scenarios. To address this\nissue, in this paper we observe that the spatial details from PAN images are\nmainly high-frequency cues, i.e., the edges reflect the contour of input PAN\nimages. This motivates us to develop a PAN-agnostic representation to store\nsome base edges, so as to compose the contour for the corresponding PAN image\nvia them. As a result, we can perform the pan-sharpening task with only the MS\nimage when inference. To this end, a memory-based network is adapted to extract\nand memorize the spatial details during the training phase and is used to\nreplace the process of obtaining spatial information from PAN images when\ninference, which is called Memory-based Spatial Details Network (MSDN).\nFinally, we integrate the proposed MSDN module into the existing deep\nlearning-based pan-sharpening methods to achieve an end-to-end pan-sharpening\nnetwork. With extensive experiments on the Gaofen1 and WorldView-4 satellites,\nwe verify that our method constructs good spatial details without PAN images\nand achieves the best performance. The code is available at\nhttps://github.com/Zhao-Tian-yi/Learning-to-Pan-sharpening-with-Memories-of-Spatial-Details.git.\n","authors":["Maoxun Yuan","Tianyi Zhao","Bo Li","Xingxing Wei"],"pdf_url":"https://arxiv.org/pdf/2306.16181v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.04517v2","updated":"2023-08-08T15:50:11Z","published":"2023-05-08T07:22:37Z","title":"DiffBFR: Bootstrapping Diffusion Model Towards Blind Face Restoration","summary":" Blind face restoration (BFR) is important while challenging. Prior works\nprefer to exploit GAN-based frameworks to tackle this task due to the balance\nof quality and efficiency. However, these methods suffer from poor stability\nand adaptability to long-tail distribution, failing to simultaneously retain\nsource identity and restore detail. We propose DiffBFR to introduce Diffusion\nProbabilistic Model (DPM) for BFR to tackle the above problem, given its\nsuperiority over GAN in aspects of avoiding training collapse and generating\nlong-tail distribution. DiffBFR utilizes a two-step design, that first restores\nidentity information from low-quality images and then enhances texture details\naccording to the distribution of real faces. This design is implemented with\ntwo key components: 1) Identity Restoration Module (IRM) for preserving the\nface details in results. Instead of denoising from pure Gaussian random\ndistribution with LQ images as the condition during the reverse process, we\npropose a novel truncated sampling method which starts from LQ images with part\nnoise added. We theoretically prove that this change shrinks the evidence lower\nbound of DPM and then restores more original details. With theoretical proof,\ntwo cascade conditional DPMs with different input sizes are introduced to\nstrengthen this sampling effect and reduce training difficulty in the\nhigh-resolution image generated directly. 2) Texture Enhancement Module (TEM)\nfor polishing the texture of the image. Here an unconditional DPM, a LQ-free\nmodel, is introduced to further force the restorations to appear realistic. We\ntheoretically proved that this unconditional DPM trained on pure HQ images\ncontributes to justifying the correct distribution of inference images output\nfrom IRM in pixel-level space. Truncated sampling with fractional time step is\nutilized to polish pixel-level textures while preserving identity information.\n","authors":["Xinmin Qiu","Congying Han","Zicheng Zhang","Bonan Li","Tiande Guo","Xuecheng Nie"],"pdf_url":"https://arxiv.org/pdf/2305.04517v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04343v1","updated":"2023-08-08T15:43:59Z","published":"2023-08-08T15:43:59Z","title":"Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval","summary":" Most existing cross-modal retrieval methods employ two-stream encoders with\ndifferent architectures for images and texts, \\textit{e.g.}, CNN for images and\nRNN/Transformer for texts. Such discrepancy in architectures may induce\ndifferent semantic distribution spaces and limit the interactions between\nimages and texts, and further result in inferior alignment between images and\ntexts. To fill this research gap, inspired by recent advances of Transformers\nin vision tasks, we propose to unify the encoder architectures with\nTransformers for both modalities. Specifically, we design a cross-modal\nretrieval framework purely based on two-stream Transformers, dubbed\n\\textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image\nTransformer, a text Transformer, and a hierarchical alignment module. With such\nidentical architectures, the encoders could produce representations with more\nsimilar characteristics for images and texts, and make the interactions and\nalignments between them much easier. Besides, to leverage the rich semantics,\nwe devise a hierarchical alignment scheme to explore multi-level\ncorrespondences of different layers between images and texts. To evaluate the\neffectiveness of the proposed HAT, we conduct extensive experiments on two\nbenchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that\nHAT outperforms SOTA baselines by a large margin. Specifically, on two key\ntasks, \\textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves\n7.6\\% and 16.7\\% relative score improvement of Recall@1 on MSCOCO, and 4.4\\%\nand 11.6\\% on Flickr30k respectively. The code is available at\n\\url{https://github.com/LuminosityX/HAT}.\n","authors":["Yi Bin","Haoxuan Li","Yahui Xu","Xing Xu","Yang Yang","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2308.04343v1.pdf","comment":"Accepted at ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2308.04340v1","updated":"2023-08-08T15:36:57Z","published":"2023-08-08T15:36:57Z","title":"A Lightweight and Accurate Face Detection Algorithm Based on Retinaface","summary":" In this paper, we propose a lightweight and accurate face detection algorithm\nLAFD (Light and accurate face detection) based on Retinaface. Backbone network\nin the algorithm is a modified MobileNetV3 network which adjusts the size of\nthe convolution kernel, the channel expansion multiplier of the inverted\nresiduals block and the use of the SE attention mechanism. Deformable\nconvolution network(DCN) is introduced in the context module and the algorithm\nuses focal loss function instead of cross-entropy loss function as the\nclassification loss function of the model. The test results on the WIDERFACE\ndataset indicate that the average accuracy of LAFD is 94.1%, 92.2% and 82.1%\nfor the \"easy\", \"medium\" and \"hard\" validation subsets respectively with an\nimprovement of 3.4%, 4.0% and 8.3% compared to Retinaface and 3.1%, 4.1% and\n4.1% higher than the well-performing lightweight model, LFFD. If the input\nimage is pre-processed and scaled to 1560px in length or 1200px in width, the\nmodel achieves an average accuracy of 86.2% on the 'hard' validation subset.\nThe model is lightweight, with a size of only 10.2MB.\n","authors":["Baozhu Liu","Hewei Yu"],"pdf_url":"https://arxiv.org/pdf/2308.04340v1.pdf","comment":"14 pages, 5 figures, 7 tables"},{"id":"http://arxiv.org/abs/2308.04337v1","updated":"2023-08-08T15:30:08Z","published":"2023-08-08T15:30:08Z","title":"Pengembangan Model untuk Mendeteksi Kerusakan pada Terumbu Karang dengan\n Klasifikasi Citra","summary":" The abundant biodiversity of coral reefs in Indonesian waters is a valuable\nasset that needs to be preserved. Rapid climate change and uncontrolled human\nactivities have led to the degradation of coral reef ecosystems, including\ncoral bleaching, which is a critical indicator of coral health conditions.\nTherefore, this research aims to develop an accurate classification model to\ndistinguish between healthy corals and corals experiencing bleaching. This\nstudy utilizes a specialized dataset consisting of 923 images collected from\nFlickr using the Flickr API. The dataset comprises two distinct classes:\nhealthy corals (438 images) and bleached corals (485 images). These images have\nbeen resized to a maximum of 300 pixels in width or height, whichever is\nlarger, to maintain consistent sizes across the dataset.\n The method employed in this research involves the use of machine learning\nmodels, particularly convolutional neural networks (CNN), to recognize and\ndifferentiate visual patterns associated with healthy and bleached corals. In\nthis context, the dataset can be used to train and test various classification\nmodels to achieve optimal results. By leveraging the ResNet model, it was found\nthat a from-scratch ResNet model can outperform pretrained models in terms of\nprecision and accuracy. The success in developing accurate classification\nmodels will greatly benefit researchers and marine biologists in gaining a\nbetter understanding of coral reef health. These models can also be employed to\nmonitor changes in the coral reef environment, thereby making a significant\ncontribution to conservation and ecosystem restoration efforts that have\nfar-reaching impacts on life.\n","authors":["Fadhil Muhammad","Alif Bintang Elfandra","Iqbal Pahlevi Amin","Alfan Farizki Wicaksono"],"pdf_url":"https://arxiv.org/pdf/2308.04337v1.pdf","comment":"in Indonesian language"},{"id":"http://arxiv.org/abs/2305.12522v2","updated":"2023-08-08T15:22:26Z","published":"2023-05-21T17:46:28Z","title":"P-NOC: Adversarial CAM Generation for Weakly Supervised Semantic\n Segmentation","summary":" To mitigate the necessity for large amounts of supervised segmentation\nannotation sets, multiple Weakly Supervised Semantic Segmentation (WSSS)\nstrategies have been devised. These will often rely on advanced data and model\nregularization strategies to instigate the development of useful properties\n(e.g., prediction completeness and fidelity to semantic boundaries) in\nsegmentation priors, notwithstanding the lack of annotated information. In this\nwork, we first create a strong baseline by analyzing complementary WSSS\ntechniques and regularizing strategies, considering their strengths and\nlimitations. We then propose a new Class-specific Adversarial Erasing strategy,\ncomprising two adversarial CAM generating networks being gradually refined to\nproduce robust semantic segmentation proposals. Empirical results suggest that\nour approach induces substantial improvement in the effectiveness of the\nbaseline, resulting in a noticeable improvement over both Pascal VOC 2012 and\nMS COCO 2014 datasets.\n","authors":["Lucas David","Helio Pedrini","Zanoni Dias"],"pdf_url":"https://arxiv.org/pdf/2305.12522v2.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2308.04322v1","updated":"2023-08-08T15:15:51Z","published":"2023-08-08T15:15:51Z","title":"Domain Adaptive Person Search via GAN-based Scene Synthesis for\n Cross-scene Videos","summary":" Person search has recently been a challenging task in the computer vision\ndomain, which aims to search specific pedestrians from real\ncameras.Nevertheless, most surveillance videos comprise only a handful of\nimages of each pedestrian, which often feature identical backgrounds and\nclothing. Hence, it is difficult to learn more discriminative features for\nperson search in real scenes. To tackle this challenge, we draw on Generative\nAdversarial Networks (GAN) to synthesize data from surveillance videos. GAN has\nthrived in computer vision problems because it produces high-quality images\nefficiently. We merely alter the popular Fast R-CNN model, which is capable of\nprocessing videos and yielding accurate detection outcomes. In order to\nappropriately relieve the pressure brought by the two-stage model, we design an\nAssisted-Identity Query Module (AIDQ) to provide positive images for the behind\npart. Besides, the proposed novel GAN-based Scene Synthesis model that can\nsynthesize high-quality cross-id person images for person search tasks. In\norder to facilitate the feature learning of the GAN-based Scene Synthesis\nmodel, we adopt an online learning strategy that collaboratively learns the\nsynthesized images and original images. Extensive experiments on two widely\nused person search benchmarks, CUHK-SYSU and PRW, have shown that our method\nhas achieved great performance, and the extensive ablation study further\njustifies our GAN-synthetic data can effectively increase the variability of\nthe datasets and be more realistic.\n","authors":["Huibing Wang","Tianxiang Cui","Mingze Yao","Huijuan Pang","Yushan Du"],"pdf_url":"https://arxiv.org/pdf/2308.04322v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04321v1","updated":"2023-08-08T15:14:23Z","published":"2023-08-08T15:14:23Z","title":"All-pairs Consistency Learning for Weakly Supervised Semantic\n Segmentation","summary":" In this work, we propose a new transformer-based regularization to better\nlocalize objects for Weakly supervised semantic segmentation (WSSS). In\nimage-level WSSS, Class Activation Map (CAM) is adopted to generate object\nlocalization as pseudo segmentation labels. To address the partial activation\nissue of the CAMs, consistency regularization is employed to maintain\nactivation intensity invariance across various image augmentations. However,\nsuch methods ignore pair-wise relations among regions within each CAM, which\ncapture context and should also be invariant across image views. To this end,\nwe propose a new all-pairs consistency regularization (ACR). Given a pair of\naugmented views, our approach regularizes the activation intensities between a\npair of augmented views, while also ensuring that the affinity across regions\nwithin each view remains consistent. We adopt vision transformers as the\nself-attention mechanism naturally embeds pair-wise affinity. This enables us\nto simply regularize the distance between the attention matrices of augmented\nimage pairs. Additionally, we introduce a novel class-wise localization method\nthat leverages the gradients of the class token. Our method can be seamlessly\nintegrated into existing WSSS methods using transformers without modifying the\narchitectures. We evaluate our method on PASCAL VOC and MS COCO datasets. Our\nmethod produces noticeably better class localization maps (67.3% mIoU on PASCAL\nVOC train), resulting in superior WSSS performances.\n","authors":["Weixuan Sun","Yanhao Zhang","Zhen Qin","Zheyuan Liu","Lin Cheng","Fanyi Wang","Yiran Zhong","Nick Barnes"],"pdf_url":"https://arxiv.org/pdf/2308.04321v1.pdf","comment":"ICCV 2023 workshop"},{"id":"http://arxiv.org/abs/2307.07873v3","updated":"2023-08-08T15:13:22Z","published":"2023-07-15T19:20:49Z","title":"Why Does Little Robustness Help? Understanding Adversarial\n Transferability From Surrogate Training","summary":" Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs\nthat successfully fool white-box surrogate models can also deceive other\nblack-box models with different architectures. Although a bunch of empirical\nstudies have provided guidance on generating highly transferable AEs, many of\nthese findings lack explanations and even lead to inconsistent advice. In this\npaper, we take a further step towards understanding adversarial\ntransferability, with a particular focus on surrogate aspects. Starting from\nthe intriguing little robustness phenomenon, where models adversarially trained\nwith mildly perturbed adversarial samples can serve as better surrogates, we\nattribute it to a trade-off between two predominant factors: model smoothness\nand gradient similarity. Our investigations focus on their joint effects,\nrather than their separate correlations with transferability. Through a series\nof theoretical and empirical analyses, we conjecture that the data distribution\nshift in adversarial training explains the degradation of gradient similarity.\nBuilding on these insights, we explore the impacts of data augmentation and\ngradient regularization on transferability and identify that the trade-off\ngenerally exists in the various training mechanisms, thus building a\ncomprehensive blueprint for the regulation mechanism behind transferability.\nFinally, we provide a general route for constructing better surrogates to boost\ntransferability which optimizes both model smoothness and gradient similarity\nsimultaneously, e.g., the combination of input gradient regularization and\nsharpness-aware minimization (SAM), validated by extensive experiments. In\nsummary, we call for attention to the united impacts of these two factors for\nlaunching effective transfer attacks, rather than optimizing one while ignoring\nthe other, and emphasize the crucial role of manipulating surrogate models.\n","authors":["Yechao Zhang","Shengshan Hu","Leo Yu Zhang","Junyu Shi","Minghui Li","Xiaogeng Liu","Wei Wan","Hai Jin"],"pdf_url":"https://arxiv.org/pdf/2307.07873v3.pdf","comment":"Accepted by IEEE Symposium on Security and Privacy (Oakland) 2024; 21\n pages, 11 figures, 13 tables"},{"id":"http://arxiv.org/abs/2308.02781v2","updated":"2023-08-08T14:54:36Z","published":"2023-08-05T03:21:12Z","title":"A Voting-Stacking Ensemble of Inception Networks for Cervical Cytology\n Classification","summary":" Cervical cancer is one of the most severe diseases threatening women's\nhealth. Early detection and diagnosis can significantly reduce cancer risk, in\nwhich cervical cytology classification is indispensable. Researchers have\nrecently designed many networks for automated cervical cancer diagnosis, but\nthe limited accuracy and bulky size of these individual models cannot meet\npractical application needs. To address this issue, we propose a\nVoting-Stacking ensemble strategy, which employs three Inception networks as\nbase learners and integrates their outputs through a voting ensemble. The\nsamples misclassified by the ensemble model generate a new training set on\nwhich a linear classification model is trained as the meta-learner and performs\nthe final predictions. In addition, a multi-level Stacking ensemble framework\nis designed to improve performance further. The method is evaluated on the\nSIPakMed, Herlev, and Mendeley datasets, achieving accuracies of 100%, 100%,\nand 100%, respectively. The experimental results outperform the current\nstate-of-the-art (SOTA) methods, demonstrating its potential for reducing\nscreening workload and helping pathologists detect cervical cancer.\n","authors":["Linyi Qian","Qian Huang","Yulin Chen","Junzhou Chen"],"pdf_url":"https://arxiv.org/pdf/2308.02781v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12344v2","updated":"2023-08-08T14:52:39Z","published":"2023-07-23T14:43:17Z","title":"Right for the Wrong Reason: Can Interpretable ML Techniques Detect\n Spurious Correlations?","summary":" While deep neural network models offer unmatched classification performance,\nthey are prone to learning spurious correlations in the data. Such dependencies\non confounding information can be difficult to detect using performance metrics\nif the test data comes from the same distribution as the training data.\nInterpretable ML methods such as post-hoc explanations or inherently\ninterpretable classifiers promise to identify faulty model reasoning. However,\nthere is mixed evidence whether many of these techniques are actually able to\ndo so. In this paper, we propose a rigorous evaluation strategy to assess an\nexplanation technique's ability to correctly identify spurious correlations.\nUsing this strategy, we evaluate five post-hoc explanation techniques and one\ninherently interpretable method for their ability to detect three types of\nartificially added confounders in a chest x-ray diagnosis task. We find that\nthe post-hoc technique SHAP, as well as the inherently interpretable Attri-Net\nprovide the best performance and can be used to reliably identify faulty model\nbehavior.\n","authors":["Susu Sun","Lisa M. Koch","Christian F. Baumgartner"],"pdf_url":"https://arxiv.org/pdf/2307.12344v2.pdf","comment":"Accepted to MICCAI 2023"},{"id":"http://arxiv.org/abs/2303.00500v2","updated":"2023-08-08T14:50:50Z","published":"2023-03-01T13:32:55Z","title":"Inherently Interpretable Multi-Label Classification Using Class-Specific\n Counterfactuals","summary":" Interpretability is essential for machine learning algorithms in high-stakes\napplication fields such as medical image analysis. However, high-performing\nblack-box neural networks do not provide explanations for their predictions,\nwhich can lead to mistrust and suboptimal human-ML collaboration. Post-hoc\nexplanation techniques, which are widely used in practice, have been shown to\nsuffer from severe conceptual problems. Furthermore, as we show in this paper,\ncurrent explanation techniques do not perform adequately in the multi-label\nscenario, in which multiple medical findings may co-occur in a single image. We\npropose Attri-Net, an inherently interpretable model for multi-label\nclassification. Attri-Net is a powerful classifier that provides transparent,\ntrustworthy, and human-understandable explanations. The model first generates\nclass-specific attribution maps based on counterfactuals to identify which\nimage regions correspond to certain medical findings. Then a simple logistic\nregression classifier is used to make predictions based solely on these\nattribution maps. We compare Attri-Net to five post-hoc explanation techniques\nand one inherently interpretable classifier on three chest X-ray datasets. We\nfind that Attri-Net produces high-quality multi-label explanations consistent\nwith clinical knowledge and has comparable classification performance to\nstate-of-the-art classification models.\n","authors":["Susu Sun","Stefano Woerner","Andreas Maier","Lisa M. Koch","Christian F. Baumgartner"],"pdf_url":"https://arxiv.org/pdf/2303.00500v2.pdf","comment":"Accepted to MIDL 2023"},{"id":"http://arxiv.org/abs/2308.04303v1","updated":"2023-08-08T14:49:44Z","published":"2023-08-08T14:49:44Z","title":"Vehicle Motion Forecasting using Prior Information and Semantic-assisted\n Occupancy Grid Maps","summary":" Motion prediction is a challenging task for autonomous vehicles due to\nuncertainty in the sensor data, the non-deterministic nature of future, and\ncomplex behavior of agents. In this paper, we tackle this problem by\nrepresenting the scene as dynamic occupancy grid maps (DOGMs), associating\nsemantic labels to the occupied cells and incorporating map information. We\npropose a novel framework that combines deep-learning-based spatio-temporal and\nprobabilistic approaches to predict vehicle behaviors.Contrary to the\nconventional OGM prediction methods, evaluation of our work is conducted\nagainst the ground truth annotations. We experiment and validate our results on\nreal-world NuScenes dataset and show that our model shows superior ability to\npredict both static and dynamic vehicles compared to OGM predictions.\nFurthermore, we perform an ablation study and assess the role of semantic\nlabels and map in the architecture.\n","authors":["Rabbia Asghar","Manuel Diaz-Zapata","Lukas Rummelhard","Anne Spalanzani","Christian Laugier"],"pdf_url":"https://arxiv.org/pdf/2308.04303v1.pdf","comment":"Accepted to the 2023 IEEE/RSJ International Conference on Intelligent\n Robots and Systems (IROS 2023)"},{"id":"http://arxiv.org/abs/2308.04288v1","updated":"2023-08-08T14:32:38Z","published":"2023-08-08T14:32:38Z","title":"Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual\n Try-On","summary":" Fabricating and designing 3D garments has become extremely demanding with the\nincreasing need for synthesizing realistic dressed persons for a variety of\napplications, e.g. 3D virtual try-on, digitalization of 2D clothes into 3D\napparel, and cloth animation. It thus necessitates a simple and straightforward\npipeline to obtain high-quality texture from simple input, such as 2D reference\nimages. Since traditional warping-based texture generation methods require a\nsignificant number of control points to be manually selected for each type of\ngarment, which can be a time-consuming and tedious process. We propose a novel\nmethod, called Cloth2Tex, which eliminates the human burden in this process.\nCloth2Tex is a self-supervised method that generates texture maps with\nreasonable layout and structural consistency. Another key feature of Cloth2Tex\nis that it can be used to support high-fidelity texture inpainting. This is\ndone by combining Cloth2Tex with a prevailing latent diffusion model. We\nevaluate our approach both qualitatively and quantitatively and demonstrate\nthat Cloth2Tex can generate high-quality texture maps and achieve the best\nvisual effects in comparison to other methods. Project page:\ntomguluson92.github.io/projects/cloth2tex/\n","authors":["Daiheng Gao","Xu Chen","Xindi Zhang","Qi Wang","Ke Sun","Bang Zhang","Liefeng Bo","Qixing Huang"],"pdf_url":"https://arxiv.org/pdf/2308.04288v1.pdf","comment":"15 pages, 15 figures"},{"id":"http://arxiv.org/abs/2212.04780v3","updated":"2023-08-08T14:30:05Z","published":"2022-12-09T11:18:40Z","title":"Genie: Show Me the Data for Quantization","summary":" Zero-shot quantization is a promising approach for developing lightweight\ndeep neural networks when data is inaccessible owing to various reasons,\nincluding cost and issues related to privacy. By exploiting the learned\nparameters ($\\mu$ and $\\sigma$) of batch normalization layers in an\nFP32-pre-trained model, zero-shot quantization schemes focus on generating\nsynthetic data. Subsequently, they distill knowledge from the pre-trained model\n(teacher) to the quantized model (student) such that the quantized model can be\noptimized with the synthetic dataset. However, thus far, zero-shot quantization\nhas primarily been discussed in the context of quantization-aware training\nmethods, which require task-specific losses and long-term optimization as much\nas retraining. We thus introduce a post-training quantization scheme for\nzero-shot quantization that produces high-quality quantized networks within a\nfew hours. Furthermore, we propose a framework called Genie~that generates data\nsuited for quantization. With the data synthesized by Genie, we can produce\nrobust quantized models without real datasets, which is comparable to few-shot\nquantization. We also propose a post-training quantization algorithm to enhance\nthe performance of quantized models. By combining them, we can bridge the gap\nbetween zero-shot and few-shot quantization while significantly improving the\nquantization performance compared to that of existing approaches. In other\nwords, we can obtain a unique state-of-the-art zero-shot quantization approach.\nThe code is available at \\url{https://github.com/SamsungLabs/Genie}.\n","authors":["Yongkweon Jeon","Chungman Lee","Ho-young Kim"],"pdf_url":"https://arxiv.org/pdf/2212.04780v3.pdf","comment":"Accepted by CVPR 2023, https://github.com/SamsungLabs/Genie"},{"id":"http://arxiv.org/abs/2308.04283v1","updated":"2023-08-08T14:25:13Z","published":"2023-08-08T14:25:13Z","title":"Vision-Based Autonomous Navigation for Unmanned Surface Vessel in\n Extreme Marine Conditions","summary":" Visual perception is an important component for autonomous navigation of\nunmanned surface vessels (USV), particularly for the tasks related to\nautonomous inspection and tracking. These tasks involve vision-based navigation\ntechniques to identify the target for navigation. Reduced visibility under\nextreme weather conditions in marine environments makes it difficult for\nvision-based approaches to work properly. To overcome these issues, this paper\npresents an autonomous vision-based navigation framework for tracking target\nobjects in extreme marine conditions. The proposed framework consists of an\nintegrated perception pipeline that uses a generative adversarial network (GAN)\nto remove noise and highlight the object features before passing them to the\nobject detector (i.e., YOLOv5). The detected visual features are then used by\nthe USV to track the target. The proposed framework has been thoroughly tested\nin simulation under extremely reduced visibility due to sandstorms and fog. The\nresults are compared with state-of-the-art de-hazing methods across the\nbenchmarked MBZIRC simulation dataset, on which the proposed scheme has\noutperformed the existing methods across various metrics.\n","authors":["Muhayyuddin Ahmed","Ahsan Baidar Bakht","Taimur Hassan","Waseem Akram","Ahmed Humais","Lakmal Seneviratne","Shaoming He","Defu Lin","Irfan Hussain"],"pdf_url":"https://arxiv.org/pdf/2308.04283v1.pdf","comment":"IEEE/RSJ International Conference on Intelligent Robots (IROS-2023)"},{"id":"http://arxiv.org/abs/2308.04269v1","updated":"2023-08-08T14:10:16Z","published":"2023-08-08T14:10:16Z","title":"Lossy and Lossless (L$^2$) Post-training Model Size Compression","summary":" Deep neural networks have delivered remarkable performance and have been\nwidely used in various visual tasks. However, their huge size causes\nsignificant inconvenience for transmission and storage. Many previous studies\nhave explored model size compression. However, these studies often approach\nvarious lossy and lossless compression methods in isolation, leading to\nchallenges in achieving high compression ratios efficiently. This work proposes\na post-training model size compression method that combines lossy and lossless\ncompression in a unified way. We first propose a unified parametric weight\ntransformation, which ensures different lossy compression methods can be\nperformed jointly in a post-training manner. Then, a dedicated differentiable\ncounter is introduced to guide the optimization of lossy compression to arrive\nat a more suitable point for later lossless compression. Additionally, our\nmethod can easily control a desired global compression ratio and allocate\nadaptive ratios for different layers. Finally, our method can achieve a stable\n$10\\times$ compression ratio without sacrificing accuracy and a $20\\times$\ncompression ratio with minor accuracy loss in a short time. Our code is\navailable at https://github.com/ModelTC/L2_Compression .\n","authors":["Yumeng Shi","Shihao Bai","Xiuying Wei","Ruihao Gong","Jianlei Yang"],"pdf_url":"https://arxiv.org/pdf/2308.04269v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04262v1","updated":"2023-08-08T13:59:16Z","published":"2023-08-08T13:59:16Z","title":"SDLFormer: A Sparse and Dense Locality-enhanced Transformer for\n Accelerated MR Image Reconstruction","summary":" Transformers have emerged as viable alternatives to convolutional neural\nnetworks owing to their ability to learn non-local region relationships in the\nspatial domain. The self-attention mechanism of the transformer enables\ntransformers to capture long-range dependencies in the images, which might be\ndesirable for accelerated MRI image reconstruction as the effect of\nundersampling is non-local in the image domain. Despite its computational\nefficiency, the window-based transformers suffer from restricted receptive\nfields as the dependencies are limited to within the scope of the image\nwindows. We propose a window-based transformer network that integrates dilated\nattention mechanism and convolution for accelerated MRI image reconstruction.\nThe proposed network consists of dilated and dense neighborhood attention\ntransformers to enhance the distant neighborhood pixel relationship and\nintroduce depth-wise convolutions within the transformer module to learn\nlow-level translation invariant features for accelerated MRI image\nreconstruction. The proposed model is trained in a self-supervised manner. We\nperform extensive experiments for multi-coil MRI acceleration for coronal PD,\ncoronal PDFS and axial T2 contrasts with 4x and 5x under-sampling in\nself-supervised learning based on k-space splitting. We compare our method\nagainst other reconstruction architectures and the parallel domain\nself-supervised learning baseline. Results show that the proposed model\nexhibits improvement margins of (i) around 1.40 dB in PSNR and around 0.028 in\nSSIM on average over other architectures (ii) around 1.44 dB in PSNR and around\n0.029 in SSIM over parallel domain self-supervised learning. The code is\navailable at https://github.com/rahul-gs-16/sdlformer.git\n","authors":["Rahul G. S.","Sriprabha Ramnarayanan","Mohammad Al Fahim","Keerthi Ram","Preejith S. P","Mohanasankar Sivaprakasam"],"pdf_url":"https://arxiv.org/pdf/2308.04262v1.pdf","comment":"Accepted at MICCAI workshop MILLanD 2023 Medical Image Learning with\n noisy and Limited Data"},{"id":"http://arxiv.org/abs/2307.11661v2","updated":"2023-08-08T13:44:12Z","published":"2023-07-21T15:49:59Z","title":"Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts","summary":" Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have\nrevolutionized visual representation learning by providing good performance on\ndownstream datasets. VLMs are 0-shot adapted to a downstream dataset by\ndesigning prompts that are relevant to the dataset. Such prompt engineering\nmakes use of domain expertise and a validation dataset. Meanwhile, recent\ndevelopments in generative pretrained models like GPT-4 mean they can be used\nas advanced internet search tools. They can also be manipulated to provide\nvisual information in any structure. In this work, we show that GPT-4 can be\nused to generate text that is visually descriptive and how this can be used to\nadapt CLIP to downstream tasks. We show considerable improvements in 0-shot\ntransfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD\n(~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt.\nWe also design a simple few-shot adapter that learns to choose the best\npossible sentences to construct generalizable classifiers that outperform the\nrecently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized\nfine-grained datasets. The code, prompts, and auxiliary text dataset is\navailable at https://github.com/mayug/VDT-Adapter.\n","authors":["Mayug Maniparambil","Chris Vorster","Derek Molloy","Noel Murphy","Kevin McGuinness","Noel E. O'Connor"],"pdf_url":"https://arxiv.org/pdf/2307.11661v2.pdf","comment":"Paper accepted at ICCV-W 2023. V2 contains additional comparisons\n with concurrent works"},{"id":"http://arxiv.org/abs/2308.04252v1","updated":"2023-08-08T13:38:50Z","published":"2023-08-08T13:38:50Z","title":"Blur aware metric depth estimation with multi-focus plenoptic cameras","summary":" While a traditional camera only captures one point of view of a scene, a\nplenoptic or light-field camera, is able to capture spatial and angular\ninformation in a single snapshot, enabling depth estimation from a single\nacquisition. In this paper, we present a new metric depth estimation algorithm\nusing only raw images from a multi-focus plenoptic camera. The proposed\napproach is especially suited for the multi-focus configuration where several\nmicro-lenses with different focal lengths are used. The main goal of our blur\naware depth estimation (BLADE) approach is to improve disparity estimation for\ndefocus stereo images by integrating both correspondence and defocus cues. We\nthus leverage blur information where it was previously considered a drawback.\nWe explicitly derive an inverse projection model including the defocus blur\nproviding depth estimates up to a scale factor. A method to calibrate the\ninverse model is then proposed. We thus take into account depth scaling to\nachieve precise and accurate metric depth estimates. Our results show that\nintroducing defocus cues improves the depth estimation. We demonstrate the\neffectiveness of our framework and depth scaling calibration on relative depth\nestimation setups and on real-world 3D complex scenes with ground truth\nacquired with a 3D lidar scanner.\n","authors":["Mathieu Labussière","Céline Teulière","Omar Ait-Aider"],"pdf_url":"https://arxiv.org/pdf/2308.04252v1.pdf","comment":"21 pages, 12 Figures, 3 Tables"},{"id":"http://arxiv.org/abs/2308.04249v1","updated":"2023-08-08T13:28:34Z","published":"2023-08-08T13:28:34Z","title":"MindDiffuser: Controlled Image Reconstruction from Human Brain Activity\n with Semantic and Structural Diffusion","summary":" Reconstructing visual stimuli from brain recordings has been a meaningful and\nchallenging task. Especially, the achievement of precise and controllable image\nreconstruction bears great significance in propelling the progress and\nutilization of brain-computer interfaces. Despite the advancements in complex\nimage reconstruction techniques, the challenge persists in achieving a cohesive\nalignment of both semantic (concepts and objects) and structure (position,\norientation, and size) with the image stimuli. To address the aforementioned\nissue, we propose a two-stage image reconstruction model called MindDiffuser.\nIn Stage 1, the VQ-VAE latent representations and the CLIP text embeddings\ndecoded from fMRI are put into Stable Diffusion, which yields a preliminary\nimage that contains semantic information. In Stage 2, we utilize the CLIP\nvisual feature decoded from fMRI as supervisory information, and continually\nadjust the two feature vectors decoded in Stage 1 through backpropagation to\nalign the structural information. The results of both qualitative and\nquantitative analyses demonstrate that our model has surpassed the current\nstate-of-the-art models on Natural Scenes Dataset (NSD). The subsequent\nexperimental findings corroborate the neurobiological plausibility of the\nmodel, as evidenced by the interpretability of the multimodal feature employed,\nwhich align with the corresponding brain responses.\n","authors":["Yizhuo Lu","Changde Du","Qiongyi zhou","Dianpeng Wang","Huiguang He"],"pdf_url":"https://arxiv.org/pdf/2308.04249v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2303.14139"},{"id":"http://arxiv.org/abs/2308.04243v1","updated":"2023-08-08T13:17:20Z","published":"2023-08-08T13:17:20Z","title":"AICSD: Adaptive Inter-Class Similarity Distillation for Semantic\n Segmentation","summary":" In recent years, deep neural networks have achieved remarkable accuracy in\ncomputer vision tasks. With inference time being a crucial factor, particularly\nin dense prediction tasks such as semantic segmentation, knowledge distillation\nhas emerged as a successful technique for improving the accuracy of lightweight\nstudent networks. The existing methods often neglect the information in\nchannels and among different classes. To overcome these limitations, this paper\nproposes a novel method called Inter-Class Similarity Distillation (ICSD) for\nthe purpose of knowledge distillation. The proposed method transfers high-order\nrelations from the teacher network to the student network by independently\ncomputing intra-class distributions for each class from network outputs. This\nis followed by calculating inter-class similarity matrices for distillation\nusing KL divergence between distributions of each pair of classes. To further\nimprove the effectiveness of the proposed method, an Adaptive Loss Weighting\n(ALW) training strategy is proposed. Unlike existing methods, the ALW strategy\ngradually reduces the influence of the teacher network towards the end of\ntraining process to account for errors in teacher's predictions. Extensive\nexperiments conducted on two well-known datasets for semantic segmentation,\nCityscapes and Pascal VOC 2012, validate the effectiveness of the proposed\nmethod in terms of mIoU and pixel accuracy. The proposed method outperforms\nmost of existing knowledge distillation methods as demonstrated by both\nquantitative and qualitative evaluations. Code is available at:\nhttps://github.com/AmirMansurian/AICSD\n","authors":["Amir M. Mansourian","Rozhan Ahmadi","Shohreh Kasaei"],"pdf_url":"https://arxiv.org/pdf/2308.04243v1.pdf","comment":"10 pages, 5 figures, 5 tables"},{"id":"http://arxiv.org/abs/2307.09724v3","updated":"2023-08-08T13:14:26Z","published":"2023-07-19T02:26:20Z","title":"AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks","summary":" To deliver the artistic expression of the target style, recent studies\nexploit the attention mechanism owing to its ability to map the local patches\nof the style image to the corresponding patches of the content image. However,\nbecause of the low semantic correspondence between arbitrary content and\nartworks, the attention module repeatedly abuses specific local patches from\nthe style image, resulting in disharmonious and evident repetitive artifacts.\nTo overcome this limitation and accomplish impeccable artistic style transfer,\nwe focus on enhancing the attention mechanism and capturing the rhythm of\npatterns that organize the style. In this paper, we introduce a novel metric,\nnamely pattern repeatability, that quantifies the repetition of patterns in the\nstyle image. Based on the pattern repeatability, we propose Aesthetic\nPattern-Aware style transfer Networks (AesPA-Net) that discover the sweet spot\nof local and global style expressions. In addition, we propose a novel\nself-supervisory task to encourage the attention mechanism to learn precise and\nmeaningful semantic correspondence. Lastly, we introduce the patch-wise style\nloss to transfer the elaborate rhythm of local patterns. Through qualitative\nand quantitative evaluations, we verify the reliability of the proposed pattern\nrepeatability that aligns with human perception, and demonstrate the\nsuperiority of the proposed framework.\n","authors":["Kibeom Hong","Seogkyu Jeon","Junsoo Lee","Namhyuk Ahn","Kunhee Kim","Pilhyeon Lee","Daesik Kim","Youngjung Uh","Hyeran Byun"],"pdf_url":"https://arxiv.org/pdf/2307.09724v3.pdf","comment":"Accepted by ICCV 2023. Code is available at this\n https://github.com/Kibeom-Hong/AesPA-Net"},{"id":"http://arxiv.org/abs/2304.08134v3","updated":"2023-08-08T12:57:36Z","published":"2023-04-17T10:29:26Z","title":"Tackling Face Verification Edge Cases: In-Depth Analysis and\n Human-Machine Fusion Approach","summary":" Nowadays, face recognition systems surpass human performance on several\ndatasets. However, there are still edge cases that the machine can't correctly\nclassify. This paper investigates the effect of a combination of machine and\nhuman operators in the face verification task. First, we look closer at the\nedge cases for several state-of-the-art models to discover common datasets'\nchallenging settings. Then, we conduct a study with 60 participants on these\nselected tasks with humans and provide an extensive analysis. Finally, we\ndemonstrate that combining machine and human decisions can further improve the\nperformance of state-of-the-art face verification systems on various benchmark\ndatasets. Code and data are publicly available on GitHub.\n","authors":["Martin Knoche","Gerhard Rigoll"],"pdf_url":"https://arxiv.org/pdf/2304.08134v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04232v1","updated":"2023-08-08T12:54:05Z","published":"2023-08-08T12:54:05Z","title":"A Comparative Study of Image-to-Image Translation Using GANs for\n Synthetic Child Race Data","summary":" The lack of ethnic diversity in data has been a limiting factor of face\nrecognition techniques in the literature. This is particularly the case for\nchildren where data samples are scarce and presents a challenge when seeking to\nadapt machine vision algorithms that are trained on adult data to work on\nchildren. This work proposes the utilization of image-to-image transformation\nto synthesize data of different races and thus adjust the ethnicity of\nchildren's face data. We consider ethnicity as a style and compare three\ndifferent Image-to-Image neural network based methods, specifically pix2pix,\nCycleGAN, and CUT networks to implement Caucasian child data and Asian child\ndata conversion. Experimental validation results on synthetic data demonstrate\nthe feasibility of using image-to-image transformation methods to generate\nvarious synthetic child data samples with broader ethnic diversity.\n","authors":["Wang Yao","Muhammad Ali Farooq","Joseph Lemley","Peter Corcoran"],"pdf_url":"https://arxiv.org/pdf/2308.04232v1.pdf","comment":"The Paper is accepted in 25th Irish Machine Vision and Image\n Processing Conference (IMVIP23)"},{"id":"http://arxiv.org/abs/2308.04224v1","updated":"2023-08-08T12:43:26Z","published":"2023-08-08T12:43:26Z","title":"Will your Doorbell Camera still recognize you as you grow old","summary":" Robust authentication for low-power consumer devices such as doorbell cameras\nposes a valuable and unique challenge. This work explores the effect of age and\naging on the performance of facial authentication methods. Two public age\ndatasets, AgeDB and Morph-II have been used as baselines in this work. A\nphoto-realistic age transformation method has been employed to augment a set of\nhigh-quality facial images with various age effects. Then the effect of these\nsynthetic aging data on the high-performance deep-learning-based face\nrecognition model is quantified by using various metrics including Receiver\nOperating Characteristic (ROC) curves and match score distributions.\nExperimental results demonstrate that long-term age effects are still a\nsignificant challenge for the state-of-the-art facial authentication method.\n","authors":["Wang Yao","Muhammad Ali Farooq","Joseph Lemley","Peter Corcoran"],"pdf_url":"https://arxiv.org/pdf/2308.04224v1.pdf","comment":"The Paper is accepted in 25th Irish Machine Vision and Image\n Processing Conference (IMVIP23)"},{"id":"http://arxiv.org/abs/2308.04218v1","updated":"2023-08-08T12:30:36Z","published":"2023-08-08T12:30:36Z","title":"AquaSAM: Underwater Image Foreground Segmentation","summary":" The Segment Anything Model (SAM) has revolutionized natural image\nsegmentation, nevertheless, its performance on underwater images is still\nrestricted. This work presents AquaSAM, the first attempt to extend the success\nof SAM on underwater images with the purpose of creating a versatile method for\nthe segmentation of various underwater targets. To achieve this, we begin by\nclassifying and extracting various labels automatically in SUIM dataset.\nSubsequently, we develop a straightforward fine-tuning method to adapt SAM to\ngeneral foreground underwater image segmentation. Through extensive experiments\ninvolving eight segmentation tasks like human divers, we demonstrate that\nAquaSAM outperforms the default SAM model especially at hard tasks like coral\nreefs. AquaSAM achieves an average Dice Similarity Coefficient (DSC) of 7.13\n(%) improvement and an average of 8.27 (%) on mIoU improvement in underwater\nsegmentation tasks.\n","authors":["Muduo Xu","Jianhao Su","Yutao Liu"],"pdf_url":"https://arxiv.org/pdf/2308.04218v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04207v1","updated":"2023-08-08T12:17:02Z","published":"2023-08-08T12:17:02Z","title":"Robust retrieval of material chemical states in X-ray microspectroscopy","summary":" X-ray microspectroscopic techniques are essential for studying morphological\nand chemical changes in materials, providing high-resolution structural and\nspectroscopic information. However, its practical data analysis for reliably\nretrieving the chemical states remains a major obstacle to accelerating the\nfundamental understanding of materials in many research fields. In this work,\nwe propose a novel data formulation model for X-ray microspectroscopy and\ndevelop a dedicated unmixing framework to solve this problem, which is robust\nto noise and spectral variability. Moreover, this framework is not limited to\nthe analysis of two-state material chemistry, making it an effective\nalternative to conventional and widely-used methods. In addition, an\nalternative directional multiplier method with provable convergence is applied\nto obtain the solution efficiently. Our framework can accurately identify and\ncharacterize chemical states in complex and heterogeneous samples, even under\nchallenging conditions such as low signal-to-noise ratios and overlapping\nspectral features. Extensive experimental results on simulated and real\ndatasets demonstrate its effectiveness and reliability.\n","authors":["Ting Wang","Xiaotong Wu","Jizhou Li","Chao Wang"],"pdf_url":"https://arxiv.org/pdf/2308.04207v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2308.04206v1","updated":"2023-08-08T12:12:30Z","published":"2023-08-08T12:12:30Z","title":"Exploring Transformers for Open-world Instance Segmentation","summary":" Open-world instance segmentation is a rising task, which aims to segment all\nobjects in the image by learning from a limited number of base-category\nobjects. This task is challenging, as the number of unseen categories could be\nhundreds of times larger than that of seen categories. Recently, the DETR-like\nmodels have been extensively studied in the closed world while stay unexplored\nin the open world. In this paper, we utilize the Transformer for open-world\ninstance segmentation and present SWORD. Firstly, we introduce to attach the\nstop-gradient operation before classification head and further add IoU heads\nfor discovering novel objects. We demonstrate that a simple stop-gradient\noperation not only prevents the novel objects from being suppressed as\nbackground, but also allows the network to enjoy the merit of heuristic label\nassignment. Secondly, we propose a novel contrastive learning framework to\nenlarge the representations between objects and background. Specifically, we\nmaintain a universal object queue to obtain the object center, and dynamically\nselect positive and negative samples from the object queries for contrastive\nlearning. While the previous works only focus on pursuing average recall and\nneglect average precision, we show the prominence of SWORD by giving\nconsideration to both criteria. Our models achieve state-of-the-art performance\nin various open-world cross-category and cross-dataset generalizations.\nParticularly, in VOC to non-VOC setup, our method sets new state-of-the-art\nresults of 40.0% on ARb100 and 34.9% on ARm100. For COCO to UVO generalization,\nSWORD significantly outperforms the previous best open-world model by 5.9% on\nAPm and 8.1% on ARm100.\n","authors":["Jiannan Wu","Yi Jiang","Bin Yan","Huchuan Lu","Zehuan Yuan","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2308.04206v1.pdf","comment":"Accepted by ICCV2023. 16 pages"},{"id":"http://arxiv.org/abs/2302.00290v2","updated":"2023-08-08T11:59:25Z","published":"2023-02-01T07:45:10Z","title":"MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely\n Coupled Fusion and Modality-Balanced Optimization","summary":" Multispectral pedestrian detection is an important task for many\naround-the-clock applications, since the visible and thermal modalities can\nprovide complementary information especially under low light conditions. Most\nof the available multispectral pedestrian detectors are based on non-end-to-end\ndetectors, while in this paper, we propose MultiSpectral pedestrian DEtection\nTRansformer (MS-DETR), an end-to-end multispectral pedestrian detector, which\nextends DETR into the field of multi-modal detection. MS-DETR consists of two\nmodality-specific backbones and Transformer encoders, followed by a multi-modal\nTransformer decoder, and the visible and thermal features are fused in the\nmulti-modal Transformer decoder. To well resist the misalignment between\nmulti-modal images, we design a loosely coupled fusion strategy by sparsely\nsampling some keypoints from multi-modal features independently and fusing them\nwith adaptively learned attention weights. Moreover, based on the insight that\nnot only different modalities, but also different pedestrian instances tend to\nhave different confidence scores to final detection, we further propose an\ninstance-aware modality-balanced optimization strategy, which preserves visible\nand thermal decoder branches and aligns their predicted slots through an\ninstance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance\non the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code\nis available at https://github.com/YinghuiXing/MS-DETR .\n","authors":["Yinghui Xing","Song Wang","Shizhou Zhang","Guoqiang Liang","Xiuwei Zhang","Yanning Zhang"],"pdf_url":"https://arxiv.org/pdf/2302.00290v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04197v1","updated":"2023-08-08T11:49:04Z","published":"2023-08-08T11:49:04Z","title":"D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with\n Glance Annotation","summary":" Temporal sentence grounding (TSG) aims to locate a specific moment from an\nuntrimmed video with a given natural language query. Recently, weakly\nsupervised methods still have a large performance gap compared to fully\nsupervised ones, while the latter requires laborious timestamp annotations. In\nthis study, we aim to reduce the annotation cost yet keep competitive\nperformance for TSG task compared to fully supervised ones. To achieve this\ngoal, we investigate a recently proposed glance-supervised temporal sentence\ngrounding task, which requires only single frame annotation (referred to as\nglance annotation) for each query. Under this setup, we propose a Dynamic\nGaussian prior based Grounding framework with Glance annotation (D3G), which\nconsists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and\na Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples\nreliable positive moments from a 2D temporal map via jointly leveraging\nGaussian prior and semantic consistency, which contributes to aligning the\npositive sentence-moment pairs in the joint embedding space. Moreover, to\nalleviate the annotation bias resulting from glance annotation and model\ncomplex queries consisting of multiple events, we propose the DGA module, which\nadjusts the distribution dynamically to approximate the ground truth of target\nmoments. Extensive experiments on three challenging benchmarks verify the\neffectiveness of the proposed D3G. It outperforms the state-of-the-art weakly\nsupervised methods by a large margin and narrows the performance gap compared\nto fully supervised methods. Code is available at\nhttps://github.com/solicucu/D3G.\n","authors":["Hanjun Li","Xiujun Shu","Sunan He","Ruizhi Qiao","Wei Wen","Taian Guo","Bei Gan","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2308.04197v1.pdf","comment":"ICCV2023"},{"id":"http://arxiv.org/abs/2307.09788v2","updated":"2023-08-08T11:36:26Z","published":"2023-07-19T07:11:45Z","title":"Density-invariant Features for Distant Point Cloud Registration","summary":" Registration of distant outdoor LiDAR point clouds is crucial to extending\nthe 3D vision of collaborative autonomous vehicles, and yet is challenging due\nto small overlapping area and a huge disparity between observed point\ndensities. In this paper, we propose Group-wise Contrastive Learning (GCL)\nscheme to extract density-invariant geometric features to register distant\noutdoor LiDAR point clouds. We mark through theoretical analysis and\nexperiments that, contrastive positives should be independent and identically\ndistributed (i.i.d.), in order to train densityinvariant feature extractors. We\npropose upon the conclusion a simple yet effective training scheme to force the\nfeature of multiple point clouds in the same spatial location (referred to as\npositive groups) to be similar, which naturally avoids the sampling bias\nintroduced by a pair of point clouds to conform with the i.i.d. principle. The\nresulting fully-convolutional feature extractor is more powerful and\ndensity-invariant than state-of-the-art methods, improving the registration\nrecall of distant scenarios on KITTI and nuScenes benchmarks by 40.9% and\n26.9%, respectively. Code is available at https://github.com/liuQuan98/GCL.\n","authors":["Quan Liu","Hongzi Zhu","Yunsong Zhou","Hongyang Li","Shan Chang","Minyi Guo"],"pdf_url":"https://arxiv.org/pdf/2307.09788v2.pdf","comment":"In Proceedings of the IEEE/CVF International Conference on Computer\n Vision (ICCV), 2023"},{"id":"http://arxiv.org/abs/2308.04188v1","updated":"2023-08-08T11:23:56Z","published":"2023-08-08T11:23:56Z","title":"Image Copy-Move Forgery Detection via Deep Cross-Scale PatchMatch","summary":" The recently developed deep algorithms achieve promising progress in the\nfield of image copy-move forgery detection (CMFD). However, they have limited\ngeneralizability in some practical scenarios, where the copy-move objects may\nnot appear in the training images or cloned regions are from the background. To\naddress the above issues, in this work, we propose a novel end-to-end CMFD\nframework by integrating merits from both conventional and deep methods.\nSpecifically, we design a deep cross-scale patchmatch method tailored for CMFD\nto localize copy-move regions. In contrast to existing deep models, our scheme\naims to seek explicit and reliable point-to-point matching between source and\ntarget regions using features extracted from high-resolution scales. Further,\nwe develop a manipulation region location branch for source/target separation.\nThe proposed CMFD framework is completely differentiable and can be trained in\nan end-to-end manner. Extensive experimental results demonstrate the high\ngeneralizability of our method to different copy-move contents, and the\nproposed scheme achieves significantly better performance than existing\napproaches.\n","authors":["Yingjie He","Yuanman Li","Changsheng Chen","Xia Li"],"pdf_url":"https://arxiv.org/pdf/2308.04188v1.pdf","comment":"6 pages, 4 figures, accepted by ICME2023"},{"id":"http://arxiv.org/abs/2209.14915v2","updated":"2023-08-08T10:30:54Z","published":"2022-09-29T16:22:46Z","title":"Spiking Neural Networks for event-based action recognition: A new task\n to understand their advantage","summary":" Spiking Neural Networks (SNN) are characterised by their unique temporal\ndynamics, but the properties and advantages of such computations are still not\nwell understood. In order to provide answers, in this work we demonstrate how\nSpiking neurons can enable temporal feature extraction in feed-forward neural\nnetworks without the need for recurrent synapses, showing how their\nbio-inspired computing principles can be successfully exploited beyond energy\nefficiency gains and evidencing their differences with respect to conventional\nneurons. This is demonstrated by proposing a new task, DVS-Gesture-Chain\n(DVS-GC), which allows, for the first time, to evaluate the perception of\ntemporal dependencies in a real event-based action recognition dataset. Our\nstudy proves how the widely used DVS Gesture benchmark could be solved by\nnetworks without temporal feature extraction, unlike the new DVS-GC which\ndemands an understanding of the ordering of the events. Furthermore, this setup\nallowed us to unveil the role of the leakage rate in spiking neurons for\ntemporal processing tasks and demonstrated the benefits of \"hard reset\"\nmechanisms. Additionally, we also show how time-dependent weights and\nnormalization can lead to understanding order by means of temporal attention.\n","authors":["Alex Vicente-Sola","Davide L. Manna","Paul Kirkland","Gaetano Di Caterina","Trevor Bihl"],"pdf_url":"https://arxiv.org/pdf/2209.14915v2.pdf","comment":"New article superseding the one in previous versions"},{"id":"http://arxiv.org/abs/2308.04177v1","updated":"2023-08-08T10:30:34Z","published":"2023-08-08T10:30:34Z","title":"How Generalizable are Deepfake Detectors? An Empirical Study","summary":" Deepfake videos and images are becoming increasingly credible, posing a\nsignificant threat given their potential to facilitate fraud or bypass access\ncontrol systems. This has motivated the development of deepfake detection\nmethods, in which deep learning models are trained to distinguish between real\nand synthesized footage. Unfortunately, existing detection models struggle to\ngeneralize to deepfakes from datasets they were not trained on, but little work\nhas been done to examine why or how this limitation can be addressed. In this\npaper, we present the first empirical study on the generalizability of deepfake\ndetectors, an essential goal for detectors to stay one step ahead of attackers.\nOur study utilizes six deepfake datasets, five deepfake detection methods, and\ntwo model augmentation approaches, confirming that detectors do not generalize\nin zero-shot settings. Additionally, we find that detectors are learning\nunwanted properties specific to synthesis methods and struggling to extract\ndiscriminative features, limiting their ability to generalize. Finally, we find\nthat there are neurons universally contributing to detection across seen and\nunseen datasets, illuminating a possible path forward to zero-shot\ngeneralizability.\n","authors":["Boquan Li","Jun Sun","Christopher M. Poskitt"],"pdf_url":"https://arxiv.org/pdf/2308.04177v1.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2301.10227v2","updated":"2023-08-08T10:18:04Z","published":"2023-01-02T14:17:08Z","title":"Denoising Diffusion Probabilistic Models for Generation of Realistic\n Fully-Annotated Microscopy Image Data Sets","summary":" Recent advances in computer vision have led to significant progress in the\ngeneration of realistic image data, with denoising diffusion probabilistic\nmodels proving to be a particularly effective method. In this study, we\ndemonstrate that diffusion models can effectively generate fully-annotated\nmicroscopy image data sets through an unsupervised and intuitive approach,\nusing rough sketches of desired structures as the starting point. The proposed\npipeline helps to reduce the reliance on manual annotations when training deep\nlearning-based segmentation approaches and enables the segmentation of diverse\ndatasets without the need for human annotations. This approach holds great\npromise in streamlining the data generation process and enabling a more\nefficient and scalable training of segmentation models, as we show in the\nexample of different practical experiments involving various organisms and cell\ntypes.\n","authors":["Dennis Eschweiler","Rüveyda Yilmaz","Matisse Baumann","Ina Laube","Rijo Roy","Abin Jose","Daniel Brückner","Johannes Stegmaier"],"pdf_url":"https://arxiv.org/pdf/2301.10227v2.pdf","comment":"9 pages, 2 figures"},{"id":"http://arxiv.org/abs/2301.05609v4","updated":"2023-08-08T10:04:14Z","published":"2023-01-13T15:24:40Z","title":"Co-manipulation of soft-materials estimating deformation from depth\n images","summary":" Human-robot co-manipulation of soft materials, such as fabrics, composites,\nand sheets of paper/cardboard, is a challenging operation that presents several\nrelevant industrial applications. Estimating the deformation state of the\nco-manipulated material is one of the main challenges. Viable methods provide\nthe indirect measure by calculating the human-robot relative distance. In this\npaper, we develop a data-driven model to estimate the deformation state of the\nmaterial from a depth image through a Convolutional Neural Network (CNN).\nFirst, we define the deformation state of the material as the relative\nroto-translation from the current robot pose and a human grasping position. The\nmodel estimates the current deformation state through a Convolutional Neural\nNetwork, specifically a DenseNet-121 pretrained on ImageNet.The delta between\nthe current and the desired deformation state is fed to the robot controller\nthat outputs twist commands. The paper describes the developed approach to\nacquire, preprocess the dataset and train the model. The model is compared with\nthe current state-of-the-art method based on a skeletal tracker from cameras.\nResults show that our approach achieves better performances and avoids the\nvarious drawbacks caused by using a skeletal tracker.Finally, we also studied\nthe model performance according to different architectures and dataset\ndimensions to minimize the time required for dataset acquisition\n","authors":["Giorgio Nicola","Enrico Villagrossi","Nicola Pedrocchi"],"pdf_url":"https://arxiv.org/pdf/2301.05609v4.pdf","comment":"Pre-print, Accepted to Robotics and Computer Integrated Manufacturing"},{"id":"http://arxiv.org/abs/2308.04168v1","updated":"2023-08-08T09:58:22Z","published":"2023-08-08T09:58:22Z","title":"EFaR 2023: Efficient Face Recognition Competition","summary":" This paper presents the summary of the Efficient Face Recognition Competition\n(EFaR) held at the 2023 International Joint Conference on Biometrics (IJCB\n2023). The competition received 17 submissions from 6 different teams. To drive\nfurther development of efficient face recognition models, the submitted\nsolutions are ranked based on a weighted score of the achieved verification\naccuracies on a diverse set of benchmarks, as well as the deployability given\nby the number of floating-point operations and model size. The evaluation of\nsubmissions is extended to bias, cross-quality, and large-scale recognition\nbenchmarks. Overall, the paper gives an overview of the achieved performance\nvalues of the submitted solutions as well as a diverse set of baselines. The\nsubmitted solutions use small, efficient network architectures to reduce the\ncomputational cost, some solutions apply model quantization. An outlook on\npossible techniques that are underrepresented in current solutions is given as\nwell.\n","authors":["Jan Niklas Kolf","Fadi Boutros","Jurek Elliesen","Markus Theuerkauf","Naser Damer","Mohamad Alansari","Oussama Abdul Hay","Sara Alansari","Sajid Javed","Naoufel Werghi","Klemen Grm","Vitomir Štruc","Fernando Alonso-Fernandez","Kevin Hernandez Diaz","Josef Bigun","Anjith George","Christophe Ecabert","Hatef Otroshi Shahreza","Ketan Kotwal","Sébastien Marcel","Iurii Medvedev","Bo Jin","Diogo Nunes","Ahmad Hassanpour","Pankaj Khatiwada","Aafan Ahmad Toor","Bian Yang"],"pdf_url":"https://arxiv.org/pdf/2308.04168v1.pdf","comment":"Accepted at IJCB 2023"},{"id":"http://arxiv.org/abs/2308.04163v1","updated":"2023-08-08T09:50:44Z","published":"2023-08-08T09:50:44Z","title":"Under-Display Camera Image Restoration with Scattering Effect","summary":" The under-display camera (UDC) provides consumers with a full-screen visual\nexperience without any obstruction due to notches or punched holes. However,\nthe semi-transparent nature of the display inevitably introduces the severe\ndegradation into UDC images. In this work, we address the UDC image restoration\nproblem with the specific consideration of the scattering effect caused by the\ndisplay. We explicitly model the scattering effect by treating the display as a\npiece of homogeneous scattering medium. With the physical model of the\nscattering effect, we improve the image formation pipeline for the image\nsynthesis to construct a realistic UDC dataset with ground truths. To suppress\nthe scattering effect for the eventual UDC image recovery, a two-branch\nrestoration network is designed. More specifically, the scattering branch\nleverages global modeling capabilities of the channel-wise self-attention to\nestimate parameters of the scattering effect from degraded images. While the\nimage branch exploits the local representation advantage of CNN to recover\nclear scenes, implicitly guided by the scattering branch. Extensive experiments\nare conducted on both real-world and synthesized data, demonstrating the\nsuperiority of the proposed method over the state-of-the-art UDC restoration\ntechniques. The source code and dataset are available at\n\\url{https://github.com/NamecantbeNULL/SRUDC}.\n","authors":["Binbin Song","Xiangyu Chen","Shuning Xu","Jiantao Zhou"],"pdf_url":"https://arxiv.org/pdf/2308.04163v1.pdf","comment":"Accepted to ICCV2023"},{"id":"http://arxiv.org/abs/2308.04162v1","updated":"2023-08-08T09:48:00Z","published":"2023-08-08T09:48:00Z","title":"EPCFormer: Expression Prompt Collaboration Transformer for Universal\n Referring Video Object Segmentation","summary":" Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object\nSegmentation (R-VOS) are two highly-related tasks, which both aim to segment\nspecific objects from video sequences according to user-provided expression\nprompts. However, due to the challenges in modeling representations for\ndifferent modalities, contemporary methods struggle to strike a balance between\ninteraction flexibility and high-precision localization and segmentation. In\nthis paper, we address this problem from two perspectives: the alignment\nrepresentation of audio and text and the deep interaction among audio, text,\nand visual features. First, we propose a universal architecture, the Expression\nPrompt Collaboration Transformer, herein EPCFormer. Next, we propose an\nExpression Alignment (EA) mechanism for audio and text expressions. By\nintroducing contrastive learning for audio and text expressions, the proposed\nEPCFormer realizes comprehension of the semantic equivalence between audio and\ntext expressions denoting the same objects. Then, to facilitate deep\ninteractions among audio, text, and video features, we introduce an\nExpression-Visual Attention (EVA) mechanism. The knowledge of video object\nsegmentation in terms of the expression prompts can seamlessly transfer between\nthe two tasks by deeply exploring complementary cues between text and audio.\nExperiments on well-recognized benchmarks demonstrate that our universal\nEPCFormer attains state-of-the-art results on both tasks. The source code of\nEPCFormer will be made publicly available at\nhttps://github.com/lab206/EPCFormer.\n","authors":["Jiajun Chen","Jiacheng Lin","Zhiqiang Xiao","Haolong Fu","Ke Nai","Kailun Yang","Zhiyong Li"],"pdf_url":"https://arxiv.org/pdf/2308.04162v1.pdf","comment":"The source code will be made publicly available at\n https://github.com/lab206/EPCFormer"},{"id":"http://arxiv.org/abs/2306.10046v2","updated":"2023-08-08T09:46:21Z","published":"2023-06-12T08:21:50Z","title":"Document Layout Annotation: Database and Benchmark in the Domain of\n Public Affairs","summary":" Every day, thousands of digital documents are generated with useful\ninformation for companies, public organizations, and citizens. Given the\nimpossibility of processing them manually, the automatic processing of these\ndocuments is becoming increasingly necessary in certain sectors. However, this\ntask remains challenging, since in most cases a text-only based parsing is not\nenough to fully understand the information presented through different\ncomponents of varying significance. In this regard, Document Layout Analysis\n(DLA) has been an interesting research field for many years, which aims to\ndetect and classify the basic components of a document. In this work, we used a\nprocedure to semi-automatically annotate digital documents with different\nlayout labels, including 4 basic layout blocks and 4 text categories. We apply\nthis procedure to collect a novel database for DLA in the public affairs\ndomain, using a set of 24 data sources from the Spanish Administration. The\ndatabase comprises 37.9K documents with more than 441K document pages, and more\nthan 8M labels associated to 8 layout block units. The results of our\nexperiments validate the proposed text labeling procedure with accuracy up to\n99%.\n","authors":["Alejandro Peña","Aythami Morales","Julian Fierrez","Javier Ortega-Garcia","Marcos Grande","Iñigo Puente","Jorge Cordova","Gonzalo Cordova"],"pdf_url":"https://arxiv.org/pdf/2306.10046v2.pdf","comment":"Accepted in ICDAR 2023 Workshop on Machine Vision and NLP for\n Document Analysis"},{"id":"http://arxiv.org/abs/2308.04156v1","updated":"2023-08-08T09:37:18Z","published":"2023-08-08T09:37:18Z","title":"Towards Top-Down Stereoscopic Image Quality Assessment via Stereo\n Attention","summary":" Stereoscopic image quality assessment (SIQA) plays a crucial role in\nevaluating and improving the visual experience of 3D content. Existing\nbinocular properties and attention-based methods for SIQA have achieved\npromising performance. However, these bottom-up approaches are inadequate in\nexploiting the inherent characteristics of the human visual system (HVS). This\npaper presents a novel network for SIQA via stereo attention, employing a\ntop-down perspective to guide the quality assessment process. Our proposed\nmethod realizes the guidance from high-level binocular signals down to\nlow-level monocular signals, while the binocular and monocular information can\nbe calibrated progressively throughout the processing pipeline. We design a\ngeneralized Stereo AttenTion (SAT) block to implement the top-down philosophy\nin stereo perception. This block utilizes the fusion-generated attention map as\na high-level binocular modulator, influencing the representation of two\nlow-level monocular features. Additionally, we introduce an Energy Coefficient\n(EC) to account for recent findings indicating that binocular responses in the\nprimate primary visual cortex are less than the sum of monocular responses. The\nadaptive EC can tune the magnitude of binocular response flexibly, thus\nenhancing the formation of robust binocular features within our framework. To\nextract the most discriminative quality information from the summation and\nsubtraction of the two branches of monocular features, we utilize a\ndual-pooling strategy that applies min-pooling and max-pooling operations to\nthe respective branches. Experimental results highlight the superiority of our\ntop-down method in simulating the property of visual perception and advancing\nthe state-of-the-art in the SIQA field. The code of this work is available at\nhttps://github.com/Fanning-Zhang/SATNet.\n","authors":["Huilin Zhang","Sumei Li","Yongli Chang"],"pdf_url":"https://arxiv.org/pdf/2308.04156v1.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2308.04152v1","updated":"2023-08-08T09:32:43Z","published":"2023-08-08T09:32:43Z","title":"Empowering Vision-Language Models to Follow Interleaved Vision-Language\n Instructions","summary":" Multimodal Large Language Models (MLLMs) have recently sparked significant\ninterest, which demonstrates emergent capabilities to serve as a\ngeneral-purpose model for various vision-language tasks. However, existing\nmethods mainly focus on limited types of instructions with a single image as\nvisual context, which hinders the widespread availability of MLLMs. In this\npaper, we introduce the I4 benchmark to comprehensively evaluate the\ninstruction following ability on complicated interleaved vision-language\ninstructions, which involve intricate image-text sequential context, covering a\ndiverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture\nslides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a\ncommon defect of existing methods: the Visual Prompt Generator (VPG) trained on\nimage-captioning alignment objective tends to attend to common foreground\ninformation for captioning but struggles to extract specific information\nrequired by particular tasks. To address this issue, we propose a generic and\nlightweight controllable knowledge re-injection module, which utilizes the\nsophisticated reasoning ability of LLMs to control the VPG to conditionally\nextract instruction-specific visual information and re-inject it into the LLM.\nFurther, we introduce an annotation-free cross-attention guided counterfactual\nimage training strategy to methodically learn the proposed module by\ncollaborating a cascade of foundation models. Enhanced by the proposed module\nand training strategy, we present Cheetah, a MLLM that can effectively handle a\nwide variety of interleaved vision-language instructions and achieves\nstate-of-the-art zero-shot performance across all tasks of I4, without\nhigh-quality multimodal instruction tuning data. Moreover, Cheetah also\nexhibits competitive performance compared with state-of-the-art instruction\ntuned models on concurrent MME benchmark.\n","authors":["Juncheng Li","Kaihang Pan","Zhiqi Ge","Minghe Gao","Hanwang Zhang","Wei Ji","Wenqiao Zhang","Tat-Seng Chua","Siliang Tang","Yueting Zhuang"],"pdf_url":"https://arxiv.org/pdf/2308.04152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04151v1","updated":"2023-08-08T09:32:15Z","published":"2023-08-08T09:32:15Z","title":"Application for White Spot Syndrome Virus (WSSV) Monitoring using Edge\n Machine Learning","summary":" The aquaculture industry, strongly reliant on shrimp exports, faces\nchallenges due to viral infections like the White Spot Syndrome Virus (WSSV)\nthat severely impact output yields. In this context, computer vision can play a\nsignificant role in identifying features not immediately evident to skilled or\nuntrained eyes, potentially reducing the time required to report WSSV\ninfections. In this study, the challenge of limited data for WSSV recognition\nwas addressed. A mobile application dedicated to data collection and monitoring\nwas developed to facilitate the creation of an image dataset to train a WSSV\nrecognition model and improve country-wide disease surveillance. The study also\nincludes a thorough analysis of WSSV recognition to address the challenge of\nimbalanced learning and on-device inference. The models explored,\nMobileNetV3-Small and EfficientNetV2-B0, gained an F1-Score of 0.72 and 0.99\nrespectively. The saliency heatmaps of both models were also observed to\nuncover the \"black-box\" nature of these models and to gain insight as to what\nfeatures in the images are most important in making a prediction. These results\nhighlight the effectiveness and limitations of using models designed for\nresource-constrained devices and balancing their performance in accurately\nrecognizing WSSV, providing valuable information and direction in the use of\ncomputer vision in this domain.\n","authors":["Lorenzo S. Querol","Macario O. Cordel II","Dan Jeric A. Rustia","Mary Nia M. Santos"],"pdf_url":"https://arxiv.org/pdf/2308.04151v1.pdf","comment":"6 pages, 7 figures, conference"},{"id":"http://arxiv.org/abs/2308.02632v2","updated":"2023-08-08T09:21:40Z","published":"2023-08-04T17:44:27Z","title":"Generation of Realistic Synthetic Raw Radar Data for Automated Driving\n Applications using Generative Adversarial Networks","summary":" The main approaches for simulating FMCW radar are based on ray tracing, which\nis usually computationally intensive and do not account for background noise.\nThis work proposes a faster method for FMCW radar simulation capable of\ngenerating synthetic raw radar data using generative adversarial networks\n(GAN). The code and pre-trained weights are open-source and available on\nGitHub. This method generates 16 simultaneous chirps, which allows the\ngenerated data to be used for the further development of algorithms for\nprocessing radar data (filtering and clustering). This can increase the\npotential for data augmentation, e.g., by generating data in non-existent or\nsafety-critical scenarios that are not reproducible in real life. In this work,\nthe GAN was trained with radar measurements of a motorcycle and used to\ngenerate synthetic raw radar data of a motorcycle traveling in a straight line.\nFor generating this data, the distance of the motorcycle and Gaussian noise are\nused as input to the neural network. The synthetic generated radar chirps were\nevaluated using the Frechet Inception Distance (FID). Then, the Range-Azimuth\n(RA) map is calculated twice: first, based on synthetic data using this GAN\nand, second, based on real data. Based on these RA maps, an algorithm with\nadaptive threshold and edge detection is used for object detection. The results\nhave shown that the data is realistic in terms of coherent radar reflections of\nthe motorcycle and background noise based on the comparison of chirps, the RA\nmaps and the object detection results. Thus, the proposed method in this work\nhas shown to minimize the simulation-to-reality gap for the generation of radar\ndata.\n","authors":["Eduardo C. Fidelis","Fabio Reway","Herick Y. S. Ribeiro","Pietro L. Campos","Werner Huber","Christian Icking","Lester A. Faria","Torsten Schön"],"pdf_url":"https://arxiv.org/pdf/2308.02632v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2002.03729v3","updated":"2023-08-08T09:18:57Z","published":"2020-01-16T09:38:50Z","title":"A lightweight target detection algorithm based on Mobilenet Convolution","summary":" Target detection algorithm based on deep learning needs high computer GPU\nconfiguration, even need to use high performance deep learning workstation,\nthis not only makes the cost increase, also greatly limits the realizability of\nthe ground, this paper introduces a kind of lightweight algorithm for target\ndetection under the condition of the balance accuracy and computational\nefficiency, MobileNet as Backbone performs parameter The processing speed is\n30fps on the RTX2060 card for images with the CNN separator layer. The\nprocessing speed is 30fps on the RTX2060 card for images with a resolution of\n320*320.\n","authors":["Shengquan Wang"],"pdf_url":"https://arxiv.org/pdf/2002.03729v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04142v1","updated":"2023-08-08T09:03:46Z","published":"2023-08-08T09:03:46Z","title":"Class-level Structural Relation Modelling and Smoothing for Visual\n Representation Learning","summary":" Representation learning for images has been advanced by recent progress in\nmore complex neural models such as the Vision Transformers and new learning\ntheories such as the structural causal models. However, these models mainly\nrely on the classification loss to implicitly regularize the class-level data\ndistributions, and they may face difficulties when handling classes with\ndiverse visual patterns. We argue that the incorporation of the structural\ninformation between data samples may improve this situation. To achieve this\ngoal, this paper presents a framework termed \\textbf{C}lass-level Structural\nRelation Modeling and Smoothing for Visual Representation Learning (CSRMS),\nwhich includes the Class-level Relation Modelling, Class-aware Graph Sampling,\nand Relational Graph-Guided Representation Learning modules to model a\nrelational graph of the entire dataset and perform class-aware smoothing and\nregularization operations to alleviate the issue of intra-class visual\ndiversity and inter-class similarity. Specifically, the Class-level Relation\nModelling module uses a clustering algorithm to learn the data distributions in\nthe feature space and identify three types of class-level sample relations for\nthe training set; Class-aware Graph Sampling module extends typical training\nbatch construction process with three strategies to sample dataset-level\nsub-graphs; and Relational Graph-Guided Representation Learning module employs\na graph convolution network with knowledge-guided smoothing operations to ease\nthe projection from different visual patterns to the same class. Experiments\ndemonstrate the effectiveness of structured knowledge modelling for enhanced\nrepresentation learning and show that CSRMS can be incorporated with any\nstate-of-the-art visual representation learning models for performance gains.\nThe source codes and demos have been released at\nhttps://github.com/czt117/CSRMS.\n","authors":["Zitan Chen","Zhuang Qi","Xiao Cao","Xiangxian Li","Xiangxu Meng","Lei Meng"],"pdf_url":"https://arxiv.org/pdf/2308.04142v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04137v1","updated":"2023-08-08T08:50:27Z","published":"2023-08-08T08:50:27Z","title":"Comprehensive Assessment of the Performance of Deep Learning Classifiers\n Reveals a Surprising Lack of Robustness","summary":" Reliable and robust evaluation methods are a necessary first step towards\ndeveloping machine learning models that are themselves robust and reliable.\nUnfortunately, current evaluation protocols typically used to assess\nclassifiers fail to comprehensively evaluate performance as they tend to rely\non limited types of test data, and ignore others. For example, using the\nstandard test data fails to evaluate the predictions made by the classifier to\nsamples from classes it was not trained on. On the other hand, testing with\ndata containing samples from unknown classes fails to evaluate how well the\nclassifier can predict the labels for known classes. This article advocates\nbench-marking performance using a wide range of different types of data and\nusing a single metric that can be applied to all such data types to produce a\nconsistent evaluation of performance. Using such a benchmark it is found that\ncurrent deep neural networks, including those trained with methods that are\nbelieved to produce state-of-the-art robustness, are extremely vulnerable to\nmaking mistakes on certain types of data. This means that such models will be\nunreliable in real-world scenarios where they may encounter data from many\ndifferent domains, and that they are insecure as they can easily be fooled into\nmaking the wrong decisions. It is hoped that these results will motivate the\nwider adoption of more comprehensive testing methods that will, in turn, lead\nto the development of more robust machine learning methods in the future.\n Code is available at:\n\\url{https://codeberg.org/mwspratling/RobustnessEvaluation}\n","authors":["Michael W. Spratling"],"pdf_url":"https://arxiv.org/pdf/2308.04137v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18651v3","updated":"2023-08-08T08:48:48Z","published":"2023-05-29T23:06:05Z","title":"UMD: Unsupervised Model Detection for X2X Backdoor Attacks","summary":" Backdoor (Trojan) attack is a common threat to deep neural networks, where\nsamples from one or more source classes embedded with a backdoor trigger will\nbe misclassified to adversarial target classes. Existing methods for detecting\nwhether a classifier is backdoor attacked are mostly designed for attacks with\na single adversarial target (e.g., all-to-one attack). To the best of our\nknowledge, without supervision, no existing methods can effectively address the\nmore general X2X attack with an arbitrary number of source classes, each paired\nwith an arbitrary target class. In this paper, we propose UMD, the first\nUnsupervised Model Detection method that effectively detects X2X backdoor\nattacks via a joint inference of the adversarial (source, target) class pairs.\nIn particular, we first define a novel transferability statistic to measure and\nselect a subset of putative backdoor class pairs based on a proposed clustering\napproach. Then, these selected class pairs are jointly assessed based on an\naggregation of their reverse-engineered trigger size for detection inference,\nusing a robust and unsupervised anomaly detector we proposed. We conduct\ncomprehensive evaluations on CIFAR-10, GTSRB, and Imagenette dataset, and show\nthat our unsupervised UMD outperforms SOTA detectors (even with supervision) by\n17%, 4%, and 8%, respectively, in terms of the detection accuracy against\ndiverse X2X attacks. We also show the strong detection performance of UMD\nagainst several strong adaptive attacks.\n","authors":["Zhen Xiang","Zidi Xiong","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2305.18651v3.pdf","comment":"Proceedings of the 40th International Conference on Machine Learning"},{"id":"http://arxiv.org/abs/2308.04126v1","updated":"2023-08-08T08:30:16Z","published":"2023-08-08T08:30:16Z","title":"OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion\n and Infinite Data Generation","summary":" This paper presents OmniDataComposer, an innovative approach for multimodal\ndata fusion and unlimited data generation with an intent to refine and\nuncomplicate interplay among diverse data modalities. Coming to the core\nbreakthrough, it introduces a cohesive data structure proficient in processing\nand merging multimodal data inputs, which include video, audio, and text. Our\ncrafted algorithm leverages advancements across multiple operations such as\nvideo/image caption extraction, dense caption extraction, Automatic Speech\nRecognition (ASR), Optical Character Recognition (OCR), Recognize Anything\nModel(RAM), and object tracking. OmniDataComposer is capable of identifying\nover 6400 categories of objects, substantially broadening the spectrum of\nvisual information. It amalgamates these diverse modalities, promoting\nreciprocal enhancement among modalities and facilitating cross-modal data\ncorrection. \\textbf{The final output metamorphoses each video input into an\nelaborate sequential document}, virtually transmuting videos into thorough\nnarratives, making them easier to be processed by large language models. Future\nprospects include optimizing datasets for each modality to encourage unlimited\ndata generation. This robust base will offer priceless insights to models like\nChatGPT, enabling them to create higher quality datasets for video captioning\nand easing question-answering tasks based on video content. OmniDataComposer\ninaugurates a new stage in multimodal learning, imparting enormous potential\nfor augmenting AI's understanding and generation of complex, real-world data.\n","authors":["Dongyang Yu","Shihao Wang","Yuan Fang","Wangpeng An"],"pdf_url":"https://arxiv.org/pdf/2308.04126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.14508v3","updated":"2023-08-08T08:29:12Z","published":"2022-08-30T19:32:07Z","title":"Swin-transformer-yolov5 For Real-time Wine Grape Bunch Detection","summary":" In this research, an integrated detection model, Swin-transformer-YOLOv5 or\nSwin-T-YOLOv5, was proposed for real-time wine grape bunch detection to inherit\nthe advantages from both YOLOv5 and Swin-transformer. The research was\nconducted on two different grape varieties of Chardonnay (always white berry\nskin) and Merlot (white or white-red mix berry skin when immature; red when\nmatured) from July to September in 2019. To verify the superiority of\nSwin-T-YOLOv5, its performance was compared against several commonly\nused/competitive object detectors, including Faster R-CNN, YOLOv3, YOLOv4, and\nYOLOv5. All models were assessed under different test conditions, including two\ndifferent weather conditions (sunny and cloudy), two different berry maturity\nstages (immature and mature), and three different sunlight\ndirections/intensities (morning, noon, and afternoon) for a comprehensive\ncomparison. Additionally, the predicted number of grape bunches by\nSwin-T-YOLOv5 was further compared with ground truth values, including both\nin-field manual counting and manual labeling during the annotation process.\nResults showed that the proposed Swin-T-YOLOv5 outperformed all other studied\nmodels for grape bunch detection, with up to 97% of mean Average Precision\n(mAP) and 0.89 of F1-score when the weather was cloudy. This mAP was\napproximately 44%, 18%, 14%, and 4% greater than Faster R-CNN, YOLOv3, YOLOv4,\nand YOLOv5, respectively. Swin-T-YOLOv5 achieved its lowest mAP (90%) and\nF1-score (0.82) when detecting immature berries, where the mAP was\napproximately 40%, 5%, 3%, and 1% greater than the same. Furthermore,\nSwin-T-YOLOv5 performed better on Chardonnay variety with achieved up to 0.91\nof R2 and 2.36 root mean square error (RMSE) when comparing the predictions\nwith ground truth. However, it underperformed on Merlot variety with achieved\nonly up to 0.70 of R2 and 3.30 of RMSE.\n","authors":["Shenglian Lu","Xiaoyu Liu","Zixaun He","Wenbo Liu","Xin Zhang","Manoj Karkee"],"pdf_url":"https://arxiv.org/pdf/2208.14508v3.pdf","comment":"30 pages; 15 figures;Corresponding author: Xin Zhang Department of\n Agricultural and Biological Engineering Mississippi State University\n Mississippi State, MS 39762, USA (xzhang@abe.msstate.edu)"},{"id":"http://arxiv.org/abs/2301.11514v4","updated":"2023-08-08T08:26:20Z","published":"2023-01-27T03:18:09Z","title":"Deep Industrial Image Anomaly Detection: A Survey","summary":" The recent rapid development of deep learning has laid a milestone in\nindustrial Image Anomaly Detection (IAD). In this paper, we provide a\ncomprehensive review of deep learning-based image anomaly detection techniques,\nfrom the perspectives of neural network architectures, levels of supervision,\nloss functions, metrics and datasets. In addition, we extract the new setting\nfrom industrial manufacturing and review the current IAD approaches under our\nproposed our new setting. Moreover, we highlight several opening challenges for\nimage anomaly detection. The merits and downsides of representative network\narchitectures under varying supervision are discussed. Finally, we summarize\nthe research findings and point out future research directions. More resources\nare available at\nhttps://github.com/M-3LAB/awesome-industrial-anomaly-detection.\n","authors":["Jiaqi Liu","Guoyang Xie","Jingbao Wang","Shangnian Li","Chengjie Wang","Feng Zheng","Yaochu Jin"],"pdf_url":"https://arxiv.org/pdf/2301.11514v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04118v1","updated":"2023-08-08T08:17:39Z","published":"2023-08-08T08:17:39Z","title":"Multimodal Color Recommendation in Vector Graphic Documents","summary":" Color selection plays a critical role in graphic document design and requires\nsufficient consideration of various contexts. However, recommending appropriate\ncolors which harmonize with the other colors and textual contexts in documents\nis a challenging task, even for experienced designers. In this study, we\npropose a multimodal masked color model that integrates both color and textual\ncontexts to provide text-aware color recommendation for graphic documents. Our\nproposed model comprises self-attention networks to capture the relationships\nbetween colors in multiple palettes, and cross-attention networks that\nincorporate both color and CLIP-based text representations. Our proposed method\nprimarily focuses on color palette completion, which recommends colors based on\nthe given colors and text. Additionally, it is applicable for another color\nrecommendation task, full palette generation, which generates a complete color\npalette corresponding to the given text. Experimental results demonstrate that\nour proposed approach surpasses previous color palette completion methods on\naccuracy, color distribution, and user experience, as well as full palette\ngeneration methods concerning color diversity and similarity to the ground\ntruth palettes.\n","authors":["Qianru Qiu","Xueting Wang","Mayu Otani"],"pdf_url":"https://arxiv.org/pdf/2308.04118v1.pdf","comment":"Accepted to ACM MM 2023"},{"id":"http://arxiv.org/abs/2303.06209v2","updated":"2023-08-08T08:06:48Z","published":"2023-03-10T21:17:14Z","title":"SemARFlow: Injecting Semantics into Unsupervised Optical Flow Estimation\n for Autonomous Driving","summary":" Unsupervised optical flow estimation is especially hard near occlusions and\nmotion boundaries and in low-texture regions. We show that additional\ninformation such as semantics and domain knowledge can help better constrain\nthis problem. We introduce SemARFlow, an unsupervised optical flow network\ndesigned for autonomous driving data that takes estimated semantic segmentation\nmasks as additional inputs. This additional information is injected into the\nencoder and into a learned upsampler that refines the flow output. In addition,\na simple yet effective semantic augmentation module provides self-supervision\nwhen learning flow and its boundaries for vehicles, poles, and sky. Together,\nthese injections of semantic information improve the KITTI-2015 optical flow\ntest error rate from 11.80% to 8.38%. We also show visible improvements around\nobject boundaries as well as a greater ability to generalize across datasets.\nCode is available at\nhttps://github.com/duke-vision/semantic-unsup-flow-release.\n","authors":["Shuai Yuan","Shuzhi Yu","Hannah Kim","Carlo Tomasi"],"pdf_url":"https://arxiv.org/pdf/2303.06209v2.pdf","comment":"Accepted by ICCV-2023; Code is available at\n https://github.com/duke-vision/semantic-unsup-flow-release"},{"id":"http://arxiv.org/abs/2307.14016v3","updated":"2023-08-08T07:57:15Z","published":"2023-07-26T07:57:56Z","title":"RPG-Palm: Realistic Pseudo-data Generation for Palmprint Recognition","summary":" Palmprint recently shows great potential in recognition applications as it is\na privacy-friendly and stable biometric. However, the lack of large-scale\npublic palmprint datasets limits further research and development of palmprint\nrecognition. In this paper, we propose a novel realistic pseudo-palmprint\ngeneration (RPG) model to synthesize palmprints with massive identities. We\nfirst introduce a conditional modulation generator to improve the intra-class\ndiversity. Then an identity-aware loss is proposed to ensure identity\nconsistency against unpaired training. We further improve the B\\'ezier palm\ncreases generation strategy to guarantee identity independence. Extensive\nexperimental results demonstrate that synthetic pretraining significantly\nboosts the recognition model performance. For example, our model improves the\nstate-of-the-art B\\'ezierPalm by more than $5\\%$ and $14\\%$ in terms of\nTAR@FAR=1e-6 under the $1:1$ and $1:3$ Open-set protocol. When accessing only\n$10\\%$ of the real training data, our method still outperforms ArcFace with\n$100\\%$ real training data, indicating that we are closer to real-data-free\npalmprint recognition.\n","authors":["Lei Shen","Jianlong Jin","Ruixin Zhang","Huaen Li","Kai Zhao","Yingyi Zhang","Jingyun Zhang","Shouhong Ding","Yang Zhao","Wei Jia"],"pdf_url":"https://arxiv.org/pdf/2307.14016v3.pdf","comment":"12 pages,8 figures"},{"id":"http://arxiv.org/abs/2308.03463v2","updated":"2023-08-08T07:54:55Z","published":"2023-08-07T10:41:52Z","title":"DiffSynth: Latent In-Iteration Deflickering for Realistic Video\n Synthesis","summary":" In recent years, diffusion models have emerged as the most powerful approach\nin image synthesis. However, applying these models directly to video synthesis\npresents challenges, as it often leads to noticeable flickering contents.\nAlthough recently proposed zero-shot methods can alleviate flicker to some\nextent, we still struggle to generate coherent videos. In this paper, we\npropose DiffSynth, a novel approach that aims to convert image synthesis\npipelines to video synthesis pipelines. DiffSynth consists of two key\ncomponents: a latent in-iteration deflickering framework and a video\ndeflickering algorithm. The latent in-iteration deflickering framework applies\nvideo deflickering to the latent space of diffusion models, effectively\npreventing flicker accumulation in intermediate steps. Additionally, we propose\na video deflickering algorithm, named patch blending algorithm, that remaps\nobjects in different frames and blends them together to enhance video\nconsistency. One of the notable advantages of DiffSynth is its general\napplicability to various video synthesis tasks, including text-guided video\nstylization, fashion video synthesis, image-guided video stylization, video\nrestoring, and 3D rendering. In the task of text-guided video stylization, we\nmake it possible to synthesize high-quality videos without cherry-picking. The\nexperimental results demonstrate the effectiveness of DiffSynth. All videos can\nbe viewed on our project page. Source codes will also be released.\n","authors":["Zhongjie Duan","Lizhou You","Chengyu Wang","Cen Chen","Ziheng Wu","Weining Qian","Jun Huang","Fei Chao"],"pdf_url":"https://arxiv.org/pdf/2308.03463v2.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2202.04680v2","updated":"2023-08-08T07:36:57Z","published":"2022-02-09T19:03:05Z","title":"Lifting-based variational multiclass segmentation: design, analysis and\n implementation","summary":" We propose, analyze and realize a variational multiclass segmentation scheme\nthat partitions a given image into multiple regions exhibiting specific\nproperties. Our method determines multiple functions that encode the\nsegmentation regions by minimizing an energy functional combining information\nfrom different channels. Multichannel image data can be obtained by lifting the\nimage into a higher dimensional feature space using specific multichannel\nfiltering or may already be provided by the imaging modality under\nconsideration, such as an RGB image or multimodal medical data. Experimental\nresults show that the proposed method performs well in various scenarios. In\nparticular, promising results are presented for two medical applications\ninvolving classification of brain abscess and tumor growth, respectively. As\nmain theoretical contributions, we prove the existence of global minimizers of\nthe proposed energy functional and show its stability and convergence with\nrespect to noisy inputs. In particular, these results also apply to the special\ncase of binary segmentation, and these results are also novel in this\nparticular situation.\n","authors":["Nadja Gruber","Johannes Schwab","Sebastien Court","Elke Gizewski","Markus Haltmeier"],"pdf_url":"https://arxiv.org/pdf/2202.04680v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04091v1","updated":"2023-08-08T07:15:23Z","published":"2023-08-08T07:15:23Z","title":"From Unimodal to Multimodal: improving the sEMG-Based Pattern\n Recognition via deep generative models","summary":" Multimodal hand gesture recognition (HGR) systems can achieve higher\nrecognition accuracy. However, acquiring multimodal gesture recognition data\ntypically requires users to wear additional sensors, thereby increasing\nhardware costs. This paper proposes a novel generative approach to improve\nSurface Electromyography (sEMG)-based HGR accuracy via virtual Inertial\nMeasurement Unit (IMU) signals. Specifically, we trained a deep generative\nmodel based on the intrinsic correlation between forearm sEMG signals and\nforearm IMU signals to generate virtual forearm IMU signals from the input\nforearm sEMG signals at first. Subsequently, the sEMG signals and virtual IMU\nsignals were fed into a multimodal Convolutional Neural Network (CNN) model for\ngesture recognition. To evaluate the performance of the proposed approach, we\nconducted experiments on 6 databases, including 5 publicly available databases\nand our collected database comprising 28 subjects performing 38 gestures,\ncontaining both sEMG and IMU data. The results show that our proposed approach\noutperforms the sEMG-based unimodal HGR method (with increases of\n2.15%-13.10%). It demonstrates that incorporating virtual IMU signals,\ngenerated by deep generative models, can significantly enhance the accuracy of\nsEMG-based HGR. The proposed approach represents a successful attempt to\ntransition from unimodal HGR to multimodal HGR without additional sensor\nhardware.\n","authors":["Wentao Wei","Linyan Ren"],"pdf_url":"https://arxiv.org/pdf/2308.04091v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.09880v3","updated":"2023-08-08T07:02:16Z","published":"2023-05-17T01:27:27Z","title":"A survey of the Vision Transformers and its CNN-Transformer based\n Variants","summary":" Vision transformers have become popular as a possible substitute to\nconvolutional neural networks (CNNs) for a variety of computer vision\napplications. These transformers, with their ability to focus on global\nrelationships in images, offer large learning capacity. However, they may\nsuffer from limited generalization as they do not tend to model local\ncorrelation in images. Recently, in vision transformers hybridization of both\nthe convolution operation and self-attention mechanism has emerged, to exploit\nboth the local and global image representations. These hybrid vision\ntransformers, also referred to as CNN-Transformer architectures, have\ndemonstrated remarkable results in vision applications. Given the rapidly\ngrowing number of hybrid vision transformers, it has become necessary to\nprovide a taxonomy and explanation of these hybrid architectures. This survey\npresents a taxonomy of the recent vision transformer architectures and more\nspecifically that of the hybrid vision transformers. Additionally, the key\nfeatures of these architectures such as the attention mechanisms, positional\nembeddings, multi-scale processing, and convolution are also discussed. In\ncontrast to the previous survey papers that are primarily focused on individual\nvision transformer architectures or CNNs, this survey uniquely emphasizes the\nemerging trend of hybrid vision transformers. By showcasing the potential of\nhybrid vision transformers to deliver exceptional performance across a range of\ncomputer vision tasks, this survey sheds light on the future directions of this\nrapidly evolving architecture.\n","authors":["Asifullah Khan","Zunaira Rauf","Anabia Sohail","Abdul Rehman","Hifsa Asif","Aqsa Asif","Umair Farooq"],"pdf_url":"https://arxiv.org/pdf/2305.09880v3.pdf","comment":"Pages: 58, Figures: 14"},{"id":"http://arxiv.org/abs/2308.01006v3","updated":"2023-08-08T06:45:25Z","published":"2023-08-02T08:29:44Z","title":"FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of\n Autonomous Driving","summary":" Building a multi-modality multi-task neural network toward accurate and\nrobust performance is a de-facto standard in perception task of autonomous\ndriving. However, leveraging such data from multiple sensors to jointly\noptimize the prediction and planning tasks remains largely unexplored. In this\npaper, we present FusionAD, to the best of our knowledge, the first unified\nframework that fuse the information from two most critical sensors, camera and\nLiDAR, goes beyond perception task. Concretely, we first build a transformer\nbased multi-modality fusion network to effectively produce fusion based\nfeatures. In constrast to camera-based end-to-end method UniAD, we then\nestablish a fusion aided modality-aware prediction and status-aware planning\nmodules, dubbed FMSPnP that take advantages of multi-modality features. We\nconduct extensive experiments on commonly used benchmark nuScenes dataset, our\nFusionAD achieves state-of-the-art performance and surpassing baselines on\naverage 15% on perception tasks like detection and tracking, 10% on occupancy\nprediction accuracy, reducing prediction error from 0.708 to 0.389 in ADE score\nand reduces the collision rate from 0.31% to only 0.12%.\n","authors":["Tengju Ye","Wei Jing","Chunyong Hu","Shikun Huang","Lingping Gao","Fangzhen Li","Jingke Wang","Ke Guo","Wencong Xiao","Weibo Mao","Hang Zheng","Kun Li","Junbo Chen","Kaicheng Yu"],"pdf_url":"https://arxiv.org/pdf/2308.01006v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04079v1","updated":"2023-08-08T06:37:06Z","published":"2023-08-08T06:37:06Z","title":"3D Gaussian Splatting for Real-Time Radiance Field Rendering","summary":" Radiance Field methods have recently revolutionized novel-view synthesis of\nscenes captured with multiple photos or videos. However, achieving high visual\nquality still requires neural networks that are costly to train and render,\nwhile recent faster methods inevitably trade off speed for quality. For\nunbounded and complete scenes (rather than isolated objects) and 1080p\nresolution rendering, no current method can achieve real-time display rates. We\nintroduce three key elements that allow us to achieve state-of-the-art visual\nquality while maintaining competitive training times and importantly allow\nhigh-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution.\nFirst, starting from sparse points produced during camera calibration, we\nrepresent the scene with 3D Gaussians that preserve desirable properties of\ncontinuous volumetric radiance fields for scene optimization while avoiding\nunnecessary computation in empty space; Second, we perform interleaved\noptimization/density control of the 3D Gaussians, notably optimizing\nanisotropic covariance to achieve an accurate representation of the scene;\nThird, we develop a fast visibility-aware rendering algorithm that supports\nanisotropic splatting and both accelerates training and allows realtime\nrendering. We demonstrate state-of-the-art visual quality and real-time\nrendering on several established datasets.\n","authors":["Bernhard Kerbl","Georgios Kopanas","Thomas Leimkühler","George Drettakis"],"pdf_url":"https://arxiv.org/pdf/2308.04079v1.pdf","comment":"https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/"},{"id":"http://arxiv.org/abs/2308.04074v1","updated":"2023-08-08T06:16:37Z","published":"2023-08-08T06:16:37Z","title":"Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction\n on Monocular RGB Video","summary":" Reconstructing interacting hands from monocular RGB data is a challenging\ntask, as it involves many interfering factors, e.g. self- and mutual occlusion\nand similar textures. Previous works only leverage information from a single\nRGB image without modeling their physically plausible relation, which leads to\ninferior reconstruction results. In this work, we are dedicated to explicitly\nexploiting spatial-temporal information to achieve better interacting hand\nreconstruction. On one hand, we leverage temporal context to complement\ninsufficient information provided by the single frame, and design a novel\ntemporal framework with a temporal constraint for interacting hand motion\nsmoothness. On the other hand, we further propose an interpenetration detection\nmodule to produce kinetically plausible interacting hands without physical\ncollisions. Extensive experiments are performed to validate the effectiveness\nof our proposed framework, which achieves new state-of-the-art performance on\npublic benchmarks.\n","authors":["Weichao Zhao","Hezhen Hu","Wengang Zhou","Li li","Houqiang Li"],"pdf_url":"https://arxiv.org/pdf/2308.04074v1.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2308.04070v1","updated":"2023-08-08T06:07:49Z","published":"2023-08-08T06:07:49Z","title":"ConDistFL: Conditional Distillation for Federated Learning from\n Partially Annotated Data","summary":" Developing a generalized segmentation model capable of simultaneously\ndelineating multiple organs and diseases is highly desirable. Federated\nlearning (FL) is a key technology enabling the collaborative development of a\nmodel without exchanging training data. However, the limited access to fully\nannotated training data poses a major challenge to training generalizable\nmodels. We propose \"ConDistFL\", a framework to solve this problem by combining\nFL with knowledge distillation. Local models can extract the knowledge of\nunlabeled organs and tumors from partially annotated data from the global model\nwith an adequately designed conditional probability representation. We validate\nour framework on four distinct partially annotated abdominal CT datasets from\nthe MSD and KiTS19 challenges. The experimental results show that the proposed\nframework significantly outperforms FedAvg and FedOpt baselines. Moreover, the\nperformance on an external test dataset demonstrates superior generalizability\ncompared to models trained on each dataset separately. Our ablation study\nsuggests that ConDistFL can perform well without frequent aggregation, reducing\nthe communication cost of FL. Our implementation will be available at\nhttps://github.com/NVIDIA/NVFlare/tree/dev/research/condist-fl.\n","authors":["Pochuan Wang","Chen Shen","Weichung Wang","Masahiro Oda","Chiou-Shann Fuh","Kensaku Mori","Holger R. Roth"],"pdf_url":"https://arxiv.org/pdf/2308.04070v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.05785v2","updated":"2023-08-08T06:06:35Z","published":"2022-09-13T07:37:53Z","title":"Adversarial Coreset Selection for Efficient Robust Training","summary":" Neural networks are vulnerable to adversarial attacks: adding well-crafted,\nimperceptible perturbations to their input can modify their output. Adversarial\ntraining is one of the most effective approaches to training robust models\nagainst such attacks. Unfortunately, this method is much slower than vanilla\ntraining of neural networks since it needs to construct adversarial examples\nfor the entire training data at every iteration. By leveraging the theory of\ncoreset selection, we show how selecting a small subset of training data\nprovides a principled approach to reducing the time complexity of robust\ntraining. To this end, we first provide convergence guarantees for adversarial\ncoreset selection. In particular, we show that the convergence bound is\ndirectly related to how well our coresets can approximate the gradient computed\nover the entire training data. Motivated by our theoretical analysis, we\npropose using this gradient approximation error as our adversarial coreset\nselection objective to reduce the training set size effectively. Once built, we\nrun adversarial training over this subset of the training data. Unlike existing\nmethods, our approach can be adapted to a wide variety of training objectives,\nincluding TRADES, $\\ell_p$-PGD, and Perceptual Adversarial Training. We conduct\nextensive experiments to demonstrate that our approach speeds up adversarial\ntraining by 2-3 times while experiencing a slight degradation in the clean and\nrobust accuracy.\n","authors":["Hadi M. Dolatabadi","Sarah Erfani","Christopher Leckie"],"pdf_url":"https://arxiv.org/pdf/2209.05785v2.pdf","comment":"Accepted to the International Journal of Computer Vision (IJCV).\n Extended version of the ECCV2022 paper: arXiv:2112.00378. arXiv admin note:\n substantial text overlap with arXiv:2112.00378"},{"id":"http://arxiv.org/abs/2305.01160v3","updated":"2023-08-08T05:59:58Z","published":"2023-05-02T02:29:18Z","title":"Long-Tailed Recognition by Mutual Information Maximization between\n Latent Features and Ground-Truth Labels","summary":" Although contrastive learning methods have shown prevailing performance on a\nvariety of representation learning tasks, they encounter difficulty when the\ntraining dataset is long-tailed. Many researchers have combined contrastive\nlearning and a logit adjustment technique to address this problem, but the\ncombinations are done ad-hoc and a theoretical background has not yet been\nprovided. The goal of this paper is to provide the background and further\nimprove the performance. First, we show that the fundamental reason contrastive\nlearning methods struggle with long-tailed tasks is that they try to maximize\nthe mutual information maximization between latent features and input data. As\nground-truth labels are not considered in the maximization, they are not able\nto address imbalances between class labels. Rather, we interpret the\nlong-tailed recognition task as a mutual information maximization between\nlatent features and ground-truth labels. This approach integrates contrastive\nlearning and logit adjustment seamlessly to derive a loss function that shows\nstate-of-the-art performance on long-tailed recognition benchmarks. It also\ndemonstrates its efficacy in image segmentation tasks, verifying its\nversatility beyond image classification.\n","authors":["Min-Kook Suh","Seung-Woo Seo"],"pdf_url":"https://arxiv.org/pdf/2305.01160v3.pdf","comment":"ICML 2023 camera-ready"},{"id":"http://arxiv.org/abs/2308.03529v2","updated":"2023-08-08T05:29:57Z","published":"2023-08-07T12:26:34Z","title":"Feature Decoupling-Recycling Network for Fast Interactive Segmentation","summary":" Recent interactive segmentation methods iteratively take source image, user\nguidance and previously predicted mask as the input without considering the\ninvariant nature of the source image. As a result, extracting features from the\nsource image is repeated in each interaction, resulting in substantial\ncomputational redundancy. In this work, we propose the Feature\nDecoupling-Recycling Network (FDRN), which decouples the modeling components\nbased on their intrinsic discrepancies and then recycles components for each\nuser interaction. Thus, the efficiency of the whole interactive process can be\nsignificantly improved. To be specific, we apply the Decoupling-Recycling\nstrategy from three perspectives to address three types of discrepancies,\nrespectively. First, our model decouples the learning of source image semantics\nfrom the encoding of user guidance to process two types of input domains\nseparately. Second, FDRN decouples high-level and low-level features from\nstratified semantic representations to enhance feature learning. Third, during\nthe encoding of user guidance, current user guidance is decoupled from\nhistorical guidance to highlight the effect of current user guidance. We\nconduct extensive experiments on 6 datasets from different domains and\nmodalities, which demonstrate the following merits of our model: 1) superior\nefficiency than other methods, particularly advantageous in challenging\nscenarios requiring long-term interactions (up to 4.25x faster), while\nachieving favorable segmentation performance; 2) strong applicability to\nvarious methods serving as a universal enhancement technique; 3) well\ncross-task generalizability, e.g., to medical image segmentation, and\nrobustness against misleading user guidance.\n","authors":["Huimin Zeng","Weinong Wang","Xin Tao","Zhiwei Xiong","Yu-Wing Tai","Wenjie Pei"],"pdf_url":"https://arxiv.org/pdf/2308.03529v2.pdf","comment":"Accepted to ACM MM 2023"},{"id":"http://arxiv.org/abs/2308.04054v1","updated":"2023-08-08T05:29:26Z","published":"2023-08-08T05:29:26Z","title":"An Empirical Analysis of Range for 3D Object Detection","summary":" LiDAR-based 3D detection plays a vital role in autonomous navigation.\nSurprisingly, although autonomous vehicles (AVs) must detect both near-field\nobjects (for collision avoidance) and far-field objects (for longer-term\nplanning), contemporary benchmarks focus only on near-field 3D detection.\nHowever, AVs must detect far-field objects for safe navigation. In this paper,\nwe present an empirical analysis of far-field 3D detection using the long-range\ndetection dataset Argoverse 2.0 to better understand the problem, and share the\nfollowing insight: near-field LiDAR measurements are dense and optimally\nencoded by small voxels, while far-field measurements are sparse and are better\nencoded with large voxels. We exploit this observation to build a collection of\nrange experts tuned for near-vs-far field detection, and propose simple\ntechniques to efficiently ensemble models for long-range detection that improve\nefficiency by 33% and boost accuracy by 3.2% CDS.\n","authors":["Neehar Peri","Mengtian Li","Benjamin Wilson","Yu-Xiong Wang","James Hays","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2308.04054v1.pdf","comment":"Accepted to ICCV 2023 Workshop - Robustness and Reliability of\n Autonomous Vehicles in the Open-World"},{"id":"http://arxiv.org/abs/2308.03177v2","updated":"2023-08-08T05:26:45Z","published":"2023-08-06T18:07:45Z","title":"Boosting Few-shot 3D Point Cloud Segmentation via Query-Guided\n Enhancement","summary":" Although extensive research has been conducted on 3D point cloud\nsegmentation, effectively adapting generic models to novel categories remains a\nformidable challenge. This paper proposes a novel approach to improve point\ncloud few-shot segmentation (PC-FSS) models. Unlike existing PC-FSS methods\nthat directly utilize categorical information from support prototypes to\nrecognize novel classes in query samples, our method identifies two critical\naspects that substantially enhance model performance by reducing contextual\ngaps between support prototypes and query features. Specifically, we (1) adapt\nsupport background prototypes to match query context while removing extraneous\ncues that may obscure foreground and background in query samples, and (2)\nholistically rectify support prototypes under the guidance of query features to\nemulate the latter having no semantic gap to the query targets. Our proposed\ndesigns are agnostic to the feature extractor, rendering them readily\napplicable to any prototype-based methods. The experimental results on S3DIS\nand ScanNet demonstrate notable practical benefits, as our approach achieves\nsignificant improvements while still maintaining high efficiency. The code for\nour approach is available at\nhttps://github.com/AaronNZH/Boosting-Few-shot-3D-Point-Cloud-Segmentation-via-Query-Guided-Enhancement\n","authors":["Zhenhua Ning","Zhuotao Tian","Guangming Lu","Wenjie Pei"],"pdf_url":"https://arxiv.org/pdf/2308.03177v2.pdf","comment":"Accepted to ACM MM 2023"},{"id":"http://arxiv.org/abs/2308.04052v1","updated":"2023-08-08T05:16:51Z","published":"2023-08-08T05:16:51Z","title":"The Five-Dollar Model: Generating Game Maps and Sprites from Sentence\n Embeddings","summary":" The five-dollar model is a lightweight text-to-image generative architecture\nthat generates low dimensional images from an encoded text prompt. This model\ncan successfully generate accurate and aesthetically pleasing content in low\ndimensional domains, with limited amounts of training data. Despite the small\nsize of both the model and datasets, the generated images are still able to\nmaintain the encoded semantic meaning of the textual prompt. We apply this\nmodel to three small datasets: pixel art video game maps, video game sprite\nimages, and down-scaled emoji images and apply novel augmentation strategies to\nimprove the performance of our model on these limited datasets. We evaluate our\nmodels performance using cosine similarity score between text-image pairs\ngenerated by the CLIP VIT-B/32 model.\n","authors":["Timothy Merino","Roman Negri","Dipika Rajesh","M Charity","Julian Togelius"],"pdf_url":"https://arxiv.org/pdf/2308.04052v1.pdf","comment":"to be published in AIIDE 2023"},{"id":"http://arxiv.org/abs/2306.16670v3","updated":"2023-08-08T05:00:58Z","published":"2023-06-29T04:05:13Z","title":"End-to-End Learnable Multi-Scale Feature Compression for VCM","summary":" The proliferation of deep learning-based machine vision applications has\ngiven rise to a new type of compression, so called video coding for machine\n(VCM). VCM differs from traditional video coding in that it is optimized for\nmachine vision performance instead of human visual quality. In the feature\ncompression track of MPEG-VCM, multi-scale features extracted from images are\nsubject to compression. Recent feature compression works have demonstrated that\nthe versatile video coding (VVC) standard-based approach can achieve a BD-rate\nreduction of up to 96% against MPEG-VCM feature anchor. However, it is still\nsub-optimal as VVC was not designed for extracted features but for natural\nimages. Moreover, the high encoding complexity of VVC makes it difficult to\ndesign a lightweight encoder without sacrificing performance. To address these\nchallenges, we propose a novel multi-scale feature compression method that\nenables both the end-to-end optimization on the extracted features and the\ndesign of lightweight encoders. The proposed model combines a learnable\ncompressor with a multi-scale feature fusion network so that the redundancy in\nthe multi-scale features is effectively removed. Instead of simply cascading\nthe fusion network and the compression network, we integrate the fusion and\nencoding processes in an interleaved way. Our model first encodes a\nlarger-scale feature to obtain a latent representation and then fuses the\nlatent with a smaller-scale feature. This process is successively performed\nuntil the smallest-scale feature is fused and then the encoded latent at the\nfinal stage is entropy-coded for transmission. The results show that our model\noutperforms previous approaches by at least 52% BD-rate reduction and has\n$\\times5$ to $\\times27$ times less encoding time for object detection...\n","authors":["Yeongwoong Kim","Hyewon Jeong","Janghyun Yu","Younhee Kim","Jooyoung Lee","Se Yoon Jeong","Hui Yong Kim"],"pdf_url":"https://arxiv.org/pdf/2306.16670v3.pdf","comment":"13 pages, accepted by IEEE Transactions on Circuits and Systems for\n Video Technology"},{"id":"http://arxiv.org/abs/2308.04047v1","updated":"2023-08-08T04:53:52Z","published":"2023-08-08T04:53:52Z","title":"SODFormer: Streaming Object Detection with Transformer Using Events and\n Frames","summary":" DAVIS camera, streaming two complementary sensing modalities of asynchronous\nevents and frames, has gradually been used to address major object detection\nchallenges (e.g., fast motion blur and low-light). However, how to effectively\nleverage rich temporal cues and fuse two heterogeneous visual streams remains a\nchallenging endeavor. To address this challenge, we propose a novel streaming\nobject detector with Transformer, namely SODFormer, which first integrates\nevents and frames to continuously detect objects in an asynchronous manner.\nTechnically, we first build a large-scale multimodal neuromorphic object\ndetection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we\ndesign a spatiotemporal Transformer architecture to detect objects via an\nend-to-end sequence prediction problem, where the novel temporal Transformer\nmodule leverages rich temporal cues from two visual streams to improve the\ndetection performance. Finally, an asynchronous attention-based fusion module\nis proposed to integrate two heterogeneous sensing modalities and take\ncomplementary advantages from each end, which can be queried at any time to\nlocate objects and break through the limited output frequency from synchronized\nframe-based fusion strategies. The results show that the proposed SODFormer\noutperforms four state-of-the-art methods and our eight baselines by a\nsignificant margin. We also show that our unifying framework works well even in\ncases where the conventional frame-based camera fails, e.g., high-speed motion\nand low-light conditions. Our dataset and code can be available at\nhttps://github.com/dianzl/SODFormer.\n","authors":["Dianze Li","Jianing Li","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2308.04047v1.pdf","comment":"18 pages, 15 figures, in IEEE Transactions on Pattern Analysis and\n Machine Intelligence"},{"id":"http://arxiv.org/abs/2308.04039v1","updated":"2023-08-08T04:30:42Z","published":"2023-08-08T04:30:42Z","title":"Implicit neural representations for joint decomposition and registration\n of gene expression images in the marmoset brain","summary":" We propose a novel image registration method based on implicit neural\nrepresentations that addresses the challenging problem of registering a pair of\nbrain images with similar anatomical structures, but where one image contains\nadditional features or artifacts that are not present in the other image. To\ndemonstrate its effectiveness, we use 2D microscopy $\\textit{in situ}$\nhybridization gene expression images of the marmoset brain. Accurately\nquantifying gene expression requires image registration to a brain template,\nwhich is difficult due to the diversity of patterns causing variations in\nvisible anatomical brain structures. Our approach uses implicit networks in\ncombination with an image exclusion loss to jointly perform the registration\nand decompose the image into a support and residual image. The support image\naligns well with the template, while the residual image captures individual\nimage characteristics that diverge from the template. In experiments, our\nmethod provided excellent results and outperformed other registration\ntechniques.\n","authors":["Michal Byra","Charissa Poon","Tomomi Shimogori","Henrik Skibbe"],"pdf_url":"https://arxiv.org/pdf/2308.04039v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2201.01615v3","updated":"2023-08-08T04:17:55Z","published":"2022-01-05T13:51:20Z","title":"Lawin Transformer: Improving New-Era Vision Backbones with Multi-Scale\n Representations for Semantic Segmentation","summary":" The multi-level aggregation (MLA) module has emerged as a critical component\nfor advancing new-era vision back-bones in semantic segmentation. In this\npaper, we propose Lawin (large window) Transformer, a novel MLA architecture\nthat creatively utilizes multi-scale feature maps from the vision backbone. At\nthe core of Lawin Transformer is the Lawin attention, a newly designed window\nattention mechanism capable of querying much larger context windows than local\nwindows. We focus on studying the efficient and simplistic application of the\nlarge-window paradigm, allowing for flexible regulation of the ratio of large\ncontext to query and capturing multi-scale representations. We validate the\neffectiveness of Lawin Transformer on Cityscapes and ADE20K, consistently\ndemonstrating great superiority to widely-used MLA modules when combined with\nnew-era vision backbones. The code is available at\nhttps://github.com/yan-hao-tian/lawin.\n","authors":["Haotian Yan","Chuang Zhang","Ming Wu"],"pdf_url":"https://arxiv.org/pdf/2201.01615v3.pdf","comment":"The latest version has really big differences from the original\n version, which may make the reader confused. We will submit the latest\n version as another article"},{"id":"http://arxiv.org/abs/2308.03698v2","updated":"2023-08-08T03:40:53Z","published":"2023-08-07T16:14:27Z","title":"Screen-based 3D Subjective Experiment Software","summary":" Recently, widespread 3D graphics (e.g., point clouds and meshes) have drawn\nconsiderable efforts from academia and industry to assess their perceptual\nquality by conducting subjective experiments. However, lacking a handy software\nfor 3D subjective experiments complicates the construction of 3D graphics\nquality assessment datasets, thus hindering the prosperity of relevant fields.\nIn this paper, we develop a powerful platform with which users can flexibly\ndesign their 3D subjective methodologies and build high-quality datasets,\neasing a broad spectrum of 3D graphics subjective quality study. To accurately\nillustrate the perceptual quality differences of 3D stimuli, our software can\nsimultaneously render the source stimulus and impaired stimulus and allows both\nstimuli to respond synchronously to viewer interactions. Compared with amateur\n3D visualization tool-based or image/video rendering-based schemes, our\napproach embodies typical 3D applications while minimizing cognitive overload\nduring subjective experiments. We organized a subjective experiment involving\n40 participants to verify the validity of the proposed software. Experimental\nanalyses demonstrate that subjective tests on our software can produce\nreasonable subjective quality scores of 3D models. All resources in this paper\ncan be found at https://openi.pcl.ac.cn/OpenDatasets/3DQA.\n","authors":["Songlin Fan","Wei Gao"],"pdf_url":"https://arxiv.org/pdf/2308.03698v2.pdf","comment":"Accepted to ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2308.04020v1","updated":"2023-08-08T03:34:04Z","published":"2023-08-08T03:34:04Z","title":"Synthetic Augmentation with Large-scale Unconditional Pre-training","summary":" Deep learning based medical image recognition systems often require a\nsubstantial amount of training data with expert annotations, which can be\nexpensive and time-consuming to obtain. Recently, synthetic augmentation\ntechniques have been proposed to mitigate the issue by generating realistic\nimages conditioned on class labels. However, the effectiveness of these methods\nheavily depends on the representation capability of the trained generative\nmodel, which cannot be guaranteed without sufficient labeled training data. To\nfurther reduce the dependency on annotated data, we propose a synthetic\naugmentation method called HistoDiffusion, which can be pre-trained on\nlarge-scale unlabeled datasets and later applied to a small-scale labeled\ndataset for augmented training. In particular, we train a latent diffusion\nmodel (LDM) on diverse unlabeled datasets to learn common features and generate\nrealistic images without conditional inputs. Then, we fine-tune the model with\nclassifier guidance in latent space on an unseen labeled dataset so that the\nmodel can synthesize images of specific categories. Additionally, we adopt a\nselective mechanism to only add synthetic samples with high confidence of\nmatching to target labels. We evaluate our proposed method by pre-training on\nthree histopathology datasets and testing on a histopathology dataset of\ncolorectal cancer (CRC) excluded from the pre-training datasets. With\nHistoDiffusion augmentation, the classification accuracy of a backbone\nclassifier is remarkably improved by 6.4% using a small set of the original\nlabels. Our code is available at https://github.com/karenyyy/HistoDiffAug.\n","authors":["Jiarong Ye","Haomiao Ni","Peng Jin","Sharon X. Huang","Yuan Xue"],"pdf_url":"https://arxiv.org/pdf/2308.04020v1.pdf","comment":"MICCAI 2023"},{"id":"http://arxiv.org/abs/2308.04016v1","updated":"2023-08-08T03:24:21Z","published":"2023-08-08T03:24:21Z","title":"Hierarchical Visual Primitive Experts for Compositional Zero-Shot\n Learning","summary":" Compositional zero-shot learning (CZSL) aims to recognize unseen compositions\nwith prior knowledge of known primitives (attribute and object). Previous works\nfor CZSL often suffer from grasping the contextuality between attribute and\nobject, as well as the discriminability of visual features, and the long-tailed\ndistribution of real-world compositional data. We propose a simple and scalable\nframework called Composition Transformer (CoT) to address these issues. CoT\nemploys object and attribute experts in distinctive manners to generate\nrepresentative embeddings, using the visual network hierarchically. The object\nexpert extracts representative object embeddings from the final layer in a\nbottom-up manner, while the attribute expert makes attribute embeddings in a\ntop-down manner with a proposed object-guided attention module that models\ncontextuality explicitly. To remedy biased prediction caused by imbalanced data\ndistribution, we develop a simple minority attribute augmentation (MAA) that\nsynthesizes virtual samples by mixing two images and oversampling minority\nattribute classes. Our method achieves SoTA performance on several benchmarks,\nincluding MIT-States, C-GQA, and VAW-CZSL. We also demonstrate the\neffectiveness of CoT in improving visual discrimination and addressing the\nmodel bias from the imbalanced data distribution. The code is available at\nhttps://github.com/HanjaeKim98/CoT.\n","authors":["Hanjae Kim","Jiyoung Lee","Seongheon Park","Kwanghoon Sohn"],"pdf_url":"https://arxiv.org/pdf/2308.04016v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2308.04008v1","updated":"2023-08-08T03:06:10Z","published":"2023-08-08T03:06:10Z","title":"Coarse-to-Fine: Learning Compact Discriminative Representation for\n Single-Stage Image Retrieval","summary":" Image retrieval targets to find images from a database that are visually\nsimilar to the query image. Two-stage methods following retrieve-and-rerank\nparadigm have achieved excellent performance, but their separate local and\nglobal modules are inefficient to real-world applications. To better trade-off\nretrieval efficiency and accuracy, some approaches fuse global and local\nfeature into a joint representation to perform single-stage image retrieval.\nHowever, they are still challenging due to various situations to tackle,\n$e.g.$, background, occlusion and viewpoint. In this work, we design a\nCoarse-to-Fine framework to learn Compact Discriminative representation (CFCD)\nfor end-to-end single-stage image retrieval-requiring only image-level labels.\nSpecifically, we first design a novel adaptive softmax-based loss which\ndynamically tunes its scale and margin within each mini-batch and increases\nthem progressively to strengthen supervision during training and intra-class\ncompactness. Furthermore, we propose a mechanism which attentively selects\nprominent local descriptors and infuse fine-grained semantic relations into the\nglobal representation by a hard negative sampling strategy to optimize\ninter-class distinctiveness at a global scale. Extensive experimental results\nhave demonstrated the effectiveness of our method, which achieves\nstate-of-the-art single-stage image retrieval performance on benchmarks such as\nRevisited Oxford and Revisited Paris. Code is available at\nhttps://github.com/bassyess/CFCD.\n","authors":["Yunquan Zhu","Xinkai Gao","Bo Ke","Ruizhi Qiao","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2308.04008v1.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2308.04005v1","updated":"2023-08-08T02:48:46Z","published":"2023-08-08T02:48:46Z","title":"Few-shot medical image classification with simple shape and texture text\n descriptors using vision-language models","summary":" In this work, we investigate the usefulness of vision-language models (VLMs)\nand large language models for binary few-shot classification of medical images.\nWe utilize the GPT-4 model to generate text descriptors that encapsulate the\nshape and texture characteristics of objects in medical images. Subsequently,\nthese GPT-4 generated descriptors, alongside VLMs pre-trained on natural\nimages, are employed to classify chest X-rays and breast ultrasound images. Our\nresults indicate that few-shot classification of medical images using VLMs and\nGPT-4 generated descriptors is a viable approach. However, accurate\nclassification requires to exclude certain descriptors from the calculations of\nthe classification scores. Moreover, we assess the ability of VLMs to evaluate\nshape features in breast mass ultrasound images. We further investigate the\ndegree of variability among the sets of text descriptors produced by GPT-4. Our\nwork provides several important insights about the application of VLMs for\nmedical image analysis.\n","authors":["Michal Byra","Muhammad Febrian Rachmadi","Henrik Skibbe"],"pdf_url":"https://arxiv.org/pdf/2308.04005v1.pdf","comment":"13 pages, 5 figures"},{"id":"http://arxiv.org/abs/2305.10044v3","updated":"2023-08-08T02:40:05Z","published":"2023-05-17T08:37:26Z","title":"Two-Stream Regression Network for Dental Implant Position Prediction","summary":" In implant prosthesis treatment, the design of the surgical guide heavily\nrelies on the manual location of the implant position, which is subjective and\nprone to doctor's experiences. When deep learning based methods has started to\nbe applied to address this problem, the space between teeth are various and\nsome of them might present similar texture characteristic with the actual\nimplant region. Both problems make a big challenge for the implant position\nprediction. In this paper, we develop a two-stream implant position regression\nframework (TSIPR), which consists of an implant region detector (IRD) and a\nmulti-scale patch embedding regression network (MSPENet), to address this\nissue. For the training of IRD, we extend the original annotation to provide\nadditional supervisory information, which contains much more rich\ncharacteristic and do not introduce extra labeling costs. A multi-scale patch\nembedding module is designed for the MSPENet to adaptively extract features\nfrom the images with various tooth spacing. The global-local feature\ninteraction block is designed to build the encoder of MSPENet, which combines\nthe transformer and convolution for enriched feature representation. During\ninference, the RoI mask extracted from the IRD is used to refine the prediction\nresults of the MSPENet. Extensive experiments on a dental implant dataset\nthrough five-fold cross-validation demonstrated that the proposed TSIPR\nachieves superior performance than existing methods.\n","authors":["Xinquan Yang","Xuguang Li","Xuechen Li","Wenting Chen","Linlin Shen","Xin Li","Yongqiang Deng"],"pdf_url":"https://arxiv.org/pdf/2305.10044v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12676v3","updated":"2023-08-08T02:32:24Z","published":"2023-07-24T10:30:54Z","title":"Damage Vision Mining Opportunity for Imbalanced Anomaly Detection","summary":" In past decade, previous balanced datasets have been used to advance\nalgorithms for classification, object detection, semantic segmentation, and\nanomaly detection in industrial applications. Specifically, for condition-based\nmaintenance, automating visual inspection is crucial to ensure high quality.\nDeterioration prognostic attempts to optimize the fine decision process for\npredictive maintenance and proactive repair. In civil infrastructure and living\nenvironment, damage data mining cannot avoid the imbalanced data issue because\nof rare unseen events and high quality status by improved operations. For\nvisual inspection, deteriorated class acquired from the surface of concrete and\nsteel components are occasionally imbalanced. From numerous related surveys, we\nsummarize that imbalanced data problems can be categorized into four types; 1)\nmissing range of target and label valuables, 2) majority-minority class\nimbalance, 3) foreground-background of spatial imbalance, 4) long-tailed class\nof pixel-wise imbalance. Since 2015, there has been many imbalanced studies\nusing deep learning approaches that includes regression, image classification,\nobject detection, semantic segmentation. However, anomaly detection for\nimbalanced data is not yet well known. In the study, we highlight one-class\nanomaly detection application whether anomalous class or not, and demonstrate\nclear examples on imbalanced vision datasets: blood smear, lung infection,\nhazardous driving, wooden, concrete deterioration, river sludge, and disaster\ndamage. Illustrated in Fig.1, we provide key results on damage vision mining\nadvantage, hypothesizing that the more effective range of positive ratio, the\nhigher accuracy gain of anomaly detection application. In our imbalanced\nstudies, compared with the balanced case of positive ratio 1/1, we find that\nthere is applicable positive ratio, where the accuracy are consistently high.\n","authors":["Takato Yasuno"],"pdf_url":"https://arxiv.org/pdf/2307.12676v3.pdf","comment":"21 pages, 29 figures, 18 tables"},{"id":"http://arxiv.org/abs/2308.02494v2","updated":"2023-08-08T02:32:04Z","published":"2023-07-16T19:36:19Z","title":"Adaptively Placed Multi-Grid Scene Representation Networks for\n Large-Scale Data Visualization","summary":" Scene representation networks (SRNs) have been recently proposed for\ncompression and visualization of scientific data. However, state-of-the-art\nSRNs do not adapt the allocation of available network parameters to the complex\nfeatures found in scientific data, leading to a loss in reconstruction quality.\nWe address this shortcoming with an adaptively placed multi-grid SRN (APMGSRN)\nand propose a domain decomposition training and inference technique for\naccelerated parallel training on multi-GPU systems. We also release an\nopen-source neural volume rendering application that allows plug-and-play\nrendering with any PyTorch-based SRN. Our proposed APMGSRN architecture uses\nmultiple spatially adaptive feature grids that learn where to be placed within\nthe domain to dynamically allocate more neural network resources where error is\nhigh in the volume, improving state-of-the-art reconstruction accuracy of SRNs\nfor scientific data without requiring expensive octree refining, pruning, and\ntraversal like previous adaptive models. In our domain decomposition approach\nfor representing large-scale data, we train an set of APMGSRNs in parallel on\nseparate bricks of the volume to reduce training time while avoiding overhead\nnecessary for an out-of-core solution for volumes too large to fit in GPU\nmemory. After training, the lightweight SRNs are used for realtime neural\nvolume rendering in our open-source renderer, where arbitrary view angles and\ntransfer functions can be explored. A copy of this paper, all code, all models\nused in our experiments, and all supplemental materials and videos are\navailable at https://github.com/skywolf829/APMGSRN.\n","authors":["Skylar Wolfgang Wurster","Tianyu Xiong","Han-Wei Shen","Hanqi Guo","Tom Peterka"],"pdf_url":"https://arxiv.org/pdf/2308.02494v2.pdf","comment":"Accepted to IEEE VIS 2023"},{"id":"http://arxiv.org/abs/2308.03999v1","updated":"2023-08-08T02:28:50Z","published":"2023-08-08T02:28:50Z","title":"Understanding CNN Hidden Neuron Activations using Structured Background\n Knowledge and Deductive Reasoning","summary":" A major challenge in Explainable AI is in correctly interpreting activations\nof hidden neurons: accurate interpretations would provide insights into the\nquestion of what a deep learning system has internally detected as relevant on\nthe input, de-mystifying the otherwise black-box character of deep learning\nsystems. The state of the art indicates that hidden node activations can, in\nsome cases, be interpretable in a way that makes sense to humans, but\nsystematic automated methods that would be able to hypothesize and verify\ninterpretations of hidden neuron activations are underexplored. In this paper,\nwe provide such a method and demonstrate that it provides meaningful\ninterpretations. Our approach is based on using large-scale background\nknowledge approximately 2 million classes curated from the Wikipedia concept\nhierarchy together with a symbolic reasoning approach called Concept Induction\nbased on description logics, originally developed for applications in the\nSemantic Web field. Our results show that we can automatically attach\nmeaningful labels from the background knowledge to individual neurons in the\ndense layer of a Convolutional Neural Network through a hypothesis and\nverification process\n","authors":["Abhilekha Dalal","Md Kamruzzaman Sarker","Adrita Barua","Eugene Vasserman","Pascal Hitzler"],"pdf_url":"https://arxiv.org/pdf/2308.03999v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03998v1","updated":"2023-08-08T02:28:48Z","published":"2023-08-08T02:28:48Z","title":"Real-time Strawberry Detection Based on Improved YOLOv5s Architecture\n for Robotic Harvesting in open-field environment","summary":" This study proposed a YOLOv5-based custom object detection model to detect\nstrawberries in an outdoor environment. The original architecture of the\nYOLOv5s was modified by replacing the C3 module with the C2f module in the\nbackbone network, which provided a better feature gradient flow. Secondly, the\nSpatial Pyramid Pooling Fast in the final layer of the backbone network of\nYOLOv5s was combined with Cross Stage Partial Net to improve the generalization\nability over the strawberry dataset in this study. The proposed architecture\nwas named YOLOv5s-Straw. The RGB images dataset of the strawberry canopy with\nthree maturity classes (immature, nearly mature, and mature) was collected in\nopen-field environment and augmented through a series of operations including\nbrightness reduction, brightness increase, and noise adding. To verify the\nsuperiority of the proposed method for strawberry detection in open-field\nenvironment, four competitive detection models (YOLOv3-tiny, YOLOv5s,\nYOLOv5s-C2f, and YOLOv8s) were trained, and tested under the same computational\nenvironment and compared with YOLOv5s-Straw. The results showed that the\nhighest mean average precision of 80.3% was achieved using the proposed\narchitecture whereas the same was achieved with YOLOv3-tiny, YOLOv5s,\nYOLOv5s-C2f, and YOLOv8s were 73.4%, 77.8%, 79.8%, 79.3%, respectively.\nSpecifically, the average precision of YOLOv5s-Straw was 82.1% in the immature\nclass, 73.5% in the nearly mature class, and 86.6% in the mature class, which\nwere 2.3% and 3.7%, respectively, higher than that of the latest YOLOv8s. The\nmodel included 8.6*10^6 network parameters with an inference speed of 18ms per\nimage while the inference speed of YOLOv8s had a slower inference speed of\n21.0ms and heavy parameters of 11.1*10^6, which indicates that the proposed\nmodel is fast enough for real time strawberry detection and localization for\nthe robotic picking.\n","authors":["Zixuan He","Salik Ram Khana","Xin Zhang","Manoj Karkee","Qin Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03998v1.pdf","comment":"20 pages; 15 figures"},{"id":"http://arxiv.org/abs/2307.02227v2","updated":"2023-08-08T02:19:48Z","published":"2023-07-05T12:08:56Z","title":"MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic\n Facial Expression Recognition","summary":" Dynamic facial expression recognition (DFER) is essential to the development\nof intelligent and empathetic machines. Prior efforts in this field mainly fall\ninto supervised learning paradigm, which is severely restricted by the limited\nlabeled data in existing datasets. Inspired by recent unprecedented success of\nmasked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel\nself-supervised method which leverages large-scale self-supervised pre-training\non abundant unlabeled data to largely advance the development of DFER. Since\nthe vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial\ncomputation during fine-tuning, MAE-DFER develops an efficient local-global\ninteraction Transformer (LGI-Former) as the encoder. Moreover, in addition to\nthe standalone appearance content reconstruction in VideoMAE, MAE-DFER also\nintroduces explicit temporal facial motion modeling to encourage LGI-Former to\nexcavate both static appearance and dynamic motion information. Extensive\nexperiments on six datasets show that MAE-DFER consistently outperforms\nstate-of-the-art supervised methods by significant margins (e.g., +6.30\\% UAR\non DFEW and +8.34\\% UAR on MAFW), verifying that it can learn powerful dynamic\nfacial representations via large-scale self-supervised pre-training. Besides,\nit has comparable or even better performance than VideoMAE, while largely\nreducing the computational cost (about 38\\% FLOPs). We believe MAE-DFER has\npaved a new way for the advancement of DFER and can inspire more relevant\nresearch in this field and even other related tasks. Codes and models are\npublicly available at https://github.com/sunlicai/MAE-DFER.\n","authors":["Licai Sun","Zheng Lian","Bin Liu","Jianhua Tao"],"pdf_url":"https://arxiv.org/pdf/2307.02227v2.pdf","comment":"ACM MM 2023 (camera ready). Codes and models are publicly available\n at https://github.com/sunlicai/MAE-DFER"},{"id":"http://arxiv.org/abs/2308.03982v1","updated":"2023-08-08T01:59:20Z","published":"2023-08-08T01:59:20Z","title":"PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection","summary":" Recently, polar-based representation has shown promising properties in\nperceptual tasks. In addition to Cartesian-based approaches, which separate\npoint clouds unevenly, representing point clouds as polar grids has been\nrecognized as an alternative due to (1) its advantage in robust performance\nunder different resolutions and (2) its superiority in streaming-based\napproaches. However, state-of-the-art polar-based detection methods inevitably\nsuffer from the feature distortion problem because of the non-uniform division\nof polar representation, resulting in a non-negligible performance gap compared\nto Cartesian-based approaches. To tackle this issue, we present PARTNER, a\nnovel 3D object detector in the polar coordinate. PARTNER alleviates the\ndilemma of feature distortion with global representation re-alignment and\nfacilitates the regression by introducing instance-level geometric information\ninto the detection head. Extensive experiments show overwhelming advantages in\nstreaming-based detection and different resolutions. Furthermore, our method\noutperforms the previous polar-based works with remarkable margins of 3.68% and\n9.15% on Waymo and ONCE validation set, thus achieving competitive results over\nthe state-of-the-art methods.\n","authors":["Ming Nie","Yujing Xue","Chunwei Wang","Chaoqiang Ye","Hang Xu","Xinge Zhu","Qingqiu Huang","Michael Bi Mi","Xinchao Wang","Li Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03982v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2308.03979v1","updated":"2023-08-08T01:55:44Z","published":"2023-08-08T01:55:44Z","title":"PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant\n Semantic Segmentation","summary":" Infrared and visible image fusion is a powerful technique that combines\ncomplementary information from different modalities for downstream semantic\nperception tasks. Existing learning-based methods show remarkable performance,\nbut are suffering from the inherent vulnerability of adversarial attacks,\ncausing a significant decrease in accuracy. In this work, a perception-aware\nfusion framework is proposed to promote segmentation robustness in adversarial\nscenes. We first conduct systematic analyses about the components of image\nfusion, investigating the correlation with segmentation robustness under\nadversarial perturbations. Based on these analyses, we propose a harmonized\narchitecture search with a decomposition-based structure to balance standard\naccuracy and robustness. We also propose an adaptive learning strategy to\nimprove the parameter robustness of image fusion, which can learn effective\nfeature extraction under diverse adversarial perturbations. Thus, the goals of\nimage fusion (\\textit{i.e.,} extracting complementary features from source\nmodalities and defending attack) can be realized from the perspectives of\narchitectural and learning strategies. Extensive experimental results\ndemonstrate that our scheme substantially enhances the robustness, with gains\nof 15.3% mIOU of segmentation in the adversarial scene, compared with advanced\ncompetitors. The source codes are available at\nhttps://github.com/LiuZhu-CV/PAIF.\n","authors":["Zhu Liu","Jinyuan Liu","Benzhuang Zhang","Long Ma","Xin Fan","Risheng Liu"],"pdf_url":"https://arxiv.org/pdf/2308.03979v1.pdf","comment":"Accepted by ACM MM'2023;The source codes are available at\n https://github.com/LiuZhu-CV/PAIF"},{"id":"http://arxiv.org/abs/2308.03276v2","updated":"2023-08-08T01:55:32Z","published":"2023-08-07T03:35:47Z","title":"Spatialyze: A Geospatial Video Analytics System with Spatial-Aware\n Optimizations","summary":" Videos that are shot using commodity hardware such as phones and surveillance\ncameras record various metadata such as time and location. We encounter such\ngeospatial videos on a daily basis and such videos have been growing in volume\nsignificantly. Yet, we do not have data management systems that allow users to\ninteract with such data effectively.\n In this paper, we describe Spatialyze, a new framework for end-to-end\nquerying of geospatial videos. Spatialyze comes with a domain-specific language\nwhere users can construct geospatial video analytic workflows using a 3-step,\ndeclarative, build-filter-observe paradigm. Internally, Spatialyze leverages\nthe declarative nature of such workflows, the temporal-spatial metadata stored\nwith videos, and physical behavior of real-world objects to optimize the\nexecution of workflows. Our results using real-world videos and workflows show\nthat Spatialyze can reduce execution time by up to 5.3x, while maintaining up\nto 97.1% accuracy compared to unoptimized execution.\n","authors":["Chanwut Kittivorawong","Yongming Ge","Yousef Helal","Alvin Cheung"],"pdf_url":"https://arxiv.org/pdf/2308.03276v2.pdf","comment":"GitHub Repository: https://github.com/apperception-db/spatialyze"},{"id":"http://arxiv.org/abs/2301.01635v3","updated":"2023-08-08T01:45:37Z","published":"2023-01-04T14:20:14Z","title":"SPTS v2: Single-Point Scene Text Spotting","summary":" End-to-end scene text spotting has made significant progress due to its\nintrinsic synergy between text detection and recognition. Previous methods\ncommonly regard manual annotations such as horizontal rectangles, rotated\nrectangles, quadrangles, and polygons as a prerequisite, which are much more\nexpensive than using single-point. Our new framework, SPTS v2, allows us to\ntrain high-performing text-spotting models using a single-point annotation.\nSPTS v2 reserves the advantage of the auto-regressive Transformer with an\nInstance Assignment Decoder (IAD) through sequentially predicting the center\npoints of all text instances inside the same predicting sequence, while with a\nParallel Recognition Decoder (PRD) for text recognition in parallel. These two\ndecoders share the same parameters and are interactively connected with a\nsimple but effective information transmission process to pass the gradient and\ninformation. Comprehensive experiments on various existing benchmark datasets\ndemonstrate the SPTS v2 can outperform previous state-of-the-art single-point\ntext spotters with fewer parameters while achieving 19$\\times$ faster inference\nspeed. Within the context of our SPTS v2 framework, our experiments suggest a\npotential preference for single-point representation in scene text spotting\nwhen compared to other representations. Such an attempt provides a significant\nopportunity for scene text spotting applications beyond the realms of existing\nparadigms. Code is available at https://github.com/Yuliang-Liu/SPTSv2.\n","authors":["Yuliang Liu","Jiaxin Zhang","Dezhi Peng","Mingxin Huang","Xinyu Wang","Jingqun Tang","Can Huang","Dahua Lin","Chunhua Shen","Xiang Bai","Lianwen Jin"],"pdf_url":"https://arxiv.org/pdf/2301.01635v3.pdf","comment":"arXiv admin note: text overlap with arXiv:2112.07917"},{"id":"http://arxiv.org/abs/2307.12450v2","updated":"2023-08-08T01:42:17Z","published":"2023-07-23T22:48:07Z","title":"ProtoFL: Unsupervised Federated Learning via Prototypical Distillation","summary":" Federated learning (FL) is a promising approach for enhancing data privacy\npreservation, particularly for authentication systems. However, limited round\ncommunications, scarce representation, and scalability pose significant\nchallenges to its deployment, hindering its full potential. In this paper, we\npropose 'ProtoFL', Prototypical Representation Distillation based unsupervised\nFederated Learning to enhance the representation power of a global model and\nreduce round communication costs. Additionally, we introduce a local one-class\nclassifier based on normalizing flows to improve performance with limited data.\nOur study represents the first investigation of using FL to improve one-class\nclassification performance. We conduct extensive experiments on five widely\nused benchmarks, namely MNIST, CIFAR-10, CIFAR-100, ImageNet-30, and\nKeystroke-Dynamics, to demonstrate the superior performance of our proposed\nframework over previous methods in the literature.\n","authors":["Hansol Kim","Youngjun Kwak","Minyoung Jung","Jinho Shin","Youngsung Kim","Changick Kim"],"pdf_url":"https://arxiv.org/pdf/2307.12450v2.pdf","comment":"Accepted by ICCV 2023. Hansol Kim and Youngjun Kwak contributed\n equally to this work"},{"id":"http://arxiv.org/abs/2308.03286v2","updated":"2023-08-08T01:34:30Z","published":"2023-08-07T04:04:22Z","title":"Multi-Label Self-Supervised Learning with Scene Images","summary":" Self-supervised learning (SSL) methods targeting scene images have seen a\nrapid growth recently, and they mostly rely on either a dedicated dense\nmatching mechanism or a costly unsupervised object discovery module. This paper\nshows that instead of hinging on these strenuous operations, quality image\nrepresentations can be learned by treating scene/multi-label image SSL simply\nas a multi-label classification problem, which greatly simplifies the learning\nframework. Specifically, multiple binary pseudo-labels are assigned for each\ninput image by comparing its embeddings with those in two dictionaries, and the\nnetwork is optimized using the binary cross entropy loss. The proposed method\nis named Multi-Label Self-supervised learning (MLS). Visualizations\nqualitatively show that clearly the pseudo-labels by MLS can automatically find\nsemantically similar pseudo-positive pairs across different images to\nfacilitate contrastive learning. MLS learns high quality representations on\nMS-COCO and achieves state-of-the-art results on classification, detection and\nsegmentation benchmarks. At the same time, MLS is much simpler than existing\nmethods, making it easier to deploy and for further exploration.\n","authors":["Ke Zhu","Minghao Fu","Jianxin Wu"],"pdf_url":"https://arxiv.org/pdf/2308.03286v2.pdf","comment":"ICCV2023"},{"id":"http://arxiv.org/abs/2308.03977v1","updated":"2023-08-08T01:33:13Z","published":"2023-08-08T01:33:13Z","title":"PUG: Photorealistic and Semantically Controllable Synthetic Data for\n Representation Learning","summary":" Synthetic image datasets offer unmatched advantages for designing and\nevaluating deep neural networks: they make it possible to (i) render as many\ndata samples as needed, (ii) precisely control each scene and yield granular\nground truth labels (and captions), (iii) precisely control distribution shifts\nbetween training and testing to isolate variables of interest for sound\nexperimentation. Despite such promise, the use of synthetic image data is still\nlimited -- and often played down -- mainly due to their lack of realism. Most\nworks therefore rely on datasets of real images, which have often been scraped\nfrom public images on the internet, and may have issues with regards to\nprivacy, bias, and copyright, while offering little control over how objects\nprecisely appear. In this work, we present a path to democratize the use of\nphotorealistic synthetic data: we develop a new generation of interactive\nenvironments for representation learning research, that offer both\ncontrollability and realism. We use the Unreal Engine, a powerful game engine\nwell known in the entertainment industry, to produce PUG (Photorealistic Unreal\nGraphics) environments and datasets for representation learning. In this paper,\nwe demonstrate the potential of PUG to enable more rigorous evaluations of\nvision models.\n","authors":["Florian Bordes","Shashank Shekhar","Mark Ibrahim","Diane Bouchacourt","Pascal Vincent","Ari S. Morcos"],"pdf_url":"https://arxiv.org/pdf/2308.03977v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.02552v2","updated":"2023-08-08T01:30:26Z","published":"2023-08-02T03:34:44Z","title":"Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from\n Stable Diffusion","summary":" Owing to the unrestricted nature of the content in the training data, large\ntext-to-image diffusion models, such as Stable Diffusion (SD), are capable of\ngenerating images with potentially copyrighted or dangerous content based on\ncorresponding textual concepts information. This includes specific intellectual\nproperty (IP), human faces, and various artistic styles. However, Negative\nPrompt, a widely used method for content removal, frequently fails to conceal\nthis content due to inherent limitations in its inference logic. In this work,\nwe propose a novel strategy named \\textbf{Degeneration-Tuning (DT)} to shield\ncontents of unwanted concepts from SD weights. By utilizing Scrambled Grid to\nreconstruct the correlation between undesired concepts and their corresponding\nimage domain, we guide SD to generate meaningless content when such textual\nconcepts are provided as input. As this adaptation occurs at the level of the\nmodel's weights, the SD, after DT, can be grafted onto other conditional\ndiffusion frameworks like ControlNet to shield unwanted concepts. In addition\nto qualitatively showcasing the effectiveness of our DT method in protecting\nvarious types of concepts, a quantitative comparison of the SD before and after\nDT indicates that the DT method does not significantly impact the generative\nquality of other contents. The FID and IS scores of the model on COCO-30K\nexhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and\n38.25, respectively, which clearly outperforms the previous methods.\n","authors":["Zixuan Ni","Longhui Wei","Jiacheng Li","Siliang Tang","Yueting Zhuang","Qi Tian"],"pdf_url":"https://arxiv.org/pdf/2308.02552v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03975v1","updated":"2023-08-08T01:27:55Z","published":"2023-08-08T01:27:55Z","title":"Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D\n Action Representation Learning","summary":" Self-supervised learning has proved effective for skeleton-based human action\nunderstanding, which is an important yet challenging topic. Previous works\nmainly rely on contrastive learning or masked motion modeling paradigm to model\nthe skeleton relations. However, the sequence-level and joint-level\nrepresentation learning cannot be effectively and simultaneously handled by\nthese methods. As a result, the learned representations fail to generalize to\ndifferent downstream tasks. Moreover, combining these two paradigms in a naive\nmanner leaves the synergy between them untapped and can lead to interference in\ntraining. To address these problems, we propose Prompted Contrast with Masked\nMotion Modeling, PCM$^{\\rm 3}$, for versatile 3D action representation\nlearning. Our method integrates the contrastive learning and masked prediction\ntasks in a mutually beneficial manner, which substantially boosts the\ngeneralization capacity for various downstream tasks. Specifically, masked\nprediction provides novel training views for contrastive learning, which in\nturn guides the masked prediction training with high-level semantic\ninformation. Moreover, we propose a dual-prompted multi-task pretraining\nstrategy, which further improves model representations by reducing the\ninterference caused by learning the two different pretext tasks. Extensive\nexperiments on five downstream tasks under three large-scale datasets are\nconducted, demonstrating the superior generalization capacity of PCM$^{\\rm 3}$\ncompared to the state-of-the-art works. Our project is publicly available at:\nhttps://jhang2020.github.io/Projects/PCM3/PCM3.html .\n","authors":["Jiahang Zhang","Lilang Lin","Jiaying Liu"],"pdf_url":"https://arxiv.org/pdf/2308.03975v1.pdf","comment":"Accepted by ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2308.03968v1","updated":"2023-08-08T00:46:01Z","published":"2023-08-08T00:46:01Z","title":"CheXFusion: Effective Fusion of Multi-View Features using Transformers\n for Long-Tailed Chest X-Ray Classification","summary":" Medical image classification poses unique challenges due to the long-tailed\ndistribution of diseases, the co-occurrence of diagnostic findings, and the\nmultiple views available for each study or patient. This paper introduces our\nsolution to the ICCV CVAMD 2023 Shared Task on CXR-LT: Multi-Label Long-Tailed\nClassification on Chest X-Rays. Our approach introduces CheXFusion, a\ntransformer-based fusion module incorporating multi-view images. The fusion\nmodule, guided by self-attention and cross-attention mechanisms, efficiently\naggregates multi-view features while considering label co-occurrence.\nFurthermore, we explore data balancing and self-training methods to optimize\nthe model's performance. Our solution achieves state-of-the-art results with\n0.372 mAP in the MIMIC-CXR test set, securing 1st place in the competition. Our\nsuccess in the task underscores the significance of considering multi-view\nsettings, class imbalance, and label co-occurrence in medical image\nclassification. Public code is available at\nhttps://github.com/dongkyuk/CXR-LT-public-solution\n","authors":["Dongkyun Kim"],"pdf_url":"https://arxiv.org/pdf/2308.03968v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04622v1","updated":"2023-08-08T23:12:33Z","published":"2023-08-08T23:12:33Z","title":"Rendering Humans from Object-Occluded Monocular Videos","summary":" 3D understanding and rendering of moving humans from monocular videos is a\nchallenging task. Despite recent progress, the task remains difficult in\nreal-world scenarios, where obstacles may block the camera view and cause\npartial occlusions in the captured videos. Existing methods cannot handle such\ndefects due to two reasons. First, the standard rendering strategy relies on\npoint-point mapping, which could lead to dramatic disparities between the\nvisible and occluded areas of the body. Second, the naive direct regression\napproach does not consider any feasibility criteria (ie, prior information) for\nrendering under occlusions. To tackle the above drawbacks, we present OccNeRF,\na neural rendering method that achieves better rendering of humans in severely\noccluded scenes. As direct solutions to the two drawbacks, we propose\nsurface-based rendering by integrating geometry and visibility priors. We\nvalidate our method on both simulated and real-world occlusions and demonstrate\nour method's superiority.\n","authors":["Tiange Xiang","Adam Sun","Jiajun Wu","Ehsan Adeli","Li Fei-Fei"],"pdf_url":"https://arxiv.org/pdf/2308.04622v1.pdf","comment":"ICCV 2023, project page:\n https://cs.stanford.edu/~xtiange/projects/occnerf/"},{"id":"http://arxiv.org/abs/2308.04605v1","updated":"2023-08-08T22:10:29Z","published":"2023-08-08T22:10:29Z","title":"PSRFlow: Probabilistic Super Resolution with Flow-Based Models for\n Scientific Data","summary":" Although many deep-learning-based super-resolution approaches have been\nproposed in recent years, because no ground truth is available in the inference\nstage, few can quantify the errors and uncertainties of the super-resolved\nresults. For scientific visualization applications, however, conveying\nuncertainties of the results to scientists is crucial to avoid generating\nmisleading or incorrect information. In this paper, we propose PSRFlow, a novel\nnormalizing flow-based generative model for scientific data super-resolution\nthat incorporates uncertainty quantification into the super-resolution process.\nPSRFlow learns the conditional distribution of the high-resolution data based\non the low-resolution counterpart. By sampling from a Gaussian latent space\nthat captures the missing information in the high-resolution data, one can\ngenerate different plausible super-resolution outputs. The efficient sampling\nin the Gaussian latent space allows our model to perform uncertainty\nquantification for the super-resolved results. During model training, we\naugment the training data with samples across various scales to make the model\nadaptable to data of different scales, achieving flexible super-resolution for\na given input. Our results demonstrate superior performance and robust\nuncertainty quantification compared with existing methods such as interpolation\nand GAN-based super-resolution networks.\n","authors":["Jingyi Shen","Han-Wei Shen"],"pdf_url":"https://arxiv.org/pdf/2308.04605v1.pdf","comment":"To be published in Proc. IEEE VIS 2023"},{"id":"http://arxiv.org/abs/2308.04598v1","updated":"2023-08-08T21:52:07Z","published":"2023-08-08T21:52:07Z","title":"1st Place Solution for CVPR2023 BURST Long Tail and Open World\n Challenges","summary":" Currently, Video Instance Segmentation (VIS) aims at segmenting and\ncategorizing objects in videos from a closed set of training categories that\ncontain only a few dozen of categories, lacking the ability to handle diverse\nobjects in real-world videos. As TAO and BURST datasets release, we have the\nopportunity to research VIS in long-tailed and open-world scenarios.\nTraditional VIS methods are evaluated on benchmarks limited to a small number\nof common classes, But practical applications require trackers that go beyond\nthese common classes, detecting and tracking rare and even never-before-seen\nobjects. Inspired by the latest MOT paper for the long tail task (Tracking\nEvery Thing in the Wild, Siyuan Li et), for the BURST long tail challenge, we\ntrain our model on a combination of LVISv0.5 and the COCO dataset using repeat\nfactor sampling. First, train the detector with segmentation and CEM on\nLVISv0.5 + COCO dataset. And then, train the instance appearance similarity\nhead on the TAO dataset. at last, our method (LeTracker) gets 14.9 HOTAall in\nthe BURST test set, ranking 1st in the benchmark. for the open-world\nchallenges, we only use 64 classes (Intersection classes of BURST Train subset\nand COCO dataset, without LVIS dataset) annotations data training, and testing\non BURST test set data and get 61.4 OWTAall, ranking 1st in the benchmark. Our\ncode will be released to facilitate future research.\n","authors":["Kaer Huang"],"pdf_url":"https://arxiv.org/pdf/2308.04598v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04589v1","updated":"2023-08-08T21:18:23Z","published":"2023-08-08T21:18:23Z","title":"Temporal DINO: A Self-supervised Video Strategy to Enhance Action\n Prediction","summary":" The emerging field of action prediction plays a vital role in various\ncomputer vision applications such as autonomous driving, activity analysis and\nhuman-computer interaction. Despite significant advancements, accurately\npredicting future actions remains a challenging problem due to high\ndimensionality, complex dynamics and uncertainties inherent in video data.\nTraditional supervised approaches require large amounts of labelled data, which\nis expensive and time-consuming to obtain. This paper introduces a novel\nself-supervised video strategy for enhancing action prediction inspired by DINO\n(self-distillation with no labels). The Temporal-DINO approach employs two\nmodels; a 'student' processing past frames; and a 'teacher' processing both\npast and future frames, enabling a broader temporal context. During training,\nthe teacher guides the student to learn future context by only observing past\nframes. The strategy is evaluated on ROAD dataset for the action prediction\ndownstream task using 3D-ResNet, Transformer, and LSTM architectures. The\nexperimental results showcase significant improvements in prediction\nperformance across these architectures, with our method achieving an average\nenhancement of 9.9% Precision Points (PP), highlighting its effectiveness in\nenhancing the backbones' capabilities of capturing long-term dependencies.\nFurthermore, our approach demonstrates efficiency regarding the pretraining\ndataset size and the number of epochs required. This method overcomes\nlimitations present in other approaches, including considering various backbone\narchitectures, addressing multiple prediction horizons, reducing reliance on\nhand-crafted augmentations, and streamlining the pretraining process into a\nsingle stage. These findings highlight the potential of our approach in diverse\nvideo-based tasks such as activity recognition, motion planning, and scene\nunderstanding.\n","authors":["Izzeddin Teeti","Rongali Sai Bhargav","Vivek Singh","Andrew Bradley","Biplab Banerjee","Fabio Cuzzolin"],"pdf_url":"https://arxiv.org/pdf/2308.04589v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04583v1","updated":"2023-08-08T21:08:42Z","published":"2023-08-08T21:08:42Z","title":"LATR: 3D Lane Detection from Monocular Images with Transformer","summary":" 3D lane detection from monocular images is a fundamental yet challenging task\nin autonomous driving. Recent advances primarily rely on structural 3D\nsurrogates (e.g., bird's eye view) that are built from front-view image\nfeatures and camera parameters. However, the depth ambiguity in monocular\nimages inevitably causes misalignment between the constructed surrogate feature\nmap and the original image, posing a great challenge for accurate lane\ndetection. To address the above issue, we present a novel LATR model, an\nend-to-end 3D lane detector that uses 3D-aware front-view features without\ntransformed view representation. Specifically, LATR detects 3D lanes via\ncross-attention based on query and key-value pairs, constructed using our\nlane-aware query generator and dynamic 3D ground positional embedding. On the\none hand, each query is generated based on 2D lane-aware features and adopts a\nhybrid embedding to enhance the lane information. On the other hand, 3D space\ninformation is injected as positional embedding from an iteratively-updated 3D\nground plane. LATR outperforms previous state-of-the-art methods on both\nsynthetic Apollo and realistic OpenLane by large margins (e.g., 11.4 gains in\nterms of F1 score on OpenLane). Code will be released at\nhttps://github.com/JMoonr/LATR.\n","authors":["Yueru Luo","Chaoda Zheng","Xu Yan","Tang Kun","Chao Zheng","Shuguang Cui","Zhen Li"],"pdf_url":"https://arxiv.org/pdf/2308.04583v1.pdf","comment":"Accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2303.06457v3","updated":"2023-08-08T21:00:21Z","published":"2023-03-11T17:14:30Z","title":"Active Visual Exploration Based on Attention-Map Entropy","summary":" Active visual exploration addresses the issue of limited sensor capabilities\nin real-world scenarios, where successive observations are actively chosen\nbased on the environment. To tackle this problem, we introduce a new technique\ncalled Attention-Map Entropy (AME). It leverages the internal uncertainty of\nthe transformer-based model to determine the most informative observations. In\ncontrast to existing solutions, it does not require additional loss components,\nwhich simplifies the training. Through experiments, which also mimic\nretina-like sensors, we show that such simplified training significantly\nimproves the performance of reconstruction, segmentation and classification on\npublicly available datasets.\n","authors":["Adam Pardyl","Grzegorz Rypeść","Grzegorz Kurzejamski","Bartosz Zieliński","Tomasz Trzciński"],"pdf_url":"https://arxiv.org/pdf/2303.06457v3.pdf","comment":"IJCAI 2023"},{"id":"http://arxiv.org/abs/2209.05996v3","updated":"2023-08-08T20:52:26Z","published":"2022-09-13T13:45:18Z","title":"M$^2$-3DLaneNet: Exploring Multi-Modal 3D Lane Detection","summary":" Estimating accurate lane lines in 3D space remains challenging due to their\nsparse and slim nature. Previous works mainly focused on using images for 3D\nlane detection, leading to inherent projection error and loss of geometry\ninformation. To address these issues, we explore the potential of leveraging\nLiDAR for 3D lane detection, either as a standalone method or in combination\nwith existing monocular approaches. In this paper, we propose M$^2$-3DLaneNet\nto integrate complementary information from multiple sensors. Specifically,\nM$^2$-3DLaneNet lifts 2D features into 3D space by incorporating geometry\ninformation from LiDAR data through depth completion. Subsequently, the lifted\n2D features are further enhanced with LiDAR features through cross-modality BEV\nfusion. Extensive experiments on the large-scale OpenLane dataset demonstrate\nthe effectiveness of M$^2$-3DLaneNet, regardless of the range (75m or 100m).\n","authors":["Yueru Luo","Xu Yan","Chaoda Zheng","Chao Zheng","Shuqi Mei","Tang Kun","Shuguang Cui","Zhen Li"],"pdf_url":"https://arxiv.org/pdf/2209.05996v3.pdf","comment":"update"},{"id":"http://arxiv.org/abs/2308.04571v1","updated":"2023-08-08T20:36:59Z","published":"2023-08-08T20:36:59Z","title":"Optimizing Algorithms From Pairwise User Preferences","summary":" Typical black-box optimization approaches in robotics focus on learning from\nmetric scores. However, that is not always possible, as not all developers have\nground truth available. Learning appropriate robot behavior in human-centric\ncontexts often requires querying users, who typically cannot provide precise\nmetric scores. Existing approaches leverage human feedback in an attempt to\nmodel an implicit reward function; however, this reward may be difficult or\nimpossible to effectively capture. In this work, we introduce SortCMA to\noptimize algorithm parameter configurations in high dimensions based on\npairwise user preferences. SortCMA efficiently and robustly leverages user\ninput to find parameter sets without directly modeling a reward. We apply this\nmethod to tuning a commercial depth sensor without ground truth, and to robot\nsocial navigation, which involves highly complex preferences over robot\nbehavior. We show that our method succeeds in optimizing for the user's goals\nand perform a user study to evaluate social navigation results.\n","authors":["Leonid Keselman","Katherine Shih","Martial Hebert","Aaron Steinfeld"],"pdf_url":"https://arxiv.org/pdf/2308.04571v1.pdf","comment":"Accepted at IROS 2023"},{"id":"http://arxiv.org/abs/2308.04556v1","updated":"2023-08-08T20:06:12Z","published":"2023-08-08T20:06:12Z","title":"FocalFormer3D : Focusing on Hard Instance for 3D Object Detection","summary":" False negatives (FN) in 3D object detection, {\\em e.g.}, missing predictions\nof pedestrians, vehicles, or other obstacles, can lead to potentially dangerous\nsituations in autonomous driving. While being fatal, this issue is understudied\nin many current 3D detection methods. In this work, we propose Hard Instance\nProbing (HIP), a general pipeline that identifies \\textit{FN} in a multi-stage\nmanner and guides the models to focus on excavating difficult instances. For 3D\nobject detection, we instantiate this method as FocalFormer3D, a simple yet\neffective detector that excels at excavating difficult objects and improving\nprediction recall. FocalFormer3D features a multi-stage query generation to\ndiscover hard objects and a box-level transformer decoder to efficiently\ndistinguish objects from massive object candidates. Experimental results on the\nnuScenes and Waymo datasets validate the superior performance of FocalFormer3D.\nThe advantage leads to strong performance on both detection and tracking, in\nboth LiDAR and multi-modal settings. Notably, FocalFormer3D achieves a 70.5 mAP\nand 73.9 NDS on nuScenes detection benchmark, while the nuScenes tracking\nbenchmark shows 72.1 AMOTA, both ranking 1st place on the nuScenes LiDAR\nleaderboard. Our code is available at\n\\url{https://github.com/NVlabs/FocalFormer3D}.\n","authors":["Yilun Chen","Zhiding Yu","Yukang Chen","Shiyi Lan","Animashree Anandkumar","Jiaya Jia","Jose Alvarez"],"pdf_url":"https://arxiv.org/pdf/2308.04556v1.pdf","comment":"Accepted by ICCV 2023"},{"id":"http://arxiv.org/abs/2308.04553v1","updated":"2023-08-08T19:52:28Z","published":"2023-08-08T19:52:28Z","title":"From Fake to Real (FFR): A two-stage training pipeline for mitigating\n spurious correlations with synthetic data","summary":" Visual recognition models are prone to learning spurious correlations induced\nby an imbalanced training set where certain groups (\\eg Females) are\nunder-represented in certain classes (\\eg Programmers). Generative models offer\na promising direction in mitigating this bias by generating synthetic data for\nthe minority samples and thus balancing the training set. However, prior work\nthat uses these approaches overlooks that visual recognition models could often\nlearn to differentiate between real and synthetic images and thus fail to\nunlearn the bias in the original dataset. In our work, we propose a novel\ntwo-stage pipeline to mitigate this issue where 1) we pre-train a model on a\nbalanced synthetic dataset and then 2) fine-tune on the real data. Using this\npipeline, we avoid training on both real and synthetic data, thus avoiding the\nbias between real and synthetic data. Moreover, we learn robust features\nagainst the bias in the first step that mitigate the bias in the second step.\nMoreover, our pipeline naturally integrates with bias mitigation methods; they\ncan be simply applied to the fine-tuning step. As our experiments prove, our\npipeline can further improve the performance of bias mitigation methods\nobtaining state-of-the-art performance on three large-scale datasets.\n","authors":["Maan Qraitem","Kate Saenko","Bryan A. Plummer"],"pdf_url":"https://arxiv.org/pdf/2308.04553v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04551v1","updated":"2023-08-08T19:45:06Z","published":"2023-08-08T19:45:06Z","title":"Improving Medical Image Classification in Noisy Labels Using Only\n Self-supervised Pretraining","summary":" Noisy labels hurt deep learning-based supervised image classification\nperformance as the models may overfit the noise and learn corrupted feature\nextractors. For natural image classification training with noisy labeled data,\nmodel initialization with contrastive self-supervised pretrained weights has\nshown to reduce feature corruption and improve classification performance.\nHowever, no works have explored: i) how other self-supervised approaches, such\nas pretext task-based pretraining, impact the learning with noisy label, and\nii) any self-supervised pretraining methods alone for medical images in noisy\nlabel settings. Medical images often feature smaller datasets and subtle inter\nclass variations, requiring human expertise to ensure correct classification.\nThus, it is not clear if the methods improving learning with noisy labels in\nnatural image datasets such as CIFAR would also help with medical images. In\nthis work, we explore contrastive and pretext task-based self-supervised\npretraining to initialize the weights of a deep learning classification model\nfor two medical datasets with self-induced noisy labels -- NCT-CRC-HE-100K\ntissue histological images and COVID-QU-Ex chest X-ray images. Our results show\nthat models initialized with pretrained weights obtained from self-supervised\nlearning can effectively learn better features and improve robustness against\nnoisy labels.\n","authors":["Bidur Khanal","Binod Bhattarai","Bishesh Khanal","Cristian A. Linte"],"pdf_url":"https://arxiv.org/pdf/2308.04551v1.pdf","comment":"Accepted at MICCAI 2023 DEMI Workshop"},{"id":"http://arxiv.org/abs/2308.04549v1","updated":"2023-08-08T19:38:15Z","published":"2023-08-08T19:38:15Z","title":"Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation","summary":" Transformers have become the primary backbone of the computer vision\ncommunity due to their impressive performance. However, the unfriendly\ncomputation cost impedes their potential in the video recognition domain. To\noptimize the speed-accuracy trade-off, we propose Semantic-aware Temporal\nAccumulation score (STA) to prune spatio-temporal tokens integrally. STA score\nconsiders two critical factors: temporal redundancy and semantic importance.\nThe former depicts a specific region based on whether it is a new occurrence or\na seen entity by aggregating token-to-token similarity in consecutive frames\nwhile the latter evaluates each token based on its contribution to the overall\nprediction. As a result, tokens with higher scores of STA carry more temporal\nredundancy as well as lower semantics thus being pruned. Based on the STA\nscore, we are able to progressively prune the tokens without introducing any\nadditional parameters or requiring further re-training. We directly apply the\nSTA module to off-the-shelf ViT and VideoSwin backbones, and the empirical\nresults on Kinetics-400 and Something-Something V2 achieve over 30% computation\nreduction with a negligible ~0.2% accuracy drop. The code is released at\nhttps://github.com/Mark12Ding/STA.\n","authors":["Shuangrui Ding","Peisen Zhao","Xiaopeng Zhang","Rui Qian","Hongkai Xiong","Qi Tian"],"pdf_url":"https://arxiv.org/pdf/2308.04549v1.pdf","comment":"ICCV 2023 camera ready"},{"id":"http://arxiv.org/abs/2307.10763v2","updated":"2023-08-08T19:31:20Z","published":"2023-07-20T10:53:12Z","title":"Actor-agnostic Multi-label Action Recognition with Multi-modal Query","summary":" Existing action recognition methods are typically actor-specific due to the\nintrinsic topological and apparent differences among the actors. This requires\nactor-specific pose estimation (e.g., humans vs. animals), leading to\ncumbersome model design complexity and high maintenance costs. Moreover, they\noften focus on learning the visual modality alone and single-label\nclassification whilst neglecting other available information sources (e.g.,\nclass name text) and the concurrent occurrence of multiple actions. To overcome\nthese limitations, we propose a new approach called 'actor-agnostic multi-modal\nmulti-label action recognition,' which offers a unified solution for various\ntypes of actors, including humans and animals. We further formulate a novel\nMulti-modal Semantic Query Network (MSQNet) model in a transformer-based object\ndetection framework (e.g., DETR), characterized by leveraging visual and\ntextual modalities to represent the action classes better. The elimination of\nactor-specific model designs is a key advantage, as it removes the need for\nactor pose estimation altogether. Extensive experiments on five publicly\navailable benchmarks show that our MSQNet consistently outperforms the prior\narts of actor-specific alternatives on human and animal single- and multi-label\naction recognition tasks by up to 50%. Code will be released at\nhttps://github.com/mondalanindya/MSQNet.\n","authors":["Anindya Mondal","Sauradip Nag","Joaquin M Prada","Xiatian Zhu","Anjan Dutta"],"pdf_url":"https://arxiv.org/pdf/2307.10763v2.pdf","comment":"Accepted at the 2023 IEEE/CVF International Conference on Computer\n Vision Workshops (ICCVW), Paris, France"},{"id":"http://arxiv.org/abs/2308.04542v1","updated":"2023-08-08T19:18:20Z","published":"2023-08-08T19:18:20Z","title":"YUDO: YOLO for Uniform Directed Object Detection","summary":" This paper presents an efficient way of detecting directed objects by\npredicting their center coordinates and direction angle. Since the objects are\nof uniform size, the proposed model works without predicting the object's width\nand height. The dataset used for this problem is presented in Honeybee\nSegmentation and Tracking Datasets project. One of the contributions of this\nwork is an examination of the ability of the standard real-time object\ndetection architecture like YoloV7 to be customized for position and direction\ndetection. A very efficient, tiny version of the architecture is used in this\napproach. Moreover, only one of three detection heads without anchors is\nsufficient for this task. We also introduce the extended Skew Intersection over\nUnion (SkewIoU) calculation for rotated boxes - directed IoU (DirIoU), which\nincludes an absolute angle difference. DirIoU is used both in the matching\nprocedure of target and predicted bounding boxes for mAP calculation, and in\nthe NMS filtering procedure. The code and models are available at\nhttps://github.com/djordjened92/yudo.\n","authors":["Đorđe Nedeljković"],"pdf_url":"https://arxiv.org/pdf/2308.04542v1.pdf","comment":"The Paper is accepted in 25th Irish Machine Vision and Image\n Processing Conference (IMVIP23)"},{"id":"http://arxiv.org/abs/2303.09472v2","updated":"2023-08-08T19:15:38Z","published":"2023-03-16T16:47:14Z","title":"DiffIR: Efficient Diffusion Model for Image Restoration","summary":" Diffusion model (DM) has achieved SOTA performance by modeling the image\nsynthesis process into a sequential application of a denoising network.\nHowever, different from image synthesis, image restoration (IR) has a strong\nconstraint to generate results in accordance with ground-truth. Thus, for IR,\ntraditional DMs running massive iterations on a large model to estimate whole\nimages or feature maps is inefficient. To address this issue, we propose an\nefficient DM for IR (DiffIR), which consists of a compact IR prior extraction\nnetwork (CPEN), dynamic IR transformer (DIRformer), and denoising network.\nSpecifically, DiffIR has two training stages: pretraining and training DM. In\npretraining, we input ground-truth images into CPEN$_{S1}$ to capture a compact\nIR prior representation (IPR) to guide DIRformer. In the second stage, we train\nthe DM to directly estimate the same IRP as pretrained CPEN$_{S1}$ only using\nLQ images. We observe that since the IPR is only a compact vector, DiffIR can\nuse fewer iterations than traditional DM to obtain accurate estimations and\ngenerate more stable and realistic results. Since the iterations are few, our\nDiffIR can adopt a joint optimization of CPEN$_{S2}$, DIRformer, and denoising\nnetwork, which can further reduce the estimation error influence. We conduct\nextensive experiments on several IR tasks and achieve SOTA performance while\nconsuming less computational costs. Code is available at\n\\url{https://github.com/Zj-BinXia/DiffIR}.\n","authors":["Bin Xia","Yulun Zhang","Shiyin Wang","Yitong Wang","Xinglong Wu","Yapeng Tian","Wenming Yang","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2303.09472v2.pdf","comment":"This paper is accepted by ICCV2023. Codes and models are available at\n https://github.com/Zj-BinXia/DiffIR"},{"id":"http://arxiv.org/abs/2308.04536v1","updated":"2023-08-08T18:57:03Z","published":"2023-08-08T18:57:03Z","title":"Facial Prior Based First Order Motion Model for Micro-expression\n Generation","summary":" Spotting facial micro-expression from videos finds various potential\napplications in fields including clinical diagnosis and interrogation,\nmeanwhile this task is still difficult due to the limited scale of training\ndata. To solve this problem, this paper tries to formulate a new task called\nmicro-expression generation and then presents a strong baseline which combines\nthe first order motion model with facial prior knowledge. Given a target face,\nwe intend to drive the face to generate micro-expression videos according to\nthe motion patterns of source videos. Specifically, our new model involves\nthree modules. First, we extract facial prior features from a region focusing\nmodule. Second, we estimate facial motion using key points and local affine\ntransformations with a motion prediction module. Third, expression generation\nmodule is used to drive the target face to generate videos. We train our model\non public CASME II, SAMM and SMIC datasets and then use the model to generate\nnew micro-expression videos for evaluation. Our model achieves the first place\nin the Facial Micro-Expression Challenge 2021 (MEGC2021), where our superior\nperformance is verified by three experts with Facial Action Coding System\ncertification. Source code is provided in\nhttps://github.com/Necolizer/Facial-Prior-Based-FOMM.\n","authors":["Yi Zhang","Youjun Zhao","Yuhang Wen","Zixuan Tang","Xinhua Xu","Mengyuan Liu"],"pdf_url":"https://arxiv.org/pdf/2308.04536v1.pdf","comment":"ACM Multimedia 2021"},{"id":"http://arxiv.org/abs/2308.04535v1","updated":"2023-08-08T18:57:01Z","published":"2023-08-08T18:57:01Z","title":"Estimation of Human Condition at Disaster Site Using Aerial Drone Images","summary":" Drones are being used to assess the situation in various disasters. In this\nstudy, we investigate a method to automatically estimate the damage status of\npeople based on their actions in aerial drone images in order to understand\ndisaster sites faster and save labor. We constructed a new dataset of aerial\nimages of human actions in a hypothetical disaster that occurred in an urban\narea, and classified the human damage status using 3D ResNet. The results\nshowed that the status with characteristic human actions could be classified\nwith a recall rate of more than 80%, while other statuses with similar human\nactions could only be classified with a recall rate of about 50%. In addition,\na cloud-based VR presentation application suggested the effectiveness of using\ndrones to understand the disaster site and estimate the human condition.\n","authors":["Tomoki Arai","Kenji Iwata","Kensho Hara","Yutaka Satoh"],"pdf_url":"https://arxiv.org/pdf/2308.04535v1.pdf","comment":"In submission to the ICCV 2023 Artificial Intelligence for\n Humanitarian Assistance and Disaster Response Workshop"},{"id":"http://arxiv.org/abs/2305.07026v3","updated":"2023-08-08T18:50:07Z","published":"2023-05-11T17:58:47Z","title":"Decentralization and Acceleration Enables Large-Scale Bundle Adjustment","summary":" Scaling to arbitrarily large bundle adjustment problems requires data and\ncompute to be distributed across multiple devices. Centralized methods in prior\nworks are only able to solve small or medium size problems due to overhead in\ncomputation and communication. In this paper, we present a fully decentralized\nmethod that alleviates computation and communication bottlenecks to solve\narbitrarily large bundle adjustment problems. We achieve this by reformulating\nthe reprojection error and deriving a novel surrogate function that decouples\noptimization variables from different devices. This function makes it possible\nto use majorization minimization techniques and reduces bundle adjustment to\nindependent optimization subproblems that can be solved in parallel. We further\napply Nesterov's acceleration and adaptive restart to improve convergence while\nmaintaining its theoretical guarantees. Despite limited peer-to-peer\ncommunication, our method has provable convergence to first-order critical\npoints under mild conditions. On extensive benchmarks with public datasets, our\nmethod converges much faster than decentralized baselines with similar memory\nusage and communication load. Compared to centralized baselines using a single\ndevice, our method, while being decentralized, yields more accurate solutions\nwith significant speedups of up to 953.7x over Ceres and 174.6x over DeepLM.\nCode: https://joeaortiz.github.io/daba.\n","authors":["Taosha Fan","Joseph Ortiz","Ming Hsiao","Maurizio Monge","Jing Dong","Todd Murphey","Mustafa Mukadam"],"pdf_url":"https://arxiv.org/pdf/2305.07026v3.pdf","comment":"Robotics: Science and Systems (RSS), 2023"},{"id":"http://arxiv.org/abs/2209.00128v3","updated":"2023-08-08T18:48:21Z","published":"2022-08-31T21:45:16Z","title":"Archangel: A Hybrid UAV-based Human Detection Benchmark with Position\n and Pose Metadata","summary":" Learning to detect objects, such as humans, in imagery captured by an\nunmanned aerial vehicle (UAV) usually suffers from tremendous variations caused\nby the UAV's position towards the objects. In addition, existing UAV-based\nbenchmark datasets do not provide adequate dataset metadata, which is essential\nfor precise model diagnosis and learning features invariant to those\nvariations. In this paper, we introduce Archangel, the first UAV-based object\ndetection dataset composed of real and synthetic subsets captured with similar\nimagining conditions and UAV position and object pose metadata. A series of\nexperiments are carefully designed with a state-of-the-art object detector to\ndemonstrate the benefits of leveraging the metadata during model evaluation.\nMoreover, several crucial insights involving both real and synthetic data\nduring model optimization are presented. In the end, we discuss the advantages,\nlimitations, and future directions regarding Archangel to highlight its\ndistinct value for the broader machine learning community.\n","authors":["Yi-Ting Shen","Yaesop Lee","Heesung Kwon","Damon M. Conover","Shuvra S. Bhattacharyya","Nikolas Vale","Joshua D. Gray","G. Jeremy Leong","Kenneth Evensen","Frank Skirlo"],"pdf_url":"https://arxiv.org/pdf/2209.00128v3.pdf","comment":"IEEE Access"},{"id":"http://arxiv.org/abs/2308.04529v1","updated":"2023-08-08T18:47:25Z","published":"2023-08-08T18:47:25Z","title":"Generating Modern Persian Carpet Map by Style-transfer","summary":" Today, the great performance of Deep Neural Networks(DNN) has been proven in\nvarious fields. One of its most attractive applications is to produce artistic\ndesigns. A carpet that is known as a piece of art is one of the most important\nitems in a house, which has many enthusiasts all over the world. The first\nstage of producing a carpet is to prepare its map, which is a difficult,\ntime-consuming, and expensive task. In this research work, our purpose is to\nuse DNN for generating a Modern Persian Carpet Map. To reach this aim, three\ndifferent DNN style transfer methods are proposed and compared against each\nother. In the proposed methods, the Style-Swap method is utilized to create the\ninitial carpet map, and in the following, to generate more diverse designs,\nmethods Clip-Styler, Gatys, and Style-Swap are used separately. In addition,\nsome methods are examined and introduced for coloring the produced carpet maps.\nThe designed maps are evaluated via the results of filled questionnaires where\nthe outcomes of user evaluations confirm the popularity of generated carpet\nmaps. Eventually, for the first time, intelligent methods are used in producing\ncarpet maps, and it reduces human intervention. The proposed methods can\nsuccessfully produce diverse carpet designs, and at a higher speed than\ntraditional ways.\n","authors":["Dorsa Rahmatian","Monireh Moshavash","Mahdi Eftekhari","Kamran Hoseinkhani"],"pdf_url":"https://arxiv.org/pdf/2308.04529v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04528v1","updated":"2023-08-08T18:46:16Z","published":"2023-08-08T18:46:16Z","title":"Unsupervised Camouflaged Object Segmentation as Domain Adaptation","summary":" Deep learning for unsupervised image segmentation remains challenging due to\nthe absence of human labels. The common idea is to train a segmentation head,\nwith the supervision of pixel-wise pseudo-labels generated based on the\nrepresentation of self-supervised backbones. By doing so, the model performance\ndepends much on the distance between the distributions of target datasets and\nthe pre-training dataset (e.g., ImageNet). In this work, we investigate a new\ntask, namely unsupervised camouflaged object segmentation (UCOS), where the\ntarget objects own a common rarely-seen attribute, i.e., camouflage.\nUnsurprisingly, we find that the state-of-the-art unsupervised models struggle\nin adapting UCOS, due to the domain gap between the properties of generic and\ncamouflaged objects. To this end, we formulate the UCOS as a source-free\nunsupervised domain adaptation task (UCOS-DA), where both source labels and\ntarget labels are absent during the whole model training process. Specifically,\nwe define a source model consisting of self-supervised vision transformers\npre-trained on ImageNet. On the other hand, the target domain includes a simple\nlinear layer (i.e., our target model) and unlabeled camouflaged objects. We\nthen design a pipeline for foreground-background-contrastive self-adversarial\ndomain adaptation, to achieve robust UCOS. As a result, our baseline model\nachieves superior segmentation performance when compared with competing\nunsupervised models on the UCOS benchmark, with the training set which's scale\nis only one tenth of the supervised COS counterpart.\n","authors":["Yi Zhang","Chengyi Wu"],"pdf_url":"https://arxiv.org/pdf/2308.04528v1.pdf","comment":"12 pages, 6 figures, 3 tables; Project Page:\n https://github.com/Jun-Pu/UCOS-DA ; Accepted to ICCV 2023 Workshop on OOD-CV"},{"id":"http://arxiv.org/abs/2308.04526v1","updated":"2023-08-08T18:41:38Z","published":"2023-08-08T18:41:38Z","title":"Large-Scale Multi-Hypotheses Cell Tracking Using Ultrametric Contours\n Maps","summary":" In this work, we describe a method for large-scale 3D cell-tracking through a\nsegmentation selection approach. The proposed method is effective at tracking\ncells across large microscopy datasets on two fronts: (i) It can solve problems\ncontaining millions of segmentation instances in terabyte-scale 3D+t datasets;\n(ii) It achieves competitive results with or without deep learning, which\nrequires 3D annotated data, that is scarce in the fluorescence microscopy\nfield. The proposed method computes cell tracks and segments using a hierarchy\nof segmentation hypotheses and selects disjoint segments by maximizing the\noverlap between adjacent frames. We show that this method achieves\nstate-of-the-art results in 3D images from the cell tracking challenge and has\na faster integer linear programming formulation. Moreover, our framework is\nflexible and supports segmentations from off-the-shelf cell segmentation models\nand can combine them into an ensemble that improves tracking. The code is\navailable https://github.com/royerlab/ultrack.\n","authors":["Jordão Bragantini","Merlin Lange","Loïc Royer"],"pdf_url":"https://arxiv.org/pdf/2308.04526v1.pdf","comment":"13 pages, 7 figures, 4 tables"},{"id":"http://arxiv.org/abs/2308.04515v1","updated":"2023-08-08T18:24:53Z","published":"2023-08-08T18:24:53Z","title":"Toward unlabeled multi-view 3D pedestrian detection by generalizable AI:\n techniques and performance analysis","summary":" We unveil how generalizable AI can be used to improve multi-view 3D\npedestrian detection in unlabeled target scenes. One way to increase\ngeneralization to new scenes is to automatically label target data, which can\nthen be used for training a detector model. In this context, we investigate two\napproaches for automatically labeling target data: pseudo-labeling using a\nsupervised detector and automatic labeling using an untrained detector (that\ncan be applied out of the box without any training). We adopt a training\nframework for optimizing detector models using automatic labeling procedures.\nThis framework encompasses different training sets/modes and multi-round\nautomatic labeling strategies. We conduct our analyses on the\npublicly-available WILDTRACK and MultiviewX datasets. We show that, by using\nthe automatic labeling approach based on an untrained detector, we can obtain\nsuperior results than directly using the untrained detector or a detector\ntrained with an existing labeled source dataset. It achieved a MODA about 4%\nand 1% better than the best existing unlabeled method when using WILDTRACK and\nMultiviewX as target datasets, respectively.\n","authors":["João Paulo Lima","Diego Thomas","Hideaki Uchiyama","Veronica Teichrieb"],"pdf_url":"https://arxiv.org/pdf/2308.04515v1.pdf","comment":"Accepted to SIBGRAPI 2023"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2210.07774v3","updated":"2023-08-08T17:48:12Z","published":"2022-09-19T22:57:10Z","title":"Learning To Rank Diversely At Airbnb","summary":" Airbnb is a two-sided marketplace, bringing together hosts who own listings\nfor rent, with prospective guests from around the globe. Applying neural\nnetwork-based learning to rank techniques has led to significant improvements\nin matching guests with hosts. These improvements in ranking were driven by a\ncore strategy: order the listings by their estimated booking probabilities,\nthen iterate on techniques to make these booking probability estimates more and\nmore accurate. Embedded implicitly in this strategy was an assumption that the\nbooking probability of a listing could be determined independently of other\nlistings in search results. In this paper we discuss how this assumption,\npervasive throughout the commonly-used learning to rank frameworks, is false.\nWe provide a theoretical foundation correcting this assumption, followed by\nefficient neural network architectures based on the theory. Explicitly\naccounting for possible similarities between listings, and reducing them to\ndiversify the search results generated strong positive impact. We discuss these\nmetric wins as part of the online A/B tests of the theory. Our method provides\na practical way to diversify search results for large-scale production ranking\nsystems.\n","authors":["Malay Haldar","Mustafa Abdool","Liwei He","Dillon Davis","Huiji Gao","Sanjeev Katariya"],"pdf_url":"https://arxiv.org/pdf/2210.07774v3.pdf","comment":"Search ranking, Diversity, e-commerce"},{"id":"http://arxiv.org/abs/2112.06668v2","updated":"2023-08-08T16:32:12Z","published":"2021-12-13T13:42:35Z","title":"CT4Rec: Simple yet Effective Consistency Training for Sequential\n Recommendation","summary":" Sequential recommendation methods play an important role in real-world\nrecommender systems. These systems are able to catch user preferences by taking\nadvantage of historical records and then performing recommendations.\nContrastive learning(CL) is a cutting-edge technology that can assist us in\nobtaining informative user representations, but these CL-based models need\nsubtle negative sampling strategies, tedious data augmentation methods, and\nheavy hyper-parameters tuning work. In this paper, we introduce another way to\ngenerate better user representations and recommend more attractive items to\nusers. Particularly, we put forward an effective \\textbf{C}onsistency\n\\textbf{C}onstraint for sequential \\textbf{Rec}ommendation(C$^2$-Rec) in which\nonly two extra training objectives are used without any structural\nmodifications and data augmentation strategies. Substantial experiments have\nbeen conducted on three benchmark datasets and one real industrial dataset,\nwhich proves that our proposed method outperforms SOTA models substantially.\nFurthermore, our method needs much less training time than those CL-based\nmodels. Online AB-test on real-world recommendation systems also achieves\n10.141\\% improvement on the click-through rate and 10.541\\% increase on the\naverage click number per capita. The code is available at\n\\url{https://github.com/zhengrongqin/C2-Rec}.\n","authors":["Chong Liu","Xiaoyang Liu","Rongqin Zheng","Lixin Zhang","Xiaobo Liang","Juntao Li","Lijun Wu","Min Zhang","Leyu Lin"],"pdf_url":"https://arxiv.org/pdf/2112.06668v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04380v1","updated":"2023-08-08T16:31:43Z","published":"2023-08-08T16:31:43Z","title":"Your Negative May not Be True Negative: Boosting Image-Text Matching\n with False Negative Elimination","summary":" Most existing image-text matching methods adopt triplet loss as the\noptimization objective, and choosing a proper negative sample for the triplet\nof is important for effectively training the\nmodel, e.g., hard negatives make the model learn efficiently and effectively.\nHowever, we observe that existing methods mainly employ the most similar\nsamples as hard negatives, which may not be true negatives. In other words, the\nsamples with high similarity but not paired with the anchor may reserve\npositive semantic associations, and we call them false negatives. Repelling\nthese false negatives in triplet loss would mislead the semantic representation\nlearning and result in inferior retrieval performance. In this paper, we\npropose a novel False Negative Elimination (FNE) strategy to select negatives\nvia sampling, which could alleviate the problem introduced by false negatives.\nSpecifically, we first construct the distributions of positive and negative\nsamples separately via their similarities with the anchor, based on the\nfeatures extracted from image and text encoders. Then we calculate the false\nnegative probability of a given sample based on its similarity with the anchor\nand the above distributions via the Bayes' rule, which is employed as the\nsampling weight during negative sampling process. Since there may not exist any\nfalse negative in a small batch size, we design a memory module with momentum\nto retain a large negative buffer and implement our negative sampling strategy\nspanning over the buffer. In addition, to make the model focus on hard\nnegatives, we reassign the sampling weights for the simple negatives with a\ncut-down strategy. The extensive experiments are conducted on Flickr30K and\nMS-COCO, and the results demonstrate the superiority of our proposed false\nnegative elimination strategy. The code is available at\nhttps://github.com/LuminosityX/FNE.\n","authors":["Haoxuan Li","Yi Bin","Junrong Liao","Yang Yang","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2308.04380v1.pdf","comment":"Accepted at ACM MM 2023"},{"id":"http://arxiv.org/abs/2308.03735v2","updated":"2023-08-08T16:20:18Z","published":"2023-08-07T17:34:58Z","title":"Randomized algorithms for precise measurement of differentially-private,\n personalized recommendations","summary":" Personalized recommendations form an important part of today's internet\necosystem, helping artists and creators to reach interested users, and helping\nusers to discover new and engaging content. However, many users today are\nskeptical of platforms that personalize recommendations, in part due to\nhistorically careless treatment of personal data and data privacy. Now,\nbusinesses that rely on personalized recommendations are entering a new\nparadigm, where many of their systems must be overhauled to be privacy-first.\nIn this article, we propose an algorithm for personalized recommendations that\nfacilitates both precise and differentially-private measurement. We consider\nadvertising as an example application, and conduct offline experiments to\nquantify how the proposed privacy-preserving algorithm affects key metrics\nrelated to user experience, advertiser value, and platform revenue compared to\nthe extremes of both (private) non-personalized and non-private, personalized\nimplementations.\n","authors":["Allegra Laro","Yanqing Chen","Hao He","Babak Aghazadeh"],"pdf_url":"https://arxiv.org/pdf/2308.03735v2.pdf","comment":"Submitted to AAAI"},{"id":"http://arxiv.org/abs/2308.04343v1","updated":"2023-08-08T15:43:59Z","published":"2023-08-08T15:43:59Z","title":"Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval","summary":" Most existing cross-modal retrieval methods employ two-stream encoders with\ndifferent architectures for images and texts, \\textit{e.g.}, CNN for images and\nRNN/Transformer for texts. Such discrepancy in architectures may induce\ndifferent semantic distribution spaces and limit the interactions between\nimages and texts, and further result in inferior alignment between images and\ntexts. To fill this research gap, inspired by recent advances of Transformers\nin vision tasks, we propose to unify the encoder architectures with\nTransformers for both modalities. Specifically, we design a cross-modal\nretrieval framework purely based on two-stream Transformers, dubbed\n\\textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image\nTransformer, a text Transformer, and a hierarchical alignment module. With such\nidentical architectures, the encoders could produce representations with more\nsimilar characteristics for images and texts, and make the interactions and\nalignments between them much easier. Besides, to leverage the rich semantics,\nwe devise a hierarchical alignment scheme to explore multi-level\ncorrespondences of different layers between images and texts. To evaluate the\neffectiveness of the proposed HAT, we conduct extensive experiments on two\nbenchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that\nHAT outperforms SOTA baselines by a large margin. Specifically, on two key\ntasks, \\textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves\n7.6\\% and 16.7\\% relative score improvement of Recall@1 on MSCOCO, and 4.4\\%\nand 11.6\\% on Flickr30k respectively. The code is available at\n\\url{https://github.com/LuminosityX/HAT}.\n","authors":["Yi Bin","Haoxuan Li","Yahui Xu","Xing Xu","Yang Yang","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2308.04343v1.pdf","comment":"Accepted at ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2308.04258v1","updated":"2023-08-08T13:46:55Z","published":"2023-08-08T13:46:55Z","title":"Advancing Natural-Language Based Audio Retrieval with PaSST and Large\n Audio-Caption Data Sets","summary":" This work presents a text-to-audio-retrieval system based on pre-trained text\nand spectrogram transformers. Our method projects recordings and textual\ndescriptions into a shared audio-caption space in which related examples from\ndifferent modalities are close. Through a systematic analysis, we examine how\neach component of the system influences retrieval performance. As a result, we\nidentify two key components that play a crucial role in driving performance:\nthe self-attention-based audio encoder for audio embedding and the utilization\nof additional human-generated and synthetic data sets during pre-training. We\nfurther experimented with augmenting ClothoV2 captions with available keywords\nto increase their variety; however, this only led to marginal improvements. Our\nsystem ranked first in the 2023's DCASE Challenge, and it outperforms the\ncurrent state of the art on the ClothoV2 benchmark by 5.6 pp. mAP@10.\n","authors":["Paul Primus","Khaled Koutini","Gerhard Widmer"],"pdf_url":"https://arxiv.org/pdf/2308.04258v1.pdf","comment":"submitted to DCASE Workshop 2023"},{"id":"http://arxiv.org/abs/2308.04247v1","updated":"2023-08-08T13:26:36Z","published":"2023-08-08T13:26:36Z","title":"UniRecSys: A Unified Framework for Personalized, Group, Package, and\n Package-to-Group Recommendations","summary":" Recommender systems aim to enhance the overall user experience by providing\ntailored recommendations for a variety of products and services. These systems\nhelp users make more informed decisions, leading to greater user satisfaction\nwith the platform. However, the implementation of these systems largely depends\non the context, which can vary from recommending an item or package to a user\nor a group. This requires careful exploration of several models during the\ndeployment, as there is no comprehensive and unified approach that deals with\nrecommendations at different levels. Furthermore, these individual models must\nbe closely attuned to their generated recommendations depending on the context\nto prevent significant variation in their generated recommendations. In this\npaper, we propose a novel unified recommendation framework that addresses all\nfour recommendation tasks, namely personalized, group, package, or\npackage-to-group recommendation, filling the gap in the current research\nlandscape. The proposed framework can be integrated with most of the\ntraditional matrix factorization-based collaborative filtering models. The idea\nis to enhance the formulation of the existing approaches by incorporating\ncomponents focusing on the exploitation of the group and package latent\nfactors. These components also help in exploiting a rich latent representation\nof the user/item by enforcing them to align closely with their corresponding\ngroup/package representation. We consider two prominent CF techniques,\nRegularized Matrix Factorization and Maximum Margin Matrix factorization, as\nthe baseline models and demonstrate their customization to various\nrecommendation tasks. Experiment results on two publicly available datasets are\nreported, comparing them to other baseline approaches that consider individual\nrating feedback for group or package recommendations.\n","authors":["Adamya Shyam","Vikas Kumar","Venkateswara Rao Kagita","Arun K Pujari"],"pdf_url":"https://arxiv.org/pdf/2308.04247v1.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2308.04226v1","updated":"2023-08-08T12:45:01Z","published":"2023-08-08T12:45:01Z","title":"OpinionConv: Conversational Product Search with Grounded Opinions","summary":" When searching for products, the opinions of others play an important role in\nmaking informed decisions. Subjective experiences about a product can be a\nvaluable source of information. This is also true in sales conversations, where\na customer and a sales assistant exchange facts and opinions about products.\nHowever, training an AI for such conversations is complicated by the fact that\nlanguage models do not possess authentic opinions for their lack of real-world\nexperience. We address this problem by leveraging product reviews as a rich\nsource of product opinions to ground conversational AI in true subjective\nnarratives. With OpinionConv, we develop the first conversational AI for\nsimulating sales conversations. To validate the generated conversations, we\nconduct several user studies showing that the generated opinions are perceived\nas realistic. Our assessors also confirm the importance of opinions as an\ninformative basis for decision-making.\n","authors":["Vahid Sadiri Javadi","Martin Potthast","Lucie Flek"],"pdf_url":"https://arxiv.org/pdf/2308.04226v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10046v2","updated":"2023-08-08T09:46:21Z","published":"2023-06-12T08:21:50Z","title":"Document Layout Annotation: Database and Benchmark in the Domain of\n Public Affairs","summary":" Every day, thousands of digital documents are generated with useful\ninformation for companies, public organizations, and citizens. Given the\nimpossibility of processing them manually, the automatic processing of these\ndocuments is becoming increasingly necessary in certain sectors. However, this\ntask remains challenging, since in most cases a text-only based parsing is not\nenough to fully understand the information presented through different\ncomponents of varying significance. In this regard, Document Layout Analysis\n(DLA) has been an interesting research field for many years, which aims to\ndetect and classify the basic components of a document. In this work, we used a\nprocedure to semi-automatically annotate digital documents with different\nlayout labels, including 4 basic layout blocks and 4 text categories. We apply\nthis procedure to collect a novel database for DLA in the public affairs\ndomain, using a set of 24 data sources from the Spanish Administration. The\ndatabase comprises 37.9K documents with more than 441K document pages, and more\nthan 8M labels associated to 8 layout block units. The results of our\nexperiments validate the proposed text labeling procedure with accuracy up to\n99%.\n","authors":["Alejandro Peña","Aythami Morales","Julian Fierrez","Javier Ortega-Garcia","Marcos Grande","Iñigo Puente","Jorge Cordova","Gonzalo Cordova"],"pdf_url":"https://arxiv.org/pdf/2306.10046v2.pdf","comment":"Accepted in ICDAR 2023 Workshop on Machine Vision and NLP for\n Document Analysis"},{"id":"http://arxiv.org/abs/2308.04086v1","updated":"2023-08-08T06:58:05Z","published":"2023-08-08T06:58:05Z","title":"Understanding and Modeling Passive-Negative Feedback for Short-video\n Sequential Recommendation","summary":" Sequential recommendation is one of the most important tasks in recommender\nsystems, which aims to recommend the next interacted item with historical\nbehaviors as input. Traditional sequential recommendation always mainly\nconsiders the collected positive feedback such as click, purchase, etc.\nHowever, in short-video platforms such as TikTok, video viewing behavior may\nnot always represent positive feedback. Specifically, the videos are played\nautomatically, and users passively receive the recommended videos. In this new\nscenario, users passively express negative feedback by skipping over videos\nthey do not like, which provides valuable information about their preferences.\nDifferent from the negative feedback studied in traditional recommender\nsystems, this passive-negative feedback can reflect users' interests and serve\nas an important supervision signal in extracting users' preferences. Therefore,\nit is essential to carefully design and utilize it in this novel recommendation\nscenario. In this work, we first conduct analyses based on a large-scale\nreal-world short-video behavior dataset and illustrate the significance of\nleveraging passive feedback. We then propose a novel method that deploys the\nsub-interest encoder, which incorporates positive feedback and passive-negative\nfeedback as supervision signals to learn the user's current active\nsub-interest. Moreover, we introduce an adaptive fusion layer to integrate\nvarious sub-interests effectively. To enhance the robustness of our model, we\nthen introduce a multi-task learning module to simultaneously optimize two\nkinds of feedback -- passive-negative feedback and traditional randomly-sampled\nnegative feedback. The experiments on two large-scale datasets verify that the\nproposed method can significantly outperform state-of-the-art approaches. The\ncode is released at https://github.com/tsinghua-fib-lab/RecSys2023-SINE.\n","authors":["Yunzhu Pan","Chen Gao","Jianxin Chang","Yanan Niu","Yang Song","Kun Gai","Depeng Jin","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2308.04086v1.pdf","comment":"Accepted by RecSys'23"},{"id":"http://arxiv.org/abs/2206.12893v3","updated":"2023-08-08T06:40:15Z","published":"2022-06-26T14:51:18Z","title":"PCDF: A Parallel-Computing Distributed Framework for Sponsored Search\n Advertising Serving","summary":" Traditional online advertising systems for sponsored search follow a cascade\nparadigm with retrieval, pre-ranking,ranking, respectively. Constrained by\nstrict requirements on online inference efficiency, it tend to be difficult to\ndeploy useful but computationally intensive modules in the ranking stage.\nMoreover, ranking models currently used in the industry assume the user click\nonly relies on the advertisements itself, which results in the ranking stage\noverlooking the impact of organic search results on the predicted\nadvertisements (ads). In this work, we propose a novel framework\nPCDF(Parallel-Computing Distributed Framework), allowing to split the\ncomputation cost into three parts and to deploy them in the pre-module in\nparallel with the retrieval stage, the middle-module for ranking ads, and the\npost-module for re-ranking ads with external items. Our PCDF effectively\nreduces the overall inference latency compared with the classic framework. The\nwhole module is end-to-end offline training and adapt for the online learning\nparadigm. To our knowledge, we are the first to propose an end-to-end solution\nfor online training and deployment on complex CTR models from the system\nframework side.\n","authors":["Han Xu","Hao Qi","Kunyao Wang","Pei Wang","Guowei Zhang","Congcong Liu","Junsheng Jin","Xiwei Zhao","Zhangang Lin","Jinghe Hu","Jingping Shao"],"pdf_url":"https://arxiv.org/pdf/2206.12893v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04067v1","updated":"2023-08-08T06:04:17Z","published":"2023-08-08T06:04:17Z","title":"Online Distillation-enhanced Multi-modal Transformer for Sequential\n Recommendation","summary":" Multi-modal recommendation systems, which integrate diverse types of\ninformation, have gained widespread attention in recent years. However,\ncompared to traditional collaborative filtering-based multi-modal\nrecommendation systems, research on multi-modal sequential recommendation is\nstill in its nascent stages. Unlike traditional sequential recommendation\nmodels that solely rely on item identifier (ID) information and focus on\nnetwork structure design, multi-modal recommendation models need to emphasize\nitem representation learning and the fusion of heterogeneous data sources. This\npaper investigates the impact of item representation learning on downstream\nrecommendation tasks and examines the disparities in information fusion at\ndifferent stages. Empirical experiments are conducted to demonstrate the need\nto design a framework suitable for collaborative learning and fusion of diverse\ninformation. Based on this, we propose a new model-agnostic framework for\nmulti-modal sequential recommendation tasks, called Online\nDistillation-enhanced Multi-modal Transformer (ODMT), to enhance feature\ninteraction and mutual learning among multi-source input (ID, text, and image),\nwhile avoiding conflicts among different features during training, thereby\nimproving recommendation accuracy. To be specific, we first introduce an\nID-aware Multi-modal Transformer module in the item representation learning\nstage to facilitate information interaction among different features. Secondly,\nwe employ an online distillation training strategy in the prediction\noptimization stage to make multi-source data learn from each other and improve\nprediction robustness. Experimental results on a video content recommendation\ndataset and three e-commerce recommendation datasets demonstrate the\neffectiveness of the proposed two modules, which is approximately 10%\nimprovement in performance compared to baseline models.\n","authors":["Wei Ji","Xiangyan Liu","An Zhang","Yinwei Wei","Yongxin Ni","Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2308.04067v1.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2308.04033v1","updated":"2023-08-08T04:21:14Z","published":"2023-08-08T04:21:14Z","title":"Adapting Foundation Models for Information Synthesis of Wireless\n Communication Specifications","summary":" Existing approaches to understanding, developing and researching modern\nwireless communication technologies involves time-intensive and arduous process\nof sifting through numerous webpages and technical specification documents,\ngathering the required information and synthesizing it. This paper presents\nNextGen Communications Copilot, a conversational artificial intelligence tool\nfor information synthesis of wireless communication specifications. The system\nbuilds on top of recent advancements in foundation models and consists of three\nkey additional components: a domain-specific database, a context extractor, and\na feedback mechanism. The system appends user queries with concise and\nquery-dependent contextual information extracted from a database of wireless\ntechnical specifications and incorporates tools for expert feedback and data\ncontributions. On evaluation using a benchmark dataset of queries and reference\nresponses created by subject matter experts, the system demonstrated more\nrelevant and accurate answers with an average BLEU score and BERTScore\nF1-measure of 0.37 and 0.79 respectively compared to the corresponding values\nof 0.07 and 0.59 achieved by state-of-the-art tools like ChatGPT.\n","authors":["Manikanta Kotaru"],"pdf_url":"https://arxiv.org/pdf/2308.04033v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04028v1","updated":"2023-08-08T04:06:11Z","published":"2023-08-08T04:06:11Z","title":"Top K Relevant Passage Retrieval for Biomedical Question Answering","summary":" Question answering is a task that answers factoid questions using a large\ncollection of documents. It aims to provide precise answers in response to the\nuser's questions in natural language. Question answering relies on efficient\npassage retrieval to select candidate contexts, where traditional sparse vector\nspace models, such as TF-IDF or BM25, are the de facto method. On the web,\nthere is no single article that could provide all the possible answers\navailable on the internet to the question of the problem asked by the user. The\nexisting Dense Passage Retrieval model has been trained on Wikipedia dump from\nDec. 20, 2018, as the source documents for answering questions. Question\nanswering (QA) has made big strides with several open-domain and machine\ncomprehension systems built using large-scale annotated datasets. However, in\nthe clinical domain, this problem remains relatively unexplored. According to\nmultiple surveys, Biomedical Questions cannot be answered correctly from\nWikipedia Articles. In this work, we work on the existing DPR framework for the\nbiomedical domain and retrieve answers from the Pubmed articles which is a\nreliable source to answer medical questions. When evaluated on a BioASQ QA\ndataset, our fine-tuned dense retriever results in a 0.81 F1 score.\n","authors":["Shashank Gupta"],"pdf_url":"https://arxiv.org/pdf/2308.04028v1.pdf","comment":"6 pages, 5 figures. arXiv admin note: text overlap with\n arXiv:2004.04906 by other authors"},{"id":"http://arxiv.org/abs/2308.04019v1","updated":"2023-08-08T03:33:15Z","published":"2023-08-08T03:33:15Z","title":"Exploring the Spatiotemporal Features of Online Food Recommendation\n Service","summary":" Online Food Recommendation Service (OFRS) has remarkable spatiotemporal\ncharacteristics and the advantage of being able to conveniently satisfy users'\nneeds in a timely manner. There have been a variety of studies that have begun\nto explore its spatiotemporal properties, but a comprehensive and in-depth\nanalysis of the OFRS spatiotemporal features is yet to be conducted. Therefore,\nthis paper studies the OFRS based on three questions: how spatiotemporal\nfeatures play a role; why self-attention cannot be used to model the\nspatiotemporal sequences of OFRS; and how to combine spatiotemporal features to\nimprove the efficiency of OFRS. Firstly, through experimental analysis, we\nsystemically extracted the spatiotemporal features of OFRS, identified the most\nvaluable features and designed an effective combination method. Secondly, we\nconducted a detailed analysis of the spatiotemporal sequences, which revealed\nthe shortcomings of self-attention in OFRS, and proposed a more optimized\nspatiotemporal sequence method for replacing self-attention. In addition, we\nalso designed a Dynamic Context Adaptation Model to further improve the\nefficiency and performance of OFRS. Through the offline experiments on two\nlarge datasets and online experiments for a week, the feasibility and\nsuperiority of our model were proven.\n","authors":["Shaochuan Lin","Jiayan Pei","Taotao Zhou","Hengxu He","Jia Jia","Ning Hu"],"pdf_url":"https://arxiv.org/pdf/2308.04019v1.pdf","comment":"accepted by SIGIR 2023"},{"id":"http://arxiv.org/abs/2308.04017v1","updated":"2023-08-08T03:24:44Z","published":"2023-08-08T03:24:44Z","title":"Multi-Granularity Attention Model for Group Recommendation","summary":" Group recommendation provides personalized recommendations to a group of\nusers based on their shared interests, preferences, and characteristics.\nCurrent studies have explored different methods for integrating individual\npreferences and making collective decisions that benefit the group as a whole.\nHowever, most of them heavily rely on users with rich behavior and ignore\nlatent preferences of users with relatively sparse behavior, leading to\ninsufficient learning of individual interests. To address this challenge, we\npresent the Multi-Granularity Attention Model (MGAM), a novel approach that\nutilizes multiple levels of granularity (i.e., subsets, groups, and supersets)\nto uncover group members' latent preferences and mitigate recommendation noise.\nSpecially, we propose a Subset Preference Extraction module that enhances the\nrepresentation of users' latent subset-level preferences by incorporating their\nprevious interactions with items and utilizing a hierarchical mechanism.\nAdditionally, our method introduces a Group Preference Extraction module and a\nSuperset Preference Extraction module, which explore users' latent preferences\non two levels: the group-level, which maintains users' original preferences,\nand the superset-level, which includes group-group exterior information. By\nincorporating the subset-level embedding, group-level embedding, and\nsuperset-level embedding, our proposed method effectively reduces group\nrecommendation noise across multiple granularities and comprehensively learns\nindividual interests. Extensive offline and online experiments have\ndemonstrated the superiority of our method in terms of performance.\n","authors":["Jianye Ji","Jiayan Pei","Shaochuan Lin","Taotao Zhou","Hengxu He","Jia Jia","Ning Hu"],"pdf_url":"https://arxiv.org/pdf/2308.04017v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04579v1","updated":"2023-08-08T20:54:59Z","published":"2023-08-08T20:54:59Z","title":"RECipe: Does a Multi-Modal Recipe Knowledge Graph Fit a Multi-Purpose\n Recommendation System?","summary":" Over the past two decades, recommendation systems (RSs) have used machine\nlearning (ML) solutions to recommend items, e.g., movies, books, and\nrestaurants, to clients of a business or an online platform. Recipe\nrecommendation, however, has not yet received much attention compared to those\napplications. We introduce RECipe as a multi-purpose recipe recommendation\nframework with a multi-modal knowledge graph (MMKG) backbone. The motivation\nbehind RECipe is to go beyond (deep) neural collaborative filtering (NCF) by\nrecommending recipes to users when they query in natural language or by\nproviding an image. RECipe consists of 3 subsystems: (1) behavior-based\nrecommender, (2) review-based recommender, and (3) image-based recommender.\nEach subsystem relies on the embedding representations of entities and\nrelations in the graph. We first obtain (pre-trained) embedding representations\nof textual entities, such as reviews or ingredients, from a fine-tuned model of\nMicrosoft's MPNet. We initialize the weights of the entities with these\nembeddings to train our knowledge graph embedding (KGE) model. For the visual\ncomponent, i.e., recipe images, we develop a KGE-Guided variational autoencoder\n(KG-VAE) to learn the distribution of images and their latent representations.\nOnce KGE and KG-VAE models are fully trained, we use them as a multi-purpose\nrecommendation framework. For benchmarking, we created two knowledge graphs\n(KGs) from public datasets on Kaggle for recipe recommendation. Our experiments\nshow that the KGE models have comparable performance to the neural solutions.\nWe also present pre-trained NLP embeddings to address important applications\nsuch as zero-shot inference for new users (or the cold start problem) and\nconditional recommendation with respect to recipe categories. We eventually\ndemonstrate the application of RECipe in a multi-purpose recommendation\nsetting.\n","authors":["Ali Pesaranghader","Touqir Sajed"],"pdf_url":"https://arxiv.org/pdf/2308.04579v1.pdf","comment":"19 pages, 8 figures, 8 tables"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2308.04431v1","updated":"2023-08-08T17:58:45Z","published":"2023-08-08T17:58:45Z","title":"When More is Less: Incorporating Additional Datasets Can Hurt\n Performance By Introducing Spurious Correlations","summary":" In machine learning, incorporating more data is often seen as a reliable\nstrategy for improving model performance; this work challenges that notion by\ndemonstrating that the addition of external datasets in many cases can hurt the\nresulting model's performance. In a large-scale empirical study across\ncombinations of four different open-source chest x-ray datasets and 9 different\nlabels, we demonstrate that in 43% of settings, a model trained on data from\ntwo hospitals has poorer worst group accuracy over both hospitals than a model\ntrained on just a single hospital's data. This surprising result occurs even\nthough the added hospital makes the training distribution more similar to the\ntest distribution. We explain that this phenomenon arises from the spurious\ncorrelation that emerges between the disease and hospital, due to\nhospital-specific image artifacts. We highlight the trade-off one encounters\nwhen training on multiple datasets, between the obvious benefit of additional\ndata and insidious cost of the introduced spurious correlation. In some cases,\nbalancing the dataset can remove the spurious correlation and improve\nperformance, but it is not always an effective strategy. We contextualize our\nresults within the literature on spurious correlations to help explain these\noutcomes. Our experiments underscore the importance of exercising caution when\nselecting training data for machine learning models, especially in settings\nwhere there is a risk of spurious correlations such as with medical imaging.\nThe risks outlined highlight the need for careful data selection and model\nevaluation in future research and practice.\n","authors":["Rhys Compton","Lily Zhang","Aahlad Puli","Rajesh Ranganath"],"pdf_url":"https://arxiv.org/pdf/2308.04431v1.pdf","comment":"Accepted at MLHC 2023"},{"id":"http://arxiv.org/abs/2308.04430v1","updated":"2023-08-08T17:58:15Z","published":"2023-08-08T17:58:15Z","title":"SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore","summary":" The legality of training language models (LMs) on copyrighted or otherwise\nrestricted data is under intense debate. However, as we show, model performance\nsignificantly degrades if trained only on low-risk text (e.g., out-of-copyright\nbooks or government documents), due to its limited size and domain coverage. We\npresent SILO, a new language model that manages this risk-performance tradeoff\nduring inference. SILO is built by (1) training a parametric LM on Open License\nCorpus (OLC), a new corpus we curate with 228B tokens of public domain and\npermissively licensed text and (2) augmenting it with a more general and easily\nmodifiable nonparametric datastore (e.g., containing copyrighted books or news)\nthat is only queried during inference. The datastore allows use of high-risk\ndata without training on it, supports sentence-level data attribution, and\nenables data producers to opt out from the model by removing content from the\nstore. These capabilities can foster compliance with data-use regulations such\nas the fair use doctrine in the United States and the GDPR in the European\nUnion. Our experiments show that the parametric LM struggles on domains not\ncovered by OLC. However, access to the datastore greatly improves out of domain\nperformance, closing 90% of the performance gap with an LM trained on the Pile,\na more diverse corpus with mostly high-risk text. We also analyze which\nnonparametric approach works best, where the remaining errors lie, and how\nperformance scales with datastore size. Our results suggest that it is possible\nto build high quality language models while mitigating their legal risk.\n","authors":["Sewon Min","Suchin Gururangan","Eric Wallace","Hannaneh Hajishirzi","Noah A. Smith","Luke Zettlemoyer"],"pdf_url":"https://arxiv.org/pdf/2308.04430v1.pdf","comment":"27 pages; 6 figures. Code, models, and data available at\n https://github.com/kernelmachine/silo-lm"},{"id":"http://arxiv.org/abs/2308.04428v1","updated":"2023-08-08T17:56:20Z","published":"2023-08-08T17:56:20Z","title":"Meta-Learning Operators to Optimality from Multi-Task Non-IID Data","summary":" A powerful concept behind much of the recent progress in machine learning is\nthe extraction of common features across data from heterogeneous sources or\ntasks. Intuitively, using all of one's data to learn a common representation\nfunction benefits both computational effort and statistical generalization by\nleaving a smaller number of parameters to fine-tune on a given task. Toward\ntheoretically grounding these merits, we propose a general setting of\nrecovering linear operators $M$ from noisy vector measurements $y = Mx + w$,\nwhere the covariates $x$ may be both non-i.i.d. and non-isotropic. We\ndemonstrate that existing isotropy-agnostic meta-learning approaches incur\nbiases on the representation update, which causes the scaling of the noise\nterms to lose favorable dependence on the number of source tasks. This in turn\ncan cause the sample complexity of representation learning to be bottlenecked\nby the single-task data size. We introduce an adaptation, $\\texttt{De-bias &\nFeature-Whiten}$ ($\\texttt{DFW}$), of the popular alternating\nminimization-descent (AMD) scheme proposed in Collins et al., (2021), and\nestablish linear convergence to the optimal representation with noise level\nscaling down with the $\\textit{total}$ source data size. This leads to\ngeneralization bounds on the same order as an oracle empirical risk minimizer.\nWe verify the vital importance of $\\texttt{DFW}$ on various numerical\nsimulations. In particular, we show that vanilla alternating-minimization\ndescent fails catastrophically even for iid, but mildly non-isotropic data. Our\nanalysis unifies and generalizes prior work, and provides a flexible framework\nfor a wider range of applications, such as in controls and dynamical systems.\n","authors":["Thomas T. C. K. Zhang","Leonardo F. Toso","James Anderson","Nikolai Matni"],"pdf_url":"https://arxiv.org/pdf/2308.04428v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04426v1","updated":"2023-08-08T17:55:30Z","published":"2023-08-08T17:55:30Z","title":"A Deep-Learning Method Using Auto-encoder and Generative Adversarial\n Network for Anomaly Detection on Ancient Stone Stele Surfaces","summary":" Accurate detection of natural deterioration and man-made damage on the\nsurfaces of ancient stele in the first instance is essential for their\npreventive conservation. Existing methods for cultural heritage preservation\nare not able to achieve this goal perfectly due to the difficulty of balancing\naccuracy, efficiency, timeliness, and cost. This paper presents a deep-learning\nmethod to automatically detect above mentioned emergencies on ancient stone\nstele in real time, employing autoencoder (AE) and generative adversarial\nnetwork (GAN). The proposed method overcomes the limitations of existing\nmethods by requiring no extensive anomaly samples while enabling comprehensive\ndetection of unpredictable anomalies. the method includes stages of monitoring,\ndata acquisition, pre-processing, model structuring, and post-processing.\nTaking the Longmen Grottoes' stone steles as a case study, an unsupervised\nlearning model based on AE and GAN architectures is proposed and validated with\na reconstruction accuracy of 99.74\\%. The method's evaluation revealed the\nproficient detection of seven artificially designed anomalies and demonstrated\nprecision and reliability without false alarms. This research provides novel\nideas and possibilities for the application of deep learning in the field of\ncultural heritage.\n","authors":["Yikun Liu","Yuning Wang","Cheng Liu"],"pdf_url":"https://arxiv.org/pdf/2308.04426v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.07774v3","updated":"2023-08-08T17:48:12Z","published":"2022-09-19T22:57:10Z","title":"Learning To Rank Diversely At Airbnb","summary":" Airbnb is a two-sided marketplace, bringing together hosts who own listings\nfor rent, with prospective guests from around the globe. Applying neural\nnetwork-based learning to rank techniques has led to significant improvements\nin matching guests with hosts. These improvements in ranking were driven by a\ncore strategy: order the listings by their estimated booking probabilities,\nthen iterate on techniques to make these booking probability estimates more and\nmore accurate. Embedded implicitly in this strategy was an assumption that the\nbooking probability of a listing could be determined independently of other\nlistings in search results. In this paper we discuss how this assumption,\npervasive throughout the commonly-used learning to rank frameworks, is false.\nWe provide a theoretical foundation correcting this assumption, followed by\nefficient neural network architectures based on the theory. Explicitly\naccounting for possible similarities between listings, and reducing them to\ndiversify the search results generated strong positive impact. We discuss these\nmetric wins as part of the online A/B tests of the theory. Our method provides\na practical way to diversify search results for large-scale production ranking\nsystems.\n","authors":["Malay Haldar","Mustafa Abdool","Liwei He","Dillon Davis","Huiji Gao","Sanjeev Katariya"],"pdf_url":"https://arxiv.org/pdf/2210.07774v3.pdf","comment":"Search ranking, Diversity, e-commerce"},{"id":"http://arxiv.org/abs/2308.04417v1","updated":"2023-08-08T17:34:28Z","published":"2023-08-08T17:34:28Z","title":"DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal from\n Optical Satellite Images","summary":" Optical satellite images are a critical data source; however, cloud cover\noften compromises their quality, hindering image applications and analysis.\nConsequently, effectively removing clouds from optical satellite images has\nemerged as a prominent research direction. While recent advancements in cloud\nremoval primarily rely on generative adversarial networks, which may yield\nsuboptimal image quality, diffusion models have demonstrated remarkable success\nin diverse image-generation tasks, showcasing their potential in addressing\nthis challenge. This paper presents a novel framework called DiffCR, which\nleverages conditional guided diffusion with deep convolutional networks for\nhigh-performance cloud removal for optical satellite imagery. Specifically, we\nintroduce a decoupled encoder for conditional image feature extraction,\nproviding a robust color representation to ensure the close similarity of\nappearance information between the conditional input and the synthesized\noutput. Moreover, we propose a novel and efficient time and condition fusion\nblock within the cloud removal model to accurately simulate the correspondence\nbetween the appearance in the conditional image and the target image at a low\ncomputational cost. Extensive experimental evaluations on two commonly used\nbenchmark datasets demonstrate that DiffCR consistently achieves\nstate-of-the-art performance on all metrics, with parameter and computational\ncomplexities amounting to only 5.1% and 5.4%, respectively, of those previous\nbest methods. The source code, pre-trained models, and all the experimental\nresults will be publicly available at https://github.com/XavierJiezou/DiffCR\nupon the paper's acceptance of this work.\n","authors":["Xuechao Zou","Kai Li","Junliang Xing","Yu Zhang","Shiying Wang","Lei Jin","Pin Tao"],"pdf_url":"https://arxiv.org/pdf/2308.04417v1.pdf","comment":"13 pages, 7 figures"},{"id":"http://arxiv.org/abs/2306.09345v2","updated":"2023-08-08T17:26:58Z","published":"2023-06-15T17:59:51Z","title":"Evaluating Data Attribution for Text-to-Image Models","summary":" While large text-to-image models are able to synthesize \"novel\" images, these\nimages are necessarily a reflection of the training data. The problem of data\nattribution in such models -- which of the images in the training set are most\nresponsible for the appearance of a given generated image -- is a difficult yet\nimportant one. As an initial step toward this problem, we evaluate attribution\nthrough \"customization\" methods, which tune an existing large-scale model\ntoward a given exemplar object or style. Our key insight is that this allows us\nto efficiently create synthetic images that are computationally influenced by\nthe exemplar by construction. With our new dataset of such exemplar-influenced\nimages, we are able to evaluate various data attribution algorithms and\ndifferent possible feature spaces. Furthermore, by training on our dataset, we\ncan tune standard models, such as DINO, CLIP, and ViT, toward the attribution\nproblem. Even though the procedure is tuned towards small exemplar sets, we\nshow generalization to larger sets. Finally, by taking into account the\ninherent uncertainty of the problem, we can assign soft attribution scores over\na set of training images.\n","authors":["Sheng-Yu Wang","Alexei A. Efros","Jun-Yan Zhu","Richard Zhang"],"pdf_url":"https://arxiv.org/pdf/2306.09345v2.pdf","comment":"Updated v2 -- ICCV 2023 camera ready version. Project page:\n https://peterwang512.github.io/GenDataAttribution Code:\n https://github.com/PeterWang512/GenDataAttribution"},{"id":"http://arxiv.org/abs/2308.04412v1","updated":"2023-08-08T17:18:04Z","published":"2023-08-08T17:18:04Z","title":"Probabilistic Invariant Learning with Randomized Linear Classifiers","summary":" Designing models that are both expressive and preserve known invariances of\ntasks is an increasingly hard problem. Existing solutions tradeoff invariance\nfor computational or memory resources. In this work, we show how to leverage\nrandomness and design models that are both expressive and invariant but use\nless resources. Inspired by randomized algorithms, our key insight is that\naccepting probabilistic notions of universal approximation and invariance can\nreduce our resource requirements. More specifically, we propose a class of\nbinary classification models called Randomized Linear Classifiers (RLCs). We\ngive parameter and sample size conditions in which RLCs can, with high\nprobability, approximate any (smooth) function while preserving invariance to\ncompact group transformations. Leveraging this result, we design three RLCs\nthat are provably probabilistic invariant for classification tasks over sets,\ngraphs, and spherical data. We show how these models can achieve probabilistic\ninvariance and universality using less resources than (deterministic) neural\nnetworks and their invariant counterparts. Finally, we empirically demonstrate\nthe benefits of this new class of models on invariant tasks where deterministic\ninvariant neural networks are known to struggle.\n","authors":["Leonardo Cotta","Gal Yehuda","Assaf Schuster","Chris J. Maddison"],"pdf_url":"https://arxiv.org/pdf/2308.04412v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04406v1","updated":"2023-08-08T17:10:23Z","published":"2023-08-08T17:10:23Z","title":"XGBD: Explanation-Guided Graph Backdoor Detection","summary":" Backdoor attacks pose a significant security risk to graph learning models.\nBackdoors can be embedded into the target model by inserting backdoor triggers\ninto the training dataset, causing the model to make incorrect predictions when\nthe trigger is present. To counter backdoor attacks, backdoor detection has\nbeen proposed. An emerging detection strategy in the vision and NLP domains is\nbased on an intriguing phenomenon: when training models on a mixture of\nbackdoor and clean samples, the loss on backdoor samples drops significantly\nfaster than on clean samples, allowing backdoor samples to be easily detected\nby selecting samples with the lowest loss values. However, the ignorance of\ntopological feature information on graph data limits its detection\neffectiveness when applied directly to the graph domain. To this end, we\npropose an explanation-guided backdoor detection method to take advantage of\nthe topological information. Specifically, we train a helper model on the graph\ndataset, feed graph samples into the model, and then adopt explanation methods\nto attribute model prediction to an important subgraph. We observe that\nbackdoor samples have distinct attribution distribution than clean samples, so\nthe explanatory subgraph could serve as more discriminative features for\ndetecting backdoor samples. Comprehensive experiments on multiple popular\ndatasets and attack methods demonstrate the effectiveness and explainability of\nour method. Our code is available:\nhttps://github.com/GuanZihan/GNN_backdoor_detection.\n","authors":["Zihan Guan","Mengnan Du","Ninghao Liu"],"pdf_url":"https://arxiv.org/pdf/2308.04406v1.pdf","comment":"8 pages, 9 figures"},{"id":"http://arxiv.org/abs/2308.04396v1","updated":"2023-08-08T17:00:30Z","published":"2023-08-08T17:00:30Z","title":"Event Abstraction for Enterprise Collaboration Systems to Support Social\n Process Mining","summary":" One aim of Process Mining (PM) is the discovery of process models from event\nlogs of information systems. PM has been successfully applied to\nprocess-oriented enterprise systems but is less suited for communication- and\ndocument-oriented Enterprise Collaboration Systems (ECS). ECS event logs are\nvery fine-granular and PM applied to their logs results in spaghetti models. A\ncommon solution for this is event abstraction, i.e., converting low-level logs\ninto more abstract high-level logs before running discovery algorithms. ECS\nlogs come with special characteristics that have so far not been fully\naddressed by existing event abstraction approaches. We aim to close this gap\nwith a tailored ECS event abstraction (ECSEA) approach that trains a model by\ncomparing recorded actual user activities (high-level traces) with the\nsystem-generated low-level traces (extracted from the ECS). The model allows us\nto automatically convert future low-level traces into an abstracted high-level\nlog that can be used for PM. Our evaluation shows that the algorithm produces\naccurate results. ECSEA is a preprocessing method that is essential for the\ninterpretation of collaborative work activity in ECS, which we call Social\nProcess Mining.\n","authors":["Jonas Blatt","Patrick Delfmann","Petra Schubert"],"pdf_url":"https://arxiv.org/pdf/2308.04396v1.pdf","comment":"8 pages, 1 figure, 3 tables"},{"id":"http://arxiv.org/abs/2308.04395v1","updated":"2023-08-08T17:00:11Z","published":"2023-08-08T17:00:11Z","title":"Data Augmentation-Based Unsupervised Domain Adaptation In Medical\n Imaging","summary":" Deep learning-based models in medical imaging often struggle to generalize\neffectively to new scans due to data heterogeneity arising from differences in\nhardware, acquisition parameters, population, and artifacts. This limitation\npresents a significant challenge in adopting machine learning models for\nclinical practice. We propose an unsupervised method for robust domain\nadaptation in brain MRI segmentation by leveraging MRI-specific augmentation\ntechniques. To evaluate the effectiveness of our method, we conduct extensive\nexperiments across diverse datasets, modalities, and segmentation tasks,\ncomparing against the state-of-the-art methods. The results show that our\nproposed approach achieves high accuracy, exhibits broad applicability, and\nshowcases remarkable robustness against domain shift in various tasks,\nsurpassing the state-of-the-art performance in the majority of cases.\n","authors":["Sebastian Nørgaard Llambias","Mads Nielsen","Mostafa Mehdipour Ghazi"],"pdf_url":"https://arxiv.org/pdf/2308.04395v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04375v1","updated":"2023-08-08T16:23:46Z","published":"2023-08-08T16:23:46Z","title":"Understanding the Effect of Counterfactual Explanations on Trust and\n Reliance on AI for Human-AI Collaborative Clinical Decision Making","summary":" Artificial intelligence (AI) is increasingly being considered to assist human\ndecision-making in high-stake domains (e.g. health). However, researchers have\ndiscussed an issue that humans can over-rely on wrong suggestions of the AI\nmodel instead of achieving human AI complementary performance. In this work, we\nutilized salient feature explanations along with what-if, counterfactual\nexplanations to make humans review AI suggestions more analytically to reduce\noverreliance on AI and explored the effect of these explanations on trust and\nreliance on AI during clinical decision-making. We conducted an experiment with\nseven therapists and ten laypersons on the task of assessing post-stroke\nsurvivors' quality of motion, and analyzed their performance, agreement level\non the task, and reliance on AI without and with two types of AI explanations.\nOur results showed that the AI model with both salient features and\ncounterfactual explanations assisted therapists and laypersons to improve their\nperformance and agreement level on the task when `right' AI outputs are\npresented. While both therapists and laypersons over-relied on `wrong' AI\noutputs, counterfactual explanations assisted both therapists and laypersons to\nreduce their over-reliance on `wrong' AI outputs by 21\\% compared to salient\nfeature explanations. Specifically, laypersons had higher performance degrades\nby 18.0 f1-score with salient feature explanations and 14.0 f1-score with\ncounterfactual explanations than therapists with performance degrades of 8.6\nand 2.8 f1-scores respectively. Our work discusses the potential of\ncounterfactual explanations to better estimate the accuracy of an AI model and\nreduce over-reliance on `wrong' AI outputs and implications for improving\nhuman-AI collaborative decision-making.\n","authors":["Min Hun Lee","Chong Jun Chew"],"pdf_url":"https://arxiv.org/pdf/2308.04375v1.pdf","comment":"ACM CSCW 2023"},{"id":"http://arxiv.org/abs/2308.04373v1","updated":"2023-08-08T16:22:44Z","published":"2023-08-08T16:22:44Z","title":"Pelta: Shielding Transformers to Mitigate Evasion Attacks in Federated\n Learning","summary":" The main premise of federated learning is that machine learning model updates\nare computed locally, in particular to preserve user data privacy, as those\nnever leave the perimeter of their device. This mechanism supposes the general\nmodel, once aggregated, to be broadcast to collaborating and non malicious\nnodes. However, without proper defenses, compromised clients can easily probe\nthe model inside their local memory in search of adversarial examples. For\ninstance, considering image-based applications, adversarial examples consist of\nimperceptibly perturbed images (to the human eye) misclassified by the local\nmodel, which can be later presented to a victim node's counterpart model to\nreplicate the attack. To mitigate such malicious probing, we introduce Pelta, a\nnovel shielding mechanism leveraging trusted hardware. By harnessing the\ncapabilities of Trusted Execution Environments (TEEs), Pelta masks part of the\nback-propagation chain rule, otherwise typically exploited by attackers for the\ndesign of malicious samples. We evaluate Pelta on a state of the art ensemble\nmodel and demonstrate its effectiveness against the Self Attention Gradient\nadversarial Attack.\n","authors":["Simon Queyrut","Yérom-David Bromberg","Valerio Schiavoni"],"pdf_url":"https://arxiv.org/pdf/2308.04373v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.06713v2","updated":"2023-08-08T16:21:49Z","published":"2023-07-13T12:11:36Z","title":"Unsupervised Calibration through Prior Adaptation for Text\n Classification using Large Language Models","summary":" A wide variety of natural language tasks are currently being addressed with\nlarge-scale language models (LLMs). These models are usually trained with a\nvery large amount of unsupervised text data and adapted to perform a downstream\nnatural language task using methods like fine-tuning, calibration or in-context\nlearning. In this work, we propose an approach to adapt the prior class\ndistribution to perform text classification tasks without the need for labelled\nsamples and only few in-domain sample queries. The proposed approach treats the\nLLM as a black box, adding a stage where the model posteriors are calibrated to\nthe task. Results show that these methods outperform the un-adapted model for\ndifferent number of training shots in the prompt and a previous approach were\ncalibration is performed without using any adaptation data.\n","authors":["Lautaro Estienne"],"pdf_url":"https://arxiv.org/pdf/2307.06713v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03735v2","updated":"2023-08-08T16:20:18Z","published":"2023-08-07T17:34:58Z","title":"Randomized algorithms for precise measurement of differentially-private,\n personalized recommendations","summary":" Personalized recommendations form an important part of today's internet\necosystem, helping artists and creators to reach interested users, and helping\nusers to discover new and engaging content. However, many users today are\nskeptical of platforms that personalize recommendations, in part due to\nhistorically careless treatment of personal data and data privacy. Now,\nbusinesses that rely on personalized recommendations are entering a new\nparadigm, where many of their systems must be overhauled to be privacy-first.\nIn this article, we propose an algorithm for personalized recommendations that\nfacilitates both precise and differentially-private measurement. We consider\nadvertising as an example application, and conduct offline experiments to\nquantify how the proposed privacy-preserving algorithm affects key metrics\nrelated to user experience, advertiser value, and platform revenue compared to\nthe extremes of both (private) non-personalized and non-private, personalized\nimplementations.\n","authors":["Allegra Laro","Yanqing Chen","Hao He","Babak Aghazadeh"],"pdf_url":"https://arxiv.org/pdf/2308.03735v2.pdf","comment":"Submitted to AAAI"},{"id":"http://arxiv.org/abs/2305.19259v3","updated":"2023-08-08T16:05:55Z","published":"2023-05-30T17:47:27Z","title":"Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with\n Arbitrary Data Orders","summary":" Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing\nneural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being\npopular choices for cycling through random or single permutations of the\ntraining data. However, the convergence properties of these algorithms in the\nnon-convex case are not fully understood. Existing results suggest that, in\nrealistic training scenarios where the number of epochs is smaller than the\ntraining set size, RR may perform worse than SGD.\n In this paper, we analyze a general SGD algorithm that allows for arbitrary\ndata orderings and show improved convergence rates for non-convex functions.\nSpecifically, our analysis reveals that SGD with random and single shuffling is\nalways faster or at least as good as classical SGD with replacement, regardless\nof the number of iterations. Overall, our study highlights the benefits of\nusing SGD with random/single shuffling and provides new insights into its\nconvergence properties for non-convex optimization.\n","authors":["Anastasia Koloskova","Nikita Doikov","Sebastian U. Stich","Martin Jaggi"],"pdf_url":"https://arxiv.org/pdf/2305.19259v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03571v2","updated":"2023-08-08T16:05:01Z","published":"2023-07-07T13:06:12Z","title":"Smoothing the Edges: A General Framework for Smooth Optimization in\n Sparse Regularization using Hadamard Overparametrization","summary":" This paper presents a framework for smooth optimization of objectives with\n$\\ell_q$ and $\\ell_{p,q}$ regularization for (structured) sparsity. Finding\nsolutions to these non-smooth and possibly non-convex problems typically relies\non specialized optimization routines. In contrast, the method studied here is\ncompatible with off-the-shelf (stochastic) gradient descent that is ubiquitous\nin deep learning, thereby enabling differentiable sparse regularization without\napproximations. The proposed optimization transfer comprises an\noverparametrization of selected model parameters followed by a change of\npenalties. In the overparametrized problem, smooth and convex $\\ell_2$\nregularization induces non-smooth and non-convex regularization in the original\nparametrization. We show that the resulting surrogate problem not only has an\nidentical global optimum but also exactly preserves the local minima. This is\nparticularly useful in non-convex regularization, where finding global\nsolutions is NP-hard and local minima often generalize well. We provide an\nintegrative overview that consolidates various literature strands on\nsparsity-inducing parametrizations in a general setting and meaningfully extend\nexisting approaches. The feasibility of our approach is evaluated through\nnumerical experiments, demonstrating its effectiveness by matching or\noutperforming common implementations of convex and non-convex regularizers.\n","authors":["Chris Kolb","Christian L. Müller","Bernd Bischl","David Rügamer"],"pdf_url":"https://arxiv.org/pdf/2307.03571v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04365v1","updated":"2023-08-08T16:04:42Z","published":"2023-08-08T16:04:42Z","title":"SLEM: Machine Learning for Path Modeling and Causal Inference with Super\n Learner Equation Modeling","summary":" Causal inference is a crucial goal of science, enabling researchers to arrive\nat meaningful conclusions regarding the predictions of hypothetical\ninterventions using observational data. Path models, Structural Equation Models\n(SEMs), and, more generally, Directed Acyclic Graphs (DAGs), provide a means to\nunambiguously specify assumptions regarding the causal structure underlying a\nphenomenon. Unlike DAGs, which make very few assumptions about the functional\nand parametric form, SEM assumes linearity. This can result in functional\nmisspecification which prevents researchers from undertaking reliable effect\nsize estimation. In contrast, we propose Super Learner Equation Modeling, a\npath modeling technique integrating machine learning Super Learner ensembles.\nWe empirically demonstrate its ability to provide consistent and unbiased\nestimates of causal effects, its competitive performance for linear models when\ncompared with SEM, and highlight its superiority over SEM when dealing with\nnon-linear relationships. We provide open-source code, and a tutorial notebook\nwith example usage, accentuating the easy-to-use nature of the method.\n","authors":["Matthew J. Vowels"],"pdf_url":"https://arxiv.org/pdf/2308.04365v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.16565v2","updated":"2023-08-08T16:01:41Z","published":"2023-03-29T09:47:48Z","title":"PMAA: A Progressive Multi-scale Attention Autoencoder Model for\n High-performance Cloud Removal from Multi-temporal Satellite Imagery","summary":" Satellite imagery analysis plays a pivotal role in remote sensing; however,\ninformation loss due to cloud cover significantly impedes its application.\nAlthough existing deep cloud removal models have achieved notable outcomes,\nthey scarcely consider contextual information. This study introduces a\nhigh-performance cloud removal architecture, termed Progressive Multi-scale\nAttention Autoencoder (PMAA), which concurrently harnesses global and local\ninformation to construct robust contextual dependencies using a novel\nMulti-scale Attention Module (MAM) and a novel Local Interaction Module (LIM).\nPMAA establishes long-range dependencies of multi-scale features using MAM and\nmodulates the reconstruction of fine-grained details utilizing LIM, enabling\nsimultaneous representation of fine- and coarse-grained features at the same\nlevel. With the help of diverse and multi-scale features, PMAA consistently\noutperforms the previous state-of-the-art model CTGAN on two benchmark\ndatasets. Moreover, PMAA boasts considerable efficiency advantages, with only\n0.5% and 14.6% of the parameters and computational complexity of CTGAN,\nrespectively. These comprehensive results underscore PMAA's potential as a\nlightweight cloud removal network suitable for deployment on edge devices to\naccomplish large-scale cloud removal tasks. Our source code and pre-trained\nmodels are available at https://github.com/XavierJiezou/PMAA.\n","authors":["Xuechao Zou","Kai Li","Junliang Xing","Pin Tao","Yachao Cui"],"pdf_url":"https://arxiv.org/pdf/2303.16565v2.pdf","comment":"Accepted by ECAI 2023"},{"id":"http://arxiv.org/abs/2308.04341v1","updated":"2023-08-08T15:38:55Z","published":"2023-08-08T15:38:55Z","title":"Accurate, Explainable, and Private Models: Providing Recourse While\n Minimizing Training Data Leakage","summary":" Machine learning models are increasingly utilized across impactful domains to\npredict individual outcomes. As such, many models provide algorithmic recourse\nto individuals who receive negative outcomes. However, recourse can be\nleveraged by adversaries to disclose private information. This work presents\nthe first attempt at mitigating such attacks. We present two novel methods to\ngenerate differentially private recourse: Differentially Private Model (DPM)\nand Laplace Recourse (LR). Using logistic regression classifiers and real world\nand synthetic datasets, we find that DPM and LR perform well in reducing what\nan adversary can infer, especially at low FPR. When training dataset size is\nlarge enough, we find particular success in preventing privacy leakage while\nmaintaining model and recourse accuracy with our novel LR method.\n","authors":["Catherine Huang","Chelse Swoopes","Christina Xiao","Jiaqi Ma","Himabindu Lakkaraju"],"pdf_url":"https://arxiv.org/pdf/2308.04341v1.pdf","comment":"Proceedings of The Second Workshop on New Frontiers in Adversarial\n Machine Learning (AdvML-Frontiers @ ICML 2023)"},{"id":"http://arxiv.org/abs/2308.03629v2","updated":"2023-08-08T15:38:21Z","published":"2023-08-07T14:36:03Z","title":"MedMine: Examining Pre-trained Language Models on Medication Mining","summary":" Automatic medication mining from clinical and biomedical text has become a\npopular topic due to its real impact on healthcare applications and the recent\ndevelopment of powerful language models (LMs). However, fully-automatic\nextraction models still face obstacles to be overcome such that they can be\ndeployed directly into clinical practice for better impacts. Such obstacles\ninclude their imbalanced performances on different entity types and clinical\nevents. In this work, we examine current state-of-the-art pre-trained language\nmodels (PLMs) on such tasks, via fine-tuning including the monolingual model\nMed7 and multilingual large language model (LLM) XLM-RoBERTa. We compare their\nadvantages and drawbacks using historical medication mining shared task data\nsets from n2c2-2018 challenges. We report the findings we get from these\nfine-tuning experiments such that they can facilitate future research on\naddressing them, for instance, how to combine their outputs, merge such models,\nor improve their overall accuracy by ensemble learning and data augmentation.\nMedMine is part of the M3 Initiative \\url{https://github.com/HECTA-UoM/M3}\n","authors":["Haifa Alrdahi","Lifeng Han","Hendrik Šuvalov","Goran Nenadic"],"pdf_url":"https://arxiv.org/pdf/2308.03629v2.pdf","comment":"Open Research Project. 7 pages, 1 figure, 5 tables"},{"id":"http://arxiv.org/abs/2305.12522v2","updated":"2023-08-08T15:22:26Z","published":"2023-05-21T17:46:28Z","title":"P-NOC: Adversarial CAM Generation for Weakly Supervised Semantic\n Segmentation","summary":" To mitigate the necessity for large amounts of supervised segmentation\nannotation sets, multiple Weakly Supervised Semantic Segmentation (WSSS)\nstrategies have been devised. These will often rely on advanced data and model\nregularization strategies to instigate the development of useful properties\n(e.g., prediction completeness and fidelity to semantic boundaries) in\nsegmentation priors, notwithstanding the lack of annotated information. In this\nwork, we first create a strong baseline by analyzing complementary WSSS\ntechniques and regularizing strategies, considering their strengths and\nlimitations. We then propose a new Class-specific Adversarial Erasing strategy,\ncomprising two adversarial CAM generating networks being gradually refined to\nproduce robust semantic segmentation proposals. Empirical results suggest that\nour approach induces substantial improvement in the effectiveness of the\nbaseline, resulting in a noticeable improvement over both Pascal VOC 2012 and\nMS COCO 2014 datasets.\n","authors":["Lucas David","Helio Pedrini","Zanoni Dias"],"pdf_url":"https://arxiv.org/pdf/2305.12522v2.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2308.04332v1","updated":"2023-08-08T15:21:30Z","published":"2023-08-08T15:21:30Z","title":"RLHF-Blender: A Configurable Interactive Interface for Learning from\n Diverse Human Feedback","summary":" To use reinforcement learning from human feedback (RLHF) in practical\napplications, it is crucial to learn reward models from diverse sources of\nhuman feedback and to consider human factors involved in providing feedback of\ndifferent types. However, the systematic study of learning from diverse types\nof feedback is held back by limited standardized tooling available to\nresearchers. To bridge this gap, we propose RLHF-Blender, a configurable,\ninteractive interface for learning from human feedback. RLHF-Blender provides a\nmodular experimentation framework and implementation that enables researchers\nto systematically investigate the properties and qualities of human feedback\nfor reward learning. The system facilitates the exploration of various feedback\ntypes, including demonstrations, rankings, comparisons, and natural language\ninstructions, as well as studies considering the impact of human factors on\ntheir effectiveness. We discuss a set of concrete research opportunities\nenabled by RLHF-Blender. More information is available at\nhttps://rlhfblender.info/.\n","authors":["Yannick Metz","David Lindner","Raphaël Baur","Daniel Keim","Mennatallah El-Assady"],"pdf_url":"https://arxiv.org/pdf/2308.04332v1.pdf","comment":"14 pages, 3 figures"},{"id":"http://arxiv.org/abs/2307.07873v3","updated":"2023-08-08T15:13:22Z","published":"2023-07-15T19:20:49Z","title":"Why Does Little Robustness Help? Understanding Adversarial\n Transferability From Surrogate Training","summary":" Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs\nthat successfully fool white-box surrogate models can also deceive other\nblack-box models with different architectures. Although a bunch of empirical\nstudies have provided guidance on generating highly transferable AEs, many of\nthese findings lack explanations and even lead to inconsistent advice. In this\npaper, we take a further step towards understanding adversarial\ntransferability, with a particular focus on surrogate aspects. Starting from\nthe intriguing little robustness phenomenon, where models adversarially trained\nwith mildly perturbed adversarial samples can serve as better surrogates, we\nattribute it to a trade-off between two predominant factors: model smoothness\nand gradient similarity. Our investigations focus on their joint effects,\nrather than their separate correlations with transferability. Through a series\nof theoretical and empirical analyses, we conjecture that the data distribution\nshift in adversarial training explains the degradation of gradient similarity.\nBuilding on these insights, we explore the impacts of data augmentation and\ngradient regularization on transferability and identify that the trade-off\ngenerally exists in the various training mechanisms, thus building a\ncomprehensive blueprint for the regulation mechanism behind transferability.\nFinally, we provide a general route for constructing better surrogates to boost\ntransferability which optimizes both model smoothness and gradient similarity\nsimultaneously, e.g., the combination of input gradient regularization and\nsharpness-aware minimization (SAM), validated by extensive experiments. In\nsummary, we call for attention to the united impacts of these two factors for\nlaunching effective transfer attacks, rather than optimizing one while ignoring\nthe other, and emphasize the crucial role of manipulating surrogate models.\n","authors":["Yechao Zhang","Shengshan Hu","Leo Yu Zhang","Junyu Shi","Minghui Li","Xiaogeng Liu","Wei Wan","Hai Jin"],"pdf_url":"https://arxiv.org/pdf/2307.07873v3.pdf","comment":"Accepted by IEEE Symposium on Security and Privacy (Oakland) 2024; 21\n pages, 11 figures, 13 tables"},{"id":"http://arxiv.org/abs/2302.01075v5","updated":"2023-08-08T15:12:42Z","published":"2023-02-02T13:05:27Z","title":"MonoFlow: Rethinking Divergence GANs via the Perspective of Wasserstein\n Gradient Flows","summary":" The conventional understanding of adversarial training in generative\nadversarial networks (GANs) is that the discriminator is trained to estimate a\ndivergence, and the generator learns to minimize this divergence. We argue that\ndespite the fact that many variants of GANs were developed following this\nparadigm, the current theoretical understanding of GANs and their practical\nalgorithms are inconsistent. In this paper, we leverage Wasserstein gradient\nflows which characterize the evolution of particles in the sample space, to\ngain theoretical insights and algorithmic inspiration of GANs. We introduce a\nunified generative modeling framework - MonoFlow: the particle evolution is\nrescaled via a monotonically increasing mapping of the log density ratio. Under\nour framework, adversarial training can be viewed as a procedure first\nobtaining MonoFlow's vector field via training the discriminator and the\ngenerator learns to draw the particle flow defined by the corresponding vector\nfield. We also reveal the fundamental difference between variational divergence\nminimization and adversarial training. This analysis helps us to identify what\ntypes of generator loss functions can lead to the successful training of GANs\nand suggest that GANs may have more loss designs beyond the literature (e.g.,\nnon-saturated loss), as long as they realize MonoFlow. Consistent empirical\nstudies are included to validate the effectiveness of our framework.\n","authors":["Mingxuan Yi","Zhanxing Zhu","Song Liu"],"pdf_url":"https://arxiv.org/pdf/2302.01075v5.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2308.04314v1","updated":"2023-08-08T15:02:50Z","published":"2023-08-08T15:02:50Z","title":"Cooperative Multi-agent Bandits: Distributed Algorithms with Optimal\n Individual Regret and Constant Communication Costs","summary":" Recently, there has been extensive study of cooperative multi-agent\nmulti-armed bandits where a set of distributed agents cooperatively play the\nsame multi-armed bandit game. The goal is to develop bandit algorithms with the\noptimal group and individual regrets and low communication between agents. The\nprior work tackled this problem using two paradigms: leader-follower and fully\ndistributed algorithms. Prior algorithms in both paradigms achieve the optimal\ngroup regret. The leader-follower algorithms achieve constant communication\ncosts but fail to achieve optimal individual regrets. The state-of-the-art\nfully distributed algorithms achieve optimal individual regrets but fail to\nachieve constant communication costs. This paper presents a simple yet\neffective communication policy and integrates it into a learning algorithm for\ncooperative bandits. Our algorithm achieves the best of both paradigms: optimal\nindividual regret and constant communication costs.\n","authors":["Lin Yang","Xuchuang Wang","Mohammad Hajiesmaili","Lijun Zhang","John C. S. Lui","Don Towsley"],"pdf_url":"https://arxiv.org/pdf/2308.04314v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12344v2","updated":"2023-08-08T14:52:39Z","published":"2023-07-23T14:43:17Z","title":"Right for the Wrong Reason: Can Interpretable ML Techniques Detect\n Spurious Correlations?","summary":" While deep neural network models offer unmatched classification performance,\nthey are prone to learning spurious correlations in the data. Such dependencies\non confounding information can be difficult to detect using performance metrics\nif the test data comes from the same distribution as the training data.\nInterpretable ML methods such as post-hoc explanations or inherently\ninterpretable classifiers promise to identify faulty model reasoning. However,\nthere is mixed evidence whether many of these techniques are actually able to\ndo so. In this paper, we propose a rigorous evaluation strategy to assess an\nexplanation technique's ability to correctly identify spurious correlations.\nUsing this strategy, we evaluate five post-hoc explanation techniques and one\ninherently interpretable method for their ability to detect three types of\nartificially added confounders in a chest x-ray diagnosis task. We find that\nthe post-hoc technique SHAP, as well as the inherently interpretable Attri-Net\nprovide the best performance and can be used to reliably identify faulty model\nbehavior.\n","authors":["Susu Sun","Lisa M. Koch","Christian F. Baumgartner"],"pdf_url":"https://arxiv.org/pdf/2307.12344v2.pdf","comment":"Accepted to MICCAI 2023"},{"id":"http://arxiv.org/abs/2207.07271v3","updated":"2023-08-08T14:51:47Z","published":"2022-07-15T03:37:59Z","title":"Set-based value operators for non-stationary Markovian environments","summary":" This paper analyzes finite state Markov Decision Processes (MDPs) with\nuncertain parameters in compact sets and re-examines results from robust MDP\nvia set-based fixed point theory. To this end, we generalize the Bellman and\npolicy evaluation operators to contracting operators on the value function\nspace and denote them as \\emph{value operators}. We lift these value operators\nto act on \\emph{sets} of value functions and denote them as \\emph{set-based\nvalue operators}. We prove that the set-based value operators are\n\\emph{contractions} in the space of compact value function sets. Leveraging\ninsights from set theory, we generalize the rectangularity condition in classic\nrobust MDP literature to a containment condition for all value operators, which\nis weaker and can be applied to a larger set of parameter-uncertain MDPs and\ncontracting operators in dynamic programming. We prove that both the\nrectangularity condition and the containment condition sufficiently ensure that\nthe set-based value operator's fixed point set contains its own extrema\nelements. For convex and compact sets of uncertain MDP parameters, we show\nequivalence between the classic robust value function and the supremum of the\nfixed point set of the set-based Bellman operator. Under dynamically changing\nMDP parameters in compact sets, we prove a set convergence result for value\niteration, which otherwise may not converge to a single value function.\nFinally, we derive novel guarantees for probabilistic path-planning problems in\nplanet exploration and stratospheric station-keeping.\n","authors":["Sarah H. Q. Li","Assalé Adjé","Pierre-Loïc Garoche","Behçet Açıkmeşe"],"pdf_url":"https://arxiv.org/pdf/2207.07271v3.pdf","comment":"17 pages, 11 figures, 1 table"},{"id":"http://arxiv.org/abs/2303.00500v2","updated":"2023-08-08T14:50:50Z","published":"2023-03-01T13:32:55Z","title":"Inherently Interpretable Multi-Label Classification Using Class-Specific\n Counterfactuals","summary":" Interpretability is essential for machine learning algorithms in high-stakes\napplication fields such as medical image analysis. However, high-performing\nblack-box neural networks do not provide explanations for their predictions,\nwhich can lead to mistrust and suboptimal human-ML collaboration. Post-hoc\nexplanation techniques, which are widely used in practice, have been shown to\nsuffer from severe conceptual problems. Furthermore, as we show in this paper,\ncurrent explanation techniques do not perform adequately in the multi-label\nscenario, in which multiple medical findings may co-occur in a single image. We\npropose Attri-Net, an inherently interpretable model for multi-label\nclassification. Attri-Net is a powerful classifier that provides transparent,\ntrustworthy, and human-understandable explanations. The model first generates\nclass-specific attribution maps based on counterfactuals to identify which\nimage regions correspond to certain medical findings. Then a simple logistic\nregression classifier is used to make predictions based solely on these\nattribution maps. We compare Attri-Net to five post-hoc explanation techniques\nand one inherently interpretable classifier on three chest X-ray datasets. We\nfind that Attri-Net produces high-quality multi-label explanations consistent\nwith clinical knowledge and has comparable classification performance to\nstate-of-the-art classification models.\n","authors":["Susu Sun","Stefano Woerner","Andreas Maier","Lisa M. Koch","Christian F. Baumgartner"],"pdf_url":"https://arxiv.org/pdf/2303.00500v2.pdf","comment":"Accepted to MIDL 2023"},{"id":"http://arxiv.org/abs/2308.04304v1","updated":"2023-08-08T14:50:05Z","published":"2023-08-08T14:50:05Z","title":"The Model Inversion Eavesdropping Attack in Semantic Communication\n Systems","summary":" In recent years, semantic communication has been a popular research topic for\nits superiority in communication efficiency. As semantic communication relies\non deep learning to extract meaning from raw messages, it is vulnerable to\nattacks targeting deep learning models. In this paper, we introduce the model\ninversion eavesdropping attack (MIEA) to reveal the risk of privacy leaks in\nthe semantic communication system. In MIEA, the attacker first eavesdrops the\nsignal being transmitted by the semantic communication system and then performs\nmodel inversion attack to reconstruct the raw message, where both the white-box\nand black-box settings are considered. Evaluation results show that MIEA can\nsuccessfully reconstruct the raw message with good quality under different\nchannel conditions. We then propose a defense method based on random\npermutation and substitution to defend against MIEA in order to achieve secure\nsemantic communication. Our experimental results demonstrate the effectiveness\nof the proposed defense method in preventing MIEA.\n","authors":["Yuhao Chen","Qianqian Yang","Zhiguo Shi","Jiming Chen"],"pdf_url":"https://arxiv.org/pdf/2308.04304v1.pdf","comment":"Accepted by 2023 IEEE Global Communications Conference (GLOBECOM)"},{"id":"http://arxiv.org/abs/2105.02796v2","updated":"2023-08-08T14:34:33Z","published":"2021-05-06T16:41:04Z","title":"Practical and Rigorous Uncertainty Bounds for Gaussian Process\n Regression","summary":" Gaussian Process Regression is a popular nonparametric regression method\nbased on Bayesian principles that provides uncertainty estimates for its\npredictions. However, these estimates are of a Bayesian nature, whereas for\nsome important applications, like learning-based control with safety\nguarantees, frequentist uncertainty bounds are required. Although such rigorous\nbounds are available for Gaussian Processes, they are too conservative to be\nuseful in applications. This often leads practitioners to replacing these\nbounds by heuristics, thus breaking all theoretical guarantees. To address this\nproblem, we introduce new uncertainty bounds that are rigorous, yet practically\nuseful at the same time. In particular, the bounds can be explicitly evaluated\nand are much less conservative than state of the art results. Furthermore, we\nshow that certain model misspecifications lead to only graceful degradation. We\ndemonstrate these advantages and the usefulness of our results for\nlearning-based control with numerical examples.\n","authors":["Christian Fiedler","Carsten W. Scherer","Sebastian Trimpe"],"pdf_url":"https://arxiv.org/pdf/2105.02796v2.pdf","comment":"Contains supplementary material and corrections to the original\n version"},{"id":"http://arxiv.org/abs/2212.04780v3","updated":"2023-08-08T14:30:05Z","published":"2022-12-09T11:18:40Z","title":"Genie: Show Me the Data for Quantization","summary":" Zero-shot quantization is a promising approach for developing lightweight\ndeep neural networks when data is inaccessible owing to various reasons,\nincluding cost and issues related to privacy. By exploiting the learned\nparameters ($\\mu$ and $\\sigma$) of batch normalization layers in an\nFP32-pre-trained model, zero-shot quantization schemes focus on generating\nsynthetic data. Subsequently, they distill knowledge from the pre-trained model\n(teacher) to the quantized model (student) such that the quantized model can be\noptimized with the synthetic dataset. However, thus far, zero-shot quantization\nhas primarily been discussed in the context of quantization-aware training\nmethods, which require task-specific losses and long-term optimization as much\nas retraining. We thus introduce a post-training quantization scheme for\nzero-shot quantization that produces high-quality quantized networks within a\nfew hours. Furthermore, we propose a framework called Genie~that generates data\nsuited for quantization. With the data synthesized by Genie, we can produce\nrobust quantized models without real datasets, which is comparable to few-shot\nquantization. We also propose a post-training quantization algorithm to enhance\nthe performance of quantized models. By combining them, we can bridge the gap\nbetween zero-shot and few-shot quantization while significantly improving the\nquantization performance compared to that of existing approaches. In other\nwords, we can obtain a unique state-of-the-art zero-shot quantization approach.\nThe code is available at \\url{https://github.com/SamsungLabs/Genie}.\n","authors":["Yongkweon Jeon","Chungman Lee","Ho-young Kim"],"pdf_url":"https://arxiv.org/pdf/2212.04780v3.pdf","comment":"Accepted by CVPR 2023, https://github.com/SamsungLabs/Genie"},{"id":"http://arxiv.org/abs/2308.04286v1","updated":"2023-08-08T14:29:35Z","published":"2023-08-08T14:29:35Z","title":"Comparative Analysis of the wav2vec 2.0 Feature Extractor","summary":" Automatic speech recognition (ASR) systems typically use handcrafted feature\nextraction pipelines. To avoid their inherent information loss and to achieve\nmore consistent modeling from speech to transcribed text, neural raw waveform\nfeature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model,\nwhich has recently gained large popularity, uses a convolutional FE which\noperates directly on the speech waveform. However, it is not yet studied\nextensively in the literature. In this work, we study its capability to replace\nthe standard feature extraction methods in a connectionist temporal\nclassification (CTC) ASR model and compare it to an alternative neural FE. We\nshow that both are competitive with traditional FEs on the LibriSpeech\nbenchmark and analyze the effect of the individual components. Furthermore, we\nanalyze the learned filters and show that the most important information for\nthe ASR system is obtained by a set of bandpass filters.\n","authors":["Peter Vieting","Ralf Schlüter","Hermann Ney"],"pdf_url":"https://arxiv.org/pdf/2308.04286v1.pdf","comment":"Accepted at ITG 2023"},{"id":"http://arxiv.org/abs/2308.04275v1","updated":"2023-08-08T14:17:17Z","published":"2023-08-08T14:17:17Z","title":"In-Context Alignment: Chat with Vanilla Language Models Before\n Fine-Tuning","summary":" In this note, we explore inference-time alignment through in-context\nlearning. We consider a vanilla pretrained language model Llama-2 before any\nfine-tuning and retrieve an average of 9 demonstration alignment examples when\nthe model is prompted to follow chat-style instructions. Compared to direct\nprompting, the in-context alignment without changing model weights leads to a\n7x increase in win-rate w.r.t. the text-davinci-003 model from OpenAI, making\nthe vanilla language model comparable to strong baselines with alignment\nfine-tuning.\n","authors":["Xiaochuang Han"],"pdf_url":"https://arxiv.org/pdf/2308.04275v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2101.08130v2","updated":"2023-08-08T14:11:40Z","published":"2021-01-19T16:14:02Z","title":"Machine learning for rapid discovery of laminar flow channel wall\n modifications that enhance heat transfer","summary":" Numerical simulation of fluids plays an essential role in modeling many\nphysical phenomena, which enables technological advancements, contributes to\nsustainable practices, and expands our understanding of various natural and\nengineered systems. The calculation of heat transfer in fluid flow in simple\nflat channels is a relatively easy task for various simulation methods.\nHowever, once the channel geometry becomes more complex, numerical simulations\nbecome a bottleneck in optimizing wall geometries. We present a combination of\naccurate numerical simulations of arbitrary, flat, and non-flat channels and\nmachine learning models predicting drag coefficient and Stanton number. We show\nthat convolutional neural networks (CNN) can accurately predict the target\nproperties at a fraction of the time of numerical simulations. We use the CNN\nmodels in a virtual high-throughput screening approach to explore a large\nnumber of possible, randomly generated wall architectures. Data Augmentation\nwas applied to existing geometries data to add generated new training data\nwhich have the same number of parameters of heat transfer to improve the\nmodel's generalization. The general approach is not only applicable to simple\nflow setups as presented here but can be extended to more complex tasks, such\nas multiphase or even reactive unit operations in chemical engineering.\n","authors":["Yuri Koide","Arjun J. Kaithakkal","Matthias Schniewind","Bradley P. Ladewig","Alexander Stroh","Pascal Friederich"],"pdf_url":"https://arxiv.org/pdf/2101.08130v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04268v1","updated":"2023-08-08T14:09:33Z","published":"2023-08-08T14:09:33Z","title":"Teacher-Student Architecture for Knowledge Distillation: A Survey","summary":" Although Deep neural networks (DNNs) have shown a strong capacity to solve\nlarge-scale problems in many areas, such DNNs are hard to be deployed in\nreal-world systems due to their voluminous parameters. To tackle this issue,\nTeacher-Student architectures were proposed, where simple student networks with\na few parameters can achieve comparable performance to deep teacher networks\nwith many parameters. Recently, Teacher-Student architectures have been\neffectively and widely embraced on various knowledge distillation (KD)\nobjectives, including knowledge compression, knowledge expansion, knowledge\nadaptation, and knowledge enhancement. With the help of Teacher-Student\narchitectures, current studies are able to achieve multiple distillation\nobjectives through lightweight and generalized student networks. Different from\nexisting KD surveys that primarily focus on knowledge compression, this survey\nfirst explores Teacher-Student architectures across multiple distillation\nobjectives. This survey presents an introduction to various knowledge\nrepresentations and their corresponding optimization objectives. Additionally,\nwe provide a systematic overview of Teacher-Student architectures with\nrepresentative learning algorithms and effective distillation schemes. This\nsurvey also summarizes recent applications of Teacher-Student architectures\nacross multiple purposes, including classification, recognition, generation,\nranking, and regression. Lastly, potential research directions in KD are\ninvestigated, focusing on architecture design, knowledge quality, and\ntheoretical studies of regression-based learning, respectively. Through this\ncomprehensive survey, industry practitioners and the academic community can\ngain valuable insights and guidelines for effectively designing, learning, and\napplying Teacher-Student architectures on various distillation objectives.\n","authors":["Chengming Hu","Xuan Li","Dan Liu","Haolun Wu","Xi Chen","Ju Wang","Xue Liu"],"pdf_url":"https://arxiv.org/pdf/2308.04268v1.pdf","comment":"20 pages. arXiv admin note: substantial text overlap with\n arXiv:2210.17332"},{"id":"http://arxiv.org/abs/2308.04263v1","updated":"2023-08-08T13:59:56Z","published":"2023-08-08T13:59:56Z","title":"BarlowRL: Barlow Twins for Data-Efficient Reinforcement Learning","summary":" This paper introduces BarlowRL, a data-efficient reinforcement learning agent\nthat combines the Barlow Twins self-supervised learning framework with DER\n(Data-Efficient Rainbow) algorithm. BarlowRL outperforms both DER and its\ncontrastive counterpart CURL on the Atari 100k benchmark. BarlowRL avoids\ndimensional collapse by enforcing information spread to the whole space. This\nhelps RL algorithms to utilize uniformly spread state representation that\neventually results in a remarkable performance. The integration of Barlow Twins\nwith DER enhances data efficiency and achieves superior performance in the RL\ntasks. BarlowRL demonstrates the potential of incorporating self-supervised\nlearning techniques to improve RL algorithms.\n","authors":["Omer Veysel Cagatan"],"pdf_url":"https://arxiv.org/pdf/2308.04263v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04262v1","updated":"2023-08-08T13:59:16Z","published":"2023-08-08T13:59:16Z","title":"SDLFormer: A Sparse and Dense Locality-enhanced Transformer for\n Accelerated MR Image Reconstruction","summary":" Transformers have emerged as viable alternatives to convolutional neural\nnetworks owing to their ability to learn non-local region relationships in the\nspatial domain. The self-attention mechanism of the transformer enables\ntransformers to capture long-range dependencies in the images, which might be\ndesirable for accelerated MRI image reconstruction as the effect of\nundersampling is non-local in the image domain. Despite its computational\nefficiency, the window-based transformers suffer from restricted receptive\nfields as the dependencies are limited to within the scope of the image\nwindows. We propose a window-based transformer network that integrates dilated\nattention mechanism and convolution for accelerated MRI image reconstruction.\nThe proposed network consists of dilated and dense neighborhood attention\ntransformers to enhance the distant neighborhood pixel relationship and\nintroduce depth-wise convolutions within the transformer module to learn\nlow-level translation invariant features for accelerated MRI image\nreconstruction. The proposed model is trained in a self-supervised manner. We\nperform extensive experiments for multi-coil MRI acceleration for coronal PD,\ncoronal PDFS and axial T2 contrasts with 4x and 5x under-sampling in\nself-supervised learning based on k-space splitting. We compare our method\nagainst other reconstruction architectures and the parallel domain\nself-supervised learning baseline. Results show that the proposed model\nexhibits improvement margins of (i) around 1.40 dB in PSNR and around 0.028 in\nSSIM on average over other architectures (ii) around 1.44 dB in PSNR and around\n0.029 in SSIM over parallel domain self-supervised learning. The code is\navailable at https://github.com/rahul-gs-16/sdlformer.git\n","authors":["Rahul G. S.","Sriprabha Ramnarayanan","Mohammad Al Fahim","Keerthi Ram","Preejith S. P","Mohanasankar Sivaprakasam"],"pdf_url":"https://arxiv.org/pdf/2308.04262v1.pdf","comment":"Accepted at MICCAI workshop MILLanD 2023 Medical Image Learning with\n noisy and Limited Data"},{"id":"http://arxiv.org/abs/2308.04258v1","updated":"2023-08-08T13:46:55Z","published":"2023-08-08T13:46:55Z","title":"Advancing Natural-Language Based Audio Retrieval with PaSST and Large\n Audio-Caption Data Sets","summary":" This work presents a text-to-audio-retrieval system based on pre-trained text\nand spectrogram transformers. Our method projects recordings and textual\ndescriptions into a shared audio-caption space in which related examples from\ndifferent modalities are close. Through a systematic analysis, we examine how\neach component of the system influences retrieval performance. As a result, we\nidentify two key components that play a crucial role in driving performance:\nthe self-attention-based audio encoder for audio embedding and the utilization\nof additional human-generated and synthetic data sets during pre-training. We\nfurther experimented with augmenting ClothoV2 captions with available keywords\nto increase their variety; however, this only led to marginal improvements. Our\nsystem ranked first in the 2023's DCASE Challenge, and it outperforms the\ncurrent state of the art on the ClothoV2 benchmark by 5.6 pp. mAP@10.\n","authors":["Paul Primus","Khaled Koutini","Gerhard Widmer"],"pdf_url":"https://arxiv.org/pdf/2308.04258v1.pdf","comment":"submitted to DCASE Workshop 2023"},{"id":"http://arxiv.org/abs/2307.11661v2","updated":"2023-08-08T13:44:12Z","published":"2023-07-21T15:49:59Z","title":"Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts","summary":" Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have\nrevolutionized visual representation learning by providing good performance on\ndownstream datasets. VLMs are 0-shot adapted to a downstream dataset by\ndesigning prompts that are relevant to the dataset. Such prompt engineering\nmakes use of domain expertise and a validation dataset. Meanwhile, recent\ndevelopments in generative pretrained models like GPT-4 mean they can be used\nas advanced internet search tools. They can also be manipulated to provide\nvisual information in any structure. In this work, we show that GPT-4 can be\nused to generate text that is visually descriptive and how this can be used to\nadapt CLIP to downstream tasks. We show considerable improvements in 0-shot\ntransfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD\n(~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt.\nWe also design a simple few-shot adapter that learns to choose the best\npossible sentences to construct generalizable classifiers that outperform the\nrecently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized\nfine-grained datasets. The code, prompts, and auxiliary text dataset is\navailable at https://github.com/mayug/VDT-Adapter.\n","authors":["Mayug Maniparambil","Chris Vorster","Derek Molloy","Noel Murphy","Kevin McGuinness","Noel E. O'Connor"],"pdf_url":"https://arxiv.org/pdf/2307.11661v2.pdf","comment":"Paper accepted at ICCV-W 2023. V2 contains additional comparisons\n with concurrent works"},{"id":"http://arxiv.org/abs/2308.04237v1","updated":"2023-08-08T13:03:36Z","published":"2023-08-08T13:03:36Z","title":"Federated Inference with Reliable Uncertainty Quantification over\n Wireless Channels via Conformal Prediction","summary":" Consider a setting in which devices and a server share a pre-trained model.\nThe server wishes to make an inference on a new input given the model. Devices\nhave access to data, previously not used for training, and can communicate to\nthe server over a common wireless channel. If the devices have no access to the\nnew input, can communication from devices to the server enhance the quality of\nthe inference decision at the server? Recent work has introduced federated\nconformal prediction (CP), which leverages devices-to-server communication to\nimprove the reliability of the server's decision. With federated CP, devices\ncommunicate to the server information about the loss accrued by the shared\npre-trained model on the local data, and the server leverages this information\nto calibrate a decision interval, or set, so that it is guaranteed to contain\nthe correct answer with a pre-defined target reliability level. Previous work\nassumed noise-free communication, whereby devices can communicate a single real\nnumber to the server. In this paper, we study for the first time federated CP\nin a wireless setting. We introduce a novel protocol, termed wireless federated\nconformal prediction (WFCP), which builds on type-based multiple access (TBMA)\nand on a novel quantile correction strategy. WFCP is proved to provide formal\nreliability guarantees in terms of coverage of the predicted set produced by\nthe server. Using numerical results, we demonstrate the significant advantages\nof WFCP against digital implementations of existing federated CP schemes,\nespecially in regimes with limited communication resources and/or large number\nof devices.\n","authors":["Meiyi Zhu","Matteo Zecchin","Sangwoo Park","Caili Guo","Chunyan Feng","Osvaldo Simeone"],"pdf_url":"https://arxiv.org/pdf/2308.04237v1.pdf","comment":"33 pages, 6 figures"},{"id":"http://arxiv.org/abs/2304.08134v3","updated":"2023-08-08T12:57:36Z","published":"2023-04-17T10:29:26Z","title":"Tackling Face Verification Edge Cases: In-Depth Analysis and\n Human-Machine Fusion Approach","summary":" Nowadays, face recognition systems surpass human performance on several\ndatasets. However, there are still edge cases that the machine can't correctly\nclassify. This paper investigates the effect of a combination of machine and\nhuman operators in the face verification task. First, we look closer at the\nedge cases for several state-of-the-art models to discover common datasets'\nchallenging settings. Then, we conduct a study with 60 participants on these\nselected tasks with humans and provide an extensive analysis. Finally, we\ndemonstrate that combining machine and human decisions can further improve the\nperformance of state-of-the-art face verification systems on various benchmark\ndatasets. Code and data are publicly available on GitHub.\n","authors":["Martin Knoche","Gerhard Rigoll"],"pdf_url":"https://arxiv.org/pdf/2304.08134v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.14353v3","updated":"2023-08-08T12:53:23Z","published":"2023-02-28T07:11:55Z","title":"A semantic backdoor attack against Graph Convolutional Networks","summary":" Graph convolutional networks (GCNs) have been very effective in addressing\nthe issue of various graph-structured related tasks, such as node\nclassification and graph classification. However, recent research has shown\nthat GCNs are vulnerable to a new type of threat called a backdoor attack,\nwhere the adversary can inject a hidden backdoor into GCNs so that the attacked\nmodel performs well on benign samples, but its prediction will be maliciously\nchanged to the attacker-specified target label if the hidden backdoor is\nactivated by the attacker-defined trigger. In this paper, we investigate\nwhether such semantic backdoor attacks are possible for GCNs and propose a\nsemantic backdoor attack against GCNs (SBAG) under the context of graph\nclassification to reveal the existence of this security vulnerability in GCNs.\nSBAG uses a certain type of node in the samples as a backdoor trigger and\ninjects a hidden backdoor into GCN models by poisoning training data. The\nbackdoor will be activated, and the GCN models will give malicious\nclassification results specified by the attacker even on unmodified samples as\nlong as the samples contain enough trigger nodes. We evaluate SBAG on four\ngraph datasets. The experimental results indicate that SBAG can achieve attack\nsuccess rates of approximately 99.9% and over 82% for two kinds of attack\nsamples, respectively, with poisoning rates of less than 5%.\n","authors":["Jiazhu Dai","Zhipeng Xiong"],"pdf_url":"https://arxiv.org/pdf/2302.14353v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04226v1","updated":"2023-08-08T12:45:01Z","published":"2023-08-08T12:45:01Z","title":"OpinionConv: Conversational Product Search with Grounded Opinions","summary":" When searching for products, the opinions of others play an important role in\nmaking informed decisions. Subjective experiences about a product can be a\nvaluable source of information. This is also true in sales conversations, where\na customer and a sales assistant exchange facts and opinions about products.\nHowever, training an AI for such conversations is complicated by the fact that\nlanguage models do not possess authentic opinions for their lack of real-world\nexperience. We address this problem by leveraging product reviews as a rich\nsource of product opinions to ground conversational AI in true subjective\nnarratives. With OpinionConv, we develop the first conversational AI for\nsimulating sales conversations. To validate the generated conversations, we\nconduct several user studies showing that the generated opinions are perceived\nas realistic. Our assessors also confirm the importance of opinions as an\ninformative basis for decision-making.\n","authors":["Vahid Sadiri Javadi","Martin Potthast","Lucie Flek"],"pdf_url":"https://arxiv.org/pdf/2308.04226v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04220v1","updated":"2023-08-08T12:34:32Z","published":"2023-08-08T12:34:32Z","title":"Semantic Interpretation and Validation of Graph Attention-based\n Explanations for GNN Models","summary":" In this work, we propose a methodology for investigating the application of\nsemantic attention to enhance the explainability of Graph Neural Network\n(GNN)-based models, introducing semantically-informed perturbations and\nestablishing a correlation between predicted feature-importance weights and\nmodel accuracy. Graph Deep Learning (GDL) has emerged as a promising field for\ntasks like scene interpretation, leveraging flexible graph structures to\nconcisely describe complex features and relationships. As traditional\nexplainability methods used in eXplainable AI (XAI) cannot be directly applied\nto such structures, graph-specific approaches are introduced. Attention\nmechanisms have demonstrated their efficacy in estimating the importance of\ninput features in deep learning models and thus have been previously employed\nto provide feature-based explanations for GNN predictions. Building upon these\ninsights, we extend existing attention-based graph-explainability methods\ninvestigating the use of attention weights as importance indicators of\nsemantically sorted feature sets. Through analysing the behaviour of predicted\nattention-weights distribution in correlation with model accuracy, we gain\nvaluable insights into feature importance with respect to the behaviour of the\nGNN model. We apply our methodology to a lidar pointcloud estimation model\nsuccessfully identifying key semantic classes that contribute to enhanced\nperformance effectively generating reliable post-hoc semantic explanations.\n","authors":["Efimia Panagiotaki","Daniele De Martini","Lars Kunze"],"pdf_url":"https://arxiv.org/pdf/2308.04220v1.pdf","comment":"6 pages, 4 figures"},{"id":"http://arxiv.org/abs/2211.07909v2","updated":"2023-08-08T12:30:03Z","published":"2022-11-15T05:29:58Z","title":"Selective Memory Recursive Least Squares: Recast Forgetting into Memory\n in RBF Neural Network Based Real-Time Learning","summary":" In radial basis function neural network (RBFNN) based real-time learning\ntasks, forgetting mechanisms are widely used such that the neural network can\nkeep its sensitivity to new data. However, with forgetting mechanisms, some\nuseful knowledge will get lost simply because they are learned a long time ago,\nwhich we refer to as the passive knowledge forgetting phenomenon. To address\nthis problem, this paper proposes a real-time training method named selective\nmemory recursive least squares (SMRLS) in which the classical forgetting\nmechanisms are recast into a memory mechanism. Different from the forgetting\nmechanism, which mainly evaluates the importance of samples according to the\ntime when samples are collected, the memory mechanism evaluates the importance\nof samples through both temporal and spatial distribution of samples. With\nSMRLS, the input space of the RBFNN is evenly divided into a finite number of\npartitions and a synthesized objective function is developed using synthesized\nsamples from each partition. In addition to the current approximation error,\nthe neural network also updates its weights according to the recorded data from\nthe partition being visited. Compared with classical training methods including\nthe forgetting factor recursive least squares (FFRLS) and stochastic gradient\ndescent (SGD) methods, SMRLS achieves improved learning speed and\ngeneralization capability, which are demonstrated by corresponding simulation\nresults.\n","authors":["Yiming Fei","Jiangang Li","Yanan Li"],"pdf_url":"https://arxiv.org/pdf/2211.07909v2.pdf","comment":"12 pages, 15 figures"},{"id":"http://arxiv.org/abs/2308.04212v1","updated":"2023-08-08T12:22:09Z","published":"2023-08-08T12:22:09Z","title":"Varying-coefficients for regional quantile via KNN-based LASSO with\n applications to health outcome study","summary":" Health outcomes, such as body mass index and cholesterol levels, are known to\nbe dependent on age and exhibit varying effects with their associated risk\nfactors. In this paper, we propose a novel framework for dynamic modeling of\nthe associations between health outcomes and risk factors using\nvarying-coefficients (VC) regional quantile regression via K-nearest neighbors\n(KNN) fused Lasso, which captures the time-varying effects of age. The proposed\nmethod has strong theoretical properties, including a tight estimation error\nbound and the ability to detect exact clustered patterns under certain\nregularity conditions. To efficiently solve the resulting optimization problem,\nwe develop an alternating direction method of multipliers (ADMM) algorithm. Our\nempirical results demonstrate the efficacy of the proposed method in capturing\nthe complex age-dependent associations between health outcomes and their risk\nfactors.\n","authors":["Seyoung Park","Eun Ryung Lee","Hyokyoung G. Hong"],"pdf_url":"https://arxiv.org/pdf/2308.04212v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2006.06926v4","updated":"2023-08-08T11:45:08Z","published":"2020-06-12T03:19:48Z","title":"Learning Bayesian Networks with Annealing Machine","summary":" Recent studies have reported that annealing machines are capable of solving\ncombinatorial optimization problems with high accuracy. Annealing machines can\npotentially be applied to score-based Bayesian network structure learning.\nHowever, the bit capacity of an annealing machine is currently limited. To\nutilize the annealing technology, converting score-based learning problems into\nquadratic unconstrained binary optimizations within the bit capacity is\nnecessary. In this paper, we propose an efficient conversion method with the\nadvanced identification of candidate parent sets and their decomposition. We\nalso provide an integer programming problem to find the decomposition that\nminimizes the number of required bits. Experimental results on $7$ benchmark\ndatasets with variables from $75$ to $223$ show that our approach requires less\nbits than the $100$K bit capacity of the fourth-generation Fujitsu Digital\nAnnealer, a fully coupled annealing machine developed with semiconductor\ntechnology. Moreover, we demonstrate that the Digital Annealer with our\nconversion method outperforms existing algorithms on score maximization. These\nresults highlight the utility of annealing processors in learning Bayesian\nnetworks.\n","authors":["Yuta Shikuri"],"pdf_url":"https://arxiv.org/pdf/2006.06926v4.pdf","comment":"13 pages, 5 tables, 3 figures, NeurIPS 2023 (under review)"},{"id":"http://arxiv.org/abs/2303.00286v3","updated":"2023-08-08T11:34:24Z","published":"2023-03-01T07:25:28Z","title":"Treat Different Negatives Differently: Enriching Loss Functions with\n Domain and Range Constraints for Link Prediction","summary":" Knowledge graph embedding models (KGEMs) are used for various tasks related\nto knowledge graphs (KGs), including link prediction. They are trained with\nloss functions that are computed considering a batch of scored triples and\ntheir corresponding labels. Traditional approaches consider the label of a\ntriple to be either true or false. However, recent works suggest that all\nnegative triples should not be valued equally. In line with this recent\nassumption, we posit that negative triples that are semantically valid w.r.t.\ndomain and range constraints might be high-quality negative triples. As such,\nloss functions should treat them differently from semantically invalid negative\nones. To this aim, we propose semantic-driven versions for the three main loss\nfunctions for link prediction. In an extensive and controlled experimental\nsetting, we show that the proposed loss functions systematically provide\nsatisfying results on three public benchmark KGs underpinned with different\nschemas, which demonstrates both the generality and superiority of our proposed\napproach. In fact, the proposed loss functions do (1) lead to better MRR and\nHits@10 values, (2) drive KGEMs towards better semantic awareness as measured\nby the Sem@K metric. This highlights that semantic information globally\nimproves KGEMs, and thus should be incorporated into loss functions. Domains\nand ranges of relations being largely available in schema-defined KGs, this\nmakes our approach both beneficial and widely usable in practice.\n","authors":["Nicolas Hubert","Pierre Monnin","Armelle Brun","Davy Monticolo"],"pdf_url":"https://arxiv.org/pdf/2303.00286v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04185v1","updated":"2023-08-08T11:10:42Z","published":"2023-08-08T11:10:42Z","title":"Iterative Sketching for Secure Coded Regression","summary":" In this work, we propose methods for speeding up linear regression\ndistributively, while ensuring security. We leverage randomized sketching\ntechniques, and improve straggler resilience in asynchronous systems.\nSpecifically, we apply a random orthonormal matrix and then subsample\n\\textit{blocks}, to simultaneously secure the information and reduce the\ndimension of the regression problem. In our setup, the transformation\ncorresponds to an encoded encryption in an \\textit{approximate gradient coding\nscheme}, and the subsampling corresponds to the responses of the non-straggling\nworkers; in a centralized coded computing network. This results in a\ndistributive \\textit{iterative sketching} approach for an $\\ell_2$-subspace\nembedding, \\textit{i.e.} a new sketch is considered at each iteration. We also\nfocus on the special case of the \\textit{Subsampled Randomized Hadamard\nTransform}, which we generalize to block sampling; and discuss how it can be\nmodified in order to secure the data.\n","authors":["Neophytos Charalambides","Hessam Mahdavifar","Mert Pilanci","Alfred O. Hero III"],"pdf_url":"https://arxiv.org/pdf/2308.04185v1.pdf","comment":"28 pages, 7 figures. arXiv admin note: substantial text overlap with\n arXiv:2201.08522"},{"id":"http://arxiv.org/abs/2111.10275v3","updated":"2023-08-08T11:05:04Z","published":"2021-11-19T15:25:06Z","title":"Composite Goodness-of-fit Tests with Kernels","summary":" Model misspecification can create significant challenges for the\nimplementation of probabilistic models, and this has led to development of a\nrange of robust methods which directly account for this issue. However, whether\nthese more involved methods are required will depend on whether the model is\nreally misspecified, and there is a lack of generally applicable methods to\nanswer this question. In this paper, we propose one such method. More\nprecisely, we propose kernel-based hypothesis tests for the challenging\ncomposite testing problem, where we are interested in whether the data comes\nfrom any distribution in some parametric family. Our tests make use of minimum\ndistance estimators based on the maximum mean discrepancy and the kernel Stein\ndiscrepancy. They are widely applicable, including whenever the density of the\nparametric model is known up to normalisation constant, or if the model takes\nthe form of a simulator. As our main result, we show that we are able to\nestimate the parameter and conduct our test on the same data (without data\nsplitting), while maintaining a correct test level. Our approach is illustrated\non a range of problems, including testing for goodness-of-fit of an\nunnormalised non-parametric density model, and an intractable generative model\nof a biological cellular network.\n","authors":["Oscar Key","Arthur Gretton","François-Xavier Briol","Tamara Fernandez"],"pdf_url":"https://arxiv.org/pdf/2111.10275v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04180v1","updated":"2023-08-08T10:42:33Z","published":"2023-08-08T10:42:33Z","title":"Studying Socially Unacceptable Discourse Classification (SUD) through\n different eyes: \"Are we on the same page ?\"","summary":" We study Socially Unacceptable Discourse (SUD) characterization and detection\nin online text. We first build and present a novel corpus that contains a large\nvariety of manually annotated texts from different online sources used so far\nin state-of-the-art Machine learning (ML) SUD detection solutions. This global\ncontext allows us to test the generalization ability of SUD classifiers that\nacquire knowledge around the same SUD categories, but from different contexts.\nFrom this perspective, we can analyze how (possibly) different annotation\nmodalities influence SUD learning by discussing open challenges and open\nresearch directions. We also provide several data insights which can support\ndomain experts in the annotation task.\n","authors":["Bruno Machado Carneiro","Michele Linardi","Julien Longhi"],"pdf_url":"https://arxiv.org/pdf/2308.04180v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.14915v2","updated":"2023-08-08T10:30:54Z","published":"2022-09-29T16:22:46Z","title":"Spiking Neural Networks for event-based action recognition: A new task\n to understand their advantage","summary":" Spiking Neural Networks (SNN) are characterised by their unique temporal\ndynamics, but the properties and advantages of such computations are still not\nwell understood. In order to provide answers, in this work we demonstrate how\nSpiking neurons can enable temporal feature extraction in feed-forward neural\nnetworks without the need for recurrent synapses, showing how their\nbio-inspired computing principles can be successfully exploited beyond energy\nefficiency gains and evidencing their differences with respect to conventional\nneurons. This is demonstrated by proposing a new task, DVS-Gesture-Chain\n(DVS-GC), which allows, for the first time, to evaluate the perception of\ntemporal dependencies in a real event-based action recognition dataset. Our\nstudy proves how the widely used DVS Gesture benchmark could be solved by\nnetworks without temporal feature extraction, unlike the new DVS-GC which\ndemands an understanding of the ordering of the events. Furthermore, this setup\nallowed us to unveil the role of the leakage rate in spiking neurons for\ntemporal processing tasks and demonstrated the benefits of \"hard reset\"\nmechanisms. Additionally, we also show how time-dependent weights and\nnormalization can lead to understanding order by means of temporal attention.\n","authors":["Alex Vicente-Sola","Davide L. Manna","Paul Kirkland","Gaetano Di Caterina","Trevor Bihl"],"pdf_url":"https://arxiv.org/pdf/2209.14915v2.pdf","comment":"New article superseding the one in previous versions"},{"id":"http://arxiv.org/abs/2301.10227v2","updated":"2023-08-08T10:18:04Z","published":"2023-01-02T14:17:08Z","title":"Denoising Diffusion Probabilistic Models for Generation of Realistic\n Fully-Annotated Microscopy Image Data Sets","summary":" Recent advances in computer vision have led to significant progress in the\ngeneration of realistic image data, with denoising diffusion probabilistic\nmodels proving to be a particularly effective method. In this study, we\ndemonstrate that diffusion models can effectively generate fully-annotated\nmicroscopy image data sets through an unsupervised and intuitive approach,\nusing rough sketches of desired structures as the starting point. The proposed\npipeline helps to reduce the reliance on manual annotations when training deep\nlearning-based segmentation approaches and enables the segmentation of diverse\ndatasets without the need for human annotations. This approach holds great\npromise in streamlining the data generation process and enabling a more\nefficient and scalable training of segmentation models, as we show in the\nexample of different practical experiments involving various organisms and cell\ntypes.\n","authors":["Dennis Eschweiler","Rüveyda Yilmaz","Matisse Baumann","Ina Laube","Rijo Roy","Abin Jose","Daniel Brückner","Johannes Stegmaier"],"pdf_url":"https://arxiv.org/pdf/2301.10227v2.pdf","comment":"9 pages, 2 figures"},{"id":"http://arxiv.org/abs/2301.05609v4","updated":"2023-08-08T10:04:14Z","published":"2023-01-13T15:24:40Z","title":"Co-manipulation of soft-materials estimating deformation from depth\n images","summary":" Human-robot co-manipulation of soft materials, such as fabrics, composites,\nand sheets of paper/cardboard, is a challenging operation that presents several\nrelevant industrial applications. Estimating the deformation state of the\nco-manipulated material is one of the main challenges. Viable methods provide\nthe indirect measure by calculating the human-robot relative distance. In this\npaper, we develop a data-driven model to estimate the deformation state of the\nmaterial from a depth image through a Convolutional Neural Network (CNN).\nFirst, we define the deformation state of the material as the relative\nroto-translation from the current robot pose and a human grasping position. The\nmodel estimates the current deformation state through a Convolutional Neural\nNetwork, specifically a DenseNet-121 pretrained on ImageNet.The delta between\nthe current and the desired deformation state is fed to the robot controller\nthat outputs twist commands. The paper describes the developed approach to\nacquire, preprocess the dataset and train the model. The model is compared with\nthe current state-of-the-art method based on a skeletal tracker from cameras.\nResults show that our approach achieves better performances and avoids the\nvarious drawbacks caused by using a skeletal tracker.Finally, we also studied\nthe model performance according to different architectures and dataset\ndimensions to minimize the time required for dataset acquisition\n","authors":["Giorgio Nicola","Enrico Villagrossi","Nicola Pedrocchi"],"pdf_url":"https://arxiv.org/pdf/2301.05609v4.pdf","comment":"Pre-print, Accepted to Robotics and Computer Integrated Manufacturing"},{"id":"http://arxiv.org/abs/2308.04169v1","updated":"2023-08-08T09:59:56Z","published":"2023-08-08T09:59:56Z","title":"Dual input neural networks for positional sound source localization","summary":" In many signal processing applications, metadata may be advantageously used\nin conjunction with a high dimensional signal to produce a desired output. In\nthe case of classical Sound Source Localization (SSL) algorithms, information\nfrom a high dimensional, multichannel audio signals received by many\ndistributed microphones is combined with information describing acoustic\nproperties of the scene, such as the microphones' coordinates in space, to\nestimate the position of a sound source. We introduce Dual Input Neural\nNetworks (DI-NNs) as a simple and effective way to model these two data types\nin a neural network. We train and evaluate our proposed DI-NN on scenarios of\nvarying difficulty and realism and compare it against an alternative\narchitecture, a classical Least-Squares (LS) method as well as a classical\nConvolutional Recurrent Neural Network (CRNN). Our results show that the DI-NN\nsignificantly outperforms the baselines, achieving a five times lower\nlocalization error than the LS method and two times lower than the CRNN in a\ntest dataset of real recordings.\n","authors":["Eric Grinstein","Vincent W. Neo","Patrick A. Naylor"],"pdf_url":"https://arxiv.org/pdf/2308.04169v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.02632v2","updated":"2023-08-08T09:21:40Z","published":"2023-08-04T17:44:27Z","title":"Generation of Realistic Synthetic Raw Radar Data for Automated Driving\n Applications using Generative Adversarial Networks","summary":" The main approaches for simulating FMCW radar are based on ray tracing, which\nis usually computationally intensive and do not account for background noise.\nThis work proposes a faster method for FMCW radar simulation capable of\ngenerating synthetic raw radar data using generative adversarial networks\n(GAN). The code and pre-trained weights are open-source and available on\nGitHub. This method generates 16 simultaneous chirps, which allows the\ngenerated data to be used for the further development of algorithms for\nprocessing radar data (filtering and clustering). This can increase the\npotential for data augmentation, e.g., by generating data in non-existent or\nsafety-critical scenarios that are not reproducible in real life. In this work,\nthe GAN was trained with radar measurements of a motorcycle and used to\ngenerate synthetic raw radar data of a motorcycle traveling in a straight line.\nFor generating this data, the distance of the motorcycle and Gaussian noise are\nused as input to the neural network. The synthetic generated radar chirps were\nevaluated using the Frechet Inception Distance (FID). Then, the Range-Azimuth\n(RA) map is calculated twice: first, based on synthetic data using this GAN\nand, second, based on real data. Based on these RA maps, an algorithm with\nadaptive threshold and edge detection is used for object detection. The results\nhave shown that the data is realistic in terms of coherent radar reflections of\nthe motorcycle and background noise based on the comparison of chirps, the RA\nmaps and the object detection results. Thus, the proposed method in this work\nhas shown to minimize the simulation-to-reality gap for the generation of radar\ndata.\n","authors":["Eduardo C. Fidelis","Fabio Reway","Herick Y. S. Ribeiro","Pietro L. Campos","Werner Huber","Christian Icking","Lester A. Faria","Torsten Schön"],"pdf_url":"https://arxiv.org/pdf/2308.02632v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08325v2","updated":"2023-08-08T09:08:01Z","published":"2023-06-14T07:54:53Z","title":"GCformer: An Efficient Framework for Accurate and Scalable Long-Term\n Multivariate Time Series Forecasting","summary":" Transformer-based models have emerged as promising tools for time series\nforecasting.\n However, these model cannot make accurate prediction for long input time\nseries. On the one hand, they failed to capture global dependencies within time\nseries data. On the other hand, the long input sequence usually leads to large\nmodel size and high time complexity.\n To address these limitations, we present GCformer, which combines a\nstructured global convolutional branch for processing long input sequences with\na local Transformer-based branch for capturing short, recent signals. A\ncohesive framework for a global convolution kernel has been introduced,\nutilizing three distinct parameterization methods. The selected structured\nconvolutional kernel in the global branch has been specifically crafted with\nsublinear complexity, thereby allowing for the efficient and effective\nprocessing of lengthy and noisy input signals. Empirical studies on six\nbenchmark datasets demonstrate that GCformer outperforms state-of-the-art\nmethods, reducing MSE error in multivariate time series benchmarks by 4.38% and\nmodel parameters by 61.92%. In particular, the global convolutional branch can\nserve as a plug-in block to enhance the performance of other models, with an\naverage improvement of 31.93\\%, including various recently published\nTransformer-based models. Our code is publicly available at\nhttps://github.com/zyj-111/GCformer.\n","authors":["YanJun Zhao","Ziqing Ma","Tian Zhou","Liang Sun","Mengni Ye","Yi Qian"],"pdf_url":"https://arxiv.org/pdf/2306.08325v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.02582v2","updated":"2023-08-08T08:57:20Z","published":"2023-08-01T05:31:36Z","title":"Adapt and Decompose: Efficient Generalization of Text-to-SQL via Domain\n Adapted Least-To-Most Prompting","summary":" Cross-domain and cross-compositional generalization of Text-to-SQL semantic\nparsing is a challenging task. Existing Large Language Model (LLM) based\nsolutions rely on inference-time retrieval of few-shot exemplars from the\ntraining set to synthesize a run-time prompt for each Natural Language (NL)\ntest query. In contrast, we devise an algorithm which performs offline sampling\nof a minimal set-of few-shots from the training data, with complete coverage of\nSQL clauses, operators and functions, and maximal domain coverage within the\nallowed token length. This allows for synthesis of a fixed Generic Prompt (GP),\nwith a diverse set-of exemplars common across NL test queries, avoiding\nexpensive test time exemplar retrieval. We further auto-adapt the GP to the\ntarget database domain (DA-GP), to better handle cross-domain generalization;\nfollowed by a decomposed Least-To-Most-Prompting (LTMP-DA-GP) to handle\ncross-compositional generalization. The synthesis of LTMP-DA-GP is an offline\ntask, to be performed one-time per new database with minimal human\nintervention. Our approach demonstrates superior performance on the KaggleDBQA\ndataset, designed to evaluate generalizability for the Text-to-SQL task. We\nfurther showcase consistent performance improvement of LTMP-DA-GP over GP,\nacross LLMs and databases of KaggleDBQA, highlighting the efficacy and model\nagnostic benefits of our prompt based adapt and decompose approach.\n","authors":["Aseem Arora","Shabbirhussain Bhaisaheb","Manasi Patwardhan","Lovekesh Vig","Gautam Shroff"],"pdf_url":"https://arxiv.org/pdf/2308.02582v2.pdf","comment":"22 Pages"},{"id":"http://arxiv.org/abs/2206.01186v2","updated":"2023-08-08T08:51:45Z","published":"2022-06-01T10:28:18Z","title":"ORC: Network Group-based Knowledge Distillation using Online Role Change","summary":" In knowledge distillation, since a single, omnipotent teacher network cannot\nsolve all problems, multiple teacher-based knowledge distillations have been\nstudied recently. However, sometimes their improvements are not as good as\nexpected because some immature teachers may transfer the false knowledge to the\nstudent. In this paper, to overcome this limitation and take the efficacy of\nthe multiple networks, we divide the multiple networks into teacher and student\ngroups, respectively. That is, the student group is a set of immature networks\nthat require learning the teacher's knowledge, while the teacher group consists\nof the selected networks that are capable of teaching successfully. We propose\nour online role change strategy where the top-ranked networks in the student\ngroup are able to promote to the teacher group at every iteration. After\ntraining the teacher group using the error samples of the student group to\nrefine the teacher group's knowledge, we transfer the collaborative knowledge\nfrom the teacher group to the student group successfully. We verify the\nsuperiority of the proposed method on CIFAR-10, CIFAR-100, and ImageNet which\nachieves high performance. We further show the generality of our method with\nvarious backbone architectures such as ResNet, WRN, VGG, Mobilenet, and\nShufflenet.\n","authors":["Junyong Choi","Hyeon Cho","Seokhwa Cheung","Wonjun Hwang"],"pdf_url":"https://arxiv.org/pdf/2206.01186v2.pdf","comment":"Accepted at ICCV 2023; Supplementary material would be found at CVF\n Open Access"},{"id":"http://arxiv.org/abs/2308.04137v1","updated":"2023-08-08T08:50:27Z","published":"2023-08-08T08:50:27Z","title":"Comprehensive Assessment of the Performance of Deep Learning Classifiers\n Reveals a Surprising Lack of Robustness","summary":" Reliable and robust evaluation methods are a necessary first step towards\ndeveloping machine learning models that are themselves robust and reliable.\nUnfortunately, current evaluation protocols typically used to assess\nclassifiers fail to comprehensively evaluate performance as they tend to rely\non limited types of test data, and ignore others. For example, using the\nstandard test data fails to evaluate the predictions made by the classifier to\nsamples from classes it was not trained on. On the other hand, testing with\ndata containing samples from unknown classes fails to evaluate how well the\nclassifier can predict the labels for known classes. This article advocates\nbench-marking performance using a wide range of different types of data and\nusing a single metric that can be applied to all such data types to produce a\nconsistent evaluation of performance. Using such a benchmark it is found that\ncurrent deep neural networks, including those trained with methods that are\nbelieved to produce state-of-the-art robustness, are extremely vulnerable to\nmaking mistakes on certain types of data. This means that such models will be\nunreliable in real-world scenarios where they may encounter data from many\ndifferent domains, and that they are insecure as they can easily be fooled into\nmaking the wrong decisions. It is hoped that these results will motivate the\nwider adoption of more comprehensive testing methods that will, in turn, lead\nto the development of more robust machine learning methods in the future.\n Code is available at:\n\\url{https://codeberg.org/mwspratling/RobustnessEvaluation}\n","authors":["Michael W. Spratling"],"pdf_url":"https://arxiv.org/pdf/2308.04137v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18651v3","updated":"2023-08-08T08:48:48Z","published":"2023-05-29T23:06:05Z","title":"UMD: Unsupervised Model Detection for X2X Backdoor Attacks","summary":" Backdoor (Trojan) attack is a common threat to deep neural networks, where\nsamples from one or more source classes embedded with a backdoor trigger will\nbe misclassified to adversarial target classes. Existing methods for detecting\nwhether a classifier is backdoor attacked are mostly designed for attacks with\na single adversarial target (e.g., all-to-one attack). To the best of our\nknowledge, without supervision, no existing methods can effectively address the\nmore general X2X attack with an arbitrary number of source classes, each paired\nwith an arbitrary target class. In this paper, we propose UMD, the first\nUnsupervised Model Detection method that effectively detects X2X backdoor\nattacks via a joint inference of the adversarial (source, target) class pairs.\nIn particular, we first define a novel transferability statistic to measure and\nselect a subset of putative backdoor class pairs based on a proposed clustering\napproach. Then, these selected class pairs are jointly assessed based on an\naggregation of their reverse-engineered trigger size for detection inference,\nusing a robust and unsupervised anomaly detector we proposed. We conduct\ncomprehensive evaluations on CIFAR-10, GTSRB, and Imagenette dataset, and show\nthat our unsupervised UMD outperforms SOTA detectors (even with supervision) by\n17%, 4%, and 8%, respectively, in terms of the detection accuracy against\ndiverse X2X attacks. We also show the strong detection performance of UMD\nagainst several strong adaptive attacks.\n","authors":["Zhen Xiang","Zidi Xiong","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2305.18651v3.pdf","comment":"Proceedings of the 40th International Conference on Machine Learning"},{"id":"http://arxiv.org/abs/2308.04126v1","updated":"2023-08-08T08:30:16Z","published":"2023-08-08T08:30:16Z","title":"OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion\n and Infinite Data Generation","summary":" This paper presents OmniDataComposer, an innovative approach for multimodal\ndata fusion and unlimited data generation with an intent to refine and\nuncomplicate interplay among diverse data modalities. Coming to the core\nbreakthrough, it introduces a cohesive data structure proficient in processing\nand merging multimodal data inputs, which include video, audio, and text. Our\ncrafted algorithm leverages advancements across multiple operations such as\nvideo/image caption extraction, dense caption extraction, Automatic Speech\nRecognition (ASR), Optical Character Recognition (OCR), Recognize Anything\nModel(RAM), and object tracking. OmniDataComposer is capable of identifying\nover 6400 categories of objects, substantially broadening the spectrum of\nvisual information. It amalgamates these diverse modalities, promoting\nreciprocal enhancement among modalities and facilitating cross-modal data\ncorrection. \\textbf{The final output metamorphoses each video input into an\nelaborate sequential document}, virtually transmuting videos into thorough\nnarratives, making them easier to be processed by large language models. Future\nprospects include optimizing datasets for each modality to encourage unlimited\ndata generation. This robust base will offer priceless insights to models like\nChatGPT, enabling them to create higher quality datasets for video captioning\nand easing question-answering tasks based on video content. OmniDataComposer\ninaugurates a new stage in multimodal learning, imparting enormous potential\nfor augmenting AI's understanding and generation of complex, real-world data.\n","authors":["Dongyang Yu","Shihao Wang","Yuan Fang","Wangpeng An"],"pdf_url":"https://arxiv.org/pdf/2308.04126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04119v1","updated":"2023-08-08T08:19:43Z","published":"2023-08-08T08:19:43Z","title":"Constructing Custom Thermodynamics Using Deep Learning","summary":" One of the most exciting applications of AI is automated scientific discovery\nbased on previously amassed data, coupled with restrictions provided by the\nknown physical principles, including symmetries and conservation laws. Such\nautomated hypothesis creation and verification can assist scientists in\nstudying complex phenomena, where traditional physical intuition may fail. Of\nparticular importance are complex dynamic systems where their time evolution is\nstrongly influenced by varying external parameters. In this paper we develop a\nplatform based on a generalised Onsager principle to learn macroscopic\ndynamical descriptions of arbitrary stochastic dissipative systems directly\nfrom observations of their microscopic trajectories. We focus on systems whose\ncomplexity and sheer sizes render complete microscopic description impractical,\nand constructing theoretical macroscopic models requires extensive domain\nknowledge or trial-and-error. Our machine learning approach addresses this by\nsimultaneously constructing reduced thermodynamic coordinates and interpreting\nthe dynamics on these coordinates. We demonstrate our method by studying\ntheoretically and validating experimentally, the stretching of long polymer\nchains in an externally applied field. Specifically, we learn three\ninterpretable thermodynamic coordinates and build a dynamical landscape of\npolymer stretching, including (1) the identification of stable and transition\nstates and (2) the control of the stretching rate. We further demonstrate the\nuniversality of our approach by applying it to an unrelated problem in a\ndifferent domain: constructing macroscopic dynamics for spatial epidemics,\nshowing that our method addresses wide scientific and technological\napplications.\n","authors":["Xiaoli Chen","Beatrice W. Soh","Zi-En Ooi","Eleonore Vissol-Gaudin","Haijun Yu","Kostya S. Novoselov","Kedar Hippalgaonkar","Qianxiao Li"],"pdf_url":"https://arxiv.org/pdf/2308.04119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/1910.06832v3","updated":"2023-08-08T07:50:36Z","published":"2019-10-15T14:47:37Z","title":"Discriminator optimal transport","summary":" Within a broad class of generative adversarial networks, we show that\ndiscriminator optimization process increases a lower bound of the dual cost\nfunction for the Wasserstein distance between the target distribution $p$ and\nthe generator distribution $p_G$. It implies that the trained discriminator can\napproximate optimal transport (OT) from $p_G$ to $p$.Based on some experiments\nand a bit of OT theory, we propose a discriminator optimal transport (DOT)\nscheme to improve generated images. We show that it improves inception score\nand FID calculated by un-conditional GAN trained by CIFAR-10, STL-10 and a\npublic pre-trained model of conditional GAN by ImageNet.\n","authors":["Akinori Tanaka"],"pdf_url":"https://arxiv.org/pdf/1910.06832v3.pdf","comment":"math errors corrected, note added"},{"id":"http://arxiv.org/abs/2308.04103v1","updated":"2023-08-08T07:38:44Z","published":"2023-08-08T07:38:44Z","title":"Explainable machine learning to enable high-throughput electrical\n conductivity optimization of doped conjugated polymers","summary":" The combination of high-throughput experimentation techniques and machine\nlearning (ML) has recently ushered in a new era of accelerated material\ndiscovery, enabling the identification of materials with cutting-edge\nproperties. However, the measurement of certain physical quantities remains\nchallenging to automate. Specifically, meticulous process control,\nexperimentation and laborious measurements are required to achieve optimal\nelectrical conductivity in doped polymer materials. We propose a ML approach,\nwhich relies on readily measured absorbance spectra, to accelerate the workflow\nassociated with measuring electrical conductivity. The first ML model\n(classification model), accurately classifies samples with a conductivity >~25\nto 100 S/cm, achieving a maximum of 100% accuracy rate. For the subset of\nhighly conductive samples, we employed a second ML model (regression model), to\npredict their conductivities, yielding an impressive test R2 value of 0.984. To\nvalidate the approach, we showed that the models, neither trained on the\nsamples with the two highest conductivities of 498 and 506 S/cm, were able to,\nin an extrapolative manner, correctly classify and predict them at satisfactory\nlevels of errors. The proposed ML workflow results in an improvement in the\nefficiency of the conductivity measurements by 89% of the maximum achievable\nusing our experimental techniques. Furthermore, our approach addressed the\ncommon challenge of the lack of explainability in ML models by exploiting\nbespoke mathematical properties of the descriptors and ML model, allowing us to\ngain corroborated insights into the spectral influences on conductivity.\nThrough this study, we offer an accelerated pathway for optimizing the\nproperties of doped polymer materials while showcasing the valuable insights\nthat can be derived from purposeful utilization of ML in experimental science.\n","authors":["Ji Wei Yoon","Adithya Kumar","Pawan Kumar","Kedar Hippalgaonkar","J Senthilnath","Vijila Chellappan"],"pdf_url":"https://arxiv.org/pdf/2308.04103v1.pdf","comment":"33 Pages, 17 figures"},{"id":"http://arxiv.org/abs/2308.04102v1","updated":"2023-08-08T07:33:49Z","published":"2023-08-08T07:33:49Z","title":"Asynchronous Evolution of Deep Neural Network Architectures","summary":" Many evolutionary algorithms (EAs) take advantage of parallel evaluation of\ncandidates. However, if evaluation times vary significantly, many worker nodes\n(i.e.,\\ compute clients) are idle much of the time, waiting for the next\ngeneration to be created. Evolutionary neural architecture search (ENAS), a\nclass of EAs that optimizes the architecture and hyperparameters of deep neural\nnetworks, is particularly vulnerable to this issue. This paper proposes a\ngeneric asynchronous evaluation strategy (AES) that is then adapted to work\nwith ENAS. AES increases throughput by maintaining a queue of upto $K$\nindividuals ready to be sent to the workers for evaluation and proceeding to\nthe next generation as soon as $M<