diff --git a/_posts/2023-10-06-(CL_paper101)LLM4TS.md b/_posts/2023-10-06-(CL_paper101)LLM4TS.md new file mode 100644 index 000000000000..f01ff05c0d98 --- /dev/null +++ b/_posts/2023-10-06-(CL_paper101)LLM4TS.md @@ -0,0 +1,364 @@ +--- +title: (paper 101) LLM4TS; Two-stage Fine-tuning for TSF with Pretrained LLMs +categories: [TS,NLP] +tags: [] +excerpt: 2023 +--- + + + +# LLM4TS: Two-stage Fine-tuning for TSF with Pretrained LLMs (2023) + +
+ +https://arxiv.org/pdf/2308.08469.pdf + +## Contents + +0. Abstract +0. Introduction +0. Related Work + 0. In-modality Knowledge Transfer + 0. Cross-modality Knowledge Transfer + +0. Problem Formulation +0. Method + 0. Two-stage FT + 0. Details of LLM4TS + +0. Experiments + 0. MTS forecasting + 0. Few-shot learning + 0. SSL + 0. Ablation Study + + +
+ +# 0. Abstract + +Limited large-scale TS data for building robust foundation models + +$$\rightarrow$$ use **Pre-trained Large Language Models (LLMs)** to enhance TSF + + + +Details + +- time-series patching + temporal encoding + +- prioritize a two-stage fine-tuning process: + - step 1) **supervised fine-tuning** to orient the LLM towards TS + - step 2) **task-specific downstream finetuning** +- **Parameter-Efficient Fine-Tuning (PEFT)** techniques +- Experiment + - robust representation learner + - effective few-shot learner + +
+ +# 1. Introduction + +Background + +- Foundation models in NLP & CV + +- Limited availability of large-scale TS data to train robust foundation models + +$$\rightarrow$$ utilizing **pre-trained Large Language Models (LLMs) as powerful representation learners for TS** + +
+ +To integrate LLMs with TS data ... 2 pivotal questions + +- Q1) How can we ***input time-series data into LLMs?*** +- Q2) How can we utilize pre-trained LLMs ***without distorting their inherent features?*** + +
+ +## Q1) Input TS into LLMs + +To accommodate new data modalities within LLMs... + +$$\rightarrow$$ essential to **(1) tokenize the data** ( feat. PatchTST ) + +**(2) Channel-independence** + +
+ +Summary + +- introduce a novel approach that integrates **temporal information**, +- while employing the techniques of **patching and channel-independence** + +
+ +## Q2) Utilize LLM without distorting inherent features + +High-quality chatbots ( Instruct GPT, ChatGPT ) + +- requires strategic alignment of a pre-trained model with instruction-based data through **supervised fine-tuning** + + $$\rightarrow$$ ensures the model becomes **familiarized with target data** formats + +
+ +Introduce a **TWO-stage fine-tuning approach** + +- step 1) supervised fine-tuning + - guiding the LLM towards TS data +- step 2) downstream fine-tuning + - geared towards TSF task + +( still, there is a need to enhance the **pre-trained LLMs’ adaptability** to new data modalities **without distorting models’ inherent features** ) + +$$\rightarrow$$ two Parameter-Efficient Fine-Tuning (PEFT) techniques + +- Layer Normalization Tuning (Lu et al. 2021) +- LoRA (Hu et al. 2021) + +to optimize model flexibility without extensive parameter adjustments + +
+ +## Summary + +1. Integration of Time-Series with LLMs + - patching and channel-independence to tokenize TS data + - novel approach to integrate temporal information with patching +2. Adaptable Fine-Tuning for LLMs + - twostage fine-tuning methodology + - step 1) supervised finetuning stage to align LLMs with TS + - step 2) downstream fine-tuning stage dedicated to TSF task +3. Optimized Model Flexibility + - To ensure both robustness and adaptability + - two PEFT techniques + - Layer Normalization Tuning + - LoRA + +4. Real-World Application Relevance + +
+ +# 2. Related Work + +## (1) In-modality Knowledge Transfer + +Foundation models + +- capable to transfer knowledge to downstream tasks +- transformation of LLMs into chatbots + - ex) InstructGPT, ChatGPT: employ supervised fine-tuning + +
+ +Limitation of fine-tuning + +- computational burden of refining an entire model can be significant. + +$$\rightarrow$$ solution: PEFT ( Parameter Efficient Fine Tuning ) + +
+ +**PEFT ( Parameter Efficient Fine Tuning )** + +- popular technique to reduce costs +- ex) LLaMA-Adapter (Gao et al. 2023) : achieves ChatGPT-level performance by fine-tuning a mere 0.02% of its parameters + +
+ +**LLM4TS**: integrate supervised fine-tuning & PEFT + +
+ +## (2) Cross-Modality Knowledge Transfer + +Transfrer across diverse data modalities + +- ex) NLP $$\rightarrow$$ Image (Lu et al. 2021), Audio (Ghosal et al. 2023), TS (Zhou et al. 2023) +- ex) CV $$\rightarrow$$ 12 distinct modalities (Zhang et al. 2023) + +
+ +**LLM4TS**: utilize pretrained LLM expertise to address challenges in TS data + +
+ +# 3. Problem Formulation + +![figure2](/assets/img/ts/img479.png) + +
+ +# 4. Method + +LLM4TS framework + +- leveraging the pre-trained GPT-2 +- (4-1) introduce the two-stage finetuning training strategy +- (4-2) details + - instance normalization + - patching + - channel-independence + - three distinct encodings + +
+ +![figure2](/assets/img/ts/img480.png) + +
+ +## (1) Two-stage FT + +### a) Supervised FT: Autoregressive + +GPT-2 (Radford et al. 2019): causal language model + +$$\rightarrow$$ **supervised fine-tuning** adopts the same **autoregressive training methodology** used during its pretraining phase. + +
+ +Given 1st, 2nd, 3rd patches.. + +Predict 2nd, 3rd, 4th patches.. + +
+ +### b) Downstream FT: Forecasting + +2 primary strategies are available + +- (1) full fine-tuning +- (2) linear probing + +
+ +Sequential approach (LP-FT) is good! + +- step 1) LP: linear probing ( epoch x 0.5 ) +- step 2) FT: full fine-tuning ( epoch x 0.5 ) + +
+ +## (2) Details of LLM4TS + +### a) Instance Normalization + +- z-score norm) standard for TSF +- RevIN) further boosts accuracy + +
+ +(????) Since RevIN is designed for the unpatched TS .... 2 problems + +- (1) Denormalization is infeasible as outputs remain in the patched format rather than the unpatched format. +- (2) RevIN’s trainable affine transformation is not appropriate for AR models. + +$$\rightarrow$$ Employ "standard instance norm" during Sup FT + +
+ +### b) Patching & Channel Independence + +- pass + +
+ +### c) Three Encodings + +(1) Token embedding + +- via 1D-convolution + +(2) Positional encoding + +- use the standard approach and employ a trainable lookup table + +(3) Temporal encoding + +- numerous studies suggest the advantage of incorporating temporal information with Transformer-based models in time-series analysis (Wen et al. 2022). +- Problem + - (1) patch = multiple timestamps = which timestamp to use...? + - (2) each timestamp carries various temporal attributes ( minute, hour, day .. ) +- Solution + - (1) designate the initial timestamp as its representative + - (2) employ a trainable lookup table for each attribute + +$$\rightarrow$$ add (1) & (2) & (3) + +
+ +### d) Pre-Trained LLM and PEFT + +Freeze + +- particularly those associated with the multi-head attention and feedforward layers within the Transformer block. +- many studies indicate that **retaining most parameters as non-trainable often yields better results** than training a pre-trained LLM from scratch (Lu et al. 2021; Zhou et al. 2023). + +
+ +Tune + +- employ PEFT techniques as efficient approaches +- (1) utilize the selection-based method .... **Layer Normalization Tuning (Lu et al. 2021)** + - adjust pre-existing parameters by making the affine transformation in layer normalization trainable. +- (2) employ LoRA (LowRank Adaptation) (Hu et al. 2021) + - reparameterization-based method that leverages low-rank representations + +
+ +Summary : **only 1.5% of the model’s total parameters** are trainable. + +
+ +### e) Output Layer + +Supervised FT + +- output remains in the form of patched TS ( tokens ) + - employ a linear layer to modify the final dimension. + +
+ +Downstream fine-tuning stage + +- transforms the patched $$\rightarrow$$ unpatched + - requiring flattening before the linear layer + +
+ +For both, use dropout immediately after the linear transformation + +
+ +# 5. Experiments + +## (1) MTS forecasting + +![figure2](/assets/img/ts/img481.png) + +
+ +## (2) Few-shot learning + +![figure2](/assets/img/ts/img482.png) + +
+ +## (3) SSL + +![figure2](/assets/img/ts/img483.png) + +
+ +## (4) Ablation Study + +### a) Supervised FT, Temporal Encoding, PEFT + +![figure2](/assets/img/ts/img484.png) + +
+ +### b) Training Strategies in Downstream FT + +![figure2](/assets/img/ts/img485.png) diff --git a/_posts/2023-10-07-(CL_paper102)DCDetector.md b/_posts/2023-10-07-(CL_paper102)DCDetector.md new file mode 100644 index 000000000000..280972e5f649 --- /dev/null +++ b/_posts/2023-10-07-(CL_paper102)DCDetector.md @@ -0,0 +1,336 @@ +--- +title: (paper 102) DCdetector; Dual Attention Contrastive Representation Learning for TS Anomaly Detection +categories: [TS] +tags: [] +excerpt: KDD 2023 +--- + + + +# DCdetector: Dual Attention Contrastive Representation Learning for TS Anomaly Detection (KDD 2023) + +
+ +https://arxiv.org/pdf/2306.10347.pdf + +## Contents + +0. Abstract +0. Introduction +0. Methodology + 0. Overall Architecture + 0. Dual Attention Contrastive Structure + 0. Representation Discrepancy + 0. Anomaly Criterion + + + +
+ +# 0. Abstract + +Challenge of TS anomaly detection + +- learn a representation map that enables effective discrimination of anomalies. + +
+ +Categories of methods + +- Reconstruction-based methods +- Contrastive learning + +
+ +### DCdetector + +- a **multi-scale dual attention** contrastive representation learning model + - utilizes a novel dual attention **asymmetric design** to create the permutated environment +- learn a **permutation invariant representation** with superior discrimination abilities + +
+ +# 1. Introduction + +### Challenges in TS-AD + +- (1) Determining what the anomalies will be like. +- (2) Anomalies are rare + - hard to get labels + - most supervised or semi-supervised methods fail to work given limited labeled training data. + +- (3) Should consider temporal, multidimensional, and non-stationary features for TS + +
+ +### TS anomaly detection methods + +( ex. statistical, classic machine learning, and deep learning based methods ) + +- Supervised and Semi-supervised methods + - can not handle the challenge of limited labeled data +- Unsupervised methods + - without strict requirements on labeled data + - ex) one class classification-based, probabilistic based, distance-based, forecasting-based, reconstruction-based approaches + +
+ +### Examples + +- **Reconstruction-based methods** + - pros) developing rapidly due to its power in handling complex data by combining it with different machine learning models and its interpretability that the instances behave unusually abnormally. + - cons) challenging to learn a well-reconstructed model for normal data without being obstructed by anomalies. +- **Contrastive Learning** + - outstanding performance in downstream tasks in the computer vision + - effectiveness of contrastive representative learning still needs to be explored in the TS-AD + +
+ +### DCdetector + +( Dual attention Contrastive representation learning anomaly detector ) + +- handle the challenges in **TS AD** + +- key idea : **normal TS points share the latent pattern** + + ( = normal points have **strong correlations** with other points <-> anomalies do not ) + +- Learning consistent representations : + - **hard** for anomalies + - **easy** for normal points +- Motivation : if normal and abnormal points’ representations are distinguishable, we can detect anomalies **without a highly qualified reconstruction model** + +
+ +### Details + +- contrastive structure with **two branches & dual attention** + - two branches share weights +- representation difference between normal and abnormal data is enlarged +- patching-based attention networks: to capture the temporal dependency +- multi-scale design: to reduce information loss during patching +- channel independence design for MTS +- does not require prior knowledge about anomalies + +
+ +# 2. Methodology + +MTS of length $$\mathrm{T}$$ : $$X=\left(x_1, x_2, \ldots, x_T\right)$$ + +- where $$x_t \in \mathbb{R}^d$$ + +
+ +Task: + +- given input TS $$\mathcal{X}$$, +- for another unknown test sequence $$\mathcal{X}_{\text {test }}$$ of length $$T^{\prime}$$ + +- we want to predict $$\mathcal{Y}_{\text {test }}=\left(y_1, y_2, \ldots, y_{T^{\prime}}\right)$$. + - $$y_t \in\{0,1\}$$ : 1 = anomaly & 0 = normal + +
+ +Inductive bias ( as Anomaly Transformer explored ) + +- ***anomalies have less connection with the whole TS than their adjacent points*** +- Anomaly Transformer: detects anomalies by association discrepancy between .. + - (1) a learned Gaussian kernel + - (2) attention weight distribution. +- DCdetector + - via a dual-attention self-supervised contrastive-type structure. + +
+ +### Comparison + +![figure2](/assets/img/ts/img486.png) + +1. Reconstruction-based approach +2. Anomaly Transformer + - observation that it is difficult to build nontrivial associations from abnormal points to the whole series. + - discrepancies + - prior discrepancy : learned with Gaussian Kernel + - association discrepancy : learned with a transformer module + - MinMax association learning & Reconstruction loss + +3. DCdetector + - concise ( does not need a specially designed Gaussian Kernel, a MinMax learning strategy, or a reconstruction loss ) + - mainly leverages the designed **CL-based dual-branch attention** for **discrepancy learning** of anomalies in different views + +
+ +## (1) Overall Architecture + +![figure2](/assets/img/ts/img487.png) + +4 main components + +1. Forward Process module +2. Dual Attention Contrastive Structure module +3. Representation Discrepancy module +4. Anomaly Criterion module. + +
+ +![figure2](/assets/img/ts/img488.png) + +### a) Forward Process module + +( channel-independent ) + +- a-1) instance normalization +- a-2) patching + +
+ +### b) Dual Attention Contrastive Structure module + +- each channel shares the same self-attention network +- representation results are concatenated as the final output $$\left(X^{\prime} \in \mathbb{R}^{N \times d}\right)$$. +- Dual Attention Contrastive Structure module + - learns the representation of inputs in different views. + +
+ +### c) Representation Discrepancy module + +Key Insight + +- normal points: share the same latent pattern even in different views (a strong correlation is not easy to be destroyed). +- anomalies: rare & do not have explicit patterns + +$$\rightarrow$$ difference will be slight for normal points representations in different views and large for anomalies. + +
+ +### d) Anomaly Criterion module. + +- calculate anomaly scores based on the discrepancy between the two representations + +- use a prior threshold for AD + +
+ +## (2) Dual Attention Contrastive Structure + +TS from different views: takes .. + +- (1) patch-wise representations +- (2) in-patch representations + +
+ +Does not construct pairs like the typical contrastive methods + +- similar to the contrastive methods only using positive samples + +
+ +### a) Dual Attention + +Input time series $$\mathcal{X} \in \mathbb{R}^{T \times d}$$ are patched as $$\mathcal{X} \in \mathbb{R}^{P \times N \times d}$$ + +- $$P$$ : patch size +- $$N$$ : number of patches + +
+ +Fuse the channel information with the batch dimension ( $$\because$$ channel independence ) + +$$\rightarrow$$ becomes $$\mathcal{X} \in \mathbb{R}^{P \times N}$$. + +
+ +[ Patch-wise representation ] + +- single patch is considered as a unit + - embedded operation will be applied in the patch_size $$(P)$$ dimension +- capture dependencies among patches ( = patch-wise attention ) +- embedding shape : $$X_{\mathcal{N}} \in \mathbb{R}^{N \times d_{\text {model }}}$$. +- apply multi-head attention to $$X_{\mathcal{N}}$$ + +
+ +[ In-patch representation ] + +- dependencies of points in the same patch + - embedded operation will be applied in the number of patches $$(N)$$ dimension + +
+ +Note that the $$W_{Q_i}, W_{\mathcal{K}_i}$$ are **shared weights within the in-patch & patch-wise attention** + +
+ +### b) Up-sampling and Multi-scale Design + +Patch-wise attention + +- ignores the relevance among points in a patch + +In-patch attention + +- ignores the relevance among patches. + +
+ +To compare these two representations .... need upsampling! + +![figure2](/assets/img/ts/img489.png) + +
+ +Multi-scale design: + += final representation concatenates results in **different scales (i.e., patch sizes)** + +- final patch-wise representation: $$\mathcal{N}$$ + - $$\mathcal{N}=\sum_{\text {Patch list }} \operatorname{Upsampling}\left(\text { Attn }_{\mathcal{N}}\right)$$, +- Final in-patch representation: $$\mathcal{P}$$ + - $$\mathcal{P}=\sum_{\text {Patch list }} \text { Upsampling }\left(\text { Attn }_{\mathcal{P}}\right)$$. + +
+ +### c) Contrastive Structure + +Patch-wise sample representation + +- learns a weighted combination **between sample points in the same position from each patch** + +In-patch sample representation + +- learns a weighted combination **between points within the same patch**. + +$$\rightarrow$$ Treat these two representations as "permutated multi-view representations" + +
+ +## (3) Representation Discrepancy + Kullback-Leibler divergence (KL divergence) + +- to measure the similarity of such two representations + +
+ +### Loss function definition + +( no reconstruction part is used ) + +$$\mathcal{L}\{\mathcal{P}, \mathcal{N} ; X\}=\frac{1}{2} \mathcal{D}(\mathcal{P}, \operatorname{Stopgrad}(\mathcal{N}))+\frac{1}{2} \mathcal{D}(\mathcal{N}, \operatorname{Stopgrad}(\mathcal{P}))$$. + +- Stop-gradient : to train 2 branches asynchronously + +
+ +## (4) Anomaly Criterion + +Final anomaly score of $$\mathcal{X} \in \mathbb{R}^{T \times d}$$ : +- $$\text { AnomalyScore }(X)=\frac{1}{2} \mathcal{D}(\mathcal{P}, \mathcal{N})+\frac{1}{2} \mathcal{D}(\mathcal{N}, \mathcal{P}) \text {. }$$. + +
+ +$$y_i= \begin{cases}1: \text { anomaly } & \text { AnomalyScore }\left(X_i\right) \geq \delta \\ 0: \text { normal } & \text { AnomalyScore }\left(X_i\right)<\delta\end{cases}$$. + diff --git a/assets/img/ts/img479.png b/assets/img/ts/img479.png new file mode 100644 index 000000000000..e374f0a490be Binary files /dev/null and b/assets/img/ts/img479.png differ diff --git a/assets/img/ts/img480.png b/assets/img/ts/img480.png new file mode 100644 index 000000000000..9855ce77fa3b Binary files /dev/null and b/assets/img/ts/img480.png differ diff --git a/assets/img/ts/img481.png b/assets/img/ts/img481.png new file mode 100644 index 000000000000..8b1546c989c4 Binary files /dev/null and b/assets/img/ts/img481.png differ diff --git a/assets/img/ts/img482.png b/assets/img/ts/img482.png new file mode 100644 index 000000000000..47aed0670b80 Binary files /dev/null and b/assets/img/ts/img482.png differ diff --git a/assets/img/ts/img483.png b/assets/img/ts/img483.png new file mode 100644 index 000000000000..75324b552318 Binary files /dev/null and b/assets/img/ts/img483.png differ diff --git a/assets/img/ts/img484.png b/assets/img/ts/img484.png new file mode 100644 index 000000000000..da0a355ce140 Binary files /dev/null and b/assets/img/ts/img484.png differ diff --git a/assets/img/ts/img485.png b/assets/img/ts/img485.png new file mode 100644 index 000000000000..3a49bd24374d Binary files /dev/null and b/assets/img/ts/img485.png differ diff --git a/assets/img/ts/img486.png b/assets/img/ts/img486.png new file mode 100644 index 000000000000..8b7fbebe8531 Binary files /dev/null and b/assets/img/ts/img486.png differ diff --git a/assets/img/ts/img487.png b/assets/img/ts/img487.png new file mode 100644 index 000000000000..c56649abc46a Binary files /dev/null and b/assets/img/ts/img487.png differ diff --git a/assets/img/ts/img488.png b/assets/img/ts/img488.png new file mode 100644 index 000000000000..3b2363ab5d80 Binary files /dev/null and b/assets/img/ts/img488.png differ diff --git a/assets/img/ts/img489.png b/assets/img/ts/img489.png new file mode 100644 index 000000000000..7beac04b9258 Binary files /dev/null and b/assets/img/ts/img489.png differ