diff --git a/_posts/2023-10-06-(CL_paper101)LLM4TS.md b/_posts/2023-10-06-(CL_paper101)LLM4TS.md
new file mode 100644
index 000000000000..f01ff05c0d98
--- /dev/null
+++ b/_posts/2023-10-06-(CL_paper101)LLM4TS.md
@@ -0,0 +1,364 @@
+---
+title: (paper 101) LLM4TS; Two-stage Fine-tuning for TSF with Pretrained LLMs
+categories: [TS,NLP]
+tags: []
+excerpt: 2023
+---
+
+
+
+# LLM4TS: Two-stage Fine-tuning for TSF with Pretrained LLMs (2023)
+
+
+
+https://arxiv.org/pdf/2308.08469.pdf
+
+## Contents
+
+0. Abstract
+0. Introduction
+0. Related Work
+ 0. In-modality Knowledge Transfer
+ 0. Cross-modality Knowledge Transfer
+
+0. Problem Formulation
+0. Method
+ 0. Two-stage FT
+ 0. Details of LLM4TS
+
+0. Experiments
+ 0. MTS forecasting
+ 0. Few-shot learning
+ 0. SSL
+ 0. Ablation Study
+
+
+
+
+# 0. Abstract
+
+Limited large-scale TS data for building robust foundation models
+
+$$\rightarrow$$ use **Pre-trained Large Language Models (LLMs)** to enhance TSF
+
+
+
+Details
+
+- time-series patching + temporal encoding
+
+- prioritize a two-stage fine-tuning process:
+ - step 1) **supervised fine-tuning** to orient the LLM towards TS
+ - step 2) **task-specific downstream finetuning**
+- **Parameter-Efficient Fine-Tuning (PEFT)** techniques
+- Experiment
+ - robust representation learner
+ - effective few-shot learner
+
+
+
+# 1. Introduction
+
+Background
+
+- Foundation models in NLP & CV
+
+- Limited availability of large-scale TS data to train robust foundation models
+
+$$\rightarrow$$ utilizing **pre-trained Large Language Models (LLMs) as powerful representation learners for TS**
+
+
+
+To integrate LLMs with TS data ... 2 pivotal questions
+
+- Q1) How can we ***input time-series data into LLMs?***
+- Q2) How can we utilize pre-trained LLMs ***without distorting their inherent features?***
+
+
+
+## Q1) Input TS into LLMs
+
+To accommodate new data modalities within LLMs...
+
+$$\rightarrow$$ essential to **(1) tokenize the data** ( feat. PatchTST )
+
+**(2) Channel-independence**
+
+
+
+Summary
+
+- introduce a novel approach that integrates **temporal information**,
+- while employing the techniques of **patching and channel-independence**
+
+
+
+## Q2) Utilize LLM without distorting inherent features
+
+High-quality chatbots ( Instruct GPT, ChatGPT )
+
+- requires strategic alignment of a pre-trained model with instruction-based data through **supervised fine-tuning**
+
+ $$\rightarrow$$ ensures the model becomes **familiarized with target data** formats
+
+
+
+Introduce a **TWO-stage fine-tuning approach**
+
+- step 1) supervised fine-tuning
+ - guiding the LLM towards TS data
+- step 2) downstream fine-tuning
+ - geared towards TSF task
+
+( still, there is a need to enhance the **pre-trained LLMs’ adaptability** to new data modalities **without distorting models’ inherent features** )
+
+$$\rightarrow$$ two Parameter-Efficient Fine-Tuning (PEFT) techniques
+
+- Layer Normalization Tuning (Lu et al. 2021)
+- LoRA (Hu et al. 2021)
+
+to optimize model flexibility without extensive parameter adjustments
+
+
+
+## Summary
+
+1. Integration of Time-Series with LLMs
+ - patching and channel-independence to tokenize TS data
+ - novel approach to integrate temporal information with patching
+2. Adaptable Fine-Tuning for LLMs
+ - twostage fine-tuning methodology
+ - step 1) supervised finetuning stage to align LLMs with TS
+ - step 2) downstream fine-tuning stage dedicated to TSF task
+3. Optimized Model Flexibility
+ - To ensure both robustness and adaptability
+ - two PEFT techniques
+ - Layer Normalization Tuning
+ - LoRA
+
+4. Real-World Application Relevance
+
+
+
+# 2. Related Work
+
+## (1) In-modality Knowledge Transfer
+
+Foundation models
+
+- capable to transfer knowledge to downstream tasks
+- transformation of LLMs into chatbots
+ - ex) InstructGPT, ChatGPT: employ supervised fine-tuning
+
+
+
+Limitation of fine-tuning
+
+- computational burden of refining an entire model can be significant.
+
+$$\rightarrow$$ solution: PEFT ( Parameter Efficient Fine Tuning )
+
+
+
+**PEFT ( Parameter Efficient Fine Tuning )**
+
+- popular technique to reduce costs
+- ex) LLaMA-Adapter (Gao et al. 2023) : achieves ChatGPT-level performance by fine-tuning a mere 0.02% of its parameters
+
+
+
+**LLM4TS**: integrate supervised fine-tuning & PEFT
+
+
+
+## (2) Cross-Modality Knowledge Transfer
+
+Transfrer across diverse data modalities
+
+- ex) NLP $$\rightarrow$$ Image (Lu et al. 2021), Audio (Ghosal et al. 2023), TS (Zhou et al. 2023)
+- ex) CV $$\rightarrow$$ 12 distinct modalities (Zhang et al. 2023)
+
+
+
+**LLM4TS**: utilize pretrained LLM expertise to address challenges in TS data
+
+
+
+# 3. Problem Formulation
+
+![figure2](/assets/img/ts/img479.png)
+
+
+
+# 4. Method
+
+LLM4TS framework
+
+- leveraging the pre-trained GPT-2
+- (4-1) introduce the two-stage finetuning training strategy
+- (4-2) details
+ - instance normalization
+ - patching
+ - channel-independence
+ - three distinct encodings
+
+
+
+![figure2](/assets/img/ts/img480.png)
+
+
+
+## (1) Two-stage FT
+
+### a) Supervised FT: Autoregressive
+
+GPT-2 (Radford et al. 2019): causal language model
+
+$$\rightarrow$$ **supervised fine-tuning** adopts the same **autoregressive training methodology** used during its pretraining phase.
+
+
+
+Given 1st, 2nd, 3rd patches..
+
+Predict 2nd, 3rd, 4th patches..
+
+
+
+### b) Downstream FT: Forecasting
+
+2 primary strategies are available
+
+- (1) full fine-tuning
+- (2) linear probing
+
+
+
+Sequential approach (LP-FT) is good!
+
+- step 1) LP: linear probing ( epoch x 0.5 )
+- step 2) FT: full fine-tuning ( epoch x 0.5 )
+
+
+
+## (2) Details of LLM4TS
+
+### a) Instance Normalization
+
+- z-score norm) standard for TSF
+- RevIN) further boosts accuracy
+
+
+
+(????) Since RevIN is designed for the unpatched TS .... 2 problems
+
+- (1) Denormalization is infeasible as outputs remain in the patched format rather than the unpatched format.
+- (2) RevIN’s trainable affine transformation is not appropriate for AR models.
+
+$$\rightarrow$$ Employ "standard instance norm" during Sup FT
+
+
+
+### b) Patching & Channel Independence
+
+- pass
+
+
+
+### c) Three Encodings
+
+(1) Token embedding
+
+- via 1D-convolution
+
+(2) Positional encoding
+
+- use the standard approach and employ a trainable lookup table
+
+(3) Temporal encoding
+
+- numerous studies suggest the advantage of incorporating temporal information with Transformer-based models in time-series analysis (Wen et al. 2022).
+- Problem
+ - (1) patch = multiple timestamps = which timestamp to use...?
+ - (2) each timestamp carries various temporal attributes ( minute, hour, day .. )
+- Solution
+ - (1) designate the initial timestamp as its representative
+ - (2) employ a trainable lookup table for each attribute
+
+$$\rightarrow$$ add (1) & (2) & (3)
+
+
+
+### d) Pre-Trained LLM and PEFT
+
+Freeze
+
+- particularly those associated with the multi-head attention and feedforward layers within the Transformer block.
+- many studies indicate that **retaining most parameters as non-trainable often yields better results** than training a pre-trained LLM from scratch (Lu et al. 2021; Zhou et al. 2023).
+
+
+
+Tune
+
+- employ PEFT techniques as efficient approaches
+- (1) utilize the selection-based method .... **Layer Normalization Tuning (Lu et al. 2021)**
+ - adjust pre-existing parameters by making the affine transformation in layer normalization trainable.
+- (2) employ LoRA (LowRank Adaptation) (Hu et al. 2021)
+ - reparameterization-based method that leverages low-rank representations
+
+
+
+Summary : **only 1.5% of the model’s total parameters** are trainable.
+
+
+
+### e) Output Layer
+
+Supervised FT
+
+- output remains in the form of patched TS ( tokens )
+ - employ a linear layer to modify the final dimension.
+
+
+
+Downstream fine-tuning stage
+
+- transforms the patched $$\rightarrow$$ unpatched
+ - requiring flattening before the linear layer
+
+
+
+For both, use dropout immediately after the linear transformation
+
+
+
+# 5. Experiments
+
+## (1) MTS forecasting
+
+![figure2](/assets/img/ts/img481.png)
+
+
+
+## (2) Few-shot learning
+
+![figure2](/assets/img/ts/img482.png)
+
+
+
+## (3) SSL
+
+![figure2](/assets/img/ts/img483.png)
+
+
+
+## (4) Ablation Study
+
+### a) Supervised FT, Temporal Encoding, PEFT
+
+![figure2](/assets/img/ts/img484.png)
+
+
+
+### b) Training Strategies in Downstream FT
+
+![figure2](/assets/img/ts/img485.png)
diff --git a/_posts/2023-10-07-(CL_paper102)DCDetector.md b/_posts/2023-10-07-(CL_paper102)DCDetector.md
new file mode 100644
index 000000000000..280972e5f649
--- /dev/null
+++ b/_posts/2023-10-07-(CL_paper102)DCDetector.md
@@ -0,0 +1,336 @@
+---
+title: (paper 102) DCdetector; Dual Attention Contrastive Representation Learning for TS Anomaly Detection
+categories: [TS]
+tags: []
+excerpt: KDD 2023
+---
+
+
+
+# DCdetector: Dual Attention Contrastive Representation Learning for TS Anomaly Detection (KDD 2023)
+
+
+
+https://arxiv.org/pdf/2306.10347.pdf
+
+## Contents
+
+0. Abstract
+0. Introduction
+0. Methodology
+ 0. Overall Architecture
+ 0. Dual Attention Contrastive Structure
+ 0. Representation Discrepancy
+ 0. Anomaly Criterion
+
+
+
+
+
+# 0. Abstract
+
+Challenge of TS anomaly detection
+
+- learn a representation map that enables effective discrimination of anomalies.
+
+
+
+Categories of methods
+
+- Reconstruction-based methods
+- Contrastive learning
+
+
+
+### DCdetector
+
+- a **multi-scale dual attention** contrastive representation learning model
+ - utilizes a novel dual attention **asymmetric design** to create the permutated environment
+- learn a **permutation invariant representation** with superior discrimination abilities
+
+
+
+# 1. Introduction
+
+### Challenges in TS-AD
+
+- (1) Determining what the anomalies will be like.
+- (2) Anomalies are rare
+ - hard to get labels
+ - most supervised or semi-supervised methods fail to work given limited labeled training data.
+
+- (3) Should consider temporal, multidimensional, and non-stationary features for TS
+
+
+
+### TS anomaly detection methods
+
+( ex. statistical, classic machine learning, and deep learning based methods )
+
+- Supervised and Semi-supervised methods
+ - can not handle the challenge of limited labeled data
+- Unsupervised methods
+ - without strict requirements on labeled data
+ - ex) one class classification-based, probabilistic based, distance-based, forecasting-based, reconstruction-based approaches
+
+
+
+### Examples
+
+- **Reconstruction-based methods**
+ - pros) developing rapidly due to its power in handling complex data by combining it with different machine learning models and its interpretability that the instances behave unusually abnormally.
+ - cons) challenging to learn a well-reconstructed model for normal data without being obstructed by anomalies.
+- **Contrastive Learning**
+ - outstanding performance in downstream tasks in the computer vision
+ - effectiveness of contrastive representative learning still needs to be explored in the TS-AD
+
+
+
+### DCdetector
+
+( Dual attention Contrastive representation learning anomaly detector )
+
+- handle the challenges in **TS AD**
+
+- key idea : **normal TS points share the latent pattern**
+
+ ( = normal points have **strong correlations** with other points <-> anomalies do not )
+
+- Learning consistent representations :
+ - **hard** for anomalies
+ - **easy** for normal points
+- Motivation : if normal and abnormal points’ representations are distinguishable, we can detect anomalies **without a highly qualified reconstruction model**
+
+
+
+### Details
+
+- contrastive structure with **two branches & dual attention**
+ - two branches share weights
+- representation difference between normal and abnormal data is enlarged
+- patching-based attention networks: to capture the temporal dependency
+- multi-scale design: to reduce information loss during patching
+- channel independence design for MTS
+- does not require prior knowledge about anomalies
+
+
+
+# 2. Methodology
+
+MTS of length $$\mathrm{T}$$ : $$X=\left(x_1, x_2, \ldots, x_T\right)$$
+
+- where $$x_t \in \mathbb{R}^d$$
+
+
+
+Task:
+
+- given input TS $$\mathcal{X}$$,
+- for another unknown test sequence $$\mathcal{X}_{\text {test }}$$ of length $$T^{\prime}$$
+
+- we want to predict $$\mathcal{Y}_{\text {test }}=\left(y_1, y_2, \ldots, y_{T^{\prime}}\right)$$.
+ - $$y_t \in\{0,1\}$$ : 1 = anomaly & 0 = normal
+
+
+
+Inductive bias ( as Anomaly Transformer explored )
+
+- ***anomalies have less connection with the whole TS than their adjacent points***
+- Anomaly Transformer: detects anomalies by association discrepancy between ..
+ - (1) a learned Gaussian kernel
+ - (2) attention weight distribution.
+- DCdetector
+ - via a dual-attention self-supervised contrastive-type structure.
+
+
+
+### Comparison
+
+![figure2](/assets/img/ts/img486.png)
+
+1. Reconstruction-based approach
+2. Anomaly Transformer
+ - observation that it is difficult to build nontrivial associations from abnormal points to the whole series.
+ - discrepancies
+ - prior discrepancy : learned with Gaussian Kernel
+ - association discrepancy : learned with a transformer module
+ - MinMax association learning & Reconstruction loss
+
+3. DCdetector
+ - concise ( does not need a specially designed Gaussian Kernel, a MinMax learning strategy, or a reconstruction loss )
+ - mainly leverages the designed **CL-based dual-branch attention** for **discrepancy learning** of anomalies in different views
+
+
+
+## (1) Overall Architecture
+
+![figure2](/assets/img/ts/img487.png)
+
+4 main components
+
+1. Forward Process module
+2. Dual Attention Contrastive Structure module
+3. Representation Discrepancy module
+4. Anomaly Criterion module.
+
+
+
+![figure2](/assets/img/ts/img488.png)
+
+### a) Forward Process module
+
+( channel-independent )
+
+- a-1) instance normalization
+- a-2) patching
+
+
+
+### b) Dual Attention Contrastive Structure module
+
+- each channel shares the same self-attention network
+- representation results are concatenated as the final output $$\left(X^{\prime} \in \mathbb{R}^{N \times d}\right)$$.
+- Dual Attention Contrastive Structure module
+ - learns the representation of inputs in different views.
+
+
+
+### c) Representation Discrepancy module
+
+Key Insight
+
+- normal points: share the same latent pattern even in different views (a strong correlation is not easy to be destroyed).
+- anomalies: rare & do not have explicit patterns
+
+$$\rightarrow$$ difference will be slight for normal points representations in different views and large for anomalies.
+
+
+
+### d) Anomaly Criterion module.
+
+- calculate anomaly scores based on the discrepancy between the two representations
+
+- use a prior threshold for AD
+
+
+
+## (2) Dual Attention Contrastive Structure
+
+TS from different views: takes ..
+
+- (1) patch-wise representations
+- (2) in-patch representations
+
+
+
+Does not construct pairs like the typical contrastive methods
+
+- similar to the contrastive methods only using positive samples
+
+
+
+### a) Dual Attention
+
+Input time series $$\mathcal{X} \in \mathbb{R}^{T \times d}$$ are patched as $$\mathcal{X} \in \mathbb{R}^{P \times N \times d}$$
+
+- $$P$$ : patch size
+- $$N$$ : number of patches
+
+
+
+Fuse the channel information with the batch dimension ( $$\because$$ channel independence )
+
+$$\rightarrow$$ becomes $$\mathcal{X} \in \mathbb{R}^{P \times N}$$.
+
+
+
+[ Patch-wise representation ]
+
+- single patch is considered as a unit
+ - embedded operation will be applied in the patch_size $$(P)$$ dimension
+- capture dependencies among patches ( = patch-wise attention )
+- embedding shape : $$X_{\mathcal{N}} \in \mathbb{R}^{N \times d_{\text {model }}}$$.
+- apply multi-head attention to $$X_{\mathcal{N}}$$
+
+
+
+[ In-patch representation ]
+
+- dependencies of points in the same patch
+ - embedded operation will be applied in the number of patches $$(N)$$ dimension
+
+
+
+Note that the $$W_{Q_i}, W_{\mathcal{K}_i}$$ are **shared weights within the in-patch & patch-wise attention**
+
+
+
+### b) Up-sampling and Multi-scale Design
+
+Patch-wise attention
+
+- ignores the relevance among points in a patch
+
+In-patch attention
+
+- ignores the relevance among patches.
+
+
+
+To compare these two representations .... need upsampling!
+
+![figure2](/assets/img/ts/img489.png)
+
+
+
+Multi-scale design:
+
+= final representation concatenates results in **different scales (i.e., patch sizes)**
+
+- final patch-wise representation: $$\mathcal{N}$$
+ - $$\mathcal{N}=\sum_{\text {Patch list }} \operatorname{Upsampling}\left(\text { Attn }_{\mathcal{N}}\right)$$,
+- Final in-patch representation: $$\mathcal{P}$$
+ - $$\mathcal{P}=\sum_{\text {Patch list }} \text { Upsampling }\left(\text { Attn }_{\mathcal{P}}\right)$$.
+
+
+
+### c) Contrastive Structure
+
+Patch-wise sample representation
+
+- learns a weighted combination **between sample points in the same position from each patch**
+
+In-patch sample representation
+
+- learns a weighted combination **between points within the same patch**.
+
+$$\rightarrow$$ Treat these two representations as "permutated multi-view representations"
+
+
+
+## (3) Representation Discrepancy
+ Kullback-Leibler divergence (KL divergence)
+
+- to measure the similarity of such two representations
+
+
+
+### Loss function definition
+
+( no reconstruction part is used )
+
+$$\mathcal{L}\{\mathcal{P}, \mathcal{N} ; X\}=\frac{1}{2} \mathcal{D}(\mathcal{P}, \operatorname{Stopgrad}(\mathcal{N}))+\frac{1}{2} \mathcal{D}(\mathcal{N}, \operatorname{Stopgrad}(\mathcal{P}))$$.
+
+- Stop-gradient : to train 2 branches asynchronously
+
+
+
+## (4) Anomaly Criterion
+
+Final anomaly score of $$\mathcal{X} \in \mathbb{R}^{T \times d}$$ :
+- $$\text { AnomalyScore }(X)=\frac{1}{2} \mathcal{D}(\mathcal{P}, \mathcal{N})+\frac{1}{2} \mathcal{D}(\mathcal{N}, \mathcal{P}) \text {. }$$.
+
+
+
+$$y_i= \begin{cases}1: \text { anomaly } & \text { AnomalyScore }\left(X_i\right) \geq \delta \\ 0: \text { normal } & \text { AnomalyScore }\left(X_i\right)<\delta\end{cases}$$.
+
diff --git a/assets/img/ts/img479.png b/assets/img/ts/img479.png
new file mode 100644
index 000000000000..e374f0a490be
Binary files /dev/null and b/assets/img/ts/img479.png differ
diff --git a/assets/img/ts/img480.png b/assets/img/ts/img480.png
new file mode 100644
index 000000000000..9855ce77fa3b
Binary files /dev/null and b/assets/img/ts/img480.png differ
diff --git a/assets/img/ts/img481.png b/assets/img/ts/img481.png
new file mode 100644
index 000000000000..8b1546c989c4
Binary files /dev/null and b/assets/img/ts/img481.png differ
diff --git a/assets/img/ts/img482.png b/assets/img/ts/img482.png
new file mode 100644
index 000000000000..47aed0670b80
Binary files /dev/null and b/assets/img/ts/img482.png differ
diff --git a/assets/img/ts/img483.png b/assets/img/ts/img483.png
new file mode 100644
index 000000000000..75324b552318
Binary files /dev/null and b/assets/img/ts/img483.png differ
diff --git a/assets/img/ts/img484.png b/assets/img/ts/img484.png
new file mode 100644
index 000000000000..da0a355ce140
Binary files /dev/null and b/assets/img/ts/img484.png differ
diff --git a/assets/img/ts/img485.png b/assets/img/ts/img485.png
new file mode 100644
index 000000000000..3a49bd24374d
Binary files /dev/null and b/assets/img/ts/img485.png differ
diff --git a/assets/img/ts/img486.png b/assets/img/ts/img486.png
new file mode 100644
index 000000000000..8b7fbebe8531
Binary files /dev/null and b/assets/img/ts/img486.png differ
diff --git a/assets/img/ts/img487.png b/assets/img/ts/img487.png
new file mode 100644
index 000000000000..c56649abc46a
Binary files /dev/null and b/assets/img/ts/img487.png differ
diff --git a/assets/img/ts/img488.png b/assets/img/ts/img488.png
new file mode 100644
index 000000000000..3b2363ab5d80
Binary files /dev/null and b/assets/img/ts/img488.png differ
diff --git a/assets/img/ts/img489.png b/assets/img/ts/img489.png
new file mode 100644
index 000000000000..7beac04b9258
Binary files /dev/null and b/assets/img/ts/img489.png differ