diff --git a/_posts/2023-10-06-(CL_paper101)LLM4TS.md b/_posts/2023-10-06-(CL_paper101)LLM4TS.md
new file mode 100644
index 000000000000..f01ff05c0d98
--- /dev/null
+++ b/_posts/2023-10-06-(CL_paper101)LLM4TS.md
@@ -0,0 +1,364 @@
+---
+title: (paper 101) LLM4TS; Two-stage Fine-tuning for TSF with Pretrained LLMs
+categories: [TS,NLP]
+tags: []
+excerpt: 2023
+---
+
+<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
+
+# LLM4TS: Two-stage Fine-tuning for TSF with Pretrained LLMs (2023)
+
+<br>
+
+https://arxiv.org/pdf/2308.08469.pdf
+
+## Contents
+
+0. Abstract
+0. Introduction
+0. Related Work
+   0. In-modality Knowledge Transfer
+   0. Cross-modality Knowledge Transfer
+
+0. Problem Formulation
+0. Method
+   0. Two-stage FT
+   0. Details of LLM4TS
+
+0. Experiments
+   0. MTS forecasting
+   0. Few-shot learning
+   0. SSL
+   0. Ablation Study
+
+
+<br>
+
+# 0. Abstract
+
+Limited large-scale TS data for building robust foundation models
+
+$$\rightarrow$$ use **Pre-trained Large Language Models (LLMs)** to enhance TSF
+
+
+
+Details
+
+- time-series patching + temporal encoding
+
+- prioritize a two-stage fine-tuning process: 
+  - step 1) **supervised fine-tuning** to orient the LLM towards TS
+  - step 2) **task-specific downstream finetuning**
+- **Parameter-Efficient Fine-Tuning (PEFT)** techniques
+- Experiment
+  - robust representation learner
+  - effective few-shot learner
+
+<br>
+
+# 1. Introduction
+
+Background
+
+- Foundation models in NLP & CV
+
+- Limited availability of large-scale TS data to train robust foundation models
+
+$$\rightarrow$$ utilizing **pre-trained Large Language Models (LLMs) as powerful representation learners for TS**
+
+<br>
+
+To integrate LLMs with TS data ... 2 pivotal questions
+
+- Q1) How can we ***input time-series data into LLMs?*** 
+- Q2) How can we utilize pre-trained LLMs ***without distorting their inherent features?***
+
+<br>
+
+## Q1) Input TS into LLMs
+
+To accommodate new data modalities within LLMs...
+
+$$\rightarrow$$ essential to **(1) tokenize the data** ( feat. PatchTST )
+
+**(2) Channel-independence** 
+
+<br>
+
+Summary
+
+- introduce a novel approach that integrates **temporal information**, 
+- while employing the techniques of **patching and channel-independence**
+
+<br>
+
+## Q2) Utilize LLM without distorting inherent features
+
+High-quality chatbots ( Instruct GPT, ChatGPT )
+
+- requires strategic alignment of a pre-trained model with instruction-based data through **supervised fine-tuning**
+
+  $$\rightarrow$$ ensures the model becomes **familiarized with target data** formats
+
+<br>
+
+Introduce a **TWO-stage fine-tuning approach**
+
+- step 1) supervised fine-tuning
+  - guiding the LLM towards TS data
+- step 2) downstream fine-tuning 
+  - geared towards TSF task
+
+( still, there is a need to enhance the **pre-trained LLMs’ adaptability** to new data modalities **without distorting models’ inherent features** )
+
+$$\rightarrow$$ two Parameter-Efficient Fine-Tuning (PEFT) techniques
+
+- Layer Normalization Tuning (Lu et al. 2021)
+- LoRA (Hu et al. 2021)
+
+to optimize model flexibility without extensive parameter adjustments
+
+<br>
+
+## Summary
+
+1. Integration of Time-Series with LLMs
+   - patching and channel-independence to tokenize TS data
+   - novel approach to integrate temporal information with patching
+2. Adaptable Fine-Tuning for LLMs
+   - twostage fine-tuning methodology
+     - step 1) supervised finetuning stage to align LLMs with TS
+     - step 2) downstream fine-tuning stage dedicated to TSF task
+3. Optimized Model Flexibility
+   - To ensure both robustness and adaptability
+   - two PEFT techniques
+     - Layer Normalization Tuning
+     - LoRA
+
+4. Real-World Application Relevance
+
+<br>
+
+# 2. Related Work
+
+## (1) In-modality Knowledge Transfer
+
+Foundation models
+
+- capable to transfer knowledge to downstream tasks
+- transformation of LLMs into chatbots 
+  - ex) InstructGPT, ChatGPT: employ supervised fine-tuning
+
+<br>
+
+Limitation of fine-tuning
+
+- computational burden of refining an entire model can be significant. 
+
+$$\rightarrow$$ solution: PEFT ( Parameter Efficient Fine Tuning )
+
+<br>
+
+**PEFT ( Parameter Efficient Fine Tuning )**
+
+- popular technique to reduce costs
+- ex) LLaMA-Adapter (Gao et al. 2023) : achieves ChatGPT-level performance by fine-tuning a mere 0.02% of its parameters
+
+<br>
+
+**LLM4TS**: integrate supervised fine-tuning & PEFT 
+
+<br>
+
+## (2) Cross-Modality Knowledge Transfer
+
+Transfrer across diverse data modalities
+
+- ex) NLP $$\rightarrow$$ Image (Lu et al. 2021), Audio (Ghosal et al. 2023), TS (Zhou et al. 2023)
+- ex) CV $$\rightarrow$$ 12 distinct modalities (Zhang et al. 2023)
+
+<br>
+
+**LLM4TS**: utilize pretrained LLM expertise to address challenges in TS data
+
+<br>
+
+# 3. Problem Formulation
+
+![figure2](/assets/img/ts/img479.png)
+
+<br>
+
+# 4. Method
+
+LLM4TS framework
+
+- leveraging the pre-trained GPT-2
+- (4-1) introduce the two-stage finetuning training strategy
+- (4-2) details
+  - instance normalization
+  - patching
+  - channel-independence
+  - three distinct encodings
+
+<br>
+
+![figure2](/assets/img/ts/img480.png)
+
+<br>
+
+## (1) Two-stage FT
+
+### a) Supervised FT: Autoregressive
+
+GPT-2 (Radford et al. 2019): causal language model
+
+$$\rightarrow$$ **supervised fine-tuning** adopts the same **autoregressive training methodology** used during its pretraining phase.
+
+<br>
+
+Given 1st, 2nd, 3rd patches..
+
+Predict 2nd, 3rd, 4th patches..
+
+<br>
+
+### b) Downstream FT: Forecasting
+
+2 primary strategies are available
+
+- (1) full fine-tuning 
+- (2) linear probing 
+
+<br>
+
+Sequential approach (LP-FT) is good!
+
+- step 1) LP: linear probing  ( epoch x 0.5 )
+- step 2) FT: full fine-tuning  ( epoch x 0.5 )
+
+<br>
+
+## (2) Details of LLM4TS
+
+### a) Instance Normalization 
+
+- z-score norm) standard for TSF
+- RevIN) further boosts accuracy
+
+<br>
+
+(????) Since RevIN is designed for the unpatched TS .... 2 problems
+
+- (1) Denormalization is infeasible as outputs remain in the patched format rather than the unpatched format. 
+- (2) RevIN’s trainable affine transformation is not appropriate for AR models.
+
+$$\rightarrow$$ Employ "standard instance norm" during Sup FT
+
+<br>
+
+### b) Patching & Channel Independence
+
+- pass
+
+<br>
+
+### c) Three Encodings
+
+(1) Token embedding
+
+- via 1D-convolution
+
+(2) Positional encoding
+
+-  use the standard approach and employ a trainable lookup table
+
+(3) Temporal encoding
+
+- numerous studies suggest the advantage of incorporating temporal information with Transformer-based models in time-series analysis (Wen et al. 2022). 
+- Problem
+  - (1) patch = multiple timestamps = which timestamp to use...?
+  - (2) each timestamp carries various temporal attributes ( minute, hour, day .. )
+- Solution
+  - (1) designate the initial timestamp as its representative
+  - (2) employ a trainable lookup table for each attribute
+
+$$\rightarrow$$ add (1) & (2) & (3)
+
+<br>
+
+### d) Pre-Trained LLM and PEFT
+
+Freeze
+
+- particularly those associated with the multi-head attention and feedforward layers within the Transformer block. 
+- many studies indicate that **retaining most parameters as non-trainable often yields better results** than training a pre-trained LLM from scratch (Lu et al. 2021; Zhou et al. 2023). 
+
+<br>
+
+Tune
+
+- employ PEFT techniques as efficient approaches
+- (1) utilize the selection-based method ....  **Layer Normalization Tuning (Lu et al. 2021)**
+  - adjust pre-existing parameters by making the affine transformation in layer normalization trainable.
+- (2) employ LoRA (LowRank Adaptation) (Hu et al. 2021)
+  - reparameterization-based method that leverages low-rank representations
+
+<br>
+
+Summary : **only 1.5% of the model’s total parameters** are trainable.
+
+<br>
+
+### e) Output Layer
+
+Supervised FT
+
+- output remains in the form of patched TS ( tokens )
+  - employ a linear layer to modify the final dimension. 
+
+<br>
+
+Downstream fine-tuning stage
+
+- transforms the patched $$\rightarrow$$ unpatched
+  - requiring flattening before the linear layer
+
+<br>
+
+For both, use dropout immediately after the linear transformation
+
+<br>
+
+# 5. Experiments
+
+## (1) MTS forecasting
+
+![figure2](/assets/img/ts/img481.png)
+
+<br>
+
+## (2) Few-shot learning
+
+![figure2](/assets/img/ts/img482.png)
+
+<br>
+
+## (3) SSL
+
+![figure2](/assets/img/ts/img483.png)
+
+<br>
+
+## (4) Ablation Study
+
+### a) Supervised FT, Temporal Encoding, PEFT
+
+![figure2](/assets/img/ts/img484.png)
+
+<br>
+
+### b) Training Strategies in Downstream FT
+
+![figure2](/assets/img/ts/img485.png)
diff --git a/_posts/2023-10-07-(CL_paper102)DCDetector.md b/_posts/2023-10-07-(CL_paper102)DCDetector.md
new file mode 100644
index 000000000000..280972e5f649
--- /dev/null
+++ b/_posts/2023-10-07-(CL_paper102)DCDetector.md
@@ -0,0 +1,336 @@
+---
+title: (paper 102) DCdetector; Dual Attention Contrastive Representation Learning for TS Anomaly Detection
+categories: [TS]
+tags: []
+excerpt: KDD 2023
+---
+
+<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
+
+# DCdetector: Dual Attention Contrastive Representation Learning for TS Anomaly Detection (KDD 2023)
+
+<br>
+
+https://arxiv.org/pdf/2306.10347.pdf
+
+## Contents
+
+0. Abstract
+0. Introduction
+0. Methodology
+   0. Overall Architecture
+   0. Dual Attention Contrastive Structure
+   0. Representation Discrepancy
+   0. Anomaly Criterion
+
+
+
+<br>
+
+# 0. Abstract
+
+Challenge of TS anomaly detection 
+
+- learn a representation map that enables effective discrimination of anomalies. 
+
+<br>
+
+Categories of methods
+
+- Reconstruction-based methods
+- Contrastive learning 
+
+<br>
+
+### DCdetector
+
+- a **multi-scale dual attention** contrastive representation learning model
+  - utilizes a novel dual attention **asymmetric design** to create the permutated environment
+- learn a **permutation invariant representation** with superior discrimination abilities
+
+<br>
+
+# 1. Introduction
+
+### Challenges in TS-AD
+
+- (1) Determining what the anomalies will be like. 
+- (2) Anomalies are rare
+  - hard to get labels
+  - most supervised or semi-supervised methods fail to work given limited labeled training data.
+
+- (3) Should consider temporal, multidimensional, and non-stationary features for TS
+
+<br>
+
+### TS anomaly detection methods
+
+( ex. statistical, classic machine learning, and deep learning based methods )
+
+- Supervised and Semi-supervised methods
+  - can not handle the challenge of limited labeled data
+- Unsupervised methods 
+  - without strict requirements on labeled data
+  - ex) one class classification-based, probabilistic based, distance-based, forecasting-based, reconstruction-based approaches
+
+<br>
+
+### Examples
+
+- **Reconstruction-based methods**
+  - pros) developing rapidly due to its power in handling complex data by combining it with different machine learning models and its interpretability that the instances behave unusually abnormally. 
+  - cons) challenging to learn a well-reconstructed model for normal data without being obstructed by anomalies. 
+- **Contrastive Learning**
+  - outstanding performance in downstream tasks in the computer vision
+  - effectiveness of contrastive representative learning still needs to be explored in the TS-AD
+
+<br>
+
+### DCdetector
+
+( Dual attention Contrastive representation learning anomaly detector  )
+
+- handle the challenges in **TS AD**
+
+- key idea : **normal TS points share the latent pattern**
+
+  ( = normal points have **strong correlations** with other points <-> anomalies do not )
+
+- Learning consistent representations :
+  - **hard** for anomalies
+  - **easy** for normal points
+- Motivation : if normal and abnormal points’ representations are distinguishable, we can detect anomalies **without a highly qualified reconstruction model**
+
+<br>
+
+### Details
+
+- contrastive structure with **two branches & dual attention**
+  - two branches share weights
+- representation difference between normal and abnormal data is enlarged
+- patching-based attention networks: to capture the temporal dependency 
+- multi-scale design: to reduce information loss during patching
+- channel independence design for MTS
+- does not require prior knowledge about anomalies
+
+<br>
+
+# 2. Methodology
+
+MTS of length $$\mathrm{T}$$ : $$X=\left(x_1, x_2, \ldots, x_T\right)$$
+
+- where $$x_t \in \mathbb{R}^d$$ 
+
+<br>
+
+Task: 
+
+- given input TS  $$\mathcal{X}$$, 
+- for another unknown test sequence $$\mathcal{X}_{\text {test }}$$ of length $$T^{\prime}$$ 
+
+- we want to predict $$\mathcal{Y}_{\text {test }}=\left(y_1, y_2, \ldots, y_{T^{\prime}}\right)$$. 
+  - $$y_t \in\{0,1\}$$ : 1 = anomaly & 0 = normal
+
+<br>
+
+Inductive bias ( as Anomaly Transformer explored )
+
+- ***anomalies have less connection with the whole TS than their adjacent points***
+- Anomaly Transformer: detects anomalies by association discrepancy between ..
+  - (1) a learned Gaussian kernel 
+  - (2) attention weight distribution. 
+- DCdetector
+  - via a dual-attention self-supervised contrastive-type structure.
+
+<br>
+
+### Comparison
+
+![figure2](/assets/img/ts/img486.png)
+
+1. Reconstruction-based approach 
+2. Anomaly Transformer 
+   - observation that it is difficult to build nontrivial associations from abnormal points to the whole series. 
+   - discrepancies
+     - prior discrepancy : learned with Gaussian Kernel 
+     - association discrepancy : learned with a transformer module
+   - MinMax association learning & Reconstruction loss
+
+3. DCdetector
+   - concise ( does not need a specially designed Gaussian Kernel, a MinMax learning strategy, or a reconstruction loss )
+   - mainly leverages the designed **CL-based dual-branch attention** for **discrepancy learning** of anomalies in different views
+
+<br>
+
+## (1) Overall Architecture
+
+![figure2](/assets/img/ts/img487.png)
+
+4 main components
+
+1. Forward Process module
+2. Dual Attention Contrastive Structure module
+3. Representation Discrepancy module
+4. Anomaly Criterion module. 
+
+<br>
+
+![figure2](/assets/img/ts/img488.png)
+
+### a) Forward Process module
+
+( channel-independent )
+
+- a-1) instance normalization 
+- a-2) patching
+
+<br>
+
+### b) Dual Attention Contrastive Structure module
+
+- each channel shares the same self-attention network
+- representation results are concatenated as the final output $$\left(X^{\prime} \in \mathbb{R}^{N \times d}\right)$$. 
+- Dual Attention Contrastive Structure module
+  - learns the representation of inputs in different views.
+
+<br>
+
+### c) Representation Discrepancy module
+
+Key Insight
+
+- normal points:  share the same latent pattern even in different views (a strong correlation is not easy to be destroyed). 
+- anomalies: rare & do not have explicit patterns
+
+$$\rightarrow$$  difference will be slight for normal points representations in different views and large for anomalies. 
+
+<br>
+
+### d) Anomaly Criterion module. 
+
+- calculate anomaly scores based on the discrepancy between the two representations
+
+- use a prior threshold for AD
+
+<br>
+
+## (2) Dual Attention Contrastive Structure
+
+TS from different views: takes ..
+
+- (1) patch-wise representations
+- (2) in-patch representations
+
+<br>
+
+Does not construct  pairs like the typical contrastive methods 
+
+- similar to the contrastive methods only using positive samples
+
+<br>
+
+### a) Dual Attention
+
+Input time series $$\mathcal{X} \in \mathbb{R}^{T \times d}$$ are patched as $$\mathcal{X} \in \mathbb{R}^{P \times N \times d}$$ 
+
+- $$P$$ : patch size
+- $$N$$ : number of patches
+
+<br>
+
+Fuse the channel information with the batch dimension ( $$\because$$ channel independence )
+
+$$\rightarrow$$  becomes $$\mathcal{X} \in \mathbb{R}^{P \times N}$$. 
+
+<br>
+
+[ Patch-wise representation ]
+
+- single patch is considered as a unit
+  - embedded operation will be applied in the patch_size $$(P)$$ dimension
+- capture dependencies among patches ( = patch-wise attention )
+- embedding shape : $$X_{\mathcal{N}} \in \mathbb{R}^{N \times d_{\text {model }}}$$. 
+- apply multi-head attention to $$X_{\mathcal{N}}$$
+
+<br>
+
+[ In-patch representation ]
+
+- dependencies of points in the same patch
+  - embedded operation will be applied in the number of patches $$(N)$$ dimension
+
+<br>
+
+Note that the $$W_{Q_i}, W_{\mathcal{K}_i}$$ are **shared weights within the in-patch & patch-wise attention**
+
+<br>
+
+### b) Up-sampling and Multi-scale Design
+
+Patch-wise attention 
+
+- ignores the relevance among points in a patch
+
+In-patch attention 
+
+- ignores the relevance among patches. 
+
+<br>
+
+To compare these two representations .... need upsampling!
+
+![figure2](/assets/img/ts/img489.png)
+
+<br>
+
+Multi-scale design:
+
+= final representation concatenates results in **different scales (i.e., patch sizes)**
+
+- final patch-wise representation: $$\mathcal{N}$$
+  - $$\mathcal{N}=\sum_{\text {Patch list }} \operatorname{Upsampling}\left(\text { Attn }_{\mathcal{N}}\right)$$,
+- Final in-patch representation: $$\mathcal{P}$$
+  - $$\mathcal{P}=\sum_{\text {Patch list }} \text { Upsampling }\left(\text { Attn }_{\mathcal{P}}\right)$$.
+
+<br>
+
+### c) Contrastive Structure
+
+Patch-wise sample representation
+
+- learns a weighted combination **between sample points in the same position from each patch**
+
+In-patch sample representation
+
+- learns a weighted combination **between points within the same patch**. 
+
+$$\rightarrow$$ Treat these two representations as "permutated multi-view representations"
+
+<br>
+
+## (3) Representation Discrepancy
+ Kullback-Leibler divergence (KL divergence) 
+
+- to measure the similarity of such two representations
+
+<br>
+
+### Loss function definition
+
+( no reconstruction part is used )
+
+$$\mathcal{L}\{\mathcal{P}, \mathcal{N} ; X\}=\frac{1}{2} \mathcal{D}(\mathcal{P}, \operatorname{Stopgrad}(\mathcal{N}))+\frac{1}{2} \mathcal{D}(\mathcal{N}, \operatorname{Stopgrad}(\mathcal{P}))$$.
+
+- Stop-gradient : to train 2 branches asynchronously
+
+<br>
+
+## (4) Anomaly Criterion
+
+Final anomaly score of $$\mathcal{X} \in \mathbb{R}^{T \times d}$$ :
+- $$\text { AnomalyScore }(X)=\frac{1}{2} \mathcal{D}(\mathcal{P}, \mathcal{N})+\frac{1}{2} \mathcal{D}(\mathcal{N}, \mathcal{P}) \text {. }$$.
+
+<br>
+
+$$y_i= \begin{cases}1: \text { anomaly } & \text { AnomalyScore }\left(X_i\right) \geq \delta \\ 0: \text { normal } & \text { AnomalyScore }\left(X_i\right)<\delta\end{cases}$$.
+
diff --git a/assets/img/ts/img479.png b/assets/img/ts/img479.png
new file mode 100644
index 000000000000..e374f0a490be
Binary files /dev/null and b/assets/img/ts/img479.png differ
diff --git a/assets/img/ts/img480.png b/assets/img/ts/img480.png
new file mode 100644
index 000000000000..9855ce77fa3b
Binary files /dev/null and b/assets/img/ts/img480.png differ
diff --git a/assets/img/ts/img481.png b/assets/img/ts/img481.png
new file mode 100644
index 000000000000..8b1546c989c4
Binary files /dev/null and b/assets/img/ts/img481.png differ
diff --git a/assets/img/ts/img482.png b/assets/img/ts/img482.png
new file mode 100644
index 000000000000..47aed0670b80
Binary files /dev/null and b/assets/img/ts/img482.png differ
diff --git a/assets/img/ts/img483.png b/assets/img/ts/img483.png
new file mode 100644
index 000000000000..75324b552318
Binary files /dev/null and b/assets/img/ts/img483.png differ
diff --git a/assets/img/ts/img484.png b/assets/img/ts/img484.png
new file mode 100644
index 000000000000..da0a355ce140
Binary files /dev/null and b/assets/img/ts/img484.png differ
diff --git a/assets/img/ts/img485.png b/assets/img/ts/img485.png
new file mode 100644
index 000000000000..3a49bd24374d
Binary files /dev/null and b/assets/img/ts/img485.png differ
diff --git a/assets/img/ts/img486.png b/assets/img/ts/img486.png
new file mode 100644
index 000000000000..8b7fbebe8531
Binary files /dev/null and b/assets/img/ts/img486.png differ
diff --git a/assets/img/ts/img487.png b/assets/img/ts/img487.png
new file mode 100644
index 000000000000..c56649abc46a
Binary files /dev/null and b/assets/img/ts/img487.png differ
diff --git a/assets/img/ts/img488.png b/assets/img/ts/img488.png
new file mode 100644
index 000000000000..3b2363ab5d80
Binary files /dev/null and b/assets/img/ts/img488.png differ
diff --git a/assets/img/ts/img489.png b/assets/img/ts/img489.png
new file mode 100644
index 000000000000..7beac04b9258
Binary files /dev/null and b/assets/img/ts/img489.png differ