diff --git a/site/content/en/blog/_index.md b/site/content/en/blog/_index.md index 2442435e4..862a02fba 100644 --- a/site/content/en/blog/_index.md +++ b/site/content/en/blog/_index.md @@ -1,6 +1,5 @@ --- title: "Crane Blog" -linkTitle: "Blog" menu: main: weight: 30 diff --git a/site/content/en/docs/Core Concept/timeseriees-forecasting-by-dsp.md b/site/content/en/docs/Core Concept/timeseries-forecasting-by-dsp.md similarity index 76% rename from site/content/en/docs/Core Concept/timeseriees-forecasting-by-dsp.md rename to site/content/en/docs/Core Concept/timeseries-forecasting-by-dsp.md index c9bd73e7d..b2eaf062c 100644 --- a/site/content/en/docs/Core Concept/timeseriees-forecasting-by-dsp.md +++ b/site/content/en/docs/Core Concept/timeseries-forecasting-by-dsp.md @@ -2,6 +2,7 @@ title: "Time Series Forecast Algorithm-DSP" description: "Introduction for DSP Algorithm" weight: 16 +math: true --- Time series forecasting refers to using historical time series data to predict future values. Time series data typically consists of time and corresponding values, such as resource usage, stock prices, or temperature. DSP (Digital Signal Processing) is a digital signal processing technique that can be used for analyzing and processing time series data. @@ -22,7 +23,7 @@ This article will introduce the implementation process and parameter settings of It is common for monitoring data to be missing at certain time points, and Crane will fill in the missing sampling points based on the surrounding data. The method is as follows: -Assume that the sampling data between the m-th and n-th sampling points are missing (m+1 P_{threshold}$, then add $k$ to the list of candidate periods. +3. Compute the FFT of the original sequence \\(\vec x(n)\\) to obtain \\(\vec X(f)\\). Traverse \\(k = 2, 3, ...\\), and if \\(P_k = \|X(k)\| > P_{threshold}\\), then add \\(k\\) to the list of candidate periods. #### Auto Correlation Function Auto Correlation Function (ACF) is the cross-correlation of a signal with itself at different time points. In simple terms, it is a function of the time lag between two observations that measures the similarity between them. -Crane uses circular autocorrelation function (Circular ACF), which first extends the time series of length $N$ by using $N$ as the period. This means that the sequence $\vec x(n)$ is copied over the interval $..., [-N, -1], [N, 2N-1], ...$, resulting in a new sequence $\vec x'(n)$ that is used for analysis. +Crane uses circular autocorrelation function (Circular ACF), which first extends the time series of length \\(N\\) by using \\(N\\) as the period. This means that the sequence \\(\vec x(n)\\) is copied over the interval \\(..., [-N, -1], [N, 2N-1], ...\\), resulting in a new sequence \\(\vec x'(n)\\) that is used for analysis. -The correlation coefficient between $\vec x'(n+k)$ and $\vec x'(n)$ is computed for each shift $k=1,2,3,...N/2$, where $\vec x'(n)$is shifted by k. +The correlation coefficient between \\(\vec x'(n+k)\\) and \\(\vec x'(n)\\) is computed for each shift \\(k=1,2,3,...N/2\\), where \\(\vec x'(n)\\) is shifted by \\(k\\). $$r_k={\displaystyle\sum_{i=-k}^{N-k-1} (x_i-\mu)(x_{i+k}-\mu) \over \displaystyle\sum_{i=0}^{N-1} (x_i-\mu)^2}\ \ \ \mu: mean$$ -Instead of directly computing the ACF using the definition mentioned above, Crane uses the following formula and performs two FFT operations to calculate the ACF in $O(nlogn)$ time. +Instead of directly computing the ACF using the definition mentioned above, Crane uses the following formula and performs two FFT operations to calculate the ACF in \\(O(nlogn)\\) time. $$\vec r = IFFT(|FFT({\vec x - \mu \over \sigma})|^2)\ \ \ \mu: mean,\ \sigma: standard\ deviation$$ -The ACF is represented graphically as shown below, where the x-axis represents the time lag $k$ and the y-axis represents the autocorrelation coefficient $r_k$, which reflects the degree of similarity between the shifted signal and the original signal. +The ACF is represented graphically as shown below, where the x-axis represents the time lag \\(k\\) and the y-axis represents the autocorrelation coefficient \\(r_k\\), which reflects the degree of similarity between the shifted signal and the original signal. ![](/images/algorithm/dsp/acf.png) @@ -106,11 +107,13 @@ Crane selects a section of the curve on each side and performs linear regression #### Predict Based on the primary cycle obtained in the previous step, Crane provides two methods to fit (predict) the time series data for the next cycle. + **maxValue** -The first method is to select the maximum value at time $t$(e.g. 6:00 PM) for each of the past few cycles, and use it as the predicted value for the next cycle at time $t$ +The first method is to select the maximum value at time \\(t\\)(e.g. 6:00 PM) for each of the past few cycles, and use it as the predicted value for the next cycle at time \\(t\\). ![](/images/algorithm/dsp/max_value.png) + **fft** The second method is to perform FFT on the original time series to obtain a frequency spectrum sequence, remove the "high-frequency noise", and then perform IFFT (inverse fast Fourier transform). The resulting time series is used as the predicted result for the next cycle. @@ -143,8 +146,6 @@ spec: sampleInterval: "60s" # The sampling interval for monitoring data is 1 minute. historyLength: "15d" # Pull the monitoring metrics from the past 15 days as the basis for prediction estimators: # Specify the prediction method, including maxValue and fft. Multiple estimators with different configurations can be specified for each method, and Crane will select the one with the highest fitting degree to generate the prediction results. If not specified, fft will be used by default -# maxValue: -# - marginFraction: "0.1" fft: - marginFraction: "0.2" lowAmplitudeThreshold: "1.0" @@ -173,7 +174,7 @@ The meanings of some dsp parameters in the example above are as follows: In simple terms, the fewer frequency components retained, the lower the upper frequency limit, and the higher the spectral amplitude lower limit, the smoother the predicted curve will be, but some details will be lost. Conversely, more detailed features are preserved with more frequency components retained, resulting in a more jagged curve. -Below are two predicted curves for the same time period. The blue and green lines have different highFrequencyThreshold values of $0.01$ and $0.001$, respectively. The blue curve filters out more high frequency components, resulting in a smoother curve. +Below are two predicted curves for the same time period. The blue and green lines have different highFrequencyThreshold values of \\(0.01\\) and \\(0.001\\), respectively. The blue curve filters out more high frequency components, resulting in a smoother curve. ![](/images/algorithm/dsp/lft_0_001.png) ![](/images/algorithm/dsp/lft_0_01.png) diff --git a/site/content/zh/blog/_index.md b/site/content/zh/blog/_index.md index 2442435e4..6830a1390 100644 --- a/site/content/zh/blog/_index.md +++ b/site/content/zh/blog/_index.md @@ -1,7 +1,7 @@ --- -title: "Crane Blog" -linkTitle: "Blog" +title: "博客" menu: main: weight: 30 --- + \ No newline at end of file diff --git a/site/content/zh/docs/Core Concept/timeseriees-forecasting-by-dsp.md b/site/content/zh/docs/Core Concept/timeseries-forecasting-by-dsp.md similarity index 74% rename from site/content/zh/docs/Core Concept/timeseriees-forecasting-by-dsp.md rename to site/content/zh/docs/Core Concept/timeseries-forecasting-by-dsp.md index a440a016c..6390d3968 100644 --- a/site/content/zh/docs/Core Concept/timeseriees-forecasting-by-dsp.md +++ b/site/content/zh/docs/Core Concept/timeseries-forecasting-by-dsp.md @@ -2,6 +2,7 @@ title: "时间序列预测算法-DSP" description: "Introduction for DSP Algorithm" weight: 16 +math: true --- 时间序列预测是指使用过去的时间序列数据来预测未来的值。时间序列数据通常包括时间和相应的数值,例如资源用量、股票价格或气温。时间序列预测算法 DSP(Digital Signal Processing)是一种数字信号处理技术,可以用于分析和处理时间序列数据。 @@ -19,7 +20,7 @@ Crane使用在数字信号处理(Digital Signal Processing)领域中常用 #### 填充缺失数据 监控数据在某些时间点上缺失是很常见的现象,Crane会根据前后的数据对缺失的采样点进行填充。做法如下: -假设第$m$个与第$n$个采样点之间采样数据缺失($m+1 < n$),设在$m$和$n$点的采样值分别为$v_m$和$v_n$,令$$\Delta = {v_n-v_m \over n-m}$$,则$m$和$n$之间的填充数据依次为$v_m+\Delta , v_m+2\Delta , ...$ +假设第\\(m\\)个与第\\(n\\)个采样点之间采样数据缺失(\\(m+1 < n\\)),设在\\(m\\)和\\(n\\)点的采样值分别为\\(v_m\\)和\\(v_n\\),令$$\Delta = {v_n-v_m \over n-m}$$,则\\(m\\)和\\(n\\)之间的填充数据依次为$$v_m+\Delta , v_m+2\Delta , ...$$ ![](/images/algorithm/dsp/missing_data_fill.png) #### 去除异常点 @@ -30,26 +31,26 @@ Crane使用在数字信号处理(Digital Signal Processing)领域中常用 这些极端的异常点对于信号的周期判断会造成干扰,需要进行去除。做法如下: -选取实际序列中所有采样点的$P99.9$和$P0.1$,分别作为上、下限阈值,如果某个采样值低于下限或者高于上限,将采样点的值设置为前一个采样值。 +选取实际序列中所有采样点的\\(P99.9\\)和\\(P0.1\\),分别作为上、下限阈值,如果某个采样值低于下限或者高于上限,将采样点的值设置为前一个采样值。 ![](/images/algorithm/dsp/remove_outliers.png) #### 离散傅里叶变换 -对监控的时间序列(设长度为$N$)做快速离散傅里叶变换(FFT),得到信号的频谱图(spectrogram),频谱图直观地表现为在各个离散点$k$处的「冲击」。 -冲击的高度为$k$对应周期分量的「幅度」,$k$的取值范围$\(0,1,2, ... N-1\)$。 +对监控的时间序列(设长度为\\(N\\))做快速离散傅里叶变换(FFT),得到信号的频谱图(spectrogram),频谱图直观地表现为在各个离散点\\(k\\)处的「冲击」。 +冲击的高度为\\(k\\)对应周期分量的「幅度」,\\(k\\)的取值范围\\(\(0,1,2, ... N-1\)\\)。 -$k = 0$对应信号的「直流分量」,对于周期没有影响,因此忽略。 +\\(k = 0\\)对应信号的「直流分量」,对于周期没有影响,因此忽略。 -由于离散傅里叶变换后的频谱序列前一半和后一半是共轭对称的,反映到频谱图上就是关于轴对称,因此只看前一半$N/2$即可。 +由于离散傅里叶变换后的频谱序列前一半和后一半是共轭对称的,反映到频谱图上就是关于轴对称,因此只看前一半\\(N/2\\)即可。 -$k$所对应的周期$$T = {N \over k} \bullet SampleInterval$$ +\\(k\\)所对应的周期$$T = {N \over k} \bullet SampleInterval$$ -要观察一个信号是不是以$T$为周期,至少需要观察两倍的$T$的长度,因此通过长度为$N$的序列能够识别出的最长周期为$N/2$。所以可以忽略$k = 1$。 +要观察一个信号是不是以\\(T\\)为周期,至少需要观察两倍的\\(T\\\)的长度,因此通过长度为\\(N\\)的序列能够识别出的最长周期为\\(N/2\\)。所以可以忽略\\(k = 1\\)。 -至此,$k$的取值范围为$(2, 3, ... , N/2)$,对应的周期为$N/2, N/3, ...$,这也就是FFT能够提供的周期信息的「分辨率」。如果一个信号的周期没有落到$N/k$上,它会散布到整个频域,导致「频率泄漏」。 +至此,\\(k\\)的取值范围为\\((2, 3, ... , N/2)\\),对应的周期为\\(N/2, N/3, ...\\),这也就是FFT能够提供的周期信息的「分辨率」。如果一个信号的周期没有落到\\(N/k\\)上,它会散布到整个频域,导致「频率泄漏」。 好在在实际生产环境中,我们通常遇到的应用(尤其是在线业务),如果有规律,都是以「天」为周期的,某些业务可能会有所谓的「周末」效应,即周末和工作日不太一样,如果扩大到「周」的粒度去观察,它们同样具有良好的周期性。 -Crane没有尝试发现任意长度的周期,而是指定几个固定的周期长度($1d、7d$)去判断。并通过截取、填充的方式,保证序列的长度$N$为待检测周期$T$的整倍数,例如:$T=1d,N=3d;T=7d,N=14d$。 +Crane没有尝试发现任意长度的周期,而是指定几个固定的周期长度(\\(1d、7d\\))去判断。并通过截取、填充的方式,保证序列的长度\\(N\\)为待检测周期\\(T\\)的整倍数,例如:$$T=1d,N=3d;T=7d,N=14d$$。 我们从生产环境中抓取了一些应用的监控指标,保存为csv格式,放到`pkg/prediction/dsp/test_data`目录下。 例如,`input0.csv`文件包括了一个应用连续8天的CPU监控数据,对应的时间序列如下图: @@ -66,24 +67,24 @@ Crane没有尝试发现任意长度的周期,而是指定几个固定的周期 上面是我们通过直觉判断的,Crane是如何挑选「候选周期」的呢? -1. 对原始序列$\vec x(n)$进行一个随机排列后得到序列$\vec x'(n)$,再对$\vec x'(n)$做FFT得到$\vec X'(k)$,令$P_{max} = argmax\|\vec X'(k)\|$。 +1. 对原始序列\\(\vec x(n)\\)进行一个随机排列后得到序列\\(\vec x'(n)\\),再对\\(\vec x'(n)\\)做FFT得到\\(\vec X'(k)\\),令\\(P_{max} = argmax\|\vec X'(k)\|\\)。 -2. 重复100次上述操作,得到100个$P_{max}$,取$P99$作为阈值$P_{threshold}$。 +2. 重复100次上述操作,得到100个\\(P_{max}\\),取\\(P99\\)作为阈值\\(P_{threshold}\\)。 -3. 对原始序列$\vec x(n)$做FFT得到$\vec X(f)$,遍历$k = 2, 3, ...$,如果$P_k = \|X(k)\| > P_{threshold}$,则将$k$加入候选周期。 +3. 对原始序列\\(\vec x(n)\\)做FFT得到\\(\vec X(f)\\),遍历\\(k = 2, 3, ...\\),如果\\(P_k = \|X(k)\| > P_{threshold}\\),则将\\(k\\)加入候选周期。 #### 循环自相关函数 自相关函数(Auto Correlation Function,ACF)是一个信号于其自身在不同时间点的互相关。通俗的讲,它就是两次观察之间的相似度对它们之间的时间差的函数。 -Crane使用循环自相关函数(Circular ACF),先对长度为$N$的时间序列以$N$为周期做扩展,也就是在$..., [-N, -1], [N, 2N-1], ...$区间上复制$\vec x(n)$,得到一个新的序列$\vec x'(n)$。 -再依次计算将$\vec x'(n)$依次平移$k=1,2,3,...N/2$后的$\vec x'(n+k)$与$\vec x'(n)$的相关系数 +Crane使用循环自相关函数(Circular ACF),先对长度为\\(N\\)的时间序列以\\(N\\)为周期做扩展,也就是在\\(..., [-N, -1], [N, 2N-1], ...\\)区间上复制\\(\vec x(n)\\),得到一个新的序列\\(\vec x'(n)\\)。 +再依次计算将\\(\vec x'(n)\\)依次平移\\(k=1,2,3,...N/2\\)后的\\(\vec x'(n+k)\\)与\\(\vec x'(n)\\)的相关系数 $$r_k={\displaystyle\sum_{i=-k}^{N-k-1} (x_i-\mu)(x_{i+k}-\mu) \over \displaystyle\sum_{i=0}^{N-1} (x_i-\mu)^2}\ \ \ \mu: mean$$ -Crane没有直接使用上面的定义去计算ACF,而是根据下面的公式,通过两次$(I)FFT$,从而能够在$O(nlogn)$的时间内完成ACF的计算。 +Crane没有直接使用上面的定义去计算ACF,而是根据下面的公式,通过两次\\((I)FFT\\),从而能够在\\(O(nlogn)\\)的时间内完成ACF的计算。 $$\vec r = IFFT(|FFT({\vec x - \mu \over \sigma})|^2)\ \ \ \mu: mean,\ \sigma: standard\ deviation$$ -ACF的图像如下所示,横轴代表信号平移的时间长度$k$;纵轴代表自相关系数$r_k$,反应了平移信号与原始信号的「相似」程度。 +ACF的图像如下所示,横轴代表信号平移的时间长度\\(k\\);纵轴代表自相关系数\\(r_k\\),反应了平移信号与原始信号的「相似」程度。 ![](/images/algorithm/dsp/acf.png) @@ -99,7 +100,7 @@ Crane在两侧个各选取一段曲线,分别做线性回归,当回归后左 根据上一步得到的主周期,Crane提供了两种方式去拟合(预测)下一个周期的时序数据 **maxValue** -选取过去几个周期中相同时刻$t$(例如:下午6:00)中的最大值,作为下一个周期$t$时刻的预测值。 +选取过去几个周期中相同时刻\\(t\\)(例如:下午6:00)中的最大值,作为下一个周期\\(t\\)时刻的预测值。 ![](/images/algorithm/dsp/max_value.png) **fft** @@ -161,7 +162,7 @@ spec: 简单来说,保留频率分量的数量越少、频率上限越低、频谱幅度下限越高,预测出来的曲线越光滑,但会丢失一些细节;反之,曲线毛刺越多,保留更多细节。 -下面是对同一时段预测的两条曲线,蓝色、绿色的`highFrequencyThreshold`分别为$0.01$和$0.001$,蓝色曲线过滤掉了更多的高频分量,因此更为平滑。 +下面是对同一时段预测的两条曲线,蓝色、绿色的`highFrequencyThreshold`分别为\\(0.01\\)和\\(0.001\\),蓝色曲线过滤掉了更多的高频分量,因此更为平滑。 ![](/images/algorithm/dsp/lft_0_001.png) ![](/images/algorithm/dsp/lft_0_01.png) diff --git a/site/layouts/partials/footer.html b/site/layouts/partials/footer.html new file mode 100644 index 000000000..f654bcbca --- /dev/null +++ b/site/layouts/partials/footer.html @@ -0,0 +1 @@ +{{ if .Params.math }}{{ partial "helpers/katex.html" . }}{{ end }} \ No newline at end of file diff --git a/site/layouts/partials/helpers/katex.html b/site/layouts/partials/helpers/katex.html new file mode 100644 index 000000000..e313d5931 --- /dev/null +++ b/site/layouts/partials/helpers/katex.html @@ -0,0 +1,5 @@ + + + + + \ No newline at end of file