Master comprehensive preprocessing techniques for time series data: data cleaning, smoothing, transformation, and quality assessment with rigorous mathematical foundations
Types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).
Example: For monthly retail sales with a missing December 2018, using seasonal interpolation (average of Decembers) preserves seasonal pattern better than global mean.
n-period simple moving average (SMA):
Centered moving average aligns window around t; for even n use center between two points.
Simple exponential smoothing:
Holt's (two-parameter) smoothing for trend:
Choose α,β by cross-validation or grid search; higher α weights recent observations more.
Compute seasonal averages per period k:
Seasonal index: (additive) or ratio for multiplicative cases.
Seasonal adjustment: subtract or divide by seasonal index to obtain seasonally adjusted series.
A weakly stationary series has constant mean and autocovariance that depends only on lag. Common checks include visual inspection, ACF/PACF and statistical tests (ADF, KPSS).
Common transformations:
Example: log-differencing often used for economic series with growing variance: .
Detect observations that deviate significantly from the mean:
Typically flag observations with or as outliers.
Based on quartiles, robust to extreme values:
More robust alternative using median absolute deviation:
where
Affect only a single observation:
where if , 0 otherwise
Affect the observation and propagate through the system:
Impact decays according to the model's dynamics
Permanent changes in the series level:
where if , 0 otherwise
Use intervention analysis or state space methods to estimate clean series while preserving underlying patterns.
Use robust regression techniques or M-estimators that automatically downweight outliers during parameter estimation.
For each point , fit a weighted polynomial regression using nearby points:
where weights decrease with distance from
where is the bandwidth parameter controlling smoothness
The HP filter solves:
Balances fit to data (first term) vs. smoothness penalty (second term)
The optimal trend is:
where is the second-difference matrix
Measurement equation:
State transition:
where is the unobserved state (trend, cycle, etc.)
Provides optimal smoothed estimates of the underlying level
Tests null hypothesis that data are MCAR using chi-square statistic comparing observed and expected covariance matrices under MCAR assumption.
Jarque-Bera test statistic:
where is skewness and is kurtosis
ARCH-LM test for time-varying variance:
Test
# Load packages library(forecast) library(tsclean) library(imputeTS) # Data preprocessing pipeline ts_preprocess <- function(x) { # 1. Outlier detection and treatment x_clean <- tsclean(x) # 2. Missing value imputation x_imputed <- na_interpolation(x_clean) # 3. Stationarity testing adf_test <- adf.test(x_imputed) # 4. Differencing if needed if(adf_test$p.value > 0.05) { x_diff <- diff(x_imputed) } else { x_diff <- x_imputed } return(x_diff) }
import pandas as pd import numpy as np from statsmodels.tsa.stattools import adfuller from scipy import stats def preprocess_timeseries(df, col): # 1. Outlier detection using IQR Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # 2. Replace outliers df_clean = df.copy() outliers = (df[col] < lower_bound) | (df[col] > upper_bound) df_clean.loc[outliers, col] = np.nan # 3. Interpolate missing values df_clean[col] = df_clean[col].interpolate(method='linear') # 4. Test stationarity adf_result = adfuller(df_clean[col].dropna()) if adf_result[1] > 0.05: # Apply differencing df_clean[col + '_diff'] = df_clean[col].diff() return df_clean return df_clean
Apply interpolation and model-based imputation to sample datasets.
Compute seasonal indices and produce seasonally adjusted series.
Run ADF/KPSS tests and perform differencing transforms.