Master comprehensive preprocessing techniques for time series data: data cleaning, smoothing, transformation, and quality assessment with rigorous mathematical foundations
Types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).
Example: For monthly retail sales with a missing December 2018, using seasonal interpolation (average of Decembers) preserves seasonal pattern better than global mean.
n-period simple moving average (SMA):
Centered moving average aligns window around t; for even n use center between two points.
Simple exponential smoothing:
Holt's (two-parameter) smoothing for trend:
Choose α,β by cross-validation or grid search; higher α weights recent observations more.
Compute seasonal averages per period k:
Seasonal index: (additive) or ratio for multiplicative cases.
Seasonal adjustment: subtract or divide by seasonal index to obtain seasonally adjusted series.
A weakly stationary series has constant mean and autocovariance that depends only on lag. Common checks include visual inspection, ACF/PACF and statistical tests (ADF, KPSS).
Common transformations:
Example: log-differencing often used for economic series with growing variance: .
Detect observations that deviate significantly from the mean:
Typically flag observations with or as outliers.
Based on quartiles, robust to extreme values:
More robust alternative using median absolute deviation:
where
Affect only a single observation:
where if , 0 otherwise
Affect the observation and propagate through the system:
Impact decays according to the model's dynamics
Permanent changes in the series level:
where if , 0 otherwise
Use intervention analysis or state space methods to estimate clean series while preserving underlying patterns.
Use robust regression techniques or M-estimators that automatically downweight outliers during parameter estimation.
For each point , fit a weighted polynomial regression using nearby points:
where weights decrease with distance from
where is the bandwidth parameter controlling smoothness
The HP filter solves:
Balances fit to data (first term) vs. smoothness penalty (second term)
The optimal trend is:
where is the second-difference matrix
Measurement equation:
State transition:
where is the unobserved state (trend, cycle, etc.)
Provides optimal smoothed estimates of the underlying level
Tests null hypothesis that data are MCAR using chi-square statistic comparing observed and expected covariance matrices under MCAR assumption.
Jarque-Bera test statistic:
where is skewness and is kurtosis
ARCH-LM test for time-varying variance:
Test
# Load packages
library(forecast)
library(tsclean)
library(imputeTS)
# Data preprocessing pipeline
ts_preprocess <- function(x) {
# 1. Outlier detection and treatment
x_clean <- tsclean(x)
# 2. Missing value imputation
x_imputed <- na_interpolation(x_clean)
# 3. Stationarity testing
adf_test <- adf.test(x_imputed)
# 4. Differencing if needed
if(adf_test$p.value > 0.05) {
x_diff <- diff(x_imputed)
} else {
x_diff <- x_imputed
}
return(x_diff)
}import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller
from scipy import stats
def preprocess_timeseries(df, col):
# 1. Outlier detection using IQR
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# 2. Replace outliers
df_clean = df.copy()
outliers = (df[col] < lower_bound) | (df[col] > upper_bound)
df_clean.loc[outliers, col] = np.nan
# 3. Interpolate missing values
df_clean[col] = df_clean[col].interpolate(method='linear')
# 4. Test stationarity
adf_result = adfuller(df_clean[col].dropna())
if adf_result[1] > 0.05:
# Apply differencing
df_clean[col + '_diff'] = df_clean[col].diff()
return df_clean
return df_cleanApply interpolation and model-based imputation to sample datasets.
Compute seasonal indices and produce seasonally adjusted series.
Run ADF/KPSS tests and perform differencing transforms.