MathIsimple

Time Series Preprocessing Fundamentals

Master comprehensive preprocessing techniques for time series data: data cleaning, smoothing, transformation, and quality assessment with rigorous mathematical foundations

6-8 hoursIntermediate Level7 sections
Learning Objectives
  • Identify common data quality issues in time series
  • Apply moving average and exponential smoothing
  • Perform seasonal adjustment and index calculation
  • Transform series to achieve stationarity
  • Prepare cleaned series for forecasting models

Missing Values & Imputation

Common Strategies
From simple interpolation to model-based imputation

Types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

  • Listwise deletion: remove rows — simple but lossy.
  • Forward/Backward fill: propagate last/next observation — works when values change slowly.
  • Linear interpolation: interpolate between neighbors — good for short gaps.
  • Seasonal interpolation: use same season in nearby periods for seasonal data.
  • Model-based imputation: use regression, Kalman smoothing or state-space models for principled estimates.

Example: For monthly retail sales with a missing December 2018, using seasonal interpolation (average of Decembers) preserves seasonal pattern better than global mean.

1. Smoothing & Trend Extraction

Moving Average
n-period centered and trailing averages

n-period simple moving average (SMA):

Xˉt=1ni=0n1Xti\bar{X}_t = \frac{1}{n} \sum_{i=0}^{n-1} X_{t-i}

Centered moving average aligns window around t; for even n use center between two points.

Exponential Smoothing
Simple and double (Holt) exponential smoothing

Simple exponential smoothing:

X^t=αXt+(1α)X^t1\hat{X}_t = \alpha X_t + (1-\alpha) \hat{X}_{t-1}

Holt's (two-parameter) smoothing for trend:

X^t=αXt+(1α)(X^t1+Tt1)Tt=β(X^tX^t1)+(1β)Tt1\begin{aligned} \hat{X}_t &= \alpha X_t + (1-\alpha)(\hat{X}_{t-1} + T_{t-1}) \\ T_t &= \beta(\hat{X}_t - \hat{X}_{t-1}) + (1-\beta)T_{t-1} \end{aligned}

Choose α,β by cross-validation or grid search; higher α weights recent observations more.

Interactive Example

Example: monthly sales (trend + seasonality + noise)
Example visualization: monthly series with moving average trend overlay.

2. Seasonal Adjustment & Indices

Seasonal Indices
Estimate and remove seasonal component

Compute seasonal averages per period k:

Xˉk=1ni=1nXik\bar{X}_k = \frac{1}{n} \sum_{i=1}^n X_{ik}

Seasonal index: Iij=XijXˉkI_{ij} = X_{ij} - \bar{X}_k (additive) or ratio for multiplicative cases.

Seasonal adjustment: subtract or divide by seasonal index to obtain seasonally adjusted series.

3. Stationarity & Transformations

Why Stationarity Matters
Most stochastic models assume stationarity

A weakly stationary series has constant mean and autocovariance that depends only on lag. Common checks include visual inspection, ACF/PACF and statistical tests (ADF, KPSS).

Common transformations:

  • Differencing: Yt=XtXt1Y_t = X_t - X_{t-1} to remove linear trend.
  • Seasonal differencing: Yt=XtXtsY_t = X_t - X_{t-s} to remove seasonal effects (s = period).
  • Log transform: stabilize variance for multiplicative seasonality.

Example: log-differencing often used for economic series with growing variance: DeltalogXt=logXtlogXt1Delta log X_t = log X_t - log X_{t-1}.

4. Outlier Detection and Treatment

Statistical Outlier Detection Methods
Identify anomalous observations that may distort analysis and forecasting

Z-Score Method

Detect observations that deviate significantly from the mean:

Zt=XtXˉsXZ_t = \frac{X_t - \bar{X}}{s_X}

Typically flag observations with Zt>2.5|Z_t| > 2.5 or Zt>3|Z_t| > 3 as outliers.

Interquartile Range (IQR) Method

Based on quartiles, robust to extreme values:

IQR=Q3Q1IQR = Q_3 - Q_1
Outlier if: Xt<Q11.5×IQR or Xt>Q3+1.5×IQR\text{Outlier if: } X_t < Q_1 - 1.5 \times IQR \text{ or } X_t > Q_3 + 1.5 \times IQR

Modified Z-Score (Median-based)

More robust alternative using median absolute deviation:

Mt=0.6745(Xtmedian(X))MADM_t = \frac{0.6745(X_t - \text{median}(X))}{\text{MAD}}

where MAD=median(Ximedian(X))\text{MAD} = \text{median}(|X_i - \text{median}(X)|)

Time Series Specific Outlier Types

Additive Outliers (AO)

Affect only a single observation:

Xt=Yt+ωIt(τ)X_t = Y_t + \omega I_t(\tau)

where It(τ)=1I_t(\tau) = 1 if t=τt = \tau, 0 otherwise

Innovational Outliers (IO)

Affect the observation and propagate through the system:

Xt=θ(B)ϕ(B)(at+ωIt(τ))X_t = \frac{\theta(B)}{\phi(B)}(a_t + \omega I_t(\tau))

Impact decays according to the model's dynamics

Level Shifts (LS)

Permanent changes in the series level:

Xt=Yt+ωSt(τ)X_t = Y_t + \omega S_t(\tau)

where St(τ)=1S_t(\tau) = 1 if tτt \geq \tau, 0 otherwise

Outlier Treatment Strategies

1. Replacement Methods

  • Linear interpolation: Xt=Xt1+Xt+12X_t^* = \frac{X_{t-1} + X_{t+1}}{2}
  • Median substitution: Replace with local median
  • Trend-based replacement: Use fitted trend value
  • Seasonal adjustment: Replace with seasonal expectation

2. Model-Based Adjustment

Use intervention analysis or state space methods to estimate clean series while preserving underlying patterns.

3. Robust Methods

Use robust regression techniques or M-estimators that automatically downweight outliers during parameter estimation.

5. Advanced Smoothing Techniques

Local Regression (LOESS/LOWESS)
Non-parametric smoothing method that fits local weighted regression

Mathematical Foundation

For each point tt, fit a weighted polynomial regression using nearby points:

X^t=argminβiwi(t)(Xiβ0β1iβ2i2)2\hat{X}_t = \arg\min_{\beta} \sum_{i} w_i(t) (X_i - \beta_0 - \beta_1 i - \beta_2 i^2 - \ldots)^2

where weights wi(t)w_i(t) decrease with distance from tt

Tricube Weight Function

wi(t)={(1ith3)3if it<h0otherwisew_i(t) = \begin{cases} (1 - |\frac{i-t}{h}|^3)^3 & \text{if } |i-t| < h \\ 0 & \text{otherwise} \end{cases}

where hh is the bandwidth parameter controlling smoothness

Bandwidth Selection

  • Cross-validation: Minimize prediction error
  • AIC/BIC criteria: Balance fit and complexity
  • Visual inspection: Choose based on desired smoothness
  • Rule of thumb: happrox0.1nh \\approx 0.1n to 0.3n0.3n
Hodrick-Prescott (HP) Filter

Optimization Problem

The HP filter solves:

minτ[t=1T(Xtτt)2+λt=2T1[(τt+1τt)(τtτt1)]2]\min_{\tau} \left[ \sum_{t=1}^T (X_t - \tau_t)^2 + \lambda \sum_{t=2}^{T-1} [(\tau_{t+1} - \tau_t) - (\tau_t - \tau_{t-1})]^2 \right]

Balances fit to data (first term) vs. smoothness penalty (second term)

Common \u03BB Values

  • Annual data: λ=100\lambda = 100
  • Quarterly data: λ=1600\lambda = 1600
  • Monthly data: λ=14400\lambda = 14400
  • • Higher λ\lambda produces smoother trends

Matrix Solution

The optimal trend is: τ^=(I+λKK)1X\hat{\tau} = (I + \lambda K'K)^{-1} X

where KK is the second-difference matrix

Kalman Filter Smoothing

State Space Formulation

Measurement equation:

Xt=Hαt+εtX_t = H \alpha_t + \varepsilon_t

State transition:

αt+1=Fαt+ηt\alpha_{t+1} = F \alpha_t + \eta_t

where αt\alpha_t is the unobserved state (trend, cycle, etc.)

Local Level Model Example

Xt=μt+εtX_t = \mu_t + \varepsilon_t
μt+1=μt+ηt\mu_{t+1} = \mu_t + \eta_t

Provides optimal smoothed estimates of the underlying level μt\mu_t

6. Data Quality Assessment

Completeness and Consistency Checks

Missing Data Patterns

  • Missing Completely at Random (MCAR): Missing values unrelated to observed or unobserved data
  • Missing at Random (MAR): Missing values depend on observed data only
  • Missing Not at Random (MNAR): Missing values depend on unobserved data

Little's MCAR Test

Tests null hypothesis that data are MCAR using chi-square statistic comparing observed and expected covariance matrices under MCAR assumption.

Temporal Consistency

  • • Check for proper time ordering and regular intervals
  • • Verify absence of duplicate timestamps
  • • Validate seasonal patterns align with calendar effects
  • • Ensure units and scales are consistent across time
Distributional Properties

Normality Testing

Jarque-Bera test statistic:

JB=n6[S2+(K3)24]JB = \frac{n}{6} \left[ S^2 + \frac{(K-3)^2}{4} \right]

where SS is skewness and KK is kurtosis

Heteroskedasticity Detection

ARCH-LM test for time-varying variance:

εt2=α0+α1εt12++αqεtq2+ut\varepsilon_t^2 = \alpha_0 + \alpha_1 \varepsilon_{t-1}^2 + \ldots + \alpha_q \varepsilon_{t-q}^2 + u_t

Test H0:α1==αq=0H_0: \alpha_1 = \ldots = \alpha_q = 0

Stability Assessment

  • CUSUM test: Detect structural breaks in mean
  • Recursive residuals: Monitor parameter stability
  • Rolling window statistics: Track changing moments over time

7. Practical Implementation Guidelines

Preprocessing Workflow

Step-by-Step Process

  1. 1. Data Import and Validation
    • • Verify time index format and frequency
    • • Check for missing timestamps or irregular spacing
    • • Validate data types and ranges
  2. 2. Exploratory Data Analysis
    • • Plot time series and identify patterns
    • • Calculate descriptive statistics
    • • Examine autocorrelation structure
  3. 3. Missing Value Treatment
    • • Assess missing data mechanism
    • • Choose appropriate imputation method
    • • Validate imputation quality
  4. 4. Outlier Detection and Treatment
    • • Apply multiple detection methods
    • • Investigate outlier causes
    • • Decide on treatment strategy
  5. 5. Transformation and Smoothing
    • • Apply variance-stabilizing transformations
    • • Smooth if necessary for trend extraction
    • • Document all transformations
  6. 6. Stationarity Assessment
    • • Perform unit root tests
    • • Apply differencing if needed
    • • Verify stationarity achievement
  7. 7. Final Validation
    • • Review processed series properties
    • • Document preprocessing steps
    • • Prepare for modeling phase
Best Practices and Common Pitfalls

✅ Best Practices

  • • Always plot your data before and after preprocessing
  • • Document all transformations for reproducibility
  • • Validate preprocessing on out-of-sample data
  • • Use domain knowledge to guide outlier treatment
  • • Preserve original data alongside processed versions
  • • Consider multiple imputation for uncertainty quantification
  • • Test stationarity assumptions thoroughly

❌ Common Pitfalls

  • • Over-smoothing and removing important signal
  • • Applying transformation without checking assumptions
  • • Ignoring the temporal dependence in missing value patterns
  • • Using future information in preprocessing (look-ahead bias)
  • • Inadequate treatment of structural breaks
  • • Blindly removing all outliers without investigation
  • • Failing to account for preprocessing uncertainty in forecasts

Computational Considerations

  • • Use vectorized operations for large time series
  • • Consider memory-efficient algorithms for long series
  • • Implement robust numerical methods for matrix operations
  • • Parallelize independent preprocessing steps
  • • Cache intermediate results for iterative analysis
Software Implementation Examples

R Implementation Workflow

# Load packages
library(forecast)
library(tsclean)
library(imputeTS)

# Data preprocessing pipeline
ts_preprocess <- function(x) {
  # 1. Outlier detection and treatment
  x_clean <- tsclean(x)
  
  # 2. Missing value imputation
  x_imputed <- na_interpolation(x_clean)
  
  # 3. Stationarity testing
  adf_test <- adf.test(x_imputed)
  
  # 4. Differencing if needed
  if(adf_test$p.value > 0.05) {
    x_diff <- diff(x_imputed)
  } else {
    x_diff <- x_imputed
  }
  
  return(x_diff)
}

Python Implementation

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller
from scipy import stats

def preprocess_timeseries(df, col):
    # 1. Outlier detection using IQR
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # 2. Replace outliers
    df_clean = df.copy()
    outliers = (df[col] < lower_bound) | (df[col] > upper_bound)
    df_clean.loc[outliers, col] = np.nan
    
    # 3. Interpolate missing values
    df_clean[col] = df_clean[col].interpolate(method='linear')
    
    # 4. Test stationarity
    adf_result = adfuller(df_clean[col].dropna())
    
    if adf_result[1] > 0.05:
        # Apply differencing
        df_clean[col + '_diff'] = df_clean[col].diff()
        return df_clean
    
    return df_clean
Practice: Preprocessing Exercises
Open the preprocessing practice for interactive exercises
Direct link to the Preprocessing practice module.
Next Steps

Practice: Imputation

Apply interpolation and model-based imputation to sample datasets.

Practice: Deseasonalize

Compute seasonal indices and produce seasonally adjusted series.

Practice: Stationarity

Run ADF/KPSS tests and perform differencing transforms.