Time Series Preprocessing Fundamentals

Master comprehensive preprocessing techniques for time series data: data cleaning, smoothing, transformation, and quality assessment with rigorous mathematical foundations

6-8 hoursIntermediate Level7 sections

Learning Objectives

Identify common data quality issues in time series
Apply moving average and exponential smoothing
Perform seasonal adjustment and index calculation
Transform series to achieve stationarity
Prepare cleaned series for forecasting models

Missing Values & Imputation

Common Strategies

From simple interpolation to model-based imputation

Types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

Listwise deletion: remove rows — simple but lossy.
Forward/Backward fill: propagate last/next observation — works when values change slowly.
Linear interpolation: interpolate between neighbors — good for short gaps.
Seasonal interpolation: use same season in nearby periods for seasonal data.
Model-based imputation: use regression, Kalman smoothing or state-space models for principled estimates.

Example: For monthly retail sales with a missing December 2018, using seasonal interpolation (average of Decembers) preserves seasonal pattern better than global mean.

1. Smoothing & Trend Extraction

Moving Average

n-period centered and trailing averages

n-period simple moving average (SMA):

\bar{X}_t = \frac{1}{n} \sum_{i=0}^{n-1} X_{t-i}

Centered moving average aligns window around t; for even n use center between two points.

Exponential Smoothing

Simple and double (Holt) exponential smoothing

Simple exponential smoothing:

\hat{X}_t = \alpha X_t + (1-\alpha) \hat{X}_{t-1}

Holt's (two-parameter) smoothing for trend:

\begin{aligned} \hat{X}_t &= \alpha X_t + (1-\alpha)(\hat{X}_{t-1} + T_{t-1}) \\ T_t &= \beta(\hat{X}_t - \hat{X}_{t-1}) + (1-\beta)T_{t-1} \end{aligned}

Choose α,β by cross-validation or grid search; higher α weights recent observations more.

Interactive Example

Example: monthly sales (trend + seasonality + noise)

Example visualization: monthly series with moving average trend overlay.

2. Seasonal Adjustment & Indices

Seasonal Indices

Estimate and remove seasonal component

Compute seasonal averages per period k:

\bar{X}_k = \frac{1}{n} \sum_{i=1}^n X_{ik}

Seasonal index: $I_{ij} = X_{ij} - \bar{X}_k$ (additive) or ratio for multiplicative cases.

Seasonal adjustment: subtract or divide by seasonal index to obtain seasonally adjusted series.

3. Stationarity & Transformations

Why Stationarity Matters

Most stochastic models assume stationarity

A weakly stationary series has constant mean and autocovariance that depends only on lag. Common checks include visual inspection, ACF/PACF and statistical tests (ADF, KPSS).

Common transformations:

Differencing: $Y_t = X_t - X_{t-1}$ to remove linear trend.
Seasonal differencing: $Y_t = X_t - X_{t-s}$ to remove seasonal effects (s = period).
Log transform: stabilize variance for multiplicative seasonality.

Example: log-differencing often used for economic series with growing variance: $Delta log X_t = log X_t - log X_{t-1}$ .

4. Outlier Detection and Treatment

Statistical Outlier Detection Methods

Identify anomalous observations that may distort analysis and forecasting

Z-Score Method

Detect observations that deviate significantly from the mean:

Z_t = \frac{X_t - \bar{X}}{s_X}

Typically flag observations with $|Z_t| > 2.5$ or $|Z_t| > 3$ as outliers.

Interquartile Range (IQR) Method

Based on quartiles, robust to extreme values:

IQR = Q_3 - Q_1

\text{Outlier if: } X_t < Q_1 - 1.5 \times IQR \text{ or } X_t > Q_3 + 1.5 \times IQR

Modified Z-Score (Median-based)

More robust alternative using median absolute deviation:

M_t = \frac{0.6745(X_t - \text{median}(X))}{\text{MAD}}

where $\text{MAD} = \text{median}(|X_i - \text{median}(X)|)$

Time Series Specific Outlier Types

Additive Outliers (AO)

Affect only a single observation:

X_t = Y_t + \omega I_t(\tau)

where $I_t(\tau) = 1$ if $t = \tau$ , 0 otherwise

Innovational Outliers (IO)

Affect the observation and propagate through the system:

X_t = \frac{\theta(B)}{\phi(B)}(a_t + \omega I_t(\tau))

Impact decays according to the model's dynamics

Level Shifts (LS)

Permanent changes in the series level:

X_t = Y_t + \omega S_t(\tau)

where $S_t(\tau) = 1$ if $t \geq \tau$ , 0 otherwise

Outlier Treatment Strategies

1. Replacement Methods

• Linear interpolation: $X_t^* = \frac{X_{t-1} + X_{t+1}}{2}$
• Median substitution: Replace with local median
• Trend-based replacement: Use fitted trend value
• Seasonal adjustment: Replace with seasonal expectation

2. Model-Based Adjustment

Use intervention analysis or state space methods to estimate clean series while preserving underlying patterns.

3. Robust Methods

Use robust regression techniques or M-estimators that automatically downweight outliers during parameter estimation.

5. Advanced Smoothing Techniques

Local Regression (LOESS/LOWESS)

Non-parametric smoothing method that fits local weighted regression

Mathematical Foundation

For each point $t$ , fit a weighted polynomial regression using nearby points:

\hat{X}_t = \arg\min_{\beta} \sum_{i} w_i(t) (X_i - \beta_0 - \beta_1 i - \beta_2 i^2 - \ldots)^2

where weights $w_i(t)$ decrease with distance from $t$

Tricube Weight Function

w_i(t) = \begin{cases} (1 - |\frac{i-t}{h}|^3)^3 & \text{if } |i-t| < h \\ 0 & \text{otherwise} \end{cases}

where $h$ is the bandwidth parameter controlling smoothness

Bandwidth Selection

• Cross-validation: Minimize prediction error
• AIC/BIC criteria: Balance fit and complexity
• Visual inspection: Choose based on desired smoothness
• Rule of thumb: $h \\approx 0.1n$ to $0.3n$

Hodrick-Prescott (HP) Filter

Optimization Problem

The HP filter solves:

\min_{\tau} \left[ \sum_{t=1}^T (X_t - \tau_t)^2 + \lambda \sum_{t=2}^{T-1} [(\tau_{t+1} - \tau_t) - (\tau_t - \tau_{t-1})]^2 \right]

Balances fit to data (first term) vs. smoothness penalty (second term)

Common \u03BB Values

• Annual data: $\lambda = 100$
• Quarterly data: $\lambda = 1600$
• Monthly data: $\lambda = 14400$
• Higher $\lambda$ produces smoother trends

Matrix Solution

The optimal trend is: $\hat{\tau} = (I + \lambda K'K)^{-1} X$

where $K$ is the second-difference matrix

Kalman Filter Smoothing

State Space Formulation

Measurement equation:

X_t = H \alpha_t + \varepsilon_t

State transition:

\alpha_{t+1} = F \alpha_t + \eta_t

where $\alpha_t$ is the unobserved state (trend, cycle, etc.)

Local Level Model Example

X_t = \mu_t + \varepsilon_t

\mu_{t+1} = \mu_t + \eta_t

Provides optimal smoothed estimates of the underlying level $\mu_t$

6. Data Quality Assessment

Completeness and Consistency Checks

Missing Data Patterns

• Missing Completely at Random (MCAR): Missing values unrelated to observed or unobserved data
• Missing at Random (MAR): Missing values depend on observed data only
• Missing Not at Random (MNAR): Missing values depend on unobserved data

Little's MCAR Test

Tests null hypothesis that data are MCAR using chi-square statistic comparing observed and expected covariance matrices under MCAR assumption.

Temporal Consistency

• Check for proper time ordering and regular intervals
• Verify absence of duplicate timestamps
• Validate seasonal patterns align with calendar effects
• Ensure units and scales are consistent across time

Distributional Properties

Normality Testing

Jarque-Bera test statistic:

JB = \frac{n}{6} \left[ S^2 + \frac{(K-3)^2}{4} \right]

where $S$ is skewness and $K$ is kurtosis

Heteroskedasticity Detection

ARCH-LM test for time-varying variance:

\varepsilon_t^2 = \alpha_0 + \alpha_1 \varepsilon_{t-1}^2 + \ldots + \alpha_q \varepsilon_{t-q}^2 + u_t

Test $H_0: \alpha_1 = \ldots = \alpha_q = 0$

Stability Assessment

• CUSUM test: Detect structural breaks in mean
• Recursive residuals: Monitor parameter stability
• Rolling window statistics: Track changing moments over time

7. Practical Implementation Guidelines

Preprocessing Workflow

Step-by-Step Process

1. Data Import and Validation
- • Verify time index format and frequency
- • Check for missing timestamps or irregular spacing
- • Validate data types and ranges
2. Exploratory Data Analysis
- • Plot time series and identify patterns
- • Calculate descriptive statistics
- • Examine autocorrelation structure
3. Missing Value Treatment
- • Assess missing data mechanism
- • Choose appropriate imputation method
- • Validate imputation quality
4. Outlier Detection and Treatment
- • Apply multiple detection methods
- • Investigate outlier causes
- • Decide on treatment strategy
5. Transformation and Smoothing
- • Apply variance-stabilizing transformations
- • Smooth if necessary for trend extraction
- • Document all transformations
6. Stationarity Assessment
- • Perform unit root tests
- • Apply differencing if needed
- • Verify stationarity achievement
7. Final Validation
- • Review processed series properties
- • Document preprocessing steps
- • Prepare for modeling phase

Best Practices and Common Pitfalls

✅ Best Practices

• Always plot your data before and after preprocessing
• Document all transformations for reproducibility
• Validate preprocessing on out-of-sample data
• Use domain knowledge to guide outlier treatment
• Preserve original data alongside processed versions
• Consider multiple imputation for uncertainty quantification
• Test stationarity assumptions thoroughly

❌ Common Pitfalls

• Over-smoothing and removing important signal
• Applying transformation without checking assumptions
• Ignoring the temporal dependence in missing value patterns
• Using future information in preprocessing (look-ahead bias)
• Inadequate treatment of structural breaks
• Blindly removing all outliers without investigation
• Failing to account for preprocessing uncertainty in forecasts

Computational Considerations

• Use vectorized operations for large time series
• Consider memory-efficient algorithms for long series
• Implement robust numerical methods for matrix operations
• Parallelize independent preprocessing steps
• Cache intermediate results for iterative analysis

Software Implementation Examples

R Implementation Workflow

# Load packages
library(forecast)
library(tsclean)
library(imputeTS)

# Data preprocessing pipeline
ts_preprocess <- function(x) {
  # 1. Outlier detection and treatment
  x_clean <- tsclean(x)
  
  # 2. Missing value imputation
  x_imputed <- na_interpolation(x_clean)
  
  # 3. Stationarity testing
  adf_test <- adf.test(x_imputed)
  
  # 4. Differencing if needed
  if(adf_test$p.value > 0.05) {
    x_diff <- diff(x_imputed)
  } else {
    x_diff <- x_imputed
  }
  
  return(x_diff)
}

Python Implementation

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller
from scipy import stats

def preprocess_timeseries(df, col):
    # 1. Outlier detection using IQR
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # 2. Replace outliers
    df_clean = df.copy()
    outliers = (df[col] < lower_bound) | (df[col] > upper_bound)
    df_clean.loc[outliers, col] = np.nan
    
    # 3. Interpolate missing values
    df_clean[col] = df_clean[col].interpolate(method='linear')
    
    # 4. Test stationarity
    adf_result = adfuller(df_clean[col].dropna())
    
    if adf_result[1] > 0.05:
        # Apply differencing
        df_clean[col + '_diff'] = df_clean[col].diff()
        return df_clean
    
    return df_clean

Practice: Preprocessing Exercises

Open the preprocessing practice for interactive exercises

Direct link to the Preprocessing practice module.

Open Practice

Next Steps

Practice: Imputation

Apply interpolation and model-based imputation to sample datasets.

Practice: Deseasonalize

Compute seasonal indices and produce seasonally adjusted series.

Practice: Stationarity

Run ADF/KPSS tests and perform differencing transforms.

Continue

Practice Formulas