While ordinary least squares (OLS) regression focuses on predicting the mean of a dependent variable, quantile regression steps in to model the relationship between variables across different quantiles of the data. This article unpacks the essence of quantile regression, its unique advantages, and its implementation, providing a clear understanding of its practical significance from a business point of view.
What is Quantile Regression?
Quantile regression extends the traditional regression approach to estimate the conditional quantiles of a response variable. Instead of predicting the mean, it focuses on specific quantiles like the median (50th percentile), or the 25th, 75th, etc.
- Example: In salary prediction, the median salary (50th percentile) might be more insightful than the average if extreme outliers distort the data. Similarly, quantile regression can estimate lower percentiles (e.g., 10th percentile) for risk assessments or upper percentiles (e.g., 90th percentile) for performance benchmarking.
Difference Between Quantile Regression and Normal Regression
Here’s how quantile regression differs from ordinary least squares (OLS) regression:
Aspect | OLS Regression | Quantile Regression |
Focus | Predicts the conditional mean of the dependent variable. | Predicts conditional quantiles (e.g., 25th, median). |
Loss Function | Minimizes squared errors (sensitive to outliers). | Minimizes quantile-specific errors (robust to outliers). |
Use Case | Best for homogeneous data without heavy outliers. | Effective for heteroscedasticity or skewed distributions. |
Insights Provided | General trends. | Granular insights across different quantiles. |
Why Do We Need Quantile Regression?
Quantile regression is invaluable in scenarios where a single summary statistic (like the mean) fails to capture the variability or distribution of the data. Key reasons to use it include:
Robustness to Outliers: Unlike OLS, quantile regression is less influenced by extreme values.
Granular Insights: It provides a more detailed view of the relationship between variables across the entire distribution.
Heteroscedasticity Handling: Ideal for datasets where variability changes at different levels of the dependent variable.
Diverse Use Cases: Useful in finance (risk modeling), medicine (survival analysis), and retail (understanding sales distribution).
Loss Function in Quantile Regression
The loss function in quantile regression, often called the pinball loss, is designed to capture how far predictions deviate from the actual values, but with a twist: it treats under-predictions and over-predictions differently, depending on the quantile being modeled. Let’s break it down intuitively:
Key Idea:
Under-predictions (when the predicted value is less than the actual value) are penalized differently from over-predictions (when the predicted value is more than the actual value).
The weight of this penalty depends on the quantile (τ\tauτ) you’re interested in:
For the median (τ=0.5\tau = 0.5τ=0.5), under-predictions and over-predictions are treated equally.
For higher quantiles (e.g., τ=0.9\tau = 0.9τ=0.9), under-predictions are penalized more heavily because we care more about capturing the upper end of the data.
Example:
Imagine you’re delivering pizzas and predicting delivery times:
If you’re predicting the 90th percentile delivery time (τ=0.9\tau = 0.9τ=0.9):
Late deliveries (under-predictions) are a bigger problem because customers will be upset.
The loss function penalizes late predictions more to account for this.
If you’re predicting the 10th percentile delivery time (τ=0.1\tau = 0.1τ=0.1):
- Early predictions (over-predictions) are the focus because you want to ensure faster delivery times.
Formula Intution:
where u is the difference between the actual and predicted values
Business Case Study: Delivery Time Optimization
A food delivery company wants to optimize its delivery times. They need to:
Predict the median delivery time (50th percentile) to set customer expectations.
Estimate the 90th percentile delivery time to ensure that 90% of deliveries are on time.
Methodology:
1. Simulated Data
Let’s generate data where:
Delivery time depends on distance and order size.
Random noise is added to simulate real-world variability.
2. Train Quantile Regressions
Use quantile regression for the 50th and 90th percentiles.
Understand how predictions adjust to capture different parts of the delivery distribution.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Simulating data
np.random.seed(42)
n_samples = 500
distance = np.random.uniform(1, 10, n_samples) # Distance in km
order_size = np.random.uniform(1, 5, n_samples) # Number of items
noise = np.random.normal(0, 2, n_samples) # Random variability
# Delivery time as a function of distance and order size
delivery_time = 5 + 2 * distance + 0.5 * order_size + noise
# Create a DataFrame
data = pd.DataFrame({
'distance': distance,
'order_size': order_size,
'delivery_time': delivery_time
})
# Independent variables (with intercept)
X = sm.add_constant(data[['distance', 'order_size']])
y = data['delivery_time']
# Train quantile regression models
quantiles = [0.5, 0.9] # Median and 90th percentile
models = {}
predictions = {}
for q in quantiles:
model = sm.QuantReg(y, X).fit(q=q)
models[q] = model
predictions[q] = model.predict(X)
print(f"Quantile {q} Regression Summary:\n")
print(model.summary())
# Add predictions to the DataFrame
data['pred_50'] = predictions[0.5]
data['pred_90'] = predictions[0.9]
# Visualization
plt.figure(figsize=(12, 6))
plt.scatter(data['distance'], data['delivery_time'], alpha=0.5, label='Actual Delivery Time')
plt.plot(data['distance'], data['pred_50'], color='blue', label='Median Prediction (50th Percentile)', linewidth=2)
plt.plot(data['distance'], data['pred_90'], color='red', label='90th Percentile Prediction', linewidth=2)
plt.xlabel('Distance (km)')
plt.ylabel('Delivery Time (minutes)')
plt.title('Quantile Regression Predictions')
plt.legend()
plt.show()
Key Insights from Results:
Median Prediction (50th Percentile):
The blue line represents the "typical" delivery time, balancing early and late deliveries equally.90th Percentile Prediction:
The red line represents a conservative estimate, capturing the time within which 90% of deliveries will be completed.
This helps the business:
Set realistic expectations (median).
Plan resources for worst-case scenarios (90th percentile).