statsmodels

statsmodels is a Python library designed for exploring data, estimating statistical models, and performing statistical tests. It provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration. It aims to bridge the gap between statistical computing environments like R and the general-purpose programming capabilities of Python.

Key features and capabilities of statsmodels include:

1. Linear Models: Ordinary Least Squares (OLS), Weighted Least Squares (WLS), Generalized Least Squares (GLS), and Robust Linear Models (RLM).
2. Generalized Linear Models (GLM): Supports various distributions like Poisson, Gamma, Binomial, and Gaussian with different link functions.
3. Discrete Choice Models: Logit, Probit, Multinomial Logit, and Poisson regression.
4. Time Series Analysis: Extensive support for time series models including AR (Autoregressive), MA (Moving Average), ARMA (Autoregressive Moving Average), ARIMA (Autoregressive Integrated Moving Average), SARIMAX (Seasonal Autoregressive Integrated Moving Average with Exogenous Regressors), VAR (Vector Autoregressive), and GARCH (Generalized Autoregressive Conditional Heteroskedasticity).
5. Nonparametric Methods: Kernel density estimation, local regression.
6. Hypothesis Testing: A wide array of statistical tests, including t-tests, F-tests, Chi-squared tests, and various diagnostic tests for model assumptions.
7. Descriptive Statistics and Statistical Tests: Functions for calculating descriptive statistics and performing various hypothesis tests.
8. Model Diagnostics and Results Summaries: Provides comprehensive summaries for fitted models, including coefficients, standard errors, p-values, R-squared, AIC, BIC, and diagnostic plots.

statsmodels integrates well with the NumPy and pandas libraries, using pandas DataFrames for input and output, which makes data handling and analysis intuitive. While libraries like scikit-learn focus more on predictive machine learning models, statsmodels is geared towards statistical inference, allowing users to understand the relationships between variables, perform hypothesis testing, and interpret model parameters with statistical rigor. It's an indispensable tool for economists, statisticians, and data scientists performing advanced statistical analysis in Python.

Example Code

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

 1. Create some sample data
 Let's assume we want to model 'Y' based on 'X1' and 'X2'
data = {
    'Y': [10, 12, 15, 13, 18, 20, 22, 25, 23, 28],
    'X1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'X2': [0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.7, 1.9, 2.1, 2.3]
}
df = pd.DataFrame(data)

 Add a constant to the independent variables for the intercept term
 This is necessary when using the 'statsmodels.api' (non-formula) interface
 When using 'statsmodels.formula.api', a constant is automatically added if not explicitly excluded
 df['const'] = 1

print("Sample DataFrame:\n", df)
print("\n---\n")

 2. Perform Ordinary Least Squares (OLS) regression
    Method 1: Using the formula API (similar to R)
print("OLS Regression using Formula API:")
model_formula = smf.ols('Y ~ X1 + X2', data=df)
results_formula = model_formula.fit()
print(results_formula.summary())

print("\n---\n")

    Method 2: Using the non-formula API (requires explicit constant and separate dependent/independent variables)
print("OLS Regression using non-Formula API:")
 Define dependent and independent variables
Y = df['Y']
X = df[['X1', 'X2']]
 Add a constant (intercept) to the independent variables
X = sm.add_constant(X)

model_api = sm.OLS(Y, X)
results_api = model_api.fit()
print(results_api.summary())

Example Code

Related Topics