October 13, 2024

Variance Inflation Factor (VIF) in Python

The Variance Inflation Factor (VIF) is a measure used to quantify how much the variance of an estimated regression coefficient increases due to multicollinearity. In other words, it assesses how much the presence of multicollinearity in a regression model inflates the variance of the estimated regression coefficients.

1. Calculating VIF using Statsmodels

The statsmodels library in Python provides a straightforward way to calculate the VIF for each feature in a regression model.

1.1. Installation

If you don’t have statsmodels installed, you can install it using pip:

pip install statsmodels
    

1.2. Example Program

Here’s how to calculate VIF for each feature in a dataset:

import pandas as pd
import statsmodels.api as sm

# Create a sample DataFrame
df = pd.DataFrame({
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 3, 4, 5, 6],
    'X3': [5, 6, 7, 8, 9]
})

# Function to calculate VIF
def calculate_vif(dataframe):
    vif_data = pd.DataFrame()
    vif_data['Feature'] = dataframe.columns
    vif_data['VIF'] = [sm.stats.outliers_influence.variance_inflation_factor(dataframe.values, i) for i in range(dataframe.shape[1])]
    return vif_data

# Calculate VIF for the sample data
vif_result = calculate_vif(df)
print(vif_result)
    

2. Explanation

The calculate_vif function performs the following steps:

  • Initialization: Creates an empty DataFrame to store the VIF values.
  • VIF Calculation: Uses the variance_inflation_factor function from statsmodels to calculate the VIF for each feature. This function requires the feature matrix (excluding the intercept) and the index of the feature for which VIF is being calculated.
  • Results: Returns a DataFrame containing each feature and its corresponding VIF value.

3. Interpretation

In general:

  • A VIF value of 1 indicates no correlation with other features.
  • A VIF value between 1 and 5 indicates moderate correlation.
  • A VIF value above 10 indicates high correlation, which suggests multicollinearity.

4. Conclusion

Calculating the VIF for features in a regression model helps in identifying and addressing multicollinearity issues. The statsmodels library provides a convenient method to compute VIF, which is crucial for ensuring the robustness of your regression analysis.