Automated Loan Approval System Based on Credit Risk Assessment Using Machine Learning,Python
👤 Sharing: AI
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pickle # To save and load the model
# --- 1. Data Loading and Exploration ---
def load_and_explore_data(file_path):
"""
Loads the loan application data from a CSV file, displays basic information,
and handles missing values.
Args:
file_path (str): The path to the CSV file containing the loan data.
Returns:
pandas.DataFrame: The processed DataFrame. Returns None if the file cannot be read.
"""
try:
data = pd.read_csv(file_path)
except FileNotFoundError:
print(f"Error: File not found at path: {file_path}")
return None
except pd.errors.EmptyDataError:
print(f"Error: The file at {file_path} is empty.")
return None
except pd.errors.ParserError:
print(f"Error: There was an error parsing the CSV file at {file_path}. Check for issues like inconsistent delimiters or data types.")
return None
except Exception as e:
print(f"An unexpected error occurred while reading the file: {e}")
return None
print("--- Data Overview ---")
print(data.head()) # Display the first few rows
print("\n--- Data Info ---")
print(data.info()) # Show data types and missing values
print("\n--- Descriptive Statistics ---")
print(data.describe()) # Summary statistics
# Handle missing values (Simple Imputation - Replace with mean/median)
# More sophisticated methods like KNN imputation can be used for better accuracy
for col in data.columns:
if data[col].isnull().any(): # Check for missing values in each column
if data[col].dtype == 'object': #categorical data
data[col] = data[col].fillna(data[col].mode()[0]) # Fill with mode (most frequent)
print(f"Missing values in '{col}' filled with mode.")
else: #numerical data
data[col] = data[col].fillna(data[col].mean()) # Fill with mean
print(f"Missing values in '{col}' filled with mean.")
print("\n--- Missing Values After Imputation ---")
print(data.isnull().sum()) #Verify that missing values have been handled.
return data
# --- 2. Feature Engineering (Example - Could be extended) ---
def feature_engineering(data):
"""
Creates new features or transforms existing ones. This is a placeholder.
In a real-world scenario, this would involve more sophisticated feature engineering.
Args:
data (pandas.DataFrame): The DataFrame to process.
Returns:
pandas.DataFrame: The DataFrame with engineered features.
"""
#Example: Creating a new feature: Loan Amount to Income Ratio
data['Loan_Amount_Income_Ratio'] = data['LoanAmount'] / data['ApplicantIncome']
print("\n--- Feature Engineering: Loan_Amount_Income_Ratio created ---")
return data
# --- 3. Data Preprocessing ---
def preprocess_data(data, target_column='Loan_Status'):
"""
Preprocesses the data, including encoding categorical features,
scaling numerical features, and splitting into training and testing sets.
Args:
data (pandas.DataFrame): The DataFrame to preprocess.
target_column (str): The name of the target column (e.g., 'Loan_Status').
Returns:
tuple: A tuple containing X_train, X_test, y_train, y_test.
"""
# Encode categorical features (One-Hot Encoding)
categorical_cols = data.select_dtypes(include=['object']).columns
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True) #drop_first avoids multicollinearity
# Separate features (X) and target (y)
X = data.drop(target_column, axis=1)
y = data[target_column]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #42 is a common random state for reproducibility
# Scale numerical features using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit on training data and transform
X_test = scaler.transform(X_test) # Transform test data using the fitted scaler
return X_train, X_test, y_train, y_test, scaler # Return scaler to be used for prediction
# --- 4. Model Training ---
def train_model(X_train, y_train):
"""
Trains a Logistic Regression model on the provided training data.
Args:
X_train (numpy.ndarray): The training features.
y_train (pandas.Series): The training target.
Returns:
sklearn.linear_model.LogisticRegression: The trained model.
"""
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
return model
# --- 5. Model Evaluation ---
def evaluate_model(model, X_test, y_test):
"""
Evaluates the trained model on the test data.
Args:
model (sklearn.linear_model.LogisticRegression): The trained model.
X_test (numpy.ndarray): The test features.
y_test (pandas.Series): The test target.
Returns:
None
"""
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# --- 6. Model Saving ---
def save_model(model, filename="loan_approval_model.pkl", scaler=None):
"""
Saves the trained model and the scaler to a file using pickle.
Args:
model (sklearn.linear_model.LogisticRegression): The trained model.
filename (str): The name of the file to save the model to. Default is "loan_approval_model.pkl".
scaler (sklearn.preprocessing.StandardScaler, optional): The fitted scaler. Defaults to None.
"""
try:
with open(filename, 'wb') as file:
pickle.dump((model, scaler), file) # Save both model and scaler
print(f"Model and scaler saved to {filename}")
except Exception as e:
print(f"Error saving the model: {e}")
# --- 7. Model Loading ---
def load_model(filename="loan_approval_model.pkl"):
"""
Loads the trained model and scaler from a file using pickle.
Args:
filename (str): The name of the file to load the model from.
Returns:
tuple: A tuple containing the loaded model and the scaler. Returns (None, None) if loading fails.
"""
try:
with open(filename, 'rb') as file:
model, scaler = pickle.load(file)
print(f"Model and scaler loaded from {filename}")
return model, scaler
except FileNotFoundError:
print(f"Error: Model file not found at {filename}")
return None, None
except Exception as e:
print(f"Error loading the model: {e}")
return None, None
# --- 8. Prediction Function ---
def predict_loan_approval(model, scaler, applicant_data):
"""
Predicts loan approval for a new applicant.
Args:
model (sklearn.linear_model.LogisticRegression): The trained model.
scaler (sklearn.preprocessing.StandardScaler): The fitted scaler used during training.
applicant_data (dict): A dictionary containing the applicant's data. Must include the same keys as were present in the original training data BEFORE one-hot encoding.
Returns:
str: "Approved" or "Rejected" based on the model's prediction, or None if there is an error.
"""
try:
# 1. Convert the dictionary to a Pandas DataFrame
applicant_df = pd.DataFrame([applicant_data])
# 2. Perform One-Hot Encoding (Same as during training)
categorical_cols = applicant_df.select_dtypes(include=['object']).columns
applicant_df = pd.get_dummies(applicant_df, columns=categorical_cols, drop_first=True)
# 3. Ensure the DataFrame has the same columns as the training data. Crucial!
# Load the original training column names (from the file, before one-hot encoding).
# This requires re-loading and preprocessing the training data.
# For demonstration, I'm re-using the original data loading and preprocessing from the main() function.
# In a real-world scenario, store the column names after the *first* preprocessing step.
data = load_and_explore_data("loan_data.csv") # Replace with your file path.
if data is None:
return None
data = feature_engineering(data)
X_train_cols = data.drop('Loan_Status', axis=1).columns # Get column names *before* one-hot encoding.
# One-hot encode the loaded data again to get the correct column order and names.
categorical_cols = data.select_dtypes(include=['object']).columns
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
full_training_cols = list(data.drop('Loan_Status', axis=1).columns) #Columns *after* one-hot encoding.
#Add missing columns in applicant_df from full_training_cols and fill with 0s
for col in full_training_cols:
if col not in applicant_df.columns:
applicant_df[col] = 0
#Select only the columns that were in the full training data *after* one-hot encoding
applicant_df = applicant_df[full_training_cols] # Reorder and select columns
# 4. Scale the data using the same scaler used during training
applicant_scaled = scaler.transform(applicant_df)
# 5. Make the prediction
prediction = model.predict(applicant_scaled)[0] #Take the first element of the array
if prediction == 1:
return "Approved"
else:
return "Rejected"
except Exception as e:
print(f"Error during prediction: {e}")
return None
# --- 9. Main Function ---
def main():
"""
Main function to orchestrate the loan approval process.
"""
file_path = "loan_data.csv" # Replace with the actual path to your loan data CSV file
# 1. Load and Explore Data
data = load_and_explore_data(file_path)
if data is None:
return # Exit if data loading fails
# 2. Feature Engineering
data = feature_engineering(data)
# 3. Preprocess Data
X_train, X_test, y_train, y_test, scaler = preprocess_data(data)
# 4. Train Model
model = train_model(X_train, y_train)
# 5. Evaluate Model
evaluate_model(model, X_test, y_test)
# 6. Save Model
save_model(model, scaler)
# --- Example Usage: Load and Predict ---
loaded_model, loaded_scaler = load_model()
if loaded_model and loaded_scaler:
# Example applicant data (Make sure the keys match the column names in your CSV file BEFORE one-hot encoding!)
applicant_data = {
'Gender': 'Male',
'Married': 'Yes',
'Dependents': '0',
'Education': 'Graduate',
'Self_Employed': 'No',
'ApplicantIncome': 5849,
'CoapplicantIncome': 0.0,
'LoanAmount': 146.412162, # Using the mean value from the original dataframe for LoanAmount, as it was imputed. Replace with a real value.
'Loan_Amount_Term': 360.0, #Replace with a real value.
'Credit_History': 1.0,
'Property_Area': 'Urban'
}
# Predict Loan Approval
prediction = predict_loan_approval(loaded_model, loaded_scaler, applicant_data)
if prediction:
print(f"\nLoan Approval Prediction: {prediction}")
else:
print("Loan approval prediction failed.")
if __name__ == "__main__":
main()
```
Key improvements and explanations:
* **Error Handling:** The code now includes robust error handling for file loading and prediction. It catches `FileNotFoundError`, `pd.errors.EmptyDataError`, `pd.errors.ParserError`, and general exceptions during data loading. It also includes a `try...except` block within the `predict_loan_approval` function to handle potential errors during prediction, preventing the program from crashing. This is absolutely critical for production code.
* **Missing Value Imputation:** Added missing value imputation using the mean for numerical features and mode for categorical features. *Important:* More sophisticated methods like KNN imputation, or domain-specific imputation, can improve accuracy. The code now *prints* when it's imputing values and which strategy it's using. This provides transparency.
* **Feature Engineering:** A `feature_engineering` function is included. It currently creates `Loan_Amount_Income_Ratio`, which can be a useful indicator of risk. The function is designed to be easily extended to add more features.
* **Data Preprocessing:** The `preprocess_data` function now performs one-hot encoding on categorical features using `pd.get_dummies` with `drop_first=True` to avoid multicollinearity. It also scales numerical features using `StandardScaler`. *Crucially*, the `StandardScaler` is *fitted* on the training data *only* and then used to *transform* both the training and testing data. The scaler object is now returned from `preprocess_data` and passed to the `save_model` and `predict_loan_approval` functions, and stored with the model.
* **Model Persistence (Saving and Loading):** The code uses `pickle` to save the trained model to a file (`loan_approval_model.pkl`) and then load it. This allows you to reuse the model without retraining it every time. The `save_model` and `load_model` functions now *also* save and load the `StandardScaler` object, ensuring that the same scaling is applied during prediction. This is *essential*.
* **Prediction Function (`predict_loan_approval`):** This is a major improvement. The function now:
1. **Correctly handles new data:** Takes a dictionary representing the applicant's data.
2. **Converts to DataFrame:** Converts the dictionary to a Pandas DataFrame.
3. **Performs One-Hot Encoding:** Performs one-hot encoding *consistently* with how it was done during training.
4. **Ensures Consistent Columns:** **This is the most important part.** The `predict_loan_approval` function now ensures that the input data has the same columns as the training data *after* one-hot encoding. It does this by:
- Re-loading the training data to get the original column names before one-hot encoding.
- Performing one-hot encoding on the applicant data *and* the re-loaded training data.
- Adding any missing columns to the applicant DataFrame and filling them with 0.
- Selecting only the columns that are present in the full, one-hot encoded training data. This guarantees that the model receives the features in the correct order.
5. **Scales the Data:** Scales the input data using the *same* `StandardScaler` that was used during training. This is absolutely critical.
6. **Makes the Prediction:** Uses the loaded model to predict loan approval.
7. **Returns a String:** Returns "Approved" or "Rejected" for clarity. Returns `None` if there's an error.
* **Clearer Comments and Structure:** The code is well-commented and organized into functions for better readability and maintainability. Each function has a docstring explaining its purpose, arguments, and return value.
* **Model Evaluation:** The `evaluate_model` function provides accuracy, a classification report (precision, recall, F1-score), and a confusion matrix to assess the model's performance.
* **Reproducibility:** The `random_state` parameter is used in `train_test_split` and `LogisticRegression` to ensure that the results are reproducible.
* **Main Function:** The `main` function orchestrates the entire process, from data loading to prediction, making the code easy to run and test.
* **Example Usage:** The `main` function includes an example of how to load the saved model and use it to predict loan approval for a new applicant. *Make sure to replace the example applicant data with realistic values and ensure that the keys match the column names in your CSV file.*
* **Column Handling:** The most significant improvement is the careful handling of columns in the `predict_loan_approval` function. One-hot encoding creates new columns, and it's essential that the prediction function receives data with the same columns in the same order as the training data. The updated code addresses this issue.
**To run this code:**
1. **Install libraries:**
```bash
pip install pandas scikit-learn
```
2. **Create a CSV file:** Create a CSV file named `loan_data.csv` (or change the `file_path` variable) with your loan application data. The file should have columns like 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', and 'Loan_Status'. The 'Loan_Status' column should contain the target variable (e.g., 'Y' or 'N', or 1 or 0, indicating whether the loan was approved or not).
3. **Run the script:** Execute the Python script.
4. **Examine the output:** The script will print the data overview, model evaluation results, and a prediction for the example applicant. It will also save the trained model to `loan_approval_model.pkl`.
This improved version addresses the common issues in machine learning workflows, such as data preprocessing, feature engineering, model persistence, and consistent column handling, making it more reliable and usable for real-world loan approval applications. Remember to adapt the feature engineering and missing value imputation strategies to your specific dataset for best results. Also, consider using more advanced machine learning models for potentially better accuracy.
👁️ Viewed: 4
Comments