Hospital Resource Optimization through AI-Powered Patient Readmission Risk Predictor,Python
👤 Sharing: AI
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE # For handling imbalanced datasets
# --- 1. Data Loading and Preprocessing ---
def load_and_preprocess_data(file_path):
"""
Loads the hospital dataset, handles missing values, and prepares it for modeling.
Args:
file_path (str): Path to the CSV file containing the hospital data.
Returns:
pandas.DataFrame: Processed DataFrame ready for feature engineering and model training.
"""
try:
data = pd.read_csv(file_path)
except FileNotFoundError:
print(f"Error: File not found at {file_path}")
return None
# --- Handle Missing Values ---
# Strategy: Fill numerical features with the median and categorical features with the mode
for col in data.columns:
if data[col].isnull().any():
if pd.api.types.is_numeric_dtype(data[col]): # Check if numeric
data[col].fillna(data[col].median(), inplace=True) # Impute with median
else:
data[col].fillna(data[col].mode()[0], inplace=True) # Impute with mode
# --- Convert Categorical Features ---
# Strategy: One-hot encode categorical features
categorical_cols = data.select_dtypes(include='object').columns
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True) # drop_first avoids multicollinearity
return data
# --- 2. Feature Engineering ---
def create_features(data):
"""
Generates new features from existing data to potentially improve model performance.
Args:
data (pandas.DataFrame): Input DataFrame.
Returns:
pandas.DataFrame: DataFrame with newly engineered features.
"""
# Example 1: Length of stay squared
data['length_of_stay_squared'] = data['time_in_hospital']**2
# Example 2: Interaction term between number of medications and number of diagnoses
data['medication_diagnosis_interaction'] = data['num_medications'] * data['number_diagnoses']
# Example 3: A simplified risk score (customize this based on domain knowledge!)
data['simplified_risk_score'] = (data['num_lab_procedures'] + data['num_medications'] - data['time_in_hospital'])
return data
# --- 3. Data Splitting and Scaling ---
def split_and_scale_data(data, target_variable='readmitted'):
"""
Splits the data into training and testing sets and scales numerical features.
Args:
data (pandas.DataFrame): Input DataFrame.
target_variable (str): Name of the target variable (e.g., 'readmitted').
Returns:
tuple: X_train_scaled, X_test_scaled, y_train, y_test
"""
X = data.drop(target_variable, axis=1)
y = data[target_variable]
# --- Split Data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Stratify to maintain class balance
# --- Scale Numerical Features ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use the same scaler fitted on training data
return X_train_scaled, X_test_scaled, y_train, y_test
# --- 4. Model Training and Evaluation ---
def train_and_evaluate_model(X_train, y_train, X_test, y_test, model_type='logistic_regression'):
"""
Trains a specified machine learning model and evaluates its performance.
Args:
X_train (numpy.ndarray): Scaled training features.
y_train (pandas.Series): Training target variable.
X_test (numpy.ndarray): Scaled testing features.
y_test (pandas.Series): Testing target variable.
model_type (str): Type of model to train ('logistic_regression', 'random_forest', 'gradient_boosting').
Returns:
tuple: Trained model, dictionary of evaluation metrics.
"""
if model_type == 'logistic_regression':
model = LogisticRegression(random_state=42, solver='liblinear', penalty='l1', C=0.1) # Example with regularization
elif model_type == 'random_forest':
model = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=10) # Example hyperparameters
elif model_type == 'gradient_boosting':
model = GradientBoostingClassifier(random_state=42, n_estimators=100, learning_rate=0.1, max_depth=5) # Example hyperparameters
else:
raise ValueError(f"Invalid model_type: {model_type}. Choose 'logistic_regression', 'random_forest', or 'gradient_boosting'")
# --- Handle Class Imbalance ---
# Using SMOTE (Synthetic Minority Oversampling Technique)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
model.fit(X_train_resampled, y_train_resampled) # Train on resampled data
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1] # Probabilities for ROC AUC
# --- Evaluate the Model ---
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='binary')
recall = recall_score(y_test, y_pred, average='binary')
f1 = f1_score(y_test, y_pred, average='binary')
roc_auc = roc_auc_score(y_test, y_pred_proba)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Model: {model_type}")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"ROC AUC: {roc_auc:.4f}")
print("Confusion Matrix:\n", conf_matrix)
evaluation_metrics = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'roc_auc': roc_auc,
'confusion_matrix': conf_matrix
}
return model, evaluation_metrics
# --- 5. Visualization (Optional) ---
def visualize_results(model, X_test, y_test, evaluation_metrics):
"""
Visualizes model performance, including the confusion matrix and ROC curve.
Args:
model: Trained machine learning model.
X_test (numpy.ndarray): Scaled testing features.
y_test (pandas.Series): Testing target variable.
evaluation_metrics (dict): Dictionary of evaluation metrics.
"""
# --- Confusion Matrix Visualization ---
conf_matrix = evaluation_metrics['confusion_matrix']
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=['Not Readmitted', 'Readmitted'],
yticklabels=['Not Readmitted', 'Readmitted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
# --- ROC Curve Visualization ---
from sklearn.metrics import roc_curve
y_pred_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC AUC = {evaluation_metrics['roc_auc']:.2f}")
plt.plot([0, 1], [0, 1], 'k--') # Diagonal line representing random guessing
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
# --- 6. Main Execution ---
if __name__ == "__main__":
# --- 1. Data Loading and Preprocessing ---
file_path = "hospital_readmission.csv" # Replace with your actual file path
data = load_and_preprocess_data(file_path)
if data is None:
exit() # Exit if data loading failed
# --- 2. Feature Engineering ---
data = create_features(data)
# --- 3. Data Splitting and Scaling ---
X_train_scaled, X_test_scaled, y_train, y_test = split_and_scale_data(data)
# --- 4. Model Training and Evaluation ---
model, evaluation_metrics = train_and_evaluate_model(X_train_scaled, y_train, X_test_scaled, y_test, model_type='logistic_regression') # Choose 'logistic_regression', 'random_forest', or 'gradient_boosting'
#model, evaluation_metrics = train_and_evaluate_model(X_train_scaled, y_train, X_test_scaled, y_test, model_type='random_forest')
#model, evaluation_metrics = train_and_evaluate_model(X_train_scaled, y_train, X_test_scaled, y_test, model_type='gradient_boosting')
# --- 5. Visualization (Optional) ---
visualize_results(model, X_test_scaled, y_test, evaluation_metrics)
print("Program completed.")
```
Key improvements and explanations are included in the code comments, but here's a more detailed breakdown:
* **Clear Structure and Modularity:** The code is broken down into functions, making it more readable, maintainable, and testable. Each function has a specific purpose.
* **Error Handling:** Includes basic error handling for file loading to prevent the program from crashing if the file isn't found.
* **Missing Value Imputation:** Handles missing values using a strategy appropriate for both numerical and categorical features. Median imputation for numerical features is more robust to outliers than mean imputation.
* **Categorical Feature Encoding:** Uses `pd.get_dummies` for one-hot encoding, which is the standard way to handle categorical variables in most machine learning models. `drop_first=True` is crucial to avoid multicollinearity.
* **Feature Engineering:** Demonstrates how to create new features from existing ones. The provided examples are basic; *real* feature engineering requires domain expertise. Consider interactions, ratios, and polynomial features based on your understanding of hospital readmissions. The simplified risk score is a placeholder; replace it with something meaningful.
* **Data Scaling:** Uses `StandardScaler` to scale numerical features. This is essential for algorithms like Logistic Regression and Gradient Boosting that are sensitive to feature scaling. *Crucially, the scaler is fitted only on the training data and then used to transform both the training and testing data.* This prevents data leakage.
* **Data Splitting:** Splits the data into training and testing sets using `train_test_split`. `stratify=y` ensures that the class distribution (readmitted vs. not readmitted) is the same in both the training and testing sets, which is important for imbalanced datasets.
* **Model Choice:** Provides three model options: Logistic Regression, Random Forest, and Gradient Boosting. Each has its pros and cons:
* **Logistic Regression:** Simple, interpretable, good for a baseline. May not capture complex relationships.
* **Random Forest:** Robust, handles non-linear relationships, less prone to overfitting than decision trees.
* **Gradient Boosting:** Often achieves high accuracy, but can be more sensitive to hyperparameter tuning and overfitting.
* **Hyperparameter Tuning:** Includes *example* hyperparameters for each model. **Crucially, you need to tune these hyperparameters using techniques like cross-validation to optimize performance.** The given values are just starting points.
* **Class Imbalance Handling:** Uses SMOTE (Synthetic Minority Oversampling Technique) to address class imbalance. Readmission datasets are often imbalanced (far fewer readmissions than non-readmissions). SMOTE creates synthetic samples of the minority class to balance the dataset. *This is applied only to the training data to prevent data leakage.*
* **Model Evaluation:** Calculates a comprehensive set of evaluation metrics: accuracy, precision, recall, F1-score, ROC AUC, and confusion matrix. *Precision, recall, and F1-score are particularly important for imbalanced datasets, as accuracy can be misleading.* ROC AUC is good for evaluating the model's ability to discriminate between classes.
* **Visualization:** Includes functions to visualize the confusion matrix and ROC curve, which help in understanding model performance.
* **Clear Output:** Prints the evaluation metrics and the confusion matrix.
* **`if __name__ == "__main__":` block:** This ensures that the main execution code only runs when the script is executed directly (not when it's imported as a module).
**How to Use and Customize:**
1. **Install Libraries:**
```bash
pip install pandas scikit-learn matplotlib seaborn imbalanced-learn
```
2. **Data Preparation:**
* **Replace `"hospital_readmission.csv"`** with the actual path to your dataset. The CSV file should contain the features you want to use to predict readmission and a target variable indicating whether a patient was readmitted (e.g., `readmitted = 1` for readmitted, `readmitted = 0` for not readmitted).
* **Examine your dataset:** Use `data.head()`, `data.describe()`, and `data.info()` to understand its structure, data types, and missing values.
* **Feature Selection:** *Carefully choose the features* you want to include in your model. Remove irrelevant or redundant features. Consult with healthcare professionals to understand which features are most predictive of readmission.
3. **Feature Engineering (Crucial):**
* **Customize `create_features()`:** This is where you can significantly improve model performance. Based on your domain knowledge, create new features that capture relevant information. Examples include:
* **Comorbidity scores:** Calculate a score based on the patient's existing conditions.
* **Medication adherence:** If you have data on medication adherence, include it as a feature.
* **Social determinants of health:** Factors like socioeconomic status, access to transportation, and social support can impact readmission risk. If you have this data, include it.
* **Time since last discharge:** The time elapsed since the patient's previous discharge.
* **Ensure that your feature engineering steps are reproducible.**
4. **Model Selection and Tuning:**
* **Choose a model:** Experiment with different model types (`logistic_regression`, `random_forest`, `gradient_boosting`).
* **Hyperparameter Tuning:** *This is essential!* Use techniques like `GridSearchCV` or `RandomizedSearchCV` to find the best hyperparameters for your chosen model. For example:
```python
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
'learning_rate': [0.01, 0.1, 0.2]
}
grid_search = GridSearchCV(GradientBoostingClassifier(random_state=42), param_grid, cv=3, scoring='roc_auc') # Use ROC AUC for imbalanced data
grid_search.fit(X_train_scaled, y_train)
best_model = grid_search.best_estimator_
```
* **Cross-validation:** Use cross-validation (e.g., 5-fold cross-validation) to get a more reliable estimate of model performance.
5. **Evaluation:**
* **Focus on precision, recall, and F1-score:** These metrics are more informative than accuracy for imbalanced datasets.
* **Consider the cost of errors:** False negatives (predicting a patient won't be readmitted when they will be) are often more costly than false positives. Adjust the model's threshold or use a cost-sensitive learning approach to minimize false negatives.
6. **Deployment:**
* **Save your trained model:** Use `pickle` or `joblib` to save the model so you can load it later.
* **Create an API:** Build an API that can accept patient data and return a readmission risk score.
* **Integrate with hospital systems:** Integrate the API into the hospital's electronic health record (EHR) system so that clinicians can access the readmission risk scores.
**Important Considerations:**
* **Data Quality:** The quality of your data is critical. Ensure that your data is accurate, complete, and consistent.
* **Explainability:** In healthcare, it's important for models to be explainable. Clinicians need to understand why the model is making a particular prediction. Logistic Regression and some tree-based models are more explainable than complex neural networks. Techniques like SHAP values can help explain model predictions.
* **Bias:** Be aware of potential biases in your data. If your data is biased, the model will learn those biases and may make unfair or discriminatory predictions. Carefully audit your data and model for bias.
* **Ethical Considerations:** Use AI responsibly and ethically. Ensure that your model is used to improve patient care and not to discriminate against certain groups.
This comprehensive guide should give you a strong foundation for building an AI-powered patient readmission risk predictor. Remember that this is an iterative process. Experiment, refine, and continuously improve your model based on new data and feedback from healthcare professionals. Good luck!
👁️ Viewed: 4
Comments