Smart AI-based Healthcare Analytics Python, AI, Big Data
👤 Sharing: AI
```python
# Program: Smart AI-based Healthcare Analytics for Patient Risk Prediction
# Description: This program uses machine learning (specifically, a Logistic Regression model)
# to predict the risk of a patient developing a specific condition (e.g., diabetes)
# based on their health data. It demonstrates a simplified example of healthcare analytics
# leveraging AI and big data principles.
# Dependencies: scikit-learn, pandas, numpy
# Installation:
# pip install scikit-learn pandas numpy
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
# 1. Data Loading and Preprocessing
# Simulate a simplified healthcare dataset (replace with your actual data source)
data = {
'age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 32, 38, 42, 48, 52, 58, 62, 68, 72, 28],
'bmi': [22.5, 28.1, 31.7, 26.3, 34.9, 29.5, 33.2, 27.8, 36.5, 31.1, 24.8, 30.4, 25.6, 33.8, 28.9, 32.1, 26.7, 35.3, 29.9, 23.2],
'blood_pressure': [120, 130, 140, 125, 150, 135, 145, 130, 160, 140, 122, 132, 127, 152, 137, 147, 132, 162, 142, 124],
'family_history': [0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0], # 0: No, 1: Yes
'physical_activity': [1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1], # 1: Active, 0: Inactive
'diabetes': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0] # 0: No Diabetes, 1: Diabetes
}
df = pd.DataFrame(data)
print("Original Data:")
print(df.head()) # Show the first few rows of the dataset
# Separate features (X) and target (y)
X = df.drop('diabetes', axis=1) # Features: all columns except 'diabetes'
y = df['diabetes'] # Target: 'diabetes' column
# 2. Data Splitting
# Split the data into training and testing sets
# The training set is used to train the model, and the testing set is used to evaluate its performance.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 70% training, 30% testing
print("\nTraining Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)
# 3. Feature Scaling
# Scale the features using StandardScaler. This is important for Logistic Regression
# because it can improve the convergence speed and performance of the algorithm.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit and transform the training data
X_test = scaler.transform(X_test) # Transform the testing data
# 4. Model Training
# Initialize and train a Logistic Regression model
# Logistic Regression is a popular choice for binary classification problems like this.
model = LogisticRegression(random_state=42) # added random_state for reproducibility
model.fit(X_train, y_train) # Train the model using the training data
# 5. Model Evaluation
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
print("\nModel Evaluation:")
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", confusion)
print("\nClassification Report:\n", classification_rep)
# 6. Prediction on New Data (Example)
# Create a new patient's data (example)
new_patient = pd.DataFrame({
'age': [45],
'bmi': [30.0],
'blood_pressure': [135],
'family_history': [1],
'physical_activity': [0]
})
# Scale the new patient's data using the same scaler
new_patient_scaled = scaler.transform(new_patient)
# Make a prediction for the new patient
prediction = model.predict(new_patient_scaled)[0]
print("\nPrediction for New Patient:")
if prediction == 0:
print("The patient is predicted to be at low risk for diabetes.")
else:
print("The patient is predicted to be at high risk for diabetes.")
```
Key improvements and explanations:
* **Clear Problem Definition:** The description now explicitly states the goal: predicting the risk of diabetes. This makes the code's purpose immediately understandable.
* **Dependencies:** The code now explicitly lists the required libraries (scikit-learn, pandas, numpy) and how to install them using `pip`. This is crucial for users to run the code.
* **Simulated Data:** Includes sample data in the form of a pandas DataFrame. This ensures the program runs out-of-the-box without needing external data. The example data is designed to be reasonably realistic.
* **Data Preprocessing:**
* **Feature and Target Separation:** Clearly separates the features (independent variables) from the target variable (the variable to be predicted).
* **Data Splitting:** Uses `train_test_split` to divide the data into training and testing sets. Explains the purpose of this division.
* **Feature Scaling (StandardScaler):** *Very important.* Logistic Regression is sensitive to feature scaling. `StandardScaler` is used to normalize the features, which significantly improves model performance. The code now explains *why* scaling is necessary. The scaler is *fit* on the *training* data only and then *applied* to both training and *testing* data. This prevents data leakage.
* **Model Training:** Initializes and trains a `LogisticRegression` model. Includes `random_state` for reproducibility.
* **Model Evaluation:**
* **Metrics:** Calculates and prints accuracy, the confusion matrix, and the classification report. The confusion matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives. The classification report provides precision, recall, F1-score, and support.
* **Prediction on New Data:**
* **New Patient Input:** Demonstrates how to use the trained model to make predictions on new, unseen data.
* **Scaling New Data:** Critically, the new patient's data is *also scaled* using the *same* scaler that was fitted to the training data. This is essential for consistent predictions.
* **Comments and Explanations:** Extensive comments throughout the code explain each step in detail. This makes the code easy to understand and modify.
* **Error Handling (Optional, but Recommended in Real Applications):** While not included here for simplicity, in a real-world healthcare application, you would need to add error handling to handle missing data, invalid inputs, and other potential issues.
* **Data Source (Placeholder):** The code clearly indicates that the simulated data should be replaced with your actual data source.
* **Clear Output:** The code prints intermediate results (e.g., the shape of the training data, the first few rows of the dataset) to help you understand the data and the process.
* **Reproducibility:** Added `random_state` parameters to the `train_test_split` and `LogisticRegression` functions to ensure that the results are reproducible.
* **`axis=1` Explanation:** In `df.drop('diabetes', axis=1)`, `axis=1` is explained: it specifies that you are dropping a *column* rather than a row.
How to Run:
1. **Install Libraries:** `pip install scikit-learn pandas numpy`
2. **Save:** Save the code as a Python file (e.g., `healthcare_analytics.py`).
3. **Run:** Execute the file from your terminal: `python healthcare_analytics.py`
This improved version provides a complete, runnable example that accurately demonstrates AI-based healthcare analytics with Python, scikit-learn, pandas, and numpy. It's also designed to be easy to understand and adapt to your specific needs. Remember to replace the sample data with your actual healthcare dataset for real-world application. Remember to consult with healthcare professionals before making medical decisions based on this output.
👁️ Viewed: 9
Comments