Smart AI-based Healthcare Analytics Python, AI, Big Data

👤 Sharing: AI
```python
# Program: Smart AI-based Healthcare Analytics for Patient Risk Prediction

# Description: This program uses machine learning (specifically, a Logistic Regression model)
# to predict the risk of a patient developing a specific condition (e.g., diabetes)
# based on their health data. It demonstrates a simplified example of healthcare analytics
# leveraging AI and big data principles.

# Dependencies:  scikit-learn, pandas, numpy

# Installation:
#   pip install scikit-learn pandas numpy

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

# 1. Data Loading and Preprocessing

# Simulate a simplified healthcare dataset (replace with your actual data source)
data = {
    'age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 32, 38, 42, 48, 52, 58, 62, 68, 72, 28],
    'bmi': [22.5, 28.1, 31.7, 26.3, 34.9, 29.5, 33.2, 27.8, 36.5, 31.1, 24.8, 30.4, 25.6, 33.8, 28.9, 32.1, 26.7, 35.3, 29.9, 23.2],
    'blood_pressure': [120, 130, 140, 125, 150, 135, 145, 130, 160, 140, 122, 132, 127, 152, 137, 147, 132, 162, 142, 124],
    'family_history': [0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0], # 0: No, 1: Yes
    'physical_activity': [1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1], # 1: Active, 0: Inactive
    'diabetes': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0]  # 0: No Diabetes, 1: Diabetes
}

df = pd.DataFrame(data)

print("Original Data:")
print(df.head())  # Show the first few rows of the dataset

# Separate features (X) and target (y)
X = df.drop('diabetes', axis=1)  # Features: all columns except 'diabetes'
y = df['diabetes']              # Target: 'diabetes' column

# 2. Data Splitting

# Split the data into training and testing sets
# The training set is used to train the model, and the testing set is used to evaluate its performance.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # 70% training, 30% testing

print("\nTraining Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

# 3. Feature Scaling

# Scale the features using StandardScaler. This is important for Logistic Regression
# because it can improve the convergence speed and performance of the algorithm.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # Fit and transform the training data
X_test = scaler.transform(X_test)      # Transform the testing data

# 4. Model Training

# Initialize and train a Logistic Regression model
# Logistic Regression is a popular choice for binary classification problems like this.
model = LogisticRegression(random_state=42) # added random_state for reproducibility
model.fit(X_train, y_train)  # Train the model using the training data

# 5. Model Evaluation

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("\nModel Evaluation:")
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", confusion)
print("\nClassification Report:\n", classification_rep)

# 6. Prediction on New Data (Example)

# Create a new patient's data (example)
new_patient = pd.DataFrame({
    'age': [45],
    'bmi': [30.0],
    'blood_pressure': [135],
    'family_history': [1],
    'physical_activity': [0]
})

# Scale the new patient's data using the same scaler
new_patient_scaled = scaler.transform(new_patient)

# Make a prediction for the new patient
prediction = model.predict(new_patient_scaled)[0]

print("\nPrediction for New Patient:")
if prediction == 0:
    print("The patient is predicted to be at low risk for diabetes.")
else:
    print("The patient is predicted to be at high risk for diabetes.")
```

Key improvements and explanations:

* **Clear Problem Definition:** The description now explicitly states the goal: predicting the risk of diabetes. This makes the code's purpose immediately understandable.
* **Dependencies:** The code now explicitly lists the required libraries (scikit-learn, pandas, numpy) and how to install them using `pip`. This is crucial for users to run the code.
* **Simulated Data:**  Includes sample data in the form of a pandas DataFrame.  This ensures the program runs out-of-the-box without needing external data.  The example data is designed to be reasonably realistic.
* **Data Preprocessing:**
    * **Feature and Target Separation:** Clearly separates the features (independent variables) from the target variable (the variable to be predicted).
    * **Data Splitting:**  Uses `train_test_split` to divide the data into training and testing sets. Explains the purpose of this division.
    * **Feature Scaling (StandardScaler):** *Very important.*  Logistic Regression is sensitive to feature scaling. `StandardScaler` is used to normalize the features, which significantly improves model performance. The code now explains *why* scaling is necessary. The scaler is *fit* on the *training* data only and then *applied* to both training and *testing* data. This prevents data leakage.
* **Model Training:**  Initializes and trains a `LogisticRegression` model. Includes `random_state` for reproducibility.
* **Model Evaluation:**
    * **Metrics:** Calculates and prints accuracy, the confusion matrix, and the classification report. The confusion matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives. The classification report provides precision, recall, F1-score, and support.
* **Prediction on New Data:**
    * **New Patient Input:** Demonstrates how to use the trained model to make predictions on new, unseen data.
    * **Scaling New Data:** Critically, the new patient's data is *also scaled* using the *same* scaler that was fitted to the training data. This is essential for consistent predictions.
* **Comments and Explanations:**  Extensive comments throughout the code explain each step in detail.  This makes the code easy to understand and modify.
* **Error Handling (Optional, but Recommended in Real Applications):**  While not included here for simplicity, in a real-world healthcare application, you would need to add error handling to handle missing data, invalid inputs, and other potential issues.
* **Data Source (Placeholder):**  The code clearly indicates that the simulated data should be replaced with your actual data source.
* **Clear Output:**  The code prints intermediate results (e.g., the shape of the training data, the first few rows of the dataset) to help you understand the data and the process.
* **Reproducibility:** Added `random_state` parameters to the `train_test_split` and `LogisticRegression` functions to ensure that the results are reproducible.
* **`axis=1` Explanation:**  In `df.drop('diabetes', axis=1)`, `axis=1` is explained: it specifies that you are dropping a *column* rather than a row.

How to Run:

1.  **Install Libraries:**  `pip install scikit-learn pandas numpy`
2.  **Save:** Save the code as a Python file (e.g., `healthcare_analytics.py`).
3.  **Run:** Execute the file from your terminal: `python healthcare_analytics.py`

This improved version provides a complete, runnable example that accurately demonstrates AI-based healthcare analytics with Python, scikit-learn, pandas, and numpy. It's also designed to be easy to understand and adapt to your specific needs. Remember to replace the sample data with your actual healthcare dataset for real-world application. Remember to consult with healthcare professionals before making medical decisions based on this output.
👁️ Viewed: 9

Comments