AI-driven Medical Diagnosis System Python, AI, Deep Learning

👤 Sharing: AI
```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Data Preparation (Simulated Medical Data)

# Let's create a synthetic dataset for demonstration.  In a real-world scenario,
# this data would come from actual medical records, lab results, imaging scans, etc.
# Here, we simulate some features and a diagnosis (disease).  The dataset simulates a scenario where we're trying to diagnose a specific disease (e.g., heart disease) based on some common risk factors.

def generate_synthetic_data(num_samples=1000):
    """Generates a synthetic medical dataset.

    Args:
        num_samples: The number of samples to generate.

    Returns:
        A pandas DataFrame containing the synthetic data.
    """

    np.random.seed(42)  # for reproducibility

    data = {
        'Age': np.random.randint(20, 80, num_samples),
        'Systolic_BP': np.random.randint(100, 200, num_samples), # Systolic Blood Pressure
        'Diastolic_BP': np.random.randint(60, 120, num_samples), # Diastolic Blood Pressure
        'Cholesterol': np.random.randint(150, 300, num_samples),
        'Smoking': np.random.choice([0, 1], num_samples, p=[0.7, 0.3]),  # 0: No, 1: Yes
        'Family_History': np.random.choice([0, 1], num_samples, p=[0.5, 0.5]),  # 0: No, 1: Yes
        'Diagnosis': np.random.choice([0, 1], num_samples)  # 0: No Disease, 1: Disease
    }
    return pd.DataFrame(data)

df = generate_synthetic_data()

print("Sample of the data:")
print(df.head())
print("\nData Description:")
print(df.describe())
print("\nData Info:")
print(df.info())


# 2. Data Preprocessing

# Separate features (X) and target (y)
X = df.drop('Diagnosis', axis=1)  # Features
y = df['Diagnosis']               # Target variable (diagnosis)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for neural networks)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # Fit on training data, then transform
X_test = scaler.transform(X_test)        # Transform test data using the training data's scaling


# 3. Model Building (Deep Learning with Keras)

# Create a sequential model
model = Sequential()

# Add layers to the model
model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))  # Input layer, 128 neurons, ReLU activation
model.add(Dropout(0.3))  # Dropout layer to prevent overfitting
model.add(Dense(64, activation='relu'))       # Hidden layer, 64 neurons, ReLU activation
model.add(Dropout(0.3))  # Another dropout layer
model.add(Dense(1, activation='sigmoid'))      # Output layer, 1 neuron (binary classification), Sigmoid activation


# Compile the model
optimizer = Adam(learning_rate=0.001)  # Adam optimizer with a learning rate of 0.001
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()


# 4. Model Training

# Train the model
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.1, verbose=1) # Trains the model for 50 epochs. Validation split helps monitor for overfitting.

#Plot training history
plt.figure(figsize=(12, 4))

# Plot training & validation accuracy values
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')

# Plot training & validation loss values
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')

plt.tight_layout()
plt.show()

# 5. Model Evaluation

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5).astype(int)  # Convert probabilities to binary predictions (0 or 1)

# Print classification report and confusion matrix
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Visualize the confusion matrix
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()

# 6. Prediction on New Data (Simulated)

# Simulate a new patient's data
new_patient_data = np.array([[55, 140, 85, 220, 1, 0]])  # Example data: Age, BP, Cholesterol, Smoking, Family History

# Scale the new data using the same scaler
new_patient_data_scaled = scaler.transform(new_patient_data)

# Make a prediction
prediction = model.predict(new_patient_data_scaled)
print(f"\nPrediction for new patient: {prediction[0][0]:.4f}") # Prints the predicted probability of having the disease
if prediction[0][0] > 0.5:
    print("The model predicts the patient has the disease.")
else:
    print("The model predicts the patient does not have the disease.")
```

Key improvements and explanations:

* **Clearer Data Generation:**  The `generate_synthetic_data` function is well-defined and uses `numpy.random` to create more realistic (though still synthetic) data, including probabilities for binary features like smoking.  It's crucial to understand how synthetic data is generated, as it directly impacts the model's learning.  The seed is set for reproducibility.
* **Data Description:** Added `df.describe()` and `df.info()` to explore the dataset characteristics.
* **Data Splitting:** Uses `train_test_split` to create training and testing sets.  The `random_state` ensures reproducibility.
* **Feature Scaling:**  `StandardScaler` is used to scale the numerical features.  This is essential for deep learning models, as it helps with faster convergence and prevents features with larger ranges from dominating the learning process.  Critically, the scaler is *fit* on the training data only and then used to *transform* both the training and test data.  This avoids data leakage.
* **Model Architecture:** The model is a sequential neural network with:
    * An input layer with 128 neurons and ReLU activation.
    * Dropout layers to prevent overfitting (randomly drops neurons during training).
    * A hidden layer with 64 neurons and ReLU activation.
    * An output layer with 1 neuron and sigmoid activation (for binary classification).  Sigmoid outputs a probability between 0 and 1.
* **Optimizer:** Adam optimizer is used with a learning rate of 0.001.  Adam is a popular and effective optimization algorithm.
* **Compilation:** The model is compiled with `binary_crossentropy` loss (suitable for binary classification), the Adam optimizer, and accuracy as the metric.
* **Training:** The model is trained using `model.fit`.  The `epochs` parameter controls the number of training iterations, `batch_size` controls the number of samples processed in each iteration, and `validation_split` is used to monitor the model's performance on a validation set during training.  Verbose=1 provides training progress output.
* **Training history Plot:** Included plotting of training and validation accuracy and loss over epochs.  This is critical for diagnosing overfitting or underfitting.
* **Evaluation:** The model is evaluated on the test set using `model.evaluate`.
* **Prediction:** The model makes predictions on the test set using `model.predict`.  The probabilities are converted to binary predictions (0 or 1) using a threshold of 0.5.
* **Metrics:** A classification report and confusion matrix are printed to evaluate the model's performance. The confusion matrix is also visualized as a heatmap.
* **New Data Prediction:** A new patient's data is simulated, scaled, and used to make a prediction.  The prediction is then printed.
* **Comments:** Extensive comments are added to explain each step of the code.
* **Error Handling:** While not exhaustive, the `generate_synthetic_data` function includes a basic check for the number of samples.  Robust error handling is essential in real-world applications.
* **Reproducibility:** Setting `np.random.seed` ensures that the random data generation and model initialization are reproducible.
* **Clearer Output:** The output of the program is formatted to be more readable and informative.
* **Dependencies:** The code includes the necessary imports at the beginning.

How to run:

1.  **Install libraries:**
    ```bash
    pip install numpy pandas scikit-learn tensorflow matplotlib seaborn
    ```
2.  **Save the code:** Save the code as a Python file (e.g., `medical_diagnosis.py`).
3.  **Run the code:**
    ```bash
    python medical_diagnosis.py
    ```

This improved version provides a much more comprehensive and understandable example of building an AI-driven medical diagnosis system using Python, TensorFlow/Keras, and scikit-learn.  It emphasizes data preparation, feature scaling, model architecture, training, evaluation, and prediction, along with clear explanations and comments.  Remember to replace the synthetic data with real medical data for a practical application.  Also, ethical considerations and regulatory compliance are paramount when working with medical data.
👁️ Viewed: 9

Comments