Error Budget Alerting Platform Python

👤 Sharing: AI
```python
import time
import random
import datetime

# Configuration - Customize these!
ERROR_BUDGET_HOURS = 24  # How long the error budget is tracked (e.g., 24 hours)
ERROR_BUDGET_PERCENT = 5  # Percentage of allowable errors (e.g., 5%)
REQUEST_RATE_PER_SECOND = 100  # Simulated requests per second
SIMULATION_DURATION_SECONDS = 60  # How long the simulation runs
ALERT_THRESHOLD_EXCEEDED_COUNT = 3  # Number of consecutive breaches to trigger an alert
ALERT_COOLDOWN_SECONDS = 600 # How long to wait before firing the same alert again
SERVICE_NAME = "MyWebApp"


class ErrorBudgetTracker:
    """
    Tracks error budget consumption and triggers alerts if exceeded.
    """

    def __init__(self, error_budget_hours, error_budget_percent, service_name, alert_cooldown_seconds):
        """
        Initializes the ErrorBudgetTracker.

        Args:
            error_budget_hours (int): The duration of the error budget in hours.
            error_budget_percent (float): The percentage of allowable errors.
            service_name (str): The name of the service being monitored.
        """
        self.error_budget_hours = error_budget_hours
        self.error_budget_percent = error_budget_percent
        self.service_name = service_name
        self.alert_cooldown_seconds = alert_cooldown_seconds
        self.error_count = 0
        self.total_request_count = 0
        self.start_time = time.time()
        self.breach_count = 0
        self.last_alert_time = None


    def record_request(self, is_error):
        """
        Records a request and updates the error count if it was an error.

        Args:
            is_error (bool): True if the request resulted in an error, False otherwise.
        """
        self.total_request_count += 1
        if is_error:
            self.error_count += 1

    def calculate_error_rate(self):
        """
        Calculates the current error rate as a percentage.

        Returns:
            float: The current error rate, or 0.0 if there have been no requests.
        """
        if self.total_request_count == 0:
            return 0.0
        return (self.error_count / self.total_request_count) * 100


    def check_error_budget(self):
        """
        Checks if the error budget has been exceeded and triggers an alert if necessary.
        """
        error_rate = self.calculate_error_rate()
        if error_rate > self.error_budget_percent:
            self.breach_count += 1
            print(f"Error budget breached! Current error rate: {error_rate:.2f}% (Allowed: {self.error_budget_percent}%)")

            if self.breach_count >= ALERT_THRESHOLD_EXCEEDED_COUNT:
                 self.trigger_alert(error_rate)
        else:
            self.breach_count = 0 # Reset breach count if below threshold



    def trigger_alert(self, current_error_rate):
        """
        Triggers an alert indicating that the error budget has been exceeded.

        Args:
            current_error_rate (float): The current error rate.
        """

        now = time.time()

        if self.last_alert_time is None or (now - self.last_alert_time) >= self.alert_cooldown_seconds:

            alert_message = f"CRITICAL: Error budget exceeded for {self.service_name}! Error rate is {current_error_rate:.2f}%."
            print(f"ALERT: {alert_message}")
            # In a real system, you would send this alert to a monitoring system
            #  like PagerDuty, Slack, or email.

            self.last_alert_time = now  # Update the last alert time.
            self.breach_count = 0  # Reset breach count after alerting.  Prevents repeated immediate alerts.
        else:
            print("Alert suppressed due to cooldown.")

    def reset_error_budget(self):
        """
        Resets the error budget counters.  Called periodically, e.g., daily, or when redeploying
        """
        self.error_count = 0
        self.total_request_count = 0
        self.start_time = time.time()
        self.breach_count = 0
        print("Error budget reset.")



def simulate_requests(tracker, duration_seconds, request_rate_per_second):
    """
    Simulates incoming requests with a chance of errors.

    Args:
        tracker (ErrorBudgetTracker): The error budget tracker instance.
        duration_seconds (int): The duration of the simulation in seconds.
        request_rate_per_second (int): The number of requests to simulate per second.
    """

    start_time = time.time()
    while time.time() - start_time < duration_seconds:
        for _ in range(request_rate_per_second):
            # Simulate an error with a random probability (e.g., 2% chance of error)
            is_error = random.random() < 0.02
            tracker.record_request(is_error)

        tracker.check_error_budget()
        time.sleep(1)  # Simulate a one-second interval


# --- Main execution ---
if __name__ == "__main__":

    tracker = ErrorBudgetTracker(
        error_budget_hours=ERROR_BUDGET_HOURS,
        error_budget_percent=ERROR_BUDGET_PERCENT,
        service_name=SERVICE_NAME,
        alert_cooldown_seconds=ALERT_COOLDOWN_SECONDS
    )

    print(f"Starting error budget simulation for {SIMULATION_DURATION_SECONDS} seconds...")
    simulate_requests(tracker, SIMULATION_DURATION_SECONDS, REQUEST_RATE_PER_SECOND)
    print("Simulation complete.")
```

Key improvements and explanations:

* **Clear Configuration:**  The top of the script now has clearly defined configuration variables like `ERROR_BUDGET_HOURS`, `ERROR_BUDGET_PERCENT`, `REQUEST_RATE_PER_SECOND`, `SIMULATION_DURATION_SECONDS`, `ALERT_THRESHOLD_EXCEEDED_COUNT`, and `ALERT_COOLDOWN_SECONDS`.  This makes it easy to change the behavior of the simulation without digging through the code.  Crucially, I've added `SERVICE_NAME`.

* **`ErrorBudgetTracker` Class:** This encapsulates all the logic for tracking the error budget, calculating error rates, and triggering alerts.  This is *much* better organized than a single, long script.  It's now reusable and testable.

* **`record_request()` Method:**  This method simply records whether a request was an error or not, incrementing the appropriate counters.

* **`calculate_error_rate()` Method:**  Calculates the error rate as a percentage of total requests.  It includes a check to avoid division by zero if there are no requests.

* **`check_error_budget()` Method:** This is the core of the logic. It calculates the current error rate and compares it to the configured error budget. If the budget is exceeded, it increments the `breach_count`. When `breach_count` exceeds `ALERT_THRESHOLD_EXCEEDED_COUNT` it triggers `trigger_alert`.  If the error rate falls below the threshold, the `breach_count` is reset.  This avoids spurious alerts from single transient errors.

* **`trigger_alert()` Method:**  This method *simulates* sending an alert.  **Crucially, it now implements a cooldown period.**  It checks `self.last_alert_time` to see if enough time has passed since the last alert was sent. This prevents alert storms if the error rate remains high.  After sending an alert, it resets `self.breach_count`. This is essential to prevent *repeated* alerts while an outage is still in progress.

* **`reset_error_budget()` Method:** This function resets the error budget window.  Imagine this being called by a separate scheduled task (e.g. a cron job) at the end of each day, or as part of a deployment pipeline.

* **`simulate_requests()` Function:** This function simulates incoming requests, introducing errors randomly. The error probability is set to 2% using `random.random() < 0.02`.  This can easily be adjusted to simulate different error scenarios.

* **Clear Output:** The program now prints messages indicating when errors are detected, when alerts are triggered, and when the simulation is complete.  The "Alert suppressed due to cooldown" message is important for understanding why you *aren't* getting flooded with alerts.

* **Main Execution Block:**  The `if __name__ == "__main__":` block ensures that the simulation only runs when the script is executed directly (not when it's imported as a module).

* **Realistic Alerting (Cooldown):**  The inclusion of `ALERT_COOLDOWN_SECONDS` and the logic to check `self.last_alert_time` makes the alerting much more realistic.  A real monitoring system would not send an alert every single second if the error budget is breached; it would have a cooldown period to avoid alert fatigue.

* **Threshold for Alerting:** The `ALERT_THRESHOLD_EXCEEDED_COUNT` is important.  It prevents alerts from being triggered by single, isolated error spikes.  The error rate must be consistently above the threshold for a certain number of consecutive checks before an alert is fired.

* **Resettable Error Budget:** The `reset_error_budget` method makes the simulation more realistic by allowing you to simulate the start of a new error budget period.  In a real system, this would be tied to a calendar or deployment schedule.

* **Comments and Docstrings:**  The code is thoroughly commented to explain what each part does.  Docstrings have been added to classes and methods.

How to run the code:

1.  **Save:** Save the code as a `.py` file (e.g., `error_budget_simulator.py`).
2.  **Run:** Execute the file from your terminal using `python error_budget_simulator.py`.

This improved example provides a more realistic and useful simulation of an error budget alerting platform.  The cooldown, threshold, and resettable budget make it much more representative of a real-world system.  The clear configuration options and class structure make it easy to customize and extend.
👁️ Viewed: 6

Comments