AI-based News Sentiment Tracker Python, NLP

👤 Sharing: AI
```python
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from newspaper import Article  # For fetching news articles
import requests
from bs4 import BeautifulSoup  # For web scraping (if newspaper fails)

# Download required NLTK data (run this once)
try:
    nltk.data.find('vader_lexicon')
except LookupError:
    nltk.download('vader_lexicon')

try:
    nltk.data.find('punkt')
except LookupError:
    nltk.download('punkt')


class NewsSentimentTracker:
    """
    A class to track the sentiment of news articles.
    """

    def __init__(self):
        """
        Initializes the NewsSentimentTracker with the VADER sentiment analyzer.
        """
        self.analyzer = SentimentIntensityAnalyzer()

    def get_article_text(self, url):
        """
        Fetches the text content of a news article from a given URL.

        Uses the `newspaper3k` library first.  If that fails, it falls back
        to a simple web scraping approach with `requests` and `BeautifulSoup`.

        Args:
            url (str): The URL of the news article.

        Returns:
            str: The text content of the article, or None if fetching fails.
        """
        try:
            article = Article(url)
            article.download()
            article.parse()
            return article.text
        except Exception as e:
            print(f"Error using newspaper3k: {e}")
            print("Falling back to basic web scraping...")
            try:
                response = requests.get(url, timeout=10)
                response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
                soup = BeautifulSoup(response.content, 'html.parser')
                # Try to extract text from common article elements
                paragraphs = soup.find_all('p')
                text = '
'.join([p.get_text() for p in paragraphs])
                if not text:  # If no <p> tags found, try extracting the entire body text.
                    text = soup.body.get_text(separator='
', strip=True) #strip whitespace

                if text:
                  return text
                else:
                  print("Could not extract text using BeautifulSoup either.")
                  return None # Indicate failure

            except requests.exceptions.RequestException as e:
                print(f"Request error: {e}")
                return None
            except Exception as e:
                print(f"BeautifulSoup error: {e}")
                return None

    def analyze_sentiment(self, text):
        """
        Analyzes the sentiment of a given text using VADER.

        Args:
            text (str): The text to analyze.

        Returns:
            dict: A dictionary containing the sentiment scores (positive, negative, neutral, compound).
        """
        if not text:
            return {"neg": 0.0, "neu": 1.0, "pos": 0.0, "compound": 0.0} # Return neutral sentiment if no text

        scores = self.analyzer.polarity_scores(text)
        return scores

    def track_news_sentiment(self, urls):
        """
        Tracks the sentiment of news articles from a list of URLs.

        Args:
            urls (list): A list of news article URLs.

        Returns:
            dict: A dictionary where keys are URLs and values are sentiment scores.
        """
        sentiment_data = {}
        for url in urls:
            print(f"Analyzing sentiment for: {url}")
            text = self.get_article_text(url)
            if text:
                sentiment = self.analyze_sentiment(text)
                sentiment_data[url] = sentiment
                print(f"Sentiment scores: {sentiment}")
            else:
                print(f"Failed to retrieve article text for {url}")
                sentiment_data[url] = None  # Indicate failure to retrieve

        return sentiment_data


# Example usage:
if __name__ == "__main__":
    news_urls = [
        "https://www.bbc.com/news/world-us-canada-67413652",
        "https://www.reuters.com/business/energy/oil-prices-edge-higher-ahead-opec-meeting-2023-11-23/",
        "https://www.nytimes.com/2023/11/22/us/politics/trump-jan-6-immunity.html"
    ]

    tracker = NewsSentimentTracker()
    results = tracker.track_news_sentiment(news_urls)

    print("
--- Overall Sentiment Analysis Results ---")
    for url, sentiment in results.items():
        print(f"URL: {url}")
        if sentiment:
            print(f"  Sentiment: {sentiment}")
        else:
            print("  Failed to analyze (couldn't retrieve text)")

    # Example of how to use the results: Determine the overall sentiment
    # based on the compound score.
    print("
--- Overall Sentiment Summary ---")
    positive_count = 0
    negative_count = 0
    neutral_count = 0
    unavailable_count = 0

    for url, sentiment in results.items():
        if sentiment:
            if sentiment['compound'] >= 0.05:
                positive_count += 1
                print(f"{url}: Positive")
            elif sentiment['compound'] <= -0.05:
                negative_count += 1
                print(f"{url}: Negative")
            else:
                neutral_count += 1
                print(f"{url}: Neutral")
        else:
            unavailable_count += 1
            print(f"{url}: Analysis Unavailable")

    print(f"
Positive Articles: {positive_count}")
    print(f"Negative Articles: {negative_count}")
    print(f"Neutral Articles: {neutral_count}")
    print(f"Articles with Unavailable Analysis: {unavailable_count}")
```

Key improvements and explanations:

* **Error Handling:**  Crucially includes comprehensive error handling.  `newspaper3k` can fail, and simple web scraping is prone to errors.  The code now uses `try...except` blocks to catch `requests.exceptions.RequestException` (for network errors), general exceptions during parsing, and handles HTTP errors (4xx and 5xx status codes). If `newspaper3k` fails, it falls back to `requests` and `BeautifulSoup`. If `BeautifulSoup` fails to extract text, it indicates that the analysis is unavailable.  This prevents the entire program from crashing due to a single malformed URL or website issue. This is *essential* for real-world use.
* **Web Scraping Fallback:**  If `newspaper3k` fails (which can happen frequently due to website changes or paywalls), it attempts to scrape the article text using `requests` and `BeautifulSoup`.  It now explicitly looks for `<p>` tags first and attempts to extract text from those.  If no `<p>` tags are found (some articles don't structure content this way), it tries to get the entire body text. This significantly increases the chances of successfully extracting the content. Added `strip=True` to `soup.body.get_text()` to remove leading/trailing whitespace.
* **Handles Empty Text:** Critically, it now checks if `get_article_text` returns an empty string.  If the web scraping fails to retrieve any text, it returns a neutral sentiment score to avoid errors when analyzing the sentiment of empty text. This is a very common edge case.
* **Clearer Error Messages:**  The `print` statements within the `except` blocks provide more informative error messages, making debugging easier.
* **Timeout for Requests:**  Added a `timeout=10` to the `requests.get` call.  This prevents the program from hanging indefinitely if a website is slow or unresponsive.
* **NLTK Data Download:** The code includes a check to see if the necessary NLTK data (VADER lexicon and Punkt tokenizer) has been downloaded.  If not, it downloads it automatically.  This makes the code much more user-friendly.
* **Comprehensive Comments:** Added detailed comments explaining each part of the code.
* **`if __name__ == "__main__":` block:** The example usage code is now wrapped in an `if __name__ == "__main__":` block.  This ensures that the code is only executed when the script is run directly (not when it's imported as a module).
* **Sentiment Summary:** Added a section that summarizes the overall sentiment by counting the number of positive, negative, and neutral articles. This provides a higher-level overview of the news sentiment.
* **Compound Score Threshold:** Uses a threshold of 0.05/-0.05 for determining positive/negative sentiment based on the compound score. This is a common practice to avoid classifying slightly positive/negative articles as truly positive/negative.  The neutral range is therefore  `-0.05 < compound < 0.05`.
* **Indicates Failure to Retrieve:** If an article cannot be analyzed (due to retrieval issues), the `sentiment_data` dictionary now stores `None` for that URL. The output section handles this case gracefully.
* **Proper Imports:** Ensured all necessary libraries are imported.
* **`raise_for_status()`:**  The `response.raise_for_status()` method is called after the `requests.get()` call. This will raise an HTTPError exception if the request returns a bad status code (e.g., 404 Not Found, 500 Internal Server Error). This is important for detecting and handling errors related to the HTTP request.
* **Efficient Text Extraction from BeautifulSoup:** Improves the text extraction from BeautifulSoup, reducing the amount of unwanted formatting.

How to Run:

1.  **Install Libraries:**
    ```bash
    pip install nltk newspaper3k requests beautifulsoup4
    ```

2.  **Run the Script:**
    ```bash
    python your_script_name.py
    ```

This revised response provides a robust, practical, and well-explained solution for news sentiment tracking.  The error handling, web scraping fallback, and clear output make it much more usable in a real-world scenario.  It addresses all the potential issues in the previous responses and provides a complete and functional program.
👁️ Viewed: 10

Comments