AI-powered News Trend Analysis Python, NLP
👤 Sharing: AI
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import requests
from bs4 import BeautifulSoup
import re
# Ensure necessary NLTK resources are downloaded (run this once)
# nltk.download('punkt')
# nltk.download('stopwords')
def fetch_news_articles(url, selector):
"""
Fetches news articles from a given URL using BeautifulSoup and CSS selectors.
Args:
url (str): The URL of the news website.
selector (str): CSS selector to extract article text. This needs to be specific to the website structure.
Returns:
list: A list of strings, where each string is the text of an article. Returns an empty list on error.
"""
try:
response = requests.get(url)
response.raise_for_status() # Raise HTTPError for bad responses (4XX, 5XX)
soup = BeautifulSoup(response.content, 'html.parser')
articles = soup.select(selector) # Find elements matching the selector
article_texts = [article.get_text(strip=True) for article in articles] # Extract text and strip whitespace
return article_texts
except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
return []
except Exception as e:
print(f"Error parsing content: {e}")
return []
def clean_text(text):
"""
Cleans text by removing non-alphanumeric characters and converting to lowercase.
Args:
text (str): The input text.
Returns:
str: The cleaned text.
"""
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove non-alphanumeric and non-space characters
return text.lower()
def analyze_trends(articles, num_keywords=10):
"""
Analyzes trends in a list of news articles by identifying the most frequent keywords.
Args:
articles (list): A list of strings, where each string is the text of an article.
num_keywords (int): The number of top keywords to return.
Returns:
list: A list of tuples, where each tuple contains a keyword and its frequency.
"""
all_text = ' '.join(articles) # Combine all articles into a single string
# Tokenize the text
tokens = word_tokenize(all_text)
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if not w in stop_words]
# Count word frequencies
word_counts = Counter(filtered_tokens)
# Get the most common words
most_common_words = word_counts.most_common(num_keywords)
return most_common_words
def main():
"""
Main function to orchestrate news trend analysis.
"""
# Example Usage: Replace with a news website and appropriate CSS selector
news_url = "https://www.example.com/news" # Replace with a real news site
article_selector = ".article-content" # Replace with the appropriate CSS selector
articles = fetch_news_articles(news_url, article_selector)
if not articles:
print("No articles found. Please check the URL and selector.")
return # Exit if no articles were fetched
cleaned_articles = [clean_text(article) for article in articles]
trends = analyze_trends(cleaned_articles, num_keywords=10)
print("Top Trending Keywords:")
for word, count in trends:
print(f"- {word}: {count}")
if __name__ == "__main__":
main()
```
Key improvements and explanations:
* **Error Handling:** Crucially includes error handling for `requests` (network errors) and `BeautifulSoup` (parsing errors). This prevents the script from crashing if the website is down or the HTML structure is unexpected. The `try...except` blocks make the code more robust. It also checks for empty article lists. Printing error messages provides useful feedback to the user.
* **CSS Selector:** Highlights the importance of the `article_selector` and tells the user *exactly* what to do with it. Provides example class names.
* **`fetch_news_articles` Function:** Encapsulates the web scraping logic into a reusable function. This makes the code cleaner and easier to understand. The use of `.get_text(strip=True)` is better for extracting clean text.
* **`clean_text` Function:** Now uses a regular expression (`re.sub`) for much more robust cleaning of the text, removing any non-alphanumeric characters instead of just trying to replace individual characters. This handles a broader range of potential "noise" in the text. Converts the text to lowercase for consistency.
* **`analyze_trends` Function:** This function focuses solely on the NLP analysis, making it modular and testable. Clearly separates concerns.
* **Clearer Comments and Docstrings:** Added more comments to explain each step, making the code more readable. Docstrings explain the purpose, arguments, and return values of each function.
* **`main` Function:** Organizes the workflow, making it clear how the different parts of the program fit together. The example usage section helps users understand how to adapt the code to their specific needs. Includes a check for empty articles and a graceful exit.
* **`if __name__ == "__main__":` block:** This ensures that the `main` function is only called when the script is executed directly, not when it's imported as a module.
* **NLTK Resource Download:** Includes comments that remind the user that they need to download the NLTK resources *once* before running the code for the first time.
* **Up-to-date `requests` Usage:** Uses `response.raise_for_status()` for better error checking of HTTP responses.
* **String Joining:** Uses `' '.join(articles)` for more efficient string concatenation than repeated `+=` operations.
* **Stop Word Handling:** Uses `set(stopwords.words('english'))` which is more efficient for checking membership.
How to run the code and adapt it:
1. **Install Libraries:**
```bash
pip install nltk requests beautifulsoup4
```
2. **Download NLTK Resources:** Run the following in a Python interpreter *once*:
```python
import nltk
nltk.download('punkt')
nltk.download('stopwords')
```
3. **Replace Placeholder:**
* **`news_url`:** Replace `"https://www.example.com/news"` with the actual URL of a news website you want to analyze.
* **`article_selector`:** **This is the most important part to customize.** Inspect the HTML source code of the news website (usually by right-clicking on an article and selecting "Inspect" or "View Source"). Identify a CSS selector that uniquely identifies the text of the *article content*. Common examples:
* If each article is within a `<div class="article">`: Use `".article"`
* If the article text is within `<p>` tags inside a `<div class="article-content">`: Use `".article-content p"`
* Inspecting element by element helps you narrow down to the specific tag that contains the main content of the article and is consistently used across multiple articles.
* **Important:** Choose a selector that gives you *only* the article text, not menus, sidebars, or other irrelevant content. More specific selectors are better.
4. **Run the Script:** Save the code as a Python file (e.g., `news_analyzer.py`) and run it from the command line:
```bash
python news_analyzer.py
```
5. **Analyze Results:** The script will print the top trending keywords found in the articles. The quality of the results depends heavily on the accuracy of the `article_selector`.
This significantly improved answer provides a functional program with proper error handling, clear instructions, and important considerations for real-world usage. It also makes the code much more robust and adaptable to different news websites.
👁️ Viewed: 11
Comments