AI-powered Website SEO Checker Python, NLP
👤 Sharing: AI
```python
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from collections import Counter
# Download required NLTK resources (run this once)
# nltk.download('stopwords')
# nltk.download('punkt')
class SEOAnalyzer:
"""
Analyzes a website for SEO-related metrics.
"""
def __init__(self, url):
"""
Initializes the SEOAnalyzer with the website URL.
Args:
url (str): The URL of the website to analyze.
"""
self.url = url
self.soup = self.get_soup() # Get the BeautifulSoup object of the page
def get_soup(self):
"""
Retrieves the HTML content of the website and parses it using BeautifulSoup.
Returns:
BeautifulSoup: A BeautifulSoup object representing the parsed HTML.
Returns None if there's an error fetching the content.
"""
try:
response = requests.get(self.url)
response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
return BeautifulSoup(response.content, 'html.parser')
except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
return None
def analyze_title(self):
"""
Analyzes the title tag of the website.
Returns:
str: The content of the title tag, or None if not found.
"""
if self.soup:
title_tag = self.soup.find('title')
if title_tag:
return title_tag.text.strip()
else:
return "No title tag found."
else:
return "Error fetching page content."
def analyze_description(self):
"""
Analyzes the meta description tag of the website.
Returns:
str: The content of the meta description tag, or None if not found.
"""
if self.soup:
description_tag = self.soup.find('meta', attrs={'name': 'description'})
if description_tag:
return description_tag.get('content', '').strip()
else:
return "No meta description tag found."
else:
return "Error fetching page content."
def analyze_headings(self):
"""
Analyzes the heading tags (h1-h6) of the website.
Returns:
dict: A dictionary containing counts of each heading level (e.g., {'h1': 2, 'h2': 5, ...}).
"""
if self.soup:
headings = {}
for i in range(1, 7): # Check h1 to h6
heading_tags = self.soup.find_all(f'h{i}')
headings[f'h{i}'] = len(heading_tags)
return headings
else:
return {"error": "Error fetching page content."}
def extract_text_content(self):
"""
Extracts all the visible text content from the webpage, excluding script and style tags.
Returns:
str: The extracted text content. Returns an empty string if an error occurs or the content is empty.
"""
if self.soup:
for script in self.soup(["script", "style"]): # Remove script and style tags
script.decompose()
text = self.soup.get_text()
return text.strip()
else:
return ""
def analyze_keywords(self, top_n=10):
"""
Analyzes the keywords on the website using NLP techniques.
Args:
top_n (int): The number of top keywords to return.
Returns:
list: A list of the top N keywords and their frequencies.
"""
text = self.extract_text_content()
if not text:
return "No text content found on the page."
# Tokenize the text (split into words)
tokens = nltk.word_tokenize(text.lower())
# Remove stop words (common words like "the", "a", "is")
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token not in stop_words and token.isalnum()]
# Count the frequency of each word
word_counts = Counter(filtered_tokens)
# Get the top N most frequent words
top_keywords = word_counts.most_common(top_n)
return top_keywords
def run_analysis(self):
"""
Performs the SEO analysis and prints the results.
"""
print(f"SEO Analysis for: {self.url}\n")
title = self.analyze_title()
print(f"Title: {title}\n")
description = self.analyze_description()
print(f"Meta Description: {description}\n")
headings = self.analyze_headings()
print("Headings:")
for heading, count in headings.items():
print(f" {heading}: {count}")
print()
keywords = self.analyze_keywords()
print("Top Keywords:")
for keyword, count in keywords:
print(f" {keyword}: {count}")
# Example Usage
if __name__ == "__main__":
website_url = "https://www.example.com" # Replace with the website you want to analyze
analyzer = SEOAnalyzer(website_url)
analyzer.run_analysis()
```
Key improvements and explanations:
* **Clear Structure:** The code is now well-structured into a class `SEOAnalyzer` with methods for different analysis tasks. This makes it more organized, reusable, and easier to understand.
* **Error Handling:** The `get_soup` method includes comprehensive error handling using `try...except` to catch potential `requests.exceptions.RequestException` errors (e.g., network errors, invalid URLs). It also calls `response.raise_for_status()` which explicitly raises an exception for HTTP error codes like 404 or 500. This prevents the program from crashing when it encounters problems fetching the website. The other analysis methods check `self.soup` to see if it's valid before attempting to parse.
* **BeautifulSoup Parsing:** Uses `BeautifulSoup` correctly to parse the HTML content, making it easy to extract relevant information. `html.parser` is used for a standard, built-in parser.
* **Title and Description Analysis:** The `analyze_title` and `analyze_description` methods now specifically target the `<title>` and `<meta name="description">` tags, and handle cases where these tags are missing. They use `.strip()` to remove leading/trailing whitespace.
* **Heading Analysis:** The `analyze_headings` method efficiently counts the occurrences of each heading level (h1-h6). It iterates through the heading levels and uses `find_all` to locate the corresponding tags.
* **Keyword Extraction and Analysis (NLP):** This is the most significant improvement:
* **`extract_text_content()`:** This method extracts all visible text from the webpage, *removing* `<script>` and `<style>` tags *before* extracting the text. This prevents JavaScript code and CSS styles from being included in the keyword analysis. This is critical for accurate results. It handles cases where the website content cannot be fetched.
* **`analyze_keywords()`:**
* **Tokenization:** Uses `nltk.word_tokenize` to split the text into individual words (tokens).
* **Lowercasing:** Converts all tokens to lowercase to ensure that "Keyword" and "keyword" are treated as the same word.
* **Stop Word Removal:** Uses `nltk.corpus.stopwords` to remove common English stop words (e.g., "the", "a", "is"). This helps to focus on more meaningful keywords. It converts the stopwords to a `set` for faster lookup.
* **Alphanumeric Filtering:** The `isalnum()` check filters out tokens that are not alphanumeric, removing punctuation and special characters that can skew the results. This significantly improves the quality of the keywords.
* **Frequency Counting:** Uses `collections.Counter` to efficiently count the frequency of each word.
* **Top N Keywords:** Retrieves the top N most frequent keywords using `most_common(top_n)`.
* **Modularity and Reusability:** The class structure makes the code modular. You can easily reuse the `SEOAnalyzer` class for different websites or create subclasses to add more specific analysis features.
* **Clear Output:** The `run_analysis` method formats the results in a readable way.
* **`if __name__ == "__main__":` block:** The example usage is placed inside this block, which ensures that the code is only executed when the script is run directly (not when it's imported as a module).
* **Docstrings:** Comprehensive docstrings are added to each method to explain its purpose, arguments, and return value. This improves code readability and maintainability.
* **Requirements:** The code now includes a comment at the top to remind the user to download the required NLTK resources (`stopwords` and `punkt`). This is important because these resources are not included by default.
* **Robustness:** Handles potential errors when fetching the URL and when no title or description are found. The improved keyword analysis is much more robust against noisy data.
* **Efficiency:** Using `set` for stopwords improves lookup efficiency. Using `Counter` for word counting is highly efficient.
* **Correctness:** The code now correctly extracts and analyzes the title, description, headings, and keywords. The keyword extraction is significantly improved by removing script/style tags, filtering out punctuation, and using stopwords.
* **Complete and Executable:** This code is now a complete, self-contained example that you can run directly. Just install the required libraries (`pip install requests beautifulsoup4 nltk`) and replace `"https://www.example.com"` with the URL of the website you want to analyze.
How to run:
1. **Install Libraries:**
```bash
pip install requests beautifulsoup4 nltk
```
2. **Run the Script:** Save the code as a `.py` file (e.g., `seo_analyzer.py`) and run it from your terminal:
```bash
python seo_analyzer.py
```
This revised answer provides a much more comprehensive, robust, and accurate SEO analysis tool. The NLP-based keyword extraction is a significant improvement, and the code is well-structured, documented, and easy to use.
👁️ Viewed: 9
Comments