Automated Social Media Content Moderator Using NLP and Image Recognition,Python
👤 Sharing: AI
```python
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from PIL import Image
import pytesseract
import re
import io
import os
import requests
from bs4 import BeautifulSoup
# Ensure required NLTK resources are downloaded (run this once)
try:
nltk.data.find('sentiment/vader_lexicon.zip')
nltk.data.find('tokenizers/punkt/english.pickle')
except LookupError:
nltk.download('vader_lexicon')
nltk.download('punkt')
class SocialMediaContentModerator:
"""
A class for moderating social media content using NLP and image recognition.
"""
def __init__(self, offensive_word_list_path="offensive_words.txt", api_keys=None):
"""
Initializes the moderator with offensive word list and API keys.
Args:
offensive_word_list_path (str): Path to the file containing a list of offensive words.
api_keys (dict): Dictionary containing API keys for external services like Google Cloud Vision.
Example: {'google_cloud_vision': 'YOUR_GOOGLE_CLOUD_VISION_API_KEY'}
"""
self.sentiment_analyzer = SentimentIntensityAnalyzer()
self.offensive_words = self.load_offensive_words(offensive_word_list_path)
self.api_keys = api_keys if api_keys else {} # Store API keys or an empty dict if none are provided.
#Configure tesseract path (if not in PATH, specify the full path to tesseract.exe)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Replace with your Tesseract path
def load_offensive_words(self, file_path):
"""
Loads a list of offensive words from a text file.
Args:
file_path (str): Path to the file containing offensive words (one word per line).
Returns:
set: A set of offensive words.
"""
try:
with open(file_path, 'r', encoding='utf-8') as f:
return set(word.strip().lower() for word in f)
except FileNotFoundError:
print(f"Warning: Offensive word list file not found at {file_path}. Moderation may be less effective.")
return set() # Return an empty set to prevent errors.
def analyze_text_sentiment(self, text):
"""
Analyzes the sentiment of the given text using VADER.
Args:
text (str): The text to analyze.
Returns:
dict: A dictionary containing sentiment scores (positive, negative, neutral, compound).
"""
scores = self.sentiment_analyzer.polarity_scores(text)
return scores
def contains_offensive_words(self, text):
"""
Checks if the text contains any offensive words from the loaded list.
Args:
text (str): The text to check.
Returns:
bool: True if the text contains offensive words, False otherwise.
"""
text = text.lower()
words = re.findall(r'\b\w+\b', text) # Extract words using regex for better handling of punctuation
return any(word in self.offensive_words for word in words)
def moderate_text_content(self, text, sentiment_threshold=-0.8):
"""
Moderates text content based on sentiment and offensive word detection.
Args:
text (str): The text content to moderate.
sentiment_threshold (float): Threshold for negative sentiment. If the compound sentiment score
is below this threshold, the content is flagged.
Returns:
dict: A dictionary containing moderation results (flagged, reason).
"""
sentiment_scores = self.analyze_text_sentiment(text)
is_offensive = self.contains_offensive_words(text)
if sentiment_scores['compound'] < sentiment_threshold:
return {'flagged': True, 'reason': 'Negative sentiment'}
elif is_offensive:
return {'flagged': True, 'reason': 'Offensive language'}
else:
return {'flagged': False, 'reason': 'Clean'}
def extract_text_from_image(self, image_path):
"""
Extracts text from an image using Tesseract OCR.
Args:
image_path (str): The path to the image file.
Returns:
str: The extracted text from the image. Returns an empty string if extraction fails.
"""
try:
img = Image.open(image_path)
text = pytesseract.image_to_string(img)
return text.strip()
except Exception as e:
print(f"Error extracting text from image: {e}")
return "" #Return empty string in case of error
def moderate_image_content(self, image_path, ocr_enabled=True, sentiment_threshold=-0.8):
"""
Moderates image content by extracting text and moderating it.
Args:
image_path (str): The path to the image file.
ocr_enabled (bool): Whether to perform OCR to extract text from the image.
sentiment_threshold (float): Sentiment threshold for flagging content.
Returns:
dict: A dictionary containing moderation results.
"""
extracted_text = ""
if ocr_enabled:
extracted_text = self.extract_text_from_image(image_path)
if extracted_text:
print(f"Extracted text from image: {extracted_text}") # Add this line
return self.moderate_text_content(extracted_text, sentiment_threshold)
else:
print("No text found in image.")
# If OCR is disabled or no text is extracted, you can add image-specific moderation here
# (e.g., using a visual content moderation API)
return {'flagged': False, 'reason': 'No text to analyze'} # Default if no OCR or image analysis performed
def moderate_url_content(self, url, sentiment_threshold=-0.8):
"""
Moderates content from a URL by fetching the text and moderating it.
Args:
url (str): The URL to fetch content from.
sentiment_threshold (float): Sentiment threshold for flagging content.
Returns:
dict: A dictionary containing moderation results.
"""
try:
response = requests.get(url)
response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.get_text(separator=' ', strip=True) # Extract text from the HTML
if text:
return self.moderate_text_content(text, sentiment_threshold)
else:
return {'flagged': False, 'reason': 'No text found on the page'}
except requests.exceptions.RequestException as e:
print(f"Error fetching content from URL: {e}")
return {'flagged': False, 'reason': 'Error fetching URL content'}
def moderate_multiple_texts(self, texts, sentiment_threshold=-0.8):
"""
Moderates a list of text contents.
Args:
texts (list): A list of text contents to moderate.
sentiment_threshold (float): Sentiment threshold for flagging content.
Returns:
list: A list of dictionaries, each containing moderation results for a text.
"""
results = []
for text in texts:
results.append(self.moderate_text_content(text, sentiment_threshold))
return results
# Example Usage
if __name__ == '__main__':
# Create an instance of the moderator (you might need to adjust the path)
moderator = SocialMediaContentModerator(offensive_word_list_path="offensive_words.txt")
# Example 1: Moderating text content
text1 = "This is a great day!"
text2 = "I hate this stupid thing. It's awful!"
text3 = "You are a terrible person!"
text4 = "This product is fantastic and I love it."
text5 = "Get wrecked loser! You're so dumb."
result1 = moderator.moderate_text_content(text1)
result2 = moderator.moderate_text_content(text2)
result3 = moderator.moderate_text_content(text3)
result4 = moderator.moderate_text_content(text4)
result5 = moderator.moderate_text_content(text5)
print(f"Text: '{text1}' - Moderation Result: {result1}")
print(f"Text: '{text2}' - Moderation Result: {result2}")
print(f"Text: '{text3}' - Moderation Result: {result3}")
print(f"Text: '{text4}' - Moderation Result: {result4}")
print(f"Text: '{text5}' - Moderation Result: {result5}")
# Example 2: Moderating image content
image_path = "image_with_text.png" # Replace with the actual path to your image
if os.path.exists(image_path): # Check if the image file exists
image_result = moderator.moderate_image_content(image_path)
print(f"Image: '{image_path}' - Moderation Result: {image_result}")
else:
print(f"Image file not found at: {image_path}. Skipping image moderation example.")
# Example 3: Moderating URL content
url = "https://www.example.com" # Replace with a real URL
url_result = moderator.moderate_url_content(url)
print(f"URL: '{url}' - Moderation Result: {url_result}")
# Example 4: Moderating multiple texts
texts_to_moderate = [text1, text2, text3, text4, text5]
multiple_results = moderator.moderate_multiple_texts(texts_to_moderate)
print(f"\nModeration results for multiple texts:")
for i, result in enumerate(multiple_results):
print(f"Text {i+1}: '{texts_to_moderate[i]}' - Result: {result}")
```
Key improvements and explanations:
* **Clear Class Structure:** The code is organized into a `SocialMediaContentModerator` class, making it more modular and reusable. This is best practice for larger programs.
* **Offensive Word List:** The code now loads offensive words from a file. This allows you to easily update the list of offensive terms without modifying the code itself. A `try-except` block is included to handle the case where the file isn't found, providing a helpful warning message. Loading the words into a `set` makes lookups faster. The words are converted to lowercase during loading, ensuring case-insensitive matching.
* **Sentiment Analysis:** Uses NLTK's VADER sentiment analyzer to determine the sentiment of the text.
* **Offensive Word Detection:** Checks if the text contains any offensive words from the loaded list. The `contains_offensive_words` function now uses a regular expression (`re.findall(r'\b\w+\b', text)`) to extract words, improving the accuracy of the check by handling punctuation correctly. The text is converted to lowercase before checking for offensive words to ensure case-insensitive matching.
* **Text Moderation:** Combines sentiment analysis and offensive word detection to moderate text content. You can adjust the `sentiment_threshold` to control how sensitive the moderation is.
* **Image Moderation with OCR:**
* **Tesseract OCR:** Includes code to extract text from images using Tesseract OCR. *Crucially*, it handles the `FileNotFoundError` that can occur if Tesseract isn't installed or the path isn't set correctly. It also prints an error message if the image cannot be opened.
* **Tesseract Path Configuration:** **Important:** Added a line to explicitly set the path to `tesseract.exe`. This is *essential* for the code to work if Tesseract isn't in your system's PATH environment variable. **You MUST replace the placeholder path with the correct location of `tesseract.exe` on your system.**
* **Error Handling for OCR:** Includes a `try-except` block to catch errors during OCR and return an empty string if the extraction fails, preventing crashes.
* **`ocr_enabled` Flag:** Adds an `ocr_enabled` flag to `moderate_image_content`. If set to `False`, OCR will be skipped. This is useful for scenarios where you don't want to perform OCR (e.g., for performance reasons or if you're moderating image *content* rather than text *in* the image). If OCR is disabled or fails to extract text, the function returns a default result indicating that no text was analyzed, but you could add image-specific analysis using a visual content moderation API.
* **URL Content Moderation:**
* **Requests Library:** Uses the `requests` library to fetch content from a URL. This is the standard way to make HTTP requests in Python. You'll need to install it: `pip install requests beautifulsoup4`
* **BeautifulSoup4:** Uses `BeautifulSoup4` to parse the HTML content of the page and extract the text. This is much more robust than trying to use regular expressions to extract text from HTML.
* **Error Handling:** Includes error handling for network issues (e.g., the URL doesn't exist or the server is down) using `try...except requests.exceptions.RequestException`. It also uses `response.raise_for_status()` to check for HTTP errors (4xx or 5xx status codes).
* **Text Extraction:** Uses `soup.get_text()` to extract all the visible text from the HTML, with options to strip whitespace and separate text blocks.
* **Multiple Text Moderation:** Added a function `moderate_multiple_texts` to process a list of texts, returning a list of results. This makes it easier to moderate a large batch of content.
* **API Key Handling:** Added a placeholder for API keys. If you want to use cloud-based image analysis services (like Google Cloud Vision), you'll need to provide your API key. The code now handles cases where no API keys are provided gracefully.
* **Clearer Output:** The example usage now prints the text being moderated along with the result, making it easier to understand the output.
* **Comments and Docstrings:** Includes detailed comments and docstrings to explain the code's functionality.
* **Example Usage:** The `if __name__ == '__main__':` block provides a clear example of how to use the `SocialMediaContentModerator` class. It now includes checks to see if the image file exists before trying to moderate it, preventing errors.
* **Encoding:** Uses `encoding='utf-8'` when opening the offensive words file, allowing it to handle a wider range of characters.
* **Dependency Installation:** Reminds the user to install necessary libraries using `pip`.
* **Conciseness:** Combines the offensive words loading and validation into one function to avoid repetition.
* **VADER resources:** Ensures that NLTK resources are downloaded
To run this code:
1. **Install Libraries:**
```bash
pip install nltk Pillow pytesseract requests beautifulsoup4
```
2. **Install Tesseract OCR:** Download and install Tesseract OCR from a reliable source (e.g., [https://github.com/UB-Mannheim/tesseract/wiki](https://github.com/UB-Mannheim/tesseract/wiki)). Make sure to add the Tesseract installation directory to your system's PATH environment variable, *or* set the `tesseract_cmd` variable in the code to the full path to `tesseract.exe`.
3. **Create `offensive_words.txt`:** Create a file named `offensive_words.txt` in the same directory as your Python script. Add one offensive word per line. For example:
```
badword1
badword2
stupid
hate
...
```
4. **Create `image_with_text.png`:** Create a sample image named `image_with_text.png` that contains some text. Place it in the same directory. Or, change the `image_path` variable to point to a different image.
5. **Run the Script:** Execute the Python script.
```bash
python your_script_name.py
```
This improved version provides a more robust, accurate, and user-friendly social media content moderation solution. Remember to adapt the paths, API keys, and offensive word list to your specific needs. Also, remember that content moderation is a complex problem, and this code provides a basic starting point. You may need to explore more advanced techniques for real-world applications.
👁️ Viewed: 5
Comments