Automated Social Media Content Moderator Using NLP and Image Recognition,Python

👤 Sharing: AI
```python
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from PIL import Image
import pytesseract
import re
import io
import os
import requests
from bs4 import BeautifulSoup
# Ensure required NLTK resources are downloaded (run this once)
try:
    nltk.data.find('sentiment/vader_lexicon.zip')
    nltk.data.find('tokenizers/punkt/english.pickle')
except LookupError:
    nltk.download('vader_lexicon')
    nltk.download('punkt')



class SocialMediaContentModerator:
    """
    A class for moderating social media content using NLP and image recognition.
    """

    def __init__(self, offensive_word_list_path="offensive_words.txt", api_keys=None):
        """
        Initializes the moderator with offensive word list and API keys.

        Args:
            offensive_word_list_path (str): Path to the file containing a list of offensive words.
            api_keys (dict): Dictionary containing API keys for external services like Google Cloud Vision.
                             Example: {'google_cloud_vision': 'YOUR_GOOGLE_CLOUD_VISION_API_KEY'}
        """
        self.sentiment_analyzer = SentimentIntensityAnalyzer()
        self.offensive_words = self.load_offensive_words(offensive_word_list_path)
        self.api_keys = api_keys if api_keys else {}  # Store API keys or an empty dict if none are provided.

        #Configure tesseract path  (if not in PATH,  specify the full path to tesseract.exe)
        pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # Replace with your Tesseract path

    def load_offensive_words(self, file_path):
        """
        Loads a list of offensive words from a text file.

        Args:
            file_path (str): Path to the file containing offensive words (one word per line).

        Returns:
            set: A set of offensive words.
        """
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                return set(word.strip().lower() for word in f)
        except FileNotFoundError:
            print(f"Warning: Offensive word list file not found at {file_path}.  Moderation may be less effective.")
            return set()  # Return an empty set to prevent errors.


    def analyze_text_sentiment(self, text):
        """
        Analyzes the sentiment of the given text using VADER.

        Args:
            text (str): The text to analyze.

        Returns:
            dict: A dictionary containing sentiment scores (positive, negative, neutral, compound).
        """
        scores = self.sentiment_analyzer.polarity_scores(text)
        return scores

    def contains_offensive_words(self, text):
        """
        Checks if the text contains any offensive words from the loaded list.

        Args:
            text (str): The text to check.

        Returns:
            bool: True if the text contains offensive words, False otherwise.
        """
        text = text.lower()
        words = re.findall(r'\b\w+\b', text)  # Extract words using regex for better handling of punctuation
        return any(word in self.offensive_words for word in words)

    def moderate_text_content(self, text, sentiment_threshold=-0.8):
        """
        Moderates text content based on sentiment and offensive word detection.

        Args:
            text (str): The text content to moderate.
            sentiment_threshold (float): Threshold for negative sentiment.  If the compound sentiment score
                                        is below this threshold, the content is flagged.

        Returns:
            dict: A dictionary containing moderation results (flagged, reason).
        """
        sentiment_scores = self.analyze_text_sentiment(text)
        is_offensive = self.contains_offensive_words(text)

        if sentiment_scores['compound'] < sentiment_threshold:
            return {'flagged': True, 'reason': 'Negative sentiment'}
        elif is_offensive:
            return {'flagged': True, 'reason': 'Offensive language'}
        else:
            return {'flagged': False, 'reason': 'Clean'}

    def extract_text_from_image(self, image_path):
        """
        Extracts text from an image using Tesseract OCR.

        Args:
            image_path (str): The path to the image file.

        Returns:
            str: The extracted text from the image.  Returns an empty string if extraction fails.
        """
        try:
            img = Image.open(image_path)
            text = pytesseract.image_to_string(img)
            return text.strip()
        except Exception as e:
            print(f"Error extracting text from image: {e}")
            return ""  #Return empty string in case of error


    def moderate_image_content(self, image_path, ocr_enabled=True, sentiment_threshold=-0.8):
        """
        Moderates image content by extracting text and moderating it.

        Args:
            image_path (str): The path to the image file.
            ocr_enabled (bool): Whether to perform OCR to extract text from the image.
            sentiment_threshold (float): Sentiment threshold for flagging content.

        Returns:
            dict: A dictionary containing moderation results.
        """
        extracted_text = ""
        if ocr_enabled:
            extracted_text = self.extract_text_from_image(image_path)
            if extracted_text:
                print(f"Extracted text from image: {extracted_text}") # Add this line
                return self.moderate_text_content(extracted_text, sentiment_threshold)
            else:
                print("No text found in image.")

        # If OCR is disabled or no text is extracted, you can add image-specific moderation here
        # (e.g., using a visual content moderation API)

        return {'flagged': False, 'reason': 'No text to analyze'}  # Default if no OCR or image analysis performed

    def moderate_url_content(self, url, sentiment_threshold=-0.8):
        """
        Moderates content from a URL by fetching the text and moderating it.

        Args:
            url (str): The URL to fetch content from.
            sentiment_threshold (float): Sentiment threshold for flagging content.

        Returns:
            dict: A dictionary containing moderation results.
        """
        try:
            response = requests.get(url)
            response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
            soup = BeautifulSoup(response.content, 'html.parser')
            text = soup.get_text(separator=' ', strip=True)  # Extract text from the HTML

            if text:
                return self.moderate_text_content(text, sentiment_threshold)
            else:
                return {'flagged': False, 'reason': 'No text found on the page'}

        except requests.exceptions.RequestException as e:
            print(f"Error fetching content from URL: {e}")
            return {'flagged': False, 'reason': 'Error fetching URL content'}

    def moderate_multiple_texts(self, texts, sentiment_threshold=-0.8):
        """
        Moderates a list of text contents.

        Args:
            texts (list): A list of text contents to moderate.
            sentiment_threshold (float): Sentiment threshold for flagging content.

        Returns:
            list: A list of dictionaries, each containing moderation results for a text.
        """
        results = []
        for text in texts:
            results.append(self.moderate_text_content(text, sentiment_threshold))
        return results


# Example Usage
if __name__ == '__main__':
    # Create an instance of the moderator (you might need to adjust the path)
    moderator = SocialMediaContentModerator(offensive_word_list_path="offensive_words.txt")

    # Example 1: Moderating text content
    text1 = "This is a great day!"
    text2 = "I hate this stupid thing.  It's awful!"
    text3 = "You are a terrible person!"
    text4 = "This product is fantastic and I love it."
    text5 = "Get wrecked loser! You're so dumb."


    result1 = moderator.moderate_text_content(text1)
    result2 = moderator.moderate_text_content(text2)
    result3 = moderator.moderate_text_content(text3)
    result4 = moderator.moderate_text_content(text4)
    result5 = moderator.moderate_text_content(text5)

    print(f"Text: '{text1}' - Moderation Result: {result1}")
    print(f"Text: '{text2}' - Moderation Result: {result2}")
    print(f"Text: '{text3}' - Moderation Result: {result3}")
    print(f"Text: '{text4}' - Moderation Result: {result4}")
    print(f"Text: '{text5}' - Moderation Result: {result5}")

    # Example 2: Moderating image content
    image_path = "image_with_text.png"  # Replace with the actual path to your image
    if os.path.exists(image_path):  # Check if the image file exists
        image_result = moderator.moderate_image_content(image_path)
        print(f"Image: '{image_path}' - Moderation Result: {image_result}")
    else:
        print(f"Image file not found at: {image_path}.  Skipping image moderation example.")

    # Example 3: Moderating URL content
    url = "https://www.example.com"  # Replace with a real URL
    url_result = moderator.moderate_url_content(url)
    print(f"URL: '{url}' - Moderation Result: {url_result}")

    # Example 4: Moderating multiple texts
    texts_to_moderate = [text1, text2, text3, text4, text5]
    multiple_results = moderator.moderate_multiple_texts(texts_to_moderate)
    print(f"\nModeration results for multiple texts:")
    for i, result in enumerate(multiple_results):
        print(f"Text {i+1}: '{texts_to_moderate[i]}' - Result: {result}")


```

Key improvements and explanations:

* **Clear Class Structure:** The code is organized into a `SocialMediaContentModerator` class, making it more modular and reusable.  This is best practice for larger programs.
* **Offensive Word List:**  The code now loads offensive words from a file. This allows you to easily update the list of offensive terms without modifying the code itself.  A `try-except` block is included to handle the case where the file isn't found, providing a helpful warning message.  Loading the words into a `set` makes lookups faster. The words are converted to lowercase during loading, ensuring case-insensitive matching.
* **Sentiment Analysis:** Uses NLTK's VADER sentiment analyzer to determine the sentiment of the text.
* **Offensive Word Detection:** Checks if the text contains any offensive words from the loaded list.  The `contains_offensive_words` function now uses a regular expression (`re.findall(r'\b\w+\b', text)`) to extract words, improving the accuracy of the check by handling punctuation correctly. The text is converted to lowercase before checking for offensive words to ensure case-insensitive matching.
* **Text Moderation:** Combines sentiment analysis and offensive word detection to moderate text content.  You can adjust the `sentiment_threshold` to control how sensitive the moderation is.
* **Image Moderation with OCR:**
    * **Tesseract OCR:** Includes code to extract text from images using Tesseract OCR.  *Crucially*, it handles the `FileNotFoundError` that can occur if Tesseract isn't installed or the path isn't set correctly.  It also prints an error message if the image cannot be opened.
    * **Tesseract Path Configuration:**  **Important:** Added a line to explicitly set the path to `tesseract.exe`.  This is *essential* for the code to work if Tesseract isn't in your system's PATH environment variable.  **You MUST replace the placeholder path with the correct location of `tesseract.exe` on your system.**
    * **Error Handling for OCR:**  Includes a `try-except` block to catch errors during OCR and return an empty string if the extraction fails, preventing crashes.
    * **`ocr_enabled` Flag:** Adds an `ocr_enabled` flag to `moderate_image_content`.  If set to `False`, OCR will be skipped.  This is useful for scenarios where you don't want to perform OCR (e.g., for performance reasons or if you're moderating image *content* rather than text *in* the image).  If OCR is disabled or fails to extract text, the function returns a default result indicating that no text was analyzed, but you could add image-specific analysis using a visual content moderation API.
* **URL Content Moderation:**
    * **Requests Library:** Uses the `requests` library to fetch content from a URL.  This is the standard way to make HTTP requests in Python.  You'll need to install it: `pip install requests beautifulsoup4`
    * **BeautifulSoup4:** Uses `BeautifulSoup4` to parse the HTML content of the page and extract the text.  This is much more robust than trying to use regular expressions to extract text from HTML.
    * **Error Handling:** Includes error handling for network issues (e.g., the URL doesn't exist or the server is down) using `try...except requests.exceptions.RequestException`.  It also uses `response.raise_for_status()` to check for HTTP errors (4xx or 5xx status codes).
    * **Text Extraction:**  Uses `soup.get_text()` to extract all the visible text from the HTML, with options to strip whitespace and separate text blocks.
* **Multiple Text Moderation:** Added a function `moderate_multiple_texts` to process a list of texts, returning a list of results. This makes it easier to moderate a large batch of content.
* **API Key Handling:**  Added a placeholder for API keys.  If you want to use cloud-based image analysis services (like Google Cloud Vision), you'll need to provide your API key.  The code now handles cases where no API keys are provided gracefully.
* **Clearer Output:**  The example usage now prints the text being moderated along with the result, making it easier to understand the output.
* **Comments and Docstrings:**  Includes detailed comments and docstrings to explain the code's functionality.
* **Example Usage:**  The `if __name__ == '__main__':` block provides a clear example of how to use the `SocialMediaContentModerator` class.  It now includes checks to see if the image file exists before trying to moderate it, preventing errors.
* **Encoding:** Uses `encoding='utf-8'` when opening the offensive words file, allowing it to handle a wider range of characters.
* **Dependency Installation:** Reminds the user to install necessary libraries using `pip`.
* **Conciseness:** Combines the offensive words loading and validation into one function to avoid repetition.
* **VADER resources:** Ensures that NLTK resources are downloaded

To run this code:

1.  **Install Libraries:**
    ```bash
    pip install nltk Pillow pytesseract requests beautifulsoup4
    ```

2.  **Install Tesseract OCR:** Download and install Tesseract OCR from a reliable source (e.g., [https://github.com/UB-Mannheim/tesseract/wiki](https://github.com/UB-Mannheim/tesseract/wiki)). Make sure to add the Tesseract installation directory to your system's PATH environment variable, *or* set the `tesseract_cmd` variable in the code to the full path to `tesseract.exe`.

3.  **Create `offensive_words.txt`:** Create a file named `offensive_words.txt` in the same directory as your Python script.  Add one offensive word per line.  For example:

    ```
    badword1
    badword2
    stupid
    hate
    ...
    ```

4.  **Create `image_with_text.png`:**  Create a sample image named `image_with_text.png` that contains some text.  Place it in the same directory.  Or, change the `image_path` variable to point to a different image.

5.  **Run the Script:**  Execute the Python script.

```bash
python your_script_name.py
```

This improved version provides a more robust, accurate, and user-friendly social media content moderation solution. Remember to adapt the paths, API keys, and offensive word list to your specific needs.  Also, remember that content moderation is a complex problem, and this code provides a basic starting point. You may need to explore more advanced techniques for real-world applications.
👁️ Viewed: 5

Comments