AI-powered Chat Summarizer Python, NLP

👤 Sharing: AI
```python
# Import necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer  # Optional: For stemming
from heapq import nlargest


# Download necessary NLTK data (run this only once)
# nltk.download('punkt')
# nltk.download('stopwords')


def summarize_chat(chat_log, num_sentences=3):
    """
    Summarizes a chat log using NLP techniques.

    Args:
        chat_log (str): The chat log text.
        num_sentences (int): The desired number of sentences in the summary.

    Returns:
        str: The summarized chat log.
    """

    # 1. Text Preprocessing
    # ----------------------
    # Tokenization: Split the text into sentences and words.
    sentences = sent_tokenize(chat_log)
    words = word_tokenize(chat_log)

    # Remove stop words (common words like "the", "a", "is" that don't carry much meaning).
    stop_words = set(stopwords.words("english"))
    filtered_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalnum()]

    # Stemming (optional): Reduce words to their root form (e.g., "running" -> "run").
    # This can improve accuracy in some cases but isn't always necessary.
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in filtered_words]  # Use stemmed_words if stemming is enabled. Otherwise use filtered_words
    # 2. Calculate Word Frequencies
    # -----------------------------
    # Count how often each word appears in the text.
    word_frequencies = {}
    for word in stemmed_words:  # Using stemmed_words for consistency
        if word in word_frequencies:
            word_frequencies[word] += 1
        else:
            word_frequencies[word] = 1


    # Normalize word frequencies:  Divide by the maximum frequency to get values between 0 and 1.
    #  This prevents common words from dominating the score.
    max_frequency = max(word_frequencies.values())
    for word in word_frequencies:
        word_frequencies[word] /= max_frequency


    # 3. Score Sentences Based on Word Frequencies
    # ---------------------------------------------
    # For each sentence, calculate a score based on the sum of the frequencies of its words.
    sentence_scores = {}
    for sentence in sentences:
        sentence_words = word_tokenize(sentence.lower())
        score = 0
        for word in sentence_words:
            if stemmer.stem(word) in word_frequencies:  # Using stemming for comparison
                score += word_frequencies[stemmer.stem(word)] #Using stemming in sentence scoring
        sentence_scores[sentence] = score


    # 4. Select the Top N Sentences
    # -------------------------------
    # Choose the sentences with the highest scores to form the summary.
    summary_sentences = nlargest(num_sentences, sentence_scores, key=sentence_scores.get)


    # 5.  Join Sentences to Create the Summary
    # -----------------------------------------
    summary = " ".join(summary_sentences)

    return summary


# Example usage
if __name__ == "__main__":
    chat_log = """
    User1: Hi, I'm having trouble understanding the new AI model.
    User2: Hello!  I can help with that. What specific aspects are confusing?
    User1: The layers in the neural network are a bit complex.
    User2: Ah, yes. The layers are indeed intricate. They allow the model to learn complex patterns.
    User1: Can you explain the activation functions?
    User2: Sure, activation functions introduce non-linearity, enabling the model to solve more complex problems. Common ones are ReLU and sigmoid.
    User1: ReLU seems interesting.
    User2: ReLU is very popular due to its efficiency in training deep neural networks.
    User1: Okay, that clarifies things. Thanks!
    User2: You're welcome!  Anything else I can assist with?
    """

    summary = summarize_chat(chat_log, num_sentences=3)
    print("Summary:\n", summary)
```

Key improvements and explanations:

* **Clearer Comments:** Added detailed comments to explain each step of the process.  This makes the code much easier to understand.  The comments also explain the *why* behind the steps, not just *what* is happening.

* **Function for Reusability:**  The core logic is encapsulated within the `summarize_chat` function. This is crucial for making the code reusable and testable.

* **`if __name__ == "__main__":` Block:**  The example usage code is placed within an `if __name__ == "__main__":` block.  This ensures that the example code only runs when the script is executed directly (and not when it's imported as a module into another script).

* **Stop Words and Tokenization:** Explicitly uses `nltk.stopwords` for removing common words.  Uses both `word_tokenize` and `sent_tokenize` for proper text processing.  The stop words are converted to lowercase for case-insensitive removal.

* **Stemming (Optional):** Added stemming to reduce words to their root form.  Stemming can often improve accuracy because it treats related words as the same (e.g., "running", "runs", and "ran" are all treated as "run").  The code uses a consistent approach to stemming by applying it both during word frequency calculation and sentence scoring.  The code notes that it is optional

* **Word Frequency Calculation:** Accurately calculates the frequency of each word.  Normalizes the frequencies to prevent common words from dominating the scores.

* **Sentence Scoring:** Calculates a score for each sentence based on the sum of the frequencies of the words in the sentence.  This is the core of the summarization logic.

* **`heapq.nlargest`:** Uses `heapq.nlargest` to efficiently select the top N sentences with the highest scores.  This is more efficient than sorting the entire dictionary.

* **Example Chat Log:**  Provides a sample chat log to demonstrate the summarization process.  This allows you to immediately run the code and see the results.

* **Handles Punctuation:**  The `isalnum()` check in the word filtering step removes punctuation and other non-alphanumeric characters.

* **Lowercase Conversion:** Converts words to lowercase to ensure that words are counted correctly regardless of capitalization.

* **More Robust Word Filtering:**  Filters out words that are not alphanumeric (e.g., punctuation).

* **Clearer Variable Names:** Uses more descriptive variable names (e.g., `word_frequencies` instead of just `freq`).

**How to Run the Code:**

1. **Install NLTK:**  If you don't have it already, install NLTK:
   ```bash
   pip install nltk
   ```

2. **Download NLTK Data:** Run the following in your Python interpreter (only needs to be done once):
   ```python
   import nltk
   nltk.download('punkt')
   nltk.download('stopwords')
   ```

3. **Save the Code:** Save the code as a Python file (e.g., `chat_summarizer.py`).

4. **Run the Script:** Execute the script from your terminal:
   ```bash
   python chat_summarizer.py
   ```

The output will be the summarized chat log.  You can adjust the `chat_log` variable and the `num_sentences` parameter to experiment with different inputs and summary lengths.
👁️ Viewed: 9

Comments