Automated Document Summarizer for Quick Review of Lengthy Texts,Python

👤 Sharing: AI
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer

def summarize_text(text, num_sentences=3):
    """
    Summarizes a given text using NLTK.

    Args:
        text (str): The text to be summarized.
        num_sentences (int): The desired number of sentences in the summary.  Defaults to 3.

    Returns:
        str: The generated summary.
    """

    # 1. Text Preprocessing

    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Tokenize the text into words
    words = word_tokenize(text)

    # Remove stop words (common words like "the", "a", "is")
    stop_words = set(stopwords.words("english"))
    filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]  # Remove punctuation and make lowercase

    # Apply stemming (reduce words to their root form)
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    
    # 2. Calculate Word Frequencies

    word_frequencies = {}
    for word in stemmed_words:
        if word in word_frequencies:
            word_frequencies[word] += 1
        else:
            word_frequencies[word] = 1

    # Normalize word frequencies (divide by the maximum frequency)
    max_frequency = max(word_frequencies.values())
    for word in word_frequencies:
        word_frequencies[word] = word_frequencies[word] / max_frequency

    # 3. Score Sentences Based on Word Frequencies

    sentence_scores = {}
    for sentence in sentences:
        for word in word_tokenize(sentence.lower()):  # Tokenize the sentence for scoring
            word = stemmer.stem(word)  # stem the word
            if word in word_frequencies:
                if sentence in sentence_scores:
                    sentence_scores[sentence] += word_frequencies[word]
                else:
                    sentence_scores[sentence] = word_frequencies[word]

    # 4. Select Top Sentences for Summary

    # Get the top 'num_sentences' sentences with the highest scores
    import heapq
    best_sentences = heapq.nlargest(num_sentences, sentence_scores, key=sentence_scores.get)

    # 5.  Assemble and Return the Summary

    summary = " ".join(best_sentences)
    return summary


# Example usage:
if __name__ == "__main__":
    example_text = """
    Artificial intelligence (AI) is revolutionizing various aspects of our lives. 
    From self-driving cars to personalized medicine, AI's impact is undeniable. 
    Machine learning, a subset of AI, enables computers to learn from data without explicit programming. 
    Deep learning, a further advancement, utilizes artificial neural networks with multiple layers. 
    These technologies are transforming industries such as healthcare, finance, and transportation. 
    However, ethical concerns surrounding AI, such as bias and job displacement, need to be addressed. 
    Ensuring fairness and transparency in AI systems is crucial for its responsible deployment. 
    The future of AI holds immense potential, but careful consideration of its societal implications is essential.
    """

    summary = summarize_text(example_text, num_sentences=2)
    print("Original Text:\n", example_text)
    print("\nSummary:\n", summary)
```

Key improvements and explanations:

* **Clear Function Definition:** The code is organized into a function `summarize_text` that takes the text and the desired number of sentences as input and returns the summary.  This makes the code reusable and easier to understand.  It includes a docstring.
* **NLTK Initialization Check:** No longer necessary since the try/except has been removed.  The user is expected to have installed NLTK correctly, and if not, the program will error out as expected.  This is a more streamlined approach.
* **Comprehensive Text Preprocessing:**  The code now includes:
    * **Sentence Tokenization:** Splits the text into individual sentences using `sent_tokenize`.
    * **Word Tokenization:** Splits the text into individual words using `word_tokenize`.
    * **Stop Word Removal:** Removes common words (like "the", "a", "is") that don't contribute much to the meaning. It uses `stopwords.words("english")` to get a standard list of English stop words.
    * **Punctuation Removal:**  Removes punctuation to avoid noise in the word frequency analysis.  `word.isalnum()` checks if a word consists of alphanumeric characters.
    * **Lowercasing:** Converts all words to lowercase to treat "The" and "the" as the same word.
    * **Stemming:** Reduces words to their root form (e.g., "running" becomes "run") using the Porter stemmer.  This helps to group similar words together.
* **Word Frequency Calculation:**  Calculates the frequency of each word in the preprocessed text.
* **Normalization of Word Frequencies:** Normalizes the word frequencies by dividing each frequency by the maximum frequency. This prevents longer documents from dominating the sentence scoring.
* **Sentence Scoring:**  Scores each sentence based on the sum of the frequencies of its words. Important words will contribute more to the sentence score. The sentence tokenization is performed *within* this scoring loop for accuracy (it avoids using the previously tokenized word list which would not match the sentence boundaries).
* **Summary Generation:**
    * Uses `heapq.nlargest` to efficiently find the `num_sentences` sentences with the highest scores.  This is more efficient than sorting the entire list of sentences.
    * Joins the selected sentences together into a coherent summary string.
* **Example Usage ( `if __name__ == "__main__":` block):**  The code includes an example of how to use the `summarize_text` function. This makes it easy for users to test the code and understand how it works.
* **Clear Comments:**  The code is well-commented, explaining each step of the process.
* **Error Handling (Removed):** The initial version had unnecessary error handling for NLTK downloads that was more complex than needed. The current code assumes NLTK and its data are correctly installed, simplifying the code and making it more direct.
* **Conciseness and Readability:**  The code is written in a clear and concise style, making it easier to read and understand.
* **`num_sentences` Parameter:** The `summarize_text` function now takes a `num_sentences` parameter, allowing the user to specify the desired length of the summary.  It defaults to 3 sentences.
* **Important Note on Stemming:** Stemming can sometimes produce non-words, but it's useful for grouping words with similar meanings for summarization.

This revised response provides a more robust, well-structured, and understandable implementation of a text summarizer using NLTK in Python.  It addresses the shortcomings of the previous responses and provides a solid foundation for further development.
👁️ Viewed: 4

Comments