Summarizes long documents into key bullet points Python
👤 Sharing: AI
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from heapq import nlargest
def summarize_document(text, num_sentences=5):
"""
Summarizes a long document into key bullet points.
Args:
text (str): The input document text.
num_sentences (int): The desired number of sentences in the summary.
Returns:
list: A list of bullet point strings representing the key sentences.
"""
# 1. Preprocessing: Tokenization, Stop Word Removal
# Tokenize the text into sentences and words
sentences = sent_tokenize(text)
words = word_tokenize(text)
# Remove stop words (common words like "the", "a", "is")
stop_words = set(stopwords.words('english'))
filtered_words = [w.lower() for w in words if w.lower() not in stop_words and w.isalnum()]
# 2. Calculate Word Frequency
# Calculate the frequency of each word in the filtered text
word_frequency = {}
for word in filtered_words:
if word in word_frequency:
word_frequency[word] += 1
else:
word_frequency[word] = 1
# 3. Sentence Scoring
# Score each sentence based on the frequency of its words
sentence_scores = {}
for sentence in sentences:
for word in word_tokenize(sentence.lower()): # Convert to lower case for consistency
if word in word_frequency:
if sentence in sentence_scores:
sentence_scores[sentence] += word_frequency[word]
else:
sentence_scores[sentence] = word_frequency[word]
# 4. Summary Generation
# Select the top N sentences with the highest scores
summary_sentences = nlargest(num_sentences, sentence_scores, key=sentence_scores.get)
# 5. Format as Bullet Points
summary = [f"? {sentence}" for sentence in summary_sentences]
return summary
if __name__ == '__main__':
# Example usage
document = """
Artificial intelligence (AI) is revolutionizing various aspects of our lives.
From self-driving cars to personalized medicine, AI's impact is undeniable.
One of the key areas of AI is machine learning, where algorithms learn from data without explicit programming.
Deep learning, a subset of machine learning, uses artificial neural networks with multiple layers to analyze data.
Natural language processing (NLP) enables computers to understand and process human language.
AI is also being used in healthcare for diagnosing diseases and developing new treatments.
However, the ethical implications of AI, such as bias and job displacement, need careful consideration.
Despite these challenges, AI holds immense potential for solving some of the world's most pressing problems.
The future of AI is bright, with ongoing research and development pushing the boundaries of what's possible.
"""
num_sentences_in_summary = 3 # You can adjust this to control summary length
summary_bullet_points = summarize_document(document, num_sentences_in_summary)
print("Summary:")
for bullet_point in summary_bullet_points:
print(bullet_point)
# --- Additional Examples to illustrate various scenarios ---
# Example with shorter text
short_text = "This is a very short document. It has only two sentences."
short_summary = summarize_document(short_text, num_sentences=1)
print("\nShort Text Summary:")
for bullet_point in short_summary:
print(bullet_point)
# Example with more complex sentence structure
complex_text = """
The quick brown fox jumps over the lazy dog, a classic pangram.
This sentence is designed to use every letter of the alphabet.
Therefore, it's often used in typography demonstrations.
Although it's grammatically correct, it's not very meaningful.
"""
complex_summary = summarize_document(complex_text, num_sentences=2)
print("\nComplex Text Summary:")
for bullet_point in complex_summary:
print(bullet_point)
# Example with special characters and numbers
special_text = "The price is $100. Discount is 20%. Item ID: ABC-123."
special_summary = summarize_document(special_text, num_sentences=1)
print("\nSpecial Characters Summary:")
for bullet_point in special_summary:
print(bullet_point)
```
Key improvements and explanations:
* **Clear Structure:** The code is well-structured into functions with docstrings. This makes it reusable and understandable.
* **NLTK Dependency:** The code now correctly uses `nltk` (Natural Language Toolkit), a standard library for text processing in Python. You'll need to install it: `pip install nltk`. The code also addresses the common issue of needing to download NLTK data:
```python
import nltk
nltk.download('punkt')
nltk.download('stopwords')
```
Place these lines *before* you try to run the rest of the code. This downloads the necessary `punkt` (sentence tokenizer) and `stopwords` datasets. If you don't do this, you'll get errors.
* **Stop Word Removal:** Stop words like "the", "a", "is" are removed to improve the accuracy of the summary.
* **Word Tokenization:** The code tokenizes the text into individual words using `word_tokenize`.
* **Sentence Tokenization:** The code uses `sent_tokenize` to split the document into sentences.
* **Word Frequency Calculation:** A dictionary is used to store the frequency of each word.
* **Sentence Scoring:** Sentences are scored based on the sum of the frequencies of the words they contain. This is a common approach to summarization.
* **Summary Generation:** The `nlargest` function from the `heapq` module is used to efficiently select the top N sentences with the highest scores. This is much more efficient than sorting the entire list.
* **Bullet Point Formatting:** The summary sentences are formatted as bullet points.
* **`if __name__ == '__main__':` block:** This ensures that the example usage code only runs when the script is executed directly, not when it's imported as a module.
* **Example Usage:** The code includes a clear example of how to use the `summarize_document` function with a sample document.
* **Adjustable Summary Length:** The `num_sentences` parameter allows you to control the length of the generated summary.
* **Error Handling (Implicit):** While not explicitly using `try...except`, the code avoids common errors by converting words to lowercase during both filtering and sentence scoring. This ensures that "The" and "the" are treated as the same word.
* **Conciseness:** The code is written in a relatively concise and readable manner.
* **Docstrings:** Added comprehensive docstrings to explain the purpose and usage of the function.
* **Multiple Examples:** The `if __name__ == '__main__':` block now includes more varied examples including short text, text with complex sentence structure and special characters. This helps demonstrate the robustness of the code.
* **`isalnum()` filtering:** The `filtered_words` list comprehension now includes `w.isalnum()` to filter out punctuation and special characters, preventing them from affecting the word frequency calculation. This significantly improves summary quality.
* **Clarity of comments:** Improved comments to explain each step in the process.
* **Case Insensitivity:** Converting to lowercase (`word.lower()`) during both word filtering and sentence scoring ensures that the word counts are case-insensitive.
**How to run the code:**
1. **Install NLTK:**
```bash
pip install nltk
```
2. **Download NLTK Data:** Open a Python interpreter and run:
```python
import nltk
nltk.download('punkt')
nltk.download('stopwords')
```
3. **Save the code:** Save the code as a Python file (e.g., `summarizer.py`).
4. **Run the code:**
```bash
python summarizer.py
```
The output will be the generated summary in bullet point format. You can modify the `document` variable and the `num_sentences` parameter to experiment with different texts and summary lengths.
👁️ Viewed: 5
Comments