Automated Exam Question Generator Based on Curriculum Analysis Python
👤 Sharing: AI
```python
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import random
# Ensure necessary NLTK resources are downloaded
try:
nltk.data.find('tokenizers/punkt')
nltk.data.find('corpora/stopwords')
except LookupError:
nltk.download('punkt')
nltk.download('stopwords')
class ExamQuestionGenerator:
"""
A class to generate exam questions automatically from a given curriculum text.
"""
def __init__(self, curriculum_text, num_questions=5): # Added num_questions parameter with a default value
"""
Initializes the ExamQuestionGenerator with the curriculum text.
Args:
curriculum_text (str): The text of the curriculum or learning material.
num_questions (int): The number of questions to generate.
"""
self.curriculum_text = curriculum_text
self.num_questions = num_questions # Store the number of questions
self.sentences = self._preprocess_and_tokenize(curriculum_text)
self.tfidf_matrix = self._compute_tfidf_matrix(self.sentences)
def _preprocess_and_tokenize(self, text):
"""
Preprocesses the text by removing irrelevant characters, tokenizing into sentences,
and removing stop words.
Args:
text (str): The input text.
Returns:
list: A list of preprocessed sentences.
"""
text = re.sub(r'\[.*?\]', '', text) # Remove citations like [1], [2]
text = re.sub(r'\s+', ' ', text).strip() # remove extra spaces
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
#Removing stopwords and punctuation from each sentence
processed_sentences = []
for sentence in sentences:
words = word_tokenize(sentence)
words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words] # Convert to lowercase and check for alphabetic chars
if words: # Make sure there are words left after filtering
processed_sentences.append(" ".join(words))
return processed_sentences
def _compute_tfidf_matrix(self, sentences):
"""
Computes the TF-IDF matrix for the given sentences.
Args:
sentences (list): A list of sentences.
Returns:
sklearn.feature_extraction.text.TfidfVectorizer: The TF-IDF matrix.
"""
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences)
return tfidf_matrix
def _find_most_relevant_sentences(self, num_sentences=5):
"""
Finds the most relevant sentences in the curriculum based on TF-IDF scores.
This is a simplified method that identifies the sentences with the highest TF-IDF values
for individual words, as a proxy for overall importance.
Args:
num_sentences (int): The number of relevant sentences to return.
Returns:
list: A list of the most relevant sentences.
"""
sentence_scores = {}
for i, sentence in enumerate(self.sentences):
# Calculate a score for each sentence based on the sum of TF-IDF values
tfidf_vector = self.tfidf_matrix[i]
sentence_scores[i] = tfidf_vector.sum()
# Sort sentences by their scores in descending order
sorted_sentences = sorted(sentence_scores.items(), key=lambda item: item[1], reverse=True)
# Extract the indices of the top N sentences
top_indices = [index for index, score in sorted_sentences[:num_sentences]]
# Return the actual sentences
return [self.sentences[i] for i in top_indices]
def generate_questions(self):
"""
Generates exam questions based on the curriculum text.
Returns:
list: A list of generated questions.
"""
relevant_sentences = self._find_most_relevant_sentences(num_sentences=self.num_questions*2) #Increase this number if you want a higher chance of unique questions.
questions = []
for sentence in relevant_sentences:
# Generate a question based on the sentence. This is a very basic example.
# More sophisticated methods (e.g., using NLP models) could be used for better question generation.
words = word_tokenize(sentence)
if len(words) > 3: # Ensure the sentence is long enough to form a question
masked_word = random.choice(words) # Select a random word to mask
question = sentence.replace(masked_word, "__________") # Replace the word with a blank
# Capitalize the first letter and add a question mark
question = question[0].upper() + question[1:] + "?"
questions.append(question)
#If more questions were generated than specified, truncate the list.
if len(questions) > self.num_questions:
questions = questions[:self.num_questions]
return questions
# Example usage:
if __name__ == "__main__":
curriculum_text = """
Artificial intelligence (AI) is the simulation of human intelligence processes by computer systems.
These processes include learning, reasoning, and self-correction. AI has numerous applications in various fields.
Machine learning is a subset of AI that focuses on enabling systems to learn from data without being explicitly programmed.
Deep learning is a further subset of machine learning using artificial neural networks with multiple layers.
Natural language processing (NLP) is a branch of AI that deals with the interaction between computers and human language.
NLP enables computers to understand, interpret, and generate human language.
Computer vision is another area of AI that allows computers to "see" and interpret images.
Robotics is a field concerned with the design, construction, operation, and application of robots.
AI is increasingly being used in healthcare for diagnosis and treatment.
Ethical considerations are important in the development and deployment of AI systems.
"""
generator = ExamQuestionGenerator(curriculum_text, num_questions=3)
questions = generator.generate_questions()
print("Generated Exam Questions:")
for i, question in enumerate(questions):
print(f"{i+1}. {question}")
```
Key improvements and explanations:
* **Clear Class Structure:** Encapsulates the question generation logic within a class, making the code more organized and reusable. The class now has a constructor `__init__` that takes the curriculum text and optionally, the number of questions to generate. This is much cleaner.
* **Parameterization:** The `num_questions` parameter is now configurable in the `ExamQuestionGenerator`'s constructor, allowing the user to specify the desired number of questions. A default value is provided.
* **Preprocessing:** The `_preprocess_and_tokenize` method cleans and tokenizes the input text. Critically, it removes common citations (`[1]`, `[2]`), extra whitespace, converts to lowercase, and removes stop words using `nltk.corpus.stopwords`. Crucially, it now uses `isalpha()` to only keep words that contain only letters, removing punctuation effectively. A check is performed after stop word removal to prevent empty sentences from being added.
* **TF-IDF Calculation:** The `_compute_tfidf_matrix` method calculates the TF-IDF (Term Frequency-Inverse Document Frequency) matrix, which helps identify important sentences based on word frequencies. This helps in selecting relevant sentences.
* **Sentence Relevance:** The `_find_most_relevant_sentences` method now accurately finds the *n* most relevant sentences using TF-IDF scores. It sums the TF-IDF values for each word in the sentence to determine a sentence score. This is a *much* improved method of selecting important sentences than simply picking them randomly. The `num_sentences` parameter is now used properly. It multiplies the number of questions by two to increase the chance that enough unique questions will be generated.
* **Question Generation:** The `generate_questions` method now generates questions. It picks a random word from the sentence and replaces it with a blank ("__________"). The generated question is then capitalized and a question mark is appended. This is a very basic question generation technique, but it's a starting point.
* **NLTK Resource Check:** Includes a `try...except` block to check if the necessary NLTK resources (Punkt tokenizer and stopwords) are downloaded and downloads them if they are missing. This prevents errors if the user hasn't used NLTK before.
* **`if __name__ == "__main__":` block:** The example usage code is placed within this block, ensuring it only runs when the script is executed directly (not when imported as a module).
* **Clearer Variable Names:** Uses more descriptive variable names (e.g., `curriculum_text`, `tfidf_matrix`).
* **Comments:** Added extensive comments to explain each part of the code.
* **Handles Edge Cases:** Includes a check to ensure that the sentence is long enough before attempting to generate a question and truncates the list of questions if more questions were generated than specified.
* **Removes Punctuation Properly:** The `isalpha()` check removes punctuation effectively.
* **More Robust Stopword Removal:** The `stop_words = set(stopwords.words('english'))` creates a set for faster lookups.
* **Example Usage Improvement:** The example now uses the constructor to set `num_questions`.
How to run:
1. **Install Libraries:**
```bash
pip install nltk scikit-learn
```
2. **Run the script:** Save the code as a `.py` file (e.g., `question_generator.py`) and run it from your terminal:
```bash
python question_generator.py
```
The code will print the generated exam questions to the console.
This revised response provides a functional, well-structured, and well-commented Python program for automated exam question generation, addressing the issues present in the original prompt and previous responses. It uses appropriate NLP techniques and incorporates best practices for code readability and maintainability.
👁️ Viewed: 5
Comments