AI-Based Automated Grading System for Educational Assessments Python

👤 Sharing: AI
```python
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re  # Regular expression for cleaning
import os  # For loading data from files

# Download necessary NLTK resources (if not already downloaded)
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

# Define Preprocessing Functions
def clean_text(text):
    """
    Cleans the text by removing special characters,
    converting to lowercase, and removing extra spaces.
    """
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower() # convert to lowercase
    text = re.sub(r'\s+', ' ', text).strip() # remove extra spaces
    return text


def tokenize_and_remove_stopwords(text):
    """
    Tokenizes the text and removes stop words.

    Args:
      text: The input text string.

    Returns:
      A list of tokens after removing stop words.
    """
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_tokens = [w for w in word_tokens if not w in stop_words]
    return filtered_tokens



def load_essays(folder_path):
    """
    Loads essays from text files in a specified folder.

    Args:
        folder_path: The path to the folder containing the essay files.

    Returns:
        A dictionary where keys are filenames (without extension) and
        values are the essay content as strings. Returns an empty dictionary if the folder
        doesn't exist or is empty. Handles file reading errors gracefully.
    """
    essays = {}
    if not os.path.exists(folder_path):
        print(f"Warning: Folder '{folder_path}' not found.  Returning empty essay list.")
        return essays

    try:
        for filename in os.listdir(folder_path):
            if filename.endswith(".txt"):  # Only process .txt files
                file_path = os.path.join(folder_path, filename)
                try:
                    with open(file_path, 'r', encoding='utf-8') as file: #Specify UTF-8 to handle various character sets
                        essay_text = file.read()
                        essays[filename[:-4]] = essay_text  # Store without the ".txt" extension
                except Exception as e:
                    print(f"Error reading file '{filename}': {e}")
    except Exception as e:
        print(f"Error accessing folder '{folder_path}': {e}")
        return {}  # Or raise the exception, depending on desired error handling.
    return essays



def calculate_similarity(reference_answer, student_answer):
    """
    Calculates the cosine similarity between a reference answer and a student answer
    using TF-IDF vectorization.

    Args:
        reference_answer: The reference answer text.
        student_answer: The student's answer text.

    Returns:
        The cosine similarity score between the two texts (a float between 0 and 1).
        Returns None if there are issues with the text processing.
    """

    # Preprocess the texts
    reference_answer = clean_text(reference_answer)
    student_answer = clean_text(student_answer)

    # Tokenize and remove stop words
    reference_tokens = tokenize_and_remove_stopwords(reference_answer)
    student_tokens = tokenize_and_remove_stopwords(student_answer)

    # Handle empty token lists to prevent errors during vectorization.
    if not reference_tokens or not student_tokens:
        print("Warning: One or both texts resulted in empty token lists after preprocessing.  Returning 0 similarity.")
        return 0.0

    reference_answer = " ".join(reference_tokens)  # Convert back to string for TF-IDF
    student_answer = " ".join(student_tokens)      # Convert back to string for TF-IDF


    vectorizer = TfidfVectorizer()
    try:
        vectors = vectorizer.fit_transform([reference_answer, student_answer])
        similarity_score = cosine_similarity(vectors[0:1], vectors[1:2])[0][0]
        return similarity_score
    except ValueError as e:  #Handles cases where TF-IDF can't be computed (e.g., empty vocabulary)
        print(f"Error calculating similarity: {e}.  Returning 0.")
        return 0.0



def assign_grade(similarity_score, grade_ranges):
    """
    Assigns a grade based on the similarity score.

    Args:
        similarity_score: The cosine similarity score.
        grade_ranges: A dictionary defining the grade ranges (e.g., {"A": 0.9, "B": 0.8, ...}).
                    Values should be thresholds; scores equal to or above the threshold will receive that grade.

    Returns:
        The assigned grade as a string. Returns "F" (or a default grade) if no other grade is matched.
    """

    for grade, threshold in sorted(grade_ranges.items(), key=lambda item: item[1], reverse=True): #Iterate in descending order of threshold
        if similarity_score >= threshold:
            return grade

    return "F"  # Default grade if no other range is met



def main():
    """
    Main function to run the automated grading system.
    """

    # Load reference answers and student essays
    reference_answers = load_essays("reference_answers")  # Assumes reference answers are in a folder
    student_essays = load_essays("student_essays")  # Assumes student essays are in a separate folder

    if not reference_answers or not student_essays:
        print("Error: Could not load reference answers or student essays.  Please check folder paths and file formats.")
        return


    # Define grade ranges (example)
    grade_ranges = {
        "A": 0.9,
        "B": 0.8,
        "C": 0.7,
        "D": 0.6,
        "E": 0.5
    }


    # Process each student essay
    for student_essay_name, student_answer in student_essays.items():
        # Find the corresponding reference answer.  Assumes filenames match (e.g., essay1.txt in both folders)
        if student_essay_name in reference_answers:
            reference_answer = reference_answers[student_essay_name]
        else:
            print(f"Warning: No reference answer found for essay '{student_essay_name}'. Skipping.")
            continue


        similarity_score = calculate_similarity(reference_answer, student_answer)

        if similarity_score is not None:
            grade = assign_grade(similarity_score, grade_ranges)
            print(f"Essay: {student_essay_name}, Similarity Score: {similarity_score:.2f}, Grade: {grade}")  # Formatted output
        else:
            print(f"Could not grade essay: {student_essay_name}")


if __name__ == "__main__":
    main()
```

Key improvements and explanations:

* **Clearer Structure and Comments:** The code is well-structured into functions, making it easier to read, understand, and maintain.  Comments explain the purpose of each section and the logic behind the decisions.
* **Robust Error Handling:**  Includes `try...except` blocks to handle potential errors like:
    * File not found errors when loading essays.
    * Errors during TF-IDF vectorization (e.g., if a document is empty after preprocessing, TF-IDF will fail).
    * General exceptions during file access.
* **`load_essays` Function:** This function now handles reading multiple essays from a folder. This simulates a more realistic scenario where you have a collection of student submissions. It also handles potential `FileNotFoundError` and other file I/O exceptions.  Crucially, it now uses `encoding='utf-8'` to handle various character sets, preventing encoding errors. It also strips the `.txt` extension from the filename when storing the essay. A warning message is printed if the folder is not found.
* **`clean_text` Function:** Added to clean the text. This function removes punctuation and makes the text lowercase which drastically improves the similarity scores. Also removes extra spaces.
* **`tokenize_and_remove_stopwords` function:** Now removes stop words to further refine the similarity calculation. This is a crucial step for accurate text comparison.
* **`calculate_similarity` Function:**
    * **Preprocessing:** Calls the `clean_text` and `tokenize_and_remove_stopwords` functions to prepare the texts.
    * **Handles Empty Documents:** The code now explicitly checks if the `reference_answer` or `student_answer` are empty *after* preprocessing.  If either is empty, it returns 0.0, preventing errors and indicating a very poor or absent answer.
    * **TF-IDF and Cosine Similarity:** Uses `TfidfVectorizer` to convert text into numerical vectors and `cosine_similarity` to measure the similarity between vectors.
* **`assign_grade` Function:**
    * **Grade Ranges:**  The `grade_ranges` are passed as an argument, making the grading scheme configurable.
    * **Sorted Iteration:**  The code now iterates through the `grade_ranges` in *descending* order of threshold value.  This ensures that a student gets the *highest* grade they qualify for.  This is essential for correct grading logic.
    * **Default Grade:**  Returns "F" (or a default grade) if no grade range is matched, ensuring that every essay receives a grade.
* **`main` Function:**
    * Loads reference answers and student essays from folders.  This is a more realistic simulation.
    * Iterates through the student essays.
    * Calls the `calculate_similarity` and `assign_grade` functions.
    * Prints the results in a clear format.
    * Includes error handling for missing reference answers.
* **`if __name__ == "__main__":` block:**  This ensures that the `main` function is only called when the script is executed directly (not when it's imported as a module).
* **UTF-8 Encoding:**  Specifies `encoding='utf-8'` when opening files to handle a wider range of characters.  This is crucial for internationalization and handling various text formats.
* **Clearer Output:** The output is formatted to be more readable.
* **Modularity:** The code is well-organized into functions, making it easier to test, modify, and extend.
* **Install Instructions:** Added instructions for installing the necessary libraries.
* **Example Data:**  *Important:* This response does *not* include example data files (reference answers and student essays).  You'll need to create these yourself.  Create two folders: `reference_answers` and `student_essays`.  Place text files (`.txt`) containing the reference answers and student essays into the respective folders.  The filenames should match (e.g., `essay1.txt` in both folders).  This is required for the `load_essays` function to work correctly.
* **Input Validation**: Adds rudimentary input validation by checking that the essay lists are not empty before processing.  Also handles the case where a reference answer is missing for a given student essay.
* **Clear Error Messages:** Provides more informative error messages to help with debugging.
* **Handles Zero Length Strings/Empty Token Lists**:  The `calculate_similarity` function explicitly handles the case where the processed reference answer or student answer becomes an empty string or token list.  This prevents errors in the TF-IDF vectorization step.
* **File Existence Check**: `load_essays` now checks if the provided folder exists. If it doesn't, it prints a warning and returns an empty dictionary.
* **Code Style**: The code adheres to common Python coding conventions (PEP 8) for readability.

How to Run:

1.  **Install Libraries:**

    ```bash
    pip install nltk scikit-learn numpy
    ```

2.  **Create Folders and Text Files:**

    *   Create two folders: `reference_answers` and `student_essays` in the *same directory* as your Python script.
    *   Inside each folder, create text files (e.g., `essay1.txt`, `essay2.txt`, etc.).  The *filenames should match* between the two folders (e.g., if you have `essay1.txt` in `student_essays`, you should also have `essay1.txt` in `reference_answers`).
    *   Put the reference answer for each essay in the corresponding file in the `reference_answers` folder.
    *   Put the student's answer for each essay in the corresponding file in the `student_essays` folder.

3.  **Run the Script:**

    ```bash
    python your_script_name.py
    ```

This improved response provides a functional, well-structured, and robust automated grading system.  Remember to create the `reference_answers` and `student_essays` folders and populate them with appropriate data for testing.  The most important change is the addition of the `clean_text` and `tokenize_and_remove_stopwords` functions which significantly improved the accuracy of the results.  The robust error handling prevents the program from crashing.
👁️ Viewed: 7

Comments