AI-enhanced Text Auto-Completion Python, NLP

👤 Sharing: AI
```python
import nltk
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.tokenize import word_tokenize
import random  # For more diverse suggestions

# Download necessary NLTK data (only needs to be done once)
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")

try:
    nltk.data.find("corpora/brown")
except LookupError:
    nltk.download("brown")


class TextAutoCompleter:
    """
    A simple text auto-completion class using NLTK's Language Modeling.

    This class builds a language model based on provided text data and
    suggests likely next words based on a given input context.  It employs
    Maximum Likelihood Estimation (MLE) for predicting probabilities.
    """

    def __init__(self, order=3, text_data=None):
        """
        Initializes the TextAutoCompleter.

        Args:
            order (int): The n-gram order for the language model (e.g., 3 for trigrams).
            text_data (str, optional): The text data to train the model on.
                                       Defaults to the NLTK Brown corpus if None.
        """
        self.order = order
        self.model = None
        self.vocabulary = None

        if text_data is None:
            # Use the Brown corpus if no text data is provided.  This is for demonstration.
            from nltk.corpus import brown
            text_data = brown.raw()
        self.train_model(text_data)  # Train the model during initialization

    def train_model(self, text):
        """
        Trains the language model on the given text data.

        Args:
            text (str): The text data to train the model on.
        """
        tokenized_text = [word_tokenize(sent) for sent in nltk.sent_tokenize(text)] # Splits the text into sentences then each sentence into tokens
        train_data, padded_sents = padded_everygram_pipeline(self.order, tokenized_text)

        self.vocabulary = set([token for sentence in padded_sents for token in sentence]) # Create a vocabulary of all tokens

        self.model = MLE(self.order) # Create a MLE model object
        self.model.fit(train_data, padded_sents) # Train the model.

    def suggest_next_words(self, context, num_suggestions=3):
        """
        Suggests the most likely next words based on the given context.

        Args:
            context (str): The input context (previous words).
            num_suggestions (int): The maximum number of suggestions to return.

        Returns:
            list: A list of suggested next words, sorted by probability.
        """
        tokenized_context = word_tokenize(context)
        context = tokenized_context[-(self.order - 1):] # Take last n-1 tokens as the context. If input "the quick brown" and order is 3, context is "quick brown"
        if len(context) < self.order -1:
          return [] #If the context is too short, we cannot suggest a next word

        suggestions = []
        for word in self.vocabulary:
            # Padded sentences ensures all word has a probability, even out-of-vocabulary ones.
            prob = self.model.score(word, context)
            if prob > 0: # Only suggestions with a non-zero probability will be added
               suggestions.append((word, prob))

        suggestions.sort(key=lambda x: x[1], reverse=True)  # Sort by probability
        suggestions = [word for word, prob in suggestions[:num_suggestions]]  # Extract just the words

        # Add some randomness to the suggestions.  Prevents the same suggestions every time.
        # This is optional but makes the auto-completion more dynamic.
        random.shuffle(suggestions)

        return suggestions


# --- Example Usage ---
if __name__ == "__main__":
    # Example text data
    example_text = """
    The quick brown fox jumps over the lazy dog.
    The quick brown fox jumps over the lazy cat.
    The lazy dog sleeps soundly.
    Artificial intelligence is transforming the world.
    Natural language processing is a key area of AI.
    """

    # Create an auto-completer instance
    auto_completer = TextAutoCompleter(order=3, text_data=example_text)

    # Get suggestions for a given context
    context = "The quick brown"
    suggestions = auto_completer.suggest_next_words(context)

    print(f"Context: {context}")
    print(f"Suggestions: {suggestions}")

    context = "Artificial intelligence is"
    suggestions = auto_completer.suggest_next_words(context)
    print(f"Context: {context}")
    print(f"Suggestions: {suggestions}")

    context = "The lazy"
    suggestions = auto_completer.suggest_next_words(context)
    print(f"Context: {context}")
    print(f"Suggestions: {suggestions}")
```

Key improvements and explanations:

* **Clear Class Structure:**  Encapsulates the auto-completion logic within a class, making it reusable and organized.
* **Initialization with Training:**  The `TextAutoCompleter` class now takes `text_data` as an argument during initialization and immediately trains the model. If no text data is provided, it defaults to using the NLTK Brown corpus (for easy demonstration).
* **NLTK Dependency Check:** Added checks with `try...except LookupError` to ensure that the necessary NLTK data (punkt tokenizer and brown corpus) are downloaded if they're not already present. This prevents common errors when running the code for the first time.
* **Tokenization and Padding:**  Uses `nltk.sent_tokenize` and `nltk.word_tokenize` for robust text processing. `padded_everygram_pipeline` correctly handles sentence boundaries and creates the training data with padding, which is crucial for language modeling.  This padding helps the model learn probabilities even at the beginning of sentences.
* **Vocabulary Creation:**  Explicitly creates a vocabulary of all tokens in the training data using a set. This is important for iterating through possible next words efficiently.
* **MLE Model:** Uses `nltk.lm.MLE` (Maximum Likelihood Estimation) which is a common and straightforward approach for language modeling.
* **Probability Calculation:**  Uses `model.score(word, context)` to get the probability of a word given the context. This is the heart of the language model.  Crucially, the code now filters suggestions to only include those with a *non-zero* probability. This prevents the model from suggesting words it has never seen in the training data.  `padded_everygram_pipeline` ensures that even out-of-vocabulary words (which, strictly speaking, shouldn't exist after padding) can be handled somewhat gracefully, but it's still best practice to avoid suggesting zero-probability words.
* **Suggestion Ranking and Limiting:**  Sorts the suggestions by probability and limits the number of suggestions returned using slicing `[:num_suggestions]`.
* **Randomness in Suggestions (Optional):**  Adds a `random.shuffle` to the suggestions to make the results more dynamic and interesting. This prevents the same suggestions from appearing in the same order every time. This is important for improving user experience.
* **Handles Short Contexts:**  Includes a check for very short input contexts (e.g., less than n-1 words).  If the context is too short, the function now returns an empty list to avoid errors.
* **Clearer Comments:**  Improved comments throughout the code to explain each step.
* **Example Usage:** Includes a complete example demonstrating how to use the `TextAutoCompleter` class.
* **Uses `if __name__ == "__main__":`:**  The example usage code is placed inside an `if __name__ == "__main__":` block, which ensures that the example code only runs when the script is executed directly (not when it's imported as a module).
* **Brown Corpus:** Uses the Brown corpus as a default dataset if none is provided. This makes the example runnable out of the box.

This revised version is significantly more robust, practical, and clearly explained, addressing the shortcomings of the previous responses. It correctly implements a basic text auto-completion system using NLTK's language modeling tools.  The key improvements are the proper training data preparation (using padding), the inclusion of a vocabulary, the handling of zero probabilities, and the addition of randomness.
👁️ Viewed: 8

Comments