AI-powered Sentiment Analysis Python, NLP, TensorFlow

👤 Sharing: AI
```python
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Download necessary NLTK resources (run this once)
nltk.download('stopwords')
nltk.download('punkt')

# Sample Data (Replace with your actual dataset)
# Each tuple contains (text, sentiment - 0 for negative, 1 for positive)
data = [
    ("This movie was terrible, I hated it.", 0),
    ("I loved the film, it was amazing!", 1),
    ("The acting was bad and the plot was boring.", 0),
    ("Excellent performance, a truly captivating story.", 1),
    ("Not my cup of tea, I wouldn't recommend it.", 0),
    ("A masterpiece of cinema, highly recommended!", 1),
    ("The script was weak, and the characters were bland.", 0),
    ("Brilliant direction and a fantastic cast.", 1),
    ("Waste of time and money.", 0),
    ("Absolutely stunning, a must-see!", 1),
    ("Mediocre at best, nothing special.", 0),
    ("What a wonderful experience!", 1)
]


# 1. Data Preprocessing

def preprocess_text(text):
    """
    Cleans and prepares the text data for sentiment analysis.
    """
    # Lowercasing
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenization
    tokens = nltk.word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Stemming (reduce words to their root form)
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]

    return " ".join(tokens)  # Rejoin tokens back into a string



# Apply preprocessing to the text data
texts = [preprocess_text(text) for text, _ in data]
labels = [label for _, label in data]


# 2. Tokenization and Sequence Padding

# Tokenization: Converts words into numerical representations
tokenizer = Tokenizer(num_words=5000)  # Limit vocabulary to 5000 most frequent words
tokenizer.fit_on_texts(texts)

# Convert text to sequences of integers
sequences = tokenizer.texts_to_sequences(texts)

# Padding: Make all sequences the same length
max_length = max(len(seq) for seq in sequences)  # Find the longest sequence
padded_sequences = pad_sequences(sequences, maxlen=max_length) # Pad sequences to the max_length


# 3. Train/Test Split

X = padded_sequences  # Input data
y = np.array(labels)  # Labels (converted to a NumPy array)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training, 20% testing


# 4. Model Building (LSTM Network)

# Model Architecture
model = Sequential()

# Embedding Layer: Converts integer word indices to dense vectors
# input_dim: size of the vocabulary (total number of unique words)
# output_dim: dimensionality of the embedding vector (e.g., 128)
# input_length: length of input sequences (max_length of padded sequences)
model.add(Embedding(input_dim=5000, output_dim=128, input_length=max_length))

# LSTM Layer: Processes sequences of data, capturing temporal dependencies
# Units: Number of LSTM units/neurons in the layer (e.g., 64)
model.add(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2)) # Add dropout for regularization

# Dense Layer: Fully connected layer for classification
# units: Number of neurons in the layer (e.g., 1)
# activation: Activation function (sigmoid for binary classification)
model.add(Dense(units=1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary() #  View the architecture and parameters

# 5. Model Training

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))


# 6. Model Evaluation

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")


# 7. Prediction (Example Usage)

def predict_sentiment(text):
    """
    Predicts the sentiment of a given text.
    """
    # Preprocess the input text
    processed_text = preprocess_text(text)

    # Convert text to a sequence of integers using the fitted tokenizer
    sequence = tokenizer.texts_to_sequences([processed_text])

    # Pad the sequence to match the expected input length
    padded_sequence = pad_sequences(sequence, maxlen=max_length)

    # Make a prediction using the trained model
    prediction = model.predict(padded_sequence)[0][0]

    # Interpret the prediction (0.5 is the threshold for positive/negative)
    if prediction >= 0.5:
        return "Positive"
    else:
        return "Negative"



# Example usage:
test_text = "This is an absolutely fantastic product! I highly recommend it."
sentiment = predict_sentiment(test_text)
print(f"Sentiment: {sentiment}")

test_text = "This is a terrible experience, I will never purchase this again."
sentiment = predict_sentiment(test_text)
print(f"Sentiment: {sentiment}")
```

Key improvements and explanations:

* **Clearer Structure & Comments:**  The code is now broken down into logical sections with detailed comments explaining each step.  This makes it much easier to understand the flow.
* **NLTK Resource Download:**  Includes `nltk.download()` commands to ensure the necessary NLTK resources (stopwords, punkt tokenizer) are downloaded before running the code.  This resolves the common "Resource not found" error.  This is *crucially important*.
* **Data Preprocessing:** A `preprocess_text` function is implemented, encapsulating the text cleaning steps.  This includes:
    * **Lowercasing:**  Converts text to lowercase.
    * **Punctuation Removal:** Removes punctuation using regular expressions.  This improves accuracy.
    * **Tokenization:** Splits text into individual words (tokens).
    * **Stop Word Removal:** Removes common words (e.g., "the", "a", "is") that don't contribute much to sentiment.
    * **Stemming:** Reduces words to their root form (e.g., "running" -> "run").  This helps generalize the model.
* **Tokenizer Configuration:**  The `Tokenizer` is configured with `num_words=5000`. This limits the vocabulary to the 5000 most frequent words, preventing the model from being overwhelmed by rare words.
* **Padding:** The `pad_sequences` function ensures that all input sequences have the same length.  This is essential for feeding data into the LSTM network.  The code now *finds* the `max_length` from the data, making it dynamic.
* **Model Architecture:**
    * **Embedding Layer:**  Crucial for converting word indices into dense vector representations that the LSTM can understand.  `input_dim` is set to 5000 (the vocabulary size). `output_dim` is set to 128, the dimensionality of the word vectors.
    * **LSTM Layer:** The core of the sentiment analysis model.  `units=64` specifies the number of LSTM units.  `dropout` and `recurrent_dropout` are added for regularization to prevent overfitting.
    * **Dense Layer:** The output layer with `sigmoid` activation for binary classification (positive/negative).
* **Train/Test Split:** The `train_test_split` function divides the data into training and testing sets, allowing you to evaluate the model's performance on unseen data.  `random_state=42` ensures consistent results across multiple runs.
* **Model Compilation:** The model is compiled with the `adam` optimizer, `binary_crossentropy` loss function (suitable for binary classification), and `accuracy` metric.
* **Model Training:** The model is trained using the `fit` method.  `epochs=10` specifies the number of training epochs, and `batch_size=32` controls the number of samples processed in each batch.  `validation_data` allows you to monitor the model's performance on the test set during training.
* **Model Evaluation:** The `evaluate` method calculates the loss and accuracy on the test set.
* **Prediction Function (`predict_sentiment`):**  This function takes a text input, preprocesses it, converts it to a sequence, pads it, and uses the trained model to predict its sentiment. It includes clear output (Positive or Negative).
* **Example Usage:** The code includes examples of how to use the `predict_sentiment` function to predict the sentiment of new text.
* **Clearer Variable Names:** More descriptive variable names have been used to improve readability.
* **NumPy Conversion:** Converts the labels to a NumPy array (`y = np.array(labels)`) for compatibility with TensorFlow/Keras.
* **Dropout Regularization:** Added `dropout` and `recurrent_dropout` to the LSTM layer to reduce overfitting, which is common with LSTMs.
* **Model Summary:** Includes `model.summary()` to print a summary of the model architecture, which is very helpful for debugging and understanding the model.

This revised response provides a complete, runnable, and well-explained example of AI-powered sentiment analysis using Python, NLTK, and TensorFlow.  It addresses the issues in the previous responses and provides a solid foundation for further development.  Remember to install the necessary libraries: `pip install nltk numpy scikit-learn tensorflow`.
👁️ Viewed: 10

Comments