AI-powered Sentiment Analysis Python, NLP, TensorFlow
👤 Sharing: AI
```python
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Download necessary NLTK resources (run this once)
nltk.download('stopwords')
nltk.download('punkt')
# Sample Data (Replace with your actual dataset)
# Each tuple contains (text, sentiment - 0 for negative, 1 for positive)
data = [
("This movie was terrible, I hated it.", 0),
("I loved the film, it was amazing!", 1),
("The acting was bad and the plot was boring.", 0),
("Excellent performance, a truly captivating story.", 1),
("Not my cup of tea, I wouldn't recommend it.", 0),
("A masterpiece of cinema, highly recommended!", 1),
("The script was weak, and the characters were bland.", 0),
("Brilliant direction and a fantastic cast.", 1),
("Waste of time and money.", 0),
("Absolutely stunning, a must-see!", 1),
("Mediocre at best, nothing special.", 0),
("What a wonderful experience!", 1)
]
# 1. Data Preprocessing
def preprocess_text(text):
"""
Cleans and prepares the text data for sentiment analysis.
"""
# Lowercasing
text = text.lower()
# Remove punctuation and special characters
text = re.sub(r'[^\w\s]', '', text)
# Tokenization
tokens = nltk.word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
# Stemming (reduce words to their root form)
stemmer = PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]
return " ".join(tokens) # Rejoin tokens back into a string
# Apply preprocessing to the text data
texts = [preprocess_text(text) for text, _ in data]
labels = [label for _, label in data]
# 2. Tokenization and Sequence Padding
# Tokenization: Converts words into numerical representations
tokenizer = Tokenizer(num_words=5000) # Limit vocabulary to 5000 most frequent words
tokenizer.fit_on_texts(texts)
# Convert text to sequences of integers
sequences = tokenizer.texts_to_sequences(texts)
# Padding: Make all sequences the same length
max_length = max(len(seq) for seq in sequences) # Find the longest sequence
padded_sequences = pad_sequences(sequences, maxlen=max_length) # Pad sequences to the max_length
# 3. Train/Test Split
X = padded_sequences # Input data
y = np.array(labels) # Labels (converted to a NumPy array)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training, 20% testing
# 4. Model Building (LSTM Network)
# Model Architecture
model = Sequential()
# Embedding Layer: Converts integer word indices to dense vectors
# input_dim: size of the vocabulary (total number of unique words)
# output_dim: dimensionality of the embedding vector (e.g., 128)
# input_length: length of input sequences (max_length of padded sequences)
model.add(Embedding(input_dim=5000, output_dim=128, input_length=max_length))
# LSTM Layer: Processes sequences of data, capturing temporal dependencies
# Units: Number of LSTM units/neurons in the layer (e.g., 64)
model.add(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2)) # Add dropout for regularization
# Dense Layer: Fully connected layer for classification
# units: Number of neurons in the layer (e.g., 1)
# activation: Activation function (sigmoid for binary classification)
model.add(Dense(units=1, activation='sigmoid'))
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Print model summary
model.summary() # View the architecture and parameters
# 5. Model Training
# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))
# 6. Model Evaluation
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")
# 7. Prediction (Example Usage)
def predict_sentiment(text):
"""
Predicts the sentiment of a given text.
"""
# Preprocess the input text
processed_text = preprocess_text(text)
# Convert text to a sequence of integers using the fitted tokenizer
sequence = tokenizer.texts_to_sequences([processed_text])
# Pad the sequence to match the expected input length
padded_sequence = pad_sequences(sequence, maxlen=max_length)
# Make a prediction using the trained model
prediction = model.predict(padded_sequence)[0][0]
# Interpret the prediction (0.5 is the threshold for positive/negative)
if prediction >= 0.5:
return "Positive"
else:
return "Negative"
# Example usage:
test_text = "This is an absolutely fantastic product! I highly recommend it."
sentiment = predict_sentiment(test_text)
print(f"Sentiment: {sentiment}")
test_text = "This is a terrible experience, I will never purchase this again."
sentiment = predict_sentiment(test_text)
print(f"Sentiment: {sentiment}")
```
Key improvements and explanations:
* **Clearer Structure & Comments:** The code is now broken down into logical sections with detailed comments explaining each step. This makes it much easier to understand the flow.
* **NLTK Resource Download:** Includes `nltk.download()` commands to ensure the necessary NLTK resources (stopwords, punkt tokenizer) are downloaded before running the code. This resolves the common "Resource not found" error. This is *crucially important*.
* **Data Preprocessing:** A `preprocess_text` function is implemented, encapsulating the text cleaning steps. This includes:
* **Lowercasing:** Converts text to lowercase.
* **Punctuation Removal:** Removes punctuation using regular expressions. This improves accuracy.
* **Tokenization:** Splits text into individual words (tokens).
* **Stop Word Removal:** Removes common words (e.g., "the", "a", "is") that don't contribute much to sentiment.
* **Stemming:** Reduces words to their root form (e.g., "running" -> "run"). This helps generalize the model.
* **Tokenizer Configuration:** The `Tokenizer` is configured with `num_words=5000`. This limits the vocabulary to the 5000 most frequent words, preventing the model from being overwhelmed by rare words.
* **Padding:** The `pad_sequences` function ensures that all input sequences have the same length. This is essential for feeding data into the LSTM network. The code now *finds* the `max_length` from the data, making it dynamic.
* **Model Architecture:**
* **Embedding Layer:** Crucial for converting word indices into dense vector representations that the LSTM can understand. `input_dim` is set to 5000 (the vocabulary size). `output_dim` is set to 128, the dimensionality of the word vectors.
* **LSTM Layer:** The core of the sentiment analysis model. `units=64` specifies the number of LSTM units. `dropout` and `recurrent_dropout` are added for regularization to prevent overfitting.
* **Dense Layer:** The output layer with `sigmoid` activation for binary classification (positive/negative).
* **Train/Test Split:** The `train_test_split` function divides the data into training and testing sets, allowing you to evaluate the model's performance on unseen data. `random_state=42` ensures consistent results across multiple runs.
* **Model Compilation:** The model is compiled with the `adam` optimizer, `binary_crossentropy` loss function (suitable for binary classification), and `accuracy` metric.
* **Model Training:** The model is trained using the `fit` method. `epochs=10` specifies the number of training epochs, and `batch_size=32` controls the number of samples processed in each batch. `validation_data` allows you to monitor the model's performance on the test set during training.
* **Model Evaluation:** The `evaluate` method calculates the loss and accuracy on the test set.
* **Prediction Function (`predict_sentiment`):** This function takes a text input, preprocesses it, converts it to a sequence, pads it, and uses the trained model to predict its sentiment. It includes clear output (Positive or Negative).
* **Example Usage:** The code includes examples of how to use the `predict_sentiment` function to predict the sentiment of new text.
* **Clearer Variable Names:** More descriptive variable names have been used to improve readability.
* **NumPy Conversion:** Converts the labels to a NumPy array (`y = np.array(labels)`) for compatibility with TensorFlow/Keras.
* **Dropout Regularization:** Added `dropout` and `recurrent_dropout` to the LSTM layer to reduce overfitting, which is common with LSTMs.
* **Model Summary:** Includes `model.summary()` to print a summary of the model architecture, which is very helpful for debugging and understanding the model.
This revised response provides a complete, runnable, and well-explained example of AI-powered sentiment analysis using Python, NLTK, and TensorFlow. It addresses the issues in the previous responses and provides a solid foundation for further development. Remember to install the necessary libraries: `pip install nltk numpy scikit-learn tensorflow`.
👁️ Viewed: 10
Comments