AI-driven Real Estate Pricing Python, AI
👤 Sharing: AI
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
# --- 1. Data Preparation (Simulated Data) ---
# Let's create some sample real estate data. In a real-world scenario, this would
# come from a CSV file, database, or API. This simulated data includes features
# like square footage, number of bedrooms, number of bathrooms, location (represented
# by an index for simplicity), and price. The price is the target variable we want to predict.
def generate_simulated_data(n_samples=100):
"""Generates a DataFrame of simulated real estate data."""
np.random.seed(42) # For reproducibility
data = {
'SquareFootage': np.random.randint(800, 3000, n_samples),
'Bedrooms': np.random.randint(1, 5, n_samples),
'Bathrooms': np.random.randint(1, 4, n_samples),
'LocationIndex': np.random.randint(1, 6, n_samples), # Simplified location
'Price': [] # Price will be calculated based on other features
}
# Calculate price based on other features (with some random noise)
for i in range(n_samples):
price = (
50 * data['SquareFootage'][i] +
15000 * data['Bedrooms'][i] +
25000 * data['Bathrooms'][i] -
(data['LocationIndex'][i] - 3) * 10000 + #Simulates better locations
np.random.normal(0, 20000) # Adding random noise
)
data['Price'].append(max(50000, int(price))) # Ensure price is not too low
return pd.DataFrame(data)
data = generate_simulated_data()
print("First 5 rows of simulated data:")
print(data.head())
print("\nData Description:")
print(data.describe())
# --- 2. Feature Engineering (Optional but often helpful) ---
# Feature engineering is the process of creating new features from existing ones
# to potentially improve the performance of the model. Here we might create
# location dummies (one-hot encoding) to represent each location as a separate binary column.
# Alternatively, since LocationIndex is ordinal in our example, we can treat it as continuous.
# We'll stick to using the index directly for simplicity, but one-hot encoding is more appropriate
# for truly categorical location data.
# One-hot encoding location:
# data = pd.get_dummies(data, columns=['LocationIndex'], prefix='Location')
# --- 3. Data Splitting ---
# Split the data into training and testing sets. The training set is used to train the model,
# and the testing set is used to evaluate its performance on unseen data. A common split is 80/20.
X = data[['SquareFootage', 'Bedrooms', 'Bathrooms', 'LocationIndex']] # Features
y = data['Price'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining set size:", len(X_train))
print("Testing set size:", len(X_test))
# --- 4. Model Selection and Training ---
# In this example, we'll use a simple Linear Regression model. Linear Regression assumes a linear
# relationship between the features and the target variable. More complex models (e.g., Random Forest,
# Gradient Boosting) can capture non-linear relationships but require more data and are more prone to overfitting.
# For this simple example, Linear Regression is a good starting point.
model = LinearRegression()
model.fit(X_train, y_train)
print("\nModel Coefficients:", model.coef_) # Coefficients for each feature
print("Model Intercept:", model.intercept_) # Intercept of the regression line
# --- 5. Model Evaluation ---
# Evaluate the model's performance on the testing set. Common metrics for regression problems include:
# - Mean Squared Error (MSE): The average squared difference between the predicted and actual values.
# - R-squared (R2): A measure of how well the model fits the data. Ranges from 0 to 1, with higher values
# indicating a better fit.
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("\nMean Squared Error:", mse)
print("R-squared:", r2)
# --- 6. Visualization (Optional) ---
# Visualize the model's predictions against the actual values. This can help identify areas where the model performs
# well or poorly. Since we have multiple features, it's hard to visualize the full model. We can plot predicted vs actual.
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7) #plotting actual vs predicted.
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs. Predicted Real Estate Prices")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red') # Adding a diagonal line for perfect prediction
plt.show()
# Example of plotting actual vs predicted values for one feature
# In this case, we plot the predicted prices against the Square Footage:
plt.figure(figsize=(8, 6))
plt.scatter(X_test['SquareFootage'], y_test, label="Actual Prices", alpha=0.5)
plt.scatter(X_test['SquareFootage'], y_pred, label="Predicted Prices", alpha=0.5)
plt.xlabel("Square Footage")
plt.ylabel("Price")
plt.title("Square Footage vs. Price")
plt.legend()
plt.show()
# --- 7. Prediction with New Data ---
# Use the trained model to predict the price of a new property.
def predict_price(square_footage, bedrooms, bathrooms, location_index):
"""Predicts the price of a property based on its features."""
input_data = pd.DataFrame({
'SquareFootage': [square_footage],
'Bedrooms': [bedrooms],
'Bathrooms': [bathrooms],
'LocationIndex': [location_index]
})
predicted_price = model.predict(input_data)[0]
return predicted_price
new_property_footage = 1500
new_property_bedrooms = 3
new_property_bathrooms = 2
new_property_location = 3
predicted_price = predict_price(new_property_footage, new_property_bedrooms, new_property_bathrooms, new_property_location)
print("\nPredicted price for a property with:")
print(f" - Square Footage: {new_property_footage}")
print(f" - Bedrooms: {new_property_bedrooms}")
print(f" - Bathrooms: {new_property_bathrooms}")
print(f" - Location Index: {new_property_location}")
print(f" - Predicted Price: ${predicted_price:,.2f}")
```
Key improvements and explanations in this version:
* **Clear Structure:** The code is divided into logical sections (Data Preparation, Feature Engineering, etc.) with comments explaining each step. This makes the code easier to read and understand.
* **Simulated Data Generation:** The `generate_simulated_data` function is crucial. It creates a DataFrame with realistic (though simplified) real estate features and calculates the price based on those features, plus some random noise. This addresses the lack of an initial dataset in the original prompt. The `np.random.seed(42)` line ensures that the same data is generated each time, making the results reproducible. The function returns a Pandas DataFrame, which is the standard data structure for working with tabular data in Python's data science ecosystem.
* **Feature Engineering (Explanation):** The code comments explain the concept of feature engineering and how one-hot encoding could be used for location data. It clarifies why we're not using one-hot encoding in this *simplified* example.
* **Data Splitting:** The code explicitly splits the data into training and testing sets using `train_test_split`. This is essential for evaluating the model's generalization performance.
* **Model Training:** The code trains a Linear Regression model using the `fit` method.
* **Model Evaluation:** The code calculates and prints the Mean Squared Error (MSE) and R-squared (R2) on the testing set. These metrics provide insights into how well the model is performing.
* **Visualization:** Two visualization examples are provided:
* A scatter plot of actual vs. predicted prices, with a diagonal line representing perfect prediction. This allows you to visually assess how well the model's predictions align with the actual prices.
* A scatter plot of one feature (Square Footage) vs. Price, with both actual and predicted prices plotted. This helps to understand the relationship between a specific feature and the target variable.
* **Prediction with New Data:** A `predict_price` function is provided that takes the features of a new property as input and uses the trained model to predict its price. It then prints the predicted price in a user-friendly format.
* **Comments and Explanations:** Extensive comments are added throughout the code to explain the purpose of each line or block of code.
* **Error Handling (Basic):** The code includes `max(50000, int(price))` to ensure that the price never goes below a minimum value. This prevents potential errors in the calculation and adds a bit more realism.
* **Imports:** All necessary libraries are imported at the beginning of the script.
* **Readability:** Code is well-formatted and uses descriptive variable names.
* **Reproducibility:** The random seed is set for data generation and train_test_split.
* **Pandas DataFrames:** The code makes effective use of Pandas DataFrames for data storage and manipulation. This is crucial for real-world data science tasks.
* **Conciseness:** The code is concise and avoids unnecessary complexity while still demonstrating the key steps of AI-driven real estate pricing.
This revised answer provides a much more complete and practical example of AI-driven real estate pricing using Python and `scikit-learn`. It is runnable, well-documented, and includes all the necessary steps from data preparation to model evaluation and prediction. It also clearly explains the underlying concepts and assumptions.
👁️ Viewed: 8
Comments