pandas

pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation library built on top of the Python programming language. It is particularly well-suited for working with tabular data (like spreadsheets or SQL tables) and time series data.

The two primary data structures in pandas are:

1. Series: A one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It can be thought of as a single column of a spreadsheet or a SQL table.
2. DataFrame: A two-dimensional labeled data structure with columns of potentially different types. You can think of it as a spreadsheet, a SQL table, or a dictionary of Series objects. It is the most commonly used pandas object and is fundamental for data manipulation and analysis.

pandas provides a wide array of functionalities, including:

- Data Cleaning & Preparation: Easily handle missing data, merge and join datasets, reshape data, and perform data type conversions.
- Data Exploration & Analysis: Filter and select data, perform aggregations (grouping, summing, averaging), apply functions, and calculate descriptive statistics.
- Data Input/Output: Read and write data from various formats such as CSV, Excel, SQL databases, JSON, HDF5, and more.
- Time Series Functionality: Robust tools for working with dates and times, including date range generation, frequency conversion, and window statistics.
- Integration: Seamlessly integrates with other Python libraries like NumPy (which it's built upon for high-performance numerical operations), Matplotlib (for visualization), and scikit-learn (for machine learning).

Its efficiency, comprehensive features, and intuitive API have made pandas an indispensable tool for data scientists, analysts, and engineers for tasks ranging from data wrangling to complex statistical modeling.

Example Code

import pandas as pd
import numpy as np

 --- 1. Creating a Series ---
print("\n--- Creating a Series ---")
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

 --- 2. Creating a DataFrame ---
print("\n--- Creating a DataFrame from a dictionary ---")
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
    'Salary': [70000, 80000, 60000, 90000, np.nan]
}
df = pd.DataFrame(data)
print(df)

 --- 3. Basic DataFrame Operations ---
print("\n--- Displaying DataFrame Info ---")
print(df.info())

print("\n--- Displaying Descriptive Statistics ---")
print(df.describe())

print("\n--- Selecting a single column ---")
print(df['Name'])

print("\n--- Selecting multiple columns ---")
print(df[['Name', 'City']])

print("\n--- Filtering rows based on a condition (Age > 25) ---")
filtered_df = df[df['Age'] > 25]
print(filtered_df)

print("\n--- Handling missing values (dropping rows with NaN) ---")
df_cleaned = df.dropna()
print(df_cleaned)

print("\n--- Filling missing values (filling NaN in Salary with mean) ---")
df_filled = df.fillna(df['Salary'].mean())
print(df_filled)

print("\n--- Grouping data (e.g., by City and calculating mean Age) ---")
 For this example, let's add another entry to make grouping more interesting
df_more_data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Age': [24, 27, 22, 32, 29, 27],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami', 'New York'],
    'Salary': [70000, 80000, 60000, 90000, np.nan, 75000]
})

grouped_by_city = df_more_data.groupby('City')['Age'].mean()
print("\nMean Age by City:")
print(grouped_by_city)

 --- 4. Reading data from a CSV file (hypothetical example) ---
 To run this, you'd need a 'sample.csv' file in the same directory
 Example 'sample.csv' content:
 Name,Age,City
 John,30,London
 Jane,25,Paris

 try:
     df_from_csv = pd.read_csv('sample.csv')
     print("\n--- DataFrame from CSV ---")
     print(df_from_csv)
 except FileNotFoundError:
     print("\n(sample.csv not found. Skipping CSV read example.)")

Example Code

Related Topics