Pandas Tutorial – DataFrame, Series & Data Analysis | ylearner

Pandas Series

A Series is a 1D labelled array — like a Python list with an index.

Python

import pandas as pd

# Create from list
scores = pd.Series([85, 92, 78, 95, 88], 
                   name="scores",
                   index=["Alice","Bob","Charlie","Diana","Eve"])

print(scores)
print(scores["Alice"])   # 85
print(scores.mean())     # 87.6
print(scores[scores > 90])

▶ Output

Alice 85 Bob 92 Charlie 78 Diana 95 Eve 88 Name: scores, dtype: int64 85 87.6 Bob 92 Diana 95

Creating DataFrames

A DataFrame is a 2D table with labelled rows and columns.

Python

import pandas as pd

# From dict of lists
df = pd.DataFrame({
    "name":  ["Alice", "Bob", "Charlie", "Diana"],
    "age":   [25, 30, 35, 28],
    "score": [88, 92, 78, 95],
    "city":  ["London", "Paris", "NYC", "London"]
})

print(df)
print(df.shape)    # (4, 4)
print(df.dtypes)

▶ Output

name age score city 0 Alice 25 88 London 1 Bob 30 92 Paris 2 Charlie 35 78 NYC 3 Diana 28 95 London (4, 4) name object age int64 score int64 city object

Selecting Data

Use [], .loc[], and .iloc[] to access rows and columns.

Python

# Select column
print(df["name"])           # Series
print(df[["name","score"]]) # DataFrame

# Select rows by label (.loc)
print(df.loc[0])            # First row
print(df.loc[0:2, ["name","score"]])  # Rows 0-2, 2 cols

# Select rows by position (.iloc)
print(df.iloc[0:3, 0:2])   # First 3 rows, first 2 cols

Filtering Rows

Use boolean conditions to filter rows.

Python

# Filter
high_scorers = df[df["score"] >= 90]
print(high_scorers)

# Multiple conditions
london_high = df[(df["city"] == "London") & (df["score"] > 80)]
print(london_high[["name","score"]])

▶ Output

name age score city 1 Bob 30 92 Paris 3 Diana 28 95 London name score 0 Alice 88 3 Diana 95

GroupBy – Aggregation

Group data and compute aggregates — like SQL GROUP BY.

Python

# Average score by city
print(df.groupby("city")["score"].mean())

# Multiple aggregations
print(df.groupby("city").agg({"score": ["mean","max"], "age": "mean"}))

▶ Output

city London 91.5 NYC 78.0 Paris 92.0 Name: score, dtype: float64

Pandas: DataFrames, loc vs iloc, and the Copy Warning

Pandas builds on NumPy to give you labeled tables (DataFrame) and columns (Series). Like NumPy, you work in vectorized operations over whole columns — never a row-by-row Python loop.

import pandas as pd
df = pd.DataFrame({"name": ["Ann", "Bob"], "age": [30, 25]})

df["age"] * 2               # vectorized on the whole column
df[df["age"] > 26]          # boolean filter → rows where age > 26
df.groupby("dept")["salary"].mean()   # split-apply-combine

loc vs iloc — label vs position

	`.loc`	`.iloc`
Selects by	label / condition	integer position
Example	`df.loc[df.age > 26, "name"]`	`df.iloc[0:2, 1]`

The SettingWithCopyWarning: chained indexing like df[df.age > 26]["age"] = 0 may edit a temporary copy, not the real frame — so your change silently vanishes. Always assign through a single .loc: df.loc[df.age > 26, "age"] = 0. Also watch for NaN (pandas' missing value) — handle it with fillna/dropna before math, since it propagates through calculations.

🏋️ Practical Exercise

Manipulate tabular data:

Create a DataFrame from a dictionary of lists.
Select a single column and a subset of columns.
Filter rows where a numeric column exceeds a threshold.
Group by a category column and compute the mean of another column.

🔥 Challenge Exercise

Load a CSV of sales data into a DataFrame, clean it (handle missing values with fillna or dropna), add a computed column (e.g. revenue = price × quantity), then use groupby to report total revenue per region sorted descending. Finally, export the summary to a new CSV. Bonus: pivot the data with pivot_table to compare regions across months.

📋 Summary

pandas is the standard Python library for tabular data analysis.
A Series is a 1D labeled array; a DataFrame is a 2D labeled table.
Select data with column names, loc (labels), and iloc (positions).
Filter rows with boolean conditions: df[df["age"] > 18].
groupby splits data into groups and aggregates them (sum, mean, count).
Handle missing values with dropna/fillna; read/write CSV with read_csv/to_csv.

Interview Questions on pandas

What is pandas and what are its two core data structures?
What is the difference between a Series and a DataFrame?
What is the difference between loc and iloc?
How do you filter rows based on a condition?
What does groupby do?
How do you handle missing data in pandas?
How do you read and write CSV files with pandas?

FAQ

What is the difference between a Series and a DataFrame? +

A Series is a single labeled column of data (1D). A DataFrame is a table of multiple columns (2D), where each column is a Series sharing a common index. Most analysis works on DataFrames.

What is the difference between loc and iloc? +

loc selects by labels (row/column names), while iloc selects by integer position. For example df.loc[0, "name"] versus df.iloc[0, 1].

How do I handle missing values? +

Detect them with isna(), drop them with dropna(), or fill them with fillna(value). Which to use depends on whether missing rows can be discarded or should be imputed.

Is pandas suitable for very large datasets? +

pandas works in memory, so it handles datasets up to a few gigabytes comfortably. For larger-than-memory data, consider chunked reading, Dask, Polars, or a database.

Pandas Series

Creating DataFrames

Selecting Data

Filtering Rows

GroupBy – Aggregation

Pandas: DataFrames, loc vs iloc, and the Copy Warning

loc vs iloc — label vs position

🏋️ Practical Exercise

🔥 Challenge Exercise

📋 Summary

Interview Questions on pandas

Related Topics

FAQ