Ad – 728Γ—90
πŸ“Š Data Science

NumPy Tutorial – Arrays, Operations & Data Science Basics

NumPy (Numerical Python) is the bedrock of the entire Python data science ecosystem. Pandas, Scikit-learn, TensorFlow, and PyTorch all build on top of NumPy's ndarray. Understanding NumPy deeply means understanding why data science in Python is so fast and expressive. In this lesson you will go from installation to advanced broadcasting and vectorized operations.

⏱️ 30 min read 🎯 Advanced πŸ“… Updated 2026

What is NumPy?

NumPy provides a fixed-type, multi-dimensional array object β€” the ndarray β€” and a large library of mathematical functions to operate on it. The key advantage over Python lists is speed: NumPy operations are implemented in C and operate on contiguous memory blocks, making them 10–100x faster than equivalent Python loops.

FeaturePython ListNumPy ndarray
TypeHeterogeneous (any type)Homogeneous (one dtype)
MemoryScattered (pointers)Contiguous block
SpeedSlow (Python loops)Fast (C vectorized)
DimensionsNested lists (ugly)Native N-D arrays
Math opsMust loop manuallyElement-wise by default
Memory usageHigherLower (no boxing)

Installing NumPy

Shell
pip install numpy

# Verify
python -c "import numpy as np; print(np.__version__)"
# 1.26.x or 2.x

The conventional import alias is np. Every NumPy tutorial and library in the world uses import numpy as np.

Creating Arrays

NumPy provides many ways to create arrays. The most common starting point is converting a Python list:

Python
import numpy as np

# From a Python list β€” 1D array
a = np.array([1, 2, 3, 4, 5])
print(a)          # [1 2 3 4 5]
print(type(a))    # <class 'numpy.ndarray'>
print(a.dtype)    # int64

# From a nested list β€” 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
print(matrix.shape)   # (3, 3)
print(matrix.ndim)    # 2
print(matrix.size)    # 9

# Specify dtype
floats = np.array([1, 2, 3], dtype=np.float64)
print(floats)    # [1. 2. 3.]
print(floats.dtype)  # float64

Array Creation Functions

Python
import numpy as np

# zeros β€” filled with 0.0
z = np.zeros((3, 4))
print(z.shape)   # (3, 4)

# ones β€” filled with 1.0
o = np.ones((2, 3))
print(o)
# [[1. 1. 1.]
#  [1. 1. 1.]]

# full β€” filled with a specific value
f = np.full((2, 2), 7)
print(f)
# [[7 7]
#  [7 7]]

# eye β€” identity matrix
I = np.eye(4)      # 4Γ—4 identity
print(I.diagonal())  # [1. 1. 1. 1.]

# arange β€” like Python range() but returns ndarray
r = np.arange(0, 20, 2)   # start, stop, step
print(r)   # [ 0  2  4  6  8 10 12 14 16 18]

# linspace β€” evenly spaced values between start and stop (inclusive)
ls = np.linspace(0, 1, 5)
print(ls)   # [0.   0.25 0.5  0.75 1.  ]

# Random arrays
np.random.seed(42)   # for reproducibility
rand_uniform = np.random.rand(3, 3)      # uniform [0, 1)
rand_normal  = np.random.randn(3, 3)     # standard normal (mean=0, std=1)
rand_int     = np.random.randint(1, 100, size=(3, 4))  # random ints
print(rand_int)

Key Array Attributes

Python
import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])

print(a.shape)    # (2, 3) β€” 2 rows, 3 columns
print(a.ndim)     # 2      β€” number of dimensions
print(a.size)     # 6      β€” total number of elements
print(a.dtype)    # int64  β€” data type
print(a.itemsize) # 8      β€” bytes per element
print(a.nbytes)   # 48     β€” total bytes in memory

Array Indexing and Slicing

NumPy indexing works like Python lists for 1D arrays, and extends naturally to multiple dimensions:

Python
import numpy as np

# 1D indexing
a = np.array([10, 20, 30, 40, 50])
print(a[0])      # 10
print(a[-1])     # 50
print(a[1:4])    # [20 30 40]
print(a[::2])    # [10 30 50] β€” every other element

# 2D indexing β€” [row, col]
m = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

print(m[0, 0])   # 1
print(m[1, 2])   # 6
print(m[2, :])   # [7 8 9]  β€” entire row 2
print(m[:, 1])   # [2 5 8]  β€” entire column 1
print(m[0:2, 1:3])  # [[2 3] [5 6]] β€” submatrix

# Fancy indexing β€” using arrays of indices
idx = np.array([0, 2])
print(m[idx])    # [[1 2 3] [7 8 9]] β€” rows 0 and 2

# Boolean (mask) indexing β€” extremely common in data work
data = np.array([15, 3, 42, 7, 28, 11])
mask = data > 10
print(mask)        # [True False True False True False]
print(data[mask])  # [15 42 28] β€” only elements > 10
print(data[data % 2 == 0])  # [42 28] β€” even numbers only
⚠️
Slices are Views, Not Copies

NumPy slices return a view of the original array for efficiency β€” modifying the slice modifies the original. Use .copy() if you need an independent copy: b = a[1:3].copy().

Ad – 336Γ—280

Array Operations

NumPy operations are element-wise by default β€” they apply to every element simultaneously without a Python loop:

Python
import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

# Arithmetic β€” all element-wise
print(a + b)    # [11 22 33 44]
print(a - b)    # [-9 -18 -27 -36]
print(a * b)    # [10 40 90 160]
print(b / a)    # [10. 10. 10. 10.]
print(a ** 2)   # [1 4 9 16]
print(b % 3)    # [1 2 0 1]

# Scalar operations β€” broadcasts the scalar to all elements
print(a * 5)    # [5 10 15 20]
print(a + 100)  # [101 102 103 104]

# Comparison β€” returns boolean array
print(a > 2)    # [False False  True  True]
print(a == b)   # [False False False False]

# Universal functions (ufuncs) β€” fast element-wise math
print(np.sqrt(a))       # [1. 1.414 1.732 2.]
print(np.exp(a))        # [e^1 e^2 e^3 e^4]
print(np.log(b))        # [2.302 2.995 3.401 3.688]
print(np.abs(np.array([-3, -1, 2, 4])))   # [3 1 2 4]
print(np.sin(np.linspace(0, np.pi, 5)))   # [0. 0.707 1. 0.707 0.]

Matrix Multiplication and Dot Product

Python
import numpy as np

# Dot product of 1D vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b))    # 1*4 + 2*5 + 3*6 = 32
print(a @ b)           # Same with @ operator (Python 3.5+)

# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(A @ B)           # Matrix product (not element-wise!)
# [[19 22]
#  [43 50]]

print(A * B)           # Element-wise product (Hadamard)
# [[ 5 12]
#  [21 32]]

# Transpose
print(A.T)
# [[1 3]
#  [2 4]]

print(A.T @ A)         # A^T A  β€” common in linear regression
# [[ 10 14]
#  [ 14 20]]

Reshape and Transpose

Python
import numpy as np

a = np.arange(12)
print(a)         # [ 0  1  2  3  4  5  6  7  8  9 10 11]

# reshape β€” change shape without changing data
m = a.reshape(3, 4)   # 3 rows, 4 columns
print(m)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# Use -1 to let NumPy infer one dimension
m2 = a.reshape(4, -1)   # 4 rows, NumPy figures out 3 columns
print(m2.shape)  # (4, 3)

# Flatten β€” always returns a copy
flat = m.flatten()
print(flat)   # [ 0  1  2  3  4  5  6  7  8  9 10 11]

# ravel β€” returns a view if possible (faster)
flat_view = m.ravel()

# Add dimensions with np.newaxis
col = a[:, np.newaxis]    # shape (12,) β†’ (12, 1)
row = a[np.newaxis, :]    # shape (12,) β†’ (1, 12)
print(col.shape, row.shape)  # (12, 1) (1, 12)

Broadcasting

Broadcasting is NumPy's powerful mechanism for performing operations on arrays of different shapes without copying data. It works by "stretching" the smaller array along dimensions of size 1.

Python
import numpy as np

# Scalar broadcast
a = np.array([1, 2, 3])
print(a + 10)    # [11 12 13] β€” 10 is broadcast to match a's shape

# 1D + 2D broadcasting
matrix = np.ones((3, 3))    # shape (3, 3)
row    = np.array([1, 2, 3])  # shape (3,) β€” treated as (1, 3)
print(matrix + row)
# [[2. 3. 4.]
#  [2. 3. 4.]
#  [2. 3. 4.]]

# Column vector broadcast
col = np.array([[10], [20], [30]])   # shape (3, 1)
print(matrix + col)
# [[11. 11. 11.]
#  [21. 21. 21.]
#  [31. 31. 31.]]

# Practical: mean-center each column of a dataset
data = np.random.randn(100, 5)   # 100 samples, 5 features
column_means = data.mean(axis=0)  # shape (5,)
centered = data - column_means    # broadcasts: (100,5) - (5,) β†’ (100,5)
print(centered.mean(axis=0).round(10))  # near-zero column means
πŸ’‘
Broadcasting Rules

NumPy aligns shapes from the right. If dimensions match or one of them is 1, the operation proceeds. A shape (3,) is treated as (1, 3) when broadcast against a 2D array.

Aggregate Functions

Python
import numpy as np

data = np.array([[4, 7, 2],
                 [1, 8, 5],
                 [9, 3, 6]])

# Global aggregates β€” across all elements
print(data.sum())     # 45
print(data.mean())    # 5.0
print(data.max())     # 9
print(data.min())     # 1
print(data.std())     # 2.581...
print(data.var())     # 6.666...
print(np.median(data))  # 5.0

# Axis-wise aggregates
# axis=0 β†’ collapse rows (result has same columns)
print(data.sum(axis=0))    # [14 18 13] β€” column sums
print(data.max(axis=0))    # [9 8 6]    β€” column maxima

# axis=1 β†’ collapse columns (result has same rows)
print(data.sum(axis=1))    # [13 14 18] β€” row sums
print(data.mean(axis=1))   # [4.33 4.67 6.0]

# Index of min/max
print(np.argmax(data))          # 6 (flat index of 9)
print(np.argmax(data, axis=0))  # [2 1 2] β€” row index of max per column
print(np.argmin(data, axis=1))  # [2 0 1] β€” col index of min per row

# Cumulative operations
print(np.cumsum(np.array([1, 2, 3, 4])))   # [1 3 6 10]
print(np.cumprod(np.array([1, 2, 3, 4])))  # [1 2 6 24]

NumPy vs Python Lists – Speed Comparison

Python
import numpy as np
import time

n = 1_000_000

# Python list: square each element
py_list = list(range(n))

start = time.time()
result = [x ** 2 for x in py_list]
py_time = time.time() - start
print(f"Python list: {py_time * 1000:.1f} ms")

# NumPy array: square each element
np_array = np.arange(n)

start = time.time()
result = np_array ** 2
np_time = time.time() - start
print(f"NumPy array: {np_time * 1000:.1f} ms")

print(f"NumPy is {py_time / np_time:.0f}x faster")
β–Ά Typical Output
Python list: 187.3 ms NumPy array: 2.1 ms NumPy is 89x faster

Common Data Science Operations

Python
import numpy as np

# Normalise data to [0, 1]
data = np.array([10, 25, 5, 50, 30])
normalised = (data - data.min()) / (data.max() - data.min())
print(normalised.round(3))   # [0.111 0.444 0.    1.    0.556]

# Standardise data (z-score)
standardised = (data - data.mean()) / data.std()
print(standardised.round(3)) # [-0.521  0.391 -1.042  1.824  0.651]

# Count elements satisfying a condition
print((data > 20).sum())     # 3

# Unique values and counts
arr = np.array([1, 2, 2, 3, 3, 3])
values, counts = np.unique(arr, return_counts=True)
print(values)   # [1 2 3]
print(counts)   # [1 2 3]

# Sort
arr2 = np.array([5, 1, 4, 2, 3])
print(np.sort(arr2))          # [1 2 3 4 5]
print(np.argsort(arr2))       # [1 3 4 2 0] β€” indices that would sort

# Clip values to a range
raw = np.array([-5, 0, 3, 10, 20])
clipped = np.clip(raw, 0, 10)
print(clipped)   # [0 0 3 10 10]

# Where β€” conditional selection
a = np.array([1, 2, 3, 4, 5])
result = np.where(a > 3, "big", "small")
print(result)    # ['small' 'small' 'small' 'big' 'big']

Stacking and Splitting

Python
import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

# Stack vertically (row-wise)
v = np.vstack([a, b])
print(v.shape)   # (4, 2)

# Stack horizontally (column-wise)
h = np.hstack([a, b])
print(h.shape)   # (2, 4)

# Concatenate along an axis
c = np.concatenate([a, b], axis=0)   # same as vstack
print(c.shape)   # (4, 2)

# Split
arr = np.arange(12).reshape(4, 3)
top, bottom = np.vsplit(arr, 2)   # split into 2 equal halves
print(top.shape, bottom.shape)    # (2, 3) (2, 3)

left, right = np.hsplit(arr, 3)   # split into 3 columns
print(left.shape)   # (4, 1)

πŸ‹οΈ Practical Exercises

  1. Create a 5Γ—5 array of random integers between 1 and 100. Find the max of each row, min of each column, and the sum of the diagonal.
  2. Generate an array of 1000 normally-distributed random numbers. Calculate the percentage of values within 1, 2, and 3 standard deviations of the mean (should be ~68%, ~95%, ~99.7%).
  3. Implement matrix multiplication from scratch using NumPy slicing and verify with @ operator.
  4. Reshape a 1D array of 24 elements into a 2Γ—3Γ—4 three-dimensional array and access specific elements using multi-dimensional indexing.

πŸ”₯ Challenge: Image as an Array

A grayscale image is just a 2D NumPy array where each value is a pixel intensity (0–255). Load an image using matplotlib.pyplot.imread(), which returns a NumPy array. Perform: (1) flip it horizontally using slicing, (2) crop to the center 50%, (3) adjust brightness by multiplying all values by 1.2 and clipping to 255, (4) convert to black and white by thresholding at 128 using np.where(). Display each result using matplotlib.pyplot.imshow().

  • What is an ndarray and how does it differ from a Python list?
  • What does it mean for NumPy operations to be vectorized?
  • Explain NumPy broadcasting with an example.
  • What is the difference between np.dot(A, B) and A * B for 2D arrays?
  • What is the difference between .flatten() and .ravel()?
  • What does axis=0 vs axis=1 mean for aggregation functions?
  • How does boolean indexing work in NumPy?
  • What is the difference between a view and a copy in NumPy?
  • How would you normalise an array to have zero mean and unit variance?

πŸ“‹ Summary

  • NumPy's ndarray is a fixed-type, contiguous array that is the foundation of Python data science.
  • Create arrays with np.array(), np.zeros(), np.ones(), np.arange(), np.linspace(), and np.random.
  • Key attributes: .shape, .ndim, .size, .dtype.
  • Indexing uses [row, col] syntax; slices return views (not copies).
  • All arithmetic operations are element-wise by default.
  • Use @ or np.dot() for matrix multiplication.
  • Broadcasting allows operations on arrays of different shapes by "stretching" size-1 dimensions.
  • Aggregate functions (sum, mean, max, etc.) accept an axis parameter.
  • NumPy is typically 10–100x faster than equivalent Python list operations.

Frequently Asked Questions

When should I use NumPy vs a Python list? +

Use NumPy whenever you need to perform mathematical operations on large collections of numbers. NumPy is 10–100x faster for numerical work. Use plain lists for small collections of mixed-type objects or when you need to frequently append/remove elements (list append is O(1); NumPy concatenate is O(n)).

What is the difference between np.float64 and Python's float? +

Python's float is a Python object with a lot of overhead. np.float64 is a raw 64-bit IEEE 754 double stored in a contiguous C array β€” no overhead. NumPy also offers float32 (half the memory, less precision), which is often used in deep learning.

What is the relationship between NumPy and Pandas? +

A Pandas DataFrame is built on top of NumPy arrays. Each column in a DataFrame is a NumPy array. Pandas adds labels (index), heterogeneous column types, and a rich API for tabular data on top of NumPy's numerical core.

How do I handle NaN values in NumPy? +

Use np.nan to represent missing values. Most aggregate functions have "NaN-safe" versions: np.nansum(), np.nanmean(), np.nanmax(), etc. Detect NaN with np.isnan(arr), which returns a boolean mask.