What is NumPy?
NumPy provides a fixed-type, multi-dimensional array object β the ndarray β and a large library of mathematical functions to operate on it. The key advantage over Python lists is speed: NumPy operations are implemented in C and operate on contiguous memory blocks, making them 10β100x faster than equivalent Python loops.
| Feature | Python List | NumPy ndarray |
|---|---|---|
| Type | Heterogeneous (any type) | Homogeneous (one dtype) |
| Memory | Scattered (pointers) | Contiguous block |
| Speed | Slow (Python loops) | Fast (C vectorized) |
| Dimensions | Nested lists (ugly) | Native N-D arrays |
| Math ops | Must loop manually | Element-wise by default |
| Memory usage | Higher | Lower (no boxing) |
Installing NumPy
pip install numpy
# Verify
python -c "import numpy as np; print(np.__version__)"
# 1.26.x or 2.x
The conventional import alias is np. Every NumPy tutorial and library in the world uses import numpy as np.
Creating Arrays
NumPy provides many ways to create arrays. The most common starting point is converting a Python list:
import numpy as np
# From a Python list β 1D array
a = np.array([1, 2, 3, 4, 5])
print(a) # [1 2 3 4 5]
print(type(a)) # <class 'numpy.ndarray'>
print(a.dtype) # int64
# From a nested list β 2D array (matrix)
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(matrix.shape) # (3, 3)
print(matrix.ndim) # 2
print(matrix.size) # 9
# Specify dtype
floats = np.array([1, 2, 3], dtype=np.float64)
print(floats) # [1. 2. 3.]
print(floats.dtype) # float64
Array Creation Functions
import numpy as np
# zeros β filled with 0.0
z = np.zeros((3, 4))
print(z.shape) # (3, 4)
# ones β filled with 1.0
o = np.ones((2, 3))
print(o)
# [[1. 1. 1.]
# [1. 1. 1.]]
# full β filled with a specific value
f = np.full((2, 2), 7)
print(f)
# [[7 7]
# [7 7]]
# eye β identity matrix
I = np.eye(4) # 4Γ4 identity
print(I.diagonal()) # [1. 1. 1. 1.]
# arange β like Python range() but returns ndarray
r = np.arange(0, 20, 2) # start, stop, step
print(r) # [ 0 2 4 6 8 10 12 14 16 18]
# linspace β evenly spaced values between start and stop (inclusive)
ls = np.linspace(0, 1, 5)
print(ls) # [0. 0.25 0.5 0.75 1. ]
# Random arrays
np.random.seed(42) # for reproducibility
rand_uniform = np.random.rand(3, 3) # uniform [0, 1)
rand_normal = np.random.randn(3, 3) # standard normal (mean=0, std=1)
rand_int = np.random.randint(1, 100, size=(3, 4)) # random ints
print(rand_int)
Key Array Attributes
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
print(a.shape) # (2, 3) β 2 rows, 3 columns
print(a.ndim) # 2 β number of dimensions
print(a.size) # 6 β total number of elements
print(a.dtype) # int64 β data type
print(a.itemsize) # 8 β bytes per element
print(a.nbytes) # 48 β total bytes in memory
Array Indexing and Slicing
NumPy indexing works like Python lists for 1D arrays, and extends naturally to multiple dimensions:
import numpy as np
# 1D indexing
a = np.array([10, 20, 30, 40, 50])
print(a[0]) # 10
print(a[-1]) # 50
print(a[1:4]) # [20 30 40]
print(a[::2]) # [10 30 50] β every other element
# 2D indexing β [row, col]
m = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(m[0, 0]) # 1
print(m[1, 2]) # 6
print(m[2, :]) # [7 8 9] β entire row 2
print(m[:, 1]) # [2 5 8] β entire column 1
print(m[0:2, 1:3]) # [[2 3] [5 6]] β submatrix
# Fancy indexing β using arrays of indices
idx = np.array([0, 2])
print(m[idx]) # [[1 2 3] [7 8 9]] β rows 0 and 2
# Boolean (mask) indexing β extremely common in data work
data = np.array([15, 3, 42, 7, 28, 11])
mask = data > 10
print(mask) # [True False True False True False]
print(data[mask]) # [15 42 28] β only elements > 10
print(data[data % 2 == 0]) # [42 28] β even numbers only
NumPy slices return a view of the original array for efficiency β modifying the slice modifies the original. Use .copy() if you need an independent copy: b = a[1:3].copy().
Array Operations
NumPy operations are element-wise by default β they apply to every element simultaneously without a Python loop:
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
# Arithmetic β all element-wise
print(a + b) # [11 22 33 44]
print(a - b) # [-9 -18 -27 -36]
print(a * b) # [10 40 90 160]
print(b / a) # [10. 10. 10. 10.]
print(a ** 2) # [1 4 9 16]
print(b % 3) # [1 2 0 1]
# Scalar operations β broadcasts the scalar to all elements
print(a * 5) # [5 10 15 20]
print(a + 100) # [101 102 103 104]
# Comparison β returns boolean array
print(a > 2) # [False False True True]
print(a == b) # [False False False False]
# Universal functions (ufuncs) β fast element-wise math
print(np.sqrt(a)) # [1. 1.414 1.732 2.]
print(np.exp(a)) # [e^1 e^2 e^3 e^4]
print(np.log(b)) # [2.302 2.995 3.401 3.688]
print(np.abs(np.array([-3, -1, 2, 4]))) # [3 1 2 4]
print(np.sin(np.linspace(0, np.pi, 5))) # [0. 0.707 1. 0.707 0.]
Matrix Multiplication and Dot Product
import numpy as np
# Dot product of 1D vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b)) # 1*4 + 2*5 + 3*6 = 32
print(a @ b) # Same with @ operator (Python 3.5+)
# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(A @ B) # Matrix product (not element-wise!)
# [[19 22]
# [43 50]]
print(A * B) # Element-wise product (Hadamard)
# [[ 5 12]
# [21 32]]
# Transpose
print(A.T)
# [[1 3]
# [2 4]]
print(A.T @ A) # A^T A β common in linear regression
# [[ 10 14]
# [ 14 20]]
Reshape and Transpose
import numpy as np
a = np.arange(12)
print(a) # [ 0 1 2 3 4 5 6 7 8 9 10 11]
# reshape β change shape without changing data
m = a.reshape(3, 4) # 3 rows, 4 columns
print(m)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
# Use -1 to let NumPy infer one dimension
m2 = a.reshape(4, -1) # 4 rows, NumPy figures out 3 columns
print(m2.shape) # (4, 3)
# Flatten β always returns a copy
flat = m.flatten()
print(flat) # [ 0 1 2 3 4 5 6 7 8 9 10 11]
# ravel β returns a view if possible (faster)
flat_view = m.ravel()
# Add dimensions with np.newaxis
col = a[:, np.newaxis] # shape (12,) β (12, 1)
row = a[np.newaxis, :] # shape (12,) β (1, 12)
print(col.shape, row.shape) # (12, 1) (1, 12)
Broadcasting
Broadcasting is NumPy's powerful mechanism for performing operations on arrays of different shapes without copying data. It works by "stretching" the smaller array along dimensions of size 1.
import numpy as np
# Scalar broadcast
a = np.array([1, 2, 3])
print(a + 10) # [11 12 13] β 10 is broadcast to match a's shape
# 1D + 2D broadcasting
matrix = np.ones((3, 3)) # shape (3, 3)
row = np.array([1, 2, 3]) # shape (3,) β treated as (1, 3)
print(matrix + row)
# [[2. 3. 4.]
# [2. 3. 4.]
# [2. 3. 4.]]
# Column vector broadcast
col = np.array([[10], [20], [30]]) # shape (3, 1)
print(matrix + col)
# [[11. 11. 11.]
# [21. 21. 21.]
# [31. 31. 31.]]
# Practical: mean-center each column of a dataset
data = np.random.randn(100, 5) # 100 samples, 5 features
column_means = data.mean(axis=0) # shape (5,)
centered = data - column_means # broadcasts: (100,5) - (5,) β (100,5)
print(centered.mean(axis=0).round(10)) # near-zero column means
NumPy aligns shapes from the right. If dimensions match or one of them is 1, the operation proceeds. A shape (3,) is treated as (1, 3) when broadcast against a 2D array.
Aggregate Functions
import numpy as np
data = np.array([[4, 7, 2],
[1, 8, 5],
[9, 3, 6]])
# Global aggregates β across all elements
print(data.sum()) # 45
print(data.mean()) # 5.0
print(data.max()) # 9
print(data.min()) # 1
print(data.std()) # 2.581...
print(data.var()) # 6.666...
print(np.median(data)) # 5.0
# Axis-wise aggregates
# axis=0 β collapse rows (result has same columns)
print(data.sum(axis=0)) # [14 18 13] β column sums
print(data.max(axis=0)) # [9 8 6] β column maxima
# axis=1 β collapse columns (result has same rows)
print(data.sum(axis=1)) # [13 14 18] β row sums
print(data.mean(axis=1)) # [4.33 4.67 6.0]
# Index of min/max
print(np.argmax(data)) # 6 (flat index of 9)
print(np.argmax(data, axis=0)) # [2 1 2] β row index of max per column
print(np.argmin(data, axis=1)) # [2 0 1] β col index of min per row
# Cumulative operations
print(np.cumsum(np.array([1, 2, 3, 4]))) # [1 3 6 10]
print(np.cumprod(np.array([1, 2, 3, 4]))) # [1 2 6 24]
NumPy vs Python Lists β Speed Comparison
import numpy as np
import time
n = 1_000_000
# Python list: square each element
py_list = list(range(n))
start = time.time()
result = [x ** 2 for x in py_list]
py_time = time.time() - start
print(f"Python list: {py_time * 1000:.1f} ms")
# NumPy array: square each element
np_array = np.arange(n)
start = time.time()
result = np_array ** 2
np_time = time.time() - start
print(f"NumPy array: {np_time * 1000:.1f} ms")
print(f"NumPy is {py_time / np_time:.0f}x faster")
Common Data Science Operations
import numpy as np
# Normalise data to [0, 1]
data = np.array([10, 25, 5, 50, 30])
normalised = (data - data.min()) / (data.max() - data.min())
print(normalised.round(3)) # [0.111 0.444 0. 1. 0.556]
# Standardise data (z-score)
standardised = (data - data.mean()) / data.std()
print(standardised.round(3)) # [-0.521 0.391 -1.042 1.824 0.651]
# Count elements satisfying a condition
print((data > 20).sum()) # 3
# Unique values and counts
arr = np.array([1, 2, 2, 3, 3, 3])
values, counts = np.unique(arr, return_counts=True)
print(values) # [1 2 3]
print(counts) # [1 2 3]
# Sort
arr2 = np.array([5, 1, 4, 2, 3])
print(np.sort(arr2)) # [1 2 3 4 5]
print(np.argsort(arr2)) # [1 3 4 2 0] β indices that would sort
# Clip values to a range
raw = np.array([-5, 0, 3, 10, 20])
clipped = np.clip(raw, 0, 10)
print(clipped) # [0 0 3 10 10]
# Where β conditional selection
a = np.array([1, 2, 3, 4, 5])
result = np.where(a > 3, "big", "small")
print(result) # ['small' 'small' 'small' 'big' 'big']
Stacking and Splitting
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
# Stack vertically (row-wise)
v = np.vstack([a, b])
print(v.shape) # (4, 2)
# Stack horizontally (column-wise)
h = np.hstack([a, b])
print(h.shape) # (2, 4)
# Concatenate along an axis
c = np.concatenate([a, b], axis=0) # same as vstack
print(c.shape) # (4, 2)
# Split
arr = np.arange(12).reshape(4, 3)
top, bottom = np.vsplit(arr, 2) # split into 2 equal halves
print(top.shape, bottom.shape) # (2, 3) (2, 3)
left, right = np.hsplit(arr, 3) # split into 3 columns
print(left.shape) # (4, 1)
ποΈ Practical Exercises
- Create a 5Γ5 array of random integers between 1 and 100. Find the max of each row, min of each column, and the sum of the diagonal.
- Generate an array of 1000 normally-distributed random numbers. Calculate the percentage of values within 1, 2, and 3 standard deviations of the mean (should be ~68%, ~95%, ~99.7%).
- Implement matrix multiplication from scratch using NumPy slicing and verify with
@operator. - Reshape a 1D array of 24 elements into a 2Γ3Γ4 three-dimensional array and access specific elements using multi-dimensional indexing.
π₯ Challenge: Image as an Array
A grayscale image is just a 2D NumPy array where each value is a pixel intensity (0β255). Load an image using matplotlib.pyplot.imread(), which returns a NumPy array. Perform: (1) flip it horizontally using slicing, (2) crop to the center 50%, (3) adjust brightness by multiplying all values by 1.2 and clipping to 255, (4) convert to black and white by thresholding at 128 using np.where(). Display each result using matplotlib.pyplot.imshow().
- What is an ndarray and how does it differ from a Python list?
- What does it mean for NumPy operations to be vectorized?
- Explain NumPy broadcasting with an example.
- What is the difference between
np.dot(A, B)andA * Bfor 2D arrays? - What is the difference between
.flatten()and.ravel()? - What does
axis=0vsaxis=1mean for aggregation functions? - How does boolean indexing work in NumPy?
- What is the difference between a view and a copy in NumPy?
- How would you normalise an array to have zero mean and unit variance?
π Summary
- NumPy's
ndarrayis a fixed-type, contiguous array that is the foundation of Python data science. - Create arrays with
np.array(),np.zeros(),np.ones(),np.arange(),np.linspace(), andnp.random. - Key attributes:
.shape,.ndim,.size,.dtype. - Indexing uses
[row, col]syntax; slices return views (not copies). - All arithmetic operations are element-wise by default.
- Use
@ornp.dot()for matrix multiplication. - Broadcasting allows operations on arrays of different shapes by "stretching" size-1 dimensions.
- Aggregate functions (
sum,mean,max, etc.) accept anaxisparameter. - NumPy is typically 10β100x faster than equivalent Python list operations.
Related Topics
Frequently Asked Questions
Use NumPy whenever you need to perform mathematical operations on large collections of numbers. NumPy is 10β100x faster for numerical work. Use plain lists for small collections of mixed-type objects or when you need to frequently append/remove elements (list append is O(1); NumPy concatenate is O(n)).
Python's float is a Python object with a lot of overhead. np.float64 is a raw 64-bit IEEE 754 double stored in a contiguous C array β no overhead. NumPy also offers float32 (half the memory, less precision), which is often used in deep learning.
A Pandas DataFrame is built on top of NumPy arrays. Each column in a DataFrame is a NumPy array. Pandas adds labels (index), heterogeneous column types, and a rich API for tabular data on top of NumPy's numerical core.
Use np.nan to represent missing values. Most aggregate functions have "NaN-safe" versions: np.nansum(), np.nanmean(), np.nanmax(), etc. Detect NaN with np.isnan(arr), which returns a boolean mask.