Python Machine Learning Intro – scikit-learn, Linear Regression | ylearner

Key Machine Learning Concepts

Supervised learning: learn from labelled examples (classification, regression). Unsupervised learning: find patterns in unlabelled data (clustering, dimensionality reduction). Reinforcement learning: learn from rewards and penalties.

Python

# The ML workflow:
# 1. Collect and prepare data
# 2. Split into training and test sets
# 3. Choose and train a model
# 4. Evaluate on test set
# 5. Tune and improve
# 6. Deploy

print("Data → Model → Predictions")

scikit-learn – Python's ML Toolkit

scikit-learn provides consistent APIs for 50+ ML algorithms, plus data preprocessing, model evaluation, and pipelines.

Python

# pip install scikit-learn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load example dataset
iris = load_iris()
X, y = iris.data, iris.target

print(f"Samples: {len(X)}, Features: {X.shape[1]}")
print(f"Classes: {iris.target_names}")

▶ Output

Samples: 150, Features: 4 Classes: ['setosa' 'versicolor' 'virginica']

Training Your First Model

Split data, train a classifier, and evaluate it.

Python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

iris = load_iris()
X, y = iris.data, iris.target

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

▶ Output

Accuracy: 96.67%

Linear Regression Example

Predict continuous values (price, temperature, salary) with linear regression.

Python

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Synthetic house price data
np.random.seed(42)
sqft = np.random.randint(500, 3000, 200).reshape(-1, 1)
price = sqft * 250 + np.random.normal(0, 15000, (200,1))

X_train, X_test, y_train, y_test = train_test_split(sqft, price, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"Predict 2000 sqft: ${model.predict([[2000]])[0][0]:,.0f}")

▶ Output

R² Score: 0.985 Predict 2000 sqft: $500,234

The Machine Learning Workflow in Python

Machine learning finds patterns in data instead of following hand-coded rules. In Python, scikit-learn gives every model the same simple interface, so the workflow matters more than any one algorithm.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# 1. split — NEVER test on data you trained on
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 2. fit (learn) then 3. predict, 4. evaluate
model = RandomForestClassifier().fit(X_train, y_train)
model.score(X_test, y_test)      # accuracy on UNSEEN data

Type	Learns from	Example
Supervised	labeled data	spam detection, price prediction
Unsupervised	unlabeled data	customer clustering

The cardinal rule — train/test split: always evaluate on data the model never saw during training. A model that scores 99% on its training data but fails on new data is overfitting — memorizing, not learning. The held-out test set is your honest estimate of real-world performance. Reality check: most ML work is data preparation — cleaning, encoding, and scaling features — not choosing algorithms. Every scikit-learn model follows the same .fit() / .predict() pattern, so learning the workflow transfers across all of them.

🏋️ Practical Exercise

Train a first model with scikit-learn:

Load a built-in dataset (e.g. load_iris) and inspect its features and target.
Split it into training and test sets with train_test_split.
Fit a classifier (e.g. LogisticRegression) with .fit().
Predict on the test set and report accuracy with accuracy_score.

🔥 Challenge Exercise

Build a simple linear regression to predict a numeric target (e.g. house price from size). Split the data, train a LinearRegression model, evaluate it with R² and mean squared error, and plot the predicted line over the data points. Then deliberately train and test on the same data to observe over-optimistic scores, illustrating why a held-out test set matters.

📋 Summary

Machine learning lets models learn patterns from data instead of being explicitly programmed.
Supervised learning uses labeled data (classification, regression); unsupervised finds structure in unlabeled data.
scikit-learn provides a consistent fit()/predict() API across many algorithms.
Always split data into training and test sets to estimate real-world performance.
Overfitting is when a model memorizes training data and generalizes poorly — a gap between train and test scores reveals it.
Classification predicts categories; regression predicts continuous values.

Interview Questions on Machine Learning

What is machine learning and how does it differ from traditional programming?
What is the difference between supervised and unsupervised learning?
What is the purpose of splitting data into training and test sets?
What is overfitting and how do you detect it?
What does scikit-learn’s fit/predict API do?
What is the difference between classification and regression?
What are features and labels?

FAQ

Do I need advanced math to start machine learning? +

To use libraries like scikit-learn and build working models, a basic grasp of statistics is enough. Deeper math (linear algebra, calculus) helps you understand and tune algorithms, but you can be productive while learning it gradually.

Why split data into training and test sets? +

A model is only useful if it generalizes to new data. Holding out a test set the model never trained on gives an honest estimate of real-world performance and reveals overfitting.

What is overfitting? +

Overfitting is when a model learns noise and quirks of the training data rather than the underlying pattern, so it scores well on training data but poorly on new data. Cross-validation, more data, and simpler models help combat it.

What is the difference between classification and regression? +

Classification predicts a discrete category (spam vs not spam), while regression predicts a continuous number (a price). Both are supervised tasks but use different algorithms and metrics.

Key Machine Learning Concepts

scikit-learn – Python's ML Toolkit

Training Your First Model

Linear Regression Example

The Machine Learning Workflow in Python

🏋️ Practical Exercise

🔥 Challenge Exercise

📋 Summary

Interview Questions on Machine Learning

Related Topics

FAQ