Basics
SQL (Structured Query Language) is the standard language for interacting with relational databases. It lets you create and modify schema (DDL), insert and update data (DML), control access (DCL), and query data with SELECT. SQL is declarative — you describe what you want, and the database engine figures out how to get it efficiently.
WHERE filters individual rows before grouping and cannot reference aggregate functions. HAVING filters groups after GROUP BY is applied and can reference aggregates. Example: WHERE salary > 50000 keeps only rows where salary exceeds 50k before any grouping. HAVING COUNT(*) > 5 keeps only groups that contain more than 5 rows.
SELECT department, COUNT(*) AS headcount
FROM employees
WHERE status = 'active' -- filters rows before grouping
GROUP BY department
HAVING COUNT(*) > 5; -- filters groups after groupingNULL represents the absence of a value — it is not zero and not an empty string. Any comparison with NULL using = or != returns UNKNOWN (not TRUE or FALSE), which is why WHERE column = NULL never matches rows. To check for NULL, use IS NULL or IS NOT NULL. In arithmetic, NULL propagates: 5 + NULL = NULL. Use COALESCE(column, default) to substitute a fallback value for NULLs.
Constraints enforce rules on table columns. Common constraints: PRIMARY KEY — uniquely identifies each row; FOREIGN KEY — enforces referential integrity between tables; UNIQUE — prevents duplicate values; NOT NULL — prevents NULL values; CHECK — enforces a custom condition (e.g., CHECK (age >= 0)); DEFAULT — provides a fallback value when none is supplied.
DELETE removes rows one at a time, fires triggers, respects WHERE conditions, and can be rolled back (it's a DML statement logged fully). TRUNCATE removes all rows by deallocating data pages — much faster on large tables, but cannot be filtered with WHERE, does not fire row-level triggers, and in some databases (MySQL) cannot be rolled back if outside a transaction. Use DELETE when you need selective removal or trigger behavior; use TRUNCATE to empty a table quickly.
Both enforce uniqueness of values. Key differences: a table can have only ONE primary key but multiple UNIQUE keys. A PRIMARY KEY column cannot contain NULL values; a UNIQUE column can contain NULLs (and in most databases, multiple NULL values are allowed in the same UNIQUE column — since NULL is not equal to NULL). PRIMARY KEY also creates the clustered index on InnoDB (MySQL). UNIQUE creates a secondary index.
SELECT DISTINCT removes duplicate rows from the result set. It compares all selected columns together: two rows are "duplicate" only if every column value in the SELECT list is identical. DISTINCT adds sorting overhead — avoid using it as a substitute for fixing a bad query or missing JOIN condition. COUNT(DISTINCT column) counts the number of unique values in a column.
SELECT DISTINCT department FROM employees; -- unique departments
SELECT COUNT(DISTINCT customer_id) FROM orders; -- unique customers who orderedJOINs
INNER JOIN returns only rows that have a match in both tables. LEFT JOIN (LEFT OUTER JOIN) returns all rows from the left table plus matching rows from the right — unmatched right-table columns become NULL. Use LEFT JOIN when you want to include rows from the left table even if there is no matching row in the right table (e.g., customers who have placed no orders).
-- INNER JOIN: only customers who have orders
SELECT c.name, o.id FROM customers c
INNER JOIN orders o ON o.customer_id = c.id;
-- LEFT JOIN: all customers, including those with no orders
SELECT c.name, o.id FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id;A FULL OUTER JOIN returns all rows from both tables, with NULLs where there is no match. Use it when you need to compare two datasets and find rows present in either but not both — for example, reconciling two inventory systems, finding discrepancies, or identifying records missing in one source.
-- Find rows in either table with no match in the other
SELECT a.id AS a_id, b.id AS b_id
FROM table_a a
FULL OUTER JOIN table_b b ON a.key = b.key
WHERE a.id IS NULL OR b.id IS NULL;A self join joins a table to itself using two aliases. It's useful for hierarchical data — like finding an employee's manager, where both employees and managers live in the same employees table.
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;A CROSS JOIN produces the Cartesian product of two tables — every row from the left table paired with every row from the right. If table A has 10 rows and table B has 5 rows, the result has 50 rows. No ON condition is used. Practical uses include generating all combinations (e.g., all color/size combinations for a product), or pairing every date in a calendar table with every product for gap-filling queries.
Yes — you can chain as many JOINs as needed. Each JOIN adds another table to the result. The database builds the result set incrementally. Performance depends on having proper indexes on join columns and query optimizer choices.
SELECT o.id, c.name, p.name AS product, oi.quantity
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id;Aggregation
Aggregate functions operate on a set of rows and return a single summary value. The five core aggregate functions are: COUNT() — number of rows; SUM() — total; AVG() — average; MIN() — minimum value; MAX() — maximum value. They are typically used with GROUP BY to compute summaries per group.
Use GROUP BY when you want to compute an aggregate (COUNT, SUM, AVG, etc.) for each distinct value (or combination of values) in one or more columns. Every column in your SELECT list must either be in the GROUP BY clause or wrapped in an aggregate function.
SELECT department, AVG(salary) AS avg_sal, COUNT(*) AS headcount
FROM employees
GROUP BY department
ORDER BY avg_sal DESC;SQL clauses are processed in this logical order (not the order you write them): FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT. This is why you cannot use a SELECT alias in a WHERE clause — WHERE runs before SELECT assigns aliases. You can reference a SELECT alias in ORDER BY because ORDER BY runs last.
COUNT(*) counts all rows in the group, including rows with NULL values. COUNT(column) counts only rows where that column is NOT NULL. Use COUNT(*) to count rows; use COUNT(column) to count non-null values in a specific column. COUNT(DISTINCT column) counts unique non-null values.
SELECT
COUNT(*) AS total_rows,
COUNT(phone) AS rows_with_phone, -- excludes NULLs
COUNT(DISTINCT city) AS unique_cities
FROM customers;Advanced
A subquery is a SELECT statement nested inside another SQL statement. It can appear in the WHERE, FROM, or SELECT clause. Subqueries in WHERE are often used to filter rows using a value computed from another table. A correlated subquery references the outer query and re-runs for each row — powerful but can be slow. A non-correlated subquery runs once and returns a fixed result.
-- Non-correlated: find employees earning above overall average
SELECT name, salary FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Correlated: find employees earning above their DEPARTMENT average
SELECT name, department, salary FROM employees e
WHERE salary > (
SELECT AVG(salary) FROM employees
WHERE department = e.department -- references outer query
);A CTE (WITH name AS (...) SELECT ...) is a named temporary result set that exists only for the duration of the query. It improves readability by breaking a complex query into named, reusable pieces. CTEs can reference each other and even be recursive (to traverse hierarchies). Unlike a subquery in FROM, a CTE can be referenced multiple times in the same query.
WITH high_earners AS (
SELECT id, name, salary FROM employees WHERE salary > 100000
),
dept_summary AS (
SELECT department, COUNT(*) AS headcount FROM employees GROUP BY department
)
SELECT h.name, h.salary, d.headcount
FROM high_earners h
JOIN employees e ON e.id = h.id
JOIN dept_summary d ON d.department = e.department;Window functions compute a value for each row using a "window" of surrounding rows, without collapsing rows like GROUP BY does. They use the OVER() clause to define the window. Examples: ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER() (running total), AVG() OVER() (moving average). They are used in analytics: ranking, running totals, comparing rows to adjacent rows.
All three assign a sequential number to rows in an ordered window. The difference is in how they handle ties: ROW_NUMBER() gives unique sequential numbers, with ties broken arbitrarily. RANK() gives the same rank to tied rows but skips subsequent ranks (1, 1, 3 — no 2). DENSE_RANK() gives the same rank to ties without gaps (1, 1, 2).
SELECT name, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
RANK() OVER (ORDER BY salary DESC) AS rnk,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rnk
FROM employees;An index is a data structure (usually a B-tree) that speeds up row lookups by providing a fast path to rows matching a column value, instead of scanning every row. Use indexes on: primary/foreign keys, columns used in WHERE or JOIN conditions frequently, high-cardinality columns (many distinct values), and ORDER BY columns for large result sets. Avoid over-indexing — each index slows down INSERT/UPDATE/DELETE.
A view is a named saved SELECT query stored in the database. When queried, it runs the underlying SELECT and returns results as if from a table — but stores no data itself. Views simplify complex joins, enforce security by hiding sensitive columns, and provide a stable interface over changing schemas. Materialized views (PostgreSQL) store query results on disk for fast reads.
A stored procedure is a named, saved block of SQL (and procedural code) stored in the database and called by name with CALL. Procedures support IN, OUT, and INOUT parameters, conditional logic, loops, and transaction control. Use them for multi-step atomic operations, batch jobs, and business rules that should be enforced at the database layer regardless of which application calls it.
Normalization is the process of organizing database tables to minimize redundancy and prevent data anomalies. It is done through a series of normal forms: 1NF (atomic values, no repeating groups), 2NF (no partial dependencies on composite PK), 3NF (no transitive dependencies — non-key columns depend only on the PK). Most production OLTP databases aim for 3NF. Data warehouses are often intentionally denormalized for query performance.
Performance
A systematic approach: (1) Run EXPLAIN / EXPLAIN ANALYZE to see the query plan — look for full table scans (Seq Scan / type: ALL). (2) Add missing indexes on WHERE, JOIN, and ORDER BY columns. (3) Rewrite correlated subqueries as JOINs or CTEs. (4) Avoid functions on indexed columns in WHERE clauses (use WHERE created_at > '...' instead of WHERE YEAR(created_at) = 2024). (5) Fetch only the columns you need — avoid SELECT *. (6) Use LIMIT for paginated queries. (7) Consider query rewriting, partitioning, or materialized views for very large tables.
EXPLAIN (or EXPLAIN ANALYZE) shows the execution plan the database optimizer chose for a query — which indexes are used, which join algorithms, estimated row counts, and relative costs. It's the primary tool for diagnosing slow queries. Look for: Seq Scan on large tables (missing index), high estimated row counts (outdated statistics), and expensive Sort operations (index can eliminate them).
A clustered index determines the physical order of rows in the table — the table data is stored sorted by the clustered index key. In MySQL InnoDB, the primary key is always the clustered index. There can only be one clustered index per table. A non-clustered index is a separate structure that stores index key values plus a pointer to the actual row. Most indexes you create are non-clustered. Non-clustered index lookups require an extra step (following the pointer to the row), which is why Index Only Scans (when all needed data is in the index) are faster.
Using SELECT * has several problems: (1) It fetches more data than needed, wasting I/O and network bandwidth. (2) It prevents "index-only scans" — the optimizer can't serve the query from the index alone if it needs all columns. (3) It makes queries fragile — if someone adds or reorders columns in the table, your query silently returns different data. (4) In JOINs, it makes the result set ambiguous (which table did this column come from?). Always specify the exact columns you need.