Project Overview
This project uses the users, orders, order_items, and products tables from the E-Commerce Schema project. If you haven't created those tables yet, do that first. All queries are written for PostgreSQL — MySQL equivalents are noted where syntax differs.
Cohort analysis, retention, and funnel analysis are the core analytical frameworks behind product and growth decisions at most tech companies. Being able to write these queries from scratch — not just call a BI tool — is a strong signal in data analyst and analytics engineer interviews.
Cohort Analysis
Business question: Group users by the month they signed up (their cohort). How much revenue does each cohort generate over time?
A cohort is a group of users who share a common characteristic at a fixed point in time — here, the month of their first order. Cohort analysis lets you compare the behavior of different user groups over their lifetimes.
-- Step 1: Assign each user to their signup cohort (first order month)
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(created_at)) AS cohort_month
FROM orders
WHERE status != 'cancelled'
GROUP BY user_id
),
-- Step 2: Tag every order with the user's cohort and the "month offset"
-- (how many months after signup the order was placed)
order_cohorts AS (
SELECT
uc.cohort_month,
DATE_TRUNC('month', o.created_at) AS order_month,
DATE_PART('year', AGE(DATE_TRUNC('month', o.created_at), uc.cohort_month)) * 12 +
DATE_PART('month', AGE(DATE_TRUNC('month', o.created_at), uc.cohort_month))
AS months_since_signup,
o.total
FROM orders o
JOIN user_cohorts uc ON uc.user_id = o.user_id
WHERE o.status != 'cancelled'
)
-- Step 3: Aggregate by cohort and month offset
SELECT
TO_CHAR(cohort_month, 'YYYY-MM') AS cohort,
months_since_signup AS month_number,
COUNT(*) AS orders,
ROUND(SUM(total), 2) AS revenue
FROM order_cohorts
GROUP BY cohort_month, months_since_signup
ORDER BY cohort_month, months_since_signup;
Reading the result: Each row represents a cohort (e.g., "users who signed up in Jan 2023") at a given month offset (0 = their signup month, 1 = one month later, etc.). Compare revenue across cohorts at the same month_number to see if newer cohorts are spending more or less than older ones at the same stage of their lifecycle.
Retention Analysis
Business question: Of users who placed their first order in month M, what percentage returned to place another order in month M+1, M+2, M+3?
WITH first_orders AS (
-- Find each user's first order month
SELECT user_id,
DATE_TRUNC('month', MIN(created_at)) AS first_month
FROM orders
WHERE status != 'cancelled'
GROUP BY user_id
),
monthly_activity AS (
-- Find all months in which each user was active (placed an order)
SELECT DISTINCT
user_id,
DATE_TRUNC('month', created_at) AS active_month
FROM orders
WHERE status != 'cancelled'
)
-- Calculate retention per cohort
SELECT
TO_CHAR(f.first_month, 'YYYY-MM') AS cohort_month,
COUNT(DISTINCT f.user_id) AS cohort_size,
-- Month 0: should equal cohort_size (100%)
COUNT(DISTINCT CASE
WHEN ma.active_month = f.first_month THEN f.user_id END
) AS month_0,
-- Month 1: users active 1 month after signup
COUNT(DISTINCT CASE
WHEN ma.active_month = f.first_month + INTERVAL '1 month' THEN f.user_id END
) AS month_1,
-- Month 2
COUNT(DISTINCT CASE
WHEN ma.active_month = f.first_month + INTERVAL '2 months' THEN f.user_id END
) AS month_2,
-- Month 3
COUNT(DISTINCT CASE
WHEN ma.active_month = f.first_month + INTERVAL '3 months' THEN f.user_id END
) AS month_3,
-- Retention rates as percentages
ROUND(100.0 * COUNT(DISTINCT CASE
WHEN ma.active_month = f.first_month + INTERVAL '1 month' THEN f.user_id END)
/ NULLIF(COUNT(DISTINCT f.user_id), 0), 1) AS retention_month_1_pct
FROM first_orders f
LEFT JOIN monthly_activity ma ON ma.user_id = f.user_id
GROUP BY f.first_month
ORDER BY f.first_month;
Reading the result: If cohort_size is 100 and month_1 is 40, the month-1 retention is 40%. Industry benchmarks vary widely: 20-30% month-1 retention is decent for e-commerce. A declining retention curve that flattens (rather than dropping to zero) indicates a loyal core user base.
7-Day Moving Average of Daily Revenue
Business question: Daily revenue is noisy — weekends spike, Mondays dip. Smooth it out with a 7-day rolling average to see the real trend.
WITH daily_revenue AS (
-- Aggregate revenue per day
SELECT
created_at::DATE AS day,
SUM(total) AS daily_total
FROM orders
WHERE status != 'cancelled'
GROUP BY created_at::DATE
),
calendar AS (
-- Generate a complete date series so days with no orders appear as 0
-- (PostgreSQL generate_series)
SELECT generate_series(
(SELECT MIN(day) FROM daily_revenue),
(SELECT MAX(day) FROM daily_revenue),
'1 day'::INTERVAL
)::DATE AS day
)
SELECT
c.day,
COALESCE(dr.daily_total, 0) AS daily_revenue,
ROUND(
AVG(COALESCE(dr.daily_total, 0)) OVER (
ORDER BY c.day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- 7-day window
), 2
) AS rolling_7d_avg
FROM calendar c
LEFT JOIN daily_revenue dr ON dr.day = c.day
ORDER BY c.day;
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW means exactly the 7 physical rows preceding (including current). RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW means all rows within 6 days by date value — useful when you have gaps in the data. Use ROWS for fixed-size rolling windows.
Revenue Percentile Analysis
Business question: Segment orders into revenue quartiles. What does a typical order look like in each quartile? Who are the top 10% by spend?
-- NTILE divides rows into N equal buckets by the ORDER BY column
WITH order_quartiles AS (
SELECT
id AS order_id,
user_id,
total,
NTILE(4) OVER (ORDER BY total) AS quartile,
NTILE(10) OVER (ORDER BY total) AS decile
FROM orders
WHERE status != 'cancelled'
)
SELECT
quartile,
COUNT(*) AS order_count,
ROUND(MIN(total), 2) AS min_total,
ROUND(MAX(total), 2) AS max_total,
ROUND(AVG(total), 2) AS avg_total
FROM order_quartiles
GROUP BY quartile
ORDER BY quartile;
-- Find the top 10% spenders (decile 10)
SELECT u.name, u.email, SUM(o.total) AS total_spend
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status != 'cancelled'
GROUP BY u.id, u.name, u.email
HAVING SUM(o.total) >= (
-- The 90th percentile cutoff
SELECT PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY total)
FROM orders WHERE status != 'cancelled'
)
ORDER BY total_spend DESC;
Conversion Funnel Analysis
Business question: Of all users, how many: (1) placed at least one order, (2) placed more than one order, (3) spent over $500 total? This is a simplified purchase funnel.
WITH user_stats AS (
SELECT
u.id,
u.name,
COUNT(o.id) AS order_count,
SUM(o.total) AS total_spend
FROM users u
LEFT JOIN orders o ON o.user_id = u.id AND o.status != 'cancelled'
GROUP BY u.id, u.name
),
funnel AS (
SELECT
COUNT(*) AS total_users,
-- Step 1: Users who placed at least one order
COUNT(CASE WHEN order_count >= 1 THEN 1 END) AS placed_order,
-- Step 2: Users who came back and ordered more than once (repeat buyers)
COUNT(CASE WHEN order_count >= 2 THEN 1 END) AS repeat_buyer,
-- Step 3: High-value customers (spent over $500)
COUNT(CASE WHEN total_spend > 500 THEN 1 END) AS high_value
FROM user_stats
)
SELECT
'All Users' AS step,
total_users AS user_count,
100.0 AS pct_of_previous
UNION ALL
SELECT
'1. Placed any order',
placed_order,
ROUND(100.0 * placed_order / NULLIF(total_users, 0), 1)
FROM funnel
UNION ALL
SELECT
'2. Placed 2+ orders (repeat)',
repeat_buyer,
ROUND(100.0 * repeat_buyer / NULLIF(placed_order, 0), 1)
FROM funnel
UNION ALL
SELECT
'3. Spent > $500 (high-value)',
high_value,
ROUND(100.0 * high_value / NULLIF(placed_order, 0), 1)
FROM funnel;
Reading the result: Each step shows how many users reached that stage and the conversion rate from the previous step. A steep drop between "placed any order" and "placed 2+ orders" (e.g., only 20% repeat) indicates a retention problem. A large high-value segment (20%+ of buyers spend over your threshold) is a healthy sign of a loyal core customer base.
In a real product analytics setup, the funnel would use an events table (page_view, add_to_cart, checkout_started, order_completed) rather than just orders. The SQL pattern is the same: count distinct users at each step and divide by the previous step's count. Tools like Mixpanel and Amplitude automate this, but knowing how to write it in SQL gives you control over exactly how steps are defined.
Summary
📋 Summary
- Cohort analysis — group users by signup month, track behavior month-by-month using DATE_TRUNC and AGE functions.
- Retention — for each cohort, count users who were active in each subsequent month using conditional COUNT(DISTINCT CASE ...).
- 7-day moving average — use
AVG() OVER (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW); fill gaps with generate_series + LEFT JOIN + COALESCE. - Percentile / quartile analysis — use
NTILE(n)for bucketing;PERCENTILE_CONT(p)for exact percentile values. - Funnel analysis — define funnel steps with CASE WHEN inside COUNT; chain UNION ALL to display as a vertical funnel report.
- All patterns use CTEs for readability — complex analytics queries are much harder to follow without them.
FAQ
PERCENTILE_CONT(p) (continuous) returns an interpolated value that may not exist in the actual dataset — for example, the median of [1, 3] would be 2.0. PERCENTILE_DISC(p) (discrete) returns the actual nearest value that exists in the dataset — the median of [1, 3] would be 1 or 3, not 2. For most analytics purposes, PERCENTILE_CONT is more useful because it gives a true statistical measure. Both are aggregate functions that use the WITHIN GROUP (ORDER BY ...) syntax.
Yes, with adaptations. MySQL doesn't have DATE_TRUNC — use DATE_FORMAT(date, '%Y-%m-01') to truncate to month. MySQL doesn't have generate_series — use a numbers table or a recursive CTE (MySQL 8.0+) to generate a date series. AGE intervals can be approximated with TIMESTAMPDIFF(MONTH, start, end). The logic is identical; only the date functions differ.
BI tools like Metabase, Looker, Tableau, or Redash are great for recurring dashboards, non-technical stakeholders, and standard charts. Write custom analytics SQL when: the question is novel and not covered by existing dashboards; you need exact control over how a metric is defined; you're building data pipelines or dbt models; or you're debugging a discrepancy in a BI tool's output. SQL and BI tools complement each other — SQL generates the dataset, BI visualizes it.
Always store timestamps in UTC. When querying for day/month breakdowns, convert to the user's local time zone before truncating: DATE_TRUNC('day', created_at AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York'). If you truncate in UTC, a purchase made at 11pm New York time (3am UTC the next day) will appear on the wrong day in reports. Most analytics databases (Redshift, BigQuery, Snowflake) have built-in timezone conversion functions — always document which timezone your dashboards use.