SQL Analytics Queries – Real-World Data Analysis Patterns

Project Overview

This project uses the users, orders, order_items, and products tables from the E-Commerce Schema project. If you haven't created those tables yet, do that first. All queries are written for PostgreSQL — MySQL equivalents are noted where syntax differs.

ℹ️

Why these patterns matter

Cohort analysis, retention, and funnel analysis are the core analytical frameworks behind product and growth decisions at most tech companies. Being able to write these queries from scratch — not just call a BI tool — is a strong signal in data analyst and analytics engineer interviews.

Cohort Analysis

Business question: Group users by the month they signed up (their cohort). How much revenue does each cohort generate over time?

A cohort is a group of users who share a common characteristic at a fixed point in time — here, the month of their first order. Cohort analysis lets you compare the behavior of different user groups over their lifetimes.

SQL – Cohort Analysis (PostgreSQL)

-- Step 1: Assign each user to their signup cohort (first order month)
WITH user_cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('month', MIN(created_at)) AS cohort_month
  FROM orders
  WHERE status != 'cancelled'
  GROUP BY user_id
),

-- Step 2: Tag every order with the user's cohort and the "month offset"
-- (how many months after signup the order was placed)
order_cohorts AS (
  SELECT
    uc.cohort_month,
    DATE_TRUNC('month', o.created_at)            AS order_month,
    DATE_PART('year', AGE(DATE_TRUNC('month', o.created_at), uc.cohort_month)) * 12 +
    DATE_PART('month', AGE(DATE_TRUNC('month', o.created_at), uc.cohort_month))
                                                 AS months_since_signup,
    o.total
  FROM orders o
  JOIN user_cohorts uc ON uc.user_id = o.user_id
  WHERE o.status != 'cancelled'
)

-- Step 3: Aggregate by cohort and month offset
SELECT
  TO_CHAR(cohort_month, 'YYYY-MM')  AS cohort,
  months_since_signup               AS month_number,
  COUNT(*)                          AS orders,
  ROUND(SUM(total), 2)              AS revenue
FROM order_cohorts
GROUP BY cohort_month, months_since_signup
ORDER BY cohort_month, months_since_signup;

Reading the result: Each row represents a cohort (e.g., "users who signed up in Jan 2023") at a given month offset (0 = their signup month, 1 = one month later, etc.). Compare revenue across cohorts at the same month_number to see if newer cohorts are spending more or less than older ones at the same stage of their lifecycle.

Retention Analysis

Business question: Of users who placed their first order in month M, what percentage returned to place another order in month M+1, M+2, M+3?

SQL – Retention (PostgreSQL)

WITH first_orders AS (
  -- Find each user's first order month
  SELECT user_id,
         DATE_TRUNC('month', MIN(created_at)) AS first_month
  FROM orders
  WHERE status != 'cancelled'
  GROUP BY user_id
),

monthly_activity AS (
  -- Find all months in which each user was active (placed an order)
  SELECT DISTINCT
    user_id,
    DATE_TRUNC('month', created_at) AS active_month
  FROM orders
  WHERE status != 'cancelled'
)

-- Calculate retention per cohort
SELECT
  TO_CHAR(f.first_month, 'YYYY-MM')       AS cohort_month,
  COUNT(DISTINCT f.user_id)               AS cohort_size,

  -- Month 0: should equal cohort_size (100%)
  COUNT(DISTINCT CASE
    WHEN ma.active_month = f.first_month THEN f.user_id END
  ) AS month_0,

  -- Month 1: users active 1 month after signup
  COUNT(DISTINCT CASE
    WHEN ma.active_month = f.first_month + INTERVAL '1 month' THEN f.user_id END
  ) AS month_1,

  -- Month 2
  COUNT(DISTINCT CASE
    WHEN ma.active_month = f.first_month + INTERVAL '2 months' THEN f.user_id END
  ) AS month_2,

  -- Month 3
  COUNT(DISTINCT CASE
    WHEN ma.active_month = f.first_month + INTERVAL '3 months' THEN f.user_id END
  ) AS month_3,

  -- Retention rates as percentages
  ROUND(100.0 * COUNT(DISTINCT CASE
    WHEN ma.active_month = f.first_month + INTERVAL '1 month' THEN f.user_id END)
    / NULLIF(COUNT(DISTINCT f.user_id), 0), 1) AS retention_month_1_pct

FROM first_orders f
LEFT JOIN monthly_activity ma ON ma.user_id = f.user_id
GROUP BY f.first_month
ORDER BY f.first_month;

Reading the result: If cohort_size is 100 and month_1 is 40, the month-1 retention is 40%. Industry benchmarks vary widely: 20-30% month-1 retention is decent for e-commerce. A declining retention curve that flattens (rather than dropping to zero) indicates a loyal core user base.

7-Day Moving Average of Daily Revenue

Business question: Daily revenue is noisy — weekends spike, Mondays dip. Smooth it out with a 7-day rolling average to see the real trend.

SQL – Moving Average (PostgreSQL)

WITH daily_revenue AS (
  -- Aggregate revenue per day
  SELECT
    created_at::DATE     AS day,
    SUM(total)           AS daily_total
  FROM orders
  WHERE status != 'cancelled'
  GROUP BY created_at::DATE
),

calendar AS (
  -- Generate a complete date series so days with no orders appear as 0
  -- (PostgreSQL generate_series)
  SELECT generate_series(
    (SELECT MIN(day) FROM daily_revenue),
    (SELECT MAX(day) FROM daily_revenue),
    '1 day'::INTERVAL
  )::DATE AS day
)

SELECT
  c.day,
  COALESCE(dr.daily_total, 0)   AS daily_revenue,
  ROUND(
    AVG(COALESCE(dr.daily_total, 0)) OVER (
      ORDER BY c.day
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW  -- 7-day window
    ), 2
  )                             AS rolling_7d_avg
FROM calendar c
LEFT JOIN daily_revenue dr ON dr.day = c.day
ORDER BY c.day;

💡

ROWS vs RANGE in window frames

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW means exactly the 7 physical rows preceding (including current). RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW means all rows within 6 days by date value — useful when you have gaps in the data. Use ROWS for fixed-size rolling windows.

Revenue Percentile Analysis

Business question: Segment orders into revenue quartiles. What does a typical order look like in each quartile? Who are the top 10% by spend?

SQL – Percentiles with NTILE (PostgreSQL & MySQL)

-- NTILE divides rows into N equal buckets by the ORDER BY column
WITH order_quartiles AS (
  SELECT
    id          AS order_id,
    user_id,
    total,
    NTILE(4) OVER (ORDER BY total) AS quartile,
    NTILE(10) OVER (ORDER BY total) AS decile
  FROM orders
  WHERE status != 'cancelled'
)
SELECT
  quartile,
  COUNT(*)             AS order_count,
  ROUND(MIN(total), 2) AS min_total,
  ROUND(MAX(total), 2) AS max_total,
  ROUND(AVG(total), 2) AS avg_total
FROM order_quartiles
GROUP BY quartile
ORDER BY quartile;

-- Find the top 10% spenders (decile 10)
SELECT u.name, u.email, SUM(o.total) AS total_spend
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status != 'cancelled'
GROUP BY u.id, u.name, u.email
HAVING SUM(o.total) >= (
  -- The 90th percentile cutoff
  SELECT PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY total)
  FROM orders WHERE status != 'cancelled'
)
ORDER BY total_spend DESC;

Conversion Funnel Analysis

Business question: Of all users, how many: (1) placed at least one order, (2) placed more than one order, (3) spent over $500 total? This is a simplified purchase funnel.

SQL – Funnel Analysis

WITH user_stats AS (
  SELECT
    u.id,
    u.name,
    COUNT(o.id)       AS order_count,
    SUM(o.total)      AS total_spend
  FROM users u
  LEFT JOIN orders o ON o.user_id = u.id AND o.status != 'cancelled'
  GROUP BY u.id, u.name
),

funnel AS (
  SELECT
    COUNT(*) AS total_users,

    -- Step 1: Users who placed at least one order
    COUNT(CASE WHEN order_count >= 1 THEN 1 END) AS placed_order,

    -- Step 2: Users who came back and ordered more than once (repeat buyers)
    COUNT(CASE WHEN order_count >= 2 THEN 1 END) AS repeat_buyer,

    -- Step 3: High-value customers (spent over $500)
    COUNT(CASE WHEN total_spend > 500 THEN 1 END) AS high_value
  FROM user_stats
)
SELECT
  'All Users'             AS step,
  total_users             AS user_count,
  100.0                   AS pct_of_previous

UNION ALL

SELECT
  '1. Placed any order',
  placed_order,
  ROUND(100.0 * placed_order / NULLIF(total_users, 0), 1)
FROM funnel

UNION ALL

SELECT
  '2. Placed 2+ orders (repeat)',
  repeat_buyer,
  ROUND(100.0 * repeat_buyer / NULLIF(placed_order, 0), 1)
FROM funnel

UNION ALL

SELECT
  '3. Spent > $500 (high-value)',
  high_value,
  ROUND(100.0 * high_value / NULLIF(placed_order, 0), 1)
FROM funnel;

Reading the result: Each step shows how many users reached that stage and the conversion rate from the previous step. A steep drop between "placed any order" and "placed 2+ orders" (e.g., only 20% repeat) indicates a retention problem. A large high-value segment (20%+ of buyers spend over your threshold) is a healthy sign of a loyal core customer base.

ℹ️

Extending the funnel with event data

In a real product analytics setup, the funnel would use an events table (page_view, add_to_cart, checkout_started, order_completed) rather than just orders. The SQL pattern is the same: count distinct users at each step and divide by the previous step's count. Tools like Mixpanel and Amplitude automate this, but knowing how to write it in SQL gives you control over exactly how steps are defined.

Summary

🏋️ Practical Exercise

Compute monthly revenue.
Find the top 5 products by sales.
Compute customer lifetime value.
Calculate month-over-month growth.
Find repeat customers.

🔥 Challenge Exercise

Write an analytics query that reports monthly revenue and month-over-month growth using a window function (LAG), then rank the top 3 products per month. Explain how window functions and GROUP BY combine for reporting.

📋 Summary

Cohort analysis — group users by signup month, track behavior month-by-month using DATE_TRUNC and AGE functions.
Retention — for each cohort, count users who were active in each subsequent month using conditional COUNT(DISTINCT CASE ...).
7-day moving average — use AVG() OVER (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW); fill gaps with generate_series + LEFT JOIN + COALESCE.
Percentile / quartile analysis — use NTILE(n) for bucketing; PERCENTILE_CONT(p) for exact percentile values.
Funnel analysis — define funnel steps with CASE WHEN inside COUNT; chain UNION ALL to display as a vertical funnel report.
All patterns use CTEs for readability — complex analytics queries are much harder to follow without them.

Interview Questions

How do you compute a running total or month-over-month growth?
Which window functions are useful for analytics?
How do you find the top-N rows per group?
How do you calculate customer lifetime value?
How do GROUP BY and window functions differ in reports?

FAQ

What is the difference between PERCENTILE_CONT and PERCENTILE_DISC? +

PERCENTILE_CONT(p) (continuous) returns an interpolated value that may not exist in the actual dataset — for example, the median of [1, 3] would be 2.0. PERCENTILE_DISC(p) (discrete) returns the actual nearest value that exists in the dataset — the median of [1, 3] would be 1 or 3, not 2. For most analytics purposes, PERCENTILE_CONT is more useful because it gives a true statistical measure. Both are aggregate functions that use the WITHIN GROUP (ORDER BY ...) syntax.

Is cohort analysis possible in MySQL? +

Yes, with adaptations. MySQL doesn't have DATE_TRUNC — use DATE_FORMAT(date, '%Y-%m-01') to truncate to month. MySQL doesn't have generate_series — use a numbers table or a recursive CTE (MySQL 8.0+) to generate a date series. AGE intervals can be approximated with TIMESTAMPDIFF(MONTH, start, end). The logic is identical; only the date functions differ.

When should I use a BI tool instead of writing analytics SQL? +

BI tools like Metabase, Looker, Tableau, or Redash are great for recurring dashboards, non-technical stakeholders, and standard charts. Write custom analytics SQL when: the question is novel and not covered by existing dashboards; you need exact control over how a metric is defined; you're building data pipelines or dbt models; or you're debugging a discrepancy in a BI tool's output. SQL and BI tools complement each other — SQL generates the dataset, BI visualizes it.

How do I handle time zones in analytics queries? +

Always store timestamps in UTC. When querying for day/month breakdowns, convert to the user's local time zone before truncating: DATE_TRUNC('day', created_at AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York'). If you truncate in UTC, a purchase made at 11pm New York time (3am UTC the next day) will appear on the wrong day in reports. Most analytics databases (Redshift, BigQuery, Snowflake) have built-in timezone conversion functions — always document which timezone your dashboards use.