Cohort Analysis Of US Legislators Dataset Using SQL

Cohort analysis is a method used to track and analyze the behavior and performance of a specific group (called cohort) over time. It enables businesses to understand how different cohorts behave differently from one another or from a general population.

Typically, cohorts share common characteristics or experiences within a defined timeframe, such as customers who signed up during a particular month or users who joined a platform in a specific year.

For example, a subscription-based company (like Netflix) might use cohort analysis to compare the retention rates of customers who signed up in different months. This could reveal insights into whether newer cohorts exhibit different retention behaviors compared to older ones, helping the company adjust its retention strategies accordingly to improve customer lifetime value (CLTV).

Cohort analysis deals with retention, survivorship, returnship and cumulative.

Retention is concerned with whether the cohort member has a record in the time series on a particular date, expressed as a number of periods from the starting date.
The number of entities existing in the data set for a certain length of time or longer, regardless of the number or frequency of actions up to that time is called survivorship. Survivorship helps answer questions about the proportion of the population that can be expected to remain.
Returnship (repeat purchase behavior) tells whether an action has happened more than some minimum threshold of times—often simply more than once—during a fixed window of time.
Cumulative calculations are concerned with the total number or amounts measured at one or more fixed time windows, regardless of when they happened during that window. Cumulative calculations are often used in calculations of customer lifetime value (LTV or CLTV).

In the present article, we will focus only on retention in cohort analysis using SQL (on MySQL Workbench). We'll be using the US legislators dataset comprising the tables legislators_terms (44063 records) and legislators (12518 records).

We've also created a date series in a table date_dim which contains dates starting 1770-12-31 and ending 2020-12-31 (you'll have to set 'cte_max_recursion_depth' in MySQL to 100000 to override the default value of recursion depth of 1001, as show in the code's snapshot ahead).

The contents of the date_dim table can be visualized as ahead.

SELECT * FROM date_dim;

Here are the contents of the date_dim table.

As in other cohort analyses, retention analysis needs three components: the cohort definition, a time series of actions, and an aggregate metric that measures something relevant.

In our case, the cohort members will be the legislators, the time series will be the terms in office for each legislator, and the metric of interest will be the count of legislators who are still in office each period from the starting date.

Let us glance at the legislator_terms table, for starters.

SELECT * FROM legislators_terms;

Here is a snapshot of result of the above query.

We will breakdown our cohort analysis in a list of SQL queries (executed in MySQL Workbench), each SQL query drawing some insight from the previous one now and then.

1. Find out the earliest term start date (first term) of each legislator.

Eliciting every cohort member's starting date makes up one of the initial steps. The MIN function returns the earliest date on which a legislator started his/her term.

SELECT
    id_bioguide
    , MIN(term_start) AS first_term
FROM legislators_terms
GROUP BY id_bioguide;

Result contains 12518 records (all distinct legislator IDs) a snapshot of which is as follows:

2. Count the number of legislators in each period.

Since elections happen every two or six years, we’ll use years as the time interval to calculate the periods. These periods can be calculated from each legislator's first_term and term_start using the TIMESTAMPDIFF function (available in MySQL) in the outer query. Query-1 is the subquery here and is joined to legislators_terms table on id_bioguide attribute.

SELECT
    TIMESTAMPDIFF(YEAR, a.first_term, b.term_start) AS period
    , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
FROM
(   SELECT
        id_bioguide
        , MIN(term_start) AS first_term
    FROM legislators_terms
    GROUP BY 1
) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
GROUP BY 1;

Output is as follows. One can notice (referring to query-1 above) that there were 12518 legislators (our cohort) in the beginning.

3. What's the percentage of the cohort retained in each period?

We can calculate the retention percentages of our cohort over the periods by building upon the previous two queries (which become nested subqueries here).

In query-2, we've already obtained the period and cohort_retained in each period. Here, in the outer query, we are finding the initial cohort_size using the FIRST_VALUE window function, ordering by period.

The cohort_retained and cohort_size values then give us the percentage retention (pct_retained) of cohort over the periods.

SELECT
    period
    , FIRST_VALUE(cohort_retained) OVER(ORDER BY period) AS cohort_size
    , cohort_retained
    , cohort_retained/FIRST_VALUE(cohort_retained) OVER(ORDER BY period) * 100 AS pct_retained
FROM
(   SELECT
        TIMESTAMPDIFF(YEAR, a.first_term, b.term_start) AS period
        , MIN(term_start) AS first_term
        , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
    FROM
    (   SELECT
            id_bioguide
            , MIN(term_start) as first_term
        FROM legislator_terms
        GROUP BY 1
    ) a JOIN legislator_terms b ON a.id_bioguide = b.id_bioguide GROUP BY 1
) tmp;

Here is the result.

Graphing the results, as shown below, demonstrates how the curve flattens and eventually goes to zero, as even the longest-serving legislators eventually retire or die.

4. Pivot the results of above query along periods (i.e. periods become columns as yr0, yr1, yr2, yr3 etc.)

Query-3 (above) works here as nested subqueries and pivoting the result is performed using aggregate functions in conjunction with CASE-WHEN-THEN clause in the outer query, as demonstrated ahead.

SELECT
    cohort_size
    , MAX(CASE WHEN period = 0 THEN pct_retained END) AS yr0
    , MAX(CASE WHEN period = 1 THEN pct_retained END) AS yr1
    , MAX(CASE WHEN period = 2 THEN pct_retained END) AS yr2
FROM
(   SELECT
        period
        , FIRST_VALUE(cohort_retained) OVER(ORDER BY period) AS cohort_size
        , cohort_retained
        , cohort_retained/FIRST_VALUE(cohort_retained) OVER(ORDER BY period) AS pct_retained
    FROM
    (   SELECT
            TIMESTAMPDIFF(YEAR, a.first_term, b.term_start) AS period
            , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
        FROM
        (   SELECT
                id_bioguide
                , MIN(term_start) AS first_term
            FROM legislators_terms
            GROUP BY 1
        ) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
        GROUP BY 1
    ) tmp1
) tmp2
GROUP BY 1;

The output is as ahead.

5. Create a data set that contains a record for each December 31 that each legislator was in office.

This is accomplished by JOINing query-1 (as subquery below) that yielded first_term, to the legislators_terms table to find the term_start and term_end for each term. A second JOIN to the date_dim table produces dates that fall between the start and end dates, restricting the returned values to c.month_name = 'December' and c.day_of_month = 31.

SELECT
    a.id_bioguide AS ID
    , a.first_term
    , b.term_start
    , b.term_end
    , c.date
    , TIMESTAMPDIFF(YEAR,  a.first_term, b.term_start) AS period
FROM
(   SELECT
        id_bioguide
        , MIN(term_start) AS first_term
    FROM legislators_terms
    GROUP BY 1
) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
LEFT JOIN date_dim c ON c.date BETWEEN b.term_start AND b.term_end AND MONTHNAME(c.date) = 'December' AND DAY(c.date) = 31;

It produces the following dataset.

6. Calculate the cohort retained for each period as on December 31.

The COALESCE function sets default as 0 if and when a legislator's term starts and ends in same year resulting in 'period' as NULL.

SELECT
    COALESCE(TIMESTAMPDIFF(YEAR, a.first_term, c.date), 0) AS period
    , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
FROM
(   SELECT
        id_bioguide
        , MIN(term_start) AS first_term
    FROM legislators_terms 
    GROUP BY 1
) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
LEFT JOIN date_dim c ON c.date BETWEEN b.term_start AND b.term_end AND MONTHNAME(c.date) = 'December' AND DAY(c.date) = 31
GROUP BY 1;

Here is a snapshot of the output.

7. Continuing the above query, find the percentage of cohort retained using a window function.

Here, the above query serves as a subquery that returns period and cohort_retained. In the outer query, FIRST_VALUE window function is used to create a column cohort_size and populate it with the first value of the cohort_size column (12518). Recall that cohort size is basically the total number of legislators at the beginning. Values in columns cohort_retained and cohort_size are used to calculate the percentage of cohort retained.

SELECT
    period
    , FIRST_VALUE(cohort_retained) OVER(ORDER BY period) AS cohort_size
    , cohort_retained
    , cohort_retained/FIRST_VALUE(cohort_retained) OVER(ORDER BY period) AS pct_retained
FROM
(   SELECT
        COALESCE(TIMESTAMPDIFF(YEAR, a.first_term, c.date), 0) AS period
        , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
    FROM
    (   SELECT
            id_bioguide
            , MIN(term_start) AS first_term
        FROM legislators_terms
        GROUP BY 1
    ) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
    LEFT JOIN c0_date_dim c ON c.date BETWEEN b.term_start AND b.term_end AND MONTHNAME(c.date) = 'December' AND DAY(c.date) = 31 
    GROUP BY 1
) tmp;

Here is the desired output.

8. Find the cohort retained and percentage of cohort retention in every year.

The cohort retained in every year is returned by the following query:

SELECT
    YEAR(a.first_term) AS first_year
    , COALESCE(TIMESTAMPDIFF(YEAR,  a.first_term, c.date), 0) AS period
    , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
FROM
(   SELECT
        id_bioguide
        , MIN(term_start) AS first_term
    FROM legislators_terms
    GROUP BY 1
) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
LEFT JOIN date_dim c ON c.date BETWEEN b.term_start AND b.term_end AND c.month_name = 'December' AND c.day_of_month = 31
GROUP BY 1, 2;

It produces the following output.

The above query is then used as the subquery, and the cohort_size and pct_retained are calculated in the outer query as ahead (as done earlier). However, a PARTITION BY clause that includes first_year is used so that the FIRST_VALUE is calculated from only within the set of rows for that particular first_year, rather than across the whole result set from the subquery.

SELECT
    first_year
    , period
    , FIRST_VALUE(cohort_retained) OVER(PARTITION BY first_year ORDER BY period) AS cohort_size, cohort_retained
    , cohort_retained/FIRST_VALUE(cohort_retained) OVER(PARTITION BY first_year ORDER BY period) AS pct_retained
FROM
(   SELECT
        YEAR(a.first_term) as first_year
        , COALESCE(TIMESTAMPDIFF(YEAR, a.first_term, c.date), 0) AS period
        , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
    FROM
    (   SELECT
            id_bioguide
            , MIN(term_start) as first_term
        FROM legislators_terms
        GROUP BY 1
    ) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
    LEFT JOIN date_dim c ON c.date BETWEEN b.term_start AND b.term_end AND c.month_name = 'December' AND c.day_of_month = 31
    GROUP BY 1, 2
) tmp;

And here is the desired output.

9. Write a query to cohort the legislators by the century of their first term date.

Unlike PostgreSQL, MySQL doesn't provide an inbuilt century() function to return the correct century a date lies in. However, we can extract the year from a date using YEAR function, then extract the floor value after dividing the year by 100 and finally adding 1 to it to get the century. The following query illustrates this.

SELECT FLOOR(YEAR(c.date)/100) +1 AS century FROM date_dim c;

Here is a sample output. Notice the change of century from 18 to 19 when the year changes from 1799 to 1800. It shows the above query is correct in form and function.

Next, we are required to cohort the legislators as per the centuries their first term dates fall in. We will use the above query for extracting century, and the first query of query-8, as subquery here to get first_century, period and cohort_retained attributes. In the outer query, a PARTITION BY clause partitions the rows by first_century and orders the rows in each partition by period. Also, the working of FIRST_VALUE window function has been already covered in query-8.

SELECT
    first_century
    , period
    , FIRST_VALUE(cohort_retained) OVER(PARTITION BY first_century ORDER BY period) AS cohort_size
    , cohort_retained
    , cohort_retained/FIRST_VALUE(cohort_retained) OVER(PARTITION BY first_century ORDER BY period) AS pct_retained
FROM
(   SELECT
        FLOOR(YEAR(a.first_term)/100) + 1 AS first_century
        , COALESCE(TIMESTAMPDIFF(YEAR, a.first_term, c.date), 0) AS period
        , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
    FROM
    (   SELECT
            id_bioguide
            , MIN(term_start) AS first_term
        FROM legislators_terms
        GROUP BY 1
    ) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
    LEFT JOIN date_dim c ON c.date BETWEEN b.term_start AND b.term_end AND c.month_name = 'December' and c.day_of_month = 31
    GROUP BY 1,2
) aa
ORDER BY 1,2;

Here's a snapshot of the output.

10. Find the first state for each legislator in which his/her first ever term began in.

Window functions simplify our task a great deal.

SELECT
    DISTINCT id_bioguide
    , MIN(term_start) OVER(PARTITION BY id_bioguide) AS first_term
    , FIRST_VALUE(state) OVER(PARTITION BY id_bioguide ORDER BY term_start) AS first_state
FROM legislators_terms;

The desired output is as ahead.

11. Extending on the previous query (query-10), find the cohort retention by state.
We'll use query-10 as subquery, extract from it period and cohort_retained in an intermediate outer query (as done in query-9) and feed the resultant set to a final outer query which returns the retention by state.

SELECT
    first_state
    , period
    , FIRST_VALUE(cohort_retained) OVER(PARTITION BY first_state ORDER BY period) AS cohort_size
    , cohort_retained
    , cohort_retained/FIRST_VALUE(cohort_retained) OVER(PARTITION BY first_state ORDER BY period)*100 AS pct_retained
FROM
(   SELECT
        a.first_state
        , COALESCE(TIMESTAMPDIFF(YEAR, a.first_term, c.date), 0) AS period
        , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
    FROM
    (   SELECT
            DISTINCT id_bioguide
            , MIN(term_start) OVER(PARTITION BY id_bioguide) AS first_term
            , FIRST_VALUE(state) OVER(PARTITION BY id_bioguide order by term_start) AS first_state
        FROM legislators_terms
    ) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
    LEFT JOIN date_dim c ON c.date between b.term_start AND b.term_end AND c.month_name = 'December' and c.day_of_month = 31
    GROUP BY 1, 2
) aa;

Output:

12. Find the retention by genders of the legislators.

The legislators table has a gender field. So, we will join that table (aliased d) to query-8 above and replace the line returning first_year in query-8 with gender attribute.

SELECT
    d.gender
    , COALESCE(TIMESTAMPDIFF(YEAR, a.first_term, c.date), 0) AS period
    , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
FROM
(   SELECT
        id_bioguide
        , MIN(term_start) AS first_term
    FROM legislators_terms
    GROUP BY 1
) a JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide
LEFT JOIN date_dim c ON c.date BETWEEN b.term_start AND b.term_end AND c.month_name = 'December' AND c.day_of_month = 31
JOIN legislators d ON a.id_bioguide = d.id_bioguide
GROUP BY 1,2;

Here is the output:

13. Building on the previous query, find the percentage retention by genders over the periods.

We have to partition the result of the above query by gender (ordering the partitions by periods), extract the first value of cohort size for each gender in each partition, and use it along with cohort retained to calculate the desired percentage retention.

SELECT
    gender
    , period
    , FIRST_VALUE(cohort_retained) OVER(PARTITION BY gender ORDER BY period) AS cohort_size
    , cohort_retained
    , cohort_retained/FIRST_VALUE(cohort_retained) OVER(PARTITION BY gender ORDER BY period) AS pct_retained
FROM
(   SELECT
        d.gender
        , COALESCE(TIMESTAMPDIFF(YEAR, a.first_term, c.date), 0) AS period
        , COUNT(DISTINCT a.id_bioguide) AS cohort_retained
    FROM
    (   SELECT
            id_bioguide
            , MIN(term_start) AS first_term
        FROM legislators_terms
        GROUP BY 1
    ) a JOIN legislators_terms b on a.id_bioguide = b.id_bioguide
    LEFT JOIN date_dim c ON c.date BETWEEN b.term_start AND b.term_end AND c.month_name = 'December' AND c.day_of_month = 31
    JOIN legislators d on a.id_bioguide = d.id_bioguide
    GROUP BY 1, 2
) tmp;

Here is the snapshot of the desired output. Click on it to enlarge the image. One can see that more males than females have served legislative terms.

------------------------------

Search This Blog

Data Analysis Through SQL

Cohort Analysis Of US Legislators Dataset Using SQL

Comments

Post a Comment

Popular posts from this blog

Time Series Analysis Of US Retail Sales Dataset Using SQL on MySQL Workbench