Time Series Analysis Of US Retail Sales Dataset Using SQL on MySQL Workbench

   Time series analysis helps in profiling and understanding a dataset. The result set of the analysis is a series of dates or timestamps and a numerical value. When graphing a time series, the dates or timestamps are plotted on the x-axis, and the numerical value becomes the y-axis.

    In this article, the time series analysis of the monthly US retail sales dataset using SQL on MySQL Workbench is presented. The data is used as an economic indicator to understand trends in US consumer spending patterns.

    Our dataset contains records of monthly retail sales in US from 1992 till 2020 and has been taken (in .XLS format originally) from the Census.gov website. The dataset has been converted into a format that can be easily interpreted and used for querying in MySQL Workbench. Its .CSV version was pulled as the retail_sales table in MySQL Workbench using Workbench's 'Table Data Import Wizard' utility. 
The retail_sales table has 22026 records (or rows) and 5 attributes (or columns). Sales is recorded as 'millions of USD' in the sales column.

Let us glance at the contents of the retail_sales table.

SELECT * FROM retail_sales;
(Preview of the US retail sales data)

Now, let us begin the time series analysis of this data set.

1. What's the trend of total retail and food services sales every year in the US?

This trend can be obtained by fetching only those records using the WHERE clause where the kind_of_business column has the value 'Retail and food services sales, total', extracting only the year from date, and aggregating the sales values for each record using SUM function.
SELECT 
    YEAR(sales_month) AS sales_year 
    , SUM(sales) AS total_sales 
FROM retail_sales 
WHERE kind_of_business = 'Retail and food services sales, total' 
GROUP BY 1; 
Here is the result:


The above result set can be graphed and the output will be like this:


[Note: In the above query, grouping using the index of a column in SELECT list as GROUP BY 1 is not a recommended practice, unless the SELECT list has many columns. Otherwise, it is advised to mention the grouping as GROUP BY YEAR(sales_month).]

2. Compare the yearly sales trend for a few categories that are associated with leisure activities, like book stores, sporting goods stores, and hobby stores.

Again, we have to extract year from sales_month (aliased sales_year), include the three leisure activities in the IN clause, and perform the SUM aggregation on the rows filtered thus.
SELECT 
    YEAR(sales_month) AS sales_year 
    , kind_of_business 
    , SUM(sales) AS total_sales 
FROM retail_sales 
WHERE kind_of_business IN ('Book stores' ,'Sporting goods stores', 'Hobby, toy, and game stores') 
GROUP BY 1, 2 
;
The result is as ahead.


The trend of the business-wise total sales is as graphed ahead.

Notice that the sales at sporting goods retailers started the highest among the three categories and grew much faster during the period, and by the period's end, those sales were substantially higher. Sales at sporting goods stores started declining in 2017 but rebounded in 2020.
We can see that the sales at hobby, toy, and game stores were relatively flat over this period, with a slight dip in the mid-2000s and another slight decline prior to a rebound in 2020.
The sales at book stores grew until the mid-2000s and have been on the decline since then.

3. What is the sales trend of men's clothing stores and women's clothing stores?

Here, we filter those records that have 'Men's clothing store' and 'Women's clothing store' as kind_of_business.
SELECT 
    YEAR(sales_month) AS sales_year 
    , kind_of_business 
    , SUM(sales) AS total_sales 
FROM retail_sales 
WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores') 
GROUP BY 1, 2 
;
Here's the output:


Graphing the above output, we note that the gap between men’s and women’s sales was increasing during the early to mid-2000s. Women’s clothing sales in particular dipped during the global financial crisis of 2008–2009, and sales in both categories dropped a lot during the Covid pandemic in 2020.

4. What is the gap between women's clothing stores' sales and men's clothing stores' sales over the years?

To find the gap between both types of clothing stores' sales, we have to first find out the yearly total sales of both types of stores. The query for that is as follows.
SELECT 
    YEAR(sales_month) AS sales_year 
    , SUM(CASE WHEN kind_of_business = 'Women''s clothing stores' THEN sales END) AS womens_sales 
    , SUM(CASE WHEN kind_of_business = 'Men''s clothing stores' THEN sales END) AS mens_sales 
FROM retail_sales 
WHERE kind_of_business in ('Men''s clothing stores', 'Women''s clothing stores') 
GROUP BY 1 
;
And the output is:











Using the above query as building block, we now calculate the difference between women's sales and men's sales.
SELECT      sales_year 
    , womens_sales - mens_sales AS womens_minus_mens 
    , mens_sales - womens_sales AS mens_minus_womens 
FROM 
(   SELECT 
        YEAR(sales_month) AS sales_year 
        , SUM(CASE WHEN kind_of_business = 'Women''s clothing stores' THEN sales END) AS womens_sales 
        , SUM(CASE WHEN kind_of_business = 'Men''s clothing stores' THEN sales END) AS mens_sales 
    FROM retail_sales 
    WHERE kind_of_business in ('Men''s clothing stores', 'Women''s clothing stores') AND sales_month <= '2019-12-01' 
    GROUP BY 1 
) tmp 
;
The output is as shown.


The graph of the above result-set is as follows. It shows that the gap decreased between 1992 and about 1997, began a long increase through about 2011 (with a brief dip in 2007), and then was more or less flat through 2019.







Now, we find the ratio of these categories. We’ll use men’s sales as the baseline or denominator (but women’s store sales can also be used as baseline, of course):
SELECT
    sales_year 
    , womens_sales / mens_sales AS womens_times_of_mens 
FROM 
(   SELECT 
        YEAR(sales_month) AS sales_year 
        , SUM(CASE WHEN kind_of_business = 'Women''s clothing stores' THEN sales END) AS womens_sales 
        , SUM(CASE WHEN kind_of_business = 'Men''s clothing stores' THEN sales END) AS mens_sales 
    FROM retail_sales 
    WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores') AND sales_month <= '2019-12-01' 
    GROUP BY 1 
) tmp 
;

Output:







Next, building upon the above query, we can calculate the percent difference between sales at women’s and men’s clothing stores:
SELECT 
    sales_year 
    , (womens_sales / mens_sales - 1) AS womens_pct_of_mens 
FROM 
(   SELECT 
        YEAR(sales_month) AS sales_year 
        , SUM(CASE WHEN kind_of_business = 'Women''s clothing stores' THEN sales END) AS womens_sales 
        , SUM(CASE WHEN kind_of_business = 'Men''s clothing stores' THEN sales END) AS mens_sales 
    FROM retail_sales 
    WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores') AND sales_month <= '2019-12-01' 
    GROUP BY 1 
) tmp 
;
Output of the above query:








5. For each year's total sales, calculate the percentage contribution each of men's stores sales and women's stores sales to that year's total sales (using only self-join).

We can JOIN retail_sales table, aliased as a, to retail_sales, aliased as b, on the sales_month field. We then SELECT the individual series name (kind_of_business) and sales values from alias a. Then, from alias b we sum the sales for both categories and call the result total_sales.
SELECT 
    sales_month 
    , kind_of_business 
    , sales * 100 / total_sales AS pct_total_sales 
FROM 
(   SELECT 
        a.sales_month 
        , a.kind_of_business 
        , a.sales 
        , SUM(b.sales) AS total_sales 
    FROM retail_sales a 
    JOIN retail_sales b ON a.sales_month = b.sales_month AND b.kind_of_business IN ('Men''s clothing    stores', 'Women''s clothing stores') 
    WHERE a.kind_of_business IN ('Men''s clothing stores' , 'Women''s clothing stores') 
    GROUP BY 1,2,3 
) tmp; 

It yields the following output.







6. Use window functions to find the trend in percentage of total sales in women's stores sales and men's stores sales over the period.

Our query must aggregate (i.e. SUM as window function) the sales for each type of store in each month in every year. Hence:

SELECT 
    sales_month 
    , kind_of_business 
    , sales 
    , SUM(sales) OVER(PARTITION BY sales_month) AS total_sales 
    , sales * 100 / SUM(sales) OVER (PARTITION BY sales_month) AS pct_total 
FROM retail_sales 
WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores'); 
Output:


Graphing the above output reveals the following trend in both types of stores' sales.










7. What is the percent of yearly sales each month contributes to the total sales (for the men's stores sales and women's stores sales)?

A self-JOIN and a subquery can be used but we will calculate the result here through window function.

SELECT
    sales_month
    , kind_of_business
    , sales
    , SUM(sales) OVER (PARTITION BY YEAR(sales_month), kind_of_business) AS yearly_sales
    , sales * 100 / SUM(sales) OVER (PARTITION BY YEAR(sales_month), kind_of_business) AS pct_yearly 
FROM retail_sales 
WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores') 
;

Output is as ahead.

The result was graphed for the year 2019 (graph below). Notice that the trends for both store types track fairly closely, but men’s stores had a greater percentage of their sales in January than did women’s stores. Men’s stores also saw a dip in July, while the corresponding dip in women’s store sales wasn’t until September.

8. What is the percentage change in women's stores sales year by year considering 1992 as base year?

Here, we need to first calculate the yearly sales only for women's stores and set 1992's sales as base_year_sales.
SELECT 
    sales_year 
    , sales 
    , FIRST_VALUE(sales) OVER(ORDER BY sales_year) AS base_year_sales 
    , (sales / FIRST_VALUE(sales) OVER(ORDER BY sales_year) - 1) * 100 AS pct_from_base 
FROM 
(   SELECT 
        YEAR(sales_month) AS sales_year 
        , SUM(sales) AS sales 
    FROM retail_sales 
    WHERE kind_of_business = 'Women''s clothing stores' 
    GROUP BY 1 
) tmp; 
This query yields the following output.


And, the trend of percentage change in sales is as follows:


9. Calculate the moving average sales for women's stores taking a window size of 12 months (i.e. date starts from 1993-01-01).

This can be accomplished by self-joining retail_sales table. Note that the count of records returned from alias b is included to confirm that each row averages 12 data points, a useful data quality check
SELECT 
    a.sales_month 
    , a.sales 
    , AVG(b.sales) AS moving_avg 
    , COUNT(b.sales) AS records_count 
FROM retail_sales a 
JOIN retail_sales b 
ON a.kind_of_business = b.kind_of_business 
AND b.sales_month BETWEEN a.sales_month - INTERVAL 11 MONTH 
AND a.sales_month 
AND b.kind_of_business = 'Women''s clothing stores' 
WHERE a.kind_of_business = 'Women''s clothing stores' 
AND a.sales_month >= '1993-01-01' 
GROUP BY 1, 2; 
And here is the result:


The following trend emerges when the above result is graphed.



While the monthly trend looks noisy, the smoothed moving average trend makes detecting changes such as the increase from 2003 to 2007 and the subsequent dip through 2011 easier to spot. Notice that the
extreme drop in early 2020 (owing to Covid) pulls the moving average down even after sales start to rebound later in the year.

10. Implement the previous query using window functions.

SELECT 
    sales_month 
    , AVG(sales) OVER(ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS moving_avg 
    , COUNT(sales) OVER(ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS records_count 
FROM retail_sales 
WHERE kind_of_business = 'Women''s clothing stores'; 

The output is as follows.


Notice that the records_count column in the result-set displays number of preceding rows counted as per the frame ROWS BETWEEN 11 PRECEDING AND CURRENT ROW. It sets to 12 after the count of previous rows equals 11; all the subsequent rows have 12 as records_count value.

11. Calculate the cumulative women's stores sales over the period.

For this, SUM is used (as window function) to find total sales YTD as of each month. The window resets according to the PARTITION BY clause, in this case the year of the sales month.
The ORDER BY clause typically includes a date field in time series analysis. Omitting the ORDER BY can lead to incorrect results due to the way the data is sorted in the underlying table, so it’s a good idea to include it in our query below.

SELECT 
    sales_month
    , sales
    , SUM(sales) OVER(PARTITION BY YEAR(sales_month) ORDER BY sales_month) AS sales_ytd 
FROM retail_sales 
WHERE kind_of_business = 'Women''s clothing stores'; 
Here is the result.



The results for years 2016 through 2020 are graphed ahead.


12. Write a query to find the percent change trend in the sales between previous month & current month for all 'book stores' businesses in the dataset.

The LAG window function will return the value in the previous row when the partition is done over kind_of_business. Sales value of the previous record is used to calculate the percentage change as implemented in the following query.

SELECT 
    kind_of_business 
    , sales_month 
    , sales 
    , LAG(sales_month) OVER(PARTITION BY kind_of_business ORDER BY sales_month) AS prev_month 
    , LAG(sales) OVER(PARTITION BY kind_of_business ORDER BY sales_month) AS prev_month_sales 
    , (sales/LAG(sales) OVER(PARTITION BY kind_of_business ORDER BY sales_month)-1)*100 AS pct_growth_from_previous 
FROM retail_sales 
WHERE kind_of_business = 'Book stores'; 
The query yields the following result.



The first row has null values in the first row of the last three columns since there is no earlier record in this dataset.

13. Extract the YoY percentage change over the period for book store sales.

This is similar to the previous query. Here, we need to first aggregate the sales to the yearly level in a sub-query and feed the intermediate result to an outer query with window functions.

SELECT 
    sales_year 
    , yearly_sales 
    , LAG(yearly_sales) OVER(ORDER BY sales_year) AS prev_year_sales 
    , (sales/LAG(yearly_sales) OVER(ORDER BY sales_year)-1)*100 AS pct_growth_from_previous 
FROM 
(   SELECT 
        YEAR(sales_month) AS sales_year 
        , SUM(sales) AS yearly_sales 
    FROM retail_sales 
    WHERE kind_of_business = 'Book stores' 
    GROUP BY 1 
) tmp; 
The output of the query is:


14. Compare the sales for book stores for current month this year to current month the previous year.

This is accomplished by partitioning the dataset by months of each year and ordering each partition by sales_month column.
SELECT 
    sales_month 
    , sales 
    , LAG(sales_month) OVER(PARTITION BY MONTH(sales_month) ORDER BY sales_month) AS prev_year_month 
    , LAG(sales) OVER(PARTITION BY MONTH(sales_month) ORDER BY sales_month) AS prev_year_sales 
FROM retail_sales 
WHERE kind_of_business = 'Book stores'; 
The query yields the following output.



15. Create a result set that has a row for each month number and month name, and a maximum sales column for each of the years 1992, 1993 and 1994. The kind of business is 'Book stores'.
SELECT 
    MONTH(sales_month) AS month_number 
    , MONTHNAME(sales_month) AS month_name 
    , MAX(CASE WHEN YEAR(sales_month) = 1992 THEN sales END) AS sales_1992 
    , MAX(CASE WHEN YEAR(sales_month) = 1993 THEN sales END) AS sales_1993 
    , MAX(CASE WHEN YEAR(sales_month) = 1994 THEN sales END) AS sales_1994 
FROM retail_sales 
WHERE kind_of_business = 'Book stores' AND sales_month BETWEEN '1992-01-01' AND '1994-12-01' 
GROUP BY 1, 2; 
Here's the result-set.



Observe how there are only 12 rows (corresponding to the months in a year); there can be as many sales columns for each year as there are number of years under consideration. Graph of the result-set is ahead.



Sales increased year to year in every month, though the increases were larger in some months than others.

16. Compare the percent of the rolling average of book stores sales over 3 prior years.

We’ll first of all inspect the returned values to confirm the SQL is working as expected.
SELECT 
    sales_month 
    , sales 
    , LAG(sales, 1) OVER (PARTITION BY MONTH(sales_month) ORDER BY sales_month) AS prev_sales_1 
    , LAG(sales, 2) OVER (PARTITION BY MONTH(sales_month) ORDER BY sales_month) AS prev_sales_2 
    , LAG(sales, 3) OVER (PARTITION BY MONTH(sales_month) ORDER BY sales_month) AS prev_sales_3 
FROM retail_sales 
WHERE kind_of_business = 'Book stores'; 
It produces the following output.



Null is returned in the top rows where no prior record exists, and we can confirm that the correct month, prior year value appears. Now we can find the percent of the rolling average of three prior periods:
SELECT 
    sales_month 
    , sales 
    , sales / ((prev_sales_1 + prev_sales_2 + prev_sales_3) / 3) * 100 AS pct_of_3_prev 
FROM 
(   SELECT 
        sales_month 
        , LAG(sales, 1) OVER (PARTITION BY MONTH(sales_month) ORDER BY sales_month) AS prev_sales_1 
        , LAG(sales, 2) OVER (PARTITION BY MONTH(sales_month) ORDER BY sales_month) AS prev_sales_2 
        , LAG(sales, 3) OVER (PARTITION BY MONTH(sales_month) ORDER BY sales_month) AS prev_sales_3 
    FROM retail_sales 
    WHERE kind_of_business = 'Book stores'; 
)
And here is the desired output:



The above query can also be re-written in a more compact way as:
SELECT 
    sales_month 
    , sales 
    , sales / AVG(sales) OVER (PARTITION BY MONTH(sales_month) ORDER BY sales_month ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS pct_of_prev_3 
FROM retail_sales 
WHERE kind_of_business = 'Book stores'; 
This produces values even for the second and third rows from the top (refer above output):

















-----------------------------------------







Comments

Popular posts from this blog

Cohort Analysis Of US Legislators Dataset Using SQL