[Data Analysis] SQL Recipe for Data Analysis Tutorial #4 - Manipulating a single table

[Data Analysis] SQL Recipe for Data Analysis Tutorial #4 - Manipulating a single table

Goal

  • Studying SQL for data analysis
  • Practicing chapter 3rd in the book
  • Studying SQL for Manipulating data



Practice

Book for studying

Title 데이터 분석을 위한 SQL 레시피
Author Nagato Kasaki, Naoto Tamiya
Translator 윤인성
Publisher 한빛미디어
Chapter 3
Lesson 7


  • I’m gonna study basic SQL syntax and functions for data analysis
  • I’m gonna study how to manipulate a single table
  • I’m gonna study about data-intensive and data-processing
  • I’m gonna try to load test data on PostgreSQL and execute queries in the book


3.7. Manipulating a singlie table

  • Data-intensive
    • SQL supports a lot of functions for data-intensive
    • Especially, SQL:2003 standard support Window functions which performs a calculation across a set of table rows that are somehow related to the current row
  • Data-processing
    • Sometimes, you need to convert a table as the form that you want because it is useful in an aggregation



3.7.1. Make groups

  • Data-intensive means that making one value from multiple records like COUNT(returns total record count), SUM(returns the sum of all records)

Create table and insert sample data

--DROP TABLE IF EXISTS review;
-- Create review table
CREATE TABLE review (
    user_id    varchar(255)
  , product_id varchar(255)
  , score      numeric
);

-- Insert sample data
INSERT INTO review
VALUES
    ('U001', 'A001', 4.0)
  , ('U001', 'A002', 5.0)
  , ('U001', 'A003', 5.0)
  , ('U002', 'A001', 3.0)
  , ('U002', 'A002', 3.0)
  , ('U002', 'A003', 4.0)
  , ('U003', 'A001', 5.0)
  , ('U003', 'A002', 4.0)
  , ('U003', 'A003', 4.0)
;

commit;

-- Select sample data
SELECT *
  FROM review
;

screenshot001


Calculate several feature values from all records

| Function | Description | | — | — | | DISTINCT | Removes the duplication | | SUM | Returns the sum | | AVG | Returns the average | | MAX | Returns the maximum value | | MIN | Returns the minimum value |

-- Calculate several feature values from all records
SELECT COUNT(*) AS total_count
     , COUNT(DISTINCT user_id) AS user_count
     , COUNT(DISTINCT product_id) AS product_count
     , SUM(score) AS sum
     , AVG(score) AS avg
     , MAX(score) AS max
     , MIN(score) AS min
  FROM review
;

screenshot002


Calculate serveral feature values from groups repectively

  • GROUP BY clause does intensivly by the key
  • I’m gonna make groups by user_id
-- Calculate serveral feature values from groups repectively
SELECT user_id
     , COUNT(*) AS total_count
     , COUNT(DISTINCT product_id) AS product_count
     , SUM(score) AS sum
     , AVG(score) AS avg
     , MAX(score) AS max
     , MIN(score) AS min
  FROM review
 GROUP BY user_id 
;

screenshot003


Handle the value that is applied data-intensive function and the one that is not applied it simultaneously

  • I’ll calculate the difference between user’s private score(score) and the average score by user(user_avg_score)
  • **OVER(PARTITION BY )** : group by column_name
  • OVER() : about all records in a table
-- Calculate user's private score(score) and the average score by user(user_avg_score)
SELECT user_id
     , product_id
     -- User's private score
     , score 
     -- The total average score 
     , AVG(score) OVER() AS avg_score
     -- User's average score
     , AVG(score) OVER(PARTITION BY user_id) AS user_avg_score
     -- The differnce
     , score - AVG(score) OVER(PARTITION BY user_id) AS user_avg_score_diff
  FROM review
;

screenshot004


-- DROP TABLE IF EXISTS popular_products;
-- Create popular product table
CREATE TABLE popular_products (
    product_id varchar(255)
  , category   varchar(255)
  , score      numeric
);

-- Insert sample data
INSERT INTO popular_products
VALUES
    ('A001', 'action', 94)
  , ('A002', 'action', 81)
  , ('A003', 'action', 78)
  , ('A004', 'action', 64)
  , ('D001', 'drama' , 90)
  , ('D002', 'drama' , 82)
  , ('D003', 'drama' , 78)
  , ('D004', 'drama' , 58)
;
     
commit;

-- Select sample data
SELECT *
  FROM popular_products
;

screenshot005


Ordering in a group

  • You can order records through ORDER BY
  • There are several functions to return a sequent order
Function Description
ROW_NUMBER Returns a unique sequence number
RANK Returns the rank number in a group and skips the number if there are same rank values
eg. 1, 2, 2, 4, 5
DENSE_RANK Returns the rank number in a group and deosn’t skip the number if there are same rank values
eg. 1, 2, 2, 3, 4
  • And also, there are functions that return the previous or next record
Function Description
LAG(column_name) Returns the previous record in a group
LAG(column_name, n) Returns the previous Nth record in a group
LEAD(column_name) Returns the next record in a group
LEAD(column_name, n) Returns the next Nth record in a group
-- Ordering in a table
SELECT product_id
     , score
     -- Unique rank by score
     , ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum
     -- Skip duplicate rank
     , RANK() OVER(ORDER BY score DESC) AS rank
     -- Don't skip duplicate rank
     , DENSE_RANK() OVER(ORDER BY score DESC) AS dense_rank 
     -- Previous record
     , LAG(product_id) OVER(ORDER BY score DESC) AS lag1
     -- Previous 2nd record
     , LAG(product_id, 2) OVER(ORDER BY score DESC) AS lag2
     -- Next record
     , LEAD(product_id) OVER(ORDER BY score DESC) AS lead1
     -- Next 2nd record
     , LEAD(product_id, 2) OVER(ORDER BY score DESC) AS lead2
  FROM popular_products
 ORDER BY rownum 
;

screenshot006


Combine ordering and intensive functions

  • There are functions that return frst or last value in a window group like FIRST_VALUE, LAST_VALUE
  • And you can set the range of a window
-- Combine ordering and intensive functions
SELECT product_id
     , score
     , ROW_NUMBER() OVER(ORDER BY score DESC) 
         AS rownum
     -- Cumulative score from the score of a high rank to current in a window
     , SUM(score) 
         OVER(ORDER BY score DESC
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
         AS cum_score
     -- Average score between previous record, current, and next record
     , AVG(score)
         OVER(ORDER BY score DESC
           ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
         AS local_avg
     -- Product_id of top ranked record
     , FIRST_VALUE(product_id) 
         OVER(ORDER BY score DESC
           ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
         AS first_value
     -- Product_id of bottom ranked record
     , LAST_VALUE(product_id) 
         OVER(ORDER BY score DESC
           ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
         AS first_value
  FROM popular_products
 ORDER BY rownum 
;

screenshot007


Define a window frame

  • Define a relative frame based on current record
  • ROWS BETWEEN start AND end’ is the basic clause
  • You can use CURRENT_ROW(current row), n PRECEDING(n previous row), n FOLLOWING(n next row), UNBOUNDED PROCEDING(all previous rows), UNBOUNDED FOLLOWING(all next rows) keyword in start or end
  • If you don’t define a window frame, the ranges of default frame are whole rows without ORDER BY or from the first record to current record with ORDER BY
  • PostgresSQL supports array_agg(column_name) keyword and Hive, SparkSQL support collect_list(column_name) for data-intensive

    Redshift supports listagg such as array_agg and collect_list. However, it is unable to used with a window frame

-- Integrate product_id by window frames
SELECT product_id
     , ROW_NUMBER() OVER(ORDER BY score DESC) 
         AS rownum
     -- All records
     , array_agg(product_id)	-- collect_list(product_id)
         OVER(ORDER BY score DESC
           ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
         AS whole_agg
     -- From the first record to current record
     , array_agg(product_id)
         OVER(ORDER BY score DESC
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
         AS cum_agg
     -- From previous record to next record
     , array_agg(product_id)
         OVER(ORDER BY score DESC
           ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
         AS local_agg
  FROM popular_products
 WHERE category = 'action'
 ORDER BY rownum 
;

screenshot008


Combine PARTITION BY and ORDER BY

-- Combine PARTITION BY and ORDER BY
SELECT category
     , product_id
     , score
     -- Unique rank by category
     , ROW_NUMBER()
         OVER(PARTITION BY category ORDER BY score DESC)
         AS rownum
     -- Rank with duplication by category
     , RANK()
         OVER(PARTITION BY category ORDER BY score DESC)
         AS rank
     -- Rank with duplication and skip duplicate rank
     , DENSE_RANK()
         OVER(PARTITION BY category ORDER BY score DESC)
         AS dense_rank
  FROM popular_products pp 
 ORDER BY category, rownum 
;

screenshot009


Extract top n by category

-- Extract top 2 products by category
SELECT *
  FROM 
    (SELECT category
          , product_id
          , score
          , ROW_NUMBER()
              OVER(PARTITION BY category ORDER BY score DESC)
              AS rank
      FROM popular_products pp 
   ) AS popular_products_with_rank
 WHERE rank  <= 2
 ORDER BY category, rank 
;

screenshot010


-- Extrat top ranked product by category
SELECT DISTINCT category
    -- the top ranked product_id
    , FIRST_VALUE(product_id)
        OVER(PARTITION BY category ORDER BY score DESC
          ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
        AS product_id
  FROM popular_products pp
;

screenshot011


3.7.3. Convert columns into rows

Create table and insert sample data

-- DROP TABLE IF EXISTS daily_kpi;
-- Create KPI(Key Point Indicator) table
CREATE TABLE daily_kpi (
    dt        varchar(255)
  , indicator varchar(255)
  , val       integer
);

-- Insert sample data
INSERT INTO daily_kpi
VALUES
    ('2020-01-01', 'impressions', 1800)
  , ('2020-01-01', 'sessions'   ,  500)
  , ('2020-01-01', 'users'      ,  200)
  , ('2020-01-02', 'impressions', 2000)
  , ('2020-01-02', 'sessions'   ,  700)
  , ('2020-01-02', 'users'      ,  250)
;

commit;

-- Select sample data
SELECT *
  FROM daily_kpi
;

screenshot012


Convert rows into columns

  • Use GROUP BY and MAX(CASE) because there are no duplicate indicators per dt
-- Convert rows into columns through GROUP_BY and MAX(CASE)
SELECT dt
     , MAX(CASE WHEN indicator = 'impressions' THEN val END)
         AS impressions
     , MAX(CASE WHEN indicator = 'sessions' THEN val END)
         AS sessions
     , MAX(CASE WHEN indicator = 'users' THEN val END)
         AS users
  FROM daily_kpi 
 GROUP BY dt 
 ORDER BY dt 
;

screenshot013


Create purchase log table and insert sampled data

-- DROP TABLE IF EXISTS purchase_detail_log;
-- Create purchase log table
CREATE TABLE purchase_detail_log (
    purchase_id integer
  , product_id  varchar(255)
  , price       integer
);

-- Insert sample data
INSERT INTO purchase_detail_log
VALUES
    (100001, 'A001', 3000)
  , (100001, 'A002', 4000)
  , (100001, 'A003', 2000)
  , (100002, 'D001', 5000)
  , (100002, 'D002', 3000)
  , (100003, 'A001', 3000)
;

commit;

-- Select sample data
SELECT *
  FROM purchase_detail_log
;

screenshot014


Convert rows into 1 row with comma delimiters

  • PostgreSQL and BigQuery support string_agg(column_name, delimiter) and Redshift also supports listagg(column_name, delimiter) to integrate rows into 1 row with delimiters
  • Hive and SparkSQL are available to integrate them through collect_list(column_name) and concat_ws(delimiter, array)
-- Convert rows into 1 row with comma delimiters
SELECT purchase_id
     -- if PostgresSQL, BigQuery
     , string_agg(product_id, ',') AS product_ids
     -- if Redshift
     -- , listagg(product_id, ',') AS product_ids
     -- if Hive, SparkSQL
     -- , concat_ws(',', collect_list(product_id)) AS product_ids
     , SUM(price) AS amount
  FROM purchase_detail_log 
 GROUP BY purchase_id 
 ORDER BY purchase_id 
;

screenshot015


3.7.4. Convert rows into columns

  • It is more complicate than converting columns into rows
  • However, it is useful to analyze data and there are many cases which need it

Quarterly sales table and insert sample data

-- DROP TABLE IF EXISTS quarterly_sales;
-- Create quarterly sales table
CREATE TABLE quarterly_sales (
    year integer
  , q1   integer
  , q2   integer
  , q3   integer
  , q4   integer
);

-- Insert sample data
INSERT INTO quarterly_sales
VALUES
    (2017, 82000, 83000, 78000, 83000)
  , (2018, 85000, 85000, 80000, 81000)
  , (2019, 92000, 81000, NULL , NULL )
;

commit;

-- Select sample data
SELECT * 
  FROM quarterly_sales
;

screenshot016


Convert rows into columns

  • You have to make pivot table and do CROSS JOIN with it
-- CROSS JOIN with a pivot table
SELECT q.year
     -- Print labels from Q1 to Q4
     , CASE WHEN p.idx = 1 THEN 'q1'
            WHEN p.idx = 2 THEN 'q2'
            WHEN p.idx = 3 THEN 'q3'
            WHEN p.idx = 4 THEN 'q4'
            END AS quarters
     -- Print sales from Q1 to Q4
     , CASE WHEN p.idx = 1 THEN q.q1
            WHEN p.idx = 2 THEN q.q2
            WHEN p.idx = 3 THEN q.q3
            WHEN p.idx = 4 THEN q.q4
            END AS sales
  FROM quarterly_sales AS q
    CROSS JOIN
      -- Create sequence table followed columns in quaterly_sales
      (SELECT 1 AS idx
       UNION ALL
       SELECT 2 AS idx
       UNION ALL
       SELECT 3 AS idx
       UNION ALL
       SELECT 4 AS idx
      ) AS p
;  

screenshot017


Create purchase log table and insert sample data

-- DROP TABLE IF EXISTS purchase_log;
-- Create purchase_log
CREATE TABLE purchase_log (
    purchase_id integer
  , product_ids varchar(255)
);

-- Insert sample data
INSERT INTO purchase_log
VALUES
    (100001, 'A001,A002,A003')
  , (100002, 'D001,D002')
  , (100003, 'A001')
;

commit;

-- Select sample data
SELECT *
  FROM purchase_log
;

screenshot018


Convert array into rows

  • Several middlewares support TABLE_FUNCTION whichi returns a table
  • For example, PostgreSQL and BigQuery have unnest(array) function and Hive and SparkSQL have explode(array) function and they convert an array into a table
-- Convert an array into a table
-- if PostgreSQL
SELECT unnest(ARRAY['A001', 'A002', 'A003']) AS product_id;
-- if BigQuery
-- SELECT * FROM unnest(ARRAY['A001', 'A002', 'A003']) AS product_id;
-- if Hive, SparkSQL
-- SELECT explode(ARRAY['A001', 'A002', 'A003']) AS product_id;

screenshot019


  • You have to use CROSS JOIN in PostgreSQL or BigQuery and use LATERAL VIEW in Hive or SparkSQL
-- Convert a stirng with delimiters into a table
SELECT purchase_id
     , product_id
  FROM purchase_log AS p
    -- if PostgreSQL
    CROSS JOIN unnest(pg_catalog.string_to_array(product_ids, ',')) AS product_id
    -- if BigQuery
    -- CROSS JOIN unnest(split(product_ids, ',')) AS prouct_id
    -- if Hive, SparkSQL
    -- LATERAL VIEW explode(split(producT_ids, ',') e AS producT_id
;

screenshot020


  • You can use scalar value and a table function in the same time in PostgreSQL
  • PostgreSQL supports regexp_split_to_table() which returns a table from an string by the regualar expression
-- Convert a stirng with delimiters into a table withtout cross join
SELECT purchase_id
     , regexp_split_to_table(product_ids, ',') AS product_id
  FROM purchase_log
;

screenshot021


Convert a string into rows in Redshift

  • Redshift doesn’t support array data type offically
  • Therefore, it is more complicate for you to do it in Redshift and you have to do several pre processing
-- Calculate the count of products through the count of delimiters in Redshift
SELECT purchase_id
     , product_ids
     -- Remove comman by product_id 
     -- and calculate the count of products through the count of delimiters
     , 1 + char_length(product_ids) - char_length(replace(product_ids, ',', ''))
         AS product_num
  FROM purchase_log
;

screenshot022


-- Convert a string into rows using a pivot table in Redshift
SELECT l.purchase_id
     , l.product_ids
     -- Set a sequence by the count of products
     , p.idx
     , split_part(l.product_ids, ',', p.idx) AS product_id 
  FROM purchase_log AS l
    JOIN
      (SELECT 1 AS idx
       UNION ALL
       SELECT 2 AS idx
       UNION ALL
       SELECT 3 AS idx 
      ) AS p
      -- Join when id of the pivot table is less than the count of products
      ON p.idx <= (1 + char_length(product_ids) - char_length(replace(product_ids, ',', '')))
;

screenshot023



References

  • 데이터 분석을 위한 SQL 레시피 - 한빛미디어
  • SQL:2003 Wiki

댓글남기기

-->