Hadoop Application Architectures Ch.4 Common Hadoop Processing Patterns

2019-12-10

Examples:

Removing Duplicate Records by Primary Key

Spark
- map() to keyedRDD, reduceByKey() to compaction
SQL
- GROUP BY primary key, SELECT MAX(TIME_STAMP)
- JOIN back to filter on the original table

Find the valley and peak.

Spark
- partition by primary key’s hash, sorted by timestamp
- mapPartitions
  - iterate the sorted partition to address peak and valley
SQL
- SELECT LEAD() and LAG() OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION)
- SELECT CASE
  - WHEN VALUE > LEAD and LAG, THEN 'PEAK'
  - WHEN VALUE < LEAD and LAG, THEN 'VALLEY'
- Note: multiple windowing operations with SQL will increase the disk I/O overhead and lead to performance decrease