Examples:
- Removing Duplicate Records by Primary Key (Compaction)
- Using Windowing Analysis
- Updating Time Series Data
Removing Duplicate Records by Primary Key
- Spark
map()
tokeyedRDD
,reduceByKey()
to compaction
- SQL
GROUP BY
primary key,SELECT
MAX(TIME_STAMP)
JOIN
back to filter on the original table
Windowing Analysis
Find the valley and peak.
- Spark
- partition by primary key’s hash, sorted by timestamp
mapPartitions
- iterate the sorted partition to address
peak
andvalley
- iterate the sorted partition to address
- SQL
SELECT
LEAD()
andLAG()
OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION)
SELECT
CASE
WHEN VALUE > LEAD
andLAG
,THEN 'PEAK'
WHEN VALUE < LEAD
andLAG
,THEN 'VALLEY'
- Note: multiple windowing operations with SQL will increase the disk I/O overhead and lead to performance decrease
Time Series Modifications
- HBase and Versioning
- advantage:
- modifications are very fast, simply update
- disadvantage:
- penalty in getting historical versions
- performing large scans or block cache reads
- advantage:
- HBase with a RowKey of RecordKey and StartTime
get
existing recordput
back with update stop timeput
the new current record- advantage:
- faster version retrieve
- disadvantage:
- slower update, requires 1
get
and 2put
s - still has the large scan and block cache problems
- slower update, requires 1
- Partitions on HDFS for Current and Historical Records
- partitioning into
- most current records partition
- historic records partition
- batch update
- for updated “current” records, update the stop time and append to historic records partition
- add new update into most current records partition
- partitioning into