Goal

The new features of Apache Hive 3.0

Practice

What’s new in this release: Apache Hive

Apache Hive 3.0.0 was released on 21 May 2018 and the recent version is 3.1.2(based on 18 Feb 2020).
One of the noticeable features is that Hive 3.0 supports seamless ACID(Atomicity, Consistency, Isolation, Durability) and you have to consider serveral thing about the transactional tables created by a Hive version prior to Hive 3.
Futhermore, several SQL features and useful connectors were added.
And also, it enhance the performance through improved LLAP and Apache Spark.

Hive3 New Features

ACID v2

Originally Hive supported write only by adding partitions or loading new file into existing partitions because HDFS Hive already has not allowed file replaced or edited.
Hive had supported the trasactinons and INSERT, UPDATE, DELETE since 0.13 version was released. But it had several drawbacks like these.
- Transactional tables had to be stored in ORC and and bucketed
- Reading transactional tables was significnatly slower than non-transactional ones
- No support for MERGE or UPSERT funcionality
Seamless INSERT/UPDATE/DELETE opertaions on existing tables and improve the performance just as good as non-ACID
Transactional tables don’t have to be bucketed anymore
Non-ORC format bales supported (INSERT & SELECT only)
Compatibility with cloud storage providers
- eg. Amazon S3
New HiveStreaming API (v2)

LLAP(Live Long And Process) workload management

It consists of a long-lived daemon which replaces direct interactions with the HDFS DataNode, and a tightly integrated DAG-based framework.
Functionality such as caching, pre-fetching, some query processing and access control are moved into the daemon. Small/short queries are largely processed by this daemon directly, while any heavy lifting will be performed in standard YARN containers.
Live Long And Process (LLAP) functionality was added in Hive 2.0 and Hive 3.0 supports new features
- Split your LLAP resources in pools
  - e.g. bi (for Business Analytics) and etl (for Extract Transform and Load)
- Automatically map applications to pools
  - e.g. Tableau to the bi pool
- Create triggers to dynamically move applications from one pool to another
  - e.g. move long running queries to the etl pool
- Activate/deactivate your resource plans based on your users’ needs (one active at a time).

JDBC, Druid and Kafka connectors

JDBC connector(JdbcStorageHandler)
- It is read-only for now.
- It allows you to easily read/import data from any of the supported databases.
  - MySQL, PostgreSQL, Oracle, MSSQL and Apache Derby
Kafka connector
- It can be used to query real-time data from Apache Kafka.
  - You connect to Kafka data from Hive by creating an external table that maps to a Kafka topic.
- It can be used to achieve exactly-once offloading of data to Hive and real-time data transformation.
  - The Hive-Kafka connector supports the following serialization and deserialization formats:
    - JsonSerDe (default)
    - OpenCSVSerde
    - AvroSerDe
  - Exactly-once ingestion from Kafka to Druid can be done through Hive.

screenshot001

Hive Warehouse Connector for Apache Spark

All interactions between Hive and Apache Spark has to go through the Hive Warehouse Connector.
This connector takes advantage of Hive LLAP to allow streaming/ACID interaction between Hive and Spark.
The Spark application will need to access a Hive Server Interactive (with LLAP activated) to read Hive managed tables, but it won’t need it to write to Hive managed tables or read/write Hive external tables.

screenshot002

Execute Query Process
- 1. Driver submits query to HiveServer
- 1. Compile query and return “splits” to Driver
- 1. Execute query on LLAP
- 1. Executor Tasks run for each split
- 1. Tasks read Arrow data from LLAP
- 1. HWC returns ArrowColumnVectors to Spark

SQL features

Constraints and default values
- DEFAULT value for each column is available and it can be used in INSERT and UPDATE statements.
  
  UNIQUE and NOT NULL SQL constraints and **PRIMARY FOREIGN KEY** already have been added in Hive 2.1
Tables information
- Hive 3 allows easy exploration of the whole warehouse with information_schema and sys databases.

Improved performance

Based on a recent TPC-DS benchmark by the MR3 team, Hive LLAP 3.1.0 is the fastest SQL-on-Hadoop system available in HDP 3.0.1.
The benchmark compares all the SQL systems embedded with HDP3 as well as Hive on MR3 (a new execution engine for Hadoop and Kubernetes), by running a set of 99 SQL queries.

System	Total TCP-DS running time (seconds)
HDP 3.0.1 LLAP	5516.89
Hive 3.1.0/MR3	6575.052
HDP 3.0.1 Tez	12227.317
Presto 0.208e	12948.224
Spark 2.3.1	26247.471

Materialized views (MV)

It can improve the performance by serveral ways #### Storage for intermediate results
- Materialized views results are stored, allowing mutualizing computing costs. #### Incremental rebuilding
- Updating an materialized view only computes the data that was inserted in its sources tables since the last update. ##### Query rewriting
- When appropriate, the cost-based optimizer uses existing materialized views to plan queries instead of the sources tables, without the user being aware of it.

Other features

Surrogate Keys
TIMESTAMP WITH TIMEZONE data type
Apache Parquet and text files are supported in LLAP

Category

Android

Apache

BigData

Business

DBMS

Cloud

Docker

ElasticStack

Eclipse RCP

English

GitHub

Java

Linux

Log

MacOS

Markdown

OpenCV

IoT

Python

Tool

[Hive] New features on Hive 3.0

[Hive] New features on Hive 3.0

Goal

Practice

What’s new in this release: Apache Hive

Hive3 New Features

ACID v2

LLAP(Live Long And Process) workload management

JDBC, Druid and Kafka connectors

Hive Warehouse Connector for Apache Spark

SQL features

Improved performance

Materialized views (MV)

Other features

References

댓글남기기