Apache Hive: Replication V3
This post is just to give a glimpse of the new features we are building in Apache Hive Replication subsystem. These features would be available in the upcoming release i.e. Apache Hive 4.0.0.
Replication is already supported in Apache Hive and is available in previously released version, Apache Hive 3.0.0. In the upcoming version we have come up with significant number of support to enhance the feature set in replication. In this post I will just touch upon those and briefly introduce them here:
Replication Policy Scheduling
Until Replication V2, an end user had to use the Hive SQL commands directly in order to perform replication. REPL DUMP command to dump the metadata from source cluster to the staging directory a.k.a dump directory and REPL LOAD command to read the same from the staging directory and load them all onto the target cluster. There was no built-in orchestration layer in hive to run these commands in a particular schedule. Starting from Replication V3, now user can submit a replication policy for the command with a desired schedule and hive will make sure that replication policy runs according to the desired schedule and performs replication in a more automated fashion. This feature has been captured in detail here.
Trust setup not required between clusters
The clusters in production are mostly Kerberized and in order to communicate between clusters during replication, trust setup was often a prerequisite. While this was mostly fine in case of on-prem to on-prem clusters replication, this was a concern for many users if the target cluster was hosted on cloud. The concerns are mainly around security aspects e.g opening a port on the on-prem for an access from cloud etc. Replication V3 breaks this requirement and user can now use a commonly accessible staging area, e.g Cloud based storage to exchange the dump content. There are no other communication between the clusters required during replication process. Hence previous prerequisite of having a trust setup between clusters is no more required.
Check-pointing
All the distributed systems are prone to failure and so is Hive. When hive experiences failures during replication, check-pointing helps avoid re-doing the same work again upon resumption. While check-pointing was already available during load operation, starting from Replication V3, check-pointing also takes place during dump operation. For an example, suppose there were around 100k events to be dumped. Upon doing 90k events during dump operation, for some reason, say for example there was a network outage and the dump operation fails. When the next time replication dump operation is attempted, it will automatically figure out the previously completed events and the dumping events will resume from 90k+1st event and not from the beginning.
Apache Ranger Policy Replication
Apache Hive already supports Apache Ranger to provide fine-grained data access control in Hive. Replication V3 also introduces built-in support to automatically export any Ranger policies that are associated with the replicated data to the target cluster. This is done a part of replication policy schedule itself.
Apache Atlas Metadata Replication
Starting from Replication V3, Hive also supports Atlas metadata replication. The associated Atlas entities are replicated to the target cluster and are tagged with special classification. Same as Apache Ranger policy replication, this also is done a part of replication policy schedule itself.
Performance Improvements
In Replication V3, lot of performance improvement patches have gone in. Replication mainly deals with data and metadata. The performance improvement are targeted on both the aspects. For example, number of interactions required from Hive Server2 to Hive Meta Store communication for various reasons during replication has been significantly cut down leading to faster load operations.
Some of the performance improvements:
a) Load partitions in parallel for external tables in the bootstrap phase:
Now loading the external table partitions during REPL LOAD opertaion is done in batches. Only a single HMS call is made per batch size. ‘hive.repl.load.partitions.batch.size’ can be used to customize the batch size(defaults to 10000).
b) Load partitions in batches for managed tables in the bootstrap phase
Like external table partitions, now managed table partitions are also loaded in batches hence reduces the number of HMS calls significantly.
c) Check for write transactions for the db under replication at a frequent interval:
While doing bootstrap dump, hive now waits for write transactions only for current DB under replication and not other DBs. Also, completion of such open transactions are checked more frequently now.
d) Execute DDL tasks for tables during bootstrap in parallel
The DDL tasks for table are now run in parallel. This will allow the tables to be loaded in parallel.
I hope this gives readers a good visibility in terms of what to expect in Hive Replication V3. Almost each of the features mentioned above deserves a separate post in order to get described more thoroughly. Stay tuned !!