In Hive Metastore tables:
"TBLS" stores the information of Hive tables.
"PARTITIONS" stores the information of Hive table partitions.
"SDS" stores the information of storage location, input and output formats, SERDE etc.
Both "TBLS" and "PARTITIONS" have a foreign key referencing to SDS(SD_ID).

Solution:

1. To list table location:

selectTBLS.TBL_NAME,SDS.LOCATION

fromSDS,TBLS

whereTBLS.SD_ID = SDS.SD_ID;

Sample output:

+--------------------------+------------------------------------------------------+

| TBL_NAME | LOCATION |

+--------------------------+------------------------------------------------------+

| test1 | maprfs:/user/hive/warehouse/test1 |

| passwords | maprfs:/user/hive/warehouse/passwords |

| parquet_par | maprfs:/user/hive/warehouse/parquet_par |

+--------------------------+------------------------------------------------------+

2. To list table partition location:

selectTBLS.TBL_NAME,PARTITIONS.PART_NAME,SDS.LOCATION

fromSDS,TBLS,PARTITIONS

wherePARTITIONS.SD_ID = SDS.SD_ID

andTBLS.TBL_ID=PARTITIONS.TBL_ID

orderby1,2;

Sample output:

+--------------+---------------------------------------+-------------------------------------------------------------------------------+

| TBL_NAME | PART_NAME | LOCATION |

+--------------+---------------------------------------+-------------------------------------------------------------------------------+

| partition_t | end_date=2015-01-01/end_time=01-00-00 | maprfs:/user/hive/warehouse/partition_t/end_date=2015-01-01/end_time=01-00-00 |

| partition_t | end_date=2015-01-02/end_time=02-00-00 | maprfs:/user/hive/warehouse/partition_t/end_date=2015-01-02/end_time=02-00-00 |

+--------------+---------------------------------------+-------------------------------------------------------------------------------+

Note: Above SQLs are based on MySQL syntax, please modify above SQLs to comply with other RDBMS syntax if needed.

↧

HDFS Namenode ,FsImage,Editlogs Backup And Restore

March 13, 2020, 6:00 am

≫ Next: Jenkins vs Spinnaker Continuous Deployment

≪ Previous: Table or partition location information from Apache Hive

HDFS Namenode ,FsImage,Editlogs Backup And Restore

How to perform HDFS metadata backup:

Backing up HDFS primarily involves creating a latest fsimage and fetching & copying it to another DR location. This can be done in there basic steps:

Note: These steps involves putting HDFS under safe mode (ready-only mode), so the Hadoop admins need to plan for that.

1. Become HDFS superuser

1. # su - hdfs

2. (Optional) If Kerberos authentication is enabled, then do kinit as well

1. # kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs@EXAMPLE.COM

3. Put HDFS in safemode, so no write operation will be allowed

1. # hdfs dfsadmin -safemode enter

2. Create new fsimage by merging any outstanding edit logs with the latest fsimage, saving the full state to a new fsimage file, and rolling edits

1. # hdfs dfsadmin -saveNamespace

3. Copy the latest fsimage from HDFS to directory on local file system. This file can be stored for backup purpose

1. # hdfs dfsadmin -fetchImage <local_dir>

4. Get Namenode out of safe mode to allow write operation and normal operations

1. # hdfs dfsadmin -safemode leave

Explained above is a very basic level of HDFS metadata backup.

Apart from this, one can also plan to backup & maintain elaborated HDFS artifacts like the fsck output, directory listing, dfsadmin report and all the fsimage + editlog + checkpoints.

Back up the following critical data.

On the node that hosts the NameNode, open the Hadoop Command Line shortcut (or open a command window in the Hadoop directory). As the hadoop user, go to the HDFS home directory:

runas /user:hadoop "cmd /K cd %HDFS_DATA_DIR%"

Run the fsck command to fix any file system errors.

hdfs fsck / -files -blocks -locations > dfs-old-fsck-1.log

The console output is printed to the dfs-old-fsck-1.log file.

Capture the complete namespace directory tree of the file system:

hdfs dfs -ls -R / > dfs-old-lsr-1.log

Create a list of DataNodes in the cluster:

hdfs dfsadmin -report > dfs-old-report-1.log

Capture output from the fsck command:

hdfs fsck / -blocks -locations -files > fsck-old-report-1.log

Verify that there are no missing or corrupted files/replicas in the fsck command output.

Save the HDFS namespace:

Place the NameNode in safe mode, to keep HDFS from accepting any new writes:

hdfs dfsadmin -safemode enter

Save the namespace.

hdfs dfsadmin -saveNamespace

	Warning
	From this point on, HDFS should not accept any new writes. Stay in safe mode!

Finalize the namespace:

hdfs namenode -finalize

On the machine that hosts the NameNode, copy the following checkpoint directories into a backup directory:

5. %HDFS_DATA_DIR%\hdfs\nn\edits\current

6. %HDFS_DATA_DIR%\hdfs\nn\edits\image

%%HDFS_DATA_DIR%\hdfs\nn\edits\previous.checkpoint

7.Get Namenode out of safe mode to allow write operation and normal operations

1. # hdfs dfsadmin -safemode leave

Restoring Name Node Metadata

This section describes how to restore Name Node metadata. If both the Name Node and the secondary Name Node were to suddenly go offline, you can restore the Name Node by doing the following:

Add a new host to your Hadoop cluster.
Add the Name Node role to the host. Make sure it has the same host name as the original Name Node.
Create a directory path for the Name Node name.dir (for example, /dfs/nn/current), ensuring that the permissions are set correctly.
Copy the VERSION and latest fsimage file to the /dfs/nn/current directory.
Run the following command to create the md5 file for the fsimage.

$ md5sum fsimage > fsimage.md5

Start the Name Node process.

Here is another way --- The Shahed Way

Put HDFS in safemode, so no write operation will be allowed

# hdfs dfsadmin -safemode enter

Create new fsimage by merging any outstanding edit logs with the latest fsimage, saving the full state to a new fsimage file, and rolling edits

# hdfs dfsadmin -saveNamespace

Copy the latest fsimage from HDFS to directory on local file system. This file can be stored for backup purpose

# hdfs dfsadmin -fetchImage <local_dir>

Get Namenode out of safe mode to allow write operation and normal operations

# hdfs dfsadmin -safemode leave

NOTES:

Safemode will impact any HDFS clients that are trying to write to HDFS.
The active NameNode is the source of truth for any HDFS operation.
A good practice is to perform a back once per month but more often never hurts

↧

Jenkins vs Spinnaker Continuous Deployment

March 27, 2020, 1:48 pm

≫ Next: Hive - Setting Up Hive Admin User - GCP Dataproc

≪ Previous: HDFS Namenode ,FsImage,Editlogs Backup And Restore

Jenkins vs Spinnaker: What are the differences?

Continuous delivery (CD) platforms help DevOps teams manage software releases in short cycles, ensuring they are delivered safely, reliably and quickly. The goal is to reduce the costs, time and associated risks connected to building, testing and delivering changes by enabling a more incremental and regular approach towards updating applications in production. I always get asked whether Spinnaker or Jenkins is the better CD platform.

Jenkins: An extendable open source continuous integration server. SImply put Jenkins CI is the leading open-source continuous integration server. Built with Java, it provides over 1000 plugins to support building and testing virtually any project.

Spinnaker: Multi-cloud continuous delivery platform for releasing software changes with high velocity and confidence. Created at Netflix, it has been vigorously tested in production by hundreds of teams over millions of deployments. It combines a powerful and flexible pipeline management system with integrations to the major cloud providers.

Jenkins belongs to "Continuous Integration" category of the tech stack, while Spinnaker can be primarily classified under "Continuous Deployment".

Jenkins and Spinnaker are both open source tools. Jenkins is still preferred among teams for its flexibility and extensive plugin support, but it falls short as a complete deployment tool, particularly for the cloud. Spinnaker, on the other hand, has a richer deployment model with a “master dashboard” and built-in support for major cloud providers.

I do have a secret preference over Spinnaker especially within the GCP Platform on Kubernetes. But Jenkins i find is the more mature product of the two.

Here at DELIVERBI we are experts in both tools. Give us a call to discuss in more detail

Cheers

Shahed

↧

Hive - Setting Up Hive Admin User - GCP Dataproc

June 9, 2020, 8:14 am

≫ Next: Oracle to GCP Google Cloud Data Migration

≪ Previous: Jenkins vs Spinnaker Continuous Deployment

HIVE ADMIN User setup - GCP etc 👀

We had issues assigning the hive admin user role to our admin users. We googled everything and the only solutions out their would not work. So we created our own that approach that is simple and effective.

Issue:

When you login to Hive with beeline (only beeline supports the admin commands) and you cant get the admin permissions to work as you want to create roles and assign to users etc.

: Sample command wont work : CREATE ROLE DELIVERBIADMIN_ADMIN --- errors you are not admin and set role admin --- does not work.

Lets Begin with the Solution

stop hive server using the following command

systemctl stop hive-server2

Ammend the following Hive configuration file /etc/hive/conf.dist/hive-site.xml

Insert into hive-site.xml the following tags and values

<name>hive.server2.enable.doAs</name>

<value>false</value>

</property>

<name>hive.users.in.admin.role</name>

<value>deliverbiadmin</value>

</property>

<name>hive.security.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value>

</property>

<name>hive.security.authorization.enabled</name>

</property>

<name>hive.security.authenticator.manager</name>

<value>org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator</value>

</property>

Now the fun part - Undocumented anywhere !!!!!!

The Roles for admin are controlled with the following 2 tables - So insert the following entries for our user deliverbiadmin.

Connect to the MYSQL Metastore Database for Hive - normally called hive or metastore. Connect with any tool you wish on Port 3306 MYSQL

Or from a linux prompt in GCS Dataproc Master Node you can connect using : mysql --host=127.0.0.1 --local-infile=1 --user=hive --password=hive-password

Once connected query the 2 tables below.

The 2 tables we are interested in are ROLES & ROLE_MAP

select * from metastore.`ROLES`;

select * from metastore.`ROLE_MAP`;

-- make sure ID for Admin role is 1 -- if not alter the insert statement

INSERT INTO metastore.`ROLE_MAP` VALUES (1,1539877824,1,'admin','ROLE','deliverbiadmin','USER',1)

------Start HIVE

systemctl start hive-server2

Thats it login to Hive and issue the command : set user admin;

use beeline to test

./usr/lib/hive/bin/beeline --color=yes -u 'jdbc:hive2://test-cluster2-m:10000/default' -n deliverbiadmin

Thats for a basic setup we are using LDAP and Kerberos for Security or you can use PAM. upto you.

Over and Sha and Krish

↧

Oracle to GCP Google Cloud Data Migration

February 6, 2021, 10:23 am

≫ Next: GCP Dataproc HDFS Error on Creation of Cluster

≪ Previous: Hive - Setting Up Hive Admin User - GCP Dataproc

Oracle to Google Cloud Platform Data Migration

We have had many clients approach us for different reasons of moving from Oracle on premise database solutions to the google cloud platform.

Here at DELIVERBI we have lots of solutions including rapid Oracle to GCP Data migration toolkits namely Data Rocket Pro. Trust us with your migration needs . DELIVERBI is an official google partner for data migration.

Our consultants are fully trained and qualified in Oracle as well as Google products. Are you moving Thousands of tables of data to GCP then we have the solution for you.

We can save you time and money. The package comes with an excellent project management team, Infrastructure specialists , Business Analysts and Data Migration specialists all under one roof.

Call us now for more info or email migration@deliverbi.co.uk

The DELIVERBI Team

↧

GCP Dataproc HDFS Error on Creation of Cluster

April 1, 2021, 8:31 am

≫ Next: Trino Graceful Shutdown on GCP Using Instance Groups

≪ Previous: Oracle to GCP Google Cloud Data Migration

Google Dataproc HDFS Error when spinning up a cluster

Error You will Encounter

On creation of Google Dataproc Cluster - You will get the following error this can be due

to a bad character in a username in SSH configuration for the Dataproc Project.

activate-component-hdfs[2799]: + exit_code=1
activate-component-hdfs[2799]: + [[ 1 -ne 0 ]]
activate-component-hdfs[2799]: + echo 1
activate-component-hdfs[2799]: + log_and_fail hdfs 'Component hdfs failed to activate' 1
activate-component-hdfs[2799]: + local component=hdfs
activate-component-hdfs[2799]: + local 'message=Component hdfs failed to activate'
activate-component-hdfs[2799]: + local error_code=1
activate-component-hdfs[2799]: + local client_error_indicator=
activate-component-hdfs[2799]: + [[ 1 -eq 2 ]]
activate-component-hdfs[2799]: + echo 'StructuredError{hdfs, Component hdfs failed to activate}'
activate-component-hdfs[2799]: StructuredError{hdfs, Component hdfs failed to activate}
activate-component-hdfs[2799]: + exit 1

The Fix

Create the cluster using command line within GCP

Use the tag below when creating cluster --- put at the end of any create cluster command even if metadata tags exist for other purposes -- still add this tag at the end.

--metadata=block-project-ssh-keys=true

Your cluster will get created.

↧

Trino Graceful Shutdown on GCP Using Instance Groups

October 25, 2021, 10:58 am

≫ Next: Creating a Basic Trino Service to Start On Demand Clusters for ADHOC Large ETL Jobs on GCP Google Cloud Platform using Python Flask

≪ Previous: GCP Dataproc HDFS Error on Creation of Cluster

I was setting up a Trino cluster for one of my clients on GCP and shutting down nodes causes query errors. I have been using Trino previously called Presto on GCP now for over 4 years. Its an amazing product with phenomenal response times when querying Hive over HDFS (ORC) format.

I need to shutdown Trino (Workers/Nodes) during specific hours or even refresh them or resize the number of workers throughout the day or night. Clusters run 24x7 as to support UK and USA time zones as well as other countries in between. Using an instance group was challenging as when you issue a command to resize an Instance Group the shut-down script method per VM was hit and miss and its also unpredictable.

A GCP instance group shuts down the Network Communication on the machine when issue a resize command and does not give the Trino Worker (Machine) enough time to fulfil any query that is currently running. Once you issue the shutdown for a VM within an instance group you only have a 90-120 second time window. This was just not good enough for my client as some queries can run anywhere between 1s and 1hr … So, I ruled out the shutdown-script … No good, wont work.

A little digging around later, I wrote a shell script .. not rocket science but something that will do the trick and get the job done as I wanted to cover a few scenarios that can run in airflow or a CRON job.

1. Shutdown a Trino worker and Remove the worker VM from an Instance group once it has finished all its tasks (Important finished all its tasks so the worker will remain in SHUT_DOWN state till all its work has finished and let the Co-Ordinator know its going to shutdown so it wont take on any more work).

2. Im able to resize the Trino Cluster on Demand Up or Down.

3. Bring Different size Workers to the Trino cluster dependant upon time of day for cost efficiency. Such as I want highmem16 or highmem8 machines.

Shell script is simple . It’s a divide and rule approach. I will tell the worker to shut down gracefully and once it has done everything it needs or wants to do I will check if the node is alive and the delete the machine from the instance group.

So to conquer this approach I have split the script into multiple processes

1. SIGNAL-SHUTDOWN – Tell the workers to shutdown (Pass Number of Workers to Shutdown) , I then store the VM hostnames in a file as I’m telling them to shutdown only for the number of workers required.

2. SHUTDOWN as a separate process, In this part of the script I’m checking if the VM (Trino Node is not in ACTIVE or SHUTDOWN state, If Trino has shut down the worker completely you wont get a response back. So, take that as a Kill signal, remove the VM and remove from the list of machines that need to removed from the instance group. This process can be run on a timer, to keep checking in case 2 of let’s say 5 nodes have not yet shutdown as the queries they are running take longer than the others.

3. RESIZE, I use this just for Scaling Upwards as its literally a resize of an instance group.

4. SHUTDOWN-RESIZE, This one can also be used after SIGNAL-SHUTDOWN but will recreate the Trino Worker machines after they have shutdown too. This can be used to refresh the worker VM’s. Trino does like to be refreshed to keep everything ticking over correctly.

The Script takes a few simple Parameters

1. The Number of Trino Workers (VM’s to Remove)

2. Signal Type ()

3. Instance Group Name

4. Number of Workers to Resize too.

Feel free to Amend and use the script and enhance it to your requirements. We have used apache airflow to orchestrate the commands and use sensors to see when the process is complete by issuing ALLDONE once the VM list file is empty. (Check out the script). Keep it simple is the key.

The Script is Available on our Git Repo : https://github.com/deliverbi/Trino-GCP

In a future post I will write up the instructions on how to create an auto-scaling Trino Cluster using instance groups on GCP.

By Shahed Munir @ DELIVERBI – 25/10/2021

↧

Creating a Basic Trino Service to Start On Demand Clusters for ADHOC Large ETL Jobs on GCP Google Cloud Platform using Python Flask

November 25, 2022, 2:54 am

≫ Next: Google Big Query (BQ) Results to GCS Storage Bucket Folder

≪ Previous: Trino Graceful Shutdown on GCP Using Instance Groups

Creating a Basic Trino Service to Start On Demand Clusters for ADHOC Large ETL Jobs on GCP Google Cloud Platform using Python Flask.

We were working on a client site where a customer of ours wanted to have a set of Trino clusters that would be used for specific high memory ETL/ ELT jobs. They wanted to send a signal to a http server and in the background, it will start a Trino Cluster up. The main purpose is to overcome the static user query memory limits on a cluster that is being used 24x7. So went to work on a quick and easy fix.

The Main components required from the GCP infrastructure side are as follows.

1. Trino Master Machine VM
2. Dynamic Instance Group that can be scaled with a Trino worker image. (You can contact DELIVERBI team for this, if not already in place.

So now lets Assume you have a cluster that can Start and Stop from a command line. The clusters we design can be started and stopped as-well as have the capability to state the number of workers we require at any time with graceful shutdowns and quick start-ups.

Lets Have a look at the Service itself. Very basic idea but works a treat. Invoke the starting and stopping of clusters from any ETL or Orchestration tool such as Apache Airflow.

All scripts are available on our GitHub : https://github.com/deliverbi/trino-api-gateway-on-demand-clusters

Let’s get to work and put together a small Virtual Machine with the following components

#Installations required Debian - Linux

apt get-install python3-pip
apt-get install jq
pip3 install --upgrade pip
pip3 install flask flask_shell2http
pip3 install waitress

You will require 2 shell scripts to start and stop the Trino Cluster.

trino-start-cluster , trino-stop-cluster these can be found in our GitHub at the location above.

Now lets move onto creating a small python flask program. Nothing difficult, This will execute shell scripts as per the URL we use . Lets call it trino_app.py


 
from flask import Flask
from flask_executor import Executor
from flask_shell2http import Shell2HTTP

# Flask application instance

app = Flask(__name__)
executor = Executor(app)
shell2http = Shell2HTTP(app=app, executor=executor, base_url_prefix="/commands/")

 def my_callback_fn(context, future):
  # optional user-defined callback function
  print(context, future.result())
  return 

#shell2http.register_command(endpoint="saythis", command_name="echo", callback_fn=my_callback_fn, decorators=[])
shell2http.register_command(endpoint="m1start", command_name="bash /trinoapp/trino-start-cluster.sh 1 5", callback_fn=my_callback_fn, decorators=[])
shell2http.register_command(endpoint="m1stop", command_name="bash /trinoapp/trino-stop-cluster.sh 1 5", callback_fn=my_callback_fn, decorators=[])
shell2http.register_command(endpoint="m2start", command_name="bash /trinoapp/trino-start-cluster.sh 2 3", callback_fn=my_callback_fn, decorators=[])
shell2http.register_command(endpoint="m2stop", command_name="bash /trinoapp/trino-stop-cluster.sh 2 3", callback_fn=my_callback_fn, decorators=[])

Start the Webserver to host the flask script. Create a script called: run_trino_app.sh




#!/bin/bash
cd "$(dirname "$0")"
/usr/local/bin/waitress-serve --port=5000 trino_app:app

We can setup a systemd service for this so it will be easier to start and stop the flask application.


# /etc/systemd/system/trino_app.service
[Unit]
Description=My Trino API API
After=network.target
[Service]
Type=simple
User=nanodano
WorkingDirectory=/root/trinoapp
ExecStart=/root/trinoapp/run_trino_app.sh
Restart=always
[Install]
WantedBy=multi-user.target

#Start and Stop the Trino App - Background process is waitress.

systemctl start trino_app

systemctl stop trino_app

systemctl status trino_app

Check Service is up, and you can invoke a cluster manually here are some sample commands. Here is a start command – just use m1stop to stop once or whatever you have used for your URL path.

#Make a Call - initiate shell script.

curl -X POST http://ip:5000/commands/m1start

#Check Result - You cant make another call to same api unless result comes back.Use the key fromthe above call

curl http://ip:5000/commands/m1stop?key=864d8355

The the main server for taking the incoming HTTP API calls is all done above and should be working.

Now let’s move onto the client side or airflow. We want the call to the API to hold whilst the server is coming up as a first step in the ETL process. So for this we introduced a client side script. This will wait until the Trino Cluster is fully up and then the ETL process can begin Once the ETL process has completed as a last step we can invoke the cluster to be shutdown.

Client-side script called client-side-call-example can be found in the GitHub repo.

We hope this helps other clients and organisations to save on cluster costs by using on demand clusters. We have Gracefull shutdown and Automatic Scaling solutions too for Trino using GCP. Our Auto scaling solutions monitor the cluster and adust the number of workers throughout the day based on memory usage and number of queries etc.

DELIVERBI

Shahed and Krishna

↧

Google Big Query (BQ) Results to GCS Storage Bucket Folder

January 12, 2024, 5:28 am

≪ Previous: Creating a Basic Trino Service to Start On Demand Clusters for ADHOC Large ETL Jobs on GCP Google Cloud Platform using Python Flask

Export data from BQ to a GCS(Google Cloud Storage) bucket.

The below example will export a CSV file output to the bucket location in the "uri" below.

declare unused STRING;
export data options(
uri='gs://dbi-analytics-warehouse-data-bq/a1delete/Test*.csv',
format='CSV',
header=true,
overwrite=true,
field_delimiter=';') as
select * from `dbi-gcs-storage.my_access_eu.dim_brand`;

You can then schedule this query in BQ after running it to export results to GCS when ever you like . Daily - Weekly - Monthly etc

↧