Google Partner for 2020
Table or partition location information from Apache Hive
"TBLS" stores the information of Hive tables.
"PARTITIONS" stores the information of Hive table partitions.
"SDS" stores the information of storage location, input and output formats, SERDE etc.
Both "TBLS" and "PARTITIONS" have a foreign key referencing to SDS(SD_ID).
1 2 3 | selectTBLS.TBL_NAME,SDS.LOCATION fromSDS,TBLS whereTBLS.SD_ID = SDS.SD_ID; |
1 2 3 4 5 6 7 | +--------------------------+------------------------------------------------------+ | TBL_NAME | LOCATION | +--------------------------+------------------------------------------------------+ | test1 | maprfs:/user/hive/warehouse/test1 | | passwords | maprfs:/user/hive/warehouse/passwords | | parquet_par | maprfs:/user/hive/warehouse/parquet_par | +--------------------------+------------------------------------------------------+ |
1 2 3 4 5 | selectTBLS.TBL_NAME,PARTITIONS.PART_NAME,SDS.LOCATION fromSDS,TBLS,PARTITIONS wherePARTITIONS.SD_ID = SDS.SD_ID andTBLS.TBL_ID=PARTITIONS.TBL_ID orderby1,2; |
1 2 3 4 5 6 | +--------------+---------------------------------------+-------------------------------------------------------------------------------+ | TBL_NAME | PART_NAME | LOCATION | +--------------+---------------------------------------+-------------------------------------------------------------------------------+ | partition_t | end_date=2015-01-01/end_time=01-00-00 | maprfs:/user/hive/warehouse/partition_t/end_date=2015-01-01/end_time=01-00-00 | | partition_t | end_date=2015-01-02/end_time=02-00-00 | maprfs:/user/hive/warehouse/partition_t/end_date=2015-01-02/end_time=02-00-00 | +--------------+---------------------------------------+-------------------------------------------------------------------------------+ |
Note: Above SQLs are based on MySQL syntax, please modify above SQLs to comply with other RDBMS syntax if needed.
HDFS Namenode ,FsImage,Editlogs Backup And Restore
HDFS Namenode ,FsImage,Editlogs Backup And Restore
- On the node that hosts the NameNode, open the Hadoop Command Line shortcut (or open a command window in the Hadoop directory). As the hadoop user, go to the HDFS home directory:
- Run the fsck command to fix any file system errors.
- Capture the complete namespace directory tree of the file system:
- Create a list of DataNodes in the cluster:
- Capture output from the fsck command:
- Save the HDFS namespace:
- Place the NameNode in safe mode, to keep HDFS from accepting any new writes:
- Save the namespace.
Warning | |
From this point on, HDFS should not accept any new writes. Stay in safe mode! |
- Finalize the namespace:
- On the machine that hosts the NameNode, copy the following checkpoint directories into a backup directory:
- Add a new host to your Hadoop cluster.
- Add the Name Node role to the host. Make sure it has the same host name as the original Name Node.
- Create a directory path for the Name Node name.dir (for example, /dfs/nn/current), ensuring that the permissions are set correctly.
- Copy the VERSION and latest fsimage file to the /dfs/nn/current directory.
- Run the following command to create the md5 file for the fsimage.
- Start the Name Node process.
Here is another way --- The Shahed Way
- Safemode will impact any HDFS clients that are trying to write to HDFS.
- The active NameNode is the source of truth for any HDFS operation.
- A good practice is to perform a back once per month but more often never hurts
Jenkins vs Spinnaker Continuous Deployment
Jenkins vs Spinnaker: What are the differences?
Continuous delivery (CD) platforms help DevOps teams manage software releases in short cycles, ensuring they are delivered safely, reliably and quickly. The goal is to reduce the costs, time and associated risks connected to building, testing and delivering changes by enabling a more incremental and regular approach towards updating applications in production. I always get asked whether Spinnaker or Jenkins is the better CD platform.
Cheers
Shahed
Hive - Setting Up Hive Admin User - GCP Dataproc
Oracle to GCP Google Cloud Data Migration
Oracle to Google Cloud Platform Data Migration
We have had many clients approach us for different reasons of moving from Oracle on premise database solutions to the google cloud platform.
Here at DELIVERBI we have lots of solutions including rapid Oracle to GCP Data migration toolkits namely Data Rocket Pro. Trust us with your migration needs . DELIVERBI is an official google partner for data migration.
Our consultants are fully trained and qualified in Oracle as well as Google products. Are you moving Thousands of tables of data to GCP then we have the solution for you.
Call us now for more info or email migration@deliverbi.co.uk
The DELIVERBI Team
GCP Dataproc HDFS Error on Creation of Cluster
Google Dataproc HDFS Error when spinning up a cluster
Error You will Encounter
On creation of Google Dataproc Cluster - You will get the following error this can be due
to a bad character in a username in SSH configuration for the Dataproc Project.
activate-component-hdfs[2799]: + exit_code=1
activate-component-hdfs[2799]: + [[ 1 -ne 0 ]]
activate-component-hdfs[2799]: + echo 1
activate-component-hdfs[2799]: + log_and_fail hdfs 'Component hdfs failed to activate' 1
activate-component-hdfs[2799]: + local component=hdfs
activate-component-hdfs[2799]: + local 'message=Component hdfs failed to activate'
activate-component-hdfs[2799]: + local error_code=1
activate-component-hdfs[2799]: + local client_error_indicator=
activate-component-hdfs[2799]: + [[ 1 -eq 2 ]]
activate-component-hdfs[2799]: + echo 'StructuredError{hdfs, Component hdfs failed to activate}'
activate-component-hdfs[2799]: StructuredError{hdfs, Component hdfs failed to activate}
activate-component-hdfs[2799]: + exit 1
The Fix
Your cluster will get created.
Trino Graceful Shutdown on GCP Using Instance Groups
I was setting up a Trino cluster for one of my clients on GCP and shutting down nodes causes query errors. I have been using Trino previously called Presto on GCP now for over 4 years. Its an amazing product with phenomenal response times when querying Hive over HDFS (ORC) format.
I need to shutdown Trino (Workers/Nodes) during specific hours or even refresh them or resize the number of workers throughout the day or night. Clusters run 24x7 as to support UK and USA time zones as well as other countries in between. Using an instance group was challenging as when you issue a command to resize an Instance Group the shut-down script method per VM was hit and miss and its also unpredictable.
A GCP instance group shuts down the Network Communication on the machine when issue a resize command and does not give the Trino Worker (Machine) enough time to fulfil any query that is currently running. Once you issue the shutdown for a VM within an instance group you only have a 90-120 second time window. This was just not good enough for my client as some queries can run anywhere between 1s and 1hr … So, I ruled out the shutdown-script … No good, wont work.
A little digging around later, I wrote a shell script .. not rocket science but something that will do the trick and get the job done as I wanted to cover a few scenarios that can run in airflow or a CRON job.
1. Shutdown a Trino worker and Remove the worker VM from an Instance group once it has finished all its tasks (Important finished all its tasks so the worker will remain in SHUT_DOWN state till all its work has finished and let the Co-Ordinator know its going to shutdown so it wont take on any more work).
2. Im able to resize the Trino Cluster on Demand Up or Down.
3. Bring Different size Workers to the Trino cluster dependant upon time of day for cost efficiency. Such as I want highmem16 or highmem8 machines.
Shell script is simple . It’s a divide and rule approach. I will tell the worker to shut down gracefully and once it has done everything it needs or wants to do I will check if the node is alive and the delete the machine from the instance group.
So to conquer this approach I have split the script into multiple processes
1. SIGNAL-SHUTDOWN – Tell the workers to shutdown (Pass Number of Workers to Shutdown) , I then store the VM hostnames in a file as I’m telling them to shutdown only for the number of workers required.
2. SHUTDOWN as a separate process, In this part of the script I’m checking if the VM (Trino Node is not in ACTIVE or SHUTDOWN state, If Trino has shut down the worker completely you wont get a response back. So, take that as a Kill signal, remove the VM and remove from the list of machines that need to removed from the instance group. This process can be run on a timer, to keep checking in case 2 of let’s say 5 nodes have not yet shutdown as the queries they are running take longer than the others.
3. RESIZE, I use this just for Scaling Upwards as its literally a resize of an instance group.
4. SHUTDOWN-RESIZE, This one can also be used after SIGNAL-SHUTDOWN but will recreate the Trino Worker machines after they have shutdown too. This can be used to refresh the worker VM’s. Trino does like to be refreshed to keep everything ticking over correctly.
The Script takes a few simple Parameters
1. The Number of Trino Workers (VM’s to Remove)
2. Signal Type ()
3. Instance Group Name
4. Number of Workers to Resize too.
Feel free to Amend and use the script and enhance it to your requirements. We have used apache airflow to orchestrate the commands and use sensors to see when the process is complete by issuing ALLDONE once the VM list file is empty. (Check out the script). Keep it simple is the key.
The Script is Available on our Git Repo : https://github.com/deliverbi/Trino-GCP
In a future post I will write up the instructions on how to create an auto-scaling Trino Cluster using instance groups on GCP.
Creating a Basic Trino Service to Start On Demand Clusters for ADHOC Large ETL Jobs on GCP Google Cloud Platform using Python Flask
Creating a Basic Trino Service to Start On Demand Clusters for ADHOC Large ETL Jobs on GCP Google Cloud Platform using Python Flask.
We were working on a client site where a customer of ours wanted to have a set of Trino clusters that would be used for specific high memory ETL/ ELT jobs. They wanted to send a signal to a http server and in the background, it will start a Trino Cluster up. The main purpose is to overcome the static user query memory limits on a cluster that is being used 24x7. So went to work on a quick and easy fix.
The Main components required from the GCP infrastructure side are as follows.
- 1. Trino Master Machine VM
- 2. Dynamic Instance Group that can be scaled with a Trino worker image. (You can contact DELIVERBI team for this, if not already in place.
So now lets Assume you have a cluster that can Start and Stop from a command line. The clusters we design can be started and stopped as-well as have the capability to state the number of workers we require at any time with graceful shutdowns and quick start-ups.
Lets Have a look at the Service itself. Very basic idea but works a treat. Invoke the starting and stopping of clusters from any ETL or Orchestration tool such as Apache Airflow.
Let’s get to work and put together a small Virtual Machine with the following components
#Installations required Debian - Linux
apt get-install python3-pip
apt-get install jq
pip3 install --upgrade pip
pip3 install flask flask_shell2http
pip3 install waitress
You will require 2 shell scripts to start and stop the Trino Cluster.
trino-start-cluster , trino-stop-cluster these can be found in our GitHub at the location above.
Now lets move onto creating a small python flask program. Nothing difficult, This will execute shell scripts as per the URL we use . Lets call it trino_app.py
from flask import Flask
from flask_executor import Executor
from flask_shell2http import Shell2HTTP
# Flask application instance
app = Flask(__name__)
executor = Executor(app)
shell2http = Shell2HTTP(app=app, executor=executor, base_url_prefix="/commands/")
def my_callback_fn(context, future):
# optional user-defined callback function
print(context, future.result())
return
#shell2http.register_command(endpoint="saythis", command_name="echo", callback_fn=my_callback_fn, decorators=[])
shell2http.register_command(endpoint="m1start", command_name="bash /trinoapp/trino-start-cluster.sh 1 5", callback_fn=my_callback_fn, decorators=[])
shell2http.register_command(endpoint="m1stop", command_name="bash /trinoapp/trino-stop-cluster.sh 1 5", callback_fn=my_callback_fn, decorators=[])
shell2http.register_command(endpoint="m2start", command_name="bash /trinoapp/trino-start-cluster.sh 2 3", callback_fn=my_callback_fn, decorators=[])
shell2http.register_command(endpoint="m2stop", command_name="bash /trinoapp/trino-stop-cluster.sh 2 3", callback_fn=my_callback_fn, decorators=[])
Start the Webserver to host the flask script. Create a script called: run_trino_app.sh
#!/bin/bash
cd "$(dirname "$0")"
/usr/local/bin/waitress-serve --port=5000 trino_app:app
We can setup a systemd service for this so it will be easier to start and stop the flask application.
# /etc/systemd/system/trino_app.service
[Unit]
Description=My Trino API API
After=network.target
[Service]
Type=simple
User=nanodano
WorkingDirectory=/root/trinoapp
ExecStart=/root/trinoapp/run_trino_app.sh
Restart=always
[Install]
WantedBy=multi-user.target
#Start and Stop the Trino App - Background process is waitress.
systemctl start trino_app
systemctl stop trino_app
systemctl status trino_app
Check Service is up, and you can invoke a cluster manually here are some sample commands. Here is a start command – just use m1stop to stop once or whatever you have used for your URL path.
#Make a Call - initiate shell script.
curl -X POST http://ip:5000/commands/m1start
#Check Result - You cant make another call to same api unless result comes back.Use the key fromthe above call
curl http://ip:5000/commands/m1stop?key=864d8355
The the main server for taking the incoming HTTP API calls is all done above and should be working.
Now let’s move onto the client side or airflow. We want the call to the API to hold whilst the server is coming up as a first step in the ETL process. So for this we introduced a client side script. This will wait until the Trino Cluster is fully up and then the ETL process can begin Once the ETL process has completed as a last step we can invoke the cluster to be shutdown.
Client-side script called client-side-call-example can be found in the GitHub repo.
We hope this helps other clients and organisations to save on cluster costs by using on demand clusters. We have Gracefull shutdown and Automatic Scaling solutions too for Trino using GCP. Our Auto scaling solutions monitor the cluster and adust the number of workers throughout the day based on memory usage and number of queries etc.
DELIVERBI
Shahed and Krishna
Google Big Query (BQ) Results to GCS Storage Bucket Folder
Export data from BQ to a GCS(Google Cloud Storage) bucket.
The below example will export a CSV file output to the bucket location in the "uri" below.
declare unused STRING;
export data options(
uri='gs://dbi-analytics-warehouse-data-bq/a1delete/Test*.csv',
format='CSV',
header=true,
overwrite=true,
field_delimiter=';') as
select * from `dbi-gcs-storage.my_access_eu.dim_brand`;
You can then schedule this query in BQ after running it to export results to GCS when ever you like . Daily - Weekly - Monthly etc