ODI 12c Load Plan stopping Multiple Runs at the same time

February 28, 2017, 2:12 am

≫ Next: ODI 12c Direct Load with no Flow control $ Tables from Source to Target

≪ Previous: ODI 12C and OBIEE 12C Integration with SOAP WSDL

ODI 12c Block the same load plan running again on a schedule when its still running from a previous scheduled run.

Stopping Load Plan from running multiple instances at one time

The Load Plan in ODI is a very handy functionality to orchestrate multiple scenarios in an orderly fashion. The only drawback with this functionality is there is no mechanism to stop multiple instances of the same load plan running at the same time.

This solution attempts to provide a generic approach that can be utilised with any load plan, more like a plug and play approach.

This solution provides additional functionality of populating a system status table that can be displayed on OBIEE dash board to provide visibility of Load Plan runs to end users. The solution also captures the parameters provided to the Load Plan and presents in the monitoring table

The following components are built and created as part of this solution

DB Objects –

WC_SYSTEM_STATUS_D – Database Table

WC_SYSTEM_STATUS_SEQ – Database Sequence

ODI Prodedures – Four ODI procedures are created. The code is given at the end of the document

LP_START – Inserts row into the WC_SYSTEM_STATUS_D table and checks to see if any other instance of the same load plan is running and invokes LP_DOUBLE_RUN procedure
LP_COMPLETE – To mark the load plan as successfully complete
LP_DOUBLE_RUN – Updates the current run with status as A, Abort and results in an error, invoking the LP_ERROR
LP_ERROR – Updates the current run with status F, Failure if not already aborted

The steps to integrate these objects with any required load plan are

Create the first step of the load plan as scenario execution step LP_START

Create the last step of the load plan as scenario execution step LP_COMPLETE

Create two exception steps and add scenario steps as shown in the figure below

For the root step of the load plan provide the exception step as LP_ERROR and select the Exception Behaviour as ‘Run Exception and Raise’

For the LP_START step of the load plan provide the exception step as LP_DOUBLE_RUN and select the Exception Behaviour as ‘Run Exception and Raise’

That is it. You will never see the same load plan running multiple times at any time.

Here is the Code that needs to be pasted into 4 ODI Procedures that are integrated into the load plan.

Code

CREATE TABLE "WC_SYSTEM_STATUS_D"

( "JOB_ID" NUMBER NOT NULL ENABLE,

"JOB_NAME" VARCHAR2(1000 BYTE),

"START_TIMESTAMP" DATE,

"END_TIMESTAMP" DATE,

"STATUS_INDICATOR" VARCHAR2(10 BYTE),

"STATUS_DESCRIPTION" VARCHAR2(100 BYTE),

"PARAMS" VARCHAR2(100 BYTE),

"GUID" VARCHAR2(1000 BYTE)

) ;

CREATE SEQUENCE "WC_SYSTEM_STATUS_SEQ" MINVALUE 1 MAXVALUE 9999999999999999999999999999 INCREMENT BY 1 START WITH 50 CACHE 20 NOORDER NOCYCLE NOPARTITION ;

LP_START

declare

is_running varchar2(1) := 'C'; -- not running

CURSOR c_ind IS

SELECT status_indicator

from wc_system_status_d

where status_indicator = 'R'

and job_name = '<%=odiRef.getLoadPlanInstance("LOAD_PLAN_NAME")%>'

and guid <> '<%=odiRef.getLoadPlanInstance("BATCH_GUID")%>';

l_params varchar2(1000);

begin

select LISTAGG(display_name,';') WITHIN GROUP(ORDER BY display_name) display_name

INTO l_params

from (

select t1.var_name||' - '||t1.var_value display_name

from crlodistg_odi_repo.snp_lpi_var t1

,(select * from (select i_lp_inst,load_plan_name

from crlodistg_odi_repo.snp_lpi_run

where load_plan_name = '<%=odiRef.getLoadPlanInstance("LOAD_PLAN_NAME")%>'

order by start_date desc) where rownum = 1) t2

where t1.i_lp_inst = t2.i_lp_inst

);

exception when others then null;

end;

INSERT INTO wc_system_status_d

VALUES

(wc_system_status_seq.nextval,'<%=odiRef.getLoadPlanInstance("LOAD_PLAN_NAME")%>',CURRENT_TIMESTAMP,NULL,'R','Running',l_params,'<%=odiRef.getLoadPlanInstance("BATCH_GUID")%>');

commit;

OPEN c_ind;

FETCH c_ind INTO is_running;

CLOSE c_ind;

IF NVL(is_running,'C') = 'R' THEN

select 1/0 INTO is_running from dual; -- Causing Error

END IF;

-- Intentionally not defined any exception handler here to push this task into error and hence raise the exception on the load plan which updates the batch as About and

-- stops the execution

end;

LP_COMPLETE

begin

UPDATE wc_system_status_d

SET end_timestamp = CURRENT_TIMESTAMP

,status_indicator = 'C'

,status_description = 'Complete'

WHERE guid = '<%=odiRef.getLoadPlanInstance("BATCH_GUID")%>';

end;

LP_DOUBLE_RUN

begin

UPDATE wc_system_status_d

SET end_timestamp = CURRENT_TIMESTAMP

,status_indicator = 'A'

,status_description = 'Abort. Currently another batch running'

WHERE guid = '<%=odiRef.getLoadPlanInstance("BATCH_GUID")%>';

end;

LP_ERROR

declare

is_running varchar2(1);

begin

select status_indicator into is_running from wc_system_status_d where guid = '<%=odiRef.getLoadPlanInstance("BATCH_GUID")%>';

if is_running = 'R' then

UPDATE wc_system_status_d

SET end_timestamp = CURRENT_TIMESTAMP

,status_indicator = 'F'

,status_description = 'Failed'

WHERE guid = '<%=odiRef.getLoadPlanInstance("BATCH_GUID")%>';

end if;

end;

Over and Out

DeliverBI Team

↧

ODI 12c Direct Load with no Flow control $ Tables from Source to Target

May 3, 2017, 6:00 am

≫ Next: OBIEE 12c - How To Get BIAuthor and BIConsumer Application Roles Added To A Clean Installation

≪ Previous: ODI 12c Load Plan stopping Multiple Runs at the same time

ODI 12c Direct Load with no Flow control $ Tables from Source to Target

A very basic load process with no controls apart from a truncate. Add some controls to it to make it slicker though…… Couldn’t see any other simple blog on this subject with ODI 12c . We always get people asking this question. Its basic we agree but still its one of those things everyone will need now and then.

Copy LKM SQL to SQL and rename it to LKM SQL to SQL No Flow (Or whatever you like).

Create 2 steps

The first step to truncate the target table if the option is set to true on the KM and the next step simply to insert data from source to target in the target table.

Truncate Table

1. Target Command

<% if (odiRef.getOption( "TRUNCATE" ).equals( "1" )) { %>

truncate table <%=odiRef.getTable("L","TARG_NAME","A")%>

<% } %>

2. Add Options

TRUNCATE

Step Commands to Load data from Source to Target table.

Target Command

insert into <%=odiRef.getTable("L", "TARG_NAME", "A")%>

(

<%=odiRef.getColList("", "[CX_COL_NAME]", ",\n\t", "","")%>

)

values

(

<%=odiRef.getColList("", ":[CX_COL_NAME]", ",\n\t", "","")%>

)

Source Command

<%for (int i=odiRef.getDataSetMin(); i <= odiRef.getDataSetMax(); i++){%>

<%=odiRef.getDataSet(i, "Operator")%>

select <%=odiRef.getPop("DISTINCT_ROWS")%>

<%=odiRef.getColList(i, "", "[EXPRESSION]\t[ALIAS_SEP] [CX_COL_NAME]", ",\n\t", "", "")%>

from <%=odiRef.getFrom(i)%>

where (1=1)

<%=odiRef.getFilter(i)%>

<%=odiRef.getJrnFilter(i)%>

<%=odiRef.getJoin(i)%>

<%=odiRef.getGrpBy(i)%>

<%=odiRef.getHaving(i)%>

<%}%>

IKM Clone as a dummy

Clone IKM SQL Control Append again Rename it to whatever you like.

Remove all the steps.

USE THE IKM and LKM in your mapping and it should run with no $ Tables.

After You have created your mapping and used the LKM and IKM created.

Operator log will look like below with your named steps etc.

This is a very simple example . I advise you build in some more control functionality etc… KM customisation is always fun. I have uploaded example Knowledge modules to the DELIVERBI document portal if you don't want to undertake these very simple tasks....

Have a look at the XML files to Import to ODI 12c as an example

Thanks

Sha & Krish

↧

OBIEE 12c - How To Get BIAuthor and BIConsumer Application Roles Added To A Clean Installation

May 15, 2017, 5:51 am

≫ Next: ODI 12c - OCI Connection to Studio and not JDBC ( Thick Client and not Thin )

≪ Previous: ODI 12c Direct Load with no Flow control $ Tables from Source to Target

OBIEE 12c - Adding BI Author and BI Consumer Roles

During the configuration of a new install of OBIEE 12c, when prompted to choose the application that will be installed on your initial service instance, you selected:

"Clean Slate (no predefined application )".

After the configuration of the instance was completed, you can only see "BI Service Administrator" in the list of Roles. BI Author and BI Consumer are not present.

This document describes how to add the BI Author and BI Consumer roles added to such instance.

Lets get it Done

You can deploy the SampleAppLite example which will add the missing BI Author and BI Consumer roles.

The steps are listed below:

The SampelAppLite bar file is available at :

<ORACLE_HOME>/bi/bifoundation/samples/sampleapplite/SampleAppLite.bar

You can set the Service Instance in your environment to SampleAppLite.bar as follows
Start the WLST script tool using the command:

<ORACLE_HOME>/oracle_common/common/bin/wlst.sh
Run the importServiceInstance to import the SampleApLite.bar file.

For example:


importServiceInstance ('<DOMAIN_HOME>','ssi','<ORACLE_HOME>/bi/bifoundation/samples/sampleapplite/SampleAppLite.bar')

  Where you need to replace the <ORACLE_HOME> and <DOMAIN_HOME> with the physical path for OBIEE OracleHome and DomainHome.
Restart the OBIEE services.
Verify.
Replace with your repository and catalog or remove and start from scratch (clean state).
Restart the OBIEE services.

↧

ODI 12c - OCI Connection to Studio and not JDBC ( Thick Client and not Thin )

July 10, 2017, 9:00 pm

≫ Next: ODI 12c Controlling Multiple Load Plans and Scenarios running multiple times together

≪ Previous: OBIEE 12c - How To Get BIAuthor and BIConsumer Application Roles Added To A Clean Installation

Install Oracle 12c Client – To Avoid the double connection needed with SCAN Addresses for JDBC – Try an OCI Connection which is Native.

We have been noticing intermittent connection issues when connecting from studio to a scan clustered oracle database, where you would have to click the connect button twice to connect in studio. So we just installed a 12c oracle home using native OCI and this seems to work flawlessly .

Download a Oracle Client for 12c from Oracle website either 32bit or 64 bit depending on your windows installation.

http://download.oracle.com/otn/nt/oracle12c/121020/winx64_12102_client.zip

This installation is based on the installation location of : C:\Oracle\ORACLECLIENT12C – adjust the path in the below edits as and when needed as it depends on your installation location.

Open Regedit and add the following string entries within the ORACLE folder in below locations.

Registry Entries in Locations Might exist : If they do then leave as is .....

[HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE]

inst_loc C:\Program Files\Oracle\Inventory

I copied this path and put it into a new string value in this registry entry:

[HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\ORACLE]

inst_loc C:\Program Files\Oracle\Inventory

The 2 registry entries need to be there for the installation to even start.

Installation Option –I used Runtime toolset and oracle home location bottom option was set to C:\Oracle\ORACLECLIENT12C , I left base location as my d drive or whatever you want.

Implementation-Version: 12.1.0.2.0

Once this is done a few more things needed. Add the following line in

File Location : <<ODI_HOME>>\odi\studio\bin\odi.conf\odi.conf

Add the following Entry using Notepad ++ or equivalent tool

AddVMOption -Djava.library.path=C:\Oracle\ORACLECLIENT12C\BIN

The second thing is to set the system variables for – My Computer – Advanced Settings

Again test to see if your ODI studio is connecting with oci and if it is ... then you can stop here.

ORACLE_HOME= C:\Oracle\ORACLECLIENT12C

Below 2 not really needed : but if errors then add them

LD_LIBRARY_PATH=%ORACLE_HOME%\lib value

PATH=%ORACLE_HOME%\bin;%PATH% value

ODI Studio Connection

In ODI Use the connection as follows

jdbc:oracle:oci:@DEVREPO

DEVREPO - Is my SID / Service Name in a standard TNS file.

Optional

TNS File can be stored separately with connection details to various databases by adding an environment variable like the ORACLE_HOME

TNS_ADMIN

↧

ODI 12c Controlling Multiple Load Plans and Scenarios running multiple times together

July 19, 2017, 11:12 pm

≫ Next: Big Data Processing with Hadoop and Hive on top of the Google DataProc Service

≪ Previous: ODI 12c - OCI Connection to Studio and not JDBC ( Thick Client and not Thin )

ODI 12c Controlling Concurrent Execution of Scenarios and Load Plans

By default, nothing prevents two instances of the same scenario or load plan from running simultaneously.

This situation could occur in several ways. For example:

A load plan containing a Run Scenario Step is running in two or more instances, so the Run Scenario Step may be executed at the same time in more than one load plan instance.
A scenario is run from the command line, from ODI Studio, or as scheduled on an agent, while another instance of the same scenario is already running (on the same or a different agent or ODI Studio session.

Concurrent executions of the same scenario or load plan apply across all remote and internal agents.

Concurrent execution of multiple instances of a scenario or load plan may be undesirable, particularly if the job involves writing data. You can control concurrent execution using the Concurrent Execution Control options.

ODI identifies a specific scenario or load plan by its internal ID, and not by the name and version. Thus, a regenerated or modified scenario or load plan having the same internal ID is still treated as the same scenario or load plan. Conversely, deleting a scenario and generating a new one with the same name and version number would be creating a different scenario (because it will have a different internal ID).

While Concurrent Execution Control can be enabled or disabled for a scenario or load plan at any time, there are implications to existing running sessions and newly invoked sessions:

When switching Concurrent Execution Control from disabled to enabled, existing running and queued jobs are counted as executing jobs and new job submissions are processed with the Concurrent Execution Control settings at time of job submission.
When switching Concurrent Execution Control from enabled to disabled for a scenario or load plan, jobs that are already submitted and in waiting state (or those that are restarted later) will carry the original Concurrent Execution Control setting values to consider and wait for running and queued jobs as executing jobs.

However, if new jobs are submitted at that point with Concurrent Execution Control disabled, they could be run ahead of already waiting jobs. As a result, a waiting job may be delayed if, at the time of polling, the system finds executing jobs that were started without Concurrent Execution Control enabled. And, after a waiting job eventually starts executing, it may still be affected by uncontrolled jobs submitted later and executing concurrently.

To limit concurrent execution of a scenario or load plan, perform the following steps:

Open the scenario or load plan by right-clicking it in the Designer or Operator Navigators and selecting Open.
Select the Definition tab and modify the Concurrent Execution Controller options:

o Enable the Limit Concurrent Executions check box if you do not want to allow multiple instances of this scenario or load plan to be run at the same time. If Limit Concurrent Executions is disabled (unchecked), no restriction is imposed and more than one instance of this scenario or load plan can be run simultaneously.

o If Limit Concurrent Executions is enabled, set your desired Violation Behavior:

§ Raise Execution Error: if an instance of the scenario or load plan is already running, attempting to run another instance will result in a session being created but immediately ending with an execution error message identifying the session that is currently running which caused the Concurrent Execution Control error.

§ Wait to Execute: if an instance of the scenario or load plan is already running, additional executions will be placed in a wait status and the system will poll for its turn to run. The session's status is updated periodically to show the currently running session, as well as all concurrent sessions (if any) that are waiting in line to run after the running instance is complete.

If you select this option, the Wait Polling Interval sets how often the system will check to see if the running instance has completed. You can only enter a Wait Polling Interval if Wait to Execute is selected.

If you do not specify a wait polling interval, the default for the executing agent will be used: in ODI 12.1.3, the default agent value is 30 seconds.

Click Save to save your changes.

↧

Big Data Processing with Hadoop and Hive on top of the Google DataProc Service

October 24, 2017, 11:00 pm

≫ Next: Hive Big Data Commands Reference

≪ Previous: ODI 12c Controlling Multiple Load Plans and Scenarios running multiple times together

Big Data Processing with Hadoop and Hive on top of the Google DataProc Service Offering

Business Scenario

One of our clients wanted us to evaluate Hadoop for data processing of about 5000 Shops. The objective of this exercise is to establish if the current processing time (using legacy RDBMS technology) can be reduced using Hadoop. We can leverage the data from OBIEE / Looker / PowerBI for reporting purposes. We can say we are impressed with google's offerings as everything can be spun up in minutes with little or no configuration of hardware or software.

The technologies utilised during this exercise are

Linux on Google Storage

Hadoop / HDFS – Map Reduce / Yarn

Hive

High level Test Scenario

Process 20 million records of data into Hive Hadoop HDFS and make the data available for queries

We created the data in the csv format, move this csv file(s) to a file server (Sha has fused a Google Bucket to the local Hadoop file system, we will detail this in another blog shortly), created an external hive table to access this data, created a hive local ORC table for subsequent processing

Overview of Hadoop Server – We were using one Resource Master with two worker nodes in a clustered environment within Google Dataproc Services.

We preferred Hive over spark as the volumes are not very high and the client’s IT department is quite familiar with RDBMS technologies.

Detailed Steps

The real world scenario consists of a denormalised reporting table with 110 columns but for this test scenario we took about with The csv File Layout contains the following columns

COST_CENTRE int,

SUB_COST_CENTRE int,

VERSION_NUMBER int,

TRANSACTION_NUMBER int,

TRANSACTION_SUB_SECTION int,

TRANSACTION_LINE int,

GROSS_PROFIT float,

PROCESS_DATE int

Step 1: Transfer the source file

Move the test.csv file to the file system. The directory I used is /deliverbi/gcs_ftp/FTP_IN/test.csv

Step 2: Invoke hive shell by typing in hive at the command line

You will see the hive prompt. Type in

Show databases;

and press enter and you should see all the databases available

I am going to use poc_hadoop database that is created for this evaluation purpose. You can create a database with the following command

create database poc_hadoop;

Now, I want to make the poc_hadoop database as the default working database. We can achieve this by

Use poc_hadoop;

Now Create an external table referring to the location where the test csv file is available. Notice we gave the file directory and not the file name. Data from any files that are dropped into this directory (of course the csv file columns should be identical) will be available in Hive directly when you query this external table

You should see the OK after the command is executed successfully

You can get more details about any table in hive with the following command

describe formatted DELIVERBI_EXTERNAL_T1_F;

Issue the following command to see how many records this external table contains and you can see just over 20 million records

Select count(*) from DELIVERBI_EXTERNAL_T1_F;

So we now can access in Hive the csv file data which is stored on a Linux drive

We will create a Hive internal table and populate it from the above external table

Issue the following command at Hive prompt

CREATE TABLE DELIVERBI_T1_F

(COST_CENTRE int, SUB_COST_CENTRE int, VERSION_NUMBER int,

TRANSACTION_NUMBER int, TRANSACTION_SUB_SECTION int, TRANSACTION_LINE int,
GROSS_PROFIT float, PROCESS_DATE int)
row format delimited fields terminated by ',' stored as ORC;

For simplicity sake, I kept the external table and the new internal table exactly the same.

You would have also noticed that I used ‘ORC’ in the stored as clause. This is a Hive internal format which is highly efficient. The various types of tables that you can create in Hive are textfile, sequencefile, rcfile etc. We will discuss about these in later blog posts

Before we copy the data from external table to this new internal table, let us first build the sql statement

select cost_centre,sub_cost_centre,version_number,transaction_number,transaction_sub_section, transaction_line,gross_profit,process_date,cost_centre from deliverbi_external_t1_f limit 5;

Run the above statement. Notice the limit 5 which is to retrieve only 5 rows.

You can notice that there are no headings in the output. Execute the following set statement to show the column headers

set hive.cli.print.header=true;

Then run the query again and you can see the column headers this time

Now that we know we are retrieving the required data from external table, execute the following to insert the data. Ensure the limit is removed. If not, we end up having only 5 rows in the internal table

insert into table DELIVERBI_T1_F select cost_centre,sub_cost_centre,version_number,transaction_number,transaction_sub_section,transaction_line,gross_profit,process_date from deliverbi_external_t1_f;

The records are inserted. You can check the record count is about 20 million by issuing

select count(*) from DELIVERBI_T1_F;

We will now create a Hive internal partitioned table and populate it from the above non partitioned internal table

CREATE TABLE DELIVERBI_PART_T1_F

(SUB_COST_CENTRE int, VERSION_NUMBER int, TRANSACTION_NUMBER int, TRANSACTION_SUB_SECTION int, TRANSACTION_LINE int, GROSS_PROFIT float,PROCESS_DATE int)

PARTITIONED BY (COST_CENTRE int)

row format delimited fields terminated by ',' stored as ORC;

For those of you who are familiar with Oracle partitioned tables, you would notice that Oracle expects you to specify partitioned column from the list of available columns. Where as Hive expects the partitioned column mentioned only once in the create statement.

Like Oracle you can partition Hive table by multiple columns. Just add the sub partition column within the partitioned by clause. As simple as that

You should see the OK after the table creation

Run the following command to insert the data into the new partitioned table

insert into table DELIVERBI_PART_T1_F partition(cost_centre) select sub_cost_centre,version_number,transaction_number,transaction_sub_section,transaction_line,gross_profit,process_date,cost_centre from deliverbi_t1_f;

Notice that the cost_centre column is the last one in the select statement. When inserting into a Hive partitioned table, the partitioned column should be specified as the last column in the select. If you have two columns used for partitioning then these two columns should come last (in the same order) in the select

Oh no. Got the following error (FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column.

To turn this off set hive.exec.dynamic.partition.mode=nonstrict)

For now, you can fix this by issuing the following set statement and then run the insert statement again

set hive.exec.dynamic.partition.mode=nonstrict

You expect to see OK but another error (Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions.)

We partitioned this table by Cost Centre. In my test data set I have more than 1500 distinct Cost Centres and there is a default limit that has been hit resulting in error. Run the two set statements to up the limits before running the insert statement again

set hive.exec.max.dynamic.partitions.pernode=5000;

set hive.exec.max.dynamic.partitions=5000;

One last step before we start retrieving the data. Analyse the partitioned table for quicker retrieval of results

ANALYZE TABLE DELIVERBI_PART_T1_F partition (cost_centre) COMPUTE STATISTICS;

Now we are ready to query the data

Issue AN sql statement at the Hive prompt for a Non Partitioned Table

select count(*) from deliverbi_t1_f where cost_centre = 'y134';

Issue AN sql statement at the Hive prompt for a Partitioned Table

select count(*) from deliverbi_part_t1_f where cost_centre = 'y134';

We can see that the retrieval from partitioned table is quick. Can we do anything to improve this? Well, let’s set something up. Run the following command at Hive prompt and then run the same query again

set hive.compute.query.using.stats=true;

That is a million times faster (metaphorical). You may think the results are in the memory so the query response was lightening. Let me query few more cost centres to see if the query has indeed sped up. Here is the result

It took under 5 minutes to load 20 million records into Hive internal table and that includes analysing the table data. I remember on one of the Projects it took over 20 minutes to load approx. 20 million records into Oracle database using SQL Loader (utilised the direct path insert for performance). So very pleased with what I have seen on Hadoop with Hive.

Watch this space for more blogs on Big Data, Google Big Query (Dremel), Hadoop, Hive etc. DeliverBI is moving towards Big Data and Fusion Oracle Solutions aswell as traditional RDMS technologies.

↧

Hive Big Data Commands Reference

October 29, 2017, 12:00 pm

≫ Next: Scripting Hive Commands with Python

≪ Previous: Big Data Processing with Hadoop and Hive on top of the Google DataProc Service

DELIVERBI Big Data Hive Command Reference

Today we are sharing some of the frequently used Hive commands and settings that will come handy

To see the status of the Hive Server2 and also view the logs etc. you can use the following URL

http://<IPAddresses>:10002/hiveserver2.jsp

As you can see here, this URL allows you to view the configuration among other things.

To List Databases use show databases;

To create a new database in the default location use create database <db_name>;

To create a new database in specified location – create database db_final location '/storate/<db_name>'

To drop a database use drop database <db_name>;

To start using a database – use <db_name>;

show tables; for listing the tables available in the current database

For gathering table statistics analyze table deliverbi_part_t1_f partition (cost_centre) COMPUTE STATISTICS;

To locate the storage directory – set hive.metastore.warehouse.dir;

To locate the storage directory along with other information for a specific table – describe formatted deliverbi_part_t1_f;

To show the column headers – set hive.cli.print.header=true;

To stop showing column headers, set the above property with value false

You can see the explain plan for a specific query using the following

explain select count(*) from deliverbi_t1_f;

You can adjust the number of reducers using SET mapreduce.job.reduces=5;

We will talk about sizing approach to work out optimum number of mappers & reducers in future posts.

To enable dynamic inserts into a partitioned table set hive.exec.dynamic.partition.mode=nonstrict;

For increasing the number of partitions from the default value of 101 use the following set statements

set hive.exec.max.dynamic.partitions.pernode=5000;

set hive.exec.max.dynamic.partitions=5000;

You can enable the Cost Based Optimiser using set hive.cbo.enable=true;

Below are some of the performance influencing settings. You need to play around a bit to work out the best combination of these settings that fits your specific setup

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

set hive.vectorized.execution.enabled=true;

set hive.vectorized.execution.reduce.enabled = true;

set hive.vectorized.execution.reduce.groupby.enabled = true;

set hive.exec.parallel=true;

set hive.exec.parallel.thread.number=16;

Having covered the basics of Hive now, in future posts we will touch upon a bit of scripting using Python to execute Hive commands.

Bye for now – Krishna and Shahed

↧

Scripting Hive Commands with Python

October 31, 2017, 1:00 am

≫ Next: Google Big Query Clone / Copy DataSets / Tables - PROD --> TEST --> DEV Projects

≪ Previous: Hive Big Data Commands Reference

Scripting Hive Commands with Python

In the previous posts, we touched upon basic data processing using Hive. But it is all interactive. We will discuss how to script these Hive commands using Python.

We will start with a very basic python script and add more functionality to it by the time we reach the end of this post.

We will ensure the environment is setup correctly before getting into the scripting.

If running Python 3 or below then install the following packages

pip install pyhs2
pip install thrift_sasl=0.2.1
pip install sasl==0.2.1

if running python 3 and above you might face SASL errors , in this case turn SASL off in hive and follow the below method :)

On Error - Tsocket error with Python 3+ , Or Could not start SASL , Or TProtocol error related to SASL...

pip3 install sasl 
pip3 install thrift 
pip3 install thrift-sasl 
pip3 install PyHive

Turn SASL off in Hive and use the pyhive python libraries

edit the hive site xml file guide path : /etc/hive/conf.dist/hive-site.xml

add the following entry below in the file and save.

<property>
<name>hive.server2.authentication</name>
<value>NOSASL</value>
</property>

Restart the Hive 2 Server so that the new setting can take affect.

To start HiveServer2:

$ sudo service hive-server2 start

To stop HiveServer2:

$ sudo service hive-server2 stop

We are good to start our scripting part

For this post, we have created a Hive table table_test with two columns and loaded with very few records.

The first script we will attempt to create will do – Connect to Hive, Execute a Hive command – a select statement and show the output on screen

We are using Python 3.4.2 version and will be utilising pyhive for python

Hive Output to Screen

I have saved the following code as hive_data_print.py file on Linux

from pyhive import hive
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_test")
for result in cursor.fetchall():
  print(result)

The first line of the code imports the python library that we are going to be using for this post and going forward.

We define a connection handler to the Hive Database poc_hadoop and initiate an instance of this connection (lines 2 & 3)

Using this connection handle we then execute a simple Hive Command (Select Query in this case)

We loop and print row by row (We could do this as the data set is small) – Lines 5 & 6. Note the Print syntax could be different depending on the version of Python you are using

I run the script as python3 hive_data_print.py

And you should see the rows displayed on the screen

That is our very first Python script. Of course it is very simple and shows the results on to the screen. Now in real life situations this is of very limited use. We typically want to store the output of the query to a file. Well, we will add functionality to our code to achieve this

Hive Output to Text File

Save the following lines of code as hive_data_to_file_simple.py file on Linux

from pyhive import hive
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_test")
with open('test.txt', 'w') as f:
  for results in cursor.fetchall():
    f.write("%s\n" % str(results))

Notice there is no change to import the library, create the connection handler (lines 1 – 4).

This time we will open a file (in the current directory, but if you specify the full path, you can create the file in any location you wish) and loop through the results and save line by line – Lines 5-7

When I execute this code as python3 hive_data_to_file_simple.py I see a file created in my current directory with the data

Hive Output to csv File using pandas

Now we will try to enhance our script to be more versatile using panda data frames. You need to install pandas for this, which Shahed has done using

pip3 pandas

Save the following lines of code as hive_data_to_file_pandas.py file on Linux

from pyhive import hive
import pandas as pd 
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
df = pd.read_sql("SELECT * FROM table_test", conn)
df.to_csv('test.csv', index=False, header=False)

When you run this script the output will be stored as csv in the current directory

Pandas is one of the most widely used Python libraries in data science. It is mainly used for data wrangling as it is very powerful and flexible, among many other things.

So now we know how to script Hive commands. You can export the results to file system, Google Bucket (we will cover this more in depth in future posts) or any other data source if you have the right libraries.

Bye for now from Krishna and Shahed

↧

Google Big Query Clone / Copy DataSets / Tables - PROD --> TEST --> DEV Projects

April 9, 2018, 8:40 am

≫ Next: Hive Tez vs Presto 12 Billion Records and 250 Columns Performance

≪ Previous: Scripting Hive Commands with Python

Google Big Query Cloning of Datasets & Tables across GCS Projects.

We searched the internet and could not find a simple cloning / copying of Tables and Datasets script from PROD --> TEST --> DEV in Google Big Query etc with ease. We needed a utility that had the option of copying complete datasets across projects within Google Big Query. There is the option within the Big Query UI to copy 1 table at a time but that would take us forever. Client is using SDLC and working through 3 envs. We needed to clone data from Production to our Test Environment needed for Shake down testing of Airflow deliverables (DAGS etc). Their are probably a lot of customers out their that have started loading their data into an initial environment and now need to copy the datasets to the other projects, What ever the reason may be we have this excellent script that can be run to leverage some of the work, The script can easily be amended and scheduled to run on weekends or evenings.

Technologies Used

Linux

Python3

Pip library: google.cloud import bigquery

Please make sure : pip install --upgrade google-cloud-storage

Notes on API.

https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html

Make sure the GOOGLE_APPLICATION_CREDENTIALS is set for the service key you have downloaded from your Production project.

export GOOGLE_APPLICATION_CREDENTIALS=/u01/yourservicekey.json

If you don’t have a service key then a service Key can be generated from GCS console web application GUI. A service key is an alternative to a user_id.

Make sure your service_key has all the permission required to run tasks on big query , ie querying tables , creating tables and deleting tables to the DEV & TEST project. You can add the service key alias email address to the DEV & TEST projects under IAM permissions to allow the script to copy datasets etc. Here you can see we have given the service account access to DEV with the role BigQueryAdmin.

Check That the BQ command is working on your GCS Linux instance

Try: bq ls to list the projects on the GCS Linux VM Instance

The .py Script has a lock so that it wont clone anything back to the Production environment and we suggest you use it.

As for best practice PRODUCTION à TEST à DEV is the best approach for cloning. One off tables and structures are also catered for and then DEV à TEST à PRODUCTION makes sense but we recommend CICD pipelines or alternatives for this.

Run the script using the following command

python3 bq_dataset_migrator.py source_project source_dataset target_project target_dataset

Variables
1 - Source Project
2 - Source Data Set
3 - Target Project
4 - Target Data Set

↧

Hive Tez vs Presto 12 Billion Records and 250 Columns Performance

October 30, 2018, 8:08 am

≫ Next: PowerBI Custom Direct Query ODBC Connector

≪ Previous: Google Big Query Clone / Copy DataSets / Tables - PROD --> TEST --> DEV Projects

Hive Tez vs Presto as a query Engine & Performance12 BILLION ROWS AND a 250 Column Table with Dimensions.

Here at DELIVERBI we have been implementing quite a few Big Data Projects. At one of our more recent clients we required speedy analytics using a query engine and technology that could query almost 15 Billion rows of data over various partitions. The main table has 250 columns so is quite wide and the data was roughly 20TB for the one table, On top of this their were joins to various dimensions. This is a more than real world example that will have a user base of around 1000 users querying data over 3 years worth of daily partitions.

We tried various approaches including spark , impala , drill , Hive MR and various other tech stacks.

We settled on the Below Technologies which work seamlessly and have produced outstanding results for our client.

Hadoop, Hive , Presto , Yarn , Airflow for Orchestration of Loading.

20 Node Cluster - All Nodes - 8cpu - 52gb Memory

We went for the divide and conquer technique and it works !!.

Lets Begin , I will also include some tuning tips as we go along. Their is a lot more involved but we will skim over the major points.

Hive Version 2.3+ Upwards - We required the TEZ engine here to load data daily and throughout the day to various ORC tables within the Hadoop Filesystem.

TEZ - Its quick , quicker than MR and we used it to load data into the Hive tables. TEZ is prominent over map reduce by using hadoop containers efficiently, multiple reduce phases without map phases and effective use of HDFS. Make sure the tez container size fits within the Yarn Container sizes.

TEZ Container Sizes are important - Roughly 80% of the Yarn Container size. These can be tuned further otherwise you will get out of memory errors.

--------------------------------------------------------------- yarn-site.xml

<name>yarn.scheduler.minimum-allocation-mb</name>

<final>false</final>

<source>Dataproc Cluster Properties</source>

</property>

--------------------------------------------------------------- hive-site.xml

<name>tez.am.resource.memory.mb</name>

</property>

<name>hive.tez.container.size</name>

</property>

After the above settings all memory errors were gone. :)

Also these settings help within the hive-site.xml and we had an increased performance. We have used many more that are in line with our clients requirements.

<name>hive.exec.dynamic.partition.mode</name>

<value>nonstrict</value>

</property>

<name>hive.cbo.enable</name>

</property>

<name>hive.compute.query.using.stats</name>

</property>

<name>hive.stats.fetch.column.stats</name>

</property>

<name>hive.stats.fetch.partition.stats</name>

</property>

<name>hive.vectorized.execution.enabled</name>

</property>

<name>hive.exec.parallel</name>

</property>

We have loads of other settings but if you will be using hive as a query engine too.

Pre warm some Tez Hive containers for future queries and re usability and remove old sessions when they expire etc the settings below will help with this . There are loads more settings that can be used.

<name>hive.server2.idle.session.timeout</name>

</property>

<name>hive.server2.session.check.interval</name>

</property>

<name>hive.server2.idle.operation.timeout</name>

</property>

<name>hive.server2.tez.sessions.per.default.queue</name>

</property>

<name>hive.server2.tez.initialize.default.sessions</name>

</property>

<name>hive.prewarm.enabled</name>

</property>

<name>hive.prewarm.numcontainers</name>

</property>

<name>tez.am.container.reuse.enabled</name>

</property>

ORC table Definition - This one works very well with large data sets.

CREATE TABLE MytableFact(

)

partitioned by (ETL_DATE date)

STORED AS ORC tblproperties ("orc.compress" = "SNAPPY" , "orc.stripe.size"="536870912"

, "orc.row.index.stride"="50000","auto.purge"="true");

The ORC table with a compression of SNAPPY works very well when loading/querying either in hive or presto and produces results quick.

Google Stripe and Stride for more info . The above settings worked well for us with the data volumes we used and above.

The partitions help when you need to reload data and so does gathering statistics for the tables and columns.

To replace data within a partition we used a Date column.

INSERT OVERWRITE TABLE mytablename PARTITION(ETL_DATE) (Will automatically Wipe the partition(s) and reload data), Insert into

Before we load we enable the gathering of Table level statistics automatically by setting the below in our hive session

set hive.stats.autogather=true

We also needed to gather statistics for the column level data within partitions we load

analyze table mytablename(etl_date='2017-06-01') compute statistics for columns;

Presto Configurations are soooo easy just set the JVM memory on the Co-ordinator and worker nodes to roughly 70% of the cluster mem size and restart everything.

With the above settings and data volumes the timings are as follows

Presto

1. Fact table query with sum and group by whole table - less than 1 minute

2. Fact Table with a date filter - less than 3 seconds.

3. Fact Table with 4 Dimensional Joins and Filters and Group By on a Date - 5 seconds

As you can see the timings are phenomenal compared to using traditional RDBMS databases with these kind of volumes and a super wide table.

Hive

Its slower but we are using it for Batch Jobs using Airflow for orchestration as manage our yarn queue's and resource capacity on who can run which etl job. We will cover capacity-management within another blog. Their is just soooo much to cover on what we implemented.

1. Fact table query with sum and group by whole table - less than 5 minutes

2. Fact Table with a date filter - less than 2 Minutes.

3. Fact Table with 4 Dimensional Joins and Filters and Group By on a Date - 5 Minutes

Pre-Warming the containers take 30 seconds off each query roughly.

We know Big Data as we have been tuning and have built our solutions on various platforms. Its the way forward

Presto is a clear winner and using Hive for our user base that is familiar with traditional sql but with a rocket under the bonnet made it clear winner.

On the next series of blogs we will go deeper into the technical aspects as this blog just skims the surface but outlines the main tips for performance.

Shahed Munir & Krishna Udathu @ DELIVERBI

↧

PowerBI Custom Direct Query ODBC Connector

November 6, 2018, 3:25 am

≫ Next: Hive HPLSQL setup on Google DataProc

≪ Previous: Hive Tez vs Presto 12 Billion Records and 250 Columns Performance

PowerBI Custom Direct Query ODBC Connector

At DELIVERBI we have various challenges presented to us and a client of ours uses Power BI as their main reporting and analytics tool and wanted Direct Query on any ODBC Source.

We implemented Presto , Hadoop , hive etc as we outlined in an earlier blog briefly. To compliment this solution we needed to be able to query Presto on a Direct Query for transactional level data on the fly.

We created a Power BI custom connector for this that sits on top of a windows ODBC connection. Its quite simple really but we have compiled the .mez which is the custom connector file that you will require to start importing or Direct Querying data straight away.

Its available to download from our github Location

https://github.com/deliverbi/PowerBI-Connectors

The Connector Supports all ODBC Datasources and has been optimized for Presto and Hive too. We used the treasure data ODBC driver for Presto as that's what was needed and a mapr 2.15 ODBC driver for Hive. This Connector is generic so any ODBC Driver can be used. Also the connector has been tested on most versions of PowerBI.

Remember to place the .mez file within your windows documents folder under 2 named folders which is the default location for Powerbi to pick up custom connectors.

It will appear under Other within the Powerbi application.

If you would like to know more about compiling the connector or have a custom connector created and delivered to your inbox that's skinned for a corporate look and feel then please contact us for more information.

Enjoy the connector and helps like it did for us. We bypassed the source code etc so business users can enjoy the benefit of a fully compiled Connector , Just store it in the location and start using it for Direct Querying any ODBC Datasource.

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!

Cheers

Shahed Munir & Krishna Udathu

↧

Hive HPLSQL setup on Google DataProc

November 8, 2018, 2:06 pm

≫ Next: Article 0

≪ Previous: PowerBI Custom Direct Query ODBC Connector

Hive HPLSQL setup on Google DataProc

Google Dataproc Hadoop and Hive

Hive Version : 2.3.2 (Version Supports Tez Engine)

More Information on hplsql

http://www.hplsql.org/

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156

As always we are trying to implement some functionality and its not working with a security setup on Hive aswell as the Tez execution engine etc.

Here are the Steps to get it going.

To enable HPLSQL . Cant find any documentation on this for google dataproc anywhere !!

As we have security etc setup on hive you have to edit the hplsql-site.xml to connect to the thriftserver2 for hive and to do this you need to follow the steps below

Find the following File

/usr/lib/hive/lib/hive-hplsql-2.3.2.jar

Copy the file to a folder and unjar it

jar -xvf hive-hplsql-2.3.2.jar

vi the file hplsql-site.xml (You Will see it when you unjar the above file)

Ammend the hiveconn2 connection and add username and password etc

I have also ammended the config so i can run querys on tez within my specific DB. Its all quite straightforward.

<name>hplsql.conn.hive2conn</name> <value>org.apache.hive.jdbc.HiveDriver;jdbc:hive2://localhost:10000;myuser;mypassword</value>

<description>HiveServer2 JDBC connection</description>

</property>

<name>hplsql.conn.init.hive2conn</name>

<value>

set hive.execution.engine=tez;

use myspecificdb;

</value>

<description>Statements for execute after connection to the database</description>

</property>

Remove the jar file currently in folder and run the jar command within the folder you are editing the hplsql-site.xml file

jar cvf hive-hplsql-2.3.2.jar *

Copy the jar file back to the original location

cp hive-hplsql-2.3.2.jar /usr/lib/hive/lib/.

Command to test

/usr/lib/hive/bin# ./hplsql -e "select * from mydb.mytablename"

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!

Cheers

Shahed Munir

↧

Article 0

November 16, 2018, 4:39 am

≫ Next: Apache Airflow 1.10.2 Maintenance Dags

≪ Previous: Hive HPLSQL setup on Google DataProc

Google Drive API Google Sheets Extract to CSV with SERVICE KEY.JSON

Recent Client had some Google Sheets stored on their Google Drive and needed them processing into the Hadoop Cluster and available in presto for reporting.

As Google Drive uses Oauth authentication we needed to use a Service Key instead to access the files needed as we are processing them using airflow and a python script on a linux server.

We hunted around and could not find any solid example of a script that works on with the latest libraries.

This script will extact the google workbook and each sheet will be a seperate csv file. within a given location path.

There is also an alternative script available that can loop over a .txt file and process google sheet ID's one after the next .

Mandatory Parameters

Service Key and a Google Sheet ID and a path

Service Keys can be generated for a project within google. Then the email address of the service key needs to be used for sharing of the spreadsheet within google sheets

For : gdrivedownloadsheetsLoop.py

Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory

Setup

Make sure your python library for google api is this version ( can be higher but not tested on higher versions)

pip install google-api-python-client==1.6.2

Also make sure all other libraries are installed using pip these are visible in the script under import.

Download Python Script gdrivedownloadsheets.py to a directory on Linux.

Ammend the Following

Replace YOURSERVICEKEY.json with the filename of the service key you generated in Google. service_account_file = os.path.join(os.getcwd(), 'YOURSERVICEKEY.json')

This will be your output Path for your csv files (1 per worksheet within a workbook) path = "/GCSDrive/output/"

For : gdrivedownloadsheetsLoop.py Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory

Run Command

The parameter is the google sheet ID which is visible in most url's when viewing the sheet in google.

--------------- For running 1 workbook at a time

python3 gdrivedownloadsheets.py 1gWxy05uEcO8a8fNAfUSIdV1OcRWxH7RnjXezoHImJLE

--------------- For running Multiple Workbooks one after another

python3 gdrivedownloadsheetsLoop.py

Git Hub Location for Code : https://github.com/deliverbi/Gdrive-API-Google-Sheets

We will follow through with this post and blog the framework we created to Process these files directly to HDFS and through to Hive and Presto.

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!

Cheers

Shahed Munir & Krishna Udathu

↧

Apache Airflow 1.10.2 Maintenance Dags

March 6, 2019, 5:02 am

≫ Next: HADOOP Balancing the cluster

≪ Previous: Article 0

Apache Airflow 1.10 / 1.10.2 + Maintenance Dags (Logs)

We have noticed on Airflow 1.10.+ , The maintenance dags to manage database and file system logs need to be updated to accomodate extra tables etc.

We have uploaded the new dags to our GitHub Repo : https://github.com/deliverbi/Airflow-Log-Maintenence-1.10-

https://github.com/deliverbi/Airflow-Log-Maintenence-1.10-

↧

HADOOP Balancing the cluster

August 29, 2019, 9:48 am

≫ Next: Yarn - Long Running Jobs Alerts for Hive TEZ , Application Containers

≪ Previous: Apache Airflow 1.10.2 Maintenance Dags

Hadoop Balancing the Cluster

Balancing an hadoop cluster i feel is very important and the nodes should have a deviation of no more than 1%. I just feel as though this helps in having a healthy hadoop cluster , Schedule a job to run maybe once a week on a quiet day. The reason im saying quiet is the command i use stretches the balancer to run very fast 100x faster than the normal way of just running the balancer

The following command - speeds up the balancer to 100x

sudo as hdfs - or the hdfs user.

Firstly set this

hdfs dfsadmin -setBalancerBandwidth 100000000

then run the following command - you can run as nohup or background job if you want. roughly takes care of 350gb in 5 minutes.

hdfs balancer -Ddfs.balancer.movedWinWidth=5400000 -Ddfs.balancer.moverThreads=1000 -Ddfs.balancer.dispatcherThreads=200 -Ddfs.datanode.balance.max.concurrent.moves=5 -Ddfs.balance.bandwidthPerSec=100000000 -Ddfs.balancer.max-size-to-move=10737418240 -threshold 1

↧

Yarn - Long Running Jobs Alerts for Hive TEZ , Application Containers

September 26, 2019, 5:31 pm

≫ Next: Apache Airflow Check Previous Run Status for a DAG

≪ Previous: HADOOP Balancing the cluster

Hadoop / Hive / Yarn / Applications Long running Jobs alert script

Currently at a client where their are some long running jobs in hive that hog the resources , You can see the logs in the application history manager or yarn etc. But we needed an alert mechanism. So we wrote this python script to detect any over-runs , it will spit out a file with the relevant records that are over running . The script can be run with any method really.Airflow . ODI , Cron Job just add a step to send a mail using xmail or what ever suits you.

We use Airflow and if the file produced contains any records we send an email to the Admins to see whats going on , Alternatively you could always issue a yarn application kill command.

The script is within our Github Python.

Repo https://github.com/deliverbi/GCS/blob/master/Yarn_Long_RUNNING_Jobs.py

https://github.com/deliverbi/GCS/blob/master/Yarn_Long_RUNNING_Jobs.py

Over and Out

Shahed and Krishna

↧

Apache Airflow Check Previous Run Status for a DAG

November 1, 2019, 6:00 am

≫ Next: GCS HADOOP/Hive to Google Big Query Migrate Data Quickly ORC Format

≪ Previous: Yarn - Long Running Jobs Alerts for Hive TEZ , Application Containers

Apache Airflow Check Previous Run Status for a DAG

The Problem

We encountered a scenario where if any previous DAG run fails the next scheduled DAG run should not proceed. Checked for the solution and did not find a robust solution. So I created one working with Shahed Munir

The Solution

I came up with a very generic approach that can be applied to any DAG that is created within Apache Airflow. Here is a sample DAG that uses and applies this concept in practice.

To enable the capability of checking any of the previous DAG run (self references) the first task needs to be created on the lines provided in the sample DAG within GitHub https://github.com/deliverbi/Airflow_Code.git

This solution relies on a Airflow Connection a mysql database used to store the Airflow Metadata.

In the sample DAG code this Airflow connection is called deliverbi_mysql_airflow

Then place the sample DAG in the dags directory.

Notes

All previous DAG runs must be set to success for the current DAG run to proceed. This task should be the first in any DAG that is created to enable this functionality

Here is the link to GitHub Repo Click Here to go to GitHub for DeliverBI

Over and Out. Krishna

↧

GCS HADOOP/Hive to Google Big Query Migrate Data Quickly ORC Format

November 22, 2019, 7:51 am

≫ Next: GCP Google BigQuery Cost Reporting - Control your costs

≪ Previous: Apache Airflow Check Previous Run Status for a DAG

GCS Dataproc Hadoop(Hive) to GCS Big Query (BQ) Quick Transfer ORC File Formats

Software Used

Google Dataproc - Google's Hadoop cluster with add ons such as spark etc.

GCS (Google Cloud Storage) - Buckets to store data

Google Big Query - Googles data query engine

The Problem

Customer needed to Move over 50tb of data in orc file format from GCS Dataproc Cluster to Google Big Query quickly with a small footprint.

The Solution

On GCS - Created a Bucket that is linked to the same project as the DataProc Instance

On Dataproc (Hadoop Master Node) - Invoked a distributed copy command to copy the data and all of its contents to the bucket i created earlier within GCS Storage.

hadoop distcp hdfs://lcg-bigdata-dev-m:8020/user/hive/warehouse/lcg_deliverbi_db.db/big_table/* gs://deliverbi_bucket/deliverbi_all_tables/big_table/

The command will copy all the data from the underlying HDFS for the Hive table to the Google Storage bucket. Remember to take all the subfolders or partition folders as these will be handled as columns in Big Query using a cli command to import the data. The Orc Import in Big Query is quite clever and recognises the Hive folder layout such as a partition folder called

"my_date=20190101"

"my_date=20190102"

"my_date=20190103" -> "my_sub_partition_field=1"

These partition folders will get created as fields within the BQ Table with the cli command to import data into Big Query. This way Hive functionality of field querying is maintained within BQ.

Once the Data is in the Bucket then goto Big Query and Create a Dataset if needed in your project. The table will get created automatically based on the ORC File format.

Open a cloud shell window - or https://cloud.google.com/sdk/ Google Cloud SDK to issue a BQ command.

This command will import the data into Big Query from the Orc files we exported into the GCS bucket using distcp.

BQ Command to import a Partitioned Table
All files will get imported within subfolders etc

-- url_prefix -- give the parent folder that contains the data for your table
-- BQ Table Name
-- GCS Files location

bq load --source_format=ORC --hive_partitioning_mode=AUTO \

--hive_partitioning_source_uri_prefix=gs://deliverbi_bucket/deliverbi_all_tables/big_table/ \

deliverbi_project1:deliverbi_dataset.big_table gs://deliverbi_bucket/deliverbi_all_tables/big_table/fact_bet/*

BQ Command to import a non Partitioned Table

All files will get imported within subfolders etc

-- BQ Table Name

-- GCS Files location

bq load --source_format=ORC deliverbi_project1:deliverbi_dataset.big_table gs://deliverbi_bucket/deliverbi_all_tables/big_table/non_paritioned_data_files_folder/*

Over and Out

Shahed Munir

↧

GCP Google BigQuery Cost Reporting - Control your costs

November 24, 2019, 8:25 am

≫ Next: Happy New Year 2020 from DELIVERBI and DataRocketPro

≪ Previous: GCS HADOOP/Hive to Google Big Query Migrate Data Quickly ORC Format

Setting up the BigQuery audit logs export within GCS

Monitor Your BigQuery Costs in GCP

Google Cloud Platform Instructions

Google Big Query is an excellent database but costs will have to be monitored very closely. You can use the below methods to start monitoring costs on a daily basis. Find out which users are not using filters and every query is costing you a packet...

Setup BigQuery log export : Do the following in a project that contains BigQuery:

Pre-Req : Create a Big Query Project and Data Set.

1. Go to the Google Cloud Platform console and select Logging -> Exports under Stack Driver Logging

2. Click --Exports-- and then Create Export

3. Add a Sink Name and select Custom Destination as the Sink Service . The Sink
Destination should be set to bigquery.googleapis.com/projects/<project-name>/datasets/<dataset-name>, adding the project and dataset names you created earlier.

Use drop down arrow on filter and make sure convert to advanced filter contains the following: resource.type="bigquery_resource"

4. Click Create Sink

If you get a permission error then that is fine (do the following). The project you have set up the export to is different to the project you have set up the logging export in. In this case the Service Account which writes the logs into the BigQuery dataset you have created will not have permission to do so.

1. Go to BigQuery in the project the logs are exported to and click on the dropdown next to the dataset you have chosen. Click Share Dataset.

2. Get the name of the service account by going to Stackdriver Logging in the project where you set up the logging export, then **Exports**, and copy the **Writer Identity**

3. Add this Writer Identity into the Share Dataset window in BigQuery from Step 1

4. Give the account **Can edit** access, and click **Add**, and then **Save Changes**

The BigQuery audit log export is now be set up. The table will be updated periodically. The BigQuery audit log table is date parititioned.

Do this for every project you have so you can get costs for all your projects , you can reuse the same project and dataset for all projects.

Using the Google Cloud SDK Alternative

Use Google Cloud SDK or Gcloud cli.

1. Firstly set your project
2. gcloud config set project <project-name>

3. Then Run the following command , adjust it to your target dataset and project

gcloud beta logging sinks create <sink_name> bigquery.googleapis.com/projects/<project-name>/datasets/<dataset-name> --log-filter='resource.type="bigquery_resource"

Using your Dataset and Table you can now report your costs in a tool of your choice from the Biq query dataset.

For a list of your projects that are bring audited.

SELECT distinct

resource.labels.project_id
FROM `<dataset-name>.cloudaudit_googleapis_com_data_access_*`

Over and Out

DELIVERBI Team

↧

Happy New Year 2020 from DELIVERBI and DataRocketPro

December 29, 2019, 9:35 am

≫ Next: Google Partner for 2020

≪ Previous: GCP Google BigQuery Cost Reporting - Control your costs

Happy New Year to all our friends and customers. Hope you have a prosperous new year. Remember we are now a google partner and can help with all your Google Cloud Platform requests. As usual any questions or help please feel free to reach out to us.

Shahed and Krishna @ DELIVERBI @ DatarocketPro

↧