Big Data Ready Enterprise (BDRE)makes big data technology adoption simpler by optimizing and integrating various big data solutions and providing them under one integrated package. BDRE provides a unified framework for a Hadoop implementation that can drastically minimize development time and fast track the Hadoop implementation. It comprises a reusable framework that can be customized as per the enterprise ecosystem. The components are loosely integrated and can be de-coupled or replaced easily with alternatives.
The primary goal of BDRE is to accelerate Bigdata implementations by supplying the essential frameworks that are most likely to be written from scratch. It can drastically reduce effort by eliminating hundreds of man hours in operational framework development. Big Data implementations however, require specialized skills, significant development effort on data loading, semantic processing, DQ, code deployment across environments etc.
Features
Operational Metadata Management
Registry of all workflow processes/templates
Parameters/configuration(key/value) for processes
Dependency information (upstream/downstream)
Batch management/tracking. Batch concept in BDRE is for tracking the data flow between workflow processes.
Run control (for delta processing/dependency check)
Execution status for jobs(dynamic metadata - with step level granularity)
File registry - can be used to register e.g. ingested files or a raw file as an output of an upstream.
Execution statistics logging (key/value)
Executed hive queries and data lineage information.
Java APIs that integrates with Big Data with non-Big Data applications alike.
Job monitoring and proactive/reactive alerting
Data ingestion framework
Tabular data from RDBMS
Streaming data from 16 types of sources (including logs, message queues and Twitter)
Arbitrary file ingestion by directory monitoring
Web Crawler
Distributed Data Manufacturing framework
Generate billons of records based on patterns and ranges
Semantic Layer Building Framework
Build the sematic layer using visual workflow creator using the data you ingested.
Supports Hive, Pig, MapReduce, Spark, R etc.
Generates Oozie workflows
Data Quality Framework
Validates your data using your rules in a distributed way
Integrated with Drools rule engine
HTML5 User Interface
Create ingestion, data generation, Crawler jobs or create Oozie workflows graphically without writing any code
One click deploy and execute jobs without SSH into the edge node.
Architecture
Installation
Overview
This section will help you build BDRE from source. Audience for this document are developers and architects who want be part of BDRE framework development or may just want to evaluate it.
General Prerequisite
For testing/development purpose and to save time, use the fully loaded Hadoop VMs from Cloudera or Hortonworks because all the required software are typically installed and configured.
A Hadoop Cluster
In this section we are using Hortonworks Sandbox 2.2.0
Git 1.9 and up
Maven 3 and up
Oracle JDK 7(and up)
BDRE is shipped with an embedded database which is okay for running the UI and evaluating and testing jobs in a single node cluster.
For production use BDRE currently supports following production scale databases.)
MySQL Server 5.1 and up
Oracle 11g Server or better
PostgreSQL
Google Chrome browser
You should be able to do the same in Mac or Windows but note that setting up a Hadoop cluster might be tricky in Windows and might more involvement. However to deploy and run the jobs we recommend a Linux system. BDRE is typically installed in Hadoop edge node in a multi-node cluster.
Login to the HDP Sandbox with the newly created openbdre user. You can perform a su openbdre to switch to this account. Please make sure you are not root user beyond this point.
[root@sandbox ~]# su openbdre
[openbdre@sandbox root]$ cd~
[openbdre@sandbox ~]$
Download Maven from a mirror, unpack and add to the PATH.
cd to the cloned source dir (so you can be in /home/openbdre/openbdre)
[openbdre@sandbox ~]# cd openbdre
Database Setup
Execute the dbsetup.sh script without any parameters as shown below. In this example, we are going to use MySQL as BDRE backend as it's already available in the HDP Sandbox. If you would like to use another database please select it accordingly.
[openbdre@sandbox ~]# sh dbsetup.sh
[openbdre@sandbox openbdre]$ sh dbsetup.sh⏎
Supported DB
1) Embedded (Default - Good for running BDRE user interface only. )
2) Oracle
3) MySQL
4) PostgreSQL
Select Database Type(Enter 1, 2, 3 , 4 or leave empty and press empty to selectthe default DB):3⏎
Enter DB username (Type username or leave it blank for default 'root'):⏎
Enter DB password (Type password or leave it blank for default '<blank>'):⏎
Enter DB hostname (Type db hostname or leave it blank for default 'localhost'):⏎
Enter DB port (Type db port or leave it blank for default '3306'):⏎
Enter DB name (Type db name or leave it blank for default 'bdre'):⏎
Enter DB schema (Type schema or leave it blank for default 'bdre'):⏎
Please confirm:
Database Type: mysql
JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection URL: jdbc:mysql://localhost:3306/bdre
Database Username: root
Database Password:
Hibernate Dialect: org.hibernate.dialect.MySQLDialect
Database Schema: bdre
Are those correct? (type y or n - default y):y⏎
Database configuration written to ./md-dao/src/main/resources/db.properties
Will create DB and tables
Tables created successfully in MySQL bdre DB
Building
Now build BDRE using (note BDRE may not compile if the settings.xml is not passed from the command line so be sure to use the -s option. When building for the first time, it might take a while as maven resolves and downloads the jar libraries from different repositories.
mvn -s settings.xml clean install -P hdp22
Note: Selecting hdp22 will compile BDRE with HDP 2.2 libraries and automatically configure BDRE with Hortonworks Sandbox 2.2.0. Similarly one should be able to build this using -P cdh52 which will configure BDRE for CDH 5.2 QuickStart VM. During building it'll pick up the environment specific configurations from /databases/setup/profile.hdp22.properties.
Content of databases/setup/profile.hdp22.properties
Use Google Chrome browser from the host machine and open http://VM_IP:28850/mdui/pages/content.page
Login using admin/zaq1xsw2
Creating, Deploying and Running a Test Job
Create a RDBMS data import job from Job Setup From Template > Import from RDBMS
Change the JDBC URL/credentials to your local MySQL DB that contains some data.
Click Test Connection
Expand and select 1 table (be sure to expand the tables before selecting).
Create the jobs and see the pipeline.
Click XML, Diagram etc. and check the generated Oozie workflow XML and diagram.
Search for 'Process' in the search window and open the 'Process' page
Click deploy button on process page corresponding to the process you want to deploy. (Deploy button will show status regarding deployment of process, when you hover over the button.)
Wait for 2 minutes and the deployment will be completed by then.
After the deployment is complete and in UI the status for the process is deployed (turns green).
Click the execution button to execute the Import job.
Check the process in Oozie console http://VM_IP:11000/oozie
When the import job is complete start the data load job.
Data Ingestion
RDBMS Data Ingestion
Streaming Data Ingestion
Directory Monitoring and File Ingestion
Workflow Builder
Bulk Data Manufacturing
Demo video TBD
Web Crawler
Operational Metadata Management System
Operational Metadata Management
BDRE provides complete job/operational metadata management solution for Hadoop. At its core acts as a registry and tracker for different types of jobs running in different Hadoop clusters or as a standalone. It provides APIs to integrate with virtually any jobs.
BDRE uses RDBMS database to store all job related metadata. A set of stored procedures are there to interface will the tables which are exposed via Java APIs to manage/create/update the static and run time metadata information. Below is the data model for BDRE metadata operational database.
How to Contribute
Contribution for the enhancements in BDRE are welcome and humbly requested by us. To contribute, please navigate to our GitHub project page and fork BDRE main repository under your own account. You can make changes to your own forked repository and then open a Pull Request to merge your change with the main repo.