Metorikku is a library that simplifies writing and executing ETLs on top of Apache Spark. A user needs to write a simple YAML configuration file that includes SQL queries and run Metorikku on a spark cluster. The platform also includes a way to write tests for metrics using MetorikkuTester.
To run Metorikku you must first define 2 files.
An MQL (Metorikku Query Language) file defines the steps and queries of the ETL as well as where and what to output.
For example a simple configuration YAML (JSON is also supported) should be as follows:
steps:
- dataFrameName: df1
sql:
SELECT *
FROM input_1
WHERE id > 100
- dataFrameName: df2
sql:
SELECT *
FROM df1
WHERE id < 1000
output:
- dataFrameName: df2
outputType: Parquet
outputOptions:
saveMode: Overwrite
path: df2.parquet
Take a look at the examples file for further configuration examples.
Metorikku uses a YAML file to describe the run configuration. This file will include input sources, output destinations and the location of the metric config files.
So for example a simple YAML (JSON is also supported) should be as follows:
metrics:
- /full/path/to/your/MQL/file.yaml
inputs:
input_1: parquet/input_1.parquet
input_2: parquet/input_2.parquet
output:
file:
dir: /path/to/parquet/output
You can check out a full example file for all possible values in the sample YAML configuration file.
Currently Metorikku supports the following inputs: CSV, JSON, parquet
And the following outputs:
CSV, JSON, parquet, Redshift, Cassandra, Segment, JDBC
Redshift - s3_access_key and s3_secret are supported from spark-submit
There are currently 3 options to run Metorikku.
To run on a cluster Metorikku requires Apache Spark v2.2+
- Download the last released JAR
- Run the following command:
spark-submit --class com.yotpo.metorikku.Metorikku metorikku.jar -c config.yaml
When using the JDBC writer, provide the path of the driver jar in both jars and driver-class-path params. For example for Mysql:
spark-submit --driver-class-path mysql-connector-java-5.0.8-bin.jar --jars mysql-connector-java-5.0.8-bin.jar --class com.yotpo.metorikku.Metorikku metorikku.jar -c config.yaml
JDBC query output allows running a query for each record in the dataframe.
- query - defines the SQL query. In the query you can address the column of the DataFrame by their location using the dollar sign ($) followed by the column index. For example:
INSERT INTO table_name (column1, column2, column3, ...) VALUES ($1, $2, $3, ...);
- maxBatchSize - The maximum size of queries to execute against the DB in one commit.
- minPartitions - Minimum partitions in the DataFrame - may cause repartition.
- maxPartitions - Maximum partitions in the DataFrame - may cause coalesce.
Metorikku is released with a JAR that includes a bundled spark.
- Download the last released Standalone JAR
- Run the following command:
java -Dspark.master=local[*] -cp metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml
It's also possible to use Metorikku inside your own software Metorikku library requires scala 2.11
- Add the following dependency to your build.sbt:
"com.yotpo" % "metorikku" % "0.0.1"
- Start Metorikku by creating an instance of
com.yotpo.metorikku.config
and runcom.yotpo.metorikku.Metorikku.execute(config)
In order to test and fully automate the deployment of MQLs (Metorikku query language files) we added a method to run tests against MQLs.
A test is comprised of 2 files:
This defines what to test and where to get the mocked data. For example, a simple test YAML (JSON is also supported) will be:
metric: "/path/to/metric"
mocks:
- name: table_1
path: mocks/table_1.jsonl
tests:
df2:
- id: 200
name: test
- id: 300
name: test2
And the corresponding mocks/table_1.jsonl
:
{ "id": 200, "name": "test" }
{ "id": 300, "name": "test2" }
{ "id": 1, "name": "test3" }
You can run Metorikku tester in any of the above methods (just like a normal Metorikku).
The main class changes from com.yotpo.metorikku.Metorikku
to com.yotpo.metorikku.MetorikkuTester
See the LICENSE file for license rights and limitations (MIT).