Code Monkey home page Code Monkey logo

data-validator's Introduction

Data Validator

GitHub release (latest by date) License Continuous Integration Release Build Status

A tool to validate data in Spark

Usage

Retrieving official releases via direct download or Maven-compatible dependency retrieval, e.g. spark-submit

You can make the jars available in one of two ways for the example run invocations below:

  1. Get the latest version from GitHub Packages for the project. Place the jars somewhere and pass their path to --jars when running spark-submit.

  2. You can pull in the dependency using spark-submit's --repositories, --packages, and --mainClass options, but it requires setting spark.jars.ivySettings and providing this file, populated with a valid personal access token having the read:packages scope enabled. N.b. it can be a challenge to secure this file on shared clusters; consider using a public GitHub service account instead of a token from your own personal GitHub account.

    <ivysettings>
      <settings defaultResolver="thechain">
        <credentials host="maven.pkg.github.com" realm="GitHub Package Registry"
                     username="${GITHUB_PACKAGES_USER}" passwd="${GITHUB_PACKAGES_USER_TOKEN}" />
      </settings>
      <resolvers>
        <chain name="thechain">
          <ibiblio name="central" m2compatible="true" root="https://repo1.maven.org/maven2" />
          <!-- add any other repositories here -->
          <ibiblio name="ghp-dv" m2compatible="true" root="https://maven.pkg.github.com/target/data-validator"/>
        </chain>
      </resolvers>
    </ivysettings>

    See also How do I add a GitHub Package repository when executing spark-submit --repositories?

Building locally

See CONTRIBUTING for development environment setup.

Assemble fat jar: make build or sbt clean assembly

spark-submit --master local data-validator-assembly-0.14.1.jar --help

data-validator v0.14.1
Usage: data-validator [options]

  --version
  --verbose                Print additional debug output.
  --config <value>         required validator config .yaml filename, prefix w/ 'classpath:' to load configuration from JVM classpath/resources, ex. '--config classpath:/config.yaml'
  --jsonReport <value>     optional JSON report filename
  --htmlReport <value>     optional HTML report filename
  --vars k1=v1,k2=v2...    other arguments
  --exitErrorOnFail true|false
                           optional when true, if validator fails, call System.exit(-1) Defaults to True, but will change to False in future version.
  --emailOnPass true|false
                           optional when true, sends email on validation success. Default: false
  --help                   Show this help message and exit.

If you want to build with Java 11 or newer, set the "MODERN_JAVA" environment variable. This may become the default in the future.

Example Run

With the JAR directly:

spark-submit \
  --num-executors 10 \
  --executor-cores 2 \
  data-validator-assembly-0.14.1.jar \
  --config config.yaml \
  --jsonReport report.json

Using packages loading, having created dv-ivy.xml as suggested above and having replaced the placeholders in the example:

touch empty.file && \
spark-submit \
  --class com.target.data_validator.Main \
  --packages com.target:data-validator_2.11:0.14.1 \
  --conf spark.jars.ivySettings=$(pwd)/dv-ivy.xml \ 
  empty.file \
  --config config.yaml \
  --jsonReport report.json

See the Example Config below for the contents of config.yaml.

Config file Description

The data-validator config file is yaml based and it has 3 sections, Global Settings, Table Sources, and Validators. The Table Sources, and Validators have the ability to use variables in the configuration. These variables are replaced at runtime with the values set via Global Settings section or the --vars option on the command line. Variables start with $ and must contain a word starting with a letter (A-Za-z) and followed by zero or more letters (A-Za-z), numbers(0-9), or underscore. Variables can optionally be wrapped in { }. i.e. $foo, ${foo} See the code for the regular expression used to find them in a string. All the table sources, and all but one validator (rowCount) supports variables in their configuration parameters. Note: Care must be taken for some of the substitutions, some possible values might require quoting the variables in the config.

Global Settings

The first section is the global settings that are used throughout the program.

Variable Type Required Description
numKeyCols Int Yes The number of columns from the table schema to use to uniquely identify a row in the table.
numErrorsToReport Int Yes The number of detailed errors to include in Validator Report.
detailedErrors Boolean Yes If a check fails, run a second pass and gather numErrorToReport examples of failure.
email EmailConfig No See Email Config.
vars Map No A map of (key, value) pairs used for variable substitution in tables config. See next section.
outputs Array No Describes where to send .json report. See Validator Output.
tables List Yes List of table sources used to load tables to validate.

Email Config

Variable Type Required Description
smtpHost String Yes The smtp host to send email message through.
subject String Yes Subject for email message.
from String Yes Email address to appear in from part of message.
to Array[String] Yes Must specify at least one email address to send the email report to.
cc Array[String] No Optional list of email addresses to send message to via cc field in message.
bcc Array[String] No Optional list of email addresses to send message to via bcc field in message.

Note that Data Validator only sends email on failure by default. To send email even on successful runs, pass --emailOnPass true to the command line.

Defining Variables

There are 4 different types of variables that you can specify, simple, environment, shell and SQL.

Simple Variable

Simple variables are specified by the name and value pairs and are very straight forward.

vars:
  - name: ENV
    value: prod

This sets the variable ENV to the value prod

Environment Variable

Environment variables import the value from the operating system

vars:
  - name: JAVA_DIR
    env: JAVA_HOME

This will set the variable JAVA_DIR to the value returned by the System.getenv("JAVA_HOME") If JAVA_HOME does not exist in the system environment, the data-validator will stop processing and exit with an error.

Shell Variable

Shell variable will take the first line of output from a shell command and store it a variable.

vars:
  - name: NEXT_SATURDAY
    shell: date -d "next saturday" +"%Y-%m-%d"

This will set the variable NEXT_SATURDAY to the first line of output from the shell command date -d "next saturday" +"%Y-%m-%d".

SQL Variable

SQL variable will take the first column from the first row of the results from a Spark SQL statement.

vars:
  - name: MAX_AGE
    sql: select max(age) from census_income.adult

This runs the sql command that gets the max value from the column age from the table adult in the census_income database and stores it in MAX_AGE.

Validator Output

In addition to the --jsonReport command line option, the .yaml has a outputs section that directs the .json event report to a file or pipes it to a program. There is no current limit on the number of outputs.

Filename

outputs:
  - filename: /user/home/sample.json
    append: true

If the filename specified begins with a / or local:/// it is written to the local filesystem. If the filename begins with hdfs:// the report is written to the hdfs path. An optional append boolean can be specified, and if it is true the current report will be appended to the end of the specified file. The default is append: false and the filename is overwritten. The filename supports variable substitution, the optional append does not. Before the validator starts processing tables, it checks to verify that it can create or append to the filename, if it cannot, the data validator will exit with an error (non-zero value).

Pipe

outputs:
  - pipe: /path/to/program
    ignoreError: true

A pipe is used to send the .json event report to another program for processing. This is a very powerful feature, and can enable the data-validator to be integrated with virtually any other system. An optional ignoreError boolean can also be specified, if true the exit value of the program will be ignored. If false (default) and the program exits with a non-zero status, the data-validator will fail. The pipe supports variable substitution, the optional ignoreError does not.

Before the validator starts processing tables, it checks to see if the pipe program is executable, if it is not, the data-validator will exit with an error (non-zero value). The program must be on a local filesystem to be executed.

Table Sources

Table sources are used to specify how to load the tables to be validated. Currently supported sources are HiveTable, and OrcFile. Each table source has 3 common arguments, keyColumns, condition, checks, and its own source specific argument(s). The keyColumns are list of columns that can be used to uniquely identify a row in the table for the detailed error report when a validator fails. The condition enables the user to specify a snippet of sql to pass to the where clause. The checks argument is a list of validators to run on this table.

HiveTable

To validate a Hive table, specify the db and the table, see below.

- db: $DB
  table: table_name
  condition: "col1 < 100"
  keyColumns:
    - col1
    - col2
  checks:

OrcFile

To validate an .orc file, specify orcFile and the path to the file, see below.

- orcFile: /path/to/orc/file
  keyColumns:
    - col1
    - col2
  checks:

Parquet File

To validate an .parquet file, specify parquetFile and the path to the file, see below.

- parquetFile: /path/to/parquet/file
  keyColumns:
    - col1
    - col2
  checks:

Core spark.read fluent API specified format loader

To validate data loadable by the Spark DataFrameReader Fluent API, use something like this:

  # Some systems require a special format
  format: llama
  # You can also pass any valid options
  options:
    maxMemory: 8G
  # This is a string passed to the varargs version of DataFrameReader.load(String*)
  # If omitted, then DV will call DataFrameReader.load() without parameters.
  # The DataSource that Spark loads is expected to know how to handle this.
  loadData:
    - /path/to/something/camelid.llama
  keyColumns:
    - col1
    - col2
  condition: "col1 < 100"
  checks:

Under the hood the above would be like loaded a DataFrame with:

spark.read
  .format("llama")
  .option("maxMemory", "8G")
  .load("/path/to/something/camelid.llama")

Validators

The third section are the validators. To specify a validator, you first specify the type as one of the validators, then specify the arguments for that validator. Some of the validators support an error threshold. This options allows the user to specify the number of errors or percentage of errors they can tolerate. In some use cases, it might not be possible to eliminate all errors in the data.

Thresholds

Thresholds can be specified as an absolute number of errors, or a percentage of the row count. If the threshold is >= 1 it is considered an absolute number of errors. For example 1000 would fail the check if there are more then 1000 rows that failed the check.

If the threshold is < 1 it is considered a fraction of the row count. For example 0.25 would fail the check if more then rowCount * 0.25 of the rows fail the check. If the threshold ends in a % its considered a percentage of the row count. For eample 33% would fail the check if more then rowCount * 0.33 of the rows fail the check.

Currently supported validators are listed below:

columnMaxCheck

Takes 2 parameters, the column name and a value. The check will fail if max(column) is not equal to the value.

Arg Type Description
column String Column within table to find the max from.
value * The column max should equal this value or the check will fail. Note: The type of the value should match the type of the column. If the column is a NumericType, the value cannot be a String.

negativeCheck

Takes a single parameter, the column name to check. The validator will fail if any rows with that column are negative.

Arg Type Description
column String Table column to be checked for negative values. If it contains a null validator will fail. Note: Column must be of a NumericType or the check will fail during the config check.
threshold String See above description of threshold.

nullCheck

Takes a single parameter, the column name to check. The validator will fail if any rows with that column are null.

Arg Type Description
column String Table column to be checked for null. If it contains a null validator will fail.
threshold String See above description of threshold.

rangeCheck

Takes 2 - 4 parameters, described below. If the value in the column doesn't fall within the range specified by (minValue, maxValue) the check will fail.

Arg Type Description
column String Table column to be checked.
minValue * lower bound of the range, or other column in table. Type depends on the type of the column.
maxValue * upper bound of the range, or other column in table. Type depends on the type of the column.
inclusive Boolean Include minValue and maxValue as part of the range.
threshold String See above description of threshold.

Note: To specify another column in the table, you must prefix the column name with a ` (backtick).

stringLengthCheck

Takes 2 to 4 parameters, described in the table below. If the length of the string in the column doesn't fall within the range specified by (minLength, maxLength), both inclusive, the check will fail. At least one of minLength or maxLength must be specified. The data type of column must be String.

Arg Type Description
column String Table column to be checked. The DataType of the column must be a String
minLength Integer Lower bound of the length of the string, inclusive.
maxLength Integer Upper bound of the length of the string, inclusive.
threshold String See above description of threshold.

stringRegexCheck

Takes 2 to 3 parameters, described in the table below. If the column value does not match the pattern specified by the regex, the check will fail. A value for regex must be specified. The data type of column must be String.

Arg Type Description
column String Table column to be checked. The DataType of the column must be a String
regex String POSIX regex.
threshold String See above description of threshold.

rowCount

The minimum number of rows a table must have to pass the validator.

Arg Type Description
minNumRows Long The minimum number of rows a table must have to pass.

See Example Config file below to see how the checks are configured.

uniqueCheck

This check is used to make sure all rows in the table are unique, only the columns specified are used to determine uniqueness. This is a costly check and requires an additional pass through the table.

Arg Type Description
columns Array[String] Each set of values in these columns must be unique.

columnSumCheck

This check sums a column in all rows. If the sum applied to the column doesn't fall within the range specified by (minValue, maxValue) the check will fail.

Arg Type Description
column String The column to be checked.
minValue NumericType The lower bound of the sum. Type depends on the type of the column.
maxValue NumericType The upper bound of the sum. Type depends on the type of the column.
inclusive Boolean Include minValue and maxValue as part of the range.

Note: If bounds are non-inclusive, and the actual sum is equal to one of the bounds, the relative error percentage will be undefined.

colstats

This check generates column statistics about the specified column.

Arg Type Description
column String The column on which to collect statistics.

These keys and their corresponding values will appear in the check's JSON summary when using the JSON report output mode:

Key Type Description
count Integer Count of non-null entries in the column.
mean Double Mean/Average of the values in the column.
min Double Smallest value in the column.
max Double Largest value in the column.
stdDev Double Standard deviation of the values in the column.
histogram Complex Summary of an equi-width histogram, counts of values appearing in 10 equally sized buckets over the range [min, max].

Example Config

---

# If keyColumns are not specified for a table, we take the first N columns of a table instead.
numKeyCols: 2

# numErrorsToReport: Number of errors per check show in "Error Details" of report, this is to limit the size of the email.
numErrorsToReport: 5

# detailedErrors: If true, a second pass will be made for checks that fail to gather numErrorsToReport examples with offending value and keyColumns to aide in debugging
detailedErrors: true

vars:
  - name: ENV
    value: prod

  - name: JAVA_DIR
    env: JAVA_HOME

  - name: TODAY
    shell: date + "%Y-%m-%d"

  - name: MAX_AGE
    sql: SELECT max(age) FROM census_income.adult

outputs:
  - filename: /user/home/sample.json
    append: true

  - pipe: /path/to/program
    ignoreError: true

email:
  smtpHost: smtp.example.com
  subject: Data Validation Summary
  from: [email protected]
  to:
    - [email protected]
  cc:
    - [email protected], [email protected]
  bcc:
    - [email protected]

tables:
  - db: census_income
    table: adult
    # Key Columns are used when errors occur to identify a row, so they should include enough columns to uniquely identify a row.
    keyColumns:
      - age
      - occupation
    condition: educationNum >= 5
    checks:
      # rowCount - checks if the number of rows is at least minRows
      - type: rowCount
        minNumRows: 50000

      # negativeCheck - checks if any values are less than 0
      - type: negativeCheck
        column: age
      
      # colstats - adds basic statistics of the column to the output
      - type: colstats
        column: age
        
      # nullCheck - checks if the column is null, counts number of rows with null for this column.
      - type: nullCheck
        column: occupation

      # stringLengthCheck - checks if the length of the string in the column falls within the specified range, counts number of rows in which the length of the string is outside the specified range.
      - type: stringLengthCheck
        column: occupation
        minLength: 1
        maxLength: 5

      # stringRegexCheck - checks if the string in the column matches the pattern specified by `regex`, counts number of rows in which there is a mismatch.
      - type: stringRegexCheck
        column: occupation
        regex: ^ENGINEER$ # matches the word ENGINEER

      - type: stringRegexCheck
        column: occupation
        regex: \w # matches any alphanumeric string

Working with OOZIE Workflows

The data-validator can be used in an oozie workflow to halt the wf if a check doesn't pass. There are 2 ways to use the data-validator in oozie and each has their own drawbacks. The selection of the methods is determined by the --exitErrorOnFail {true|false} command line option.

Setting ExitErrorOnFail to True

The first option, enabled by --exitErrorOnFail=true, is to have the data-validator exit with a non-zero value when a check fails. This enables the workflow to decide how it wants to handle a failed check/error. The downsides of this method, is that you can never be sure if the data-validator exited with an error because bad check, or if there was a problem with the execution of the data-validator. This also pollutes the oozie workflow info with ERROR, which some might not like. This is currently the default but likely to change with v1.0.0.

Example oozie wf snippet:

<action name="RunDataValidator">
    <shell xmlns="uri:oozie:shell-action:0.2">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <exec>spark-submit</exec>
      <argument>--conf</argument>
      <argument>spark.yarn.maxAppAttempts=1</argument>
      <argument>--class</argument>
      <argument>com.target.data_validator.Main</argument>
      <argument>--master</argument>
      <argument>yarn</argument>
      <argument>--deploy-mode</argument>
      <argument>cluster</argument>
      <argument>--keytab</argument>
      <argument>${keytab}</argument>
      <argument>--principal</argument>
      <argument>${principal}</argument>
      <argument>--files</argument>
      <argument>config.yaml</argument>
      <argument>data-validator-assembly-0.14.1.jar</argument>
      <argument>--config</argument>
      <argument>config.yaml</argument>
      <argument>--exitErrorOnFail</argument>
      <argument>true</argument>
      <argument>--vars</argument>
      <argument>ENV=${ENV},EMAIL_REPORT=${EMAIL_REPORT},SMTP_HOST=${SMTP_HOST}</argument>
      <capture-output/>
    </shell>
    <ok to="ValidatorSuccess" />
    <error to="ValidatorErrorOrCheckFail" />
  </action>

 <action name="ValidatorErrorOrCheckFail">
  <!-- Check or data-validator failed  -->
  </action>

  <action name="ValidatorSuccess">
  <!-- Everything is wonderful!  -->
  </action>

Setting ExitErrorOnFail to False

The second option, enabled by --exitErrorOnFail=false, is to have the data-validator output to stdout DATA_VALIDATOR_STATUS=PASS or DATA_VALIDATOR_STATUS=FAIL and System.exit(0) when it completes. This enables the workflow to distinguish between a failed check, and a runtime error. The downside is that you must use the oozie shell action, with the capture output option, and run the validator via Spark's client mode. This will likely become the default behavior in v1.0.0.

Example oozie wf snippet:

<action name="RunDataValidator">
  <shell xmlns="uri:oozie:shell-action:0.2">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <exec>spark-submit</exec>
    <argument>--conf</argument>
    <argument>spark.yarn.maxAppAttempts=1</argument>
    <argument>--class</argument>
    <argument>com.target.data_validator.Main</argument>
    <argument>--master</argument>
    <argument>yarn</argument>
    <argument>--deploy-mode</argument>
    <argument>client</argument>
    <argument>--keytab</argument>
    <argument>${keytab}</argument>
    <argument>--principal</argument>
    <argument>${principal}</argument>
    <argument>data-validator-assembly-0.14.1.jar</argument>
    <argument>--config</argument>
    <argument>config.yaml</argument>
    <argument>--exitErrorOnFail</argument>
    <argument>false</argument>
    <argument>--vars</argument>
    <argument>ENV=${ENV},EMAIL_REPORT=${EMAIL_REPORT},SMTP_HOST=${SMTP_HOST}</argument>
    <capture-output/>
  </shell>
  <ok to="ValidatorDecision" />
  <error to="VaildatorError" />
</action>

<decision name="ValidatorDecision">
  <switch>
    <case to="ValidatorCheckFail">${wf:actionData('RunDataValidator')['DATA_VALIDATOR_STATUS'] eq "FAIL"}</case>
    <case to="ValidatorCheckPass">${wf:actionData('RunDataValidator')['DATA_VALIDATOR_STATUS'] eq "PASS"}</case>
    <default to="ValidatorNeither"/>
  </switch>
</decision>

<action name="ValidatorCheckFail">
  <!-- Handle Failed Check -->
</action>

<action name="ValidatorCheckPass">
  <!-- Everything is Wonderful! -->
</action>

<action name="ValidatorFailure">
  <!-- Notify devs of validator failure -->
</action>

Other tools included

Configuration parser check

com.target.data_validator.ConfigParser has an entrypoint that will check that the configuration file is parseable. It does not validate variable substitutions since those have runtime implications.

spark-submit \
  --class com.target.data_validator.ConfigParser \
  --files config.yml \
  data-validator-assembly-0.14.1.jar \
    config.yml

If there is an error, DV will print a message and exit non-zero.

Development Tools

Generate testing data with GenTestData or sbt generateTestData

Data Validator includes a tool to generate a sample .orc file for use in local development. This repo's SBT configuration wraps the tool in a convenient SBT task: sbt generateTestData
If you run this program or task, it will generate a file testData.orc in the current directory. You can then use the following config file to test the data-validator. It will generate a report.json and report.html.

spark-submit \
  --master "local[*]"  \
  data-validator-assembly-0.14.1.jar \
  --config local_validators.yaml \
  --jsonReport report.json  \
  --htmlReport report.html

local_validators.yaml

---
numKeyCols: 2
numErrorsToReport: 5
detailedErrors: true

tables:
  - orcFile: testData.orc

    checks:
      - type: rowCount
        minNumRows: 1000

      - type: nullCheck
        column: nullCol

History

This tool is based on methods described in Methodology for Data Validation 1.0 by Di Zio et al., published by Esset Validat Foundation in 2016. You can download the paper here.

data-validator's People

Contributors

abs428 avatar c-horn avatar colindean avatar dependabot[bot] avatar dougb avatar github-actions[bot] avatar holdenk avatar jaygaynor avatar phpisciuneri avatar samratmitra-0812 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-validator's Issues

Unit tests failures after adding dependency on HiveWarehouseConnector

In order to enable data-validator for Hadoop 3, a dependency on HiveWarehouseConnector was added. Post this unit tests started failing with the following exception:

java.lang.SecurityException: class "org.codehaus.janino.JaninoRuntimeException"'s signer information does not match signer information of other classes in the same package
at java.lang.ClassLoader.checkCerts(ClassLoader.java:898)
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668)
at java.lang.ClassLoader.defineClass(ClassLoader.java:761)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:197)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)

As suggested in this SO thread, HiveWarehouseConnector jar was added to the end of the classpath. Post that, a NoClassDefFoundError showed up.

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[error]     at org.apache.spark.SparkContext.withScope(SparkContext.scala:693)
[error]     at org.apache.spark.SparkContext.parallelize(SparkContext.scala:710)

This seems like a typical jar hell issue. And the issue is only with the unit tests. When the unit test runs were skipped, the data-validator was successfully deployed and ran fine on Hadoop 2 and Hadoop 3.

Thresholds parsed as JSON floats are ignored

Describe the bug

When specifying a check with a threshold that will parse to a JSON float, e.g.

threshold: 0.10 # will be ignored
threshold: 10% # works
threshold: "0.10" # works

the threshold will be ignored.

To Reproduce

Configure a check with:

type: nullCheck
column: foo
threshold: 0.10

or put that into a test in NullCheckSpec.

Expected behavior

Thresholds specified as floats should work.

Move back to Travis CI

Recent changes to GitHub Actions disable it for paid orgs that are on older plans. We have to go back to Travis since this is not likely to be resolved amenable anytime soon.

Attempt to send email should be retried if it fails

Currently, if sending email fails because the email server is temporarily offline or overloaded, the only choice of action is to rerun the whole validation. This can be very expensive, and may require manual intervention if the program is running as part of an automatic workflow.

It would be better if the program detected the error in sending email and did its own wait-and-retry loop. This would be pretty cheap and much better than failing.

Migrate your Actions workflows to the new syntax

The HCL syntax in GitHub Actions will stop working on September 30, 2019. We are contacting you because youโ€™ve run workflows using HCL syntax in the last week in your account with the following repos: target/data-validator.

To continue using workflows that you created with the HCL syntax, you'll need to migrate the workflow files to the new YAML syntax. Once you have your YAML workflows ready, visit your repositories and follow the prompts to upgrade. Once you upgrade, your HCL workflows will stop working.

https://help.github.com/en/articles/migrating-github-actions-from-hcl-syntax-to-yaml-syntax

createKeySelect log msg is potentially redundant and should not be at error level

For example, given:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

outputs:
  - filename: report.json
    append: false

tables:
  - db: census_income
    table: adult
    checks:
      - type: rangeCheck
        column: age
        minValue: 40
        maxValue: 50

In the output you will see:

...
...
21/01/14 14:35:32 ERROR ValidatorTable: createKeySelect: age, workclass keyColumns: None
...
...

This is not an error. It is merely informing you what the keyColumns are for ValidatorQuickCheckError details. In the case that the keyColumns are specified in the configuration, you will end up seeing them listed twice.

Drop env dump from JSON output

data-validator may expose secrets held in environment variables in the output JSON.

private def envToJson: Json = {
val env = System.getenv.asScala.toList.map(x => (x._1, Json.fromString(x._2)))
Json.obj(env: _*)
}
dumps the current environment into the output JSON via
("runtimeInfo", ValidatorConfig.runtimeInfoJson(spark)),
that calls
private def runtimeInfoJson(spark: SparkSession): Json = {
which includes it here

It's safe to dump variables that data-validator accesses but it's unwise to dump everything.

Allow format+options to be passed before Hive query

We have a use case which necessitates constructing queries a certain way:

val df = spark.read.format("internal_format").option("database", "foo").load("select * from myTable")
// โ€ฆ Spark magic โ€ฆ
df.write.format("internal_format").option("database", "my_select_database").save()

DV doesn't have a way to allow the user to pass arbitrary format or a map of options in this manner. A particular Target-internal use case requires this and DV cannot be used for this use case until these are supported.

A proposed solution is to allow format: String and options: Object|Map[String, String] properties:

tables:
  - db: census_income
    table: adult
    format: internal_format
    options:
      database: census_income
      hive.vectorized.execution.reduce.enabled: "false"
    keyColumns:
      - age
      - occupation
    condition: educationNum >= 5
    checks:
      - type: rowCount
        minNumRows: 50000

Enable a configuration check using com.target.data_validator.ConfigParser#main

There is an unused com.target.data_validator.ConfigParser#main that could be exposed somehow to enable configuration testing.

def main(args: Array[String]): Unit = {
logger.info(s"Args[${args.length}]: $args")
val filename = args(0)
var error = false
parseFile(filename, Map.empty) match {
case Left(pe) => logger.error(s"Failed to parse $filename, ${pe.getMessage}"); error = true
case Right(config) => logger.info(s"Config: $config")
}
System.exit(if (error) 1 else 0)
}

Ideally, this should be a separate mode but minimally we could document how to use it locally to validate a configuration.

I think it could be as simple as documenting using it like this:

spark-submit data-validator-assembly-${version}.jar config.yaml

Error % not calculated correctly for ColumnBased checks

Example

CheckType: minNumRows
minValue: 1800
actualValue: 1144

In the above example error % should be (1800 - 1144) * 100/1800 = 36.44%
However, it is calculated as (1 * 100)/1144 = 0.09%

Similar issue exists for ColumnMaxCheck.

Refactor tests using `traits`

@colindean made some good suggestions in #14 around refactoring tests using traits (See comment)
I tried to create a few new utility functions in #13, but I'd like to see if we can do something like Colin suggested, and make a pass through the tests and use any new traits or functions to make them more concise.
I suspect we can reduce the duplicate code in the tests, greatly reduce the test SLOC and make it easier to develop tests.

SQL variable substitution fails when result is a double

Observed, for example, when combined with ColumnSumCheck. See example config below:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

vars:
  - name: MAX_AGE
    sql: SELECT CAST(MAX(age) AS DOUBLE) FROM census_income.adult

outputs:
  - filename: report.json
    append: false

email:
  smtpHost: smtp.example.com
  subject: Data Validation Summary
  from: [email protected]
  to:
    - [email protected]

tables:
  - db: census_income
    table: adult
    checks:
      - type: columnSumCheck
        column: age
        minValue: $MAX_AGE
        inclusive: true

yields:

...
...
21/01/14 09:12:01 ERROR JsonUtils$: Unimplemented dataType 'double' in column: CAST(max(age) AS DOUBLE) Please report this as a bug.
21/01/14 09:12:01 INFO ValidatorConfig: substituteVariables()
21/01/14 09:12:02 INFO Substitutable$class: Substituting Json minValue Json: "$MAX_AGE" with `null`
...
...
21/01/14 09:12:02 ERROR ColumnSumCheck$$anonfun$configCheck$1: 'minValue' defined but type is not a Number, is: Null
21/01/14 09:12:02 ERROR ValidatorTable$$anonfun$1: ConfigCheck failed for HiveTable:`census_income.adult`
...
...

Streamline configuration for the same test applied to multiple columns

Currently, if I wanted to check for null values in each of the columns (age, occupation) of a table, the checks: section of the configuration file would contain something this:

- type: nullCheck
  column: age

- type: nullCheck
  column: occupation

Ideally, we should support a more streamlined config. Something like:

- type: nullCheck
  columns: age, occupation

We would need to decide how to handle optional parameters in the streamlined case. One option is that we do not support streamlining if any optional parameters are specified:

- type: nullCheck
  column: age
  threshold: 1%

- type: nullCheck
  column: occupation
  threshold: 5%

Another option would be to allow additional parameters to be streamlined and applied in the same order as the specified columns:

- type: nullCheck
  columns: age, occupation
  thresholds: 1%, 5%

Test fails for ConfigVarSpec

When running sbt clean assembly on terminal, the following tests are failing:

[info]   - from Json snippet
[info]   - addEntry works *** FAILED ***
[info]     sut.addEntry(ConfigVarSpec.this.spark, varSub) was true (ConfigVarSpec.scala:70)
[info]   - asJson works
[info]   - var sub in env value *** FAILED ***
[info]     sut.addEntry(ConfigVarSpec.this.spark, varSub) was true (ConfigVarSpec.scala:83)
[info]   - var sub fails when value doesn't exist

Here is the ScalaTest output:

[info] ScalaTest
[info] Run completed in 3 minutes, 2 seconds.
[info] Total number of tests run: 330
[info] Suites: completed 25, aborted 0
[info] Tests: succeeded 328, failed 2, canceled 0, ignored 0, pending 0
[info] *** 2 TESTS FAILED ***
[error] Failed: Total 330, Failed 2, Errors 0, Passed 328
[error] Failed tests:
[error]         com.target.data_validator.validator.ConfigVarSpec
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 892 s (14:52), completed 10 Jan, 2022 8:36:27 AM

Ratchet up to newer baseline

Target's internal baseline is rebasing onto these facts:

  • Ubuntu
  • JDK 17
  • Scala 2.13 (or 2.12)
  • Spark 3.5.1

#166 will handle Spark 3.5.1 and sets the stage for JDK 17. It enables Scala 2.12, too, but keeps Scala 2.11. We'll want to roll off Scala 2.11 and onboard 2.13.

I don't think we've got anything that cares about the underlying distro.

After #166 is merged, we'll need to do some testing and work to ensure operability on JDK 17, including bumping CI workflows.

Rename "tables" concept

Is your feature request related to a problem? Please describe.

We've got ValidatorTable and tables in the config, but they're not really tables in the case of orc or parquet files. Let's get rid of the tables moniker and choose something else.

Describe the solution you'd like

ValidatorDataSource and sources might be more appropriate.

N.b. this would be a breaking change.

Add a 'sum of numeric column' check

Acceptance Criteria:

  • Implement the sum of numeric column
  • Usage should be documented in the read me
  • Test coverage should be added and passing

Range check configuration at debug log level

The range check configuration should be a debug log level to make it consistent with how other row-based tests are logged.

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

outputs:
  - filename: report.json
    append: false

tables:
  - db: census_income
    table: adult
    checks:
      - type: rangeCheck
        column: age
        minValue: 40
        maxValue: 50

yields:

21/01/14 14:34:43 INFO Main$: Logging configured!
21/01/14 14:34:43 INFO Main$: Data Validator
21/01/14 14:34:43 INFO ConfigParser$: Parsing `issue.yaml`
21/01/14 14:34:43 INFO ConfigParser$: Attempting to load `issue.yaml` from file system
21/01/14 14:34:43 INFO RangeCheck$$anonfun$fromJson$1: RangeCheckJson: {
  "type" : "rangeCheck",
  "column" : "age",
  "minValue" : 4e1,
  "maxValue" : 5e1
}
...
...

Add support for variable substitution for `minNumRows` in `rowCount` check

We should be able to support the following use case or something analogous to it:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

vars:
  - name: NUM_ROWS
    value: 1000
  
tables:
  - db: census_income
    table: adult
    checks:
      - type: rowCount
        minNumRows: $NUM_ROWS

Currently, trying the above configuration yields a fairly non-descriptive DecodingFailure

21/01/15 12:24:03 ERROR Main$: Failed to parse config file 'issue.yaml, {}
DecodingFailure(Attempt to decode value on failed cursor, List(DownField(parquetFile), DownArray, DownField(tables)))

It is noted in the documentation that this is not currently supported.

NoSuchMethodError when running with test data and test config

$ spark-submit --master "local[*]" $(ls -t target/scala-2.11/data-validator-assembly-*.jar | head -n 1) --config local_validators.yaml --jsonReport target/testreport.json --htmlReport target/testreport.html
20/04/07 18:04:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/07 18:04:34 INFO Main$: Logging configured!
20/04/07 18:04:34 INFO Main$: Data Validator
20/04/07 18:04:34 INFO ConfigParser$: Parsing `local_validators.yaml`
20/04/07 18:04:34 INFO ConfigParser$: Attempting to load `local_validators.yaml` from file system
20/04/07 18:04:35 INFO ValidatorConfig: substituteVariables()
20/04/07 18:04:35 INFO Substitutable$class: Substituting filename var: ${WORKDIR}/test.json with `/Users/z003xc4/Source/OSS/target_data-validator/test.json`
20/04/07 18:04:35 INFO Main$: Checking Cli Outputs htmlReport: Some(target/testreport.html) jsonReport: Some(target/testreport.json)
20/04/07 18:04:35 INFO Main$: filename: Some(target/testreport.html) append: false
20/04/07 18:04:35 INFO Main$: CheckFile Some(target/testreport.html)
20/04/07 18:04:35 INFO Main$: Checking file 'target/testreport.html append: false failed: false
20/04/07 18:04:35 INFO Main$: filename: Some(target/testreport.json) append: true
20/04/07 18:04:35 INFO Main$: CheckFile Some(target/testreport.json)
20/04/07 18:04:35 INFO Main$: Checking file 'target/testreport.json append: true failed: false
20/04/07 18:04:35 INFO ValidatorOrcFile: Reading orc file: testData.orc
20/04/07 18:04:36 INFO Main$: Running sparkChecks
20/04/07 18:04:36 INFO ValidatorConfig: Running Quick Checks...
20/04/07 18:04:36 INFO ValidatorOrcFile: Reading orc file: testData.orc
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias$.apply$default$4(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;)Lscala/Option;
        at com.target.data_validator.ValidatorTable.createCountSelect(ValidatorTable.scala:33)
        at com.target.data_validator.ValidatorTable.quickChecks(ValidatorTable.scala:87)
        at com.target.data_validator.ValidatorConfig$$anonfun$quickChecks$1.apply(ValidatorConfig.scala:51)
        at com.target.data_validator.ValidatorConfig$$anonfun$quickChecks$1.apply(ValidatorConfig.scala:51)
        at scala.collection.immutable.List.map(List.scala:284)
        at com.target.data_validator.ValidatorConfig.quickChecks(ValidatorConfig.scala:51)
        at com.target.data_validator.Main$.runSparkChecks(Main.scala:80)
        at com.target.data_validator.Main$$anonfun$2.apply(Main.scala:106)
        at com.target.data_validator.Main$$anonfun$2.apply(Main.scala:100)
        at scala.Option.map(Option.scala:146)
        at com.target.data_validator.Main$.runChecks(Main.scala:99)
        at com.target.data_validator.Main$.loadConfigRun(Main.scala:27)
        at com.target.data_validator.Main$.main(Main.scala:170)
        at com.target.data_validator.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

local_validators.yaml:

---
numKeyCols: 2
numErrorsToReport: 5
detailedErrors: true
vars:
- name: WORKDIR
  env: PWD
tables:
- orcFile: testData.orc
  checks:
  - type: rowCount
    minNumRows: 1000
    #  - type: nullCheck
    #    column: nullCol
outputs:
- filename: ${WORKDIR}/test.json
  • Spark version 2.4.4
  • Scala version 2.11.12

java.lang.IllegalArgumentException when using parquet file

When trying to run a config check on a parquet file, the following error can be seen:

root@lubuntu:/home/jyoti/Spark# /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml
22/01/11 11:50:53 WARN Utils: Your hostname, lubuntu resolves to a loopback address: 127.0.1.1; using 192.168.195.131 instead (on interface ens33)
22/01/11 11:50:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/01/11 11:50:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/11 11:50:59 INFO Main$: Logging configured!
22/01/11 11:51:00 INFO Main$: Data Validator
22/01/11 11:51:01 INFO ConfigParser$: Parsing `config.yaml`
22/01/11 11:51:01 INFO ConfigParser$: Attempting to load `config.yaml` from file system
Exception in thread "main" java.lang.ExceptionInInitializerError
	at com.target.data_validator.validator.RowBased.<init>(RowBased.scala:11)
	at com.target.data_validator.validator.NullCheck.<init>(NullCheck.scala:12)
	at com.target.data_validator.validator.NullCheck$.fromJson(NullCheck.scala:37)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
	at scala.Option.map(Option.scala:230)
	at com.target.data_validator.validator.JsonDecoders$$anon$7.com$target$data_validator$validator$JsonDecoders$$anon$$getDecoder(JsonDecoders.scala:32)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
	at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
	at com.target.data_validator.validator.JsonDecoders$$anon$7.apply(JsonDecoders.scala:27)
	at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
	at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
	at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$15$1$$anon$6.apply(ConfigParser.scala:21)
	at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
	at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
	at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
	at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
	at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
	at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$81$1$$anon$10.apply(ConfigParser.scala:28)
	at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
	at io.circe.Json.as(Json.scala:106)
	at com.target.data_validator.ConfigParser$.configFromJson(ConfigParser.scala:28)
	at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
	at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
	at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
	at com.target.data_validator.ConfigParser$.parse(ConfigParser.scala:65)
	at com.target.data_validator.ConfigParser$.parseFile(ConfigParser.scala:60)
	at com.target.data_validator.Main$.loadConfigRun(Main.scala:23)
	at com.target.data_validator.Main$.main(Main.scala:171)
	at com.target.data_validator.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to bigint, but class Integer found.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:219)
	at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:296)
	at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:144)
	at com.target.data_validator.validator.ValidatorBase$.<init>(ValidatorBase.scala:139)
	at com.target.data_validator.validator.ValidatorBase$.<clinit>(ValidatorBase.scala)
	... 47 more

Ran a spark-submit job as follows:

spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml

The config.yaml file has the following content:

numKeyCols: 2
numErrorsToReport: 742

tables:
  - parquetFile: /home/jyoti/Spark/userdata1.parquet
    checks:
      - type: nullCheck
        column: salary

I got the userdata1.parquet from the following github link:
https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet

Environment Details:
latest source code: data-validator-0.13.0
Lubuntu 18.04 LTS x64 version on VMWare Player
4 CPU cores and 2GB ram
Java version

yoti@lubuntu:~$ java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

lsb_release output:

jyoti@lubuntu:~$ lsb_release -a 2>/dev/null
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04 LTS
Release:	18.04
Codename:	bionic

uname -s:

jyoti@lubuntu:~$ uname -s
Linux

sbt -version:

root@lubuntu:/home/jyoti/Spark# sbt -version
downloading sbt launcher 1.6.1
[info] [launcher] getting org.scala-sbt sbt 1.6.1  (this may take some time)...
[info] [launcher] getting Scala 2.12.15 (for sbt)...
sbt version in this project: 1.6.1
sbt script version: 1.6.1

Please let me know if you need anything else.

[SECURITY] Releases are built/executed/released in the context of insecure/untrusted code

CWE-829: Inclusion of Functionality from Untrusted Control Sphere
CWE-494: Download of Code Without Integrity Check

The build files indicate that this project is resolving dependencies over HTTP instead of HTTPS. Any of these artifacts could have been MITM to maliciously compromise them and infect the build artifacts that were produced. Additionally, if any of these JARs or other dependencies were compromised, any developers using these could continue to be infected past updating to fix this.

This vulnerability has a CVSS v3.0 Base Score of 8.1/10
https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?vector=AV:N/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:H

This isn't just theoretical

POC code has existed since 2014 to maliciously compromise a JAR file inflight.
See:

MITM Attacks Increasingly Common

See:

Source Locations

resolvers += "Concurrent Conjars repository" at "http://conjars.org/repo"

distinctCountCheck as a validator

It would be great to have a distinctCountCheck validator that checks the number of distinct values in a column of a given table, and that this number matches a user provided value.

Unknown fields in the check section of `.yaml` should cause `WARN` log messages.

While testing stringLengthCheck I accidently referenced minLength instead of minValue
This caused configTest() to fail for no apparent reason and took me a really long time to debug because the program was not logging any useful information. configTest() did generate a ValidatorError() event in the eventLog, but the program doesn't write the report.json or HTML report on configTest() failures.

The new Object.fromJson() constructors should log a warn for every unknown field present in the config.
In general, I do not think that unknown fields should be an error, only a warning, this helps keeps the config "compatible" across versions.
Maybe create a cli option for strict config parsing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.