Code Monkey home page Code Monkey logo

avro_validator's Introduction

CI Documentation Status PyPI version Downloads Coverage Status

Avro Validator

A pure python avro schema validator.

The default avro library for Python provide validation of data against the schema, the problem is that the output of this validation doesn't provide information about the error. All you get is the the datum is not an example of the schema error message.

When working with bigger avro schemas, sometimes is not easy to visually find the field that has an issue.

This library provide clearer exceptions when validating data against the avro schema, in order to be easier to identify the field that is not compliant with the schema and the problem with that field.

Installing

Install using pip:

$ pip install -U avro_validator

Validating data against Avro schema

The validator can be used as a console application. It receives a schema file, and a data file, validating the data and returning the error message in case of failure.

The avro_validator can also be used as a library in python code.

Console usage

In order to validate the data_to_validate.json file against the schema.avsc using the avro_validator callable, just type:

$ avro_validator schema.avsc data_to_valdate.json
OK

Since the data is valid according to the schema, the return message is OK.

Error validating the data

If the data is not valid, the program returns an error message:

$ avro_validator schema.avsc data_to_valdate.json
Error validating value for field [data,my_boolean_value]: The value [123] is not from one of the following types: [[NullType, BooleanType]]

This message indicates that the field my_boolean_value inside the data dictionary has value 123, which is not compatible with the bool type.

Command usage

It is possible to get information about usage of the avro_validator using the help:

$ avro_validator -h

Library usage

Using schema file

When using the avr_validator as a library, it is possible to pass the schema as a file:

from avro_validator.schema import Schema

schema_file = 'schema.avsc'

schema = Schema(schema_file)
parsed_schema = schema.parse()

data_to_validate = {
    'name': 'My Name'
}

parsed_schema.validate(data_to_validate)

In this example, if the data_to_validate is valid according to the schema, then the parsed_schema.validate(data_to_validate) call will return True.

Using a dict as schema

It is also possible to provide the schema as a json string:

import json
from avro_validator.schema import Schema

schema = json.dumps({
    'name': 'test schema',
    'type': 'record',
    'doc': 'schema for testing avro_validator',
    'fields': [
        {
            'name': 'name',
            'type': 'string'
        }
    ]
})

schema = Schema(schema)
parsed_schema = schema.parse()

data_to_validate = {
    'name': 'My Name'
}

parsed_schema.validate(data_to_validate)

In this example, the parsed_schema.validate(data_to_validate) call will return True, since the data is valid according to the schema.

Invalid data

If the data is not valid, the parsed_schema.validate will raise a ValueError, with the message containing the error description.

import json
from avro_validator.schema import Schema

schema = json.dumps({
    'name': 'test schema',
    'type': 'record',
    'doc': 'schema for testing avro_validator',
    'fields': [
        {
            'name': 'name',
            'type': 'string',
            'doc': 'Field that stores the name'
        }
    ]
})

schema = Schema(schema)
parsed_schema = schema.parse()

data_to_validate = {
    'my_name': 'My Name'
}

parsed_schema.validate(data_to_validate)

The schema defined expects only one field, named name, but the data contains only the field name_2, making it invalid according to the schema. In this case, the validate method will return the following error:

Traceback (most recent call last):
  File "/Users/leonardo.almeida/.pyenv/versions/avro_validator_venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-a5e8ce95d21c>", line 23, in <module>
    parsed_schema.validate(data_to_validate)
  File "/opt/dwh/avro_validator/avro_validator/avro_types.py", line 563, in validate
    raise ValueError(f'The fields from value [{value}] differs from the fields '
ValueError: The fields from value [{'my_name': 'My Name'}] differs from the fields of the record type [{'name': RecordTypeField <name: name, type: StringType, doc: Field that stores the name, default: None, order: None, aliases: None>}]

The message detailed enough to enable the developer to pinpoint the error in the data.

Invalid schema

If the schema is not valid according to avro specifications, the parse method will also return a ValueError.

import json
from avro_validator.schema import Schema

schema = json.dumps({
    'name': 'test schema',
    'type': 'record',
    'doc': 'schema for testing avro_validator',
    'fields': [
        {
            'name': 'name',
            'type': 'invalid_type',
            'doc': 'Field that stores the name'
        }
    ]
})

schema = Schema(schema)
parsed_schema = schema.parse()

Since the schema tries to define the name field as invalid_type, the schema declaration is invalid, thus the following exception will be raised:

Traceback (most recent call last):
  File "/Users/leonardo.almeida/.pyenv/versions/avro_validator_venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-7f3f77000f08>", line 18, in <module>
    parsed_schema = schema.parse()
  File "/opt/dwh/avro_validator/avro_validator/schema.py", line 28, in parse
    return RecordType.build(schema)
  File "/opt/dwh/avro_validator/avro_validator/avro_types.py", line 588, in build
    record_type.__fields = {field['name']: RecordTypeField.build(field) for field in json_repr['fields']}
  File "/opt/dwh/avro_validator/avro_validator/avro_types.py", line 588, in <dictcomp>
    record_type.__fields = {field['name']: RecordTypeField.build(field) for field in json_repr['fields']}
  File "/opt/dwh/avro_validator/avro_validator/avro_types.py", line 419, in build
    field.__type = cls.__build_field_type(json_repr)
  File "/opt/dwh/avro_validator/avro_validator/avro_types.py", line 401, in __build_field_type
    raise ValueError(f'Error parsing the field [{fields}]: {actual_error}')
ValueError: Error parsing the field [name]: The type [invalid_type] is not recognized by Avro

The message is clearly indicating that the the invalid_type is not recognized by avro.

avro_validator's People

Contributors

helver avatar leocalm avatar manmat avatar pawndev avatar prakashautade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

avro_validator's Issues

Cannot parse schema with top level array

The following schema fails with error: ValueError: The RecordType must have {'name', 'fields'} defined.

{
	"name": "test",
	"type": "array",
	"items": {
		"type": "string"
	}
}

It appears that the schema parser only accepts schemas with the top level being a record.

Schema constructor does not close the file

Schema constructor currently leaves the schema file open until it's GC'd (unless I misunderstand the semantics of open()).

Current code:

self._schema = open(schema, 'r').read()

This would be more reliable:

with open(schema, 'r') as f:
   self._schema = f.read()

Is there a way to skip the extra_fields check ?

Hi!

Thanks for your awesome, developer friendly, library.
I just have a question, I began to migrate my nifi workflow to an airflow workflow, and I want to know if there is a way to skip the check for extra_field ? I read a csv with the pandas library, but I just want to check some column with my avro schema, not all fields.

Kind regards

does not support avro union, i.e., .avsc file is a JSON array?

Hello,

I am doing in command line

avro_validator union_schema.avsc producing_message.json

My <union_schema.avsc> is a JSON array, with different dependent objects inside. I can give an example below.

[
{
    "type": "record",
    "namespace": "com.company.model",
    "name": "AddressRecord",
    "fields": [
        {
            "name": "streetaddress",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        }
    ]
},
{
    "namespace": "com.company.model",
    "type": "record",
    "name": "person",
    "fields": [
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
}
]

When I was trying to validate through command line, I got an error

Traceback (most recent call last):
  File "/.local/bin/avro_validator", line 8, in <module>
    sys.exit(main())
  File "/.local/lib/python3.6/site-packages/avro_validator/cli.py", line 28, in main
    parsed_schema = schema.parse()
  File "/.local/lib/python3.6/site-packages/avro_validator/schema.py", line 28, in parse
    return RecordType.build(schema)
  File "/.local/lib/python3.6/site-packages/avro_validator/avro_types.py", line 647, in build
    cls._validate_json_repr(json_repr)
  File "/.local/lib/python3.6/site-packages/avro_validator/avro_types.py", line 63, in _validate_json_repr
    if cls.required_attributes.intersection(json_repr.keys()) != cls.required_attributes:
AttributeError: 'list' object has no attribute 'keys'

I couldn't find info in README in this repo about avro union. What should I do to make this work? Thanks

Valid data files with nullable fields not correctly validated

Avro records with nullable fields (based on union types) encoded as JSON
are not correctly parsed using the avro_validator cli. The error is
due to an inconsistency in how this tool interprets union types and how
they're encoded in JSON (link to docs).

Specifically:

For example, the union schema ["null","string","Foo"], where Foo is a
record name, would encode:

  1. null as null;
  2. string "a" as {"string": "a"}; and
  3. a Foo instance as {"Foo": {...}}, where {...} indicates the JSON encoding of a Foo instance.

The following schema includes some nullable fields, which can be used to
generate some random data.

{
  "type" : "record",
  "name" : "test",
  "namespace" : "com.example",
  "fields" : [ {
    "name" : "name",
    "type" : "string"
  }, {
    "name" : "null_name1",
    "type" : [ "null", "string" ]
  }, {
    "name" : "null_name2",
    "type" : [ "string", "null" ]
  }, {
    "name" : "num",
    "type" : "int"
  }, {
    "name" : "null_num1",
    "type" : [ "null", "int" ]
  }, {
    "name" : "null_num2",
    "type" : [ "int", "null" ]
  } ]
}

This following record fails the validation based on the above schema:

{
  "name": "snhepdirqromqkgllhgljumtuj",
  "null_name1": null,
  "null_name2": null,
  "num": 186374858,
  "null_num1": {
    "int": -1433093325
  },
  "null_num2": {
    "int": -1728851584
  }
}
$ avro_validator test.avsc test.json
Error validating value for field [null_num1]: The value [{'int': -1433093325}] is not from one of the following types: [[NullType, IntType]]

logicalType is not supported

schema = json.dumps({
    'name': 'test schema',
    'type': 'record',
    'doc': 'schema for testing avro_validator',
    'fields': [
                            {
                        "name": "event_time",
                        "type": "long",
                        "logicalType": "timestamp-millis"
                    }
    ]
})

schema = Schema(schema)
parsed_schema = schema.parse()

get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/schema.py", line 28, in parse
    return RecordType.build(schema, skip_extra_keys=skip_extra_keys)
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 673, in build
    record_type.__fields = {
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 674, in <dictcomp>
    field['name']: RecordTypeField.build(
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 440, in build
    cls._validate_json_repr(json_repr, skip_extra_keys=skip_extra_keys)
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 72, in _validate_json_repr
    raise ValueError(f'The {cls.__name__} can only contains '
ValueError: The RecordTypeField can only contains {'doc', 'default', 'name', 'order', 'aliases', 'type'} keys, but does contain also {'logicalType'}

Date/Datetime Validation

Avro now accepts python datetime objects for int/long fields with logicalType's in the date/timestamp family. It'd be great if this validator could do the same.

Validator doesn't understand recursive schemas

ValueError: The type [Actor] is not recognized by Avro

import json
from avro_validator.schema import Schema

SCHEMA = {
    "name": "Actor",
    "type": "record",
    "fields": [
        {
            "name": "actedBy",
            "type": ["null", "Actor"],
        }
    ]
}

Schema(json.dumps(SCHEMA)).parse()

logicalType in schema

Hi there,

I'm getting this error The RecordTypeField can only contains {'order', 'aliases', 'type', 'doc', 'name', 'default'} keys, even though it looks like it's supported according to this.

They seem to have some other fields listed on the docs there, but I haven't tried that myself. This issue could resolve this as well, that would be nice to have some checking and just be able to support the new fields.

Thanks for this project!

Problem with Schema constructor

Hi all! It looks like we're running into problem when we've got a schema as JSON in a variable that we pass into the Schema class constructor.

>       schema = Schema(json.dumps(local_schema))
18:34:32 
18:34:32 test/test_sdp_producer.py:451: 
18:34:32 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
18:34:32 lib/python3.7/site-packages/avro_validator/schema.py:17: in __init__
18:34:32     if file_path.exists():
18:34:32 /usr/local/lib/python3.7/pathlib.py:1329: in exists
18:34:32     self.stat()
18:34:32 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
18:34:32 
18:34:32 self = PosixPath('{"type": "record", "name": "Invoice", "fields": [{"name": "enterpriseEventEnvelope", "type": {"type": "reco...e": ["null", {"type": "record", "name": "DomainPayLoadRecord", "fields": [{"name": "eventId", "type": "string"}]}]}]}')
18:34:32 
18:34:32     def stat(self):
18:34:32         """
18:34:32         Return the result of the stat() system call on this path, like
18:34:32         os.stat() does.
18:34:32         """
18:34:32 >       return self._accessor.stat(self)
18:34:32 E       OSError: [Errno 36] File name too long: '{"type": "record", "name": "Invoice", "fields": [{"name": "enterpriseEventEnvelope", "type": {"type": "record", "name": "EnterpriseEventEnvelopeRecord", "fields": [{"name": "eventId", "type": "string"}]}}, {"name": "domainPayload", "type": ["null", {"type": "record", "name": "DomainPayLoadRecord", "fields": [{"name": "eventId", "type": "string"}]}]}]}'
18:34:32 
18:34:32 /usr/local/lib/python3.7/pathlib.py:1151: OSError

As far as I can tell, has only become a problem in the last day or so, with the addition of the Path library. Perhaps we need to wrap the file_path.exists() check in a try block and fall back to the else clause when we catch an OSError?

Neither "null" in the union type nor default are respected

I have the following field declared in my schema, but missing in the data:

    {
      "name": "myField",
      "type": [
        "null",
        {
          "type": "map",
          "values": {
            "type": "string",
            "avro.java.string": "String"
          },
          "avro.java.string": "String"
        }
      ],
      "default": null
    }

I would expect the schema validator to ignore it given that it both has "null" as a part of union type and as default. However, the validator throws the following error:

ValueError: Error parsing the field [surfaceIds]: The MapType can only contains {'values', 'type'} keys

Failing to parse two-dimensional nested array of ints

Just found this project today and it has been very helpful, but I ran into a problem with nested arrays.
The following avro schema:

{
    "namespace": "com.company.code",
    "type": "record",
    "name": "MyAvroSchema",
    "doc" : "...",
    "fields": [
        {
            "name": "MyArray",
            "type": {
                "type": "array",
                "items": {
                    "type": {"type": "array","items": "int"}
                }
            }
        }
    ]
}

generates the following error:

Traceback (most recent call last):
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Users/user/Library/Application Support/JetBrains/Toolbox/apps/PyCharm-P/ch-0/222.4459.20/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Users/user/Library/Application Support/JetBrains/Toolbox/apps/PyCharm-P/ch-0/222.4459.20/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/user/arrhythmia/file_schemas/validate_schema.py", line 5, in <module>
    parsed_schema = schema.parse()
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/schema.py", line 28, in parse
    return RecordType.build(schema, skip_extra_keys=skip_extra_keys)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 812, in build
    record_type.__fields = {
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 813, in <dictcomp>
    field['name']: RecordTypeField.build(
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 585, in build
    field.__type = cls.__build_field_type(json_repr, custom_fields)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 545, in __build_field_type
    return cls._get_field_from_json(json_repr['type'], custom_fields)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 210, in _get_field_from_json
    return getattr(sys.modules[__name__], FIELD_MAPPING[field_type['type']]).build(field_type, custom_fields)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 1016, in build
    array_type.__items = ArrayType._get_field_from_json(json_repr['items'], custom_fields)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 210, in _get_field_from_json
    return getattr(sys.modules[__name__], FIELD_MAPPING[field_type['type']]).build(field_type, custom_fields)
TypeError: unhashable type: 'dict'

If I remove the inner array it parses correctly.

Schema constructor fails on a long json string

I'm calling avro_validator.schema.Schema(s) where s is a json string and len(s)==33335.
The constructor throws the following exception:

{ValueError}stat: path too long for Windows

It's probably not correct to use os.path.isfile() to distinguish between files and json strings. The simplest fix would be to wrap it in a try/catch and treat an exception as another indicator that it's not a file.

RecordType ValueError is too strict

When using other fields in my .asvc schema then the standard:

name: a JSON string providing the name of the record (required).
namespace, a JSON string that qualifies the name;
doc: a JSON string providing documentation to the user of this schema (optional).
aliases: a JSON array of strings, providing alternate names for this record (optional).
fields: a JSON array, listing fields (required). Each field is a JSON object with the following attributes:

I get an error that these are the only fields allowed in RecordType.
Here is the line of code: https://github.com/leocalm/avro_validator/blob/master/avro_validator/avro_types.py#L67

But in the avro documentation, it states this is allowed: "...permitted as metadata..."
https://avro.apache.org/docs/1.10.2/spec.html#schemas

A Schema is represented in JSON by one of:

A JSON string, naming a defined type.
A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.

Could you either fix this, or give me write access so I can fix it myself in a branch?

Thank you.

Raise error or return False when validate

def validate(self, value: Any) -> bool:

For now, the above validate function raise exception upon invalid data and returns True otherwise. Would it be better to have another optional parameter indicating whether to raise the exception? Something like below:

def validate(self, value: Any, raise_errors=True) -> bool:

Without this, users will have to do below to get around it if they only care about whether the data is valid (not caring about what exactly error it is). And I think this scenario is quite common, e.g. users may only want to process some data that matches the schema and simply discard other data.

valid = False
try:
    valid = schema.validate(data)
except ValueError:
    pass

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.