Code Monkey home page Code Monkey logo

schema-analyzer's Introduction

Build Status Codacy Badge GitHub package version GitHub stars Node.js CI npm bundle size npm downloads codecov

Schema Analyzer

An Open Source joint by Dan Levy โœจ

Analyze column type & size summary from any input JSON array

Schema Analyzer is the core library behind Dan's Schema Generator.

Features

The primary goal is to support any input JSON/CSV and infer as much as possible. More data will generally yield better results.

  • Heuristic type analysis for arrays of objects.
  • Browser-based (local, no server necessary)
  • Automatic type detection:
    • ID - Identifier column, by name and unique Integer check (detects BigInteger)
    • ObjectId (MongoDB's 96 bit/12 Byte ID. 32bit timestamp + 24bit MachineID + 16bit ProcessID + 24bit Counter)
    • UUID/GUID (Common 128 bit/16 Byte ID. Stored as a hex string, dash delimited in parts: 8, 4, 4, 4, 12)
    • Boolean (detects obvious strings true, false, Y, N)
    • Date (Smart detection via comprehensive regex pattern)
    • Timestamp (integer, number of milliseconds since unix epoch)
    • Currency (62 currency symbols supported)
    • Float (w/ scale & precision measurements)
    • Number (Integers)
    • Null (sparse column data helps w/ certain inferences)
    • Email (falls back to string)
    • String (big text and varchar awareness)
    • Array (includes min/max/avg length)
    • Object
  • Detects column size minimum, maximum and average
  • Includes data points at the 30th, 60th and 90th percentiles (for detecting outliers and enum types!)
  • Handles some error/outliers
  • Quantify # of unique values per column
  • Identify enum Fields w/ Values
  • Identify Not Null fields
  • Nested data structure & multi-table relational output.

Getting Started

npm install schema-analyzer
import { schemaBuilder } from 'schema-builder'

schemaBuilder(schemaName: String, data: Array<Object>): TypeSummary

Preview Analysis Results

What does this library's analysis look like?

It consists of 3 key top-level properties:

  • totalRows - # of rows analyzed.
  • fields: FieldTypeSummary - a map of field names with all detected types (includes meta-data for each type detected, with possible overlaps. e.g. an Email is also a String, "42" is a String and Number)

Review the raw results below

Details about each field can be found below.

{
  "totalRows": 5,
  "fields": {
    "id": {
      "types": {
        "Number": {
          "rank": 8,
          "count": 5,
          "value": { "min": 1, "mean": 3, "max": 5, "p25": 2, "p33": 2, "p50": 3, "p66": 4, "p75": 4, "p99": 5 }
        },
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 1, "mean": 1, "max": 1, "p25": 1, "p33": 1, "p50": 1, "p66": 1, "p75": 1, "p99": 1 }
        }
      }
    },
    "name": {
      "types": {
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 3, "mean": 7.2, "max": 15, "p25": 3, "p33": 3, "p50": 5, "p66": 10, "p75": 10, "p99": 15 }
        }
      }
    },
    "role": {
      "types": {
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 4, "mean": 5.4, "max": 9, "p25": 4, "p33": 4, "p50": 5, "p66": 5, "p75": 5, "p99": 9 }
        }
      }
    },
    "email": {
      "types": {
        "Email": {
          "rank": 11,
          "count": 5,
          "length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
        }
      }
    },
    "createdAt": {
      "types": {
        "Date": {
          "rank": 4,
          "count": 4,
          "value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
        },
        "String": {
          "rank": 12,
          "count": 1,
          "length": { "min": 6, "mean": 6, "max": 6, "p25": 6, "p33": 6, "p50": 6, "p66": 6, "p75": 6, "p99": 6 }
        }
      }
    },
    "accountConfirmed": {
      "types": {
        "Unknown": {
          "rank": -1,
          "count": 1
        },
        "String": {
          "rank": 12,
          "count": 1,
          "length": { "min": 9, "mean": 9, "max": 9, "p25": 9, "p33": 9, "p50": 9, "p66": 9, "p75": 9, "p99": 9 }
        },
        "Boolean": {
          "rank": 3,
          "count": 4
        }
      }
    }
  }
}

Sample input dataset for the example results above

id name role email createdAt accountConfirmed
1 Eve poweruser [email protected] 01/20/2020 undefined
2 Alice user [email protected] 02/02/2020 true
3 Bob user [email protected] 12/31/2019 true
4 Elliot Alderson admin [email protected] 01/01/2001 false
5 Sam Sepiol admin [email protected] 9/9/99 true

AggregateSummary

Numeric and String types include a summary of the observed field sizes:

Number & String Range Object Details

Properties
  • min the minimum number or string length
  • max the maximum number or string length
  • mean the average number or string length
  • percentiles[25th, 33th, 50th, 66th, 75th, 99th] values from the Nth percentile number or string length

Percentile is based on input data, as-is with out sorting.

Length Range Data

Range data for the length of a String field type:

{
  "rank": 11,
  "count": 5,
  "length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
}

This is useful for defining strict length limits or minimums, for example as SQL servers often require..

Range data for a Date fields value:

{
  "rank": 4,
  "count": 4,
  "value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
}

Notes

We recommend you provide at least 100+ rows. Accuracy increases greatly with 1,000 rows.

The following features require a certain minimum # of records:

  • Enumeration detection.
    • 100+ Rows Required.
    • Number of unique values must not exceed 20 or 5% of the total number of records. (100 records will identify as Enum w/ 5 values. Up to 20 are possible given 400 or 1,000+.)
  • Not Null detection.
    • where rowCount === field count

Full List of Detected Types

  • Unknown
  • ObjectId
  • UUID
  • Boolean
  • Date
  • Timestamp
  • Currency
  • Float
  • Number
  • Email
  • String
  • Array
  • Object
  • Null

schema-analyzer's People

Contributors

dependabot[bot] avatar justsml avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

thomasdfowler

schema-analyzer's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.