Code Monkey home page Code Monkey logo

datahub's Introduction

DataHub

DataHub logo

Synthetic data generation

DataHub is a set of python libraries dedicated to the production of synthetic data to be used in tests, machine learning training, statistical analysis, and other use cases wiki. DataHub uses existing datasets to generate synthetic models. If no existing data is available it will use user-provided scripts and data rules to generate synthetic data using out-of-the-box helper datasets.

Synthetic datasets are simply artificiality manufactured sets, produced to a desired degree of accuracy. Real Data does play a part in synthetic generation, all depending on the realism you require. The product roadmaps details out the functionality planned in this respect.

DataHub's core is predominantly based around pandas data frames and object generation. A common question: Now that I have a data frame of synthetic-data, what do I do with it? The Pandas library comes with an array of options here - so for the time being sinking to databases is out of the scope of the core library, however see that examples in the test folder for some common patterns.

note As we build out a config based synthetic spec generator, we will bring this back into scope - please see our roadmap/issue list and get involved in the discussion.

Key documents

  1. For information on how to get started with DataHub see our Getting Started Guide
  2. For more technical information about DataHub and how to customize it, see the Developer Guide
  3. For high-level project direction see Road Map, Requirements Gathering Approach and Delegated Action Groups.
  4. For Feature Development, Good First Issues, Help Wanted and Bug Tracking see DataHub GitHub Issues.
  5. This project uses Gravizo for all diagrams and charts as highlighted in DataHub Issue 41.

Overview of Synthetic data

  • Synthetic data is information that's is artificially manufactured rather than generated by *real-world events.
  • Synthetic data is created algorithmically, and can be used as a stand-in for  test datasets of production data
  • Real data does play a part in synthetic data generation - depending on how realistic you want the output

License

Copyright 2020 Citigroup

Distributed under the Apache License, Version 2.0.

SPDX-License-Identifier: Apache-2.0

datahub's People

Contributors

finos-admin avatar grovesy avatar maoo avatar mcleo-d avatar pgrovesy avatar zheyu-wang-tony avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.