Code Monkey home page Code Monkey logo

synthetic-data-generator's Introduction

Actions Status Documentation Status pre-commit.ci status LICENSE Releases Pre Releases Last Commit Python version contributors slack

🚀 Synthetic Data Generator

Switch Language: 简体中文  |  Latest API Docs |   Join Wechat Group

Colab Examples:  LLM: Data Synthesis  |   LLM: Off-Table Inference  |   Billion-Level-Data supported CTGAN

The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data.

Synthetic data does not contain any sensitive information, yet it retains the essential characteristics of the original data, making it exempt from privacy regulations such as GDPR and ADPPA.

High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc.

💥News

Our current key achievements and timelines are as follows:

🔥 Feb 20, 2024: a single-table data synthesis model based on LLM is included, view colab example: LLM: Data Synthesis and LLM: Off-table Feature Inference.

🔶 Dec 20, 2023: v0.1.0 released, a CTGAN model that supports billions of data processing capabilities is included, view our benchmark against SDV, where SDG achieved less memory consumption and avoided crashing during training. For specific use, view colab example: Billion-Level-Data supported CTGAN.

🔆 Aug 10, 2023: First line of SDG code committed.

🎉 LLM-integrated synthetic data generation

For a long time, LLM has been used to understand and generate various types of data. In fact, LLM also has certain capabilities in tabular data generation. Also, it has some abilities that cannot be achieved by traditional (based on GAN methods or statistical methods) .

Our sdgx.models.LLM.single_table.gpt.SingleTableGPTModel implements two new features:

Synthetic data generation without Data

No training data is required, synthetic data can be generated based on metadata data, view in our colab example.

Synthetic data generation without Data

Off-Table feature inference

Infer new column data based on the existing data in the table and the knowledge mastered by LLM, view in our colab example.

Off-Table feature inference

💫 Why SDG ?

  • Technological advancements:
    • Supports a wide range of statistical data synthesis algorithms, LLM-based synthetic data generation model is also integrated;
    • Optimised for big data scenarios, effectively reducing memory consumption;
    • Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner.
  • Privacy enhancements:
    • SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data.
  • Easy to extend:
    • Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages.

🌀 Quick Start

Pre-build image

You can use pre-built images to quickly experience the latest features.

docker pull idsteam/sdgx:latest

Install from PyPi

pip install sdgx

Local Install (Recommended)

Use SDG by installing it through the source code.

git clone [email protected]:hitsz-ids/synthetic-data-generator.git
pip install .
# Or install from git
pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Quick Demo of Single Table Data Generation and Metric

Demo code

from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.utils import download_demo_data

# This will download demo data to ./dataset
dataset_csv = download_demo_data()

# Create data connector for csv file
data_connector = CsvConnector(path=dataset_csv)

# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
    model=CTGANSynthesizerModel(epochs=1),  # For quick demo
    data_connector=data_connector,
)

# Fit the model
synthesizer.fit()

# Sample
sampled_data = synthesizer.sample(1000)
print(sampled_data)

Comparison

Real data are as follows:

>>> data_connector.read()
       age         workclass  fnlwgt  education  ...  capitalloss hoursperweek native-country  class
0        2         State-gov   77516  Bachelors  ...            0            2  United-States  <=50K
1        3  Self-emp-not-inc   83311  Bachelors  ...            0            0  United-States  <=50K
2        2           Private  215646    HS-grad  ...            0            2  United-States  <=50K
3        3           Private  234721       11th  ...            0            2  United-States  <=50K
4        1           Private  338409  Bachelors  ...            0            2           Cuba  <=50K
...    ...               ...     ...        ...  ...          ...          ...            ...    ...
48837    2           Private  215419  Bachelors  ...            0            2  United-States  <=50K
48838    4               NaN  321403    HS-grad  ...            0            2  United-States  <=50K
48839    2           Private  374983  Bachelors  ...            0            3  United-States  <=50K
48840    2           Private   83891  Bachelors  ...            0            2  United-States  <=50K
48841    1      Self-emp-inc  182148  Bachelors  ...            0            3  United-States   >50K

[48842 rows x 15 columns]

Synthetic data are as follows:

>>> sampled_data
     age workclass  fnlwgt     education  ...  capitalloss hoursperweek native-country  class
0      1       NaN   28219  Some-college  ...            0            2    Puerto-Rico  <=50K
1      2   Private  250166       HS-grad  ...            0            2  United-States   >50K
2      2   Private   50304       HS-grad  ...            0            2  United-States  <=50K
3      4   Private   89318     Bachelors  ...            0            2    Puerto-Rico   >50K
4      1   Private  172149     Bachelors  ...            0            3  United-States  <=50K
..   ...       ...     ...           ...  ...          ...          ...            ...    ...
995    2       NaN  208938     Bachelors  ...            0            1  United-States  <=50K
996    2   Private  166416     Bachelors  ...            2            2  United-States  <=50K
997    2       NaN  336022       HS-grad  ...            0            1  United-States  <=50K
998    3   Private  198051       Masters  ...            0            2  United-States   >50K
999    1       NaN   41973       HS-grad  ...            0            2  United-States  <=50K

[1000 rows x 15 columns]

👩‍🎓 Related Work

🤝 Join Community

The SDG project was initiated by Institute of Data Security, Harbin Institute of Technology. If you are interested in out project, welcome to join our community. We welcome organizations, teams, and individuals who share our commitment to data protection and security through open source:

📄 License

The SDG open source project uses Apache-2.0 license, please refer to the LICENSE.

synthetic-data-generator's People

Contributors

moooocat avatar wh1isper avatar pre-commit-ci[bot] avatar allcontributors[bot] avatar sweep-ai[bot] avatar wunder957 avatar joeyscave avatar femi-lawal avatar z712023 avatar iokk3732 avatar sjh120 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.