datanymizer / datanymizer Goto Github PK

View Code? Open in Web Editor NEW

489.0 6.0 29.0 1.15 MB

Powerful database anonymizer with flexible rules. Written in Rust.

Home Page: https://datanymizer.github.io/docs/

License: MIT License

Dockerfile 0.08% Rust 98.79% Shell 1.13%

database-dump fake-data fake-generator dump-data dumper database-anonymizer postgresql-database database hacktoberfest

datanymizer's Introduction

[Data]nymizer

Powerful database anonymizer with flexible rules. Written in Rust.

Datanymizer is created & supported by Evrone. See what else we develop with Rust.

More information you can find in articles in English and Russian.

How it works

Database -> Dumper (+Faker) -> Dump.sql

You can import or process your dump with supported database without 3rd-party importers.

Datanymizer generates database-native dump.

Installation

There are several ways to install pg_datanymizer, choose a more convenient option for you.

Pre-compiled binary

# Linux / macOS / Windows (MINGW and etc). Installs it into ./bin/ by default
$ curl -sSfL https://raw.githubusercontent.com/datanymizer/datanymizer/main/cli/pg_datanymizer/install.sh | sh -s

# Or more shorter way
$ curl -sSfL https://git.io/pg_datanymizer | sh -s

# Specify installation directory and version
$ curl -sSfL https://git.io/pg_datanymizer | sudo sh -s -- -b /usr/local/bin v0.2.0

# Alpine Linux (wget)
$ wget -q -O - https://git.io/pg_datanymizer | sh -s

Homebrew / Linuxbrew

# Installs the latest stable release
$ brew install datanymizer/tap/pg_datanymizer

# Builds the latest version from the repository
$ brew install --HEAD datanymizer/tap/pg_datanymizer

Docker

$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer

Getting started with CLI dumper

First, inspect your database schema, choose fields with sensitive data, and create a config file based on it.

# config.yml
tables:
  - name: markets
    rules:
      name_translations:
        template:
          format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
          rules:
            - words:
                min: 1
                max: 2
            - words:
                min: 1
                max: 2
  - name: franchisees
    rules:
      operator_mail:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe
      operator_name:
        first_name: {}
      operator_phone:
        phone:
          format: +###########
      name_translations:
        template:
          format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
          rules:
            - words:
                min: 2
                max: 3
            - words:
                min: 2
                max: 3
  - name: users
    rules:
      first_name:
        first_name: {}
      last_name:
        last_name: {}
  - name: customers
    rules:
      email:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe
                uniq:  
                  required: true
                  try_count: 5
      phone:
        phone:
          format: +7##########
          uniq: true
      city:
        city: {}
      age:
        random_num:
          min: 10
          max: 99
      first_name:
        first_name: {}
      last_name:
        last_name: {}
      birth_date:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

And then start to make dump from your database instance:

pg_datanymizer -f /tmp/dump.sql -c ./config.yml postgres://postgres:postgres@localhost/test_database

It creates new dump file /tmp/dump.sql with native SQL dump for Postgresql database. You can import fake data from this dump into new Postgresql database with command:

psql -U postgres -d new_database < /tmp/dump.sql

Dumper can stream dump to STDOUT like pg_dump and you can use it in other pipelines:

pg_datanymizer -c ./config.yml postgres://postgres:postgres@localhost/test_database > /tmp/dump.sql

Additional options

Tables filter

You can specify which tables you choose or ignore for making dump.

For dumping only public.markets and public.users data.

# config.yml
#...
filter:
  only:
    - public.markets
    - public.users

For ignoring those tables and dump data from others.

# config.yml
#...
filter:
  except:
    - public.markets
    - public.users

You can also specify data and schema filters separately.

This is equivalent to the previous example.

# config.yml
#...
filter:
  data:
    except:
      - public.markets
      - public.users

For skipping schema and data from other tables.

# config.yml
#...
filter:
  schema:
    only:
      - public.markets
      - public.users

For skipping schema for markets table and dumping data only from users table.

# config.yml
#...
filter:
  data:
    only:
      - public.users
  schema:
    except:
      - public.markets

You can use wildcards in the filter section:

? matches exactly one occurrence of any character;
* matches arbitrary many (including zero) occurrences of any character.

Dump conditions and limit

You can specify conditions (SQL WHERE statement) and limit for dumped data per table:

# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # select maximum 100 rows
      limit: 100

Transform conditions and limit

As the additional option, you can specify SQL conditions that define which rows will be transformed (anonymized):

# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # preserve original values for some rows
      transform_condition: "NOT (first_name = 'John' AND last_name = 'Doe')"      
      # select maximum 100 rows
      limit: 100

You can use the dump_condition, transform_condition and limit options in any combination (only transform_condition; transform_condition and limit; etc).

Global variables

You can specify global variables available from any template rule.

# config.yml
tables:
  users:
    bio:
      template:
        format: "User bio is {{var_a}}"
    age:
      template:
        format: {{_0 | float * global_multiplicator}}
#...
globals:
  var_a: Global variable 1
  global_multiplicator: 6

Available rules

Rule	Description
`email`	Emails with different options
`ip`	IP addresses. Supports IPv4 and IPv6
`words`	Lorem words with different length
`first_name`	First name generator
`last_name`	Last name generator
`city`	City names generator
`phone`	Generate random phone with different `format`
`pipeline`	Use pipeline to generate more complicated values
`capitalize`	Like filter, it capitalizes input value
`template`	Template engine for generate random text with included rules
`digit`	Random digit (in range `0..9`)
`random_num`	Random number with `min` and `max` options
`password`	Password with different length options (support `max` and `min` options)
`datetime`	Make DateTime strings with options (`from` and `to`)
more than 70 rules in total...

For the complete list of rules please refer this document.

Uniqueness

You can specify that result values must be unique (they are not unique by default). You can use short or full syntax.

Short:

uniq: true

Full:

uniq:
  required: true
  try_count: 5

Uniqueness is ensured by re-generating values when they are same. You can customize the number of attempts with try_count (this is an optional field, the default number of tries depends on the rule).

Currently, uniqueness is supported by: email, ip, phone, random_num.

Locales

You can specify the locale for individual rules:

first_name:
  locale: RU

The default locale is EN but you can specify a different default locale:

tables:
  # ........  
default:
  locale: RU

We also support ZH_TW (traditional chinese) and RU (translation in progress).

Referencing row values from templates

You can reference values of other row fields in templates. Use prev for original values and final - for anonymized:

tables:
  - name: some_table
    # You must specify the order of rule execution when using `final`
    rule_order:
      - greeting
      - options
    rules:
      first_name:
        first_name: {}
      greeting:
        template:
          # Keeping the first name, but anonymizing the last name   
          format: "Hello, {{ prev.first_name }} {{ final.last_name }}!"
      options:
        template:
          # Using the anonymized value again   
          format: "{greeting: \"{{ final.greeting }}\"}"

You must specify the order of rule execution when using final with rule_order. All rules not listed will be placed at the beginning (i.e. you must list only rules with final).

Sharing information between rows

We implemented a built-in key-value store that allows information to be exchanged between anonymized rows.

It is available via the special functions in templates.

Take a look at an example:

tables:
  - name: users  
    rules:
      name:
        template:    
          # Save a name to the store as a side effect, the key is `user_names.<USER_ID>` 
          format: "{{ _1 }}{{ store_write(key='user_names.' ~ prev.id, value=_1) }}"
          rules:
            - person_name: {}
  - name: user_operations
    rules:
      user_name:          
        template:
          # Using the saved value again  
          format: "{{ store_read(key='user_names.' ~ prev.user_id) }}"

Supported databases

Postgresql
MySQL or MariaDB (TODO)

Documentation

pg_datanymizer CLI application manual.
config.yml file specification.
Full list of transformation rules.
Integration testing manual.

License

MIT

Development

Cross compilation

Mac to Linux

rustup target add x86_64-unknown-linux-gnu
brew tap messense/macos-cross-toolchains
brew install x86_64-unknown-linux-gnu
CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_LINKER=x86_64-linux-gnu-gcc cargo build --target x86_64-unknown-linux-gnu --release --features openssl/vendored

datanymizer's People

Contributors

Stargazers

Watchers

datanymizer's Issues

Rename pg_dump_faker to pg_datanymizer

postgresql - same user-defined-type present in multiple schema breaks the process

Folks,

I have a situation where the same user-defined-type name is re-used across two different schema in the same database, and it breaks the process.

i think the query that constructs the columns of a table needs a small tweak :

Line 32 on schema_inspector.rs needs to change from :

const TABLE_COLUMNS_QUERY: &str = "SELECT cc.column_name, cc.ordinal_position, cc.data_type, pt.oid
                                   FROM information_schema.columns as cc
                                   JOIN pg_catalog.pg_type as pt
                                   ON cc.udt_name = pt.typname
                                   WHERE cc.table_schema = $1 and cc.table_name = $2
                                   ORDER BY cc.ordinal_position ASC";

const TABLE_COLUMNS_QUERY: &str = "SELECT cc.column_name, cc.ordinal_position, cc.data_type, pt.oid
                                   FROM information_schema.columns as cc
                                   JOIN pg_catalog.pg_type as pt
                                   ON cc.udt_name = pt.typname
                                   LEFT JOIN pg_catalog.pg_namespace as pn
                                   ON (
                                        cc.data_type = 'USER-DEFINED'
                                        AND cc.table_schema = pn.nspname
                                        and pn.oid = pt.typnamespace
                                      )
                                   WHERE cc.table_schema = $1 and cc.table_name = $2
                                   AND ( cc.data_type <> 'USER-DEFINED' OR pn.nspname is not null )
                                   ORDER BY cc.ordinal_position ASC";

Mainly, it is ensuring that the schema name is used when finding the type of a column, when the type is user-defined.

In-place data modification

Hi! Is it possible to make it work to change data in-place (without dump-restore cycle)?
We already have automated backups-to-staging restoration cycle (postgres, basebackups) and would like to use this tool to make data masking in-place.

Fetch tables metadata takes too long

Fetch tables metadata...
takes more time than a full database dump.

I have filters in place for only 4 tables I need out of 100+ but this still takes longer than I expect.

Add tables exclusion and inclusion at the schema creation level

Now, when filter added to the list of tables, Its filter only data copying process. The data schema needs to be responsive to the filter as well. To do this, you need to add arguments to the call of the command pg_dump

There are 2 steps with using pg_dump cli application.
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_dumper/src/postgres/dumper.rs#L62
and
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_dumper/src/postgres/dumper.rs#L134

Need to add inclusion and exclusion support to it.

Including
Excluding

Possible to exclude owner/privilege information

Using pg_dump and pg_restore you can exclude owner/priv information using --no-owner and --no-privileges. Using psql < dump to do restores as documented, doesn't allow this type of thing. Is this possible without having to manually text process the dump file?

Set up CI

Set up GitHub Actions to run these commands:

Make the uniq option available in all transformers

This requires some refactoring of the UniqTransformer trait.

Panic on TemplateTransformerTemplate

I am getting a panic when I try to store some values in the key-value store using the example mentioned in the docs.

Config

tables:
  - name: User
    rules:
      phone:
        template:
          format: "{{ _1 }}{{ store_write key='user_names.' ~ prev.profile, value=_1 }}"
          rules:
            - phone:
                format: "+############"
...........

Output

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Msg("Failed to parse 'TemplateTransformerTemplate'"), source: Some(Error { kind: Msg(" --> 1:24\n  |\n1 | {{ _1 }}{{ store_write key='user_names.' ~ prev.profile, value=_1 }}\n  |  ^---\n  |\n  = expected `or`, `and`, `not`, `<=`, `>=`, `<`, `>`, `==`, `!=`, `+`, `-`, `*`, `/`, `%`, a filter, or a variable end (`}}`)"), source: None }) }', datanymizer_engine/src/transformers/template/mod.rs:64:59
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
/bin/bash: line 54:    28 Aborted                 ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml $SOURCE_POSTGRES_CONNECTION_STRING --accept_invalid_certs=true

Implement the Russian locale

Now we only have a stub:
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_engine/src/locale/ru.rs

Locale implementation example (English):
https://github.com/cksac/fake-rs/blob/master/fake/src/locales/mod.rs

Create the end-to-end test based on the "demo" folder

Dump includes extra value per row for COPY on table I have configured

It appears to impact fields that I have configuration for:

tables:
  - name: user
    rules:
      name:
        person_name: {}
      birthdate:
        datetime:
          from: 1980-01-01T00:00:00+00:00
          to: 2021-07-24T00:00:00+00:00

When loading the dump I see this error:

COPY 0
ERROR:  invalid input syntax for type timestamp with time zone: "Vanessa Wyman"
CONTEXT:  COPY user, line 1, column birthdate: "Vanessa Wyman"

COPY public.user("id", "firebase_uid", "created_at", "updated_at", "username", "bio", "verified", "flagged", "private_account", "avatar_image_id", "user_role_type", "avatar_foreground_color", "avatar_background_light_color", "avatar_background_dark_color", "name", "birthdate", "last_seen", "google_iap_uid") FROM STDIN;
2afa72c2-f06a-4bcd-88e8-6bac7b32fa80	ANDROID- 17	2021-06-14 19:09:17.508424+00	2021-06-14 19:09:17.508424+00	somename	\N	f	f	f	\N	USER	\N	\N	\N	\N	Vanessa Wyman	2004-01-13T00:14:00+00:00	\N

Make bootstrap in demo dir isn't working?

Discussed in #107

^{Originally posted by marcelofernandez October 5, 2021}
Seems like the demo docker environment isn't working:

marcelo@marcelo:~/dev/datanymizer/demo/$ make bootstrap
[...]
   Compiling nom v5.1.2
   Compiling openssl-sys v0.9.63
error: failed to run custom build command for `openssl-sys v0.9.63`

Caused by:
  process didn't exit successfully: `/usr/src/target/release/build/openssl-sys-9793598f3bd829c9/build-script-main` (exit code: 101)
  --- stdout
  cargo:rustc-cfg=const_fn
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_LIB_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_LIB_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_LIB_DIR
  OPENSSL_LIB_DIR unset
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_INCLUDE_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_INCLUDE_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_INCLUDE_DIR
  OPENSSL_INCLUDE_DIR unset
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_DIR
  OPENSSL_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_NO_PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG_ALLOW_CROSS
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS
  cargo:rerun-if-env-changed=PKG_CONFIG_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG_SYSROOT_DIR
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR
  run pkg_config fail: "pkg-config has not been configured to support cross-compilation.\n\n                Install a sysroot for the target platform and configure it via\n                PKG_CONFIG_SYSROOT_DIR and PKG_CONFIG_PATH, or install a\n                cross-compiling wrapper for pkg-config and set it via\n                PKG_CONFIG environment variable."

  --- stderr
  thread 'main' panicked at '

  Could not find directory of OpenSSL installation, and this `-sys` crate cannot
  proceed without this knowledge. If OpenSSL is installed and this crate had
  trouble finding it,  you can set the `OPENSSL_DIR` environment variable for the
  compilation process.

  Make sure you also have the development packages of openssl installed.
  For example, `libssl-dev` on Ubuntu or `openssl-devel` on Fedora.

  If you're in a situation where you think the directory *should* be found
  automatically, please open a bug at https://github.com/sfackler/rust-openssl
  and include information about your system as well as this message.

  $HOST = x86_64-unknown-linux-gnu
  $TARGET = x86_64-unknown-linux-musl
  openssl-sys = 0.9.63

  ', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/openssl-sys-0.9.63/build/find_normal.rs:174:5
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
error: build failed`

I'm trying this out in an Ubuntu 20.04 + Docker 20.10.9~3-0~ubuntu-focal setup, on an x86_64 AMD machine...

Add a config-wide default locale

Possible config syntax:

tables:
  # ..............
default:
  locale: RU

Execution order for tables

Is there a way to define an execution sequence for the tables? As the values for the store are not stored before the read operation starts. Swapping the sequence in the config works, but it is inconsistent and fails after some runs.

tables:
  - name: User
  ...
    template:
      format: "{{ store_read(...) }}"
  ...
  - name: Profile
  ...
    template:
      format: "{{ store_write(...) }}"
  ...

Demo DB link is broken in the demo Makefile

This link is broken and "make bootstrap / make download" isn't working:
https://github.com/datanymizer/datanymizer/blob/main/demo/Makefile#L1

This should be replaced with this one, which is working OK:
https://www.postgresqltutorial.com/wp-content/uploads/2019/05/dvdrental.zip

Update the chrono and time crates (security issue)

https://github.com/datanymizer/datanymizer/actions/runs/1432731164

One option is to get rid of chrono and only use time.
But chrono supports more date and time formats (when parsing a configuration).

pg_datanymizer not work in container

I run demo as it say in demo/README.md.
I build a container with pg_datanymizer in and run via make dump
But I get an error:

Error: invalid type: unit value, expected struct FirstNameTransformer

If I comment in dvdrental.yml block for first_name, like this:

filter:
  only:
    - public.actor
tables:
  - name: actor
    rules:
#      first_name:
#        first_name: ~
      last_name:
        last_name: ~
      last_update:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

Then i get same error with last_name:

Error: invalid type: unit value, expected struct LastNameTransformer

After add comment for last_name, like this:

filter:
  only:
    - public.actor
tables:
  - name: actor
    rules:
#      first_name:
#        first_name: ~
#      last_name:
#        last_name: ~
      last_update:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

All works fine.

When I build container with postgresql and download pg_datanymizer CLI into it all works fine!

My Dockerfile for example - Link

Builded container in DockerHub - agmayer/pg_datanymizer

Add option for `EmailTransformer` to add unique alphanumeric prefixes and postfixes

This is necessary when we have many rows, and the faker itself cannot provide enough unique emails (this is a quiet typical situation).

email: {}

Sample email: [email protected]

email:
  prefix: 5 # prefix length

Sample email: [email protected]

email:
  postfix: 6 # postfix length

Sample email: [email protected]

Add template references to other transformed values of the same row

Possible syntax (look at tr):

tables:
  - name: users
    rules:
      login:
        username: {}
      preferences:
        template:
          format: '{"some_complex_json": {"field": "{{tr.login}}"}}'

Tera allows also tr["login"].

The name tr means "transformed row". We can use tr_row or transformed_row instead (it might be too long).

We can also use just row, but what if in the future we decide to use the values of the original row too?

Add the option for installation with `cargo install`

Add a transformer for company names

Generate fake company names, based on CompanyName faker from fake-rs

Update fake-rs to v2.4.1

Adding if-conditions for column transformation

It will be good to skip transformations using if-confitions... for example

phone:
  rules:
    conditions:
      - if: "{% condition statement-1 %}"
        rules:
           - format:
               template: "bla-bla-bla"
      - if: "{% condition statement-2 %}"
        rules:
           - format:
               template: "my cool template"
      - rules:
           - format:
               template: "bla-bla-bla"

Release v0.1.0

What needs to be done for the v0.1.0 release.

Infrastructure:

Set up CI (check, fmt, build, clippy, test, audit) (#7);
Automate release process (linux, darwin, docker) (#13);
Create Homebrew formula;

Improvements:

Make uniqueness be available for other transformers (#8);
Add tables exclusion and inclusion at the schema creation level (#19);

Malformed query when table has no defined fields

Our database uses timescaledb, and there's a schema called _timescaledb_cache with three tables: cache_inval_bgw_job, cache_inval_extension, and cache_inval_hypertable. For some reason there are no fields defined in these tables.

When running pg_datanymizer against this, it fails with an error that made it hard to know which table was the cause (we have many schemas and tables). Running rust nightly with RUST_BACKTRACE=full allowed me to see the offending line and modified pg_datanymizer a bit to print out the offending query.

    (prior output deleted)
[Dumping: _timescaledb_internal._hyper_9_327_chunk] [|##################################################|] 0 of 0 rows [100%] (0s)
[Dumping: _timescaledb_internal._hyper_9_327_chunk] Finished in 0 seconds
[Dumping: _timescaledb_internal._hyper_15_976_chunk] [|##################################################|] 68288 of 68288 rows [100%] (0s)
[Dumping: _timescaledb_internal._hyper_15_976_chunk] Finished in 2 seconds
Error: db error: ERROR: syntax error at or near ")"

Caused by:
ERROR: syntax error at or near ")"

Stack backtrace:
0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/backtrace.rs:328:13
3: std::backtrace::Backtrace::capture
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/backtrace.rs:296:9
4: anyhow::error::<impl core::convert::From for anyhow::Error>::from
at /Users/beynon/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.33/src/error.rs:464:25
5: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Resultcore::convert::Infallible,E>>::from_residual
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/core/src/result.rs:2052:27
6: datanymizer_dumper::postgres::dumper::PgDumper<W,I>::dump_table
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/postgres/dumper.rs:109:26
7: <datanymizer_dumper::postgres::dumper::PgDumper<W,I> as datanymizer_dumper::Dumper>::data
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/postgres/dumper.rs:173:17
8: datanymizer_dumper::Dumper::dump
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/lib.rs:21:9
9: pg_datanymizer::app::App::run
at /Users/beynon/work/datanymizer/cli/pg_datanymizer/src/app.rs:34:31
10: pg_datanymizer::main
at /Users/beynon/work/datanymizer/cli/pg_datanymizer/src/main.rs:13:5

One of the offending queries is:

COPY "_timescaledb_cache"."cache_inval_hypertable"() TO STDOUT

Solution seems to be if the column list is empty, the parentheses must be omitted:

COPY "_timescaledb_cache"."cache_inval_hypertable" TO STDOUT

I can easily skip these tables with an "except:" in the config.yaml, but then the pg_dump output is missing a table that may still be required by timescaledb to function properly.

Performance benchmark

Compare performance with raw pg_dump, possibly identify bottlenecks.

Add the "generator" mode

Use the existing database schema, but generate new data (specified number of rows).

Create example for pg_datanymizer

Create example to introduce how to get starting with CLI tool pg_datanymizer.
Setup environment with docker-compose, test database dump and example config file.

Add CLI option that determines whether or not to use a database transaction during dumping

Postgres Dump missing Create statements

When I run pg_datanymizer against a GCP managed sql instance, I only get a data dump, it does not add any create statements for the relations.

Running pg_dump directly against the same database with the same credentials gives a full dump as expected.

The command I am running is
pg_datanymizer db --config /app/config.yml --host cloud-sql-proxy --username <redacted> --password <redacted> --file /app/dump.sql --port 5432

The config just has table rules, no filters or anything like that.

Would appreciate any guidance on this.

RUSTSEC-2020-0146: arr! macro erases lifetimes

arr! macro erases lifetimes

Details
Package	`generic-array`
Version	`0.12.3`
URL	fizyk20/generic-array#98
Date	2020-04-09
Patched versions	`>=0.14.0`
Unaffected versions	`<0.8.0`

Affected versions of this crate allowed unsoundly extending
lifetimes using arr! macro. This may result in a variety of
memory corruption scenarios, most likely use-after-free.

See advisory page for additional details.

datetime format issues

      birth_month: 
        datetime:
          format: '%m'

throws error Error: Failed transform trailing input

      birth_month: 
        datetime:
          format: %m

throws error Error: unexpected character: '%' at line 68 column 19 in config.yaml

NullTransformer

Just clears field (sets NULL value).

SSL support

I am trying to use datanymizer in the environment that requires SSL connection (Heroku postgres). It fails with an error:

$ docker run --rm -v /Users/f213/prj/education/dev-db:/app -w /app datanymizer/pg_datanymizer "postgres://user:pwd@<REDACTED>.eu-west-1.compute.amazonaws.com:5432/<REDACTED>?sslmode=require"
Error: error performing TLS handshake: no TLS implementation configured

Caused by:
    no TLS implementation configured

Any chances this great project will support SSL?

Add transformers for the Base64 and Base64URL tokens (like HexTokenTransformer)

Add WiX Installer for windows release

Sharing information between tables

As mentioned in the readme there is a possible way to share information between rows. I was wondering if is it possible to share information between tables? My usecase has a need to keep a value consistent in table1 and table2.

For example:
I have 2 tables named Table1 and Table2. Both of them have a field emailId in them and I want to anonymize them and keep the value consistent? is that possible?

Issue with reading from key value store

Hi @evgeniy-r , I am facing some issues while using the key-value store. please correct me if I am using it wrong.
User is connected to Profile with a foreign key in the User table. So I am using profile_id while storing the data to the store and then while retrieving I am just using the id from the Profile table.

Config

tables:
  - name: User
    rules:
      phone:
        template:
          format: "{{ _1 }}{{ store_write(key='user_phonenumbers.' ~ prev.profile_id, value=_1) }}"
          rules:
            - phone:
                format: "+############"
                uniq: true
  - name: Profile
    rules:
      phone:
        # reading phone numbers from `user_phonenumbers` stored in `User`
        template:
          format: "{{ store_read(key='user_phonenumbers.' ~ prev.id) }}"

Output

...
Prepare to dump table: default$default.Profile
Error: Failed transform Failed to render 'TemplateTransformerTemplate'
ERROR: Job failed: exit code 1
FATAL: exit code 1

Just to confirm that whether the template is giving an error on reading, I tried removing the template under Profile and replaced it with phone to randomize it and it works that way. So I am quite sure that there is something wrong - either how I am implementing it or with the anonymizer.

Add the `cargo audit` to the CI pipeline (with the fresh bases).

Automate release process

Set up GitHub Actions to automate release process

Make uniqueness be available for other transformers

Now, uniq: true available for phone transformer only. Make it be available for others (email, ip, random_num).

email
ip
random_num

Error: no matching tables were found

@evgeniy-r I am facing an issue while using the cli with my db. I am not sure that whether I am using the config correctly, can you help me rectify the issue.

thank you in advance.

Background

I have a database with multiple schemas and I want to copy over the schema default$default and the tables under that schema. Of course, I also anonymize certain fields in the tables.

config.yml

tables:
  - name: default$default.Profile
    rules:
      firstName:
        first_name: {}
        ...
        ...
filter:
  schema:
    only:
      - default$default

Output

$ ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml --accept_invalid_certs $POSTGRES_CONNECTION_STRING
Prepare data scheme...
pg_dump error. Command:
pg_dump --section pre-data -t default$default <POSTGRES_CONNECTION_STRING>
Output:
pg_dump: error: no matching tables were found

Add MySQL support

Add dumper for MySQL (analogue of datanymizer_dumper::postgres).

Feature Request: wildcards in filters

Currently, I have 100s of tables in my database, all under different schemas.
I want to anonymize around 5 - 10 of my tables. I am able to achieve this using the following config

.....
filter:
  data:
    only:
      - "default$default.Profile"
      ...

Potentially, this creates another issue as it copies over the other schema, which I don't want to copy.

When I try to use the following config:

.....
filter:
  schema:
    only:
      - "default$default.Profile"
      ...

Only the tables selected get copy but not their dependencies, for e.g. ENUMs used in those tables.

So I would like to request for a feature in which filter accepts wildcard * for anything under a schema.
something like:

filter:
  schema:
    only:
      - "default$default.*"
    except:
      - "public.*"

Thank you in advance.

Failed after schema change.

DB:

create table "user"
(
	id serial,
	first_name text,
	last_name text,
	email_field text
);

insert into "user" (first_name, last_name, email_field) VALUES ('A', 'B', '[email protected]');
alter table "user" drop column last_name;

Config:

tables:
  - name: "user"
    rules:
      email_field:
        email:
          kind: Safe

Error:

thread 'main' panicked at 'index out of bounds: the len is 3 but the index is 3', datanymizer_engine/src/engine.rs:40:25

Add the limit when dumping data

Possible config syntax:

tables:
  - users:
    # per table
    limit: 10000
    # ........................
default:
  # global
  limit: 1000

We can also think about more flexible customization - specifying an SQL query for the table.

None ret from qw.copy_out(untransformed_query=

"COPY "metrics"."deployment"("deployment_id", "hypervisor_key", "specification_id", "vmware_deployment_target_id",
     "vcenter_url", "vcenter_cluster", "vcenter_folder", "clone_source_id", "event_type", "event_type", "event_type",
     "org_name", "org_id", "mto_node_id", "mto_node_path", "mto_node_org_path", "academy_event_id", "package_id",
     "package_estimated_duration_in_seconds", "course_plan_id", "deployer_user_id", "user_full_name", "user_email",
     "warden_event_id", "created_at", "last_updated", "project_id", "project_org_id", "project_org_name",
     "project_mto_node_id", "project_mto_node_path", "project_mto_node_org_path") TO STDOUT
"),

table=metrics.deployment,
count=0,
error=Error {
    kind: Db,
    cause: Some(DbError {
        severity: "ERROR",
        parsed_severity: Some(Error),
        code: SqlState(E42701),
        message: "column \"event_type\" specified more than once",
        detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None,
        datatype: None, constraint: None, file: Some("copy.c"), line: Some(5155),
        routine: Some("CopyGetAttnums")
    })
}

Working now to understand why and fix it.

datanymizer / datanymizer Goto Github PK

datanymizer's Introduction

[Data]nymizer

How it works

Installation

Pre-compiled binary

Homebrew / Linuxbrew

Docker

Getting started with CLI dumper

Additional options

Tables filter

Dump conditions and limit

Transform conditions and limit

Global variables

Available rules

Uniqueness

Locales

Referencing row values from templates

Sharing information between rows

Supported databases

Documentation

Sponsors

License

Development

Cross compilation

datanymizer's People

Contributors

Stargazers

Watchers

Forkers

datanymizer's Issues

Config

Output

Discussed in #107

Config

Output

Background

config.yml

Output

Recommend Projects

Recommend Topics

Recommend Org