Code Monkey home page Code Monkey logo

datanymizer's Introduction

[Data]nymizer

datanymizer

Build Status License Release Version CodeCov Audit

Powerful database anonymizer with flexible rules. Written in Rust.

Datanymizer is created & supported by Evrone. See what else we develop with Rust.

More information you can find in articles in English and Russian.

How it works

Database -> Dumper (+Faker) -> Dump.sql

You can import or process your dump with supported database without 3rd-party importers.

Datanymizer generates database-native dump.

Installation

There are several ways to install pg_datanymizer, choose a more convenient option for you.

Pre-compiled binary

# Linux / macOS / Windows (MINGW and etc). Installs it into ./bin/ by default
$ curl -sSfL https://raw.githubusercontent.com/datanymizer/datanymizer/main/cli/pg_datanymizer/install.sh | sh -s

# Or more shorter way
$ curl -sSfL https://git.io/pg_datanymizer | sh -s

# Specify installation directory and version
$ curl -sSfL https://git.io/pg_datanymizer | sudo sh -s -- -b /usr/local/bin v0.2.0

# Alpine Linux (wget)
$ wget -q -O - https://git.io/pg_datanymizer | sh -s

Homebrew / Linuxbrew

# Installs the latest stable release
$ brew install datanymizer/tap/pg_datanymizer

# Builds the latest version from the repository
$ brew install --HEAD datanymizer/tap/pg_datanymizer

Docker

$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer

Getting started with CLI dumper

First, inspect your database schema, choose fields with sensitive data, and create a config file based on it.

# config.yml
tables:
  - name: markets
    rules:
      name_translations:
        template:
          format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
          rules:
            - words:
                min: 1
                max: 2
            - words:
                min: 1
                max: 2
  - name: franchisees
    rules:
      operator_mail:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe
      operator_name:
        first_name: {}
      operator_phone:
        phone:
          format: +###########
      name_translations:
        template:
          format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
          rules:
            - words:
                min: 2
                max: 3
            - words:
                min: 2
                max: 3
  - name: users
    rules:
      first_name:
        first_name: {}
      last_name:
        last_name: {}
  - name: customers
    rules:
      email:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe
                uniq:  
                  required: true
                  try_count: 5
      phone:
        phone:
          format: +7##########
          uniq: true
      city:
        city: {}
      age:
        random_num:
          min: 10
          max: 99
      first_name:
        first_name: {}
      last_name:
        last_name: {}
      birth_date:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

And then start to make dump from your database instance:

pg_datanymizer -f /tmp/dump.sql -c ./config.yml postgres://postgres:postgres@localhost/test_database

It creates new dump file /tmp/dump.sql with native SQL dump for Postgresql database. You can import fake data from this dump into new Postgresql database with command:

psql -U postgres -d new_database < /tmp/dump.sql

Dumper can stream dump to STDOUT like pg_dump and you can use it in other pipelines:

pg_datanymizer -c ./config.yml postgres://postgres:postgres@localhost/test_database > /tmp/dump.sql

Additional options

Tables filter

You can specify which tables you choose or ignore for making dump.

For dumping only public.markets and public.users data.

# config.yml
#...
filter:
  only:
    - public.markets
    - public.users

For ignoring those tables and dump data from others.

# config.yml
#...
filter:
  except:
    - public.markets
    - public.users

You can also specify data and schema filters separately.

This is equivalent to the previous example.

# config.yml
#...
filter:
  data:
    except:
      - public.markets
      - public.users

For skipping schema and data from other tables.

# config.yml
#...
filter:
  schema:
    only:
      - public.markets
      - public.users

For skipping schema for markets table and dumping data only from users table.

# config.yml
#...
filter:
  data:
    only:
      - public.users
  schema:
    except:
      - public.markets

You can use wildcards in the filter section:

  • ? matches exactly one occurrence of any character;
  • * matches arbitrary many (including zero) occurrences of any character.

Dump conditions and limit

You can specify conditions (SQL WHERE statement) and limit for dumped data per table:

# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # select maximum 100 rows
      limit: 100 

Transform conditions and limit

As the additional option, you can specify SQL conditions that define which rows will be transformed (anonymized):

# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # preserve original values for some rows
      transform_condition: "NOT (first_name = 'John' AND last_name = 'Doe')"      
      # select maximum 100 rows
      limit: 100

You can use the dump_condition, transform_condition and limit options in any combination (only transform_condition; transform_condition and limit; etc).

Global variables

You can specify global variables available from any template rule.

# config.yml
tables:
  users:
    bio:
      template:
        format: "User bio is {{var_a}}"
    age:
      template:
        format: {{_0 | float * global_multiplicator}}
#...
globals:
  var_a: Global variable 1
  global_multiplicator: 6

Available rules

Rule Description
email Emails with different options
ip IP addresses. Supports IPv4 and IPv6
words Lorem words with different length
first_name First name generator
last_name Last name generator
city City names generator
phone Generate random phone with different format
pipeline Use pipeline to generate more complicated values
capitalize Like filter, it capitalizes input value
template Template engine for generate random text with included rules
digit Random digit (in range 0..9)
random_num Random number with min and max options
password Password with different
length options (support max and min options)
datetime Make DateTime strings with options (from and to)
more than 70 rules in total...

For the complete list of rules please refer this document.

Uniqueness

You can specify that result values must be unique (they are not unique by default). You can use short or full syntax.

Short:

uniq: true

Full:

uniq:
  required: true
  try_count: 5

Uniqueness is ensured by re-generating values when they are same. You can customize the number of attempts with try_count (this is an optional field, the default number of tries depends on the rule).

Currently, uniqueness is supported by: email, ip, phone, random_num.

Locales

You can specify the locale for individual rules:

first_name:
  locale: RU

The default locale is EN but you can specify a different default locale:

tables:
  # ........  
default:
  locale: RU

We also support ZH_TW (traditional chinese) and RU (translation in progress).

Referencing row values from templates

You can reference values of other row fields in templates. Use prev for original values and final - for anonymized:

tables:
  - name: some_table
    # You must specify the order of rule execution when using `final`
    rule_order:
      - greeting
      - options
    rules:
      first_name:
        first_name: {}
      greeting:
        template:
          # Keeping the first name, but anonymizing the last name   
          format: "Hello, {{ prev.first_name }} {{ final.last_name }}!"
      options:
        template:
          # Using the anonymized value again   
          format: "{greeting: \"{{ final.greeting }}\"}"

You must specify the order of rule execution when using final with rule_order. All rules not listed will be placed at the beginning (i.e. you must list only rules with final).

Sharing information between rows

We implemented a built-in key-value store that allows information to be exchanged between anonymized rows.

It is available via the special functions in templates.

Take a look at an example:

tables:
  - name: users  
    rules:
      name:
        template:    
          # Save a name to the store as a side effect, the key is `user_names.<USER_ID>` 
          format: "{{ _1 }}{{ store_write(key='user_names.' ~ prev.id, value=_1) }}"
          rules:
            - person_name: {}
  - name: user_operations
    rules:
      user_name:          
        template:
          # Using the saved value again  
          format: "{{ store_read(key='user_names.' ~ prev.user_id) }}"

Supported databases

  • Postgresql
  • MySQL or MariaDB (TODO)

Documentation

Sponsors

Sponsored by Evrone

License

MIT

Development

Cross compilation

Mac to Linux

rustup target add x86_64-unknown-linux-gnu
brew tap messense/macos-cross-toolchains
brew install x86_64-unknown-linux-gnu
CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_LINKER=x86_64-linux-gnu-gcc cargo build --target x86_64-unknown-linux-gnu --release --features openssl/vendored

datanymizer's People

Contributors

akirill0v avatar dependabot[bot] avatar evgeniy-r avatar gregwebs avatar janmeier avatar marmik18 avatar mbeynon avatar mgrachev avatar sobolevn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

datanymizer's Issues

postgresql - same user-defined-type present in multiple schema breaks the process

Folks,

I have a situation where the same user-defined-type name is re-used across two different schema in the same database, and it breaks the process.

i think the query that constructs the columns of a table needs a small tweak :

Line 32 on schema_inspector.rs needs to change from :

const TABLE_COLUMNS_QUERY: &str = "SELECT cc.column_name, cc.ordinal_position, cc.data_type, pt.oid
                                   FROM information_schema.columns as cc
                                   JOIN pg_catalog.pg_type as pt
                                   ON cc.udt_name = pt.typname
                                   WHERE cc.table_schema = $1 and cc.table_name = $2
                                   ORDER BY cc.ordinal_position ASC";

to

const TABLE_COLUMNS_QUERY: &str = "SELECT cc.column_name, cc.ordinal_position, cc.data_type, pt.oid
                                   FROM information_schema.columns as cc
                                   JOIN pg_catalog.pg_type as pt
                                   ON cc.udt_name = pt.typname
                                   LEFT JOIN pg_catalog.pg_namespace as pn
                                   ON (
                                        cc.data_type = 'USER-DEFINED'
                                        AND cc.table_schema = pn.nspname
                                        and pn.oid = pt.typnamespace
                                      )
                                   WHERE cc.table_schema = $1 and cc.table_name = $2
                                   AND ( cc.data_type <> 'USER-DEFINED' OR pn.nspname is not null )
                                   ORDER BY cc.ordinal_position ASC";

Mainly, it is ensuring that the schema name is used when finding the type of a column, when the type is user-defined.

In-place data modification

Hi! Is it possible to make it work to change data in-place (without dump-restore cycle)?
We already have automated backups-to-staging restoration cycle (postgres, basebackups) and would like to use this tool to make data masking in-place.

Fetch tables metadata takes too long

Fetch tables metadata...
takes more time than a full database dump.

I have filters in place for only 4 tables I need out of 100+ but this still takes longer than I expect.

Add tables exclusion and inclusion at the schema creation level

Now, when filter added to the list of tables, Its filter only data copying process. The data schema needs to be responsive to the filter as well. To do this, you need to add arguments to the call of the command pg_dump

There are 2 steps with using pg_dump cli application.
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_dumper/src/postgres/dumper.rs#L62
and
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_dumper/src/postgres/dumper.rs#L134

Need to add inclusion and exclusion support to it.

  • Including
  • Excluding

Possible to exclude owner/privilege information

Using pg_dump and pg_restore you can exclude owner/priv information using --no-owner and --no-privileges. Using psql < dump to do restores as documented, doesn't allow this type of thing. Is this possible without having to manually text process the dump file?

Set up CI

Set up GitHub Actions to run these commands:

  • check
  • fmt
  • build
  • clippy
  • test
  • audit

Panic on TemplateTransformerTemplate

I am getting a panic when I try to store some values in the key-value store using the example mentioned in the docs.

Config

tables:
  - name: User
    rules:
      phone:
        template:
          format: "{{ _1 }}{{ store_write key='user_names.' ~ prev.profile, value=_1 }}"
          rules:
            - phone:
                format: "+############"
...........

Output

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Msg("Failed to parse 'TemplateTransformerTemplate'"), source: Some(Error { kind: Msg(" --> 1:24\n  |\n1 | {{ _1 }}{{ store_write key='user_names.' ~ prev.profile, value=_1 }}\n  |  ^---\n  |\n  = expected `or`, `and`, `not`, `<=`, `>=`, `<`, `>`, `==`, `!=`, `+`, `-`, `*`, `/`, `%`, a filter, or a variable end (`}}`)"), source: None }) }', datanymizer_engine/src/transformers/template/mod.rs:64:59
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
/bin/bash: line 54:    28 Aborted                 ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml $SOURCE_POSTGRES_CONNECTION_STRING --accept_invalid_certs=true

Dump includes extra value per row for COPY on table I have configured

It appears to impact fields that I have configuration for:

tables:
  - name: user
    rules:
      name:
        person_name: {}
      birthdate:
        datetime:
          from: 1980-01-01T00:00:00+00:00
          to: 2021-07-24T00:00:00+00:00

When loading the dump I see this error:

COPY 0
ERROR:  invalid input syntax for type timestamp with time zone: "Vanessa Wyman"
CONTEXT:  COPY user, line 1, column birthdate: "Vanessa Wyman"
COPY public.user("id", "firebase_uid", "created_at", "updated_at", "username", "bio", "verified", "flagged", "private_account", "avatar_image_id", "user_role_type", "avatar_foreground_color", "avatar_background_light_color", "avatar_background_dark_color", "name", "birthdate", "last_seen", "google_iap_uid") FROM STDIN;
2afa72c2-f06a-4bcd-88e8-6bac7b32fa80	ANDROID- 17	2021-06-14 19:09:17.508424+00	2021-06-14 19:09:17.508424+00	somename	\N	f	f	f	\N	USER	\N	\N	\N	\N	Vanessa Wyman	2004-01-13T00:14:00+00:00	\N

Make bootstrap in demo dir isn't working?

Discussed in #107

Originally posted by marcelofernandez October 5, 2021
Seems like the demo docker environment isn't working:

marcelo@marcelo:~/dev/datanymizer/demo/$ make bootstrap
[...]
   Compiling nom v5.1.2
   Compiling openssl-sys v0.9.63
error: failed to run custom build command for `openssl-sys v0.9.63`

Caused by:
  process didn't exit successfully: `/usr/src/target/release/build/openssl-sys-9793598f3bd829c9/build-script-main` (exit code: 101)
  --- stdout
  cargo:rustc-cfg=const_fn
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_LIB_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_LIB_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_LIB_DIR
  OPENSSL_LIB_DIR unset
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_INCLUDE_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_INCLUDE_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_INCLUDE_DIR
  OPENSSL_INCLUDE_DIR unset
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_DIR
  OPENSSL_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_NO_PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG_ALLOW_CROSS
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS
  cargo:rerun-if-env-changed=PKG_CONFIG_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG_SYSROOT_DIR
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR
  run pkg_config fail: "pkg-config has not been configured to support cross-compilation.\n\n                Install a sysroot for the target platform and configure it via\n                PKG_CONFIG_SYSROOT_DIR and PKG_CONFIG_PATH, or install a\n                cross-compiling wrapper for pkg-config and set it via\n                PKG_CONFIG environment variable."

  --- stderr
  thread 'main' panicked at '

  Could not find directory of OpenSSL installation, and this `-sys` crate cannot
  proceed without this knowledge. If OpenSSL is installed and this crate had
  trouble finding it,  you can set the `OPENSSL_DIR` environment variable for the
  compilation process.

  Make sure you also have the development packages of openssl installed.
  For example, `libssl-dev` on Ubuntu or `openssl-devel` on Fedora.

  If you're in a situation where you think the directory *should* be found
  automatically, please open a bug at https://github.com/sfackler/rust-openssl
  and include information about your system as well as this message.

  $HOST = x86_64-unknown-linux-gnu
  $TARGET = x86_64-unknown-linux-musl
  openssl-sys = 0.9.63

  ', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/openssl-sys-0.9.63/build/find_normal.rs:174:5
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
error: build failed`

I'm trying this out in an Ubuntu 20.04 + Docker 20.10.9~3-0~ubuntu-focal setup, on an x86_64 AMD machine...

Execution order for tables

Is there a way to define an execution sequence for the tables? As the values for the store are not stored before the read operation starts. Swapping the sequence in the config works, but it is inconsistent and fails after some runs.

tables:
  - name: User
  ...
    template:
      format: "{{ store_read(...) }}"
  ...
  - name: Profile
  ...
    template:
      format: "{{ store_write(...) }}"
  ...

pg_datanymizer not work in container

I run demo as it say in demo/README.md.
I build a container with pg_datanymizer in and run via make dump
But I get an error:

Error: invalid type: unit value, expected struct FirstNameTransformer

If I comment in dvdrental.yml block for first_name, like this:

filter:
  only:
    - public.actor
tables:
  - name: actor
    rules:
#      first_name:
#        first_name: ~
      last_name:
        last_name: ~
      last_update:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

Then i get same error with last_name:

Error: invalid type: unit value, expected struct LastNameTransformer

After add comment for last_name, like this:

filter:
  only:
    - public.actor
tables:
  - name: actor
    rules:
#      first_name:
#        first_name: ~
#      last_name:
#        last_name: ~
      last_update:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

All works fine.

When I build container with postgresql and download pg_datanymizer CLI into it all works fine!

My Dockerfile for example - Link

Builded container in DockerHub - agmayer/pg_datanymizer

Add template references to other transformed values of the same row

Possible syntax (look at tr):

tables:
  - name: users
    rules:
      login:
        username: {}
      preferences:
        template:
          format: '{"some_complex_json": {"field": "{{tr.login}}"}}'

Tera allows also tr["login"].

The name tr means "transformed row". We can use tr_row or transformed_row instead (it might be too long).

We can also use just row, but what if in the future we decide to use the values โ€‹โ€‹of the original row too?

Adding if-conditions for column transformation

It will be good to skip transformations using if-confitions... for example

phone:
  rules:
    conditions:
      - if: "{% condition statement-1 %}"
        rules:
           - format:
               template: "bla-bla-bla"
      - if: "{% condition statement-2 %}"
        rules:
           - format:
               template: "my cool template"
      - rules:
           - format:
               template: "bla-bla-bla"

Release v0.1.0

What needs to be done for the v0.1.0 release.

Infrastructure:

  • Set up CI (check, fmt, build, clippy, test, audit) (#7);
  • Automate release process (linux, darwin, docker) (#13);
  • Create Homebrew formula;

Improvements:

  • Make uniqueness be available for other transformers (#8);
  • Add tables exclusion and inclusion at the schema creation level (#19);

Malformed query when table has no defined fields

Our database uses timescaledb, and there's a schema called _timescaledb_cache with three tables: cache_inval_bgw_job, cache_inval_extension, and cache_inval_hypertable. For some reason there are no fields defined in these tables.

image

When running pg_datanymizer against this, it fails with an error that made it hard to know which table was the cause (we have many schemas and tables). Running rust nightly with RUST_BACKTRACE=full allowed me to see the offending line and modified pg_datanymizer a bit to print out the offending query.

    (prior output deleted)

[Dumping: _timescaledb_internal._hyper_9_327_chunk] [|##################################################|] 0 of 0 rows [100%] (0s)
[Dumping: _timescaledb_internal._hyper_9_327_chunk] Finished in 0 seconds
[Dumping: _timescaledb_internal._hyper_15_976_chunk] [|##################################################|] 68288 of 68288 rows [100%] (0s)
[Dumping: _timescaledb_internal._hyper_15_976_chunk] Finished in 2 seconds
Error: db error: ERROR: syntax error at or near ")"

Caused by:
ERROR: syntax error at or near ")"

Stack backtrace:
0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/backtrace.rs:328:13
3: std::backtrace::Backtrace::capture
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/backtrace.rs:296:9
4: anyhow::error::<impl core::convert::From for anyhow::Error>::from
at /Users/beynon/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.33/src/error.rs:464:25
5: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Resultcore::convert::Infallible,E>>::from_residual
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/core/src/result.rs:2052:27
6: datanymizer_dumper::postgres::dumper::PgDumper<W,I>::dump_table
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/postgres/dumper.rs:109:26
7: <datanymizer_dumper::postgres::dumper::PgDumper<W,I> as datanymizer_dumper::Dumper>::data
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/postgres/dumper.rs:173:17
8: datanymizer_dumper::Dumper::dump
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/lib.rs:21:9
9: pg_datanymizer::app::App::run
at /Users/beynon/work/datanymizer/cli/pg_datanymizer/src/app.rs:34:31
10: pg_datanymizer::main
at /Users/beynon/work/datanymizer/cli/pg_datanymizer/src/main.rs:13:5

One of the offending queries is:

COPY "_timescaledb_cache"."cache_inval_hypertable"() TO STDOUT

Solution seems to be if the column list is empty, the parentheses must be omitted:

COPY "_timescaledb_cache"."cache_inval_hypertable" TO STDOUT

I can easily skip these tables with an "except:" in the config.yaml, but then the pg_dump output is missing a table that may still be required by timescaledb to function properly.

Create example for pg_datanymizer

Create example to introduce how to get starting with CLI tool pg_datanymizer.
Setup environment with docker-compose, test database dump and example config file.

Postgres Dump missing Create statements

When I run pg_datanymizer against a GCP managed sql instance, I only get a data dump, it does not add any create statements for the relations.

Running pg_dump directly against the same database with the same credentials gives a full dump as expected.

The command I am running is
pg_datanymizer db --config /app/config.yml --host cloud-sql-proxy --username <redacted> --password <redacted> --file /app/dump.sql --port 5432

The config just has table rules, no filters or anything like that.

Would appreciate any guidance on this.

RUSTSEC-2020-0146: arr! macro erases lifetimes

arr! macro erases lifetimes

Details
Package generic-array
Version 0.12.3
URL fizyk20/generic-array#98
Date 2020-04-09
Patched versions >=0.14.0
Unaffected versions <0.8.0

Affected versions of this crate allowed unsoundly extending
lifetimes using arr! macro. This may result in a variety of
memory corruption scenarios, most likely use-after-free.

See advisory page for additional details.

datetime format issues

      birth_month: 
        datetime:
          format: '%m'

throws error Error: Failed transform trailing input

      birth_month: 
        datetime:
          format: %m

throws error Error: unexpected character: '%' at line 68 column 19 in config.yaml

SSL support

I am trying to use datanymizer in the environment that requires SSL connection (Heroku postgres). It fails with an error:

$ docker run --rm -v /Users/f213/prj/education/dev-db:/app -w /app datanymizer/pg_datanymizer "postgres://user:pwd@<REDACTED>.eu-west-1.compute.amazonaws.com:5432/<REDACTED>?sslmode=require"
Error: error performing TLS handshake: no TLS implementation configured

Caused by:
    no TLS implementation configured

Any chances this great project will support SSL?

Sharing information between tables

As mentioned in the readme there is a possible way to share information between rows. I was wondering if is it possible to share information between tables? My usecase has a need to keep a value consistent in table1 and table2.

For example:
I have 2 tables named Table1 and Table2. Both of them have a field emailId in them and I want to anonymize them and keep the value consistent? is that possible?

Issue with reading from key value store

Hi @evgeniy-r , I am facing some issues while using the key-value store. please correct me if I am using it wrong.
User is connected to Profile with a foreign key in the User table. So I am using profile_id while storing the data to the store and then while retrieving I am just using the id from the Profile table.

Config

tables:
  - name: User
    rules:
      phone:
        template:
          format: "{{ _1 }}{{ store_write(key='user_phonenumbers.' ~ prev.profile_id, value=_1) }}"
          rules:
            - phone:
                format: "+############"
                uniq: true
  - name: Profile
    rules:
      phone:
        # reading phone numbers from `user_phonenumbers` stored in `User`
        template:
          format: "{{ store_read(key='user_phonenumbers.' ~ prev.id) }}"

Output

...
Prepare to dump table: default$default.Profile
Error: Failed transform Failed to render 'TemplateTransformerTemplate'
ERROR: Job failed: exit code 1
FATAL: exit code 1         

Just to confirm that whether the template is giving an error on reading, I tried removing the template under Profile and replaced it with phone to randomize it and it works that way. So I am quite sure that there is something wrong - either how I am implementing it or with the anonymizer.

Error: no matching tables were found

@evgeniy-r I am facing an issue while using the cli with my db. I am not sure that whether I am using the config correctly, can you help me rectify the issue.

thank you in advance.

Background

I have a database with multiple schemas and I want to copy over the schema default$default and the tables under that schema. Of course, I also anonymize certain fields in the tables.

config.yml

tables:
  - name: default$default.Profile
    rules:
      firstName:
        first_name: {}
        ...
        ...
filter:
  schema:
    only:
      - default$default

Output

$ ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml --accept_invalid_certs $POSTGRES_CONNECTION_STRING
Prepare data scheme...
pg_dump error. Command:
pg_dump --section pre-data -t default$default <POSTGRES_CONNECTION_STRING>
Output:
pg_dump: error: no matching tables were found

Feature Request: wildcards in filters

Currently, I have 100s of tables in my database, all under different schemas.
I want to anonymize around 5 - 10 of my tables. I am able to achieve this using the following config

.....
filter:
  data:
    only:
      - "default$default.Profile"
      ...

Potentially, this creates another issue as it copies over the other schema, which I don't want to copy.

When I try to use the following config:

.....
filter:
  schema:
    only:
      - "default$default.Profile"
      ...

Only the tables selected get copy but not their dependencies, for e.g. ENUMs used in those tables.

So I would like to request for a feature in which filter accepts wildcard * for anything under a schema.
something like:

filter:
  schema:
    only:
      - "default$default.*"
    except:
      - "public.*"

Thank you in advance.

Failed after schema change.

DB:

create table "user"
(
	id serial,
	first_name text,
	last_name text,
	email_field text
);

insert into "user" (first_name, last_name, email_field) VALUES ('A', 'B', '[email protected]');
alter table "user" drop column last_name;

Config:

tables:
  - name: "user"
    rules:
      email_field:
        email:
          kind: Safe

Error:

thread 'main' panicked at 'index out of bounds: the len is 3 but the index is 3', datanymizer_engine/src/engine.rs:40:25

Add the limit when dumping data

Possible config syntax:

tables:
  - users:
    # per table
    limit: 10000
    # ........................
default:
  # global
  limit: 1000

We can also think about more flexible customization - specifying an SQL query for the table.

Malformed query repeats one field multiple times

We have a table with the following 30 fields:
image

This fails and datanymizer exits because an internal untransformed query for the dump is repeating the event_type field 3 times in the COPY:

None ret from qw.copy_out(untransformed_query=

"COPY "metrics"."deployment"("deployment_id", "hypervisor_key", "specification_id", "vmware_deployment_target_id",
     "vcenter_url", "vcenter_cluster", "vcenter_folder", "clone_source_id", "event_type", "event_type", "event_type",
     "org_name", "org_id", "mto_node_id", "mto_node_path", "mto_node_org_path", "academy_event_id", "package_id",
     "package_estimated_duration_in_seconds", "course_plan_id", "deployer_user_id", "user_full_name", "user_email",
     "warden_event_id", "created_at", "last_updated", "project_id", "project_org_id", "project_org_name",
     "project_mto_node_id", "project_mto_node_path", "project_mto_node_org_path") TO STDOUT
"),

table=metrics.deployment,
count=0,
error=Error {
    kind: Db,
    cause: Some(DbError {
        severity: "ERROR",
        parsed_severity: Some(Error),
        code: SqlState(E42701),
        message: "column \"event_type\" specified more than once",
        detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None,
        datatype: None, constraint: None, file: Some("copy.c"), line: Some(5155),
        routine: Some("CopyGetAttnums")
    })
}

Working now to understand why and fix it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.