rwnx / pynonymizer Goto Github PK

View Code? Open in Web Editor NEW

96.0 96.0 37.0 3.48 MB

A universal tool for translating sensitive production database dumps into anonymized copies.

Home Page: https://pypi.org/project/pynonymizer/

License: MIT License

Python 97.94% Shell 1.42% Dockerfile 0.63%

anonymization anonymized-data anonymized-database database dbanonymizer gdpr hacktoberfest python

pynonymizer's Introduction

The Encyclopedia of Me 👋

I’m a developer with a colourful background in arts and technology, and a lot of experience to boot. This gives me some unique input on both sides of the creative/technical equation.

At my core, I am an engineering obsessive and I love learning how things work.

I love writing little dumb tools and exploring programming meta through new languages.

Facts

[
    "A camel can shut its nostrils in a sand storm.",
    "Flamingos produce milk.",
    "Human thigh bones are stronger than concrete.",
    "A Polar Bear can eat up to 86 penguins in one sitting."
]

pynonymizer's People

Contributors

Stargazers

Watchers

pynonymizer's Issues

Add compact syntax for `literal`

The literal column strategy is quite useful, and a great way to break out and drop into field-level sql. I'd like to make this even easier to use, by not requiring the verbose syntax.

It occured to me that since subqueries require brackets around them in most providers anyway (and aren't conflicting with any other \w+ patterns), they'd be a distinct shorthand for literal, i.e:

tables:
  customer:
    columns:
      name: (SELECT foo FROM bar WHERE x = y)
      modified_on: ( NOW() )
      login: unique_login

If There are cases where you might not want the brackets, but I feel that's handled by gracefully falling back into the verbose syntax.

"ERROR: could not identify an equality operator for type json" thrown when attempting to anonymize a postgres table that contains json columns

Describe the bug
When attempting to run pynonymizer against a postgres table that includes json columns, an error:

ERROR: could not identify an equality operator for type json

is thrown.

Even if the user runs a strategy against a non-json column. As long as there is a column in the table of type json, the error will be thrown

To Reproduce
Steps to reproduce the behavior:

run pynonymizer against a postgres table with any json column. Even if you aren't anonymizing that particular json column, the error will be thrown.

an example strategyfile snippet:

tables:
  user:
    columns:
      friendly_name:
        type: fake_update
        fake_type: user_name
      phone_number:
        type: fake_update
        fake_type: phone_number

there is an additional column, lead_score of type jsonb present in the user: table, that is not mentioned in the strategyfile, but still seems to trigger the could not identify an equality operator for type json error

Expected behavior
A clear and concise description of what you expected to happen.

I expect that when anonymizing a table, whether or not a column in the table is of type json, I should be able to anonymize other columns

Additional context
ran against a postgres 11.11 DB with the latest version of pynonymizer installed via pip install pynonymizer

Also - Thank you very much for creating this tool, it has been very useful

Not able to support randomizing a jsonb type

Describe the bug
Trying to randomize a specific portion in a json field, but there is no specific expression to declare it as.

To Reproduce
Steps to reproduce the behavior:
In the .yaml file, have this:

  table:
    columns:
      params: 
        type: literal
        value: CASE WHEN params->>'email' ~ LOWER('website.com[[:>:]]') THEN params->>'email' ELSE CONCAT(DIGEST(LOWER(params->>'email'), 'sha1'), '@websitefaker.com') END

The structure of the table is:

     Column      |            Type             |                           Modifiers
-----------------+-----------------------------+---------------------------------------------------------------
 blob_random     | integer                     | not null default nextval('table_name_id_seq'::regclass)
 random_id       | text                        | not null
 completion_time | timestamp without time zone | default now()
 params          | jsonb                       |

The params field is this whole blob of json fields e.g.:

{
  "blob": 173467,
  "blobby": 20274,
  "service": "random",
  "owners": [
    "[email protected]"
  ]
}

When trying to anonymize, I get ERROR: column "params" is of type jsonb but expression is of type text

Expected behavior
A clear and concise description of what you expected to happen.

The value output should change only the specified params (tested this on the local database itself, so it should work here)
Ran a SELECT CASE WHEN params->>'email' ~ LOWER('website.com[[:>:]]') THEN params->>'email' ELSE CONCAT(DIGEST(LOWER(params->>'email'), 'sha1'), '@websitefaker.com') END FROM table_name

So the output should be:

{
  "blob": 173467,
  "blobby": 20274,
  "service": "random",
  "owners": [
    "[email protected]"
  ]
}

panic: runtime error: invalid memory address or nil pointer dereference

Describe the bug
Since today I got the following error message, I see that its now using:
successfully installed faker-4.17.0 pynonymizer-1.10.1 python-dateutil-2.8.1 python-dotenv-0.15.0 pyyaml-5.3.1 six-1.15.0 text-unidecode-1.3 tqdm-4.52.0

before it was:

Successfully installed faker-4.16.0 pynonymizer-1.10.1 python-dateutil-2.8.1 python-dotenv-0.15.0 pyyaml-5.3.1 six-1.15.0 text-unidecode-1.3 tqdm-4.52.0
I have done this for months :

pynonymizer --db-user "$SQL_USER_PROD" --db-password "$SQL_PASS_PROD" --db-host "127.0.0.1" --db-port "3306" --db-name temp" --start-at ANONYMIZE_DB --stop-at ANONYMIZE_DB --strategy ./strategy.yaml

Today it raised this error which seems related to the faker upgrade:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x804a0ac]
goroutine 59 [running]:
runtime/internal/atomic.Xadd64(0xa47a0b4, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/runtime/internal/atomic/asm_386.s:105 +0xc
github.com/GoogleCloudPlatform/cloudsql-proxy/proxy/proxy.(*Client).handleConn(0xa47a060, 0xff93fb8e, 0x34, 0x868f080, 0xa6ee000)
	/workspace/proxy/proxy/client.go:129 +0x50
created by github.com/GoogleCloudPlatform/cloudsql-proxy/proxy/proxy.(*Client).Run
	/workspace/proxy/proxy/client.go:120 +0x99
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11
Failed to connect to database.

What does GET_SOURCE step do and is there a way/step to just test if pynonymizer can connect to the mysql server

Query:
I am trying to figure out if there is a way to test connectivity with a db server like mysql without performing any actual destructive step operation like , CREATE_DB or ANONYMIZE_DB .
I tried using START step , but it does not check if the connection was actually setup and also tried GET_SOURCE step but also does not help cos i tried even with invalid credentials to confirm if connection failed ,but there was no error output.
So what does GET_SOURCE actually do ?
I looked at the provider code and did not find any reference to it

Feature request:
Would it make sense to add a feature to have a STEP to check connectivity or does it already exist ?
This would be useful when you are testing against different servers , but you want to play safe and first make sure if the connectivity is good before actually performing a destructive operation.

Thanks for the tool . Hope for a quick response

Generate default strategy file to manage changing database schemas and not fail due to schema changes

Is your feature request related to a problem? Please describe.
Most systems will have a dynamically changing database as features are added to whatever application is being used .
This will require the strategy files to be manually updated each time data sensitive columns are updated/added or removed
If a manual change to the strategy files are not made , the data masking tool could break due to the column changes .
Tested this by adding an unknown table to a strategy file and it failed:

ERROR 1146 (42S02) at line 1: Table 'services.userscool' doesn't exist

where userscool is a table that does not exist in services db

Describe the solution you'd like
Temporary solution:

Possibly not throw an exception if it cannot find a table/column . This will ensure that the tool can anonymize whatever valid tables and columns exist and possibly give a report , at the end that the following tables/columns could not be anonymized since they could not be found .

Permanent solution
a. Add the ability for the pynonymizer to generate a default strategy file for the database so that any new schema changes will be captured in that default strategy file . It could also possibly check an exclusion file that contains tables/columns to be ignored.
b. Compare default strategy file with existing strategy file and reconcile the difference between the 2 considering the default strategy file as a source of truth .

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Improve Programmatic Invokation: convenience methods/interface

Is your feature request related to a problem? Please describe.
At the present time, pynonymizer only has 1 way to be invoked, and it is analagous to the CLI, so it has a lot of flags and kwargs to set.

Describe the solution you'd like
The main ways of invoking pynonymizer should be covered with convenience methods.

When i say "main ways" I'm refering to common workflows, like:

restore database
anonymize database
safe anonymize (i.e. do not drop)

This is quite an abstract request but I think we recognise pynonymizer can do more for you if you're not using the CLI !

Broken pipe when using large dump files

Describe the bug
A clear and concise description of what the bug is.
Running the pynonymizer app, it automatically crashes if we use a big file, e.g. >15GB

To Reproduce
Steps to reproduce the behavior:
Running the program with a large file will output the below:

  File "/home/ubuntu/.local/lib/python3.8/site-packages/pynonymizer/pynonymize.py", line 140, in pynonymize
    db_provider.restore_database(input_path)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pynonymizer/database/postgres/__init__.py", line 199, in restore_database
    batch_processor.close()

Expected behavior
A clear and concise description of what you expected to happen.
Should act as normal. This works fine with a file that's <5GB

Truncate with foreign key checks

Is your feature request related to a problem? Please describe.
It's currently not possible to truncate a table and also delete all associated data with it. We would like to truncate e.g. a customer table and have to also remove all associated data in other tables.

Describe the solution you'd like
Provide an additional table strategy which truncates with foreign key checks enabled.

Describe alternatives you've considered
none

Additional context
none

anyway to escape : from yaml file

For example if strategy file is like

tables:
user_session: truncate
scripts:
after:
- UPDATE test_lead SET details = ' "test": false, "brokerName": "dd"'

it errors with -
yaml.scanner.ScannerError: while scanning a quoted scalar
in "test.yaml", line 5, column 82
found unexpected end of stream
in "test.yaml", line 5, column 83

If i remote the : then it works, is there a way to escape, and allow the sql update allow for : in column.

Thanks

Additional Mysql connection parameters

Hi,

I would like to use Pynonymizer tool connecting to AWS RDS Mysql instances.
But for Mysql RDS connexion, we require to set Mysql SSL connexion parameters.
Currently, pynonymizer provides a way to provide Mysql dump additional parameters via "additional_opts" and "dump_opts" (in MySqlDumpRunner).
But for Mysql queries execution, it's not possible to provide any additional options, so that we can configure the SSL for Mysql connection.
May be you could in the class MySqlCmdRunner the same additional options as in MySqlDumpRunner class (https://github.com/jerometwell/pynonymizer/blob/02593b18e6f2ae3b173689d518e138d002551e03/pynonymizer/database/mysql/execution.py#L16).

set column to NULL instead of '' (Empty)

Currently I can only set a Column to '' (Empty String) with UpdateColumnStrategyTypes.EMPTY.

For my hibernate Application, I would need to set a column to NULL since NULL is different than '' in hibernate Validation.

It would be great, if an additional Column Update stategy could be implemented. e.g. UpdateColumnStrategyTypes.NULL?

Regards,
M.

MSSQL [Feature request]: Using remote SQL when restore and backup steps are skipped

We are using pynonymizer to anonymize data in Customer's MS SQL database before it is transferred to our infrastructure for support/development purposes. We cannot do that in our infrastructure because GDPR. We cannot install python on Customer's server. But we have access to the SQL through VPN.

Solution is to not force using local server only, if only anonymization step is running through using

start_at_step='ANONYMIZE_DB',stop_at_step='ANONYMIZE_DB'

I had tested that with success when I removed the check for db_host being None and removing the db_host = "(local)" (for testing purposes without any condition if the ANONYMIZE_DB step is only running).

Deprecate python 3.6, add 3.10 tests/support

Python36 is out of support, let's do some housekeeping!

Remove python36 tests
remove python36 tags from setup.py
Add python310 tests
add python310 tags to setup.py
remove version checks for 3.6 since all features will be supported

help with dob

Hi, im trying to create a db strategy yml file, where the dob generated should be between certains dates, cause they need to be 18 or above of age.
Any help, would be much appreciated.

literal statement doesn't allow to add field in where clause

Hi Jerome,

In another question on you repo you mentioned how to define a table relation with a literal type to update a field with an already anonymized field value from another table.

I was happy at first, but didn't check the result quit well.
It seems that this solution always returns the same value from the table instead of taking the field value in the where clause

Steps to reproduce the behavior:

Add this in your strategy file

tables:
  - table_name: Customer
    schema: dbo
    type: update_columns
    columns:
      Name: company
  - table_name: Sales Header
    schema: dbo
    type: update_columns
    columns:
      Sell-to Customer Name:
        type: literal
        value: (SELECT TOP 1 [Customer].Name FROM [Customer], [Sales Header] WHERE [Customer].No_=[Sales Header].[Sell-to Customer No_])

Run pynonymizer

Expected behavior
The field Sell-to Customer Name Should be filled with the related Name
But this is the result:

Other Scenario
I Changed the value in the strategy file in the hope the statement would use the [Sell-to Customer No_] value in the query when running

tables:
...
      Sell-to Customer Name:
        type: literal
        value: (SELECT TOP 1 [Customer].Name FROM [Customer] WHERE [Customer].[No_]=[Sell-to Customer No_])

When running pynonymizer with this literal, this is the error in the command line:

pyodbc.Error: ('HY007', '[HY007] [Microsoft][ODBC SQL Server Driver]Associated statement is not prepared (0) (SQLNumResultCols)')

Regards,
Job

No way to set sql_require_primary_key=0 for seeding

Describe the bug
Currently there is no way of setting sql_require_primary_key for the database seeding step. For the anonymization step i used the before script, but for seeding this is not possible. Is there a way to give this input somewhere or can it be added?

I only see;

[MYSQL] pass additional arguments to the
                        restore process (advanced use only!).
                        [$PYNONYMIZER_MYSQL_CMD_OPTS]

To Reproduce
Use a digitalocean mysql RDB or manually set sql_require_primary_key=1 in a mysql8 database

Expected behavior
It would be nice to have a way of settings this variable in the session.

Error:

[ANONYMIZE_DB]
creating seed table with 1 columns
ERROR 3750 (HY000) at line 1: Unable to create or change a table without a primary key, when the system variable 'sql_require_primary_key' is set. Add a primary key to the table or unset this variable to avoid this message. Note that tables without a primary key can cause performance problems in row-based replication, so please consult your DBA before changing this setting.
Traceback (most recent call last):
  File "/usr/local/bin/pynonymizer", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/site-packages/pynonymizer/cli.py", line 177, in cli

Generate random string values to not encounter duplicate entry issues

Is your feature request related to a problem? Please describe.
Currently when you try to anonymize a column say (sys_id) of type VARCHAR and if it has duplicate foreign keys , pynonymizer throws error

Duplicate entry '0efg4t5fs5d-test1' for key 'api_product_id_sys_id'

where for api_product_id = 0yb6h5jfs5d , api_product_id the foreign key
and for sys_id = test1, sys_id is the column to be updated

when using a strategy file as below

tables:
  system:
    columns:
      sys_id: ( 'test1' )

How can i specify in the strategy file to generate random string values without using the faker lbrary generators ?
The faker library does not guarantee unique values cos when i tried with say word faker generator , it still gave me an error and failed

ERROR 1050 (42S01) at line 1: Table '_pynonymizer_seed_fake_data' already exists

And for a large number of rows ,the faker library generators adds an overhead (since it needs to generate the value and insert it )slowing down the anonymization process .
I know that there is a RAND() function , but that generates a value from 0 to 1 . Can that be used to say generate a value range lke

test1
test2
test3

sdfsd
evbrr
dfdgd

And why does pynonymizer check the combination of foreign key and the column to be updated instead of the primary key and the column to be updated ?

Describe the solution you'd like

Have either a way to combine a string value and the RAND() function possibly ,like ( 'test' + RAND() ) or a random string generator where you could specify the length of the string like ( RAND_STR(5) ) where 5 is the length of the string

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

push up v1.11.* tags, *-stable branches

the original stable branches and most recent tags are missing from github.

grab these from the original remote and move them here.

RFC: Why is locale a CLI option

It seems to me that Faker's locale setting is intrinsically linked to the anonymization fields, so should it be placed in the strategyfile with the rest of the schema/fields config?

It seems like the only reason you'd keep them seperate is if you wanted to use the strategyfile with different locales, which is a contradiction because the strategyfile is bound directly to the database. (at least from the usages I've seen.)

I'd like to make it clear that the strategyfile is for the anonymization process while the cli is for controlling the tool's connection and provider behaviour. It is NOT just extended config.

I was thinking something like a top-level locale key. :

locale: en-GB

tables:
  customers:

Discussion pls

Default seed_rows value is not in documentation

Make this clear in the docs so that it's clearer why this would be important

Unable to complete execution of default process control

Describe the bug
The default process control fails at the anonymizing step with error that the db_schema.table does not exist .
But if i run each step individually after creating the db manually ,ie, if i call pynonymizer.run

pynonymizer.run(input_path="core_db.sql", strategyfile_path="strat.yaml",         output_path="anonimize.sql",db_name='core_db',db_user="test",db_password="test",verbose=True,only_step='RESTORE_DB')

pynonymizer.run(input_path="core_db.sql", strategyfile_path="strat.yaml",         output_path="anonimize.sql",db_name='core_db',db_user="test",db_password="test",verbose=True,only_step='ANONYMIZE_DB')

,ie, seperately , there are no errors and the respective columns get anonymized

To Reproduce

Create a python file with contents

import pynonymizer

pynonymizer.run(input_path="core_db.sql", strategyfile_path="strat.yaml",
                output_path="anonimize.sql",db_name='core_db',db_user="test",db_password="test",verbose=True)

contents of strat.yaml

tables:
  user:
    columns:
      first_name: ( RAND() )
      last_name: ( RAND() )

Execute the python file

Actual behavior

mysql: [Warning] Using a password on the command line interface can be insecure.
Restoring: 100%|██████████| 233k/233k [00:00<00:00, 658kB/s]
["UPDATE `user` SET `first_name` = (''),`last_name` = ('');"]
["UPDATE `user` SET `first_name` = (''),`last_name` = ('');"]
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]    mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 1146 (42S02) at line 1: Table 'core_db.user' doesn't exist
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/test/Documents/DataProcessor/tools/pnonymizer/anonimize.py", line 3, in <module>
        pynonymizer.run(input_path="core_db.sql", strategyfile_path="strat.yaml",
                File "/Users/test/Documents/DataProcessor/venv/lib/python3.9/site-packages/pynonymizer/pynonymize.py", line 147, in pynonymize
db_provider.anonymize_database(strategy)
File "/Users/test/Documents/DataProcessor/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/__init__.py", line 159, in anonymize_database
self.__runner.db_execute(statements)
File "/Users/test/Documents/DataProcessor/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 131, in db_execute
self.__mask_subprocess_error(error)
File "/Users/test/Documents/DataProcessor/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 81, in __mask_subprocess_error
raise error from None
File "/Users/test/Documents/DataProcessor/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 124, in db_execute
subprocess.check_output(
        File "/usr/local/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/usr/local/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
        subprocess.CalledProcessError: Command '['mysql', '-h', '127.0.0.1', '-P', '3306', '-u', 'test', '-p******']' returned non-zero exit status 1.

Expected behavior
All the steps are executed without any error

Additional context
Add any other context about the problem here.

Improve mssql ODBC driver selection

At the moment the mssql driver auto-selection process just picks the first one that matches sql server.

Based on our understanding in #62

Prefer ODBC Driver 17 for SQL Server, possibly pattern matching to achieve the "latest" and then apply an ordering, at which point we can fall back to the old behaviour.

This needs some additional research and understanding into the ODBC drivers available on different systems.

Error with v1.18.0: `NameError: name 'logger' is not defined`

Describe the bug
With v1.18.0 I get an error when I want to run it:

Traceback (most recent call last):
  File "/usr/local/bin/pynonymizer", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/site-packages/pynonymizer/cli.py", line 161, in cli
    _warn_deprecated_env("DB_HOST", "PYNONYMIZER_DB_HOST")
  File "/usr/local/lib/python3.9/site-packages/pynonymizer/cli.py", line 123, in _warn_deprecated_env
    logger.warning("Environmental var $%s is deprecated. Use $%s", old_env, new_env)
NameError: name 'logger' is not defined

To Reproduce
My Dockerfile looks like this

FROM python:3.9-slim

RUN pip install pynonymizer \
    && apt-get update \
    && apt-get install default-mysql-client -y

Expected behavior
It should work like with version 1.17.0 I guess

Define Table Relations (Feature request)

SQL database with redundant data
We have a database with redundant data.
For example there is a Customer table and a Sales Header table.
Both tables contain the fields Customer Name, Customer Address, ...

Anonymization with table relations
While anonymization is going on, it could be (very) useful to define a table relation like this:

tables:
  Customer:
    schema: dbo
    columns:
      Name: first_name
      Address: street_address
  Sales Header:
    schema: dbo
    columns:
      Customer No: [Relation: Customer.No]
      Customer Name: [Customer.Name]
      Address:  [Customer.Address]

So after processing the Customer table, we could then retrieve the already anonymized data from this table and apply it on the next ones.
The first column with the table relation. The following columns with the anonymized data from the related record.

publish github version to pypi

Until a new version is published, the information leading to gitlab is going to be present in PyPi.

Publish a new version, preferably with some more substantial updates than "new version because github, lol"! https://pypi.org/project/pynonymizer/

Add badges

The old project had some neat badges!

Can we have them here, too?

process steps guide hasn't been updated to include `--only-step` option

https://github.com/jerometwell/pynonymizer/blob/master/doc/process-control.md

This guide seems to be missing info on the new option [--only-step STEP].

Can we make sure this is in line with the options of the CLI? it means less confusion for new users reading stuff.

Support for Third Party Faker Providers

Is your feature request related to a problem? Please describe.
I've noticed the documentation mentions that:
"You can specify a fake update with any default Faker provider."
Is it possible to add a way to utilise third-party/community Faker providers?

Describe the solution you'd like
Support for third-party/community Faker providers.

Describe alternatives you've considered
There doesn't seem to be an alternative currently.

Additional context
NA

Add Issue templates

Issue templates were really useful for triaging new issues, especially with community involvement.

Github has it's own format for these, we should port the existing templates in https://github.com/jerometwell/pynonymizer/tree/master/.gitlab/issue_templates

Change/Ignore exception handling during anonymization

Possibly not throw an exception if it cannot find a table/column . This will ensure that the tool can anonymize whatever valid tables and columns exist and possibly give a report , at the end that the following tables/columns could not be anonymized since they could not be found .

Leading from #94, a missing or changed table name can cause anonymization to fail half-way. It would be better if, during the anonymization step, pynonymizer attempted to "do everything it can" even if the resulting exit code is still !=0.

In addition to this, adding an option to ignore some errors during anonymization e.g. --no-fail / --ignore-anonymization-errors (TBC) that would allow users to explicitly override the failure behaviour would go a long way!

Improve Programmatic Invokation: native python inputs

Is your feature request related to a problem? Please describe.
The current interface is based on the CLI, so it only accepts filepaths (or stdin). Really that should be extended to support python file inputs. There are other options that could benefit from being made pythonic.

Describe the solution you'd like
I like the way this is handled in the click-style CLIs, where all files are abstracted into a readable interface. This might pose some problems with the existing gzip reading scenario, but i'm convinced we can get this in as a minimum.

MSSQL: All lines in table are anonymized to same values

Using only the anonymize step to anon existing DB. In tables where row count is low, the records are anonymized correctly. If the row count is higher (e.g. 1000) all records have same values in anon columns.

I would expect that all tables has random values in all lines.

Repro-steps> not having yet (anonymization of customer's database). Before I start to play with the repro steps I want to know if there is some known issue with that to not spend time on something known.

Have used:

pynonymizer.run(input_path="",output_path="",strategyfile_path="./strategy.yml", db_server=".",db_type="mssql", db_name="anon",db_user="myuser",db_password='mypwd',fake_locale="cs-CZ",dry_run=False,verbose=False,start_at_step='ANONYMIZE_DB',stop_at_step='ANONYMIZE_DB')

Allow uniqueness for any type, not just emails and usernames

Issue

Something I've run into while adding pynonymizer into my company's workflow is the inability to have table-level unique fake data. It's possible for emails and usernames, but not for anything else (like random strings/pystr for instance).

Concrete example: we have a phone_numbers table with a column number: unique varchar<10>. It's currently not possible to fill this column with fake, unique number strings. If you use pystr then you will sometimes run into unique constraint violations (and assuming the table is larger than the seed data set you may always run into issues).

For columns without length constraints you can kind of hack it together by using unique_email or unique_login but there are a couple limitations:

Things that expect a phone number will be broken at runtime
The addition of the character limit breaks this solution

Solution I'd like

Ideally, being able to provide an argument like unique: true to a column strategy. If this is more difficult to implement, expanding unique_email and unique_login to include more types might be sufficient for most use cases as well, although it would be much more useful to be able to pass arguments to the fakes as well as guaranteeing their uniqueness.

Not sure of any workable alternatives, besides maybe changing your schema to remove the unique constraint or allow nulls, but those both don't seem like ideal solutions to the problem.

Anonymizer process control does not work

Describe the bug
When i run pynonymizer to use its process control system , it fails to go through the default process control flow ,ie, as per pynonymize.py , it is supposed to go through the below flow starting with CREATE_DB is per my understanding

    logger.info(actions.summary(ProcessSteps.CREATE_DB))
    if not actions.skipped(ProcessSteps.CREATE_DB):
        db_provider.create_database()

    logger.info(actions.summary(ProcessSteps.RESTORE_DB))
    if not actions.skipped(ProcessSteps.RESTORE_DB):
        db_provider.restore_database(input_path)

    logger.info(actions.summary(ProcessSteps.ANONYMIZE_DB))
    if not actions.skipped(ProcessSteps.ANONYMIZE_DB):
        db_provider.anonymize_database(strategy)

    logger.info(actions.summary(ProcessSteps.DUMP_DB))
    if not actions.skipped(ProcessSteps.DUMP_DB):
        db_provider.dump_database(output_path)

    logger.info(actions.summary(ProcessSteps.DROP_DB))
    if not actions.skipped(ProcessSteps.DROP_DB):
        db_provider.drop_database()

But in reality when i run it as follows , it does not create the db and fails

To Reproduce
Issue1:

 pynonymizer.run(input_path="main_sys.sql", strategyfile_path="strategy_file1.yaml",
                        db_host='< host >', db_name = 'main_sys', db_password='<password>', output_path='main_sys_anonymized.sql')

Does this imply that it did not run CREATE_DB by default, but instead ran RESTORE_DB first , since logs state restoring followed by the logs stating Table 'main_sys.admins' does not exist ?
So i tried to explicitly start from CREATE_DB step as shown in Issue2 below
Error log:

mysql: [Warning] Using a password on the command line interface can be insecure.
Restoring: 100%|██████████| 233k/233k [00:00<00:00, 658kB/s]
["UPDATE `user` SET `first_name` = ('hello'),`last_name` = ('test');"]
["UPDATE `user` SET `first_name` = ('hello'),`last_name` = ('test');"]
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]    mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 1146 (42S02) at line 1: Table 'main_sys.admins' doesn't exist
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/test/Documents/tools/pnonymizer/mask.py", line 3, in <module>
        pynonymizer.run(input_path="main_sys.sql", strategyfile_path="strategy.yaml",
                File "/Users/test/Documents/DataProcessor/venv/lib/python3.9/site-packages/pynonymizer/pynonymize.py", line 147, in pynonymize
db_provider.anonymize_database(strategy)
File "/Users/test/Documents/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/__init__.py", line 159, in anonymize_database
self.__runner.db_execute(statements)
File "/Users/test/Documents/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 131, in db_execute
self.__mask_subprocess_error(error)
File "/Users/test/Documents/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 81, in __mask_subprocess_error
raise error from None
File "/Users/test/Documents/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 124, in db_execute
subprocess.check_output(
        File "/usr/local/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/usr/local/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
        subprocess.CalledProcessError: Command '['mysql', '-h', '127.0.0.1', '-P', '3306', '-u', 'test', '-p******']' returned non-zero exit status 1.

Issue2:

 pynonymizer.run(input_path="main_sys.sql", strategyfile_path="strategy_file1.yaml",
                        db_host='< host >', db_name = 'main_sys', db_password='<password>', output_path='main_sys_anonymized.sql', 
start_at_step='CREATE_DB')

When i tried to use start_at_step='CREATE_DB' in pynonymizer.run() to understand and change the process control behaviour by ensuring that the database gets created to prevent the above error , the following below error happens which implies that the it is attempting to run RESTORE_DB and than CREATE_DB causing the below failure even though it is supposed to first CREATE_DB .
Error log:

Restoring: 100%|██████████| 307k/307k [00:00<00:00, 1.13MB/s]
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]    mysql: [Warning] Using a password on the command line interface can be insecure
["UPDATE `user` SET `last_name` = ( 'test' );"]
ERROR 1146 (42S02) at line 1: Table 'main_sys.admins' doesn't exist
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
    subprocess.check_output(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mysql', '-h', '127.0.0.1', '-P', '3306', '-u', 'root', '-p******']' returned non-zero exit status 1.

Can you please advise how i can i achieve the basic process flow via the python script ,ie,

1. CREATE_DB
2. RESTORE_DB
3. ANONYMIZE_DB
4. DUMP_DB

. Thank you @rwnx
The only way i am able to use the tool in a step by step manner is to specify the only_step by calling pynonymizer.run for each of the below values
CREATE_DB, RESTORE_DB,ANONYMIZE_DB,DUMP_DB

Expected behavior
As per documentation and code it should go through the steps in the below order as default process control behaviour ,ie,

    logger.info(actions.summary(ProcessSteps.CREATE_DB))
    if not actions.skipped(ProcessSteps.CREATE_DB):
        db_provider.create_database()

    logger.info(actions.summary(ProcessSteps.RESTORE_DB))
    if not actions.skipped(ProcessSteps.RESTORE_DB):
        db_provider.restore_database(input_path)

    logger.info(actions.summary(ProcessSteps.ANONYMIZE_DB))
    if not actions.skipped(ProcessSteps.ANONYMIZE_DB):
        db_provider.anonymize_database(strategy)

    logger.info(actions.summary(ProcessSteps.DUMP_DB))
    if not actions.skipped(ProcessSteps.DUMP_DB):
        db_provider.dump_database(output_path)

    logger.info(actions.summary(ProcessSteps.DROP_DB))
    if not actions.skipped(ProcessSteps.DROP_DB):
        db_provider.drop_database()

Additional context

Setting (perhaps global) to only anonymise non-empty columns

To keep my anonymised database as true-to-life as the source, I want to only update non-empty columns. To achieve that I currently do this:

  - table_name: customers
    type: update_columns
    columns:
      customer_company:
        type: fake_update
        fake_type: company
        where: customer_company!= ""
      customer_title:
        type: fake_update
        fake_type: prefix
        where: customer_title!= ""
      customer_firstname:
        type: fake_update
        fake_type: first_name
        where: customer_firstname!= ""
      customer_surname:
        type: fake_update
        fake_type: last_name
        where: customer_surname!= ""

While this works, it's a little laborious and makes the strategy file very long.

Is there a way to set "only update if not empty" globally somehow? Or, failing that, at any other level?

Fix integration test workflow

As you can see, integration test job is currently failing.

It's likely something badly ported from the gitlab pipeline, especially with artifacts being different.

https://github.com/jerometwell/pynonymizer/actions/runs/265717912

Please fix! This really is a necessary test.

Typo: "running fafter script"

present in mssql, mysql and postgres providers, there's an additional "F" character being added in the log line for running before/after scripts.

Remove it please!

            self.logger.info(f"Running f{title} script #{i} \"{script[:50]}\"")

Add CLI options `--only-step`

I've seen quite a few people only using the anonymizer features (and not the surrounding steps). It looks like --start-at and --stop-at are getting used, but this interface seems clunky to me, it would be nicer to be able to say "only do this".

In terms of implementation I'd like to continue using the "step" system and make only-step a mutex cli option (i.e. you can't have only-step with any of the other options(skip-steps, start-at, stop-at).

Future considerations for the cli might consider making this even easier, but at the moment we have the step-system i'd like to make it a little more useful (in a "boring" way).

This means you get to do something like

pynonymizer [...] --only-step ANONYMIZE_DB

use poetry

Using a build management system like pipenv has been a real time-saver on other projects.
I've read that poetry can do pypi projects with relative ease, so i'd like to try that out here.

This will require changes to the workflow/actions as well as the tooling required in the project

Invoke pynonymizer programmatically

Is your feature request related to a problem? Please describe.
pynonymizer does normally exactly what is needed, i. e. anonymize certain columns/drop data using a pre-defined, clear strategy-file. However it still needs a lot of parameters, especially if you use all the features, e. g. starting from a certain step, skipping another one etc. If it would be possible to invoke it programmatically, you could integrate it into your own workflow. For instance, I already have a commandline tool for my app that does already a lot of things (e. g. running an alembic migration, reset the database, adding seeding data etc.). The user/developer wouldn't need to learn another tool but can run the tool with the command she is used to.

Describe the solution you'd like
Honestly I don't know exactly what would be the best method. Probably some pynonymizer-object, to which you can assign a configuration/strategy-file (or having both in a single file). Then, once it is configured, you can just invoke pynonymizer.run() or similar and it does its job. Something like:

a = Pynonymizer()
a.config_file = os.path.join('./myconfigfile.yml')
# Assuming the strategy-file is provided already in the config
# Now configuring the steps
a.config.start_at(ANONYMIZE_DB)
a.config.skip_step([DUMP_DB, DROP_DB])
# Getting some config from an environment variable
a.config.db_password = os.environ["DB_PASSWRD"]
# Another setting is provided via the arguments of my commandline tool which uses click
a.config.db_user = click.parameters.get("my_user", default_user)  # not real code here

# Doing a final check before running
is_all_needed_config_data_available = a.config.check()
if is_all_needed_config_data_available == True:
    result = a.run()
    if result.error_code != 0:
        print(f"Something happened: {result.message}")

Describe alternatives you've considered
Invoking the tool from within your code like you would invoke another executable is, of course, possible, but it doesn't give you that amount of control or integration that you would have when invoking from Python directly.

Additional context
Great library, really, a deeply felt "Thank you!"

Accept input from stdin

The project is awesome! Thanks for all your hard work.

It would be really cool if you could accept input from stdin though. That way we could use it as part of a pipe!
Would you be able to add that please?

Archive/Note old gitlab project

Leave a note on the original gitlab project directing people here, and making it clear that this is the canonical version.

https://gitlab.com/jerometwell/pynonymizer

SQL command returned non-zero exit status 1 with uppercase column names

Hi Jerome,

I'm having an issue while using Pynonymizer with uppercase table names. It seems the SQL command decodes the table name as lowercase, returning an error.

subprocess.CalledProcessError: Command '['psql', '--host', 'localhost', '--port', '5432', '--username', 'postgres', '--dbname', 'strategy3_e30f5bf9e4b54379896fc2d037930bec', '--command', 'UPDATE APP AS "updatetarget" SET "APP_NAME" = ( SELECT company FROM _pynonymizer_seed_fake_data ORDER BY RANDOM(), MD5("updatetarget"::text) LIMIT 1),"CLIENT_ID" = ( SELECT md5 FROM _pynonymizer_seed_fake_data ORDER BY RANDOM(), MD5("updatetarget"::text) LIMIT 1),"CLIENT_SECRET" = ( SELECT md5 FROM _pynonymizer_seed_fake_data ORDER BY RANDOM(), MD5("updatetarget"::text) LIMIT 1);']' returned non-zero exit status 1.

strategy3_e30f5bf9e4b54379896fc2d037930bec=# UPDATE APP AS "updatetarget" SET "APP_NAME" = ( SELECT company FROM _pynonymizer_seed_fake_data ORDER BY RANDOM(), MD5("updatetarget"::text) LIMIT 1),"CLIENT_ID" = ( SELECT md5 FROM _pynonymizer_seed_fake_data ORDER BY RANDOM(), MD5("updatetarget"::text) LIMIT 1),"CLIENT_SECRET" = ( SELECT md5 FROM _pynonymizer_seed_fake_data ORDER BY RANDOM(), MD5("updatetarget"::text) LIMIT 1);
ERROR:  relation "app" does not exist
LINE 1: UPDATE APP AS "updatetarget" SET "APP_NAME" = ( SELECT compa...

Would you take a look?

Improve Programmatic Invocation: Callbacks/events/hooks

Is your feature request related to a problem? Please describe.
pynonymizer is a long, blocking process. If you're going to invoke it from another function, it would be nice to know what the hell is happening as it executes.

Describe the solution you'd like
key places, likely conditional on steps, that you are able to recieve signals to run your own code, and clearly defined failure behaviour.

Write RELEASING.md or release/branch process general

I would like some additional documentation to be added for:

branching strategy
releasing guide + release branches
hotfix guide
How we talk about semver bumps

Concatenate anonymized fields into one field (User Question)

Hi,

I have a contact table that holds, First Name, Surname and Name.
The Name field is always filled with the First Name and Surname.
Is it possible to have the anonymized fields of First Name and Surname concatenated into the field Name?

Already tried

      Name:
        type: literal
        value: (SELECT TOP1 ([First Name] + ' ' + Surname) FROM [Contact] WHERE [Contact].[No_] = [Contact].[No_])

Cannot find module during import.

I'm installing pynonymizer with pip, and including it as part of a larger project. My project includes a tests python package on the top level, the same as pynonymizer. This causes an issue where I can't import my tests package, because it gets confused with the pynonymizer one.

Since the tests package isn't needed to run pynonymizer (it's just for testing), could it be excluded from the pip package to avoid this issue?

Thanks.

Standardize formatting/Linting

It looks like the project has inconsistent style and trips PEP8 ide rules in more than a few places. It would be good to :

set a standard for the project
Perform some kind of automation to make that easier

locale from strategyfile will always be overwritten

Describe the bug
You can't set the locale via the strategyfile because it will always be overwritten by the fallback locale

To Reproduce
Steps to reproduce the behavior:

set the locale in a strategyfile to some locale with some exclusive faker modules i.e. de_DE
use a faker module exclusive to that locale i.e. state
run pynonymizer without the --fake-locale argument but enable verbose to see that the locale will fall back to en_GB

Expected behavior
settings the locale in the strategyfile results in that locale being used by pynonmizer.

Additional context
The issue seemingly lies in pynonymizer/pynonymize.py:47

    if fake_locale is None:
        fake_locale = "en_GB"

fake_locale will never be None beyond this point, it's either a value specified by the CLI argument/environment variable or the fallback en_GB

the parse_config method relies on local_override to be None to use the value set in the strategyfile.

pynonymizer/strategy/parser.py:162

    def parse_config(self, raw_config, locale_override=None):
        """
        parse a configuration dict into a DatabaseStrategy.
        :param raw_config:
        :return:
        """
        config = StrategyParser.__normalize_table_list(deepcopy(raw_config))

        locale = config.get("locale", None)
        if locale_override:
            locale = locale_override

I'd suggest to put the fallback to en_GB into the line where the locale is being read from the strategyfile instead of the fallback logic from pynonymizer/pynonymize.py:47.

    def parse_config(self, raw_config, locale_override=None):
        """
        parse a configuration dict into a DatabaseStrategy.
        :param raw_config:
        :return:
        """
        config = StrategyParser.__normalize_table_list(deepcopy(raw_config))

---        locale = config.get("locale", None)
+++        locale = config.get("locale", "en_GB")
        if locale_override:
            locale = locale_override

I've created a pull request with the suggested changes: #79