Using only the anonymize step to anon existing DB. In tables where row count is low, t

Seems to corelated with subquery caching mentioned here: <a href="https://stackoverflo

Ok, thanks, my mistake, have forgot to use the s to set the ANSI_WARNINGs... fir

hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

MSSQL: All lines in table are anonymized to same values about pynonymizer HOT 9 CLOSED

rwnx commented on August 23, 2024

MSSQL: All lines in table are anonymized to same values

from pynonymizer.

Comments (9)

rwnx commented on August 23, 2024 1

My original thought was more along the lines of "would this cause anonymization to be improperly performed when enabled?" and looking at the docs here I don't have that concern. pynonymizer's goal should be to be compatibile with the database, not the other way around 😅

If people get some truncated replacement data, that seems agreeable to me!

I'll integrate this into #48 and add a test case too.

from pynonymizer.

kine commented on August 23, 2024

Seems to corelated with subquery caching mentioned here: https://stackoverflow.com/questions/39101333/force-sql-server-to-stop-caching-subquery

from pynonymizer.

rwnx commented on August 23, 2024

Hi,
Definitely not the correct behaviour / known issue. I had a report of this in behaviour in mssql provider a few years ago but I couldn't replicate it in my mssql instance.
I want to make sure we capture this effort into integration tests for the project since the mssql provider is by far the most touchy one!

There was a similar issue in the postgres provider a while ago: https://gitlab.com/jerometwell/pynonymizer/-/issues/49 , maybe the workaround there (using literal with an uncachable subquery) might work for you until we can fix this?

from pynonymizer.

kine commented on August 23, 2024

Thanks,
I have solved the problem partially even I am not python developer...

the prevention of cached subselect(seems that when there is more than x rows the SQL will optimize the subselect because it is not depending on the main data, thus executing it only once and using the result everywhere) is to make it depending on the main data. I did it in this way:

return f"( SELECT TOP 1 {column} FROM [{SEED_TABLE_NAME}] ORDER BY NEWID(),CHECKSUM({SEED_TABLE_NAME}.{column},[{table_name}].[{field_name}]))"

in the __get_column_subquery (taken from https://dba.stackexchange.com/questions/275072/generate-random-row-order-in-a-subquery). It makes the subselect depending on the data of the query. Side-effect is, that if the caching kick in, the result is "deterministic" because same original value is replaced with same generated "pseudo-random" value because the cache, but this is for me even better ;-)

I can create PR for this, but someone must add the tests and somehow tune the code, because really I just did without deep python knowledge.

Another part of the fix was to use "SET ANSI_WARNINGS off" to disable warnings when generated value is truncated because the target field is shorter (had some problem with long city name in cs-CZ fake dataset).

from pynonymizer.

rwnx commented on August 23, 2024

Thanks for your prototype and suggestions, it's a great starting place 😇 It looks like we'll be doing roughly what we did for postgres, i.e. depending upon something in the query which makes it hard to cache, like you say!
I will pick this up for next release.

I've tried to keep strategy.scripts open for stuff like SET ANSI_WARNINGS off but maybe that's not working for you? part of pynonymizers unofficial goals is to remain flexible enough without literally supporting every feature 😅

from pynonymizer.

kine commented on August 23, 2024

Ok, thanks, my mistake, have forgot to use the scripts to set the ANSI_WARNINGs... first time I am using pynonymizer... next time I will know... ;-) (but I am not sure if it will work - depends if all is running in same session or not, because this is per session setting).

from pynonymizer.

rwnx commented on August 23, 2024

hey @kine can we chat about ANSI_WARNINGS? I looked at your implementation on your fork - I have some reservations about enabling ansi_warnings for everyone, I'm concerned that the behaviour might not be applicable to everyone. What do you think?

from pynonymizer.

kine commented on August 23, 2024

Yes, it could be "too strong" for others. But without this the process will fail if the generated value is somewhere longer than the field in table. You will need to fix that in the code. This setting is solving this without adding the length check in code.

From documentation:

SET ANSI_WARNINGS affects the following conditions:

When set to ON, if null values appear in aggregate functions, such as SUM, AVG, MAX, MIN, STDEV, STDEVP, VAR, VARP, or COUNT, a warning message is generated. When set to OFF, no warning is issued.

When set to ON, the divide-by-zero and arithmetic overflow errors cause the statement to be rolled back and an error message is generated. When set to OFF, the divide-by-zero and arithmetic overflow errors cause null values to be returned. The behavior in which a divide-by-zero or arithmetic overflow error causes null values to be returned occurs if an INSERT or UPDATE is tried on a character, Unicode, or binary column in which the length of a new value exceeds the maximum size of the column. If SET ANSI_WARNINGS is ON, the INSERT or UPDATE is canceled as specified by the ISO standard. Trailing blanks are ignored for character columns and trailing nulls are ignored for binary columns. When OFF, data is truncated to the size of the column and the statement succeeds.

Null values we are not solving, we are not dividing. Thus only the field length is crucial for us thus I think it could be safely used. E.g. possible to enable this through parameter will be nice. But still, situation, where the generated value is longer than field will kill all, thus what else we can use to solve this problem? Shoudl I open separate issue for solving the "value xxxx is longer than..."?

from pynonymizer.

kine commented on August 23, 2024

yes, data are random, I think that there will be something truncated is out of scope... :-)

from pynonymizer.

MSSQL: All lines in table are anonymized to same values about pynonymizer HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent