Code Monkey home page Code Monkey logo

Comments (9)

rwnx avatar rwnx commented on August 23, 2024 1

My original thought was more along the lines of "would this cause anonymization to be improperly performed when enabled?" and looking at the docs here I don't have that concern. pynonymizer's goal should be to be compatibile with the database, not the other way around 😅

If people get some truncated replacement data, that seems agreeable to me!

I'll integrate this into #48 and add a test case too.

from pynonymizer.

kine avatar kine commented on August 23, 2024

Seems to corelated with subquery caching mentioned here: https://stackoverflow.com/questions/39101333/force-sql-server-to-stop-caching-subquery

from pynonymizer.

rwnx avatar rwnx commented on August 23, 2024

Hi,
Definitely not the correct behaviour / known issue. I had a report of this in behaviour in mssql provider a few years ago but I couldn't replicate it in my mssql instance.
I want to make sure we capture this effort into integration tests for the project since the mssql provider is by far the most touchy one!

There was a similar issue in the postgres provider a while ago: https://gitlab.com/jerometwell/pynonymizer/-/issues/49 , maybe the workaround there (using literal with an uncachable subquery) might work for you until we can fix this?

from pynonymizer.

kine avatar kine commented on August 23, 2024

Thanks,
I have solved the problem partially even I am not python developer...

the prevention of cached subselect(seems that when there is more than x rows the SQL will optimize the subselect because it is not depending on the main data, thus executing it only once and using the result everywhere) is to make it depending on the main data. I did it in this way:

return f"( SELECT TOP 1 {column} FROM [{SEED_TABLE_NAME}] ORDER BY NEWID(),CHECKSUM({SEED_TABLE_NAME}.{column},[{table_name}].[{field_name}]))"

in the __get_column_subquery (taken from https://dba.stackexchange.com/questions/275072/generate-random-row-order-in-a-subquery). It makes the subselect depending on the data of the query. Side-effect is, that if the caching kick in, the result is "deterministic" because same original value is replaced with same generated "pseudo-random" value because the cache, but this is for me even better ;-)

I can create PR for this, but someone must add the tests and somehow tune the code, because really I just did without deep python knowledge.

Another part of the fix was to use "SET ANSI_WARNINGS off" to disable warnings when generated value is truncated because the target field is shorter (had some problem with long city name in cs-CZ fake dataset).

from pynonymizer.

rwnx avatar rwnx commented on August 23, 2024

Thanks for your prototype and suggestions, it's a great starting place 😇 It looks like we'll be doing roughly what we did for postgres, i.e. depending upon something in the query which makes it hard to cache, like you say!
I will pick this up for next release.

I've tried to keep strategy.scripts open for stuff like SET ANSI_WARNINGS off but maybe that's not working for you? part of pynonymizers unofficial goals is to remain flexible enough without literally supporting every feature 😅

from pynonymizer.

kine avatar kine commented on August 23, 2024

Ok, thanks, my mistake, have forgot to use the scripts to set the ANSI_WARNINGs... first time I am using pynonymizer... next time I will know... ;-) (but I am not sure if it will work - depends if all is running in same session or not, because this is per session setting).

from pynonymizer.

rwnx avatar rwnx commented on August 23, 2024

hey @kine can we chat about ANSI_WARNINGS? I looked at your implementation on your fork - I have some reservations about enabling ansi_warnings for everyone, I'm concerned that the behaviour might not be applicable to everyone. What do you think?

from pynonymizer.

kine avatar kine commented on August 23, 2024

Yes, it could be "too strong" for others. But without this the process will fail if the generated value is somewhere longer than the field in table. You will need to fix that in the code. This setting is solving this without adding the length check in code.

From documentation:

SET ANSI_WARNINGS affects the following conditions:

When set to ON, if null values appear in aggregate functions, such as SUM, AVG, MAX, MIN, STDEV, STDEVP, VAR, VARP, or COUNT, a warning message is generated. When set to OFF, no warning is issued.

When set to ON, the divide-by-zero and arithmetic overflow errors cause the statement to be rolled back and an error message is generated. When set to OFF, the divide-by-zero and arithmetic overflow errors cause null values to be returned. The behavior in which a divide-by-zero or arithmetic overflow error causes null values to be returned occurs if an INSERT or UPDATE is tried on a character, Unicode, or binary column in which the length of a new value exceeds the maximum size of the column. If SET ANSI_WARNINGS is ON, the INSERT or UPDATE is canceled as specified by the ISO standard. Trailing blanks are ignored for character columns and trailing nulls are ignored for binary columns. When OFF, data is truncated to the size of the column and the statement succeeds.

Null values we are not solving, we are not dividing. Thus only the field length is crucial for us thus I think it could be safely used. E.g. possible to enable this through parameter will be nice. But still, situation, where the generated value is longer than field will kill all, thus what else we can use to solve this problem? Shoudl I open separate issue for solving the "value xxxx is longer than..."?

from pynonymizer.

kine avatar kine commented on August 23, 2024

yes, data are random, I think that there will be something truncated is out of scope... :-)

from pynonymizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.