Code Monkey home page Code Monkey logo

pimo's Introduction

GitHub Workflow Status Go Report Card GitHub all releases GitHub GitHub Repo stars GitHub go.mod Go version GitHub release (latest by date)

PIMO : Private Input, Masked Output

PIMO is a tool for data masking. It can mask data from a JSONline stream and return another JSONline stream thanks to a masking configuration contained in a yaml file.

pimo

You can use LINO to extract sample data from a database, which you can then use as input data for PIMO's data masking. You can also generate data with a simple yaml configuration file.

Capabilities

  • credibility : generated data is not distinguishable from real data
  • data synthesis : generate data from nothing
  • data masking, including
    • randomization : protect personal or sensitive data by writing over it
    • pseudonymization, on 3 levels
      • consistent pseudonymisation : real value A is always replaced by pseudo-value X but X can be attributed to other values than A
      • identifiant pseudonymisation : real value A is always replaced by pseudo-value X and X CANNOT be attributed to other values than A
      • reversible pseudonymisation : real value A can be generated from pseudo-value X

Configuration file needed

PIMO requires a yaml configuration file to works. By default, the file is named masking.yml and is placed in the working directory. The file must respect the following format :

version: "1"
seed: 42
functions:
    # Optional define functions
masking:
  - selector:
      jsonpath: "example.example"
    mask:
      type: "argument"
    # Optional cache (coherence preservation)
    cache: "cacheName"
    # Optional custom seed for this mask
    seed:
      field: "example.example"

  # another mask on a different location
  - selector:
      jsonpath: "example.example2"
    mask:
      type: "argument"
    preserve: "null"

caches:
  cacheName:
    # Optional bijective cache (enable re-identification if the cache is dumped on disk)
    unique: true
    # Use reverse cache dictionnary
    reverse: true

version is the version of the masking file. seed is to give every random mask the same seed, it is optional and if it is not defined, the seed is derived from the current time to increase randomness. functions is used to define the functions that can be used in the te mask template, template-each, add, and add-transient. masking is used to define the pipeline of masks that is going to be applied. selector is made of a jsonpath and a mask. jsonpath defines the path of the entry that has to be masked in the json file. mask defines the mask that will be used for the entry defined by selector. cache is optional, if the current entry is already in the cache as key the associated value is returned without executing the mask. Otherwise the mask is executed and a new entry is added in the cache with the orignal content as key and the masked result as value. The cache have to be declared in the caches section of the YAML file. preserve is optional, and is used to keep some values unmasked in the json file. Allowed preserve options are: "null" (null values), "empty" (empty string ""), and "blank" (both empty and null values). Additionally, preserve can be used with mask fromCache to preserve uncached values. (usage: preserve: "notInCache")

Multiple masks can be applied on the same jsonpath location, like in this example :

  - selector:
      jsonpath: "example"
    masks:
      - add: "hello"
      - template: "{{.example}} World!"
      - remove: true

Masks can be applied on multiple selectors, like in this example:

  - selectors:
      - jsonpath: "example"
      - jsonpath: "example2"
    mask:
      add: "hello"

It is possible to define functions and reuse them later in the masks, like in this example:

functions:
  add20:
    params:
      - name: "i"
    body: |-
      return i + 20
  sub:
    params:
      - name: "x"
      - name: "y"
    body: |-
      return x - y
masking:
  - selector:
      jsonpath: "addValue"
    mask:
      template: '{{add20 5}}'
  - selector:
      jsonpath: "subValue"
    mask:
      template: '{{sub 10 5}}'

Possible masks

The following types of masks can be used :

  • Pure randomization masks
    • regex is to mask using a regular expression given in argument.
    • randomInt is to mask with a random int from a range with arguments min and max.
    • randomDecimal is to mask with a random decimal from a range with arguments min, max and precision.
    • randDate is to mask a date with a random date between dateMin and dateMax.
    • randomDuration is to mask a date by adding or removing a random time between Min and Max.
    • randomChoice is to mask with a random value from a list in argument.
    • weightedChoice is to mask with a random value from a list with probability, both given with the arguments choice and weight.
    • randomChoiceInUri is to mask with a random value from an external resource.
    • randomChoiceInCSV is to mask with a random value from an external CSV resource.
    • transcode is to mask a value randomly with character class preservation.
    • timeline to generate a set of dates related to each other (by rules and constraints)
  • K-Anonymization
    • range is to mask a integer value by a range of value (e.g. replace 5 by [0,10]).
    • duration is to mask a date by adding or removing a certain number of days.
  • Re-identification and coherence preservation
    • hash is to mask with a value from a list by matching the original value, allowing to mask a value the same way every time.
    • hashInUri is to mask with a value from an external resource, by matching the original value, allowing to mask a value the same way every time.
    • hashInCSV is to mask with a value from an external CSV resource, by matching the original value, allowing to mask a value the same way every time.
    • fromCache is a mask to obtain a value from a cache.
    • ff1 mask allows the use of FPE which enable private-key based re-identification.
    • sha3 masks will apply a variable length cryptographic hash (SHAKE variable-output-length hash function defined by FIPS-202) and then apply a base-conversion to the output.
  • Formatting
    • dateParser is to change a date format.
    • template is to mask a data with a template using other values from the jsonline.
    • template-each is like template but will apply on each value of an array.
    • fromjson is to convert string field values to parsed JSON, e.g. "[1,2,3]" -> [1,2,3].
  • Data structure manipulation
    • remove is to mask a field by completely removing it.
    • add is a mask to add a field to the jsonline.
    • add-transient same as add but the field is not exported in the output jsonline.
  • Others
    • constant is to mask the value by a constant value given in argument.
    • command is to mask with the output of a console command given in argument.
    • incremental is to mask data with incremental value starting from start with a step of increment.
    • sequence generate sequenced IDs of any format.
    • fluxUri is to replace by a sequence of values defined in an external resource.
    • replacement is to mask a data with another data from the jsonline.
    • pipe is a mask to handle complex nested array structures, it can read an array as an object stream and process it with a sub-pipeline.
    • luhn can generate valid numbers using the Luhn algorithm (e.g. french SIRET or SIREN).
    • markov can generate pseudo text based on a sample text.
    • findInCSV get one or multiple csv lines which matched with Json entry value from CSV files.
    • xml can manipulate XML content within JSON values.

A full masking.yml file example, using every kind of mask, is given with the source code.

In case two types of mask are entered with the same selector, the program can't extract the masking configuration and will return an error. The file wrongMasking.yml provided with the source illustrate that error.

Usage

To use PIMO to mask a data.json, use in the following way :

./pimo <data.json >maskedData.json

This takes the data.json file, masks the data contained inside it and put the result in a maskedData.json file. If data are in a table (for example multiple names), then each field of this table will be masked using the given mask. The following flags can be used:

  • --repeat=N This flag will make pimo mask every input N-times (useful for dataset generation).
  • --skip-line-on-error This flag will totally skip a line if an error occurs masking a field.
  • --skip-field-on-error This flag will return output without a field if an error occurs masking this field.
  • --skip-log-file <filename> Skipped lines will be written to <filename>.
  • --catch-errors <filename> or -e <filename> Equivalent to --skip-line-on-error --skip-log-file <filename>.
  • --empty-input This flag will give PIMO a {} input, usable with --repeat flag.
  • --config=filename.yml This flag allow to use another file for config than the default masking.yml.
  • --load-cache cacheName=filename.json This flag load an initial cache content from a file (json line format {"key":"a", "value":"b"}).
  • --dump-cache cacheName=filename.json This flag dump final cache content to a file (json line format {"key":"a", "value":"b"}).
  • --verbosity <level> or -v<level> This flag increase verbosity on the stderr output, possible values: none (0), error (1), warn (2), info (3), debug (4), trace (5).
  • --debug This flag complete the logs with debug information (source file, line number).
  • --log-json Set this flag to produce JSON formatted logs (demo9 goes deeper into logging and structured logging)
  • --seed <int> Set this flage to declare seed in command line.
  • --mask Declare a simple masking definition in command line (minified YAML format: --mask "value={fluxUri: 'pimo://nameFR'}", or --mask "value=[{add: ''},{fluxUri: 'pimo://nameFR'}]" for multiple masks). For advanced use case (e.g. if caches needed) masking.yml file definition will be preferred.
  • --repeat-until <condition> This flag will make PIMO keep masking every input until the condition is met. Condition format is using Template. Last output verifies the condition.
  • --repeat-while <condition> This flag will make PIMO keep masking every input while the condition is met. Condition format is using Template.
  • --stats <filename | url> This flag either outputs run statistics to the specified file or send them to specified url (has to start with http or https).
  • --statsTemplate <string> This flag will have PIMO use the value as a template to generate statistics. Please use go templating format to include statistics. To include them you have to specify them as {{ .Stats }}. (i.e. {"software":"PIMO","stats":{{ .Stats }}})

PIMO Play

The play command will start a local website, where you will find commented examples and a playground to play with the masking configuration.

$ pimo play
⇨ http server started on [::]:3010

Then go to http://localhost:3010/ in your browser.

PIMO Play screenshot

Examples

This section will give examples for every types of mask.

Please check the demo folder for more advanced examples.

Regex

Try it

  - selector:
      jsonpath: "phone"
    mask:
      regex: "0[1-7]( ([0-9]){2}){4}"

This example will mask the phone field of the input jsonlines with a random string respecting the regular expression.

Return to list of masks

Constant

Try it

  - selector:
      jsonpath: "name"
    mask:
      constant: "Bill"

This example will mask the name field of the input jsonlines with the value of the constant field.

Return to list of masks

RandomChoice

Try it

  - selector:
      jsonpath: "name"
    mask:
      randomChoice:
       - "Mickael"
       - "Mathieu"
       - "Marcelle"

This example will mask the name field of the input jsonlines with random values from the randomChoice list.

Return to list of masks

RandomChoiceInUri

Try it

  - selector:
      jsonpath: "name"
    mask:
      randomChoiceInUri: "file://names.txt"

This example will mask the name field of the input jsonlines with random values from the list contained in the name.txt file. The different URI usable with this selector are : pimo, file and http/https.

A value can be injected in URI with the template syntax. For example, file://name{{.gender}}.txt select a line in name_F.txt if the current jsonline is {gender : "F"}.

Return to list of masks

RandomChoiceInCSV

Try it

version: "1"
masking:
  - selector:
      jsonpath: "pokemon"
    mask:
      randomChoiceInCSV:
        uri: "https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv"
        header: true          # optional: csv has a header line, use it to name fields, default: false
        separator: ","        # optional: csv value separator is , (default value)
        comment: "#"          # optional: csv contains comments starting with #, if empty no comment is expected (default)
        fieldsPerRecord: 0    # optional: number of fields per record, if 0 sets it to the number of fields in the first record (default)
                              # if negative, no check is made and records may have a variable number of fields
        trim: true            # optional: trim space in values and headers, default: false

The selected field's data will be masked with random values selected from a CSV file available at the specified URL (a GitHub gist in this case).

Here is a detailed breakdown of the example configuration:

  • selector: The jsonpath: "pokemon" line means that this masking configuration is meant to apply to the field named "pokemon" in the JSON data.
  • mask: This defines the masking operation to be performed on the "pokemon" field.
  • randomChoiceInCSV: The mask will replace the value in the "pokemon" field with a random choice from the CSV file at the specified URL.
  • uri: The location of the CSV file to use for replacement values, file and http/https schemes can be used. This parameter can be a template.
  • header: This optional parameter is set to true, meaning the CSV file contains a header line that names the fields.
  • separator: This optional parameter specifies that the CSV values are separated by a comma, which is the default separator in CSV files.
  • comment: This optional parameter specifies that the CSV file may contain comments that start with a '#'.
  • fieldsPerRecord: This optional parameter is set to 0, meaning the number of fields per record will be set to the number of fields in the first record by default. If negative, no check is made and records may have a variable number of fields.
  • trim: This optional parameter is set to true, meaning any spaces in values and headers in the CSV file will be trimmed.

Return to list of masks

RandomInt

Try it

  - selector:
      jsonpath: "age"
    mask:
      randomInt:
        min: 25
        max: 32

This example will mask the age field of the input jsonlines with a random number between min and max included.

Return to list of masks

RandomDecimal

Try it

  - selector:
      jsonpath: "score"
    mask:
      randomDecimal:
        min: 0
        max: 17.23
        precision: 2

This example will mask the score field of the input jsonlines with a random float between min and max, with the number of decimal chosen in the precision field.

Return to list of masks

Command

  - selector:
      jsonpath: "name"
    mask:
      command: "echo -n Dorothy"

This example will mask the name field of the input jsonlines with the output of the given command. In this case, Dorothy.

Return to list of masks

WeightedChoice

Try it

  - selector:
      jsonpath: "surname"
    mask:
      weightedChoice:
        - choice: "Dupont"
          weight: 9
        - choice: "Dupond"
          weight: 1

This example will mask the surname field of the input jsonlines with a random value in the weightedChoice list with a probability proportional at the weight field.

Return to list of masks

Hash

Try it

  - selector:
      jsonpath: "town"
    mask:
      hash:
        - "Emerald City"
        - "Ruby City"
        - "Sapphire City"

This example will mask the town field of the input jsonlines with a value from the hash list. The value will be chosen thanks to a hashing of the original value, allowing the output to be always the same in case of identical inputs.

Return to list of masks

HashInUri

Try it

  - selector:
      jsonpath: "name"
    mask:
      hashInUri: "pimo://nameFR"

This example will mask the name field of the input jsonlines with a value from the list nameFR contained in pimo, the same way as for hash mask. The different URI usable with this selector are : pimo, file and http/https.

Return to list of masks

HashInCSV

Try it

version: "1"
masking:
  - selector:
      jsonpath: "pokemon"
    mask:
      hashInCSV:
        uri: "https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv"
        header: true          # optional: csv has a header line, use it to name fields, default: false
        separator: ","        # optional: csv value separator is , (default value)
        comment: "#"          # optional: csv contains comments starting with #, if empty no comment is expected (default)
        fieldsPerRecord: 0    # optional: number of fields per record, if 0 sets it to the number of fields in the first record (default)
                              # if negative, no check is made and records may have a variable number of fields
        trim: true            # optional: trim space in values and headers, default: false

The selected field's data will be masked with random values selected from a CSV file available at the specified URL (a GitHub gist in this case). The value will be chosen thanks to a hashing of the original value, allowing the output to be always the same in case of identical inputs.

See RandomChoiceInCSV for a detailed breakdown of the example configuration.

Return to list of masks

RandDate

Try it

  - selector:
      jsonpath: "date"
    mask:
      randDate:
        dateMin: "1970-01-01T00:00:00Z"
        dateMax: "2020-01-01T00:00:00Z"

This example will mask the date field of the input jsonlines with a random date between dateMin and dateMax. In this case the date will be between the 1st January 1970 and the 1st January 2020.

Return to list of masks

Duration

Try it

  - selector:
      jsonpath: "last_contact"
    mask:
      duration: "-P2D"

This example will mask the last_contact field of the input jsonlines by decreasing its value by 2 days. The duration field should match the ISO 8601 standard for durations.

Return to list of masks

DateParser

Try it

  - selector:
      jsonpath: "date"
    mask:
      dateParser:
        inputFormat: "2006-01-02"
        outputFormat: "01/02/06"

This example will change every date from the date field from the inputFormat to the outputFormat. The format should always display the following date : Mon Jan 2 15:04:05 -0700 MST 2006. Either field is optional and in case a field is not defined, the default format is RFC3339, which is the base format for PIMO, needed for duration mask and given by randDate mask. It is possible to use the Unix time format by specifying inputFormat: "unixEpoch" or outputFormat: "unixEpoch".

Return to list of masks

RandomDuration

Try it

  - selector:
      jsonpath: "date"
    mask:
      randomDuration:
        min: "-P2D"
        max: "-P27D"

This example will mask the date field of the input jsonlines by decreasing its value by a random value between 2 and 27 days. The durations should match the ISO 8601 standard.

Return to list of masks

Incremental

Try it

  - selector:
      jsonpath: "id"
    mask:
      incremental:
        start: 1
        increment: 1

This example will mask the id field of the input jsonlines with incremental values. The first jsonline's id will be masked by 1, the second's by 2, etc...

Return to list of masks

Sequence

Try it

  - selector:
      jsonpath: "id"
    mask:
      sequence:
        format: "ERR-0000"

This example will generate the id field of the input jsonlines with sequenced values. The first jsonline's id will be masked by ERR-0000, the second's by ERR-0001, etc...

By default, the varying part of the ID is numbers, but this can be changed :

  - selector:
      jsonpath: "id"
    mask:
      sequence:
        format: "ERR-0000"
        varying: "ER"

With this configuration, the first jsonline's id will be masked by EEE-0000, the second's by EER-0000, the third by ERE-0000 etc...

Return to list of masks

Replacement

Try it

  - selector:
      jsonpath: "name4"
    mask:
      replacement: "name"

This example will mask the name4 field of the input jsonlines with the field name of the jsonline. This selector must be placed after the name selector to be masked with the new value and it must be placed before the name selector to be masked by the previous value.

Return to list of masks

Template

Try it

  - selector:
      jsonpath: "mail"
    mask:
      template: "{{.surname}}.{{.name}}@gmail.com"

This example will mask the mail field of the input jsonlines respecting the given template. In the masking.yml config file, this selector must be placed after the fields contained in the template to mask with the new values and before the other fields to be masked with the old values. In the case of a nested json, the template must respect the following example :

  - selector:
      jsonpath: "user.mail"
    mask:
      template: "{{.user.surname}}.{{.user.name}}@gmail.com"

The format for the template should respect the text/template package : https://golang.org/pkg/text/template/

The template mask can format the fields used. The following example will create a mail address without accent or upper case:

  - selector:
      jsonpath: "user.mail"
    mask:
      template: "{{.surname | NoAccent | upper}}.{{.name | NoAccent | lower}}@gmail.com"

Available functions for templates come from http://masterminds.github.io/sprig/.

Most masks will be available as functions in template in the form : MaskCapitalizedMaskName.

  - selector:
      jsonpath: "mail"
    masks:
      - add: ""
      - template: '{{MaskRegex "[a-z]{10}"}}.{{MaskRegex "[a-z]{10}"}}.{{MaskRandomInt 0 100}}@gmail.com'

Return to list of masks

Template each

Try it

  - selector:
      jsonpath: "array"
    mask:
      template-each:
        template: "{{title .value}}"
        item: "value"

This will affect every values in the array field. The field must be an array ({"array": ["value1", "value2"]}). The item property is optional and defines the name of the current item in the templating string (defaults to "it"). There is another optional property index, if defined then a property with the given name will be available in the templating string (e.g. : index: "idx" can be used in template with {{.idx}}).

The format for the template should respect the text/template package : https://golang.org/pkg/text/template/

See also the Template mask for other options, all functions are applicable on template-each.

Return to list of masks

Fromjson

Try it

  - selector:
      jsonpath: "targetfield"
    mask:
      fromjson: "sourcefield"

This example will mask the targetfield field of the input jsonlines with the parsed JSON from field sourcefield of the jsonline. This mask changes the type of the input string (sourcefield) :

  • null : nil
  • string: string
  • number: float64
  • array: slice
  • object: map
  • bool: bool

Return to list of masks

Remove

Try it

  - selector:
      jsonpath: "useless-field"
    mask:
      remove: true

This field will mask the useless-field of the input jsonlines by completely deleting it.

Return to list of masks

Add

Try it

  - selector:
      jsonpath: "newField"
    mask:
      add: "newvalue"

This example will create the field newField containing the value newvalue. This value can be a string, a number, a boolean...

The field will be created in every input jsonline that doesn't already contains this field.

Note: add can contains template strings (see the Template mask for more information).

Return to list of masks

Add-Transient

Try it

  - selector:
      jsonpath: "newField"
    mask:
      add-transient: "newvalue"

This example will create the field newField containing the value newvalue. This value can be a string, a number, a boolean... It can also be a template.

The field will be created in every input jsonline that doesn't already contains this field, and it will be removed from the final JSONLine output.

This mask is used for temporary field that is only available to other fields during the execution.

Note: add-transient can contains template strings (see the Template mask for more information).

Return to list of masks

FluxURI

Try it

  - selector:
      jsonpath: "id"
    mask:
      fluxURI: "file://id.csv"

This example will create an id field in every output jsonline. The values will be the ones contained in the id.csv file in the same order as in the file. If the field already exist on the input jsonline it will be replaced and if every value of the file has already been assigned, the input jsonlines won't be modified.

Return to list of masks

FromCache

  - selector:
      jsonpath: "id"
    mask:
      fromCache: "fakeId"
  caches:
    fakeId :
      unique: true
      reverse: false

This example will replace the content of id field by the matching content in the cache fakeId. Cache have to be declared in the caches section. Cache content can be loaded from jsonfile with the --load-cache fakeId=fakeId.jsonl option or by the cache option on another field. If no matching is found in the cache, fromCache block the current line and the next lines are processing until a matching content go into the cache. A reverse option is available in the caches section to use the reverse cache dictionary.

Return to list of masks

FF1

Try it

  - selector:
      jsonpath: "siret"
    mask:
      ff1:
        keyFromEnv: "FF1_ENCRYPTION_KEY"
        domain: "0123456789" # all possible characters in a siret
        onError: "Invalid value = {{ .siret }}" # if set, this template will be executed on error

This example will encrypt the siret column with the private key base64-encoded in the FF1_ENCRYPTION_KEY environment variable. Use the same mask with the option decrypt: true to re-identify the unmasked value.

Characters outside of the domain can be preserved with preserve: true option.

Be sure to check the full FPE demo to get more details about this mask.

Return to list of masks

Sha3

Try it

The sha3 mask will apply a variable length cryptographic hash (SHAKE variable-output-length hash function defined by FIPS-202) and then apply a base-conversion to the output.

This is useful to mask any input data into a coherent and collision resistant ID.

version: "1"
seed: 123 # needed to salt the hash (can also be set via command line argument --seed 123)
masking:
  - selector:
      jsonpath: "email"
    mask:
      sha3:
        length: 12 # hash to N bytes, collision resistance is 2^(N*4)
        domain: "0123456789" # convert to base 10 with digits 0-9

In this example, the email will be replaced with a 29-digit collision resistant number. The collision resistance will be considered very good if the number of ID generated is less than 2^(12*8/2).

Return to list of masks

Range

Try it

  - selector:
      jsonpath: "age"
    mask:
      range: 5

This mask will replace an integer value {"age": 27} with a range like this {"age": "[25;29]"}.

Return to list of masks

Pipe

Try it

If the data structure contains arrays of object like in the example below, this mask can pipe the objects into a sub pipeline definition.

data.jsonl

{
    "organizations": [
        {
            "domain": "company.com",
            "persons": [
                {
                    "name": "leona",
                    "surname": "miller",
                    "email": ""
                },
                {
                    "name": "joe",
                    "surname": "davis",
                    "email": ""
                }
            ]
        },
        {
            "domain": "company.fr",
            "persons": [
                {
                    "name": "alain",
                    "surname": "mercier",
                    "email": ""
                },
                {
                    "name": "florian",
                    "surname": "legrand",
                    "email": ""
                }
            ]
        }
    ]
}

masking.yml

version: "1"
seed: 42
masking:
  - selector:
      # this path points to an array of persons
      jsonpath: "organizations.persons"
    mask:
      # it will be piped to the masking pipeline definition below
      pipe:
        # the parent object (a domain) will be accessible with the "_" variable name
        injectParent: "_"
        masking:
          - selector:
              jsonpath: "name"
            mask:
              # fields inside the person object can be accessed directly
              template: "{{ title .name }}"
          - selector:
              jsonpath: "surname"
            mask:
              template: "{{ title .surname }}"
          - selector:
              jsonpath: "email"
            mask:
              # the value stored inside the parent object is accessible through "_" thanks to the parent injection
              template: "{{ lower .name }}.{{ lower .surname }}@{{ ._.domain }}"

In addition to the injectParent property, this mask also provide the injectRoot property to inject the whole structure of data.

It is possible to simplify the masking.yml file by referencing an external yaml definition :

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "organizations.persons"
    mask:
      pipe:
        injectParent: "domain"
        file: "./masking-person.yml"

Be sure to check demo to get more details about this mask.

Return to list of masks

Luhn

Try it

The Luhn algorithm is a simple checksum formula used to validate a variety of identification numbers.

The luhn mask can calculate the checksum for any value.

  - selector:
      jsonpath: "siret"
    mask:
      luhn: {}

In this example, the siret value will be appended with the correct checksum, to create a valid SIRET number (french business identifier).

The mask can be parametered to use a different universe of valid characters, internally using the Luhn mod N algorithm.

  - selector:
      jsonpath: "siret"
    mask:
      luhn:
        universe: "abcde"

Return to list of masks

Markov

Try it

Markov chains produces pseudo text based on an sample text.

sample.txt

I want a cheese burger
I need a cheese cake

masking.yml

  - selector:
      jsonpath: "comment"
    mask:
      markov:
        max-size: 20
        sample: "file://sample.txt"
        separator: " "

This example will mask the surname comment of the input jsonlines with a random value comment generated by the markov mask with an order of 2. The different possibilities generated from sample.txt will be :

I want a cheese burger
I need a cheese burger
I want a cheese cake
I need a cheese cake

The separator field defines the way the sample text will be split ("" for splitting into characters, " " for splitting into words)

Return to list of masks

Transcode

Try it

This mask produce a random string by preserving character classes from the original value.

masking.yml

- selector:
    jsonpath: "id"
  mask:
    transcode:
      classes:
      - input: "0123456789abcdefABCDEF"
        output: "0123456789abcdef"

This example will mask the original id value by replacing every characters from the input class by a random character from the output class.

$ echo '{"id": "1ef619-90F"}' | pimo
{"id": "d8e203-a92"}

By default, if not specified otherwise, these classes will be used (input -> output):

  • lowercase letters -> lowercase letters
  • UPPERCASE LETTERS -> UPPERCASE LETTERS
  • Digits -> Digits
# this configuration:
- selector:
    jsonpath: "id"
  mask:
    transcode: {}
# is equivalent to:
- selector:
    jsonpath: "id"
  mask:
    transcode:
      classes:
        - input: "abcdefghijklmnopqrstuvwxyz"
          output: "abcdefghijklmnopqrstuvwxyz"
        - input: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
          output: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        - input: "0123456789"
          output: "0123456789"

Return to list of masks

FindInCSV

Try it

This mask compares targeted values or combinations of values from a JSON Entry with values from a CSV file, inserting the matched CSV line into the designated field of the JSON entry.

{"type_1": "fire", "name": "carmender"}

Input CSV

#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
...
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...

Pokemon CSV

version: "1"
masking:
  - selector:
      jsonpath: "info"
    masks:
        - add : ""                                       # add key "info" with value "" in json Entry
        - findInCSV:
            uri: "https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv"
            exactMatch:                                  # optional: you can only use exact match or both
                csv: '{{(index . "Type 1") | lower }}'
                entry: "{{.type_1}}"
            jaccard:                                     # optional: you can only use jaccard match or both
                csv: "{{.Name | lower }}"
                entry: "{{.name |lower}}"
            expected: "at-least-one"                     # optional: only-one, at-least-one or many, by default: at-least-one
            header: true                                 # optional: csv has a header line, use it to name fields, default: false
            trim: true                                   # optional: trim space in values and headers, default: false

In this scenario, the findInCSV mask is applied to the "info" field in the JSON entry. The mask utilizes both exact matching and Jaccard similarity. The expected results passes to Jaccard similarity. The configuration expected: "at-least-one" will return the most similar CSV line which is then saved in the info field. If expected: "many" is used, Jaccard match will return all expected matched lines in order of similarity.Using expected: "only-one" result in an error if the match yields more than one line. Jaccard match offers flexibility in handling variations in the entry, such as differences in accents or letter case, by leveraging the Jaccard similarity metric.

Here is the result of excution:

{
  "type_1": "fire",
  "name": "carmender",
  "info": {
    "#": "4",
    "Name": "Charmander",
    "Type 1": "Fire",
    "Type 2": "",
    "Total": "309",
    "HP": "39",
    "Attack": "52",
    "Defense": "43",
    "Sp. Atk": "60",
    "Sp. Def": "50",
    "Speed": "65",
    "Generation": "1",
    "Legendary": "False"
  }
}

Return to list of masks

Timeline

Try it

This mask can generate multiple dates related to each other, for example :

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "timeline"
    masks:
      - add: ""
      - timeline:
          start:
            name: "start" # name the first point in the timeline
            value: "2006-01-02T15:04:05Z" # optional : current date if not specified
          format: "2006-01-02" # output format for the timeline
          points:
            - name: "birth"
              min: "-P80Y" # lower bound for this date ISO 8601 duration
              max: "-P18Y" # upper bound for this date ISO 8601 duration
            - name: "contract"
              from: "birth" # bounded relative to "birth" (if not specified, then relative to start point)
              min: "+P18Y"
              max: "+P40Y"
            - name: "promotion"
              from: "contract"
              min: "+P0"
              max: "+P5Y"

Will generate :

$ pimo --empty-input
{"timeline":{"start":"2006-01-02","birth":"1980-12-01","contract":"2010-07-16","promotion":"2010-12-06"}}

Constraints

before and after constraints can be set to create better timelines, for example :

            - name: "begin"
              min: "P0"
              max: "+P80Y"
            - name: "end"
              min: "P0"
              max: "+P80Y"
              constraints:
                - before: "begin"

The dates begin and end will both be chosen from the same interval, but end will always be after begin.

To enforce this, the timeline mask will regerate all date until all constraints are met, up to 200 retries. If there is still unsatified contraints after 200 attempts, the mask will set the date to null.

This default behavior can be changed with the following parameters :

  • retry sets the maximum number of retry (it can be set to 0 to disable retrying)

              - timeline:
                  start:
                    name: "start"
                    value: "2006-01-02T15:04:05Z"
                  format: "2006-01-02"
                  retry: 0 # constraints will fail immediatly if not satisfied
  • onError will change the default behavior that set date to null if contraints cannot be satified, following values are accepted :

    • default : use a default value, this is the standard behavior when onError is unset (see next item for how to change the default value)
    • reject : fail masking of the current line with an error

    onError is defined on each constraint, for example :

              - name: "begin"
                min: "P0"
                max: "+P80Y"
              - name: "end"
                min: "P0"
                max: "+P80Y"
                constraints:
                  - before: "begin"
                    onError: "reject"
  • default set the default value to use when an error occurs, if not set null value is the default

              - name: "begin"
                min: "P0"
                max: "+P80Y"
              - name: "end"
                min: "P0"
                max: "+P80Y"
                constraints:
                  - after: "begin"
                default: "begin" # use begin date if constraint can't be satisfied

Epsilon

The epsilon parameter is the minimum period of time between two date to validate a constraint.

It can be set globally on the timeline to make sure dates under constraints have a minimum amount of time between them.

            - timeline:
                start:
                  name: "today"
                  value: "2006-01-02T15:04:05Z"
                format: "2006-01-02"
                retry: 0
                epsilon: "P1Y" # minimum 1 year between dates (in constraints)

For example this contraint will fail if begin is 2007-12-20 and end is 2008-05-21 (less than a year between dates).

            - name: "end"
              min: "P0"
              max: "+P80Y"
              constraints:
                - after: "begin"

It can be set locally on a single constraint (override global epsilon parameter).

                    constraints:
                      - after: "contract"
                        epsilon: "P0" # will override global epsilon config

Return to list of masks

XML

Try it

The XML mask feature enhances PIMO's capabilities by enabling users to manipulate XML content within JSON values. The proposed syntax aims to align with existing masking conventions for ease of use.

Input JSON

{
    "title": "my blog note",
    "content": "<note author='John Doe'><date>10/10/2023</date>This is a note of my blog....</note>"
}

masking.yml

version: "1"
masking:
  - selector:
      jsonpath: "content"
    mask:
      xml:
        xpath: "note"
        # the parent object (a domain) will be accessible with the "_" variable name.
        injectParent: "_"
        masking:
        - selector:
            jsonpath: "@author"
          mask:
            # To use a parent value in template: {{. + injectParentName + . + jsonKey}}
            template: "{{._.title}}"
        - selector:
            jsonpath: "date"
          masks:
            - randDate:
                dateMin: "1970-01-01T00:00:00Z"
                dateMax: "2020-01-01T00:00:00Z"
            - template: "{{index . \"date\"}}"

This example masks the original attribute value with the specified template value. jsonpath: "content" point to the key in json that contains target XML content to be masked. The masking section applies all masks to the target attribute or tag in XML.

the parent object (a domain) will be accessible with the "_" variable name. To use a parent value in template: {{. + injectParentName + . + jsonKey}}

For more infomation on pasing XML files. refer to Parsing-XML-files

Output JSON

{
  "title": "my blog note",
  "content": "<note author='my blog note'><date>2008-06-07 04:34:17 +0000 UTC</date>This is a note of my blog....</note>"
}

Return to list of masks

Parsing-XML-files

To use PIMO to masking data in an XML file, use in the following way :

  cat data.xml | pimo xml --subscriber parentTagName=MaskName.yml > maskedData.xml

Pimo selects specific tags within a predefined parent tag to replace the text and store the entire data in a new XML file. These specific tags should not contain any other nested tags.

To mask values of attributes, follow the rules to define your choice in jsonpath in masking.yml.

  • For attributes of parent tag, we use: @attributeName in jsonpath.
  • For attributes of child tag, we use: childTagName@attributeName in jsonpath.

For example, consider an XML file named data.xml:

data.xml

<?xml version="1.0" encoding="UTF-8"?>
<taxes>
    <agency>
        <name>NewYork Agency</name>
        <agency_number>0032</agency_number>
    </agency>
    <account type="classic">
        <name age="25">Doe</name>
        <account_number>12345</account_number>
        <annual_income>50000</annual_income>
    </account>
    <account type="saving">
        <name age="50">Smith</name>
        <account_number>67890</account_number>
        <annual_income>60000</annual_income>
    </account>
</taxes>

In this example, you can mask the values of agency_number in the agency tag and the values of name and account_number in the account tag using the following command:

  cat data.xml | pimo xml --subscriber agency=masking_agency.yml --subscriber account=masking_account.yml > maskedData.xml

masking_agency.yml

version: "1"
seed: 42

masking:
  - selector:
      jsonpath: "agency_number"  # this is the name of tag that will be masked
    mask:
      template: '{{MaskRegex "[0-9]{4}$"}}'

masking_account.yml

version: "1"
seed: 42

masking:
  - selector:
      jsonpath: "name" # this is the name of tag that will be masked
    mask:
      randomChoiceInUri: "pimo://nameFR"
  - selector:
      jsonpath: "@type" # this is the name of parent tag's attribute that will be masked
    mask:
        randomChoice:
         - "classic"
         - "saving"
         - "securitie"
  - selector:
      jsonpath: "account_number" # this is the name of tag that will be masked
    masks:
      - incremental:
          start: 1
          increment: 1
        # incremental will change string to int, need to use template to restore string value in xml file
      - template: "{{.account_number}}"
  - selector:
      jsonpath: "name@age" # this is the name of child tag's attribute that will be masked
    masks:
      - randomInt:
         min: 18
         max: 95
         # @ is not accepted by GO, so there we need use index in template to change int into string
      - template: "{{index . \"name@age\"}}"

After executing the command with the correct configuration, here is the expected result in the file maskedData.xml:

maskedData.xml

<?xml version="1.0" encoding="UTF-8"?>
<taxes>
    <agency>
        <name>NewYork Agency</name>
        <agency_number>2308</agency_number>
    </agency>
    <account type="saving">
        <name age="33">Rolande</name>
        <account_number>1</account_number>
        <annual_income>50000</annual_income>
    </account>
    <account type="saving">
        <name age="47">Matéo</name>
        <account_number>2</account_number>
        <annual_income>60000</annual_income>
    </account>
</taxes>

Return to list of masks

pimo:// scheme

Pimo embed a usefule list of fake data. URIs that begin with a pimo:// sheme point to the pseudo files bellow.

name description
nameEN english female or male names
nameENF english female names
nameENM english male names
nameFR french female or male names
nameFRF french female names
nameFRM french male names
surnameFR french surnames
townFR french towns names

The content of built-in lists are in the maskingdata package

Flow chart

PIMO can generate a Mermaid syntax flow chart to visualize the transformation process.

for example the command pimo flow masking.yml > masing.mmd with that masking.yml file generate following chart :

Visual Studio Code

To integrate with Visual Studio Code (opens new window), download the YAML extension.

Then, edit your Visual Studio Code settings yaml.schemas to containing the following configuration:

{
  "yaml.schemas": {
    "https://raw.githubusercontent.com/CGI-FR/PIMO/main/schema/v1/pimo.schema.json": "/**/*masking*.yml"
  }
}

Using this configuration, the schema will be applied on every YAML file containing the word `masking`` in their name.

Contributors

Licence

Copyright (C) 2021 CGI France

PIMO is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

PIMO is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with PIMO. If not, see http://www.gnu.org/licenses/.

pimo's People

Contributors

adrienaury avatar capkicklee avatar chao-ma5566 avatar dependabot[bot] avatar giraud10 avatar p0labrd avatar romandguillaume avatar youen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pimo's Issues

[PROPOSAL] New mask : transcode

Transcode mask

A new mask that replace every character by a character of the another class.

By default, the following transcoding character classes are used

  • lowercase letters -> lowercase letters
  • UPPERCASE LETTERS -> UPPERCASE LETTERS
  • Digits -> Digits

Example

- selector:
    jsonpath: "id"
  mask:
    transcode: {}
$ echo '{"id": "12345-ABCD-6789"}' | pimo
{"id": "30274-RPDM-2883"}

Example 2 : Define custom classes

Convert a hexadecimal values, by random hexadecimal values using only lowercase letters

- selector:
    jsonpath: "id"
  mask:
    transcode:
      classes:
      - input: "0123456789abcdefABCDEF"
        output: "0123456789abcdef"
$ echo '{"id": "1ef619-90F"}' | pimo
{"id": "d8e203-a92"}

Mask some characters by a star

- selector:
    jsonpath: "pseudo"
  mask:
    transcode:
      classes:
      - input: "abcdefghijklmnopqrstuvwxyz"
        output: "*"
$ echo '{"pseudo": "mark_23"}' | pimo
{"id": "****_23"}

[PROPOSAL] Define masking in a one-liner

One line masking definition

Sometime, when fast iterating in a testing phase, it can be useful to run a pipeline without writing a masking.yml file.

Example

$ echo '{"value": ""}' | pimo --repeat 5 --mask "value=[{fluxUri: 'pimo://nameFR'}]"
{"value": "Aaron"}
{"value": "Abel"}
{"value": "Abel-François"}
{"value": "Abélard"}
{"value": "Abelin"}

This is equivalent to the use of the following masking.yml :

version: "1"
masking:
  - selector:
      jsonpath: "value"
    mask:
      fluxUri: "pimo://nameFR"

The syntax to define masks in one line can be a minified version of YAML, e.g. : https://onlineyamltools.com/minify-yaml

[BUG] Null values protection

PIMO is very sensitive to null values, most of the masks generate panic errors when encountering null values.

10:48AM INF Mask hash config=masking.yml context=stdin[1] input-line=1 output-line=1 path=prenom
panic: interface conversion: model.Entry is nil, not string

goroutine 1 [running]:
github.com/cgi-fr/pimo/pkg/hash.MaskEngine.Mask(0xc000107000, 0x375, 0x375, 0x0, 0x0, 0xc00026f600, 0x2, 0x2, 0x10, 0xc739c0, ...)
...

The default behavior when encountering a null value that can't be handled should be to ignore it. A null value is never a sensitive data to anonymize.

[PROPOSAL] Implicit add

Implicit add if the field is missing

Frequently, a field needs to be created, then valorized.

version: "1"
masking:
  - selector :
      jsonpath: "gender"
    mask:
      add: ""
  - selector :
      jsonpath: "gender"
    mask:
      randomChoice:
        - "M"
        - "F"

This can be simplified with an auto-add feature

version: "1"
masking:
  # directly use the valorization mask
  - selector :
      jsonpath: "gender"
    mask:
      randomChoice:
        - "M"
        - "F"

Auto-add could also be disabled by default and enabled on demand

version: "1"
masking:
  - selector :
      jsonpath: "gender"
    autoadd: true
    mask:
      randomChoice:
        - "M"
        - "F"

[bug] pipe mask after a fromjson mask yield a panic error

With the following pipeline, pimo stop on panic error.

# version du fichier de configuration PIMO
version: "1"
# Initialisation du générateur pseudo-aléatoire (optionel)
seed: 42
# Liste ordonnée des masque à appliquer
masking:
  - selector:
      jsonpath: "numberOfPet"
    mask:
      add: ""

  - selector:
      jsonpath: "numberOfPet"
    mask:
      randomInt:
        min: 4
        max: 10

  - selector:
      jsonpath: "fk_pets_owner_id"
    mask:
      add: ""
  - selector:
      jsonpath: "fk_pets_owner_id"
    mask:
      template: |
        [
          {{- range  $index := until (int .numberOfPet) -}}
            {{- if $index }},{{end -}}
            {
              "id": 7
            }
          {{- end -}}
        ]
  - selector:
      jsonpath: "fk_pets_owner_id"
    mask:
      fromjson: "fk_pets_owner_id"


  - selector:
      jsonpath: "fk_pets_owner_id"
    mask:
      pipe:
        masking:
          - selector:
              jsonpath: id
            mask:
              incremental:
                start: 1
                increment: 1
pimo --empty-input > with-pipe-result.json
panic: interface conversion: interface {} is map[string]interface {}, not model.Dictionary

goroutine 1 [running]:
github.com/cgi-fr/pimo/pkg/model.CleanDictionary(...)
        /workspace/pkg/model/ordered_dict.go:95
github.com/cgi-fr/pimo/pkg/model.CleanDictionarySlice(0x873720, 0xc00000ceb8, 0xc000024468, 0x10, 0xc0001e9dd8)
        /workspace/pkg/model/ordered_dict.go:104 +0x50c
github.com/cgi-fr/pimo/pkg/pipe.MaskEngine.MaskContext(0x0, 0x0, 0x9b0f10, 0xc000039540, 0x0, 0x0, 0x0, 0x0, 0xc00000cf00, 0xc000024468, ...)
        /workspace/pkg/pipe/pipe.go:69 +0x145
github.com/cgi-fr/pimo/pkg/model.(*MaskContextEngineProcess).ProcessDictionary.func2(0xc00000cf00, 0xc00000cf00, 0xc000024468, 0x10, 0x873720, 0xc00000ceb8, 0xc0001e9d58, 0xc00017b838, 0xc0001e9d40)
        /workspace/pkg/model/process_maskcontext.go:45 +0xb5
github.com/cgi-fr/pimo/pkg/model.selector.applyContext(0xc000024468, 0x10, 0x0, 0x0, 0xc00000cf00, 0xc00000cf00, 0x873720, 0xc00000ceb8, 0xc00000e480, 0x1, ...)
        /workspace/pkg/model/selector.go:203 +0x9c
github.com/cgi-fr/pimo/pkg/model.selector.applySubContext(0xc000024468, 0x10, 0x0, 0x0, 0xc00000cf00, 0xc00000cf00, 0xc00000e480, 0x1, 0x1, 0xc00000cf30)
        /workspace/pkg/model/selector.go:196 +0x339
github.com/cgi-fr/pimo/pkg/model.selector.ApplyContext(...)
        /workspace/pkg/model/selector.go:170
github.com/cgi-fr/pimo/pkg/model.(*MaskContextEngineProcess).ProcessDictionary(0xc0001db360, 0xc00000ced0, 0x9aa800, 0xc0001db380, 0x0, 0x0)
        /workspace/pkg/model/process_maskcontext.go:44 +0x1de
github.com/cgi-fr/pimo/pkg/model.(*ProcessPipeline).Next(0xc0000395c0, 0x0)
        /workspace/pkg/model/model.go:397 +0x96
github.com/cgi-fr/pimo/pkg/model.SimpleSinkedPipeline.Run(0x9b2a28, 0xc0000395c0, 0x9aeda0, 0xc0001db3a0, 0x0, 0x0)
        /workspace/pkg/model/model.go:445 +0x15f
main.run()
        /workspace/cmd/pimo/main.go:196 +0xba9
main.main.func1(0xc0001be280, 0xc0000694a0, 0x0, 0x1)
        /workspace/cmd/pimo/main.go:94 +0x25
github.com/spf13/cobra.(*Command).execute(0xc0001be280, 0xc00001e050, 0x1, 0x1, 0xc0001be280, 0xc00001e050)
        /home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:856 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001be280, 0xc00017bf20, 0x1, 0x1)
        /home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:897
main.main()
        /workspace/cmd/pimo/main.go:121 +0x713

[PROPOSAL] condition the execution of a mask for null or empty value

Problem

To preserve null or "" values in the output we have to use a template mask with conditional test.

  - selector:
      jsonpath: "comment"
    mask:
      template: |-
        {{if kindIs "string" .comment}}{{if eq "" .comment}}""{{else}}"Com_Fiche"{{end}}{{else}}null{{end}}

This kind of test is verbose and have to be repeat for each mask in the chain.

Proposal

This is a proposal to add new attribute preserve in pipeline's step fill with one of options "null", "empty", "blank" (null or empty), none (default).

The equivalent mask using the preserve feature is

  - selector:
      jsonpath: "comment"
    preserve: "blank"
    mask:
      template: "Com_Fiche"

[REFACTORING] MaskFactoryConfiguration

pour injecter un paramètre supplémentaire dans une MaskFactory, par exemple chaches ou seed; on doit modifier toutes les fonctions MaskFactory ....

Je propose de faire un refactoring pour modifier l'interface MaskFactory qui prend un paramètre une nouvelle structure MaskFactoyConfiguration. De cette façon l'injection de nouveaux paramètres externe sera simplifier.

Originally posted by @youen in #139 (comment)

[PROPOSAL] Chain multiple masks on the same jsonpath

Chain masks on the same jsonpath

A new dedicated mask

  - selector :
      jsonpath: "birthdate"
    mask:
      chain:
        - randDate:
            dateMin: "1960-01-01T00:00:00Z"
            dateMax: "2002-12-31T00:00:00Z"
        - dateParser:
            outputFormat: "2006-01-02"

Can be used to split a big YAML file in chunks or to reuse the same YAML on different paths

  - selector :
      jsonpath: "birthdate"
    mask:
      chain:
        definition: mask-date.yaml
  - selector :
      jsonpath: "otherdate"
    mask:
      chain:
        definition: mask-date.yaml

[REFACTOR] global variables

Some global vars in package model should be encapsulated in a root object

var maskContextFactories []MaskContextFactory
var maskFactories []MaskFactory

[PROPOSAL] option Preserve for fromCacheProcess

Current situtation

FromCacheProcess presented in README.md

If no matching is found in the cache, fromCache block the current line and the next lines are processing until a matching content go into the cache.

Sometimes we might want to preserve the current value if no match found (data structure manipulation use)

masking.yml

version: 1
seed: 42
masking:
  - selector:
      jsonpath: id
    mask:
      fromCache: mycache
  - selector:
      jsonpath: name
    mask:
      randomChoiceInUri: "pimo://nameFR"
caches:
  mycache: {}

data.jsonl

{"id":1,"name":"Pierre"}
{"id":2,"name":"Paul"}
{"id":3,"name":"Jacques"}

cache.jsonl

{"key":1,"value":11}
{"key":3,"value":13}
cat data.jsonl | pimo --load-cache mycache=cache.jsonl    
{"id":11,"name":"Rolande"}
{"id":13,"name":"Matéo"}

Proposal

Using option --preserve (proposed in #56), we can notify the FromCacheProcess, and ignore the masking

masking.yml

version: 1
seed: 42
masking:
  - selector:
      jsonpath: id
    mask:
      fromCache: mycache
    preserve: notInCache
  - selector:
      jsonpath: name
    mask:
      randomChoiceInUri: "pimo://nameFR"
caches:
  mycache: {}
cat data.jsonl | pimo --load-cache mycache=cache.jsonl    
{"id":11,"name":"Rolande"}
{"id":2, "name":"Aaron"}
{"id":13,"name":"Matéo"}

[PROPOSAL] HTTP API service

HTTP Service

This is proposal to expose PIMO's pipeline as a HTTP service.

Stateless vs Statefull

PIMO's pipeline could be statefull (if it's using sequence mask or cache) or stateless. For this proposal PIMO HTTP API is stateless and each HTTP request simulate a PIMO's run. A statefull implementation will be propose in an other issue.

HTTP server

The command pimo --http start the http server on port 8000 . Port is configurable with the --port option.

HTTP API

The root path is /api/v1/

GET

HTTP GET method simulate the --empty-input option

POST

HTTP POST method send the input as a body to the server

[bug] [fluxUri] Cannot use cache with mask fluxUri

With the following pipeline, pimo does not save ids in cache cacheId:
masking.yml

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "ID"
    mask:
      fluxUri: "file://test-id.csv"
    cache: "cacheId"
caches:
  cacheId : {}

test-id.csv

1001
1002
1003

Executing pimo with this config, cacheIds.jsonl file is empty.

$ pimo --dump-cache cacheId=cacheId.jsonl << EOF
> {"ID":1}
> {"ID":2}
> {"ID":3}
> EOF
{"ID":1001}
{"ID":1002}
{"ID":1003}

In comparaison using mask randomChoiceInUri saves ids just fine

[PROPOSAL] Enrich template functions by YAML definition

Problem

Complex multi steps template can be difficult to read. We need a solution to mutualize data processing formulas in a single configuration element.

Proposal

  • Add a new root section in masking.yml functions :
  • Load functions in the template engine (as it is already done for Sprig functions, or NoAccent)
version: "1"
functions:
  rangLettre:
    params:
      lettre: string
    code: -|
      return lettre - 'A' + 1;
masking:
  - selector:
      jsonpath: "rang_lettre_J"
    mask:
      template: "{{rangLettre 'J'}}"

OR (simpler) :

version: "1"
functions: -|
  func rangLettre(lettre) {
    return lettre - 'A' + 1;
  }
masking:
  - selector:
      jsonpath: "rang_lettre_J"
    mask:
      template: "{{rangLettre 'J'}}"

Extends use of function in masks

version: "1"
functions: -|
  func anonRIB(rib) {
    ...
    return anonimizedRIB;
  }
masking:
  - selector:
      jsonpath: "RIB"
    mask:
      call: "anonRIB"

With params :

version: "1"
functions: -|
  func anonRIB(rib) {
    ...
    return anonimizedRIB;
  }
masking:
  - selector:
      jsonpath: "RIB"
    mask:
      call: 
        name: "anonRIB"
        paramsFromContext:
           - "RIB"

[PROPOSAL] Luhn mask

Problem

The Luhn algorithm is a simple checksum formula, used by french national bureau of statistics (INSEE).

When a SIREN or SIRET code is anonymized, the last digit must be recalculated with the Luhn algorithm.

Solution

- selector:
    jsonpath: "siren"
  mask:
    # perform a luhn mod10 with the specified mapping on input string (char '0' = 0, char '1' = 1, ...)
    luhn:
      mod: 10
      map: "0123456789"

If len(map) != mod the mask will report a configuration error.

The previous example contains the default values for mod and map so it could be written as :

- selector:
    jsonpath: "siren"
  mask:
    # perform a luhn mod10 on a numeric string
    luhn: {}
$ echo '{"siren": "12345678"}' | pimo
{"siren": "123456782"}

[BUG] template over slice of map cause error

pimo version : 1.12.1

This venom test fail

  - name: template with range over slice of map
    steps:
      - script: rm -f masking.yml
      - script: |-
          cat > masking.yml <<EOF
          version: "1"
          masking:
            - selector:
                jsonpath: "CLE_V2_DOCUMENT.LIEN"
              mask :
                template: '[[if eq (int .CLE_V2_DOCUMENT.ID_LOC) (int "1") ]]OB=2016001-0/0 PDF[[else if has (int .CLE_V2_DOCUMENT.ID_LOC) (list 10 12 13 14 15 16 17 18 19 20 21 22) ]]29Z3lvv1r3a90ULQmGfiwddPPWq5W4fd[[else]][[.CLE_V2_DOCUMENT.LIEN]][[end]]'
          EOF
      - script: sed -i  "s/\[\[/\{\{/g"  masking.yml
      - script: sed -i  "s/\]\]/\}\}/g"  masking.yml
      - script: |-
          pimo <<EOF
          {"CODE_FAMILLE":"DOCDE","ID_DN":"1320000522255","NOM_CLE":"idRCI","TYPE_DOC":"AEMP","VALEUR_CLE":"1003468703","CLE_V2_DOCUMENT":[{"CODE_ERREUR_INTEGRATION":"","DATE_CREATION":"2020-03-11T19:42:12+01:00","DATE_MODIFICATION":null,"ID_DN":"1320000522255","ID_LOC":1,"LIEN":"OB=2020025-328/2272 PDF","METAS":"{\"idDoc\":1320000522255,\"cDoc\":\"AEMP\",\"idGED\":\"22003111942005580020370050331894\",\"cReg\":\"025\",\"taill\":204,\"tFlux\":\"UGUD\",\"idDE\":\"3997529\",\"dtArc\":\"20200310\",\"dtArr\":\"20200310\",\"dtDif\":\"20200310\",\"icon\":\"3997529\",\"cle\":\"R\",\"nbPag\":3,\"sTypo\":\"407\",\"idRCI\":1003468703,\"cCont\":1,\"cStaW\":3,\"caRec\":\"R1\",\"cFac\":\"W\",\"dses\":\"20200310\",\"dStaW\":\"20200311\",\"hDiff\":\"12:04:22\",\"typo\":\"40\",\"cAgen\":\"02012\",\"dtTrt\":\"20200311\",\"iGedO\":\"22003102003101204234489627952860\",\"corb\":\"TORECORD\",\"cSite\":\"25906\"}","STATUT_DN":2,"SUPPRIME":0,"TYPE_DOC":"AEMP","TYPE_MIME":"application/pdf"}]}
          EOF
        assertions:
          - result.code ShouldEqual 0
          - result.systemerr ShouldBeEmpty
          - result.systemoutjson.CLE_V2_DOCUMENT.0.LIEN ShouldEqual OB=2016001-0/0 PDF
       • template-with-range-over-slice-of-map FAILURE
Testcase "template with range over slice of map", step #4: Assertion "result.code ShouldEqual 0" failed. expected: 0  got: 4 (test/workspace/masking_template.yml:75)
Testcase "template with range over slice of map", step #4: Assertion "result.systemerr ShouldBeEmpty" failed. expected '9:09PM ERR Cannot execute pipeline error="Pipeline didn't complete run: template: template:1:29: executing \"template\" at <.CLE_V2_DOCUMENT.ID_LOC>: can't evaluate field ID_LOC in type model.Entry" config=masking.yml duration="550.823µs" input-line=1 output-line=1' to be empty but it wasn't (test/workspace/masking_template.yml:76)
Testcase "template with range over slice of map", step #4: Assertion "result.systemoutjson.CLE_V2_DOCUMENT.0.LIEN ShouldEqual OB=2016001-0/0 PDF" failed. expected: OB=2016001-0/0 PDF  got: <nil> (test/workspace/masking_template.yml:77)
ERROR running target 'test-int': in step 4: executing command: exit status 2

[Proposal] new markov mask

Motivation

To generate random string the regex mask is limited for small text but can't generate pseudo natural language.

Solution

Use markov chain [1] to produce pseudo text based on example. Add a new markov mask with transitions as parameters

mask: 
  markov :
    # protection against infinity loop
    max-size: 20
    parameters:
       - from: "I am"
         to: "a"
         weight : 0.5
       - from: "I am"
         to: "not"
         weight : 0.5
       - from: "am a"
         to: "free"
         weight : 1
      - from: "free"
         to: "man"
         weight : 1

Parameters are extremely verbose and should not be compute by human. Parameters should be externalized in a json file.

mask: 
  markov :
    # protection against infinity loop
    max-size: 20
    parameters: free-man.json

Or better compute from an sample text

mask: 
  markov :
    # protection against infinity loop
    max-size: 20
    sample: free-man.txt

[1] https://en.wikipedia.org/wiki/Markov_chain#Markov_text_generators

[PROPOSAL] Reverse cache masking

Problem

Pimo is able to transpose original values to masked value using cache feature.

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "category"
    mask:
      incremental:
        start: 1
        increment: 1
    # Optional cache (coherence preservation)
    cache: "cacheCategory"

caches:
  cacheCategory:
    # Optional bijective cache (enable re-identification if the cache is dumped on disk)
    unique: true
$ pimo --dump-cache cacheCategory=categoryCache.jsonl <<EOF
{ "category": "Animal" }
{ "category": "Food" }
{ "category": "IT" }
EOF
{ "category": 1 }
{ "category": 2 }
{ "category": 3 }

Pimo create a cache file categoryCache.jsonl

{ "Animal": 1 }
{ "Food": 2 }
{ "IT": 3 }

No documentation explain how to restore original data from masked data and cache file.

{ "category": 1 }
{ "category": 2 }
{ "category": 3 }

feat(template) : access to context in nested arrays

I want to be able to modify a value in nested arrays by referencing the current value with a template mask.

data.jsonl

{"elements":[{"persons":[{"name":"bob"},{"name":"john"}]}]}

Expected

$ pimo <data.jsonl

{"elements":[{"persons":[{"name":"BOB"},{"name":"JOHN"}]}]}

Solutions that does not work

Using the same path as the selector

This will refer to a field that does not exist {"elements":{"persons":{"name":"bob"}}}, and generate an error.

masking.yml

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "elements.persons.name"
    mask:
      # this go template syntax refer to a field that is not in a nested array
      template: "{{upper .elements.persons.name}}"

Result

$ pimo <data.jsonl

template: template:1:17: executing "template" at <.elements.persons.name>: can't evaluate field persons in type model.Entry

Using go template syntax to access elements in array

This will always use the elements of index 0, and will only give the expected result for the first element bob.

masking.yml

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "elements.persons.name"
    mask:
      # this go template syntax refer to a single value of index (0;0) 
      template: "{{upper (index (index .elements 0).persons 0).name}}"

Result

$ pimo <data.jsonl

{"elements":[{"persons":[{"name":"BOB"},{"name":"BOB"}]}]}

bug [randomChoice] use differents seed for differents field

Using this configuration

version: "1"
seed: 3
masking:
  - selector:
      jsonpath: "name"
    mask:
      randomChoiceInUri: "file://../names.txt"
  - selector:
      jsonpath: "name2"
    mask:
      randomChoiceInUri: "file://../names.txt"

name and name2 are always equal.

[PROPOSAL] Structured logging with -v flag

PIMO need to log what is happening in the stderr file.

The log might be structured with https://github.com/sirupsen/logrus, or (better performance) : https://github.com/rs/zerolog, https://github.com/uber-go/zap

The level of verbosity is passed via the -v flag, the default value (0) does not log anything, the other possible values are :

  1. error : log only errors
  2. warn : same as level 1 + warnings that should be checked by user
  3. info : same as level 2 + information about what is processed
  4. debug : same as level 3 + debugging information, to analyse what can cause an unexpected behavior
  5. trace : same as level 4 + tracing of events in code (enter function, exit function)

Example :

$ echo "{}" | pimo -v3 > result.jsonl
INFO[0000] Reading file from disk                      definition=file://.masking.yml
INFO[0000] Begin processing of pipeline                definition=file://.masking.yml
WARN[0000] Ignoring mask because path is non-existent  definition=file://.masking.yml path=name mask=randomInt

Logs can be in JSON format with --log-json flag

$ echo "{}" | pimo -v3 --log-json > result.jsonl
{"definition":"file://.masking.yml","level":"info","msg":"Reading file from disk","time":"2014-03-10 19:57:38.562264131 -0400 EDT"}
{"definition":"file://.masking.yml","level":"info","msg":"Begin processing of pipeline","time":"2014-03-10 19:57:38.562264131 -0400 EDT"}
{"definition":"file://.masking.yml","level":"warn","msg":"Ignoring mask because path is non-existent","time":"2014-03-10 19:57:38.562264131 -0400 EDT","path":"name","mask":"randomInt"}

[PROPOSAL] Mask http

New mask HTTP

- selector:
    # the . (dot) selector should select the whole dictionary, so the http response will replace the input
    jsonpath: "."
  mask:
    http:
      method: get
      url: https://www.data.gouv.fr/api/1/users/{{.userid}}/
      # auth:
      # headers:
- selector:
    jsonpath: "first_name"
  mask:
    randomChoiceInUri: pimo://nameFR

[BUG] Implement missing flags

The README documentation mention these flags:

--skip-line-on-error This flag will totally skip a line if an error occurs masking a field.
--skip-field-on-error This flag will return output without a field if an error occurs masking this field.

But are currently not implemented.

Either remove them from documentation or implement them.

[PERF] Mask pipe executed N times on array of length N

Mask pipe executed N times on array of length N

Data

data.json

{
    "organizations": [
        {
            "domain": "company.com",
            "persons": [
                {
                    "name": "leona",
                    "surname": "miller",
                    "email": ""
                },
                {
                    "name": "joe",
                    "surname": "davis",
                    "email": ""
                }
            ]
        },
        {
            "domain": "company.fr",
            "persons": [
                {
                    "name": "alain",
                    "surname": "mercier",
                    "email": ""
                },
                {
                    "name": "florian",
                    "surname": "legrand",
                    "email": ""
                }
            ]
        }
    ]
}

masking.yml

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "organizations.persons"
    mask:
      pipe:
        injectParent: "org"
        masking:
          - selector:
              jsonpath: "email"
            mask:
              template: "{{.name}}.{{.surname}}@{{.org.domain}}"

Execution

Note: data.json is passed twice to pimo to gererate two lines

$ cat data.json data.json | jq -c "."  | pimo --log-json -v5 >/dev/null 2> >( jq "." | mlr --ijson --opprint --barred cat)

Actual result

+-------+-------------+-------------+-----------------------+---------------+
| level | config      | line-number | path                  | message       |
+-------+-------------+-------------+-----------------------+---------------+
| info  | masking.yml | 1           | organizations.persons | Mask pipe     |
| info  | -           | 1           | email                 | Mask template |
| info  | -           | 2           | email                 | Mask template |
| info  | masking.yml | 1           | organizations.persons | Mask pipe     |
| info  | -           | 1           | email                 | Mask template |
| info  | -           | 2           | email                 | Mask template |
| info  | masking.yml | 2           | organizations.persons | Mask pipe     |
| info  | -           | 1           | email                 | Mask template |
| info  | -           | 2           | email                 | Mask template |
| info  | masking.yml | 2           | organizations.persons | Mask pipe     |
| info  | -           | 1           | email                 | Mask template |
| info  | -           | 2           | email                 | Mask template |
+-------+-------------+-------------+-----------------------+---------------+

Expected result

+-------+-------------+-------------+-----------------------+---------------+
| level | config      | line-number | path                  | message       |
+-------+-------------+-------------+-----------------------+---------------+
| info  | masking.yml | 1           | organizations.persons | Mask pipe     |
| info  | -           | 1           | email                 | Mask template |
| info  | -           | 2           | email                 | Mask template |
| info  | masking.yml | 2           | organizations.persons | Mask pipe     |
| info  | -           | 1           | email                 | Mask template |
| info  | -           | 2           | email                 | Mask template |
+-------+-------------+-------------+-----------------------+---------------+

bug: jsonpath to array component not working

data.jsonl

{"elements":[{"persons": [{"phonenumber": "027"}]}]}

masking.yml

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "elements.persons.phonenumber"
    mask:
      regex: "0[1-7]( ([0-9]){2}){4}"

Result

$ pimo <data.jsonl >dataout.jsonl

panic: interface conversion: model.Entry is []model.Entry, not map[string]model.Entry

goroutine 1 [running]:
makeit.imfr.cgi.com/makeit2/scm/lino/pimo/pkg/model.ComplexePathSelector.Write(0xc00002c8a9, 0x7, 0x92c978, 0xc00001f730, 0xc0001ba300, 0x80e3e0, 0xc00001f7d0, 0xc00011dc20)
        /workspace/pkg/model/model.go:308 +0x6c5
makeit.imfr.cgi.com/makeit2/scm/lino/pimo/pkg/model.ComplexePathSelector.Write(0xc00002c8a0, 0x8, 0x92c938, 0xc00006e840, 0xc0001ba360, 0x807240, 0xc00000c870, 0xc00000c870)
        /workspace/pkg/model/model.go:302 +0x2de
makeit.imfr.cgi.com/makeit2/scm/lino/pimo/pkg/model.(*MaskEngineProcess).ProcessDictionary(0xc00006e880, 0xc0001ba360, 0x924260, 0xc00006e8a0, 0x0, 0x0)
        /workspace/pkg/model/model.go:460 +0x2dc
makeit.imfr.cgi.com/makeit2/scm/lino/pimo/pkg/model.(*ProcessPipeline).Next(0xc0000367c0, 0x0)
        /workspace/pkg/model/model.go:606 +0x96
makeit.imfr.cgi.com/makeit2/scm/lino/pimo/pkg/model.SimpleSinkedPipeline.Run(0x92bc50, 0xc0000367c0, 0x927e40, 0xc00001f740, 0x927e40, 0xc00001f740)
        /workspace/pkg/model/model.go:648 +0x89
main.run()
        /workspace/cmd/pimo/main.go:127 +0x65b
main.main.func1(0xc000128840, 0xba2298, 0x0, 0x0)
        /workspace/cmd/pimo/main.go:76 +0x25
github.com/spf13/cobra.(*Command).execute(0xc000128840, 0xc00001e210, 0x0, 0x0, 0xc000128840, 0xc00001e210)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:846 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc000128840, 0xb71910, 0x89c08d, 0xa)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:950 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:887
main.main()
        /workspace/cmd/pimo/main.go:86 +0x433

[BUG] PIMO panic when using both --repeat flag and pipe mask

Version : pimo v1.4.0

  1. Configure masking.yml with a pipe mask.
  2. Execute pimo with --repeat 2

Expected

First line of input is processed twice

Actual

pimo fails with panic message

panic: interface conversion: model.Entry is []model.Dictionary, not []model.Entry

goroutine 1 [running]:
github.com/cgi-fr/pimo/pkg/pipe.MaskEngine.MaskContext(0x0, 0x0, 0x99ae70, 0xc00022a8c0, 0x0, 0x0, 0xc00002f9e0, 0x2, 0xc000321b30, 0xc0000246c6, ...)
        /workspace/pkg/pipe/pipe.go:72 +0xccb
github.com/cgi-fr/pimo/pkg/model.(MaskContextEngineProcess).ProcessDictionary.func2(0xc000321890, 0xc000321b30, 0xc0000246c6, 0xe, 0x861fc0, 0xc000191b60, 0x3, 0x6, 0xc000382420)
        /workspace/pkg/model/process_maskcontext.go:48 +0xb5
github.com/cgi-fr/pimo/pkg/model.selector.applyContext(0xc0000246c6, 0xe, 0x0, 0x0, 0xc000321890, 0xc000321b30, 0x861fc0, 0xc000191b60, 0xc0003d8540, 0x1, ...)
        /workspace/pkg/model/selector.go:177 +0x9c
github.com/cgi-fr/pimo/pkg/model.selector.applySubContext(0xc0000246c6, 0xe, 0x0, 0x0, 0xc000321890, 0xc000321b30, 0xc0003d8540, 0x1, 0x1, 0xc0000f0f58)
        /workspace/pkg/model/selector.go:170 +0x32f
github.com/cgi-fr/pimo/pkg/model.selector.applySubContext(0xc0000246c0, 0x5, 0x7fd60a5fbf58, 0xc000272280, 0xc000321890, 0xc000321890, 0xc0003d8540, 0x1, 0x1, 0xc00000c6c0)
        /workspace/pkg/model/selector.go:166 +0x298
github.com/cgi-fr/pimo/pkg/model.selector.ApplyContext(...)
        /workspace/pkg/model/selector.go:144
github.com/cgi-fr/pimo/pkg/model.(MaskContextEngineProcess).ProcessDictionary(0xc0002722c0, 0xc000321860, 0x994860, 0xc0002722e0, 0x88c7c0, 0xc0000f1101)
        /workspace/pkg/model/process_maskcontext.go:47 +0x2fe
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00022a940, 0xc0003b58c0)
        /workspace/pkg/model/model.go:385 +0x96
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00022b0c0, 0xc0000f11c8)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00022b840, 0xc0000f1220)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc0002fe600, 0x8f12e0)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc0002fef00, 0x40f3db)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc0003952c0, 0x480b4f)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc000395700, 0xc0007e2d80)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc000395b40, 0x56d852)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc000395f80, 0x1)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc000395fc0, 0x9a1820)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048c240, 0xc0002f0100)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048c500, 0x98)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048c7c0, 0xc000190100)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048c800, 0x21)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048cac0, 0xc000082000)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048cd80, 0xc0000f1558)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048d040, 0x22)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048d080, 0x98)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048d300, 0x57bced)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048d340, 0x56d742)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048d5c0, 0x7fd60a92cfff)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048d600, 0x300)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00048d880, 0xc00)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00008ecc0, 0x40d7fb)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00008f940, 0x0)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018c440, 0xc0000f17b8)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018cbc0, 0xd)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018d000, 0xc000162ebc)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018d440, 0xd)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018d880, 0x0)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018dcc0, 0x4141dc)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018dd00, 0x90e98c)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018dd40, 0xc000162eb0)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018dd80, 0xc000060020)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018ddc0, 0xc00079d770)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018de00, 0x901d9c)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.(ProcessPipeline).Next(0xc00018dec0, 0xc00079d770)
        /workspace/pkg/model/model.go:384 +0x49
github.com/cgi-fr/pimo/pkg/model.SimpleSinkedPipeline.Run(0x99c928, 0xc00018dec0, 0x998d40, 0xc000273100, 0x998d40, 0xc000273100)
        /workspace/pkg/model/model.go:434 +0x89
main.run()
        /workspace/cmd/pimo/main.go:178 +0xba9
main.main.func1(0xc000095b80, 0xc00008e700, 0x0, 0x4)
        /workspace/cmd/pimo/main.go:88 +0x25
github.com/spf13/cobra.(Command).execute(0xc000095b80, 0xc00001e0b0, 0x4, 0x4, 0xc000095b80, 0xc00001e0b0)
        /home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:856 +0x2c2
github.com/spf13/cobra.(Command).ExecuteC(0xc000095b80, 0xc4b035, 0x9058a3, 0x13)
        /home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:960 +0x375
github.com/spf13/cobra.(Command).Execute(...)
        /home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:897
main.main()
        /workspace/cmd/pimo/main.go:103 +0x633

[PROPOSAL] temp property to auto-remove in post-processing

Auto remove

Instead of doing this pattern

version: "1"
masking:
  - selector :
      jsonpath: "temp"
    mask:
      add: "temp_value"
...
<use the field>
...
  - selector:
      jsonpath: "temp"
    mask:
      remove: true

Do this

version: "1"
masking:
  - selector :
      jsonpath: "temp"
    temp: true
    mask:
      add: "temp_value"
...
<use the field>

[PROPOSAL] Templated uri for mask randomInUri

Problem

Conditional random choice is not easy. All possibilities have to be choice and the template switch to the valid choice.

For example to choice name by gender :

- selector:
      jsonpath: "name_F"
    mask:
      add : ""
- selector:
      jsonpath: "name_F"
    mask:
      randomChoiceInURI : "file://names_F.txt"
- selector:
      jsonpath: "name_M"
    mask:
      add : ""
- selector:
      jsonpath: "name_M"
    mask:
      randomChoiceInURI : "file://names_M.txt"
- selector:
      jsonpath: "name"
    mask:
      template: |-
        {{if .gender "F"}}{{.name_ F}}{{else}}{{.name_M}}{{end}}
# Remove temporaries fields
- selector:
      jsonpath: "name_F"
    mask:
      remove : true
- selector:
      jsonpath: "name_M"
    mask:
      remove : true

This is a pain for a two categories choice and is unusable for hundred categories choice.

Solution

This issue propose to use template in uri path.

For example :

- selector:
      jsonpath: "name"
    mask:
      randomChoiceInUri: "file://names_{{.gender}}.txt

fromJson with integer or float value

Following the purpose of this mask, i've tried the following :

  - name: entry float value
    steps:
      - script: rm -rf masking.yml
      - script: |-
          cat > masking.yml <<EOF
          version: "1"
          masking:
            - selector:
                jsonpath: "targetfield"
              mask:
                fromjson: "sourcefield"
          EOF
          echo '{"sourcefield": "{\"property\":\"1.2\"}", "targetfield": ""}' | pimo
        assertions:
          - result.code ShouldEqual 0
          - result.systemout ShouldEqual {"sourcefield":"{\"property\":\"1.2\"}","targetfield":{"property":1.2}}
          - result.systemerr ShouldBeEmpty

In my comprehension of the mask, it should work this way but the targetfield property returns a string, not a float (works the same with integer)

[BUG] Cache should apply on whole masking item

Example YAML

version: "1"
caches:
  mycache:
    unique: true
masking:
  - selector:
      jsonpath: "test"
    masks:
      - add: "1"
      - constant: "1"
    cache: mycache

Expected

$ pimo --empty-input
1

Actual

$ pimo --empty-input
5:21PM ERR Cannot execute pipeline error="Pipeline didn't complete run: Unique value not found" config=masking.yml duration="598.2µs" input-line=1 output-line=1

[PROPOSAL] Play : add crafted examples

Proposal to add embedded examples in the PIMO Play website. Examples will be organized by category : Generation, Anonymization, Pseudonymisation and Technical.

Generation examples

Generate first name, last name and email from an existing referential

TODO (using internal referentials nameFR and surnameFR)

Generate fake name, last name and email

TODO (using Markov mask)

Generate a fake phone number

TODO (using Regex mask)

Generate a valid NIR (french individual identification number)

TODO (using RandomDate for the birth date and Template mask for the key)

Generate a valid SIRET (french business identification number)

TODO (using Luhn mask)

Anonymization examples

We will reuse previous generation examples, but with focus on anonymization specifics.

Remove a value

TODO (using Remove mask)

Replace by a constant

TODO (using Constant mask)

Anonymize a value but preserve null, emty or blank values

TODO

Anonymize a technical ID (like a plate number)

TODO (with Transcode mask)

Pseudonymization examples

Add noise to existing data

TODO (using Random Duration mask for dates, Range mask and Template for other types)

Preserve coherence with a hash

TODO (using Hash, HashInURI or seed parameter)

Preserve coherence and enable reversibility with a cache

TODO (add a comment for reversibility)

Preserve coherence and enable reversibility with encryption

TODO (with FF1 mask)

Technical examples

Muliple mask with single selector

TODO

Multiple selector for a single mask

TODO

Preserve parameter

TODO

Seed parameter

TODO

Caches

TODO (unique, reverse, FromCache mask)

Change date formats

TODO (using DateParser mask)

Arrays

TODO (using TemplateEach mask)

Complex structure

TODO (using Pipe mask)

Parse raw JSON

TODO (using FromJson mask)

Temporary fields

TODO (using AddTransient mask)

Generate sequences

TODO (using Increment and FluxURI masks)

[PROPOSAL] add unixEpoch format in date parser

Problem

Pimo can't trasnform unixEpoch timestamp (1647512434) to a date format (Thu Mar 17 2022 10:20:34 GMT+0000).

Solution

add "unixEpoch" parameter in date parser.

  - selector:
      jsonpath: "date"
    mask:
      dateParser:
        inputFormat: "unixEpoch"
        outputFormat: "01/02/06"

transform input

{
  "date": 1647512434
}

to output

{
  "date": "17/03/22"
}

unixEpoch can be use as outputFormat argument

  - selector:
      jsonpath: "date"
    mask:
      dateParser:
        inputFormat: "01/02/06"
        outputFormat:  "unixEpoch"

[PROPOSAL] From JSON mask

A new mask that can convert an existing JSON string.

- selector:
    jsonpath: "targetfield"
  mask:
    fromjson: "sourcefield"

Examples

$ echo '{"sourcefield": "null", "targetfield": ""}' | pimo
{"sourcefield": "null", "targetfield": null}
$ echo '{"sourcefield": "1", "targetfield": ""}' | pimo
{"sourcefield": "1", "targetfield": 1}
$ echo '{"sourcefield": "1.2", "targetfield": ""}' | pimo
{"sourcefield": "1.2", "targetfield": 1.2}
$ echo '{"sourcefield": "{\"property\": \"hello\"}", "targetfield": ""}' | pimo
{"sourcefield": "{\"property\": \"hello\"}", "targetfield": {"property": "hello"}}

[PROPOSAL] Apply template on array

Input data

{"array": ["value1", "value2", "value3"]}

Expected output

{"array": ["Value1", "Value2", "Value3"]}

Problem

Existing masks don't work (pipe, template, ...)

version: "1"
masking:
  - selector:
      jsonpath: "array"
    mask:
      pipe:
        masking:
          - selector:
              jsonpath: "."
            mask:
              template: "{{toUpper .}}"

The result is:

panic: interface conversion: interface {} is string, not model.Dictionary

bug: only first mask is processed with nested arrays

data.jsonl

{"elements":[{"persons": [{"phonenumber": "027","email": "[email protected]"}]}]}

masking.yml

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "elements.persons.phonenumber"
    mask:
      regex: "0[1-7]( ([0-9]){2}){4}"
  - selector:
      jsonpath: "elements.persons.email"
    mask:
      regex: "[a-z]{10}@company\.com"

Result

{
  "elements": [
    {
      "persons": [
        {
          "email": "[email protected]",
          "phonenumber": "04 87 48 09 96"
        }
      ]
    }
  ]
}

Expected

{
  "elements": [
    {
      "persons": [
        {
          "email": "[email protected]",
          "phonenumber": "04 87 48 09 96"
        }
      ]
    }
  ]
}

[PROPOSAL] Randomize with custom seed

Problem

Sometimes, we need coherence in generated data (X always gives Y).
Using a randomization-based mask, such as regex, it's a difficult process that requires the use of caches and can be time/memory consuming.

Solution

Add a parameter in the masking file to let the user force the value of seed used by RNG in the masks.

Examples

  - selector:
      jsonpath: "phone"
    seed:
      template: "{{.phone}}"
    mask:
      regex: "0[1-7]( ([0-9]){2}){4}"
  - selector:
      jsonpath: "phone"
    seed:
      field: "phone"
    mask:
      regex: "0[1-7]( ([0-9]){2}){4}"

[PROPOSAL] Repeat until

Repeat until a condition is met

$ pimo --repeat-until '{{.value == 0}}'

Or for an infinite stream

$ pimo --repeat-until '{{false}}'

Combined with #17 this command can output all the names in the internal referential nameFR :

$ echo '{"value": ""}' | pimo --repeat-until '{{.value==""}}' --mask "value=[{fluxUri: 'pimo://nameFR'}]"
{"value": "Aaron"}
{"value": "Abel"}
{"value": "Abel-François"}
{"value": "Abélard"}
{"value": "Abelin"}
...

[BUG] Mask Replacement does not work with nested selectors

Problem

input.jsonl

{"fk":{"name1":"Pierre","name2":"Paul"}}

masking.yml

version: "v1"
masking:
  - selector:
      jsonpath: fk.name1
    mask:
      replacement: fk.name2

expected_output.jsonl

{"fk":{"name1":"Paul","name2":"Paul"}}

actual_output.jsonl

{"fk":{"name1":null,"name2":"Paul"}}

Solution

Change this:

type MaskEngine struct {
	Field string
}

Into this:

type MaskEngine struct {
	Field model.Selector
}

[PROPOSAL] Pass multiple masking config to the command line

Possibility to pass multiple yaml configuration, that will be applied in the order provided by command line arguments

$ cat data.jsonl | pimo -c format.yml -c clean.yml -c masking.yml

Equivalent to

$ cat data.jsonl | pimo -c format.yml | pimo -c clean.yml | pimo -c masking.yml

[BUG] use of fromCache after a mask which causes a change in the type of the value

When you have a field with a calculated value the fromCache mask does not work anymore.

Input file :

{"sexe": 2.25}

Below are examples of masking.yml files where fromCache does not work :

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "sexe"
    masks:
      - constant: 2
      - fromCache: "cacheSex"

caches:
  cacheSex:
    unique: true
    reverse: true

version: "1"
seed: 42
masking:
  - selector:
      jsonpath: "sexe"
    masks:
      - template : "{{ round ( toString .sexe) 0  }}"
      - fromjson: "sexe"
      - fromCache: "cacheSex"
     
caches:
  cacheSex :
    unique: true
    reverse:  true

Cache file cacheSex.jsonl:

{ "key": "M", "value" : 2}
{ "key": "F", "value" : 1}

pimo --load-cache cacheSex=cacheSex.json -c masking.yml < input.json

Output file :

{"sexe": "M"}

[PROPOSAL] Time Series generator

Time series generator

This issue is a proposal to generate and simulate time serie from a set of sensors.

Time series generator configuration

Simple configuration to generate time serie with a period of 5 seconds from 2012-04-23T18:25:00.000Z to 2012-04-23T18:25:15.000Z.

masks:
  - selector:
      jsonpath : 'timestamp'
      mask: 
        timeserie:
          period: "5s" # 1s by default
          from: "2012-04-23T18:25:00.000Z"  # current time by default
          to: "2012-04-23T18:25:15.000Z" # empty by default that mean endless time serie generator         

The following command generate timeserie dataset

$ echo '{"timestamp": "" }' |  pimo
{"timestamp": "2012-04-23T18:25:00.000Z" }
{"timestamp": "2012-04-23T18:25:05.000Z" }
{"timestamp": "2012-04-23T18:25:10.000Z" }

Multi-Sensors configuration

Timeserie mask is not streamable. It start to generate data after the close of the input stream. Each input line is a sensor configuration and timeserie mask generate data for each sensors.

$ echo '{ "id": 1, "timestamp": "" }\n'{"id": 2, "timestamp": "" }' |  pimo
{"id": 1, "timestamp": "2012-04-23T18:25:00.000Z" }
{"id": 2, "timestamp": "2012-04-23T18:25:00.000Z" }
{"id": 1, "timestamp": "2012-04-23T18:25:05.000Z" }
{"id": 2, "timestamp": "2012-04-23T18:25:05.000Z" }
{"id": 1, "timestamp": "2012-04-23T18:25:10.000Z" }
{"id": 2, "timestamp": "2012-04-23T18:25:10.000Z" }

Time Serie simulation

If the simulate option is activating timeserie wait for the period between each data generation.

masks:
  - selector:
      jsonpath : 'timestamp'
      mask: 
        timeserie:
          period: "5s" # 1s by default
          from: "2012-04-23T18:25:00.000Z"  # current time by default
          to: "2012-04-23T18:25:15.000Z" # empty by default that mean endless time serie generator    
          simulate: true # false by default     

[PROPOSAL] export a pimo play sandbox as venom test

This is a proposal to add link in the pimo play page to export current status as non regression test in venom test.

We could create a drop-down buttons list on the upper right corner (similar to https://jqplay.org/).
Buttons :

Share : copy link
Export as Venom Test

image

the venom test template is :

name: "test generated  from pimoplay <current url>"
testcases:
- name: declaring cache
  steps:
  - script: rm -f masking.yml
  - script: |-
      cat > masking.yml <<EOF
      <content of the masking cell>
      EOF
  - script: |-
      cat > input.jsonl <<EOF
      <content of the input cell in jsonline format>
      EOF
  - script: |-
      cat > expected.jsonl <<EOF
      <content of the output cell in jsonline format>
      EOF
  - script: |-
      < input.jsonl pimo > result.jsonl
    assertions:
    - result.code ShouldEqual 0
  - script: |-
      diff expected.jsonl result.jsonl
    assertions:
    - result.code ShouldEqual 0
    - result.systemout ShouldBeEmpty
  • Share Button copy current link into the clipboard.
  • Venom Test button download a pimo-test.yaml file as a base64 data url

[PROPOSAL] Markov sample separator

Problem

Markov Mask can be used on different samples:

  • lists of words
  • paragraphs

For list of words, we would want to read the file line by line (exemples: nameFR, pokemons, etc..)
For entire paragraphs, or text that can be spread over multiple lines.

Proposal

In addition of the separator parameter that determine the way we split the text (word by word, character by character, etc..), we would want a parameter that helps the mask to understand the structure of the text:

  • is it a list?
  • is it paragraphs?
  • is it something else?

Anyway, markov mask should have a default configuration in order not to make it unusable.

Originally posted by @baguettte in #81 (comment)

[PROPOSAL] multiple jsonpath in selector for same mask(s)

Proposal

Possibility to enter a list of jsonpath in selectorType to apply the same mask to multiple fields:

masking.yml

version: v1
  masking:
    - selector:
        jsonpaths:
          - name1
          - name2
          - name2
      mask:
        randomChoiceInUri: "pimo://nameFR"

Equivalent to:

version: v1
  masking:
    - selector:
        jsonpath: name1
      mask:
        randomChoiceInUri: "pimo://nameFR"
    - selector:
        jsonpath: name2
      mask:
        randomChoiceInUri: "pimo://nameFR"
    - selector:
        jsonpath: name3
      mask:
        randomChoiceInUri: "pimo://nameFR"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.