Code Monkey home page Code Monkey logo

Comments (4)

slilichenko avatar slilichenko commented on June 3, 2024 1

Thank you for confirming Mehran! I updated the README to include the dataset.

from dlp-dataflow-deidentification.

slilichenko avatar slilichenko commented on June 3, 2024

Hi Mehran,

We fixed the problem with auto-sharding in batch pipelines - could you please give it a try.

Sergei

from dlp-dataflow-deidentification.

mehran702 avatar mehran702 commented on June 3, 2024

Hi Sergei,

I did a new build with your latest fixes, and was able to run the the REID pipeline in batch mode successfully, The REID-pipeline did indeed re-identify the data!
Many thanks for your help!

The final Gradle command I ran in batch mode looks like this:
(To run in batch mode I just took away these two parameters from the command --streaming and --enableStreamingEngine, not sure if this is the right way, but it worked:-)

gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs=" \
--region=<region> \
--project=<project_id> \
--tempLocation=gs://<bucket>/temp \
--numWorkers=1 \
--maxNumWorkers=2 \
--runner=DataflowRunner \
--tableRef=<project_id>:<dataset>.<table> \
--dataset=<dataset>
--topic=projects/<project_id>/topics/<name> \
--autoscalingAlgorithm=THROUGHPUT_BASED \
--workerMachineType=n1-standard-1 \
--deidentifyTemplateName=projects/<project_id>/locations/<location>/deidentifyTemplates/<name> \
--DLPMethod=REID \
--keyRange=1024 \
--queryPath=gs://<bucket>/<query.sql> \
--DLPParent=projects/<project_id>/locations/<location>"

Maybe you should update the Gradle command in the example on the front-page of the repo where you add the option --dataset=<dataset> to avoid the Exception in thread "main" java.lang.NullPointerException: Null datasetId error, as mentioned in my first comment.

/Mehran

from dlp-dataflow-deidentification.

sasirekhamsvl avatar sasirekhamsvl commented on June 3, 2024

Hi,
I have used the same de-identification template for the re-identification also and ran the above command with the similar parameters. I have found out that the output on the pub-sub is not re-identified data but the same de-identified data being written to the pub-sub. Am I making any mistake with the parameters?
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs="
--region=asia-south1
--project=
--gcpTempLocation=gs:///temp
--tempLocation=gs:///temp
--numWorkers=4
--maxNumWorkers=10
--runner=DataflowRunner
--tableRef=:workerDetails.WorkerDetails
--dataset=workerDetails
--topic=projects//topics/dlp-reid-final
--autoscalingAlgorithm=THROUGHPUT_BASED
--workerMachineType=n1-standard-1
--deidentifyTemplateName=projects//locations/asia-south1/deidentifyTemplates/dlp-for-worker-details
--DLPMethod=REID
--keyRange=1024
--queryPath=gs:///reid_query.sql
--DLPParent=projects//locations/asia-south1"

from dlp-dataflow-deidentification.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.