Comments (4)
Thank you for confirming Mehran! I updated the README to include the dataset.
from dlp-dataflow-deidentification.
Hi Mehran,
We fixed the problem with auto-sharding in batch pipelines - could you please give it a try.
Sergei
from dlp-dataflow-deidentification.
Hi Sergei,
I did a new build with your latest fixes, and was able to run the the REID pipeline in batch mode successfully, The REID-pipeline did indeed re-identify the data!
Many thanks for your help!
The final Gradle command I ran in batch mode
looks like this:
(To run in batch mode I just took away these two parameters from the command --streaming
and --enableStreamingEngine
, not sure if this is the right way, but it worked:-)
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs=" \
--region=<region> \
--project=<project_id> \
--tempLocation=gs://<bucket>/temp \
--numWorkers=1 \
--maxNumWorkers=2 \
--runner=DataflowRunner \
--tableRef=<project_id>:<dataset>.<table> \
--dataset=<dataset>
--topic=projects/<project_id>/topics/<name> \
--autoscalingAlgorithm=THROUGHPUT_BASED \
--workerMachineType=n1-standard-1 \
--deidentifyTemplateName=projects/<project_id>/locations/<location>/deidentifyTemplates/<name> \
--DLPMethod=REID \
--keyRange=1024 \
--queryPath=gs://<bucket>/<query.sql> \
--DLPParent=projects/<project_id>/locations/<location>"
Maybe you should update the Gradle command in the example on the front-page of the repo where you add the option --dataset=<dataset>
to avoid the Exception in thread "main" java.lang.NullPointerException: Null datasetId
error, as mentioned in my first comment.
/Mehran
from dlp-dataflow-deidentification.
Hi,
I have used the same de-identification template for the re-identification also and ran the above command with the similar parameters. I have found out that the output on the pub-sub is not re-identified data but the same de-identified data being written to the pub-sub. Am I making any mistake with the parameters?
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs="
--region=asia-south1
--project=
--gcpTempLocation=gs:///temp
--tempLocation=gs:///temp
--numWorkers=4
--maxNumWorkers=10
--runner=DataflowRunner
--tableRef=:workerDetails.WorkerDetails
--dataset=workerDetails
--topic=projects//topics/dlp-reid-final
--autoscalingAlgorithm=THROUGHPUT_BASED
--workerMachineType=n1-standard-1
--deidentifyTemplateName=projects//locations/asia-south1/deidentifyTemplates/dlp-for-worker-details
--DLPMethod=REID
--keyRange=1024
--queryPath=gs:///reid_query.sql
--DLPParent=projects//locations/asia-south1"
from dlp-dataflow-deidentification.
Related Issues (20)
- Template that is generated needs to be fix. BigQuery tablespec is called labels
- Process 'command' finished with a non-zero exit value 1
- ContentProcessorDofn Type Issue HOT 1
- Exception while creating template HOT 1
- Exception while creating template. HOT 4
- Default table for output is brittle since GCS allows for file names that are incompatible with BQ table ids
- Using deidentify template and inspect template in a regional location results in permission error HOT 8
- Custom Dataflow Template failing HOT 2
- Dataflow job is throwing exceptions - Followed all the steps as mentioned in the Git and Google HOT 2
- DLP to run on the data within an existing Big Query table HOT 1
- Correct arguments for gcloud dataflow jobs run
- Pushing 2 files with 2 different names at the same timestamp is crashing the job HOT 12
- Command to run Dataflow in deploy-data-tokeninzation-solution.sh returns error HOT 1
- Build failing with the error for SanitizeFileNameDoFn.java:37 HOT 5
- Please restore the docker image in the container registry
- Let the user set the location of dlp to call HOT 1
- --additional-experiments=enable_secure_boot is not added to the java file com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2PipelineOptions HOT 1
- Could not re-identify the de-identified data from Big query to Pubsub. HOT 1
- Gradle build issue HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dlp-dataflow-deidentification.