googlecloudplatform / dataflow-opinion-analysis Goto Github PK

Opinion Analysis of News, Threaded Conversations, and User Generated Content

License: Apache License 2.0

Java 65.79% Shell 4.30% Jupyter Notebook 29.91%

dataflow-opinion-analysis's Introduction

Sample: Opinion Analysis of News, Threaded Conversations, and User Generated Content

This sample uses Cloud Dataflow to build an opinion analysis processing pipeline for news, threaded conversations in forums like Hacker News, Reddit, or Twitter and other user generated content e.g. email.

Opinion Analysis can be used for lead generation purposes, user research, or automated testimonial harvesting.

About the sample

This sample contains three types of artifacts:

Cloud Dataflow pipelines for ingesting and indexing textual data from sources such as relational databases, files, BigQuery datasets, and Pub/Sub topics
BigQuery dataset (with schema definitions and some metadata) to receive the results of the Dataflow Opinion Analysis pipelines, as well as additional transformations (via Materialized Views) to calculate trends
Jupyter Notebooks for creating Tensorflow models that use Sirocco-based textual embeddings as features in prediction models

Major Changes in current and past Releases

Version 0.7

In this version we began the task of updating pipelines to more recent versions of Apache Beam SDK. Version 0.6 relied on Beam 2.2.0, version 0.7 bumps the Beam SDK to a more recent one.
We moved away from orchestrating pipelines by using an AppEngine-based solution. Pipeline orchestration is best done with Airflow or Cloud Composer
We also stopped calculating trends in BigQuery by running Dataflow pipelines using embedded SQL. BigQuery Materialized Views as well as BigQuery Scheduled Queries are the more modern solution to this task

How to run the sample

The steps for configuring and running this sample are as follows:

Setup your Google Cloud Platform project and permissions.
Install tools necessary for compiling and deploying the code in this sample.
Create and setup a Cloud Storage bucket and Cloud Pub/Sub topics.
Create or verify a configuration for your project.
Clone the sample code
Create the BigQuery dataset
Deploy the Dataflow pipelines
Clean up

Prerequisites

Setup your Google Cloud Platform project and permissions

Select or Create a Google Cloud Platform project. In the Google Cloud Console, select Create Project.
Enable billing for your project, if you haven't done so during the project creation.
Enable the Google Dataflow, Compute Engine, Google Cloud Storage, and other APIs necessary to run the example.

Install tools necessary for compiling and deploying the code in this sample, if not already on your system, specifically git, Java and Maven:

Install git. If you have Homebrew, the command is

brew install git

Download and install the Java Development Kit (JDK) version 1.8 or later. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
Download and install Apache Maven. With Homebrew, the command is:

brew install maven

Install the Google Cloud SDK

Download and install the Google Cloud SDK.

Create and setup a Cloud Storage bucket and Cloud Pub/Sub topics

Create a Cloud Storage bucket for your project. This bucket will be used for staging your code, as well as for temporary input/output files. For consistency with this sample, select Multi-Regional storage class and United States location.
Create folders in this bucket staging, input, output, temp
(Optional) Create the following Pub/Sub topic: documents. This topic can be used together with a streaming Dataflow pipeline. You can send textual documents to that topic, and the Dataflow Indexing pipeline will process these documents as they arrive.

(Optional) Create or verify a configuration for your project

By now you have already created a configuration, e.g. when you initiated the Google Cloud SDK. Now is another chance to change your mind and create a new configuration.

Authenticate with the Cloud Platform. Run the following command to get Application Default Credentials.

gcloud auth application-default login
Create a new configuration for your project if it does not exist already

gcloud init

Verify your configuration

Verify that the active configuration is the one you want to use

gcloud config configurations list

Important: This tutorial uses several billable components of Google Cloud Platform. New Cloud Platform users may be eligible for a free trial.

Clone the sample code

Go to the directory where you typically store your git repos.

To clone the GitHub repository to your computer, run the following command:

git clone https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis

Go to the dataflow-opinion-analysis directory. The exact path depends on where you placed the directory when you cloned the sample files from GitHub.

cd dataflow-opinion-analysis

Activate gcloud configuration and set environment variables

Do this step before creating the BigQuery dataset and before running your demo Dataflow jobs every time you open a new shell.

Check what configurations are currently available on your machine

gcloud config configurations list

Activate the gcloud configuration for the project where your BigQuery dataset and your Dataflow jobs are or should be located

gcloud config configurations activate <config-name>

[One Time Task] Go to the dataflow-opinion-analysis/scripts directory and make a copy of the set_env_vars_template.sh file

cd scripts
cp set_env_vars_template.sh set_env_vars_local.sh
chmod +x *.sh

[One Time Task] Edit the set_env_vars_local.sh file in your favorite text editor, e.g. nano. Specifically, set the values of the variables used for parametarizing your Dataflow pipeline. Set the value of DATASET_ID to the name of a BigQuery dataset that you want to keep your analysis results in (this dataset does not have to exist yet, we will create it in later steps). A good DATASET_ID is "opinions". Set GCS_BUCKET to the name of the GCS bucket that you created previously. Note that the UNSUPPORTED_SDK_OVERRIDE_TOKEN variable should only be set once you have a real token to replace it with (see below for more info).
Set environment variables for the rest of your shell session

Don't miss the dot at the beginning of this command!

. ./set_env_vars_local.sh

Return to the root directory of the repo

cd ..

Create the BigQuery dataset

Go to the bigquery directory where the build scripts and schema files for BigQuery tables and views are located

cd bigquery
Make sure that the test scripts are executable

chmod +x *.sh
Run the build_dataset.sh script to create the dataset, tables, and views. The script will use the PROJECT_ID variable from your active gcloud configuration, and create a new dataset in BigQuery named 'opinions'. In this dataset it will create several tables and views necessary for this sample.

./build_dataset.sh
[optional] Later on, if you make changes to the table schema or views, you can update the definitions of these objects by running update commands:

./build_tables.sh update

./build_views.sh update

Table schema definitions are located in the *Schema.json files in the bigquery directory. View definitions are located in the shell script build_views.sh.

Prepare your machine for Dataflow job submissions

Download and install Sirocco, a framework maintained by @datancoffee.

Download the latest Sirocco Java framework jar file.
Download the latest Sirocco model jar file.
Go to the directory where the downloaded sirocco-sa-x.y.z.jar and sirocco-mo-x.y.z.jar files are located.
Install the Sirocco framework in your local Maven repository. Replace x.y.z with downloaded versions.

mvn install:install-file \
  -DgroupId=sirocco.sirocco-sa \
  -DartifactId=sirocco-sa \
  -Dpackaging=jar \
  -Dversion=x.y.z \
  -Dfile=sirocco-sa-x.y.z.jar \
  -DgeneratePom=true

Install the Sirocco model file in your local Maven repository. Replace x.y.z with downloaded version.

mvn install:install-file \
  -DgroupId=sirocco.sirocco-mo \
  -DartifactId=sirocco-mo \
  -Dpackaging=jar \
  -Dversion=x.y.z \
  -Dfile=sirocco-mo-x.y.z.jar \
  -DgeneratePom=true

Run demo jobs

You can use the included news articles (from Google's blogs) and movie reviews in the src/test/resources/testdatasets directory to run demo jobs. News articles are in TXT bag-of-properties format and movie reviews are in CSV format. More information about the format and the meaning of parameters is available in the Sirocco repo

Upload the files in the src/test/resources/testdatasets directory into the GCS input bucket. Use the Cloud Storage browser to find the input directory you created in Prerequisites. Then, upload all files from your local src/test/resources/testdatasets directory.

We will run a demo job that processes movie reviews in CSV format.

Go back to the dataflow-opinion-analysis directory

cd dataflow-opinion-analysis

Build the executable jar. This command should create a bundled jar in the target directory, e.g. ./target/examples-opinionanalysis-bundled-x.y.z.jar

mvn clean package

Run a command to deploy the control Dataflow pipeline to Cloud Dataflow.

scripts/run_indexer_gcs_csv_to_bigquery.sh FULLINDEX SHALLOW SHORTTEXT 1 2 "gs://$GCS_BUCKET/input/kaggle-rotten-tomato/*.csv"

(First Time Only) The first time you run the job, you will get an error from Dataflow

The workflow was automatically rejected by the service because it uses an unsupported SDK Google Cloud Dataflow SDK for Java 2.2.0. Please upgrade to the latest SDK version. To override the SDK version check temporarily, please provide an override token using the experiment flag '--experiments=unsupported_sdk_temporary_override_token=<token>'. Note that this token expires on <date>.

This is because we are still working on upgrading our Beam dependecies to newer versions of Beam. To fix this error, modify your scripts/set_env_vars_local.sh script to set the UNSUPPORTED_SDK_OVERRIDE_TOKEN to the token that was returned.

Set the shell variables again.

. scripts/set_env_vars_local.sh

Resubmit the job.

In the Dataflow Console observe how a new input job is created.
Once the Dataflow job successfully finishes, you can review the data it will write into your target BigQuery dataset. Use the BigQuery console to review the dataset.
Enter the following query to list new documents that were indexed by the Dataflow job. The sample query is using the Standard SQL dialect of BigQuery.

#standardSQL
SELECT d.CollectionItemId, s.* 
FROM opinions.sentiment s
    INNER JOIN opinions.document d ON d.DocumentHash = s.DocumentHash
WHERE SentimentTotalScore > 0
ORDER BY ProcessingDateId DESC, SentimentTotalScore DESC
LIMIT 1000;

Issues Under Investigation

The IndexerPipeline Dataflow job does not truncate existing content in BigQuery tables, even if --writeTruncate=true is specified This is because the BigQuery tables are defined as partitioned tables. The workaround for truncating the content between job runs is to run the following script

DELETE FROM opinions.document WHERE 1=1;
DELETE FROM opinions.sentiment WHERE 1=1;
DELETE FROM opinions.webresource WHERE 1=1;

Building the project on Apple M1 chip hardware results in an error Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64

This is because we are using an older version of the Beam SDK, which in turn uses an older version of snappy-java. Snappy-java version 1.1.8.2 is supposed to work on Apple M1 chips, and we will fix the problem when we upgrade to newer versions of Beam. For the time being, build the project and submit jobs on pre-M1 Mac hardware.

The IndexerPipeline Dataflow job is marked as 'Failed' although data gets successfuilly imported into BigQuery. This is because of the temporary BigQuery import files created in the GCS temp folder that are sometimes not cleaned up. The IndexerPipeline stages that write to BigQuery are marked as 'Failed' as well. Since data is successfully imported into BigQuery, this issue can be ignored for the time being, until we upgraded our Beam dependecies.

If you are seeing pipeline failures, see if you are getting the following errors in the pipeline logs

java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.io.IOException: Error executing batch GCS request
...
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found

<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 404 (Not Found)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/logos/errorpage/error_logo-150x54.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/logos/errorpage/error_logo-150x54-2x.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/logos/errorpage/error_logo-150x54-2x.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/logos/errorpage/error_logo-150x54-2x.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>404.</b> <ins>That’s an error.</ins>
  <p>  <ins>That’s all we know.</ins>

Clean up

Now that you have tested the sample, delete the cloud resources you created to prevent further billing for them on your account.

Stop the control Cloud Dataflow job in the Dataflow Cloud Console.
Disable and delete the App Engine application as described in Disable or delete your application in the Google App Engine documentation.
Delete the Cloud Pub/Sub topic. You can delete the topic and associated subscriptions from the Cloud Pub/Sub section of the Cloud Console.

##License:

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

dataflow-opinion-analysis's People

Stargazers

Watchers

dataflow-opinion-analysis's Issues

Controller not listening to Pub/Sub commands

This is a great and well documented example - much appreciated!

I am having trouble however with the Run a verification job step. When I publish the command=start_gcs_import message to the indexercommands topic I do not get a new input job created. I tried out the cron jobs to see if those were working, and I can see a new job created for startstatscalc only - not the others. I feel like I must have messed up some config step along the way.

Do you have tips for how to debug what is happening? I am not seeing any sort of logging to help me debug, but that's where I'd think to look first. It feels like the controller must not be listening for the topics correctly though...

Thanks!

Error when running verification job with "command=start_gcs_import"

Hi @datancoffee ,

First i would like to thank you for this great example.
I'm trying to install the project in the google cloud platform but i'm facing some issues.

Version : 0.6.4

I followed the Readme and everything is ok till the section Run a verification job.

When i publish the message command=start_gcs_import in my Pub/Sub topic i see that the message is processed by the control pipeline which tries to launch the indexer pipeline,
but this launch fails with this error :

exception: "java.lang.NoSuchMethodError: org.apache.beam.sdk.common.runner.v1.RunnerApi$FunctionSpec$Builder.setPayload(Lcom/google/protobuf/ByteString;)Lorg/apache/beam/sdk/common/runner/v1/RunnerApi$FunctionSpec$Builder; at org.apache.beam.runners.dataflow.repackaged.org.apache.beam.runners.core.construction.WindowingStrategyTranslation.toProto(WindowingStrategyTranslation.java:224) at org.apache.beam.runners.dataflow.repackaged.org.apache.beam.runners.core.construction.WindowingStrategyTranslation.toProto(WindowingStrategyTranslation.java:299) at org.apache.beam.runners.dataflow.repackaged.org.apache.beam.runners.core.construction.WindowingStrategyTranslation.toProto(WindowingStrategyTranslation.java:285) at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.serializeWindowingStrategy(DataflowPipelineTranslator.java:129) at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.access$1500(DataflowPipelineTranslator.java:114) at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$5.groupByKeyHelper(DataflowPipelineTranslator.java:806) at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$5.translate(DataflowPipelineTranslator.java:784) at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$5.translate(DataflowPipelineTranslator.java:781) at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.visitPrimitiveTransform(DataflowPipelineTranslator.java:442) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:663) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655) at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:311) at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:245) at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:446) at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.translate(DataflowPipelineTranslator.java:386) at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translate(DataflowPipelineTranslator.java:173) at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:537) at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:170) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:289) at com.google.cloud.dataflow.examples.opinionanalysis.ControlPipeline$ProcessCommand.startDocumentImportPipeline(ControlPipeline.java:270) at com.google.cloud.dataflow.examples.opinionanalysis.ControlPipeline$ProcessCommand.processElement(ControlPipeline.java:172)

And consequently i'm not seeing any output in my bigquery tables.
Any help on this?

PS: Launching only the indexer pipeline as specified in the version 0.6.4 Release Note is OK.

Thanks.

Getting Invalid GCS URI: gs:///temp/ while running Demo.

Machine Configuration
OS: Linux Mint 21.2 x86_64
Kernel: 5.15.0-84-generic
Terminal: gnome-terminal
CPU: Intel i5-10210U (8) @ 4.200GHz
GPU: Intel CometLake-U GT2 [UHD Graphics]
Memory: 7072MiB / 15811MiB
-:::::-

The Logs here

diptopal@diptopal-HP-348-G7-2Q1B6PA:/Projects/GCP_Competency_dev/dataflow-opinion-analysis$ scripts/run_indexer_gcs_csv_to_bigquery.sh FULLINDEX SHALLOW SHORTTEXT 1 2 "gs://$GCS_BUCKET/input/kaggle-rotten-tomato/*.csv"
[INFO] Scanning for projects...
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-deploy-plugin/2.7/maven-deploy-plugin-2.7.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-deploy-plugin/2.7/maven-deploy-plugin-2.7.pom (5.6 kB at 17 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-deploy-plugin/2.7/maven-deploy-plugin-2.7.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-deploy-plugin/2.7/maven-deploy-plugin-2.7.jar (27 kB at 440 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-site-plugin/3.3/maven-site-plugin-3.3.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-site-plugin/3.3/maven-site-plugin-3.3.pom (21 kB at 425 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-plugins/24/maven-plugins-24.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-plugins/24/maven-plugins-24.pom (11 kB at 216 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-site-plugin/3.3/maven-site-plugin-3.3.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-site-plugin/3.3/maven-site-plugin-3.3.jar (124 kB at 1.4 MB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-antrun-plugin/1.3/maven-antrun-plugin-1.3.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-antrun-plugin/1.3/maven-antrun-plugin-1.3.pom (4.7 kB at 115 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-plugins/12/maven-plugins-12.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-plugins/12/maven-plugins-12.pom (12 kB at 273 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven-parent/9/maven-parent-9.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven-parent/9/maven-parent-9.pom (33 kB at 631 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-antrun-plugin/1.3/maven-antrun-plugin-1.3.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-antrun-plugin/1.3/maven-antrun-plugin-1.3.jar (24 kB at 421 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-assembly-plugin/2.2-beta-5/maven-assembly-plugin-2.2-beta-5.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-assembly-plugin/2.2-beta-5/maven-assembly-plugin-2.2-beta-5.pom (15 kB at 303 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-plugins/16/maven-plugins-16.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-plugins/16/maven-plugins-16.pom (13 kB at 308 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-assembly-plugin/2.2-beta-5/maven-assembly-plugin-2.2-beta-5.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-assembly-plugin/2.2-beta-5/maven-assembly-plugin-2.2-beta-5.jar (209 kB at 1.6 MB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-dependency-plugin/2.8/maven-dependency-plugin-2.8.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-dependency-plugin/2.8/maven-dependency-plugin-2.8.pom (11 kB at 212 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-dependency-plugin/2.8/maven-dependency-plugin-2.8.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-dependency-plugin/2.8/maven-dependency-plugin-2.8.jar (153 kB at 1.5 MB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-release-plugin/2.5.3/maven-release-plugin-2.5.3.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-release-plugin/2.5.3/maven-release-plugin-2.5.3.pom (11 kB at 246 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/release/maven-release/2.5.3/maven-release-2.5.3.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/release/maven-release/2.5.3/maven-release-2.5.3.pom (5.0 kB at 129 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-release-plugin/2.5.3/maven-release-plugin-2.5.3.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-release-plugin/2.5.3/maven-release-plugin-2.5.3.jar (53 kB at 930 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/1.4.0/exec-maven-plugin-1.4.0.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/1.4.0/exec-maven-plugin-1.4.0.pom (12 kB at 264 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/codehaus/mojo/mojo-parent/34/mojo-parent-34.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/codehaus/mojo/mojo-parent/34/mojo-parent-34.pom (24 kB at 624 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/1.4.0/exec-maven-plugin-1.4.0.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/1.4.0/exec-maven-plugin-1.4.0.jar (46 kB at 811 kB/s)
[INFO]
[INFO] ---------< com.google.cloud.dataflow:examples-opinionanalysis >---------
[INFO] Building examples-opinionanalysis 0.7.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce) @ examples-opinionanalysis ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ examples-opinionanalysis ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 2 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ examples-opinionanalysis ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ examples-opinionanalysis ---
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven-toolchain/1.0/maven-toolchain-1.0.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven-toolchain/1.0/maven-toolchain-1.0.pom (3.4 kB at 71 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.20/plexus-utils-3.0.20.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.20/plexus-utils-3.0.20.pom (3.8 kB at 87 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/commons/commons-exec/1.3/commons-exec-1.3.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/commons/commons-exec/1.3/commons-exec-1.3.pom (11 kB at 250 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35.pom (58 kB at 791 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven-toolchain/1.0/maven-toolchain-1.0.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.20/plexus-utils-3.0.20.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven-toolchain/1.0/maven-toolchain-1.0.jar (33 kB at 522 kB/s)
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar (54 kB at 293 kB/s)
Downloaded from central: https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.20/plexus-utils-3.0.20.jar (243 kB at 881 kB/s)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/diptopal/.m2/repository/org/slf4j/slf4j-jdk14/1.7.25/slf4j-jdk14-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/diptopal/.m2/repository/sirocco/sirocco-sa/sirocco-sa/1.0.10/sirocco-sa-1.0.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
[WARNING]
java.lang.reflect.InvocationTargetException
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
at java.lang.Thread.run (Thread.java:829)
Caused by: java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:233)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:52)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:142)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.createIndexerPipeline (IndexerPipeline.java:130)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.main (IndexerPipeline.java:114)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
at java.lang.Thread.run (Thread.java:829)
Caused by: java.lang.reflect.InvocationTargetException
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:222)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:52)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:142)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.createIndexerPipeline (IndexerPipeline.java:130)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.main (IndexerPipeline.java:114)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
at java.lang.Thread.run (Thread.java:829)
Caused by: java.lang.IllegalArgumentException: DataflowRunner requires gcpTempLocation, but failed to retrieve a value from PipelineOptions
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions (DataflowRunner.java:225)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:222)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:52)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:142)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.createIndexerPipeline (IndexerPipeline.java:130)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.main (IndexerPipeline.java:114)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
at java.lang.Thread.run (Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Error constructing default value for gcpTempLocation: tempLocation is not a valid GCS path, gs:///temp/.
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create (GcpOptions.java:247)
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create (GcpOptions.java:228)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:156)
at com.sun.proxy.$Proxy39.getGcpTempLocation (Unknown Source)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions (DataflowRunner.java:223)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:222)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:52)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:142)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.createIndexerPipeline (IndexerPipeline.java:130)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.main (IndexerPipeline.java:114)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
at java.lang.Thread.run (Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Expected a valid 'gs://' path but was given 'gs:///temp/'
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.getGcsPath (GcsPathValidator.java:101)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPath (GcsPathValidator.java:75)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.validateOutputFilePrefixSupported (GcsPathValidator.java:60)
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create (GcpOptions.java:245)
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create (GcpOptions.java:228)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:156)
at com.sun.proxy.$Proxy39.getGcpTempLocation (Unknown Source)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions (DataflowRunner.java:223)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:222)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:52)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:142)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.createIndexerPipeline (IndexerPipeline.java:130)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.main (IndexerPipeline.java:114)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
at java.lang.Thread.run (Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Invalid GCS URI: gs:///temp/
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.base.Preconditions.checkArgument (Preconditions.java:191)
at org.apache.beam.sdk.util.gcsfs.GcsPath.fromUri (GcsPath.java:116)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.getGcsPath (GcsPathValidator.java:99)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPath (GcsPathValidator.java:75)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.validateOutputFilePrefixSupported (GcsPathValidator.java:60)
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create (GcpOptions.java:245)
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create (GcpOptions.java:228)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:156)
at com.sun.proxy.$Proxy39.getGcpTempLocation (Unknown Source)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions (DataflowRunner.java:223)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:222)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:52)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:142)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.createIndexerPipeline (IndexerPipeline.java:130)
at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.main (IndexerPipeline.java:114)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
at java.lang.Thread.run (Thread.java:829)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5.425 s
[INFO] Finished at: 2023-10-04T17:40:27+05:30
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project examples-opinionanalysis: An exception occured while executing the Java class. null: InvocationTargetException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions): DataflowRunner requires gcpTempLocation, but failed to retrieve a value from PipelineOptions: Error constructing default value for gcpTempLocation: tempLocation is not a valid GCS path, gs:///temp/. Expected a valid 'gs://' path but was given 'gs:///temp/': Invalid GCS URI: gs:///temp/ -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
diptopal@diptopal-HP-348-G7-2Q1B6PA:/Projects/GCP_Competency_dev/dataflow-opinion-analysis$

How would the scheduler change in the python 3 env?

I like the idea of using app engine + cron as the ingestion orchestrator. How would this example change for the python 3 env?

ImportErrors on RedditEngagement.ipynb notebook

My apologies in advance for my lack of knowledge in a wide array of Python modules.

I have been trying to run the entire notebook in both Google Colab and Google Cloud Datalab. None of them seem to be able to run the notebook without import errors. I have tried to pip install them but pip is not able to find them.

ImportError: No module named google
ImportError: No module named google3.pyglib
NameError: global name 'os' is not defined
-> I was able to easily fix this one but it sort of baffles me that a notebook that seems to have been run with all the output saved and everything has this error?
ImportError: No module named colabtools

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.