Code Monkey home page Code Monkey logo

gcs-tools's Introduction

GCS Tools

Build Status GitHub license

Raison d'être:

Light weight wrapper that adds Google Cloud Storage (GCS) support to common Hadoop tools, including avro-tools, parquet-cli, proto-tools for Scio's Protobuf in Avro file, and magnolify-tools for Magnolify code generation, so that they can be used from regular workstations or laptops, outside of a Google Compute Engine (GCE) instance.

It uses your existing OAuth2 credentials and allows authentication via a browser.

Usage:

You can install the tools via our Homebrew tap on Mac.

brew tap spotify/public
brew install gcs-avro-tools gcs-parquet-cli gcs-proto-tools gcs-magnolify-tools
avro-tools tojson <GCS_PATH>
parquet-cli cat <GCS_PATH>
proto-tools tojson <GCS_PATH>
magnolify-tools <avro|parquet> <GCS_PATH>

Or build them yourself.

sbt assembly
java -jar avro-tools/target/scala-2.13/avro-tools-*.jar tojson <GCS_PATH>
java -jar parquet-cli/target/scala-2.13/parquet-cli-*.jar cat <GCS_PATH>
java -jar proto-tools/target/scala-2.13/proto-tools-*.jar cat <GCS_PATH>
java -jar magnolify-tools/target/scala-2.13/magnolify-tools-*.jar <avro|parquet> <GCS_PATH>

How it works:

To make avro-tools and parquet-cli work with GCS we need:

GCS connector won't pick up your local gcloud configuration, and instead expects settings in core-site.xml.

gcs-tools's People

Contributors

ajitgogul avatar clairemcginty avatar luster avatar nevillelyh avatar regadas avatar rustedbones avatar scala-steward avatar syodage avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gcs-tools's Issues

no JSON input found: gcloud credentials

on gcs-avro-tools 0.1.7 from homebrew, there appear to be issues loading the application default credentials.

Exception in thread "main" java.lang.IllegalArgumentException: no JSON input found
	at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
	at com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:49)
	at com.google.api.client.json.JsonParser.startParsing(JsonParser.java:222)
	at com.google.api.client.json.JsonParser.parse(JsonParser.java:379)
	at com.google.api.client.json.JsonParser.parse(JsonParser.java:335)
	at com.google.api.client.json.JsonParser.parseAndClose(JsonParser.java:165)
	at com.google.api.client.json.JsonParser.parseAndClose(JsonParser.java:147)
	at com.google.api.client.json.JsonFactory.fromInputStream(JsonFactory.java:206)
	at com.google.api.client.extensions.java6.auth.oauth2.FileCredentialStore.loadCredentials(FileCredentialStore.java:154)
	at com.google.api.client.extensions.java6.auth.oauth2.FileCredentialStore.<init>(FileCredentialStore.java:86)
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromFileCredentialStoreForInstalledApp(CredentialFactory.java:301)

....

when I gcloud auth application-default login, it saves my credentials to /Users/cchow/.config/gcloud/application_default_credentials.json. did the expected path change?

No valid credential configuration discovered

Hello,
Coming from Scio's documentation, I ended-up installing proto-tools through Homebrew (spotify/public/gcs-proto-tools stable 0.2.4). However, when I run it I get the following error:

$ proto-tools getschema gs://bucket/data.protobuf.avro
Exception in thread "main" java.lang.IllegalArgumentException: No valid credential configuration discovered:  [CredentialOptions{serviceAccountEnabled=false, serviceAccountPrivateKeyId=null, serviceAccountPrivateKey=null, serviceAccountEmail=null, serviceAccountKeyFile=null, serviceAccountJsonKeyFile=null, nullCredentialEnabled=false, transportType=JAVA_NET, tokenServerUrl=https://oauth2.googleapis.com/token, proxyAddress=null, proxyUsername=null, proxyPassword=null, authClientId=32555940559.apps.googleusercontent.com, authClientSecret=<redacted>, authRefreshToken=null}]
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:220)
	at com.google.cloud.hadoop.util.CredentialOptions$Builder.build(CredentialOptions.java:171)
	at com.google.cloud.hadoop.util.HadoopCredentialConfiguration.getCredentialFactory(HadoopCredentialConfiguration.java:227)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getCredential(GoogleHadoopFileSystemBase.java:1343)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createGcsFs(GoogleHadoopFileSystemBase.java:1501)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1483)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:470)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3572)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3673)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3624)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:557)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.avro.mapred.FsInput.<init>(FsInput.java:38)
	at org.apache.avro.tool.ProtoGetSchemaTool.run(ProtoGetSchemaTool.java:33)
	at org.apache.avro.tool.ProtoMain.run(ProtoMain.java:64)
	at org.apache.avro.tool.ProtoMain.main(ProtoMain.java:53)

I'm logged into my GCP project with gcloud.

The README seems to suggest that something needs to be done with GCS-connector but I can't figure out what exactly:

  • Is it something to be installed separately? How?
  • How can I edit some core-site.xml file within a Homebrew installation? Or can it be passed to the command-line?

add `proto-tools fromPb` method?

an example use case is inspecting the pipelineUrl file that Dataflow stages (which is a .pb file representing org.apache.beam.model.pipeline.v1.Pipeline) to verify coders and transforms. It would just be a wrapper around protoc's decode or decode_raw method, maybe with built-in support for common Protobuf messages like Pipeline (although we'd have to handle different schema versions for different Beam versions).

NoSuchMethodError when running parquet-tools locally

Hello there 👋

Following the README, I tried to build the project & use it locally but parquet-tools fails with a NoSuchMethodError:

The command:

% java -jar parquet-tools/target/scala-2.12/parquet-tools-1.10.1.jar rowcount --debug gs://path/to/parquet/file

The error:

java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$ParentTimestampUpdateIncludePredicate.create(GoogleHadoopFileSystemBase.java:790)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createOptionsBuilderFromConfig(GoogleHadoopFileSystemBase.java:2140)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1832)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1013)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:976)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.parquet.tools.command.RowCountCommand.execute(RowCountCommand.java:83)
        at org.apache.parquet.tools.Main.main(Main.java:223)
java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;

New release with parquet-tools 1.10.1?

Hello there 👋

I noticed that you updated the version of parquet-tools on master (allowing usage of rowcount 🙏 ) for a fair amount of time now but there was no release with it.

Do you have any idea if/when you would be able to make a new release?

Relying heavily on parquet hosted on GCS, this is cruelly missing!

Nonetheless, thanks for this awesome tool 👍

All latest tools fail to authenticate to GCS

STR:

1a. Install all latest (v0.2.2 on Aug 29) tools
1b. Or build latest master to parquet-cli-1.12.3.jar, proto-tools-3.21.1.jar, avro-tools-1.11.0.jar,magnolify-tools-0.4.8.jar

  1. Run all of them using basic read command like <TOOL> tojson <GCS_PATH>

Actual:
Tool launches browser that shows a page:
Screen Shot 2022-08-29 at 9 50 20 AM

With a message:

The version of the app you're using doesn't include the latest security features to keep you protected. Please make sure to download from a trusted source and update to the latest, most secure version.

Exected:
Tool reads a file according to spec

proto-tools NoSuchMethod error

When calling proto-tools with either tojson or getschema, the following error is thrown:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$ParentTimestampUpdateIncludePredicate.create(GoogleHadoopFileSystemBase.java:641)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createOptionsBuilderFromConfig(GoogleHadoopFileSystemBase.java:1978)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1675)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:862)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:825)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at org.apache.avro.tool.Util.openFromFS(Util.java:88)
	at org.apache.avro.tool.Util.fileOrStdin(Util.java:60)
	at org.apache.avro.tool.ProtoToJsonTool.run(ProtoToJsonTool.java:48)
	at org.apache.avro.tool.ProtoMain.run(ProtoMain.java:64)
	at org.apache.avro.tool.ProtoMain.main(ProtoMain.java:53)

error running proto-tools tojson

running proto-tools tojson throws this error:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.CodedInputStream.newInstance(Ljava/nio/ByteBuffer;)Lcom/google/protobuf/CodedInputStream;
at me.lyh.protobuf.generic.GenericReader.read(GenericReader.scala:21)
at org.apache.avro.tool.ProtobufReader.toJson(ProtobufReader.scala:9)
at org.apache.avro.tool.ProtoToJsonTool.run(ProtoToJsonTool.java:59)
at org.apache.avro.tool.ProtoMain.run(ProtoMain.java:64)
at org.apache.avro.tool.ProtoMain.main(ProtoMain.java:53)`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.