Code Monkey home page Code Monkey logo

kuromoji-for-bigquery's Introduction

kuromoji-for-bigquery

Build Status

kuromoji-for-bigquery tokenizes text on a BigQuery table with kuromoji and apache beam. And then the tokenized result will be stored into another BigQuery table.

It is horizontally-scalable on top of distributed system, since apache beam can run on Google Dataflow, Apache Spark, Apache Flink and so on.

Overview

Requirements

  • Maven
  • Java 1.8+
  • Google Cloud Platform account

Version Info

  • Apache Beam: 2.42.0
  • Kuromoji: 0.7.7

How to Use

Command Line Options

Required Options

  • --project: Google Cloud Project
  • --inputDataset: Input BigQuery dataset ID
  • --inputTable: Input BigQuery table ID
  • --tokenizedColumn: Column name to tokenize in a input table
  • --outputDataset: Output BigQuery dataset ID
  • --outputTable: Output BigQuery table ID
  • --schema: BigQuery schema to select columns in a input table. (Format: id:integer,name:string,value:float,ts:timestamp)
  • --tempLocation: The Cloud Storage path to use for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
  • --gcpTempLocation: A GCS path for storing temporary files in GCP.

Optional Options

  • --outputColumn: Output column for tokenized result in output table. (Default: token)
  • --kuromojiMode: Kuromoji Mode. (NORMAL, SEARCH, or EXTENDED) (Default: NORMAL)
  • --createDisposition: Create Disposition option for BigQuery. (CREATE_NEVER or CREATE_IF_NEEDED)
  • --writeDisposition: Write Disposition option for BigQuery. (WRITE_TRUNCATE, WRITE_APPEND or WRITE_EMPTY)
  • --runner: Apache Beam runner.
    • When you don't set this option, it will run on your local machine, not Google Dataflow.
    • e.g. DataflowRunner
  • --numWorkers: The number of workers when you run it on top of Google Dataflow.
  • --workerMachineType: Google Dataflow worker instance type
    • e.g. n1-standard-1, n1-standard-4

Run the command

# compile
mvn clean package

# Run bigquery-to-datastore via the compiled JAR file
java -jar $(pwd)/target/kuromoji-for-bigquery-bundled-0.4.1.jar \
  --project=test-project-id \
  --schema=id:integer \
  --inputDataset=test_input_dataset \
  --inputTable=test_input_table \
  --outputDataset=test_output_dataset \
  --outputTable=test_output_table \
  --tokenizedColumn=text \
  --outputColumn=token \
  --kuromojiMode=NORMAL \
  --tempLocation=gs://test_yu/test-log/ \
  --gcpTempLocation=gs://test_yu/test-log/ \
  --maxNumWorkers=10 \
  --workerMachineType=n1-standard-2

Versions

kuromoji-for-bigquery Apache Beam kuromoji
0.1.0 2.1.0 0.7.7
0.2.x 2.20.0 0.7.7
0.3.x 2.34.0 0.7.7
0.4.x 2.42.0 0.7.7

License

Copyright (c) 2017 Yu Ishikawa.

kuromoji-for-bigquery's People

Contributors

dependabot[bot] avatar komei22 avatar shuhoy avatar snyk-bot avatar usamomokawa avatar yu-iskw avatar zaimy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

kuromoji-for-bigquery's Issues

Change the version in pom.xml

Overview

Thanks to the contribution, we have upgraded beam version. So, it would be nice to change the version in pom.xml as well.

Security Policy violation SECURITY.md

This issue was automatically created by Allstar.

Security Policy Violation
Security policy not enabled.
A SECURITY.md file can give users information about what constitutes a vulnerability and how to report one securely so that information about a bug is not publicly visible. Examples of secure reporting methods include using an issue tracker with private issue support, or encrypted email with a published key.

To fix this, add a SECURITY.md file that explains how to handle vulnerabilities found in your repository. Go to https://github.com/yu-iskw/kuromoji-for-bigquery/security/policy to enable.

For more information, see https://docs.github.com/en/code-security/getting-started/adding-a-security-policy-to-your-repository.


This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Change atilika repository URL to https

what happend

I got following error when I execute mvn clean package.

Blocked mirror for repositories: [Atilika Open Source repository (http://www.atilika.org/nexus/content/repositories/atilika, default, releases+snapshots)]

I thought this is caused by Maven update.
The latest version of Maven (3.8.1) has implemented the feature to block getting http repositories.
Release note explains this. (https://maven.apache.org/docs/3.8.1/release-notes.html#how-to-fix-when-i-get-a-http-repository-blocked)

need to do

Atilika repository provides https URL, so we need to change following URL from http to https.

<url>http://www.atilika.org/nexus/content/repositories/atilika</url>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.