Code Monkey home page Code Monkey logo

bigquery-python's Introduction

BigQuery-Python

Simple Python client for interacting with Google BigQuery.

This client provides an API for retrieving and inserting BigQuery data by wrapping Google's low-level API client library. It also provides facilities that make it convenient to access data that is tied to an App Engine appspot, such as request logs.

Documentation

Installation

pip install bigquery-python

Basic Usage

from bigquery import get_client

# BigQuery project id as listed in the Google Developers Console.
project_id = 'project_id'

# Service account email address as listed in the Google Developers Console.
service_account = '[email protected]'

# PKCS12 or PEM key provided by Google.
key = 'key.pem'

client = get_client(project_id, service_account=service_account,
                    private_key_file=key, readonly=True)

# JSON key provided by Google
json_key = 'key.json'
 
client = get_client(json_key_file=json_key, readonly=True)

# Submit an async query.
job_id, _results = client.query('SELECT * FROM dataset.my_table LIMIT 1000')

# Check if the query has finished running.
complete, row_count = client.check_job(job_id)

# Retrieve the results.
results = client.get_query_rows(job_id)

Executing Queries

The BigQuery client allows you to execute raw queries against a dataset. The query method inserts a query job into BigQuery. By default, query method runs asynchronously with 0 for timeout. When a non-zero timeout value is specified, the job will wait for the results, and throws an exception on timeout.

When you run an async query, you can use the returned job_id to poll for job status later with check_job.

# Submit an async query
job_id, _results = client.query('SELECT * FROM dataset.my_table LIMIT 1000')

# Do other stuffs

# Poll for query completion.
complete, row_count = client.check_job(job_id)

# Retrieve the results.
if complete:
    results = client.get_query_rows(job_id)

You can also specify a non-zero timeout value if you want your query to be synchronous.

# Submit a synchronous query
try:
    _job_id, results = client.query('SELECT * FROM dataset.my_table LIMIT 1000', timeout=10)
except BigQueryTimeoutException:
    print "Timeout"

Query Builder

The query_builder module provides an API for generating query strings that can be run using the BigQuery client.

from bigquery.query_builder import render_query

selects = {
    'start_time': {
        'alias': 'Timestamp',
        'format': 'INTEGER-FORMAT_UTC_USEC'
    }
}

conditions = [
    {
        'field': 'Timestamp',
        'type': 'INTEGER',
        'comparators': [
            {
                'condition': '>=',
                'negate': False,
                'value': 1399478981
            }
        ]
    }
]

grouping = ['Timestamp']

having = [
    {
        'field': 'Timestamp',
        'type': 'INTEGER',
        'comparators': [
            {
                'condition': '==',
                'negate': False,
                'value': 1399478981
            }
        ]
    }
]

order_by ={'fields': ['Timestamp'], 'direction': 'desc'}

query = render_query(
    'dataset',
    ['table'],
    select=selects,
    conditions=conditions,
    groupings=grouping,
    having=having,
    order_by=order_by
)

job_id, _ = client.query(query)

Managing Tables

The BigQuery client provides facilities to manage dataset tables, including creating, deleting, and checking the existence of tables.

# Create a new table.
schema = [
    {'name': 'foo', 'type': 'STRING', 'mode': 'nullable'},
    {'name': 'bar', 'type': 'FLOAT', 'mode': 'nullable'}
]
created = client.create_table('dataset', 'my_table', schema)

# Delete an existing table.
deleted = client.delete_table('dataset', 'my_table')

# Check if a table exists.
exists = client.check_table('dataset', 'my_table')

There is also functionality for retrieving tables that are associated with a Google App Engine appspot, assuming table names are in the form of appid_YYYY_MM or YYYY_MM_appid. This allows tables between a date range to be selected and queried on.

# Get appspot tables falling within a start and end time.
from datetime import datetime, timedelta
range_end = datetime.utcnow()
range_start = range_end - timedelta(weeks=12)
tables = client.get_tables('dataset', 'appid', range_start, range_end)

Inserting Data

The client provides an API for inserting data into a BigQuery table. The last parameter refers to an optional insert id key used to avoid duplicate entries.

# Insert data into table.
rows =  [
    {'one': 'ein', 'two': 'zwei'}
    {'id': 'NzAzYmRiY', 'one': 'uno', 'two': 'dos'},
    {'id': 'NzAzYmRiY', 'one': 'ein', 'two': 'zwei'} # duplicate entry
]

inserted = client.push_rows('dataset', 'table', rows, 'id')

Write Query Results to Table

You can write query results directly to table. When either dataset or table parameter is omitted, query result will be written to temporary table.

# write to permanent table
job = client.write_to_table('SELECT * FROM dataset.original_table LIMIT 100',
                            'dataset',
                            'table')
try:
    job_resource = client.wait_for_job(job, timeout=60)
    print job_resource
except BigQueryTimeoutException:
    print "Timeout"

# write to permanent table with UDF in query string
external_udf_uris = ["gs://bigquery-sandbox-udf/url_decode.js"]
query = """SELECT requests, title
            FROM
              urlDecode(
                SELECT
                  title, sum(requests) AS num_requests
                FROM
                  [fh-bigquery:wikipedia.pagecounts_201504]
                WHERE language = 'fr'
                GROUP EACH BY title
              )
            WHERE title LIKE '%ç%'
            ORDER BY requests DESC
            LIMIT 100
        """
job = client.write_to_table(
  query,
  'dataset',
  'table',
  external_udf_uris=external_udf_uris
)

try:
    job_resource = client.wait_for_job(job, timeout=60)
    print job_resource
except BigQueryTimeoutException:
    print "Timeout"

# write to temporary table
job = client.write_to_table('SELECT * FROM dataset.original_table LIMIT 100')
try:
    job_resource = client.wait_for_job(job, timeout=60)
    print job_resource
except BigQueryTimeoutException:
    print "Timeout"

Import data from Google cloud storage

schema = [ {"name": "username", "type": "string", "mode": "nullable"} ]
job = client.import_data_from_uris( ['gs://mybucket/mydata.json'],
                                    'dataset',
                                    'table',
                                    schema,
                                    source_format=JOB_SOURCE_FORMAT_JSON)

try:
    job_resource = client.wait_for_job(job, timeout=60)
    print job_resource
except BigQueryTimeoutException:
    print "Timeout"

Export data to Google cloud storage

job = client.export_data_to_uris( ['gs://mybucket/mydata.json'],
                                   'dataset',
                                   'table')
try:
    job_resource = client.wait_for_job(job, timeout=60)
    print job_resource
except BigQueryTimeoutException:
    print "Timeout"

Managing Datasets

The client provides an API for listing, creating, deleting, updating and patching datasets.

# List datasets
datasets = client.get_datasets()


# Create dataset
dataset = client.create_dataset('mydataset', friendly_name="My Dataset", description="A dataset created by me")

# Get dataset
client.get_dataset('mydataset')

# Delete dataset
client.delete_dataset('mydataset')
client.delete_dataset('mydataset', delete_contents=True) # delete even if it contains data

# Update dataset
client.update_dataset('mydataset', friendly_name="mon Dataset") # description is deleted

# Patch dataset
client.patch_dataset('mydataset', friendly_name="mon Dataset") # friendly_name changed; description is preserved

# Check if dataset exists.
exists = client.check_dataset('mydataset')

Creating a schema from a sample record

from bigquery import schema_from_record

schema_from_record({"id":123, "posts": [{"id":123, "text": "tihs is a post"}], "username": "bob"})

Caveats

BigQuery flattens results with repeated records, so a result might actually map to multiple rows. This means that the row count may be larger than the actual number of results because BigQuery reports the number of unrolled rows but the returned results are rolled back up.

Contributing

Requirements to commit here:

  • Branch off master, PR back to master.
  • Your code should pass Flake8.
  • Unit test coverage is required.
  • Good docstrs are required.
  • Good commit messages are required.

bigquery-python's People

Contributors

andrewgardners avatar aneilbaboo avatar atlaspilotpuppy avatar coffenbacher avatar cpdean avatar dylanroy avatar hagino3000 avatar ilikedata avatar jasonab avatar john2x avatar kamatama41 avatar mdanieli avatar nickstanisha avatar orangain avatar paulliang1 avatar paulw54jrn avatar pirsquare avatar puhitaku avatar ruxandraburtica avatar sagarrakshe avatar scribu avatar sshrestha-datalicious avatar tylertreat avatar tylertreat-wf avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.