Code Monkey home page Code Monkey logo

tinyquery's Introduction

tinyquery

tinyquery is a partial implementation of the BigQuery API that runs completely in-memory in Python code. If you have Python code that uses the BigQuery API, you may be able to use tinyquery to make your automated tests orders of magnitude faster, more reliable, and less annoying to run.

There are still lots of missing pieces, but enough of the API and SQL language is filled in that it can be used for some production-size BigQuery pipelines.

Motivation

BigQuery is a Google service that lets you import giant data sets and run arbitrary SQL queries over them. The most common use case is to allow people to dig into their data manually using SQL, but BigQuery also lets you build complex data pipelines entirely in SQL, which has a number of advantages over other approaches like MapReduce.

One of the biggest challenges when writing a data pipeline on top of BigQuery is writing high-quality automated tests. Normally, you're left with a few less-than-perfect options:

  • Skip automated testing and rely only on manual testing.
  • Swap out the BigQuery service with a mock object that asserts the API usage and returns pre-canned results. This lets you test some basic things, but won't give you confidence in your system if most of your business logic is in SQL.
  • Run tests against some other SQL implementation that can be run locally. Since every dialect of SQL is different (and with BigQuery having a dramatically different architecture than a typical relational database), this would likely cause more trouble than it's worth.
  • Run tests against the real BigQuery service. This is probably the best option, but has a few problems:
    • The tests can get very slow as the code gets more complex, since they will involve lots of network I/O over the internet and BigQuery operations tend to have highly-variable response times anyway.
    • Anyone running your test needs to have the right credentials to access BigQuery.
    • Running the test requires an internet connection.
    • If you don't take additional measures, two people running the test at the same time could trample each other's test state and cause confusing test failures.

tinyquery lets you write a test against the real BigQuery API, then swap out BigQuery for a fake implementation for fast iteration. For example, tinyquery was used to dramatically improve a test at Khan Academy that ran a large data pipeline three times in different conditions. Originally, the test took about 8 minutes (and at its worst point took about an hour to run) and required some manual steps. After modifying the test to use tinyquery, the test now takes 2 seconds to run and can be run as part of the regular build process (or manually by any developer) without any extra work.

How to use

tinyquery is a drop-in replacement as long as you're accessing the BigQuery API using the Python client library.

Here's the basic setup code that you would use to get started:

from tinyquery import tinyquery
from tinyquery import api_client
tq_service = tinyquery.TinyQuery()
tq_api_client = api_client.TinyQueryApiClient(tq_service)

The tq_service instance gives you direct Python access to various tinyquery operations. The tq_api_client instance is an API wrapper that you can patch/inject into your production code as a replacement for the return value of apiclient.discovery.build().

If your code catches the apiclient.errors.HttpError exception, you may also want to patch tinyquery.api_client.FakeHttpError with that class.

Features

  • Almost all of the core SQL language: SELECT, FROM, WHERE, HAVING, GROUP BY, JOIN (including LEFT OUTER JOIN and CROSS JOIN), LIMIT, subqueries.
  • Many of the common functions and operators. See runtime.py for a list.
  • Importing from CSV.
  • API wrappers for creating, getting, and deleting tables, and for creating and managing query and copy jobs and getting query results.

What's missing?

  • Many features that behave uniquely with repeated and record fields:
    • FLATTEN
    • WITHIN and scoped aggregation
    • POSITION and NEST
  • Window functions and OVER/PARTITION BY.
  • Various operators and functions.
  • Lots of API operations are unsupported. The ones that are supported are missing various return fields from the API.
  • There are some edge cases in the core language where BigQuery and tinyquery differ:
    • differences involving fully-qualified column names (e.g. table_alias.column being allowed in tinyquery but not BigQuery or vice versa).
    • tinyquery allows SELECTing from multiple repeated fields more often than bigquery does.

Contributing

See CONTRIBUTING.md

License

tinyquery is licensed under the MIT License.

Changelog

1.1

Pinned dependencies to versions.

1.0

Initial release to PyPI.

tinyquery's People

Contributors

alangpierce avatar benjaminjkraft avatar cjfuller avatar csilvers avatar graingert avatar guptaragini avatar kevinbarabash avatar millar avatar nsfmc avatar samanthadoran avatar sdcooke avatar veryeli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tinyquery's Issues

Examples for integration testing using tinyquery

Hi Team,

I was looking to do integration testing of my project against BigQuery. Given, there is no BigQuery emulator provided by google to test locally, I landed up on tinyquery to check if it can fulfil my purpose.

However, I couldn't find examples in the documentation which can help me know how to use tinyquery. I would request you guys to add few examples to help the viewers to understand the use-case in a better way.

My use case is following:

  1. Create a table locally in the form - "project_name.dataset_name.table_name" & populate it using data from CSV or JSON file.
  2. Use the BigQuery SQL to execute on the table created above.
  3. Assert the results.

Essentially I would like to check if the BigQuery SQL that I have in my project, would run properly on actual BigQuery or not.

Thanks in advance.

Missing ORDER BY ABS and DATE_DIFF functions

I am trying to use this to test a query of the form:

   SELECT year_from_record, month_from_record, day_from_record
   FROM project_id:dataset.table
   ORDER BY ABS(DATE_DIFF(first_date_expression, second_date_expression, DAY)) ASC
   LIMIT 1

I am getting a:

   p = LexToken(LPAREN,'(',1,101)
          def p_error(p):
          raise SyntaxError('Unexpected token: %s' % p)
          SyntaxError: Unexpected token: LexToken(LPAREN,'(',1,101)

`
intead.

Extending SQL support

I've been looking for a mock implementation of the newer BigQuery Cloud Client (versus the Google API Client that tinyquery uses[1]) and Tinyquery is the closest I've found. This is really cool!

I was considering extending it to include an interface for the new client, but there are quite a few SQL features that we also need. For example support for CREATE TABLE, INSERT, FULL OUTER JOIN. Do you know if anyone has done any work in this area? I'd be glad to contribute but I'm trying to assess just how much work we'd need to put in to meet our needs. It would be great to be able to run tests without a dep on the actual BigQuery service.

1 - https://cloud.google.com/apis/docs/client-libraries-explained

Subquery ordering ignored in aggregation

On bigquery the following query returns the first item after ordering, on tinyquery the order seems to be ignored. If you point me in the right direction I could have a go at fixing this - I'm not sure I could dive in and do it straight away though!

SELECT a, FIRST(b)
FROM (
    SELECT a, b
    FROM t
    ORDER BY b
)
GROUP BY a

Python 3

Are you interested in or currently working on python 3 support? I could potentially find the time to do the conversion (maybe using six if necessary) if you're open to it.

Transfer ownership?

Hi @alangpierce ! We are not accessing bigquery from python anymore at Khan Academy, so it doesn't make sense for us to keep ownership of this project. Would you like us to transfer it to you?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.