Code Monkey home page Code Monkey logo

gsheets-db-api's Introduction

Build Status codecov

Note: shillelagh is a drop-in replacement for gsheets-db-api, with many additional features. You should use it instead. If you're using SQLAlchemy all you need to do:

$ pip uninstall gsheetsdb
$ pip install shillelagh

If you're using the DB API:

# from gsheetsdb import connect
from shillelagh.backends.apsw.db import connect

A Python DB API 2.0 for Google Spreadsheets

This module allows you to query Google Spreadsheets using SQL.

Using this spreadsheet as an example:

A B
1 country cnt
2 BR 1
3 BR 3
4 IN 5

Here's a simple query using the Python API:

from gsheetsdb import connect

conn = connect()
result = conn.execute("""
    SELECT
        country
      , SUM(cnt)
    FROM
        "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/"
    GROUP BY
        country
""", headers=1)
for row in result:
    print(row)

This will print:

Row(country='BR', sum_cnt=4.0)
Row(country='IN', sum_cnt=5.0)

How it works

Transpiling

Google spreadsheets can actually be queried with a very limited SQL API. This module will transpile the SQL query into a simpler query that the API understands. Eg, the query above would be translated to:

SELECT A, SUM(B) GROUP BY A

Processors

In addition to transpiling, this module also provides pre- and post-processors. The pre-processors add more columns to the query, and the post-processors build the actual result from those extra columns. Eg, COUNT(*) is not supported, so the following query:

SELECT COUNT(*) FROM "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/"

Gets translated to:

SELECT COUNT(A), COUNT(B)

And then the maximum count is returned. This assumes that at least one column has no NULLs.

SQLite

When a query can't be expressed, the module will issue a SELECT *, load the data into an in-memory SQLite table, and execute the query in SQLite. This is obviously inneficient, since all data has to be downloaded, but ensures that all queries succeed.

Installation

$ pip install gsheetsdb
$ pip install gsheetsdb[cli]         # if you want to use the CLI
$ pip install gsheetsdb[sqlalchemy]  # if you want to use it with SQLAlchemy

CLI

The module will install an executable called gsheetsdb:

$ gsheetsdb --headers=1
> SELECT * FROM "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/"
country      cnt
---------  -----
BR             1
BR             3
IN             5
> SELECT country, SUM(cnt) FROM "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1
pscv8ZXPtg8/" GROUP BY country
country      sum cnt
---------  ---------
BR                 4
IN                 5
>

SQLAlchemy support

This module provides a SQLAlchemy dialect. You don't need to specify a URL, since the spreadsheet is extracted from the FROM clause:

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *

engine = create_engine('gsheets://')
inspector = inspect(engine)

table = Table(
    'https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/edit#gid=0',
    MetaData(bind=engine),
    autoload=True)
query = select([func.count(table.columns.country)], from_obj=table)
print(query.scalar())  # prints 3.0

Alternatively, you can initialize the engine with a "catalog". The catalog is a Google spreadsheet where each row points to another Google spreadsheet, with URL, number of headers and schema as the columns. You can see an example here:

A B C
1 https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/edit#gid=0 1 default
2 https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/edit#gid=1077884006 2 default

This will make the two spreadsheets above available as "tables" in the default schema.

Authentication

You can access spreadsheets that are shared only within an organization. In order to do this, first create a service account. Make sure you select "Enable G Suite Domain-wide Delegation". Download the key as a JSON file.

Next, you need to manage API client access at https://admin.google.com/${DOMAIN}/AdminHome?chromeless=1#OGX:ManageOauthClients. Add the "Unique ID" from the previous step as the "Client Name", and add https://spreadsheets.google.com/feeds as the scope.

Now, when creating the connection from the DB API or from SQLAlchemy you can point to the JSON file and the user you want to impersonate:

>>> auth = {'service_account_file': '/path/to/certificate.json', 'subject': '[email protected]'}
>>> conn = connect(auth)

gsheets-db-api's People

Contributors

betodealmeida avatar nytai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

gsheets-db-api's Issues

Insert/update support?

Just a quick question, does this library allow for writing of data back up to Sheets, or just reading data from it?
Having played around with it with SQLAlchemy, the integration is excellent. Getting ProgrammingErrors though when trying to do anything involving insert/updates.

Handle `COUNT(*)` outside of `SELECT`

The CountStar processor should handle more cases, eg:

SELECT
    country
  , COUNT(*)
FROM
    table
WHERE
    name = 'John'
GROUP BY
    country
HAVING
    COUNT(*) > 100

SqlAlchemy

How to connect with SqlAlchemy?

I am trying to url below in Apache Superset but it is not working:

gsheets://gsheets://https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/edit#gid=0

Returns the following error:

superset_1 | 2019-10-09 17:28:18,959:ERROR:root:invalid literal for int() with base 10: ''
superset_1 | Traceback (most recent call last):
superset_1 | File "/home/superset/superset/views/core.py", line 1701, in testconn
superset_1 | database.set_sqlalchemy_uri(uri)
superset_1 | File "/home/superset/superset/models/core.py", line 836, in set_sqlalchemy_uri
superset_1 | conn = sqla.engine.url.make_url(uri.strip())
superset_1 | File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/url.py", line 229, in make_url
superset_1 | return _parse_rfc1738_args(name_or_url)
superset_1 | File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/url.py", line 288, in _parse_rfc1738_args
superset_1 | return URL(name, **components)
superset_1 | File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/url.py", line 71, in init
superset_1 | self.port = int(port)
superset_1 | ValueError: invalid literal for int() with base 10: ''

From the command line this works:

root@e87274843f8f:/home/superset# gsheetsdb
sql> select * from "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8";
country cnt


BR 1
BR 3
IN 5
ZA 6

I await help. Thanks!

Missing functionality

Translation

  • aliases (the API reject these queries)
  • ORDER BY

SQLAlchemy

  • basic implementation
  • catalogs

Accessing private gsheets

We have some gsheets that are accessible by people from the same org only.
Is there any way to connect to these gsheets without making them public?

Getting error: "Connection failed"

I installed gsheetsdb package and tried to use it in superset and in python script but I am getting connection failed error. What should be the full uri for gsheets?

Kindly help in this issue.

Can't get time granularity using GSHEET driver in Superset.

When doing a query in Superset with time granularity, the query fails.

SELECT DATETIME(STRFTIME('%Y-%m-%dT%H:00:00', "DateTime")) AS __timestamp,
       COUNT(*) AS count
FROM "https://docs.google.com/spreadsheets/d/1x5q4jaX54Ob0aT-0rBbECv_Q9xwqiYR41LwYVSwgavE/edit?usp=sharing"
WHERE "DateTime" >= '2020-04-16 00:00:00.000000'
  AND "DateTime" < '2020-04-23 00:00:00.000000'
GROUP BY DATETIME(STRFTIME('%Y-%m-%dT%H:00:00', "DateTime"))
ORDER BY count DESC
LIMIT 10000;

image

Open spreadsheet as database, worksheets as tables?

Wouldn't it be easier to open the Google spreadsheet as a database, then use worksheets as tables? And eliminate the need for an intermediate catalog?

Like this:

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *

engine = create_engine("gsheets://1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8")
inspector = inspect(engine)

table = Table(
    "Simple sheet",
    MetaData(bind=engine),
    autoload=True)
query = select([func.count(table.columns.country)], from_obj=table)
print(query.scalar())  # prints 3.0

Couldn't parse date string 'datetime.date(2022, 3, 25)' with SQLAlchemy

Without SQL alchemy, everything is OK. However, as I have a demand to move to real database in the future, i decided to apply SQL alchemy.
I'm using SQL alchemy and follow the instruction in here #13
However, I got the error when running table.select()

File "/Users/duyetmai/.local/share/virtualenvs/slack_bot-_UTJr7MR/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 383, in iterrows row = make_row(row) if make_row else row ValueError: Couldn't parse date string 'datetime.date(2022, 3, 25)' - value is not a string.

Here is the sheet data

image

After that, instead of using autoload=True, I tried to apply the schema

`

        Column('id', UUID(as_uuid=True), primary_key=True, default=uuid.uuid4),

        Column('username', String()),
        
        Column('start_date', Date()),
        
        Column('end_date', Date()),
        
        Column('leave_type', String()),
        
        Column('reason', String()),
        
        Column('created_time', DateTime()),
        
        Column('status', String()),
        
        Column('approver', String()),

`

However, still have the same error, so I changed all Date/DateTime Field to String and it works OK. Here is the print result after changing to string

`

             Column('start_date', String()),
            Column('end_date', String()),
             Column('created_time', String()),

`

(UUID('a91fb8ef-a456-4ab5-87ec-51505c08d831'), 'Xzz', datetime.date(2022, 3, 25), datetime.date(2022, 3, 26), 'PTO', 'PTO', datetime.datetime(2022, 3, 5, 6, 19, 42, tzinfo=datetime.timezone.utc), 'Wait for Approval', 'ABC')

(UUID('130223f1-b501-4079-95f4-8de003659a0e'), 'zzzz', datetime.date(2022, 3, 3), datetime.date(2022, 3, 18), 'PTO', 'PTO', datetime.datetime(2022, 3, 5, 6, 57, 45, tzinfo=datetime.timezone.utc), 'Approved', 'ABC')

`

Not working with newer moz-sql-parser

When I install newer versions of moz-sql-parser queries aren't parsed and the following error is returned:

Invalid URL, must be a docs.google.com URL!

Digging into the problem I found that moz_sql_parser.parse used in extract_url changed in commit 3557475 and so in all releases after 4.10.21011.

The new version escapes the dots in the URL, so instead of https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/ we get back https://docs\\.google\\.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/

For the moment I fixed it installing moz-sql-parser==4.9.21002, do you think this issue should be solved in this project or in the moz-sql-parser code?

Issue #14 may be related

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.