betodealmeida / gsheets-db-api Goto Github PK

A Python DB-API and SQLAlchemy dialect to Google Spreasheets

License: MIT License

Makefile 0.16% Python 99.84%

python db google spreadsheet spreadsheets sql api

gsheets-db-api's Introduction

Note: shillelagh is a drop-in replacement for gsheets-db-api, with many additional features. You should use it instead. If you're using SQLAlchemy all you need to do:

$ pip uninstall gsheetsdb
$ pip install shillelagh

If you're using the DB API:

# from gsheetsdb import connect
from shillelagh.backends.apsw.db import connect

A Python DB API 2.0 for Google Spreadsheets

This module allows you to query Google Spreadsheets using SQL.

Using this spreadsheet as an example:

	A	B
1	country	cnt
2	BR	1
3	BR	3
4	IN	5

Here's a simple query using the Python API:

from gsheetsdb import connect

conn = connect()
result = conn.execute("""
    SELECT
        country
      , SUM(cnt)
    FROM
        "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/"
    GROUP BY
        country
""", headers=1)
for row in result:
    print(row)

This will print:

Row(country='BR', sum_cnt=4.0)
Row(country='IN', sum_cnt=5.0)

How it works

Transpiling

Google spreadsheets can actually be queried with a very limited SQL API. This module will transpile the SQL query into a simpler query that the API understands. Eg, the query above would be translated to:

SELECT A, SUM(B) GROUP BY A

Processors

In addition to transpiling, this module also provides pre- and post-processors. The pre-processors add more columns to the query, and the post-processors build the actual result from those extra columns. Eg, COUNT(*) is not supported, so the following query:

SELECT COUNT(*) FROM "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/"

Gets translated to:

SELECT COUNT(A), COUNT(B)

And then the maximum count is returned. This assumes that at least one column has no NULLs.

SQLite

When a query can't be expressed, the module will issue a SELECT *, load the data into an in-memory SQLite table, and execute the query in SQLite. This is obviously inneficient, since all data has to be downloaded, but ensures that all queries succeed.

Installation

$ pip install gsheetsdb
$ pip install gsheetsdb[cli]         # if you want to use the CLI
$ pip install gsheetsdb[sqlalchemy]  # if you want to use it with SQLAlchemy

CLI

The module will install an executable called gsheetsdb:

$ gsheetsdb --headers=1
> SELECT * FROM "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/"
country      cnt
---------  -----
BR             1
BR             3
IN             5
> SELECT country, SUM(cnt) FROM "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1
pscv8ZXPtg8/" GROUP BY country
country      sum cnt
---------  ---------
BR                 4
IN                 5
>

SQLAlchemy support

This module provides a SQLAlchemy dialect. You don't need to specify a URL, since the spreadsheet is extracted from the FROM clause:

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *

engine = create_engine('gsheets://')
inspector = inspect(engine)

table = Table(
    'https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/edit#gid=0',
    MetaData(bind=engine),
    autoload=True)
query = select([func.count(table.columns.country)], from_obj=table)
print(query.scalar())  # prints 3.0

Alternatively, you can initialize the engine with a "catalog". The catalog is a Google spreadsheet where each row points to another Google spreadsheet, with URL, number of headers and schema as the columns. You can see an example here:

	A	B	C
1	https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/edit#gid=0	1	default
2	https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/edit#gid=1077884006	2	default

This will make the two spreadsheets above available as "tables" in the default schema.

Authentication

You can access spreadsheets that are shared only within an organization. In order to do this, first create a service account. Make sure you select "Enable G Suite Domain-wide Delegation". Download the key as a JSON file.

Next, you need to manage API client access at https://admin.google.com/${DOMAIN}/AdminHome?chromeless=1#OGX:ManageOauthClients. Add the "Unique ID" from the previous step as the "Client Name", and add https://spreadsheets.google.com/feeds as the scope.

Now, when creating the connection from the DB API or from SQLAlchemy you can point to the JSON file and the user you want to impersonate:

>>> auth = {'service_account_file': '/path/to/certificate.json', 'subject': '[email protected]'}
>>> conn = connect(auth)

gsheets-db-api's People

Contributors

Stargazers

Watchers

Forkers

dvska puneetjaiswal jrjsmrtn jithinraj tim-werner villebro s2t2 preset-io visualphoenix jagritig onestarshang glryz juned7228 scott3j poonyisath fsantoma67

gsheets-db-api's Issues

Add support for granularity functions

We need to add proper support for granularity functions, so it can be used in Superset. The current PR (apache/superset#5915) is incorrect.

Insert/update support?

Just a quick question, does this library allow for writing of data back up to Sheets, or just reading data from it?
Having played around with it with SQLAlchemy, the integration is excellent. Getting ProgrammingErrors though when trying to do anything involving insert/updates.

Handle `COUNT(*)` outside of `SELECT`

The CountStar processor should handle more cases, eg:

SELECT
    country
  , COUNT(*)
FROM
    table
WHERE
    name = 'John'
GROUP BY
    country
HAVING
    COUNT(*) > 100

SqlAlchemy

How to connect with SqlAlchemy?

I am trying to url below in Apache Superset but it is not working:

gsheets://gsheets://https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/edit#gid=0

Returns the following error:

superset_1 | 2019-10-09 17:28:18,959:ERROR:root:invalid literal for int() with base 10: ''
superset_1 | Traceback (most recent call last):
superset_1 | File "/home/superset/superset/views/core.py", line 1701, in testconn
superset_1 | database.set_sqlalchemy_uri(uri)
superset_1 | File "/home/superset/superset/models/core.py", line 836, in set_sqlalchemy_uri
superset_1 | conn = sqla.engine.url.make_url(uri.strip())
superset_1 | File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/url.py", line 229, in make_url
superset_1 | return _parse_rfc1738_args(name_or_url)
superset_1 | File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/url.py", line 288, in _parse_rfc1738_args
superset_1 | return URL(name, **components)
superset_1 | File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/url.py", line 71, in init
superset_1 | self.port = int(port)
superset_1 | ValueError: invalid literal for int() with base 10: ''

From the command line this works:

root@e87274843f8f:/home/superset# gsheetsdb
sql> select * from "https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8";
country cnt

BR 1
BR 3
IN 5
ZA 6

I await help. Thanks!

Missing functionality

Translation

aliases (the API reject these queries)
ORDER BY

SQLAlchemy

basic implementation
catalogs

In Catalog Mode, be able to specify a table alias

In addition to specify the URL, the header lines, and the schema, add the "table" alias as well.

I was really excited when I first saw the catalog example, when I saw what I thought was the alias column in column D, but it turns out it wasn't functional.

Latest setup tools is breaking the library

We had to downgrade setuptools v51.3.3 to 51.1.2 as the library started throwing Invalid URL, must be a docs.google.com URL! error

Accessing private gsheets

We have some gsheets that are accessible by people from the same org only.
Is there any way to connect to these gsheets without making them public?

Getting error: "Connection failed"

I installed gsheetsdb package and tried to use it in superset and in python script but I am getting connection failed error. What should be the full uri for gsheets?

Kindly help in this issue.

Can't get time granularity using GSHEET driver in Superset.

When doing a query in Superset with time granularity, the query fails.

SELECT DATETIME(STRFTIME('%Y-%m-%dT%H:00:00', "DateTime")) AS __timestamp,
       COUNT(*) AS count
FROM "https://docs.google.com/spreadsheets/d/1x5q4jaX54Ob0aT-0rBbECv_Q9xwqiYR41LwYVSwgavE/edit?usp=sharing"
WHERE "DateTime" >= '2020-04-16 00:00:00.000000'
  AND "DateTime" < '2020-04-23 00:00:00.000000'
GROUP BY DATETIME(STRFTIME('%Y-%m-%dT%H:00:00', "DateTime"))
ORDER BY count DESC
LIMIT 10000;

Implement auth

Can we query private spreadsheets?

Open spreadsheet as database, worksheets as tables?

Wouldn't it be easier to open the Google spreadsheet as a database, then use worksheets as tables? And eliminate the need for an intermediate catalog?

Like this:

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *

engine = create_engine("gsheets://1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8")
inspector = inspect(engine)

table = Table(
    "Simple sheet",
    MetaData(bind=engine),
    autoload=True)
query = select([func.count(table.columns.country)], from_obj=table)
print(query.scalar())  # prints 3.0

Couldn't parse date string 'datetime.date(2022, 3, 25)' with SQLAlchemy

Without SQL alchemy, everything is OK. However, as I have a demand to move to real database in the future, i decided to apply SQL alchemy.
I'm using SQL alchemy and follow the instruction in here #13
However, I got the error when running table.select()

File "/Users/duyetmai/.local/share/virtualenvs/slack_bot-_UTJr7MR/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 383, in iterrows row = make_row(row) if make_row else row ValueError: Couldn't parse date string 'datetime.date(2022, 3, 25)' - value is not a string.

Here is the sheet data

After that, instead of using autoload=True, I tried to apply the schema

        Column('id', UUID(as_uuid=True), primary_key=True, default=uuid.uuid4),

        Column('username', String()),
        
        Column('start_date', Date()),
        
        Column('end_date', Date()),
        
        Column('leave_type', String()),
        
        Column('reason', String()),
        
        Column('created_time', DateTime()),
        
        Column('status', String()),
        
        Column('approver', String()),

However, still have the same error, so I changed all Date/DateTime Field to String and it works OK. Here is the print result after changing to string

             Column('start_date', String()),
            Column('end_date', String()),
             Column('created_time', String()),

(UUID('a91fb8ef-a456-4ab5-87ec-51505c08d831'), 'Xzz', datetime.date(2022, 3, 25), datetime.date(2022, 3, 26), 'PTO', 'PTO', datetime.datetime(2022, 3, 5, 6, 19, 42, tzinfo=datetime.timezone.utc), 'Wait for Approval', 'ABC')

(UUID('130223f1-b501-4079-95f4-8de003659a0e'), 'zzzz', datetime.date(2022, 3, 3), datetime.date(2022, 3, 18), 'PTO', 'PTO', datetime.datetime(2022, 3, 5, 6, 57, 45, tzinfo=datetime.timezone.utc), 'Approved', 'ABC')

Not working with newer moz-sql-parser

When I install newer versions of moz-sql-parser queries aren't parsed and the following error is returned:

Invalid URL, must be a docs.google.com URL!

Digging into the problem I found that moz_sql_parser.parse used in extract_url changed in commit 3557475 and so in all releases after 4.10.21011.

The new version escapes the dots in the URL, so instead of https://docs.google.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/ we get back https://docs\\.google\\.com/spreadsheets/d/1_rN3lm0R_bU3NemO0s9pbFkY5LQPcuy1pscv8ZXPtg8/

For the moment I fixed it installing moz-sql-parser==4.9.21002, do you think this issue should be solved in this project or in the moz-sql-parser code?

Issue #14 may be related