intake / intake-sql Goto Github PK

View Code? Open in Web Editor NEW

15.0 6.0 17.0 207 KB

Plugin for Intake to read from SQL servers

Home Page: https://intake-sql.readthedocs.io/

License: BSD 2-Clause "Simplified" License

Python 100.00%

sql python data-access

intake-sql's Introduction

Intake: Take 2

A general python package for describing, loading and processing data

Taking the pain out of data access and distribution

Intake is an open-source package to:

describe your data declaratively
gather data sets into catalogs
search catalogs and services to find the right data you need
load, transform and output data in many formats
work with third party remote storage and compute platforms

Documentation is available at Read the Docs.

Please report issues at https://github.com/intake/intake/issues

Install

Recommended method using conda:

conda install -c conda-forge intake

You can also install using pip, in which case you have a choice as to how many of the optional dependencies you install, with the simplest having least requirements

pip install intake

Note that you may well need specific drivers and other plugins, which usually have additional dependencies of their own.

Development

Create development Python environment with the required dependencies, ideally with conda. The requirements can be found in the yml files in the scripts/ci/ directory of this repo.
- e.g. conda env create -f scripts/ci/environment-py311.yml and then conda activate test_env
Install intake using pip install -e .
Use pytest to run tests.
Create a fork on github to be able to submit PRs.
We respect, but do not enforce, pep8 standards; all new code should be covered by tests.

intake-sql's People

Contributors

Stargazers

Watchers

Forkers

enterstudio yuhonghong7035 rymmx paularmand johntfoster rlshuhart ian-r-rose automotivedevops aweir12 nystud fagan2888 yougis martindurant blackary albertdefusco mounte astrojuanlu

intake-sql's Issues

sql_cat setting sql_kwargs

sql_cat makes entries in the catalog much like the yamlfilecatalog does it. The sql-auto receives the sql_kwargs and passes these to dask.dataframe.read_sql and to pandas.read_sql. This breaks for our use case since the cat.describe() gives:

{'args': {'index': 'n_hashed',
  'metadata': {'catalog_dir': ''},
  'sql_kwargs': {'name': 'mh_populate_catalog_entries', 'ttl': 1},
  'table': 'Test2',
  'uri': 'mssql+pymssql://user:pass@server/database?charset=utf8'},
 'description': 'SQL table Test2 from mssql+pymssql://user:pass@server/database?charset=utf8',
 'direct_access': True,
 'metadata': {},
 'plugin': 'sql_auto'}

pandas.read_sql does not know how to handle these sql_kwargs, so it fails. We can work-around this with:

cat.Test2(sql_kwargs={}).read()

but this will not pass the kwargs.

If the kwargs should be passed we can test if this works with:

cat.Test2(sql_kwargs={'params': {'name': 'mh_populate_catalog_entries', 'ttl': 1}}).read()

That works apparently.

I can PR the code, but unsure why you would want to pass these params to pandas.read_sql. Please advise.

where_values not providable as lists

The catalog below contains user parameters for where_string and where_list. The where_string works correctly in the where_values arg, but I can't get where_list to function correctly in the where_values, since where_values already is a list. I tried to use

where_values: {{ where_list }}

but the template engine complains about this one (unhashable dict). I'm unsure whether I've missed something. If this should be fixed, please advice on a direction, and I'll try to make a PR.

catalog.yaml

metadata:
  version: 1
sources:
  mh_data_anonymization:
    description: "data set used for pseudonymization in the data team"
    direct_access: 'allow'
    driver: sql_manual
    args:
      uri: "{{ param_uri }}"
      sql_expr: "SELECT {{ top }} * FROM {{table_name}}"
      where_values: ["{{ where_string }}"]
      # where_values: ["{{ where_list }}"]           # not functioning
    parameters: # User defined parameters
      param_uri:
        description: "database uri in sqlalchemy style"
        type: str
        default: env(TEST_MSSQL_URI)
      table_name:
        description: "name of the table"
        type: str
        default: "IdentityProfilingMap"
      top:
        description: "number of results to return, e.g. top='TOP 200'"
        type: str
        default: "TOP 10"
      where_string:
        description: "list of string, or list of tuples"        
        type: str
        default: "WHERE Id<126040"
      where_list:
        description: "list of string, or list of tuples"        
        type: list
        default: ["WHERE Id<126040"]

Explicit Docs Link

It took me a while to find the link to the docs -- would you be willing to set https://intake-sql.readthedocs.io/ as the url for the repo, or put an explicit link in the README? I discovered that clicking the badge goes to the docs, but it took me a while to find that.

TypeError: can not serialize 'Series' object while trying to read remote intake_sql.intake_sql.SQLSource

Using intake-sql.
It seems because msgpack recieve invalid data types in the response.

Traceback (server side) :

Traceback (most recent call last):
  File "/opt/conda/envs/intake/lib/python3.9/site-packages/tornado/web.py", line 1704, in _execute
    result = await result
  File "/opt/conda/envs/intake/lib/python3.9/site-packages/tornado/gen.py", line 216, in wrapper
    result = ctx_run(func, *args, **kwargs)
  File "/opt/conda/envs/intake/lib/python3.9/site-packages/intake/cli/server/server.py", line 329, in post
    self.write(msgpack.packb(response, **pack_kwargs))
  File "/opt/conda/envs/intake/lib/python3.9/site-packages/msgpack/__init__.py", line 35, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 292, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 298, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 295, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 289, in msgpack._cmsgpack.Packer._pack
TypeError: can not serialize 'Series' object

Add to TravisCI and create token

@seibert , I don't know how to do this.
Note that this build will tend to fail until new intake is released, but that's ok, it hasn't been tagged at all yet.

Release?

Would it be possible to get a release at some point? #5 would be super useful.

Does not support sqlalchemy 2?

I'm getting an error: TypeError: MetaData.init() got an unexpected keyword argument 'bind'
Reverting to sqlalchemy 1.4 solved the problem.

where_values with where_template : what is the values expected ?

I am trying to work with SQLSourceManualPartition
and I can't figure out what values is expected in where_value

I have this template :

SQLSourceManualPartition(uri='postgresql://[email protected]:5432/ddr', sql_expr='select * from schema.table', where_values=[[0 ,100,200],[100,200,500]], where_template='WHERE id >= {} AND id < {} OR id = {}')

but I get a tuple index out of range error

It could be resolved if we get all values unpacked at once
line 315
where = [where_tmp.format(*values) for values in where]

Accessing an SQLCatalog accesses all tables in database?

I'm trying out SQLCatalog on my database that contains a lot of tables (~270).

  mydb:
    description: My database
    driver: sql_cat
    args:
      uri: "mysql+mysqlconnector://......."
    metadata: {}

When I access the catalog with

cat = intake.open_catalog('mycat.yml')
cat.mydb

it takes ages (up to 5 mins) for the last call to return. So it seems it introspects all tables.
Calling it again is no faster, so it does not cache its data.
Even accessing 1 table from the cat is slow.

Am I doing something wrong?

PS accessing individual tables with the sql driver seems to work fine.

Reading queries with subqueries doesn't work.

When i'm trying to create new data source with intake.open_sql() with e.g.
select * from table where user_id in (select id from users where id in (123, 124))
it causes:
InternalError: (pymysql.err.InternalError) (1059, "Identifier name 'select * from table where user_id in (select id from users where id in (123, 124))' is too long"
or it produces
ProgrammingError: (pymysql.err.ProgrammingError) (1103, "Incorrect table name..."
in different queries involving subqueries.

I think it might be bug concerning treating statement in bracket as table.