Code Monkey home page Code Monkey logo

intake-sql's Introduction

Intake: Take 2

A general python package for describing, loading and processing data

Logo

Build Status Documentation Status

Taking the pain out of data access and distribution

Intake is an open-source package to:

  • describe your data declaratively
  • gather data sets into catalogs
  • search catalogs and services to find the right data you need
  • load, transform and output data in many formats
  • work with third party remote storage and compute platforms

Documentation is available at Read the Docs.

Please report issues at https://github.com/intake/intake/issues

Install

Recommended method using conda:

conda install -c conda-forge intake

You can also install using pip, in which case you have a choice as to how many of the optional dependencies you install, with the simplest having least requirements

pip install intake

Note that you may well need specific drivers and other plugins, which usually have additional dependencies of their own.

Development

  • Create development Python environment with the required dependencies, ideally with conda. The requirements can be found in the yml files in the scripts/ci/ directory of this repo.
    • e.g. conda env create -f scripts/ci/environment-py311.yml and then conda activate test_env
  • Install intake using pip install -e .
  • Use pytest to run tests.
  • Create a fork on github to be able to submit PRs.
  • We respect, but do not enforce, pep8 standards; all new code should be covered by tests.

intake-sql's People

Contributors

albertdefusco avatar astrojuanlu avatar ian-r-rose avatar martindurant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

intake-sql's Issues

sql_cat setting sql_kwargs

sql_cat makes entries in the catalog much like the yamlfilecatalog does it. The sql-auto receives the sql_kwargs and passes these to dask.dataframe.read_sql and to pandas.read_sql. This breaks for our use case since the cat.describe() gives:

{'args': {'index': 'n_hashed',
  'metadata': {'catalog_dir': ''},
  'sql_kwargs': {'name': 'mh_populate_catalog_entries', 'ttl': 1},
  'table': 'Test2',
  'uri': 'mssql+pymssql://user:pass@server/database?charset=utf8'},
 'description': 'SQL table Test2 from mssql+pymssql://user:pass@server/database?charset=utf8',
 'direct_access': True,
 'metadata': {},
 'plugin': 'sql_auto'}

pandas.read_sql does not know how to handle these sql_kwargs, so it fails. We can work-around this with:

cat.Test2(sql_kwargs={}).read()

but this will not pass the kwargs.

If the kwargs should be passed we can test if this works with:

cat.Test2(sql_kwargs={'params': {'name': 'mh_populate_catalog_entries', 'ttl': 1}}).read()

That works apparently.

I can PR the code, but unsure why you would want to pass these params to pandas.read_sql. Please advise.

where_values not providable as lists

The catalog below contains user parameters for where_string and where_list. The where_string works correctly in the where_values arg, but I can't get where_list to function correctly in the where_values, since where_values already is a list. I tried to use

where_values: {{ where_list }}     

but the template engine complains about this one (unhashable dict). I'm unsure whether I've missed something. If this should be fixed, please advice on a direction, and I'll try to make a PR.

catalog.yaml

metadata:
  version: 1
sources:
  mh_data_anonymization:
    description: "data set used for pseudonymization in the data team"
    direct_access: 'allow'
    driver: sql_manual
    args:
      uri: "{{ param_uri }}"
      sql_expr: "SELECT {{ top }} * FROM {{table_name}}"
      where_values: ["{{ where_string }}"]
      # where_values: ["{{ where_list }}"]           # not functioning
    parameters: # User defined parameters
      param_uri:
        description: "database uri in sqlalchemy style"
        type: str
        default: env(TEST_MSSQL_URI)
      table_name:
        description: "name of the table"
        type: str
        default: "IdentityProfilingMap"
      top:
        description: "number of results to return, e.g. top='TOP 200'"
        type: str
        default: "TOP 10"
      where_string:
        description: "list of string, or list of tuples"        
        type: str
        default: "WHERE Id<126040"
      where_list:
        description: "list of string, or list of tuples"        
        type: list
        default: ["WHERE Id<126040"]

Explicit Docs Link

It took me a while to find the link to the docs -- would you be willing to set https://intake-sql.readthedocs.io/ as the url for the repo, or put an explicit link in the README? I discovered that clicking the badge goes to the docs, but it took me a while to find that.

TypeError: can not serialize 'Series' object while trying to read remote intake_sql.intake_sql.SQLSource

Using intake-sql.
It seems because msgpack recieve invalid data types in the response.

Traceback (server side) :

Traceback (most recent call last):
  File "/opt/conda/envs/intake/lib/python3.9/site-packages/tornado/web.py", line 1704, in _execute
    result = await result
  File "/opt/conda/envs/intake/lib/python3.9/site-packages/tornado/gen.py", line 216, in wrapper
    result = ctx_run(func, *args, **kwargs)
  File "/opt/conda/envs/intake/lib/python3.9/site-packages/intake/cli/server/server.py", line 329, in post
    self.write(msgpack.packb(response, **pack_kwargs))
  File "/opt/conda/envs/intake/lib/python3.9/site-packages/msgpack/__init__.py", line 35, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 292, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 298, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 295, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 289, in msgpack._cmsgpack.Packer._pack
TypeError: can not serialize 'Series' object

Release?

Would it be possible to get a release at some point? #5 would be super useful.

Does not support sqlalchemy 2?

I'm getting an error: TypeError: MetaData.init() got an unexpected keyword argument 'bind'
Reverting to sqlalchemy 1.4 solved the problem.

where_values with where_template : what is the values expected ?

I am trying to work with SQLSourceManualPartition
and I can't figure out what values is expected in where_value

I have this template :

SQLSourceManualPartition(uri='postgresql://[email protected]:5432/ddr', sql_expr='select * from schema.table', where_values=[[0 ,100,200],[100,200,500]], where_template='WHERE id >= {} AND id < {} OR id = {}')

but I get a tuple index out of range error

It could be resolved if we get all values unpacked at once
line 315
where = [where_tmp.format(*values) for values in where]

Accessing an SQLCatalog accesses all tables in database?

I'm trying out SQLCatalog on my database that contains a lot of tables (~270).

  mydb:
    description: My database
    driver: sql_cat
    args:
      uri: "mysql+mysqlconnector://......."
    metadata: {}

When I access the catalog with

cat = intake.open_catalog('mycat.yml')
cat.mydb

it takes ages (up to 5 mins) for the last call to return. So it seems it introspects all tables.
Calling it again is no faster, so it does not cache its data.
Even accessing 1 table from the cat is slow.

Am I doing something wrong?

PS accessing individual tables with the sql driver seems to work fine.

Reading queries with subqueries doesn't work.

When i'm trying to create new data source with intake.open_sql() with e.g.
select * from table where user_id in (select id from users where id in (123, 124))
it causes:
InternalError: (pymysql.err.InternalError) (1059, "Identifier name 'select * from table where user_id in (select id from users where id in (123, 124))' is too long"
or it produces
ProgrammingError: (pymysql.err.ProgrammingError) (1103, "Incorrect table name..."
in different queries involving subqueries.

I think it might be bug concerning treating statement in bracket as table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.