mindsdb / dfsql Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 9.0 1.46 MB

SQL interface to Pandas

License: GNU General Public License v3.0

Python 56.90% Jupyter Notebook 43.10%

dfsql's People

Contributors

Stargazers

Watchers

Forkers

sycomix anmyachev shivamtrivedi1 suryatmodulus personx000 vishalbelsare mobasherah12 rommeldb kkxiaotikk

dfsql's Issues

Aliasing a column that is used in WHERE silently produces wrong output

Query:

SELECT passenger_id, titanic.survived as ts FROM titanic WHERE titanic.survived = 1

syntax documentation

Looks cool, but I have no idea what SQL functions the tool supports..

Queries fail for column names with dashes

Given seq.csv:

sequence,affinity,seq-len
ARKKERLW,0.13,8
WAKWAKA,0.1,7
EEERRRKK,0.2,8

The code:

import pandas as pd
from dfsql import sql_query

df = pd.read_csv("demo/alt.csv")
res = sql_query("SELECT * FROM df WHERE seq-len > 7", df=df)
print(res)

Fails with the error:

dfsql.exceptions.QueryExecutionException: Column seq not found.

If the column is renamed seq-len -> seqlen and the query is adjusted accordingly, then no error occurs.

Exception if two values are compared in 'WHERE'

dfsql.sql_query(
            'SELECT name, status FROM predictors WHERE 1 = 0',
            ds_kwargs={'case_sensitive': False},
            reduce_output=False,
            predictors=predictors_df
        )

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/__init__.py", line 30, in sql_query
    result = ds.query(sql, reduce_output=reduce_output)
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/data_sources/base_data_source.py", line 170, in query
    return self.execute_query(query, reduce_output=reduce_output)
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/data_sources/base_data_source.py", line 543, in execute_query
    return self.execute_select(query, reduce_output=reduce_output)
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/data_sources/base_data_source.py", line 398, in execute_select
    source_df = source_df[index.values]
AttributeError: 'bool' object has no attribute 'values'

Add contributors agreement and try to apply it to post license-switch contributions

We should go through all contributions since we switch from an MIT License to a GPL-3.0 License and either:

a) Have all contributors agree to and sign something like the ASF Contributor License Agreement or alternatively remove their contributions.

b) In the future we should have some easy way of allowing anyone that contributes code to sign and agreement similar to the way the Apache foundation does it.

This is for {insert legal reasons I would make a mess of explaining}, feel free to ask or send us an email or ask a question here in case you don't agree with this policy or think it's in some way disadvantageous to Mindsdb and/or it's open source contributors.

Spontaneous error during query

Time-to-time, on random query i get error like this:

data = dfsql.sql_query(
      str(new_statement),
      ds_kwargs={'case_sensitive': False},
      reduce_output=False,
      **{'dataframe': df}
  )

error:

Traceback (most recent call last):
  File "/home/maxs/dev/mdb/venv38/sources/mindsdb/mindsdb/api/mysql/mysql_proxy/mysql_proxy.py", line 1194, in query_answer
    data = dfsql.sql_query(
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/__init__.py", line 36, in sql_query
    shutil.rmtree(tmpdir)
  File "/usr/local/lib/python3.8/shutil.py", line 722, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/local/lib/python3.8/shutil.py", line 720, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/tmp/dfsql_temp_1639132502'

[Errno 39] Directory not empty: '/tmp/dfsql_temp_1639132502'

Remove column name transformations when loading a dataframe into a datasource

Zero length string value converts to <NA>

If i do:

df = pd.DataFrame([['a', '']], columns=['a','b'])
dfsql.sql_query('select * from df', df=df)

then i get:

   a     b
0  a  <NA>

Build a license-inclusive dependency graph

Build a depenency graph for this library that includes the licenses of all of our dependencies to make sure they are compatible with GLP-3.0

Table alias and groupby produces wrong output

This test fails:

def test_complex_groupby(self, googleplay_csv, data_source_googleplay):
        sql = """SELECT sub.category, avg(reviews) AS avg_reviews
                    FROM (
                        SELECT category, CAST(reviews AS float) AS reviews
                         FROM (
                            SELECT category, reviews
                            FROM googleplaystore
                            LIMIT 100
                         )
                    ) AS sub
                    GROUP BY sub.category
                    HAVING avg_reviews > 0.4
                    LIMIT 10"""

        df = pd.read_csv(googleplay_csv)
        inner = df[['Category', 'Reviews']].iloc[:100]
        out_df = inner.groupby(['Category']).agg({'Reviews': 'mean'}).reset_index()
        out_df.columns = ['sub.category', 'avg_reviews']

        query_result = data_source_googleplay.query(sql)
        assert out_df.shape == query_result.shape
        assert (out_df.dropna().values == query_result.dropna().values).all()

Error if selecting column name is 'status'

query:

result_df = dfsql.sql_query(
                    'select name, status from predictors',
                    ds_kwargs={'case_sensitive': False},
                    reduce_output=False,
                    predictors=predictors_df
                )

error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/__init__.py", line 30, in sql_query
    result = ds.query(sql, reduce_output=reduce_output)
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/data_sources/base_data_source.py", line 171, in query
    query = parse_sql(sql)
  File "/home/maxs/dev/mdb/venv38/sources/mindsdb_sql/mindsdb_sql/__init__.py", line 25, in parse_sql
    ast = parser.parse(tokens)
  File "/home/maxs/dev/mdb/venv38/lib/python3.8/site-packages/sly-0.4-py3.8.egg/sly/yacc.py", line 2119, in parse
    tok = self.error(errtoken)
  File "/home/maxs/dev/mdb/venv38/sources/mindsdb_sql/mindsdb_sql/parser/parser.py", line 544, in error
    raise ParsingException(f"Syntax error at token {p.type}: \"{p.value}\"")
mindsdb_sql.exceptions.ParsingException: Syntax error at token STATUS: "status"

ImportError: cannot import name 'basestring' from 'pyparsing'

Hi guys, I noticed that dfsql doesn't work with new versions of pyparsing (since 3.0.0). This should probably be fixed.

KeyError if 'ORDER BY' contains table name

This query produce KeyError('COLUMNS.ORDINAL_POSITION') :

dfsql.sql_query(
    "SELECT * FROM COLUMNS WHERE COLUMNS.TABLE_SCHEMA = 'MINDSDB' AND COLUMNS.TABLE_NAME = 'predictors' ORDER BY COLUMNS.ORDINAL_POSITION",
    ds_kwargs={'case_sensitive': False},
    reduce_output=False,
    **{table_name: dataframe}
)

Dataframe has column 'ORDINAL_POSITION'

No mac (or windows/source) packages means the installation fails.

Hello! I'm getting this error when I try to install via poetry add "modin[sql]" on my Macbook (outside of a Docker image):

$ poetry add "modin[sql]"
Using version ^0.10.1 for modin

Updating dependencies
Resolving dependencies... (0.6s)
Resolving dependencies... (0.3s)

Package operations: 2 installs, 0 updates, 0 removals

  • Installing dfsql (0.3.1): Failed

  RuntimeError

  Unable to find installation candidates for dfsql (0.3.1)

  at ~/.poetry/lib/poetry/installation/chooser.py:72 in choose_for
       68│
       69│             links.append(link)
       70│
       71│         if not links:
    →  72│             raise RuntimeError(
       73│                 "Unable to find installation candidates for {}".format(package)
       74│             )
       75│
       76│         # Get the best link


Failed to add packages, reverting the pyproject.toml file to its original content.

Doing a bit of digging looks like this change was made to just publish Linux wheels only, not able to understand the context behind that change given the lack of a description. I'm happy to throw together a quick Dockerfile to get my project to compile so I can try this out but figured I'd create an issue as likely others in the near future will hit this same problem!

dfsql.exceptions.dfsqlException: Table from_tables found in from_tables, but not in the SQL query.

Hi, I'm trying to run the example in the README, but I'm not able to run it.

import pandas as pd
from dfsql import sql_query

df = pd.DataFrame({
     "animal": ["cat", "dog", "cat", "dog"],
     "height": [23,  100, 25, 71] 
 })

df.head()

sql_query('SELECT animal, height FROM "animals_df" WHERE height > 50', from_tables={"animals_df": df})

But I'm getting an error with the table name:

File "/home/ray/anaconda3/lib/python3.7/site-packages/dfsql/__init__.py", line 23, in sql_query
    raise dfsqlException(f"Table {table_name} found in from_tables, but not in the SQL query.")
dfsql.exceptions.dfsqlException: Table from_tables found in from_tables, but not in the SQL query.

I don't know if anyone is having the same issue. Thanks a lot for you help