tobymao / sqlglot Goto Github PK

Python SQL Parser and Transpiler

License: MIT License

Python 98.79% Makefile 0.03% Jinja 0.01% Rust 1.18%

transpiler sql python parser optimizer bigquery duckdb hive mysql postgres presto snowflake spark sqlite sqlparser trino tsql clickhouse redshift databricks

sqlglot's People

Contributors

Stargazers

Watchers

Forkers

blthree rsmuthu z123 wseaton mrcodechef pseudobobsmith tkforks cafonso wiggimt srinipunuru akashmavle5 acreux trendingtechnology er77 zeninpalm izeigerman wyfunique hudakas oneraynyday robert8138 zosbhai maxwellboss robinl mamonu kelsin jfontestad southdy markchalloner nemetris jack1981 nolamesa maggie1121 junpeng-jp yingyingqiqi ebell495 th368 ptallada isananic ssuarezbe anshbhadauria vdt sirrice blentle cclauss bsmth creative-research-project-v1-1 mitack futurebody mohnish226 bobvanratingen grai-io aloenobilis victorioberra lumiqai wshao12 sloon-nextdoor honsa munro theotherside theseyi akbartrilaksana cuulee bigrlab thelastnode pinkerltm bo1658a yaowenqiang giancarlopro ubizone cxz guptam spread0x teej mawuva adamf cpcloud eugenetorap looi sid-kap saligrama crericha consultingmd 51ak rolandwalker smar10 ginter harshg999 constantinek woodyzhou ctimmortal edscode akravetz antusqus alingse g5539220y liangchenm aleh-rymasheuski zaskittles chengchengpei jli

sqlglot's Issues

Support catalog for Trino SQL

At the moment we get an unexpected token error when we write a query with a fully qualified name catalog.schema.table

support clickhouse sql dialect

Looks to https://clickhouse.tech/docs/en/
clickhouse is a blazing fast vectorized columnar mpp database

parse create mysql string error

raw = '''
CREATE TABLE `t_customer_account` (
  `id` int,
  `customer_id` int,
  `bank` varchar(100),
  `account_no` varchar(100)
);
'''
parsed = sqlglot.parse(raw, sqlglot.dialects.MySQL)

I got error:

ParseError: Expected ). Line 2, Col: 7.

CREATE TABLE `t_customer_account` (
  `id` int,
  `customer_id` int,
  `bank` varchar(100),
  `a

sqlglot version: v1.3.2

No token for ILIKE postgres

There is no token for ILIKE in glot

https://postgrespro.com/docs/postgresql/9.6/functions-matching#functions-like

The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.

Adding a new formatter

Hello

Thank you for the very nice and great tool! I want to use it to format presto sql in the way below, it's different from regular style.

Is it possible that I will send a style guide for you and you will add this format?

WITH
    SOME_CTE AS (
        SELECT DISTINCT
          id          AS my_id,
          otherField1 AS my_other_field_1,
          otherField2 AS my_other_field_2
        FROM
            prod.my_other_data
        WHERE
                id IN (1, 2, 3)
            OR
                (
                        otherField1 = 'value1'
                    AND
                        otherField2 = 'value2'
                )
    ),
    
    FINAL AS (
        SELECT
            id_field,
            field_1                    AS detailed_field_1,
            field_2                    AS detailed_field_2,
            detailed_field_3,
            
            CASE
                WHEN
                    cancellation_date IS NULL AND expiration_date IS NOT NULL
                THEN
                    expiration_date
                
                WHEN
                    cancellation_date IS NULL
                THEN
                    start_date + 7
                
                ELSE
                    cancellation_date
            END                        AS cancellation_date,
            
            rank() OVER (
                PARTITION BY id_field
                ORDER BY cancellation_date
            )                          AS previous_detailed_field_3,
            SUM(field_4)               AS field_4_sum,
            MAX(field_5)               AS field_5_max
        FROM
                (
                    SELECT
                        *
                    FROM
                        prod.my_data
                    WHERE
                        filter = 'my_filter'
                ) AS A
            LEFT JOIN
                SOME_CTE
            ON
                A.id_field = SOME_CTE.my_id
        WHERE
            reduce(
                field_1,
                0,
                (x, s) -> s + x,
                s -> s
            ) > 100
        GROUP BY
            1, 
            2, 
            3, 
            4, 
            5
        HAVING
            COUNT(*) > 5
        ORDER BY
            4 DESC
)

SELECT
    *
FROM
    FINAL

Support `INTERSECT` and `EXCEPT` set operations.

UNION is currently supported but it would be be useful if INTERSECT and EXCEPT were also supported.

Spark's explode should be somehow transformed with duckdb's unnest

Following query

select sub.year, sub.genre, count(id)
from (
         select m.movieId                                                    as id,
                int(regexp_extract(m.title, '\\s*\\((\\d{4})\\)\\s*"?$', 1)) as year,
                explode(split(m.genres, '\\|'))                              as genre
         from movies m
     ) sub
where genre <> '(no genres listed)'
  and year IS NOT NULL
group by sub.year, sub.genre
order by sub.year DESC, sub.genre

is being incorrectly translated to spark

exists clause does not support

I get the following error:

# version: sqlglot 1.26.1

from sqlglot import parse_one
parse_one("SELECT * FROM demo where exists (select * from demo)")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/__init__.py", line 44, in parse_one
    return parse(code, read=read, **opts)[0]
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/__init__.py", line 27, in parse
    return dialect.parse(code, **opts)
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/dialects.py", line 62, in parse
    return self.parser(**opts).parse(self.tokenizer().tokenize(code), code)
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/parser.py", line 255, in parse
    self.check_errors()
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/parser.py", line 263, in check_errors
    raise error
sqlglot.errors.ParseError: Expecting ). Line 1, Col: 34.
  SELECT * FROM demo where exists (select * from demo)

Parse DELETE SQL DML command

See
sqlglot - Why is not able to parse correctly DELETE SQL Statement?
https://stackoverflow.com/q/71224391/14700552?sem=2

Error on parsing query from Postgres example

On following query:

WITH regional_sales AS (
    SELECT region, SUM(amount) AS total_sales
    FROM orders
    GROUP BY region
), top_regions AS (
    SELECT region
    FROM regional_sales
    WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)
SELECT region,
       product,
       SUM(quantity) AS product_units,
       SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;

execution fails on following:

~/.local/lib/python3.9/site-packages/sqlglot/parser.py in _parse_primary(self)
650
651             if not self._match(TokenType.R_PAREN):
--> 652                 self.raise_error('Expecting )', paren)
653             return exp.Paren(this=this)
654

~/.local/lib/python3.9/site-packages/sqlglot/parser.py in raise_error(self, message, token)
202         )
203         if self.error_level == ErrorLevel.RAISE:
--> 204             raise self.error
205         if self.error_level == ErrorLevel.WARN:
206             logging.error(self.error)

ParseError: Expecting ). Line 7, Col: 24.
n
FROM regional_sales
WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)

> working on a solution, should have something by tonight, i need to also handle complex types like map<> and struct

working on a solution, should have something by tonight, i need to also handle complex types like map<> and struct

it has been solved, but, another sql failed:

CREATE TABLE `t_customer_account` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `customer_id` int(11) DEFAULT NULL COMMENT '客户id',
  `bank` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '行别',
  `account_no` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '账号',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin COMMENT='客户账户表';

error info

ERROR:root:Expected ). Line 2, Col: 10.

CREATE TABLE `t_customer_account` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `customer_id` int(1
ERROR:root:Invalid expression / Unexpected token. Line 2, Col: 10.

CREATE TABLE `t_customer_account` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `customer_id` int(1
[(CREATE this: 
   (TABLE this: 
     (SCHEMA this: `t_customer_account`, expressions: 
       (COLUMNDEF this: 
         (COLUMN this: `id`, db: , table: ), kind: int, comment: )), db: ), kind: table, expression: , exists: , file_format: , temporary: )]

Originally posted by @exitNA in #20 (comment)

Snowflake Semi-structured data support

I believe this would be a bigger change than simply adding a new dialect as new parsing is required for json extraction syntax

Test Cases

SELECT foo:bar

SELECT foo:bar:baz::DATE

SELECT foo:bar[0]

SELECT foo:bar.baz

Ref:
https://docs.snowflake.com/en/user-guide/querying-semistructured.html#traversing-semi-structured-data

Support for Struct data types or struct parsing for Spark dialect

I'm wondering if it is possible to implement struct handling such that a query like this:

SELECT a.b.c from table_1 would parse as a struct field type.

Stale Syntax Tree Transformation example

This is the Syntax Tree Transformation example in the README:

import sqlglot
import sqlglot.expressions as exp

expression_tree = sqlglot.parse_one("SELECT a FROM x")

def transformer(node):
    if isinstance(node, exp.Column) and node.args["this"].text == "a":
        return sqlglot.parse_one("FUN(a)")
    return node

transformed_tree = expression_tree.transform(transformer)
transformed_tree.sql()

When I run it, the transformed_tree isn't changed, the SQL is still SELECT a FROM x.

I played with it a bit and found it worked to change the .text to .args["this"].

Presto CTAS with bucketing option

Hi Toby, I found a bug. Please review and fix it.

PRESTO SQL

CREATE TABLE schema2.table2 
WITH (
    bucketed_by  = Array['Key'],
    bucket_count = 64
)
AS
SELECT * FROM schema2.table2

sqlglot.errors.ParseError: Required keyword: 'value' missing for <class 'sqlglot.expressions.Property'>. Line 3, Col: 20.

tested on sqlglot==2.3.2

Parentheses at union

Nice lib, I found some minor issues:

This is ok:

sql_str = """
select A from dbo.tablename
union all
select B from dbo.tablename
"""
sqlglot.parse(sql_str)

This is not ok:

sql_str = """
(select A from dbo.tablename)
union all
(select B from dbo.tablename)
"""
sqlglot.parse(sql_str)

ParseError: Invalid expression / Unexpected token. Line 3, Col: 1.

sqlglot doesn't parse UPDATE statement?

Hi Toby,

As far as I know, sqlglot doesn't support parsing the UPDATE statements yet, do you have a recent plan to support it?
Or, if you don't have such a plan, I would like to send an MR to support this 😄

Create table - parser errors

I've run into some create table statements that throw parser errors:

PostgreSQL

create table t1(a int primary key)
create table t1(s bytea)
create table t1(n uuid)
create table t1(n int generated always as identity primary key)

Oracle

create table t1(n number)
create table t1(s long raw)
create table t1(s long varchar)
create table t1(c1 NUMBER GENERATED by default on null as IDENTITY)

The error is basically always Error: Expecting )

mysql: truncate and optimize take optional table keyword

This doesn't seem to be supported, parser complains:

TRUNCATE TABLE mytable;

Reference: https://dev.mysql.com/doc/refman/8.0/en/truncate-table.html

OPTIMIZE TABLE doesn't even seem optional:: https://dev.mysql.com/doc/refman/8.0/en/optimize-table.html

add sqloxide benchmarks

From the README:

It is currently the fastest Python SQL parser.

I have python bindings to sqlparser-rs, would you be open to adding these to the benchmark suite? I went ahead and added your library to my benchmarks and it indeed compares favorably to the other pure-python implementations.

Not to take away from your awesome project, it does a lot of other cool things that simple bindings cannot!

Presto Like expression with "back slash" escape

it seems "back slash" on escape causes an error

Presto SQL

SELECT TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE,
  TYPE_NAME, COLUMN_SIZE, BUFFER_LENGTH, DECIMAL_DIGITS, NUM_PREC_RADIX,
  NULLABLE, REMARKS, COLUMN_DEF, SQL_DATA_TYPE, SQL_DATETIME_SUB,
  CHAR_OCTET_LENGTH, ORDINAL_POSITION, IS_NULLABLE,
  SCOPE_CATALOG, SCOPE_SCHEMA, SCOPE_TABLE,
  SOURCE_DATA_TYPE, IS_AUTOINCREMENT, IS_GENERATEDCOLUMN
FROM system.jdbc.columns
WHERE TABLE_CAT = 'hive' AND TABLE_SCHEM LIKE 'stage' ESCAPE '\' AND TABLE_NAME LIKE '%' ESCAPE '\' AND COLUMN_NAME LIKE '%' ESCAPE '\'
ORDER BY TABLE_CAT, TABLE_SCHEM, TABLE_NAME, ORDINAL_POSITION

EXCEPTION

sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 8, Col: 55.
  , IS_GENERATEDCOLUMN
FROM system.jdbc.columns
WHERE TABLE_CAT = 'hive' AND TABLE_SCHEM LIKE 'stage' ESCAPE '' AND TABLE_NAME LIKE '%' ESCAPE '' AND COLUMN_NAME LIKE '%' ESCAPE ''

Parse error with 'extract' function (Postgresql)

Hello,

This code:

import sqlglot
sqlglot.transpile("select extract(month from '2021-01-31'::timestamp without time zone)", read='postgres', pretty=True)

raises an error:
ParseError: Expected ). Line 0, Col: 26.
select LPAD(extract(month from '2021-01-31'::timestamp without time zone)::text,

The query works without any issue in Postgresql.

Here is the complete traceback:
tb_sqlglot.txt

Thanks.

Best,

Add support for `PARTITIONED BY`

Now got ParseError: Invalid expression / Unexpected token.

Codes to reproduce:

sql = '''create table test (
a int,
b string
) partitioned by (dt string)
stored as parquet'''

sqlglot.transpile(sql, read='hive')

Error message:

ParseError: Invalid expression / Unexpected token. Line 4, Col: 3.
  create table test (
a int,
b string
) partitioned by (dt string)
stored as parquet

Support TRY_CAST in Presto

What about differnt parsing policy for different dialects?

Hi Toby,

I'm using sqlglot to parse the SQL syntax from Hive 3.0. There are some new keywords and new syntax, I have to change the lexer and parser somehow, but those keywords and syntax are not supported in MySQL or PostgreSQL at all.

So, I got this idea, what about moving the tasks of parsing those uncommon and special syntax into those specific dialects? And we may need to change the interface a little, for example, the users may use sqlglot.parse(code, dialect='hive-3') to indicate which dialect to use. Of course, if the user doesn't specify any dialect, we will keep using the existing logic 😄

I would like to do this refactoring for sqlglot if you don't mind, or if you think this change violates the philosophy of sqlglot, I may create a new project based on sqlglot.

Please let me know if you think it's OK, thanks.

Ability to parse TIMESTAMP WITH TIME ZONE for Trino/Presto

Trino and presto both support TIMESTAMP WITH TIME ZONE types as shown in the docs here.

trino> SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE);
                  _col0
------------------------------------------
 2021-01-01 18:00:00.000 America/New_York

I tried the following to parse the expression:

>>> p = Presto()
>>> p.parse("SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE)")
Traceback (most recent call last):
...
sqlglot.errors.ParseError: Expected ). Line 1, Col: 51.
SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE)

(WITH is underlined above)

Arguments to lambda functions wrapped in brackets even with only one argument

sql = """
transform(array('a','b','c'), X -> upper(X))
"""

import sqlglot
sqlglot.transpile(sql)

> ["TRANSFORM(ARRAY('a', 'b', 'c'), (X) -> UPPER(X))"]

I discovered this as it was breaking some pyspark queries. Minimal pyspark example below:

from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.sql import Row
data_list = [
    {"array_col": ['a', 'b']},

        ]

df = spark.createDataFrame(Row(**x) for x in data_list)
df.createOrReplaceTempView("df")

import sqlglot 
sql = "select transform(array_col, x -> x) as transformed from df"
spark.sql(sql).toPandas()

sql = sqlglot.transpile(sql, read='spark', write='spark')
spark.sql(sql).toPandas()

Will submit a PR to:

distinguish between x -> f(x) and (x,y,z) -> f(x,y,z) syntax
add a new test
update identity.sql accordingly

"with clause" in insert statement in presto

Tested on 2.3.0

INSERT INTO test.table2
with base as (
    SELECT * FROM table3
)
SELECT * FROM BASE

Hi Toby,
This should work in Presto
Thanks.

Union does not support CTE

Thanks for sharing great parser.
I would like to report a bug using presto sql.
Below should work in presto but parser produces an error, tested on sqlglot ==2.2.7

WITH temp_1 as (SELECT 1 FROM test2.table2),
    temp_2 as (SELECT 2 FROM test2.table3)

    SELECT * FROM temp_1 
    UNION ALL
    SELECT * FROM temp_2

error

sqlglot.errors.ParseError: union does not support CTE. Line 7, Col: 19.
    temp_2 as (select 2 from test2.table3)

    select * from temp_1 
    union all
    select * from temp_2

Parsing error in CTAS statement with parenthesis and with clause

Following sql works in presto but it produces parse error

   CREATE TABLE  schema1.table1 AS
   (
        WITH base AS (
            SELECT *
                    ,row_number() over(partition by col1 order by col2 asc) AS test
            FROM test.table2
        )
        SELECT * FROM base
    )

ERROR:

sqlglot.errors.ParseError: Expecting ). Line 4, Col: 9.
  
   CREATE TABLE  schema1.table1 AS
   (
        WITH base AS (
            SELECT *
                    ,row_number() over(partition by col1 order by co

if you remove parenthesis, it works.

   CREATE TABLE  schema1.table1 AS
   --(
        WITH base AS (
            SELECT *
                    ,row_number() over(partition by col1 order by col2 asc) AS test
            FROM test.table2
        )
        SELECT * FROM base
    --)

SQLite3 `BEGIN IMMEDIATE TRANSACTION` is rejected

$ sqlite3
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> BEGIN IMMEDIATE TRANSACTION;
sqlite> COMMIT;
sqlite> ^D

$ python -c 'from sqlglot import parse; parse("BEGIN IMMEDIATE TRANSACTION", read="sqlite")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/__init__.py", line 36, in parse
    return dialect.parse(sql, **opts)
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/dialects.py", line 60, in parse
    return self.parser(**opts).parse(self.tokenizer().tokenize(sql), sql)
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 257, in parse
    return self._parse(
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 308, in _parse
    self.check_errors()
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 315, in check_errors
    raise error
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 1, Col: 17.
  BEGIN IMMEDIATE TRANSACTION

Adding Trino parser

Trino and presto are now different and are likely to diverge in subsequent development: https://trino.io/blog/2020/12/27/announcing-trino.html. Is it possible to add Trino as a Dialect? I'm thinking since Trino's development cycle appears to be much faster (they are hosted by a startup called Starburst) compared to facebook, we should either copy the parser rules for Presto -> Trino and carry forward with Trino, or inherit from Presto. I think with the way the Dialect classes are designed it makes more sense to do a copy.

Transpiling / Parsing INSERT INTO with columns

INSERT INTO table2 (column1, column2, column3) SELECT column1, column2, column3 FROM table1 WHERE condition

transpiles to

INSERT INTO TABLE2(column1, column2, column3) SELECT column1, column2, column3 FROM table1 WHERE condition

Note the uppercase TABLE2 and the "missing" space.

I would expect the transpiled SQL to look the same as the initial string.

Comparing to INSERT INTO w/o columns:

INSERT INTO table2 SELECT column1, column2, column3 FROM table1 WHERE condition which parses "table2" as IDENTIFIER:

(INSERT this:
  (TABLE this:
    (IDENTIFIER this: table2, quoted: False)), expression:
  (SELECT expressions:
  ...

but with columns it parses "table2" as ANONYMOUS:

(INSERT this:
  (TABLE this:
    (ANONYMOUS this: table2, expressions:
      (COLUMN this:
        (IDENTIFIER this: column1, quoted: False)),
      (COLUMN this:
        (IDENTIFIER this: column2, quoted: False)),
      (COLUMN this:
        (IDENTIFIER this: column3, quoted: False)))), expression:
  (SELECT expressions:
  ...

Rewrite queries to remove nested queries in WHERE clause

It would be great if SQLGlot can "flatten" a nested SQL query such that the only nested subqueries are operands of the FROM clause and are independent from each other. This mostly involves taking care of subqueries inside the WHERE clause.

There are a couple cases here. The first case is this subquery is independent of the outer query. In this case this subquery should just be evaluated as a separate SQL query and its result can be used in the original query.

The second case is a bit more complicated, and occurs when the subquery depends on the outer query. For example in TPCH 2 and 4: https://github.com/marsupialtail/nest-query-rewrites. In this case we need to also reflect the dependency between the outer query and the nested query in the rewrite.

I think taking care of the first case will go a long way already.

Parse Time Formats

Presto and Hive have different time formats, these are not transpiled when converting

Docs and Support

I have been looking all over trying to find docs, and reading the doc strings, but is there a place to ask for support on this. Trying to utilize the .find method for an expression, and can't get it to work.

I am not putting in the write expression type to find... but can't seem to find out what the options are. Where am I not looking?

Sample:

test = "Select *, a.ID from `table_name` as b"
parsed_output = sqlglot.parse(test, read='mysql')

print(type(parsed_output))
print(type(parsed_output[0]))
print(parsed_output[0].find(('Select',)))

Error:

ile ~\AppData\Roaming\Python\Python38\site-packages\sqlglot\expressions.py:119, in Expression.find_all(self, *expression_types)
    [108](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=107) """
    [109](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=108) Returns a generator object which visits all nodes in this tree and only
    [110](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=109) yields those that match at least one of the specified expression types.
   (...)
    [116](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=115)     the generator object.
    [117](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=116) """
    [118](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=117) for expression, _, _ in self.walk():
--> [119](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=118)     if isinstance(expression, expression_types):
    [120](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=119)         yield expression

TypeError: isinstance() arg 2 must be a type or tuple of types

WITHIN GROUP and Filter Not Supported

Hi Toby Mao,

In the sqlglot latest release (2.1.1) the Oracle SQL Analytic Function LISTAGG is not supported.

For example with the following SQL Query that works perfectly when executed in ORACLE SQL Developer :

SQL_Query_ = "SELECT CLUSTERNAME, DATETYPE, MIN(CLUSTERDATE) CLUSTERDATE, LISTAGG(APPLICATION, '|') WITHIN GROUP (ORDER BY CLUSTERDATE) AS APPLICATION FROM TYPOLOGY_CLUSTERDATE_T_L WHERE DATETYPE = 'ACCEPTANCE DATE' GROUP BY CLUSTERNAME, DATETYPE"

The result of sqlglot parsing is:

sqlglot.parse_one(SQL_Query_)

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot_init_.py", line 44, in parse_one
return parse(sql, read=read, **opts)[0]
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot_init_.py", line 27, in parse
return dialect.parse(sql, **opts)
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\dialects.py", line 60, in parse
return self.parser(**opts).parse(self.tokenizer().tokenize(sql), sql)
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 258, in parse
return self._parse(
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 309, in _parse
self.check_errors()
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 316, in check_errors
raise error
sqlglot.errors.ParseError: Expecting BY. Line 1, Col: 100.
SELECT CLUSTERNAME, DATETYPE, MIN(CLUSTERDATE) CLUSTERDATE, LISTAGG(APPLICATION, '|') WITHIN GROUP ←[4m(←[0mORDER BY CLUSTERDATE) AS APPLICATION FROM TYPOLOGY_CLUSTERDATE_T_L WHERE DATETYPE = 'ACCEPTANCE DATE

It seems that after the keywords WITHIN GROUP sqlglot is expecting the keyword BY, but that is not applicable in this case

Could you solve this in the earliest possible release?

Thanks in advance.

Best Regards,
F. Moreira

Unable to Parse VALUES clause with named table alias (Presto -> Spark)

Getting an error trying to parse the following query

q = """
SELECT zoo, animal
FROM (VALUES
        ('OaklandZoo', ARRAY['dog', 'cat', 'tiger']),
        ('SanFranciscoZoo', ARRAY['dog', 'cat'])
) AS t(zoo,animals)
"""

sqlglot.transpile(q, 'presto', 'spark')

ParseError: Invalid expression / Unexpected token. Line 5, Col: 7.
  'OaklandZoo', ARRAY['dog', 'cat', 'tiger']),
        ('SanFranciscoZoo', ARRAY['dog', 'cat'])
) AS t(zoo,animals)

Token List

[<Token token_type: TokenType.SELECT, text: SELECT, line: 1, col: 1>,
...,
 <Token token_type: TokenType.FROM, text: FROM, line: 2, col: 1>,
 <Token token_type: TokenType.L_PAREN, text: (, line: 2, col: 6>,
 <Token token_type: TokenType.VALUES, text: VALUES, line: 2, col: 7>,
 <Token token_type: TokenType.L_PAREN, text: (, line: 3, col: 9>,
 ...,
 <Token token_type: TokenType.R_PAREN, text: ), line: 5, col: 1>,
 <Token token_type: TokenType.ALIAS, text: AS, line: 5, col: 3>,
 <Token token_type: TokenType.VAR, text: t, line: 5, col: 6>,
 **<Token token_type: TokenType.L_PAREN, text: (, line: 5, col: 7>,**
 <Token token_type: TokenType.VAR, text: zoo, line: 5, col: 8>,
 <Token token_type: TokenType.COMMA, text: ,, line: 5, col: 11>,
 <Token token_type: TokenType.VAR, text: animals, line: 5, col: 12>,
 <Token token_type: TokenType.R_PAREN, text: ), line: 5, col: 19>]

Are we really parsing CREATE statement the right way?

Hi Toby,

I'm trying to parse "CREATE OR REPLACE TEMPORARY VIEW ...", but I noticed that in generator.py, you wrote CREATE{temporary}{replace}. I think that's a typo because then the generated SQL will be "CREATE TEMPORARY OR REPLACE ...".

And in parser.py, you wrote

def _parse_create(self):
    temporary = self._match(TokenType.TEMPORARY)
    replace = self._match(TokenType.OR) and self._match(TokenType.REPLACE)

I guess it should be

def _parse_create(self):
    replace = self._match(TokenType.OR) and self._match(TokenType.REPLACE)
    temporary = self._match(TokenType.TEMPORARY)

Please let me if I understand your code correctly 😄

Parsing incomplete SQL statements

Hey @tobymao! This is excellent work and I'm pretty sure it will help lots of people in their project.
sqlglot will for sure help us in https://github.com/MariaDB/mariadb_kernel

We are trying to build an autocompletion and introspection feature (as part of a GSoC project) and the ability to get the parsed expression and the type of each token in the expression is exactly what we were looking for (and eventually implementing ourselves if no reasonable solution would be found in the wild).

My question for you is, is it very complicated to make sqlglot able to parse and generate a partial expression tree on incomplete SQL statements?
Now for instance, if you execute parse("select a from t1 where"), you'd get an exception. If you execute parse("select a from"), you would get a partial expression tree, which is fantastic.
Is this a bug in sqlglot and it is designed to deal with incomplete statements? Or it is just by accident that the latter statement above works?

Thanks again for this great project!

Support all CASE statement syntax variations.

Currently, the behaviour of transpile is such that this given input is mapped as follows:

Versions v1.16.0 and v1.22.0 tested.

sqlglot.transpile(
    ...,
    read="trino",
    write="trino",
    identity=False,
    pretty=True,
)

input

CASE col
    WHEN val_a THEN mappend_val_a
    ELSE mapped_val_general
  END

output
IF(val_a, ...

IF(val_a, mappend_val_a, mapped_val_general)

Clearly, the comparison condition is not being included, only the value. It should be:

output
IF(col_a = val_a, ...

IF(col_a = val_a, mappend_val_a, mapped_val_general)

dialects don't like optional parts in table and view creation

Quite a lot of the views/tables that I write use the optional parts
E.g.

Presto
CREATE OR REPLACE VIEW {name} AS

message = 'Expected TABLE or View'
token = <Token token_type: TokenType.OR, text: OR, line: 1, col: 7, arg_key: None>

MySQL
CREATE TABLE IF NOT EXISTS {name} (

)

message = 'Invalid expression / Unexpected token'
token = <Token token_type: TokenType.L_PAREN, text: (, line: 1, col: 0, arg_key: None>

I believe both Presto and MySQL support these optional parts in the SQL syntax for tables and views.
Many thanks in advance!

feature request - add support for TBLPROPERTIES when creating a new table

It would be great to add support for TBLPROPERTIES in CREATE statements (Hive doc, Presto doc). This is useful for specifying things like table retention periods.

For simplicity , it, could use an interface like Airflow's Hive query builder where a dict of key/value pairs is just unrolled into the relevant syntax.

Seems like this would require changes to expressions.Create, Rewriter.ctas(), Generator.create_sql(), etc., and then the language specific implementations. For Hive there's an explicit TBLPROPERTIES keyword in the CREATE statement, for Presto they're just added into the WITH statement.

Cannot parse CTE Update

It seems CTE parsing is hard-coded to be followed by a SELECT statement, thus parsing this sql query fails:

WITH baz as (SELECT 1 as col1) UPDATE some_table SET cid = baz.col1 from baz;

Note: this is a valid Postgres query.

Error:

Traceback (most recent call last):
  ...
  File "sqlglot\parser.py", line 263, in check_errors
    raise error
sqlglot.errors.ParseError: Required keyword: 'this' missing for <class 'sqlglot.expressions.CTE'>. Line 3, Col: 1.
  WITH baz as
  (SELECT 1 as col1)
UPDATE some_table
SET cid = baz.col1
from baz;

I have tried to fix it myself by changing

def _parse_cte(self):
    ...
    return self.expression(
        exp.CTE,
        this=self._parse_select(),
        expressions=expressions,
        recursive=recursive,
    )

def _parse_cte(self):
    ...
    return self.expression(
        exp.CTE,
        this=self._parse_statement(), # parse based on keyword ?!
        expressions=expressions,
        recursive=recursive,
    )

But that only resulted in

  File "sqlglot\parser.py", line 263, in check_errors
    raise error
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 5, Col: 1.

subqueries not supported in where clause

parser.py:_parse_range is throwing out the error message 'Expected ) after IN'

minimal reproducible example:-

from sqlglot import transpile

test_sql = "SELECT column FROM db.tbl WHERE column IN (SELECT column FROM otherdb.othertbl)"
transpile(test_sql, read="spark")

sqlglot.errors.ParseError: Expected ) after IN. Line 0, Col: 50.

Parser fails o recursive CTEs

On following

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

parser fails with following error:

~/.local/lib/python3.9/site-packages/sqlglot/parser.py in _parse_cte(self)
354
355             if not self._match(TokenType.ALIAS):
--> 356                 self.raise_error('Expected AS after WITH')
357
358             expressions.append(self._parse_table(alias=alias))

~/.local/lib/python3.9/site-packages/sqlglot/parser.py in raise_error(self, message, token)
202         )
203         if self.error_level == ErrorLevel.RAISE:
--> 204             raise self.error
205         if self.error_level == ErrorLevel.WARN:
206             logging.error(self.error)

ParseError: Expected AS after WITH. Line 0, Col: 15.
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1

CROSS JOIN UNNEST not supported going from Presto --> Hive

CROSS JOIN UNNEST is currently not supported when translating from Presto to Hive, e.g.:

sqlglot.transpile("""
SELECT
    col1
   , col2
   , item
FROM
    my_db.my_table
    CROSS JOIN UNNEST(items) t (item)
""", read='presto', write='hive')

returns the query unmodified, but it should be

SELECT
    col1
   , col2
   , item
FROM
    my_db.my_table
    LATERAL VIEW EXPLODE(items) t AS item

It would be great to add support for this expression!

Parse Postgres style "::" casts

Hey, amazing work on this library so far.
Are there plans to support "::" style type casts, as seen in postgres & redshift?
https://www.postgresqltutorial.com/postgresql-cast/

Here are some example scenarios I've tested in the 1.2.0 release.

import sqlglot
sqlglot.transpile("select foo::INTEGER from bar", read='postgres')

results in:
['SELECT CAST(foo AS ) AS INT FROM bar']

where I would expect:
['SELECT CAST(foo AS INT) FROM bar']

Another edge-case is combining the cast with an alias (which is what sqlglot currently interprets the :: as):

import sqlglot
sqlglot.transpile("select foo::INTEGER as baz from bar", read='postgres')

results in:

ParseError: Invalid expression / Unexpected token. Line 0, Col: 20.
select foo::INTEGER as baz from bar

where I would expect:
['SELECT CAST(foo AS INT) AS baz FROM bar']

Thanks!

bug - switching between 0 and 1 based indexes

Hive uses 0-based indexing, while Presto uses 1-based indexing, however in the following example sqlglot doesn't properly translate between the two:

print(sqlglot.transpile("""
SELECT
    SPLIT(str_col_with_space, ' ')[0]
FROM
    my_db.my_table
""", read='hive', write='presto', pretty=True)[0])

returns the query unchanged, but it should be

SELECT
    SPLIT(str_col_with_space, ' ')[1]
FROM
    my_db.my_table

and conversely

print(sqlglot.transpile("""
SELECT
    SPLIT(str_col_with_space, ' ')[1]
FROM
    my_db.my_table
""", read='presto', write='hive', pretty=True)[0])

returns the query unchanged, but it should be

SELECT
    SPLIT(str_col_with_space, ' ')[0]
FROM
    my_db.my_table

This is particularly dangerous when going from presto --> hive, as the wrongly translated code will always still be syntactically correct.

Support for Spark SQL literals with postfix to indicate datatype (e.g. `2Y` or `123.08BD`)

sqlglot does not yet support Spark SQL queries that include literals that are specified with a postfix character to indicate the datatype.

I'd be open to trying to write a PR to implement this functionality. If you'd like me to attempt this, it'd be useful to have some high level pointers about how you think it would be best to implement. (I've not written a tokeniser or parser before, but stepping through your code, I think I have a reasonable understanding of how it works).

Spark - syntax and example of problem

See here for the Spark language reference. For example a postfix of 'Y' is a shorthand way of indicating a TINYINT literal, so you can write 2Y rather than cast(2 as tinyint)

import sqlglot
sql = '2Y'
sqlglot.parse_one(sql, read="spark").sql(dialect="spark")

Results in:

'2 as Y'

When it should result in either:

CAST(2 AS TINYINT) or
CAST(2 AS BYTE) or
2Y

To give an example of how this can affect the result:


from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark_session = SparkSession(sc)
sql = "select 2Y"
print(spark_session.sql(sql).schema)
sql_parsed = sqlglot.parse_one(sql, read="spark").sql(dialect="spark")
print(spark_session.sql(sql_parsed).schema)

Results in:

StructType(List(StructField(2,ByteType,false)))
StructType(List(StructField(Y,IntegerType,false)))

Where this appears in the Spark codebase

I've also looked in the Spark codebase. I believe this may be the code to parse the literal syntax, and the syntax is also referenced here, as well as in various parts of the test suite like here.

Literals in other SQL flavours

I've done some research to understand how widespread the postfix syntax is, and to get a sense of what approach may be best to tokenising/parsing these queries. I recognise some of these flavours are not supported by sqlglot. The intention is to provide some examples that may help decide the most sensible appraoch to tokenising/parsing.

DuckDB

I don't think DuckDB supports postfixes. In fact, it parses postfixes in a surprising way;

DuckDB evaluates syntax like:
select 0.1d
to the table:

d
0.1

i.e. the result is the same as select 0.1 as d

Note that the same query in PostGres results in the same result (a column named d with value 0.1). Whereas MySQL and MS SQL Server, and SQLite all give you a syntax error.

Hive

In Hive, a postfix can be used, much like Spark, documented here

Postgres

In postgres, the :: operator is a 'PostgreSQL-style typecast'. This means that you can write things like select 123::smallint.

More info in the docs.

Oracle

In Oracle it is possible to use postfixes, see here

SQLite

Postfixes are not allowed:

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute("select 2Y")

p.s.

Thanks so much for the library. I'm using it as part of splink, a FOSS record linkage tool. At the moment Splink uses Spark as a backend, but I'd like it to be able to support both DuckDB and Spark - sqlglot looks pretty ideal for this purpose!!

tobymao / sqlglot Goto Github PK

sqlglot's People

Contributors

Stargazers

Watchers

Forkers

sqlglot's Issues

PostgreSQL

Oracle

Spark - syntax and example of problem

Where this appears in the Spark codebase

Literals in other SQL flavours

DuckDB

Hive

Postgres

Oracle

SQLite

p.s.

Recommend Projects

Recommend Topics

Recommend Org