tobymao / sqlglot Goto Github PK
View Code? Open in Web Editor NEWPython SQL Parser and Transpiler
Home Page: https://sqlglot.com/
License: MIT License
Python SQL Parser and Transpiler
Home Page: https://sqlglot.com/
License: MIT License
At the moment we get an unexpected token error when we write a query with a fully qualified name catalog.schema.table
Looks to https://clickhouse.tech/docs/en/
clickhouse is a blazing fast vectorized columnar mpp database
raw = '''
CREATE TABLE `t_customer_account` (
`id` int,
`customer_id` int,
`bank` varchar(100),
`account_no` varchar(100)
);
'''
parsed = sqlglot.parse(raw, sqlglot.dialects.MySQL)
I got error:
ParseError: Expected ). Line 2, Col: 7.
CREATE TABLE `t_customer_account` (
`id` int,
`customer_id` int,
`bank` varchar(100),
`a
sqlglot version: v1.3.2
There is no token for ILIKE in glot
https://postgrespro.com/docs/postgresql/9.6/functions-matching#functions-like
The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.
Hello
Thank you for the very nice and great tool! I want to use it to format presto sql in the way below, it's different from regular style.
Is it possible that I will send a style guide for you and you will add this format?
WITH
SOME_CTE AS (
SELECT DISTINCT
id AS my_id,
otherField1 AS my_other_field_1,
otherField2 AS my_other_field_2
FROM
prod.my_other_data
WHERE
id IN (1, 2, 3)
OR
(
otherField1 = 'value1'
AND
otherField2 = 'value2'
)
),
FINAL AS (
SELECT
id_field,
field_1 AS detailed_field_1,
field_2 AS detailed_field_2,
detailed_field_3,
CASE
WHEN
cancellation_date IS NULL AND expiration_date IS NOT NULL
THEN
expiration_date
WHEN
cancellation_date IS NULL
THEN
start_date + 7
ELSE
cancellation_date
END AS cancellation_date,
rank() OVER (
PARTITION BY id_field
ORDER BY cancellation_date
) AS previous_detailed_field_3,
SUM(field_4) AS field_4_sum,
MAX(field_5) AS field_5_max
FROM
(
SELECT
*
FROM
prod.my_data
WHERE
filter = 'my_filter'
) AS A
LEFT JOIN
SOME_CTE
ON
A.id_field = SOME_CTE.my_id
WHERE
reduce(
field_1,
0,
(x, s) -> s + x,
s -> s
) > 100
GROUP BY
1,
2,
3,
4,
5
HAVING
COUNT(*) > 5
ORDER BY
4 DESC
)
SELECT
*
FROM
FINAL
UNION
is currently supported but it would be be useful if INTERSECT
and EXCEPT
were also supported.
Following query
select sub.year, sub.genre, count(id)
from (
select m.movieId as id,
int(regexp_extract(m.title, '\\s*\\((\\d{4})\\)\\s*"?$', 1)) as year,
explode(split(m.genres, '\\|')) as genre
from movies m
) sub
where genre <> '(no genres listed)'
and year IS NOT NULL
group by sub.year, sub.genre
order by sub.year DESC, sub.genre
is being incorrectly translated to spark
I get the following error:
# version: sqlglot 1.26.1
from sqlglot import parse_one
parse_one("SELECT * FROM demo where exists (select * from demo)")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/__init__.py", line 44, in parse_one
return parse(code, read=read, **opts)[0]
File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/__init__.py", line 27, in parse
return dialect.parse(code, **opts)
File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/dialects.py", line 62, in parse
return self.parser(**opts).parse(self.tokenizer().tokenize(code), code)
File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/parser.py", line 255, in parse
self.check_errors()
File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/parser.py", line 263, in check_errors
raise error
sqlglot.errors.ParseError: Expecting ). Line 1, Col: 34.
SELECT * FROM demo where exists (select * from demo)
See
sqlglot - Why is not able to parse correctly DELETE SQL Statement?
https://stackoverflow.com/q/71224391/14700552?sem=2
On following query:
WITH regional_sales AS (
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
), top_regions AS (
SELECT region
FROM regional_sales
WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)
SELECT region,
product,
SUM(quantity) AS product_units,
SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;
execution fails on following:
~/.local/lib/python3.9/site-packages/sqlglot/parser.py in _parse_primary(self)
650
651 if not self._match(TokenType.R_PAREN):
--> 652 self.raise_error('Expecting )', paren)
653 return exp.Paren(this=this)
654
~/.local/lib/python3.9/site-packages/sqlglot/parser.py in raise_error(self, message, token)
202 )
203 if self.error_level == ErrorLevel.RAISE:
--> 204 raise self.error
205 if self.error_level == ErrorLevel.WARN:
206 logging.error(self.error)
ParseError: Expecting ). Line 7, Col: 24.
n
FROM regional_sales
WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)
working on a solution, should have something by tonight, i need to also handle complex types like map<> and struct
it has been solved, but, another sql failed:
CREATE TABLE `t_customer_account` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`customer_id` int(11) DEFAULT NULL COMMENT 'ๅฎขๆทid',
`bank` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '่กๅซ',
`account_no` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '่ดฆๅท',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin COMMENT='ๅฎขๆท่ดฆๆท่กจ';
error info
ERROR:root:Expected ). Line 2, Col: 10.
CREATE TABLE `t_customer_account` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`customer_id` int(1
ERROR:root:Invalid expression / Unexpected token. Line 2, Col: 10.
CREATE TABLE `t_customer_account` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`customer_id` int(1
[(CREATE this:
(TABLE this:
(SCHEMA this: `t_customer_account`, expressions:
(COLUMNDEF this:
(COLUMN this: `id`, db: , table: ), kind: int, comment: )), db: ), kind: table, expression: , exists: , file_format: , temporary: )]
Originally posted by @exitNA in #20 (comment)
I believe this would be a bigger change than simply adding a new dialect as new parsing is required for json extraction syntax
Test Cases
SELECT foo:bar
SELECT foo:bar:baz::DATE
SELECT foo:bar[0]
SELECT foo:bar.baz
I'm wondering if it is possible to implement struct handling such that a query like this:
SELECT a.b.c from table_1
would parse as a struct field type.
This is the Syntax Tree Transformation example in the README:
import sqlglot
import sqlglot.expressions as exp
expression_tree = sqlglot.parse_one("SELECT a FROM x")
def transformer(node):
if isinstance(node, exp.Column) and node.args["this"].text == "a":
return sqlglot.parse_one("FUN(a)")
return node
transformed_tree = expression_tree.transform(transformer)
transformed_tree.sql()
When I run it, the transformed_tree isn't changed, the SQL is still SELECT a FROM x
.
I played with it a bit and found it worked to change the .text
to .args["this"]
.
Hi Toby, I found a bug. Please review and fix it.
PRESTO SQL
CREATE TABLE schema2.table2
WITH (
bucketed_by = Array['Key'],
bucket_count = 64
)
AS
SELECT * FROM schema2.table2
sqlglot.errors.ParseError: Required keyword: 'value' missing for <class 'sqlglot.expressions.Property'>. Line 3, Col: 20.
tested on sqlglot==2.3.2
Hi
Nice lib, I found some minor issues:
This is ok:
sql_str = """
select A from dbo.tablename
union all
select B from dbo.tablename
"""
sqlglot.parse(sql_str)
This is not ok:
sql_str = """
(select A from dbo.tablename)
union all
(select B from dbo.tablename)
"""
sqlglot.parse(sql_str)
ParseError: Invalid expression / Unexpected token. Line 3, Col: 1.
Hi Toby,
As far as I know, sqlglot doesn't support parsing the UPDATE statements yet, do you have a recent plan to support it?
Or, if you don't have such a plan, I would like to send an MR to support this ๐
I've run into some create table statements that throw parser errors:
The error is basically always Error: Expecting )
This doesn't seem to be supported, parser complains:
TRUNCATE TABLE mytable;
Reference: https://dev.mysql.com/doc/refman/8.0/en/truncate-table.html
OPTIMIZE TABLE
doesn't even seem optional:: https://dev.mysql.com/doc/refman/8.0/en/optimize-table.html
From the README:
It is currently the fastest Python SQL parser.
I have python bindings to sqlparser-rs, would you be open to adding these to the benchmark suite? I went ahead and added your library to my benchmarks and it indeed compares favorably to the other pure-python implementations.
Not to take away from your awesome project, it does a lot of other cool things that simple bindings cannot!
it seems "back slash" on escape causes an error
Presto SQL
SELECT TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE,
TYPE_NAME, COLUMN_SIZE, BUFFER_LENGTH, DECIMAL_DIGITS, NUM_PREC_RADIX,
NULLABLE, REMARKS, COLUMN_DEF, SQL_DATA_TYPE, SQL_DATETIME_SUB,
CHAR_OCTET_LENGTH, ORDINAL_POSITION, IS_NULLABLE,
SCOPE_CATALOG, SCOPE_SCHEMA, SCOPE_TABLE,
SOURCE_DATA_TYPE, IS_AUTOINCREMENT, IS_GENERATEDCOLUMN
FROM system.jdbc.columns
WHERE TABLE_CAT = 'hive' AND TABLE_SCHEM LIKE 'stage' ESCAPE '\' AND TABLE_NAME LIKE '%' ESCAPE '\' AND COLUMN_NAME LIKE '%' ESCAPE '\'
ORDER BY TABLE_CAT, TABLE_SCHEM, TABLE_NAME, ORDINAL_POSITION
EXCEPTION
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 8, Col: 55.
, IS_GENERATEDCOLUMN
FROM system.jdbc.columns
WHERE TABLE_CAT = 'hive' AND TABLE_SCHEM LIKE 'stage' ESCAPE '' AND TABLE_NAME LIKE '%' ESCAPE '' AND COLUMN_NAME LIKE '%' ESCAPE ''
Hello,
This code:
import sqlglot
sqlglot.transpile("select extract(month from '2021-01-31'::timestamp without time zone)", read='postgres', pretty=True)
raises an error:
ParseError: Expected ). Line 0, Col: 26.
select LPAD(extract(month from '2021-01-31'::timestamp without time zone)::text,
The query works without any issue in Postgresql.
Here is the complete traceback:
tb_sqlglot.txt
Thanks.
Best,
Now got ParseError: Invalid expression / Unexpected token.
Codes to reproduce:
sql = '''create table test (
a int,
b string
) partitioned by (dt string)
stored as parquet'''
sqlglot.transpile(sql, read='hive')
Error message:
ParseError: Invalid expression / Unexpected token. Line 4, Col: 3.
create table test (
a int,
b string
) partitioned by (dt string)
stored as parquet
Hi Toby,
I'm using sqlglot to parse the SQL syntax from Hive 3.0. There are some new keywords and new syntax, I have to change the lexer and parser somehow, but those keywords and syntax are not supported in MySQL or PostgreSQL at all.
So, I got this idea, what about moving the tasks of parsing those uncommon and special syntax into those specific dialects? And we may need to change the interface a little, for example, the users may use sqlglot.parse(code, dialect='hive-3') to indicate which dialect to use. Of course, if the user doesn't specify any dialect, we will keep using the existing logic ๐
I would like to do this refactoring for sqlglot if you don't mind, or if you think this change violates the philosophy of sqlglot, I may create a new project based on sqlglot.
Please let me know if you think it's OK, thanks.
Trino and presto both support TIMESTAMP WITH TIME ZONE types as shown in the docs here.
trino> SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE);
_col0
------------------------------------------
2021-01-01 18:00:00.000 America/New_York
I tried the following to parse the expression:
>>> p = Presto()
>>> p.parse("SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE)")
Traceback (most recent call last):
...
sqlglot.errors.ParseError: Expected ). Line 1, Col: 51.
SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE)
(WITH
is underlined above)
sql = """
transform(array('a','b','c'), X -> upper(X))
"""
import sqlglot
sqlglot.transpile(sql)
> ["TRANSFORM(ARRAY('a', 'b', 'c'), (X) -> UPPER(X))"]
I discovered this as it was breaking some pyspark queries. Minimal pyspark example below:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.sql import Row
data_list = [
{"array_col": ['a', 'b']},
]
df = spark.createDataFrame(Row(**x) for x in data_list)
df.createOrReplaceTempView("df")
import sqlglot
sql = "select transform(array_col, x -> x) as transformed from df"
spark.sql(sql).toPandas()
sql = sqlglot.transpile(sql, read='spark', write='spark')
spark.sql(sql).toPandas()
Will submit a PR to:
x -> f(x)
and (x,y,z) -> f(x,y,z)
syntaxidentity.sql
accordinglyTested on 2.3.0
INSERT INTO test.table2
with base as (
SELECT * FROM table3
)
SELECT * FROM BASE
Hi Toby,
This should work in Presto
Thanks.
Thanks for sharing great parser.
I would like to report a bug using presto sql.
Below should work in presto but parser produces an error, tested on sqlglot ==2.2.7
WITH temp_1 as (SELECT 1 FROM test2.table2),
temp_2 as (SELECT 2 FROM test2.table3)
SELECT * FROM temp_1
UNION ALL
SELECT * FROM temp_2
error
sqlglot.errors.ParseError: union does not support CTE. Line 7, Col: 19.
temp_2 as (select 2 from test2.table3)
select * from temp_1
union all
select * from temp_2
Following sql works in presto but it produces parse error
CREATE TABLE schema1.table1 AS
(
WITH base AS (
SELECT *
,row_number() over(partition by col1 order by col2 asc) AS test
FROM test.table2
)
SELECT * FROM base
)
ERROR:
sqlglot.errors.ParseError: Expecting ). Line 4, Col: 9.
CREATE TABLE schema1.table1 AS
(
WITH base AS (
SELECT *
,row_number() over(partition by col1 order by co
if you remove parenthesis, it works.
CREATE TABLE schema1.table1 AS
--(
WITH base AS (
SELECT *
,row_number() over(partition by col1 order by col2 asc) AS test
FROM test.table2
)
SELECT * FROM base
--)
$ sqlite3
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> BEGIN IMMEDIATE TRANSACTION;
sqlite> COMMIT;
sqlite> ^D
$ python -c 'from sqlglot import parse; parse("BEGIN IMMEDIATE TRANSACTION", read="sqlite")'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/__init__.py", line 36, in parse
return dialect.parse(sql, **opts)
File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/dialects.py", line 60, in parse
return self.parser(**opts).parse(self.tokenizer().tokenize(sql), sql)
File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 257, in parse
return self._parse(
File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 308, in _parse
self.check_errors()
File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 315, in check_errors
raise error
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 1, Col: 17.
BEGIN IMMEDIATE TRANSACTION
Trino and presto are now different and are likely to diverge in subsequent development: https://trino.io/blog/2020/12/27/announcing-trino.html. Is it possible to add Trino as a Dialect? I'm thinking since Trino's development cycle appears to be much faster (they are hosted by a startup called Starburst) compared to facebook, we should either copy the parser rules for Presto -> Trino and carry forward with Trino, or inherit from Presto. I think with the way the Dialect classes are designed it makes more sense to do a copy.
INSERT INTO table2 (column1, column2, column3) SELECT column1, column2, column3 FROM table1 WHERE condition
transpiles to
INSERT INTO TABLE2(column1, column2, column3) SELECT column1, column2, column3 FROM table1 WHERE condition
Note the uppercase TABLE2 and the "missing" space.
I would expect the transpiled SQL to look the same as the initial string.
Comparing to INSERT INTO w/o columns:
INSERT INTO table2 SELECT column1, column2, column3 FROM table1 WHERE condition
which parses "table2" as IDENTIFIER:
(INSERT this:
(TABLE this:
(IDENTIFIER this: table2, quoted: False)), expression:
(SELECT expressions:
...
but with columns it parses "table2" as ANONYMOUS:
(INSERT this:
(TABLE this:
(ANONYMOUS this: table2, expressions:
(COLUMN this:
(IDENTIFIER this: column1, quoted: False)),
(COLUMN this:
(IDENTIFIER this: column2, quoted: False)),
(COLUMN this:
(IDENTIFIER this: column3, quoted: False)))), expression:
(SELECT expressions:
...
It would be great if SQLGlot can "flatten" a nested SQL query such that the only nested subqueries are operands of the FROM clause and are independent from each other. This mostly involves taking care of subqueries inside the WHERE clause.
There are a couple cases here. The first case is this subquery is independent of the outer query. In this case this subquery should just be evaluated as a separate SQL query and its result can be used in the original query.
The second case is a bit more complicated, and occurs when the subquery depends on the outer query. For example in TPCH 2 and 4: https://github.com/marsupialtail/nest-query-rewrites. In this case we need to also reflect the dependency between the outer query and the nested query in the rewrite.
I think taking care of the first case will go a long way already.
Presto and Hive have different time formats, these are not transpiled when converting
I have been looking all over trying to find docs, and reading the doc strings, but is there a place to ask for support on this. Trying to utilize the .find
method for an expression, and can't get it to work.
I am not putting in the write expression type to find... but can't seem to find out what the options are. Where am I not looking?
Sample:
test = "Select *, a.ID from `table_name` as b"
parsed_output = sqlglot.parse(test, read='mysql')
print(type(parsed_output))
print(type(parsed_output[0]))
print(parsed_output[0].find(('Select',)))
Error:
ile ~\AppData\Roaming\Python\Python38\site-packages\sqlglot\expressions.py:119, in Expression.find_all(self, *expression_types)
[108](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=107) """
[109](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=108) Returns a generator object which visits all nodes in this tree and only
[110](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=109) yields those that match at least one of the specified expression types.
(...)
[116](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=115) the generator object.
[117](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=116) """
[118](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=117) for expression, _, _ in self.walk():
--> [119](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=118) if isinstance(expression, expression_types):
[120](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=119) yield expression
TypeError: isinstance() arg 2 must be a type or tuple of types
Hi Toby Mao,
In the sqlglot latest release (2.1.1) the Oracle SQL Analytic Function LISTAGG is not supported.
For example with the following SQL Query that works perfectly when executed in ORACLE SQL Developer :
SQL_Query_ = "SELECT CLUSTERNAME, DATETYPE, MIN(CLUSTERDATE) CLUSTERDATE, LISTAGG(APPLICATION, '|') WITHIN GROUP (ORDER BY CLUSTERDATE) AS APPLICATION FROM TYPOLOGY_CLUSTERDATE_T_L WHERE DATETYPE = 'ACCEPTANCE DATE' GROUP BY CLUSTERNAME, DATETYPE"
The result of sqlglot parsing is:
sqlglot.parse_one(SQL_Query_)
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot_init_.py", line 44, in parse_one
return parse(sql, read=read, **opts)[0]
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot_init_.py", line 27, in parse
return dialect.parse(sql, **opts)
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\dialects.py", line 60, in parse
return self.parser(**opts).parse(self.tokenizer().tokenize(sql), sql)
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 258, in parse
return self._parse(
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 309, in _parse
self.check_errors()
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 316, in check_errors
raise error
sqlglot.errors.ParseError: Expecting BY. Line 1, Col: 100.
SELECT CLUSTERNAME, DATETYPE, MIN(CLUSTERDATE) CLUSTERDATE, LISTAGG(APPLICATION, '|') WITHIN GROUP โ[4m(โ[0mORDER BY CLUSTERDATE) AS APPLICATION FROM TYPOLOGY_CLUSTERDATE_T_L WHERE DATETYPE = 'ACCEPTANCE DATE
It seems that after the keywords WITHIN GROUP sqlglot is expecting the keyword BY, but that is not applicable in this case
Could you solve this in the earliest possible release?
Thanks in advance.
Best Regards,
F. Moreira
Getting an error trying to parse the following query
q = """
SELECT zoo, animal
FROM (VALUES
('OaklandZoo', ARRAY['dog', 'cat', 'tiger']),
('SanFranciscoZoo', ARRAY['dog', 'cat'])
) AS t(zoo,animals)
"""
sqlglot.transpile(q, 'presto', 'spark')
ParseError: Invalid expression / Unexpected token. Line 5, Col: 7.
'OaklandZoo', ARRAY['dog', 'cat', 'tiger']),
('SanFranciscoZoo', ARRAY['dog', 'cat'])
) AS t(zoo,animals)
Token List
[<Token token_type: TokenType.SELECT, text: SELECT, line: 1, col: 1>,
...,
<Token token_type: TokenType.FROM, text: FROM, line: 2, col: 1>,
<Token token_type: TokenType.L_PAREN, text: (, line: 2, col: 6>,
<Token token_type: TokenType.VALUES, text: VALUES, line: 2, col: 7>,
<Token token_type: TokenType.L_PAREN, text: (, line: 3, col: 9>,
...,
<Token token_type: TokenType.R_PAREN, text: ), line: 5, col: 1>,
<Token token_type: TokenType.ALIAS, text: AS, line: 5, col: 3>,
<Token token_type: TokenType.VAR, text: t, line: 5, col: 6>,
**<Token token_type: TokenType.L_PAREN, text: (, line: 5, col: 7>,**
<Token token_type: TokenType.VAR, text: zoo, line: 5, col: 8>,
<Token token_type: TokenType.COMMA, text: ,, line: 5, col: 11>,
<Token token_type: TokenType.VAR, text: animals, line: 5, col: 12>,
<Token token_type: TokenType.R_PAREN, text: ), line: 5, col: 19>]
Hi Toby,
I'm trying to parse "CREATE OR REPLACE TEMPORARY VIEW ...", but I noticed that in generator.py, you wrote CREATE{temporary}{replace}
. I think that's a typo because then the generated SQL will be "CREATE TEMPORARY OR REPLACE ...".
And in parser.py, you wrote
def _parse_create(self):
temporary = self._match(TokenType.TEMPORARY)
replace = self._match(TokenType.OR) and self._match(TokenType.REPLACE)
I guess it should be
def _parse_create(self):
replace = self._match(TokenType.OR) and self._match(TokenType.REPLACE)
temporary = self._match(TokenType.TEMPORARY)
Please let me if I understand your code correctly ๐
Hey @tobymao! This is excellent work and I'm pretty sure it will help lots of people in their project.
sqlglot
will for sure help us in https://github.com/MariaDB/mariadb_kernel
We are trying to build an autocompletion and introspection feature (as part of a GSoC project) and the ability to get the parsed expression and the type of each token in the expression is exactly what we were looking for (and eventually implementing ourselves if no reasonable solution would be found in the wild).
My question for you is, is it very complicated to make sqlglot
able to parse and generate a partial expression tree on incomplete SQL statements?
Now for instance, if you execute parse("select a from t1 where")
, you'd get an exception. If you execute parse("select a from")
, you would get a partial expression tree, which is fantastic.
Is this a bug in sqlglot
and it is designed to deal with incomplete statements? Or it is just by accident that the latter statement above works?
Thanks again for this great project!
Currently, the behaviour of transpile
is such that this given input is mapped as follows:
Versions v1.16.0 and v1.22.0 tested.
sqlglot.transpile(
...,
read="trino",
write="trino",
identity=False,
pretty=True,
)
input
CASE col
WHEN val_a THEN mappend_val_a
ELSE mapped_val_general
END
output
IF(val_a, ...
IF(val_a, mappend_val_a, mapped_val_general)
Clearly, the comparison condition is not being included, only the value. It should be:
output
IF(col_a = val_a, ...
IF(col_a = val_a, mappend_val_a, mapped_val_general)
Quite a lot of the views/tables that I write use the optional parts
E.g.
Presto
CREATE OR REPLACE VIEW {name} AS
message = 'Expected TABLE or View'
token = <Token token_type: TokenType.OR, text: OR, line: 1, col: 7, arg_key: None>
MySQL
CREATE TABLE IF NOT EXISTS {name} (
)
message = 'Invalid expression / Unexpected token'
token = <Token token_type: TokenType.L_PAREN, text: (, line: 1, col: 0, arg_key: None>
I believe both Presto and MySQL support these optional parts in the SQL syntax for tables and views.
Many thanks in advance!
It would be great to add support for TBLPROPERTIES in CREATE statements (Hive doc, Presto doc). This is useful for specifying things like table retention periods.
For simplicity , it, could use an interface like Airflow's Hive query builder where a dict of key/value pairs is just unrolled into the relevant syntax.
Seems like this would require changes to expressions.Create
, Rewriter.ctas()
, Generator.create_sql()
, etc., and then the language specific implementations. For Hive there's an explicit TBLPROPERTIES keyword in the CREATE statement, for Presto they're just added into the WITH statement.
It seems CTE parsing is hard-coded to be followed by a SELECT statement, thus parsing this sql query fails:
WITH baz as (SELECT 1 as col1) UPDATE some_table SET cid = baz.col1 from baz;
Note: this is a valid Postgres query.
Error:
Traceback (most recent call last):
...
File "sqlglot\parser.py", line 263, in check_errors
raise error
sqlglot.errors.ParseError: Required keyword: 'this' missing for <class 'sqlglot.expressions.CTE'>. Line 3, Col: 1.
WITH baz as
(SELECT 1 as col1)
UPDATE some_table
SET cid = baz.col1
from baz;
I have tried to fix it myself by changing
def _parse_cte(self):
...
return self.expression(
exp.CTE,
this=self._parse_select(),
expressions=expressions,
recursive=recursive,
)
to
def _parse_cte(self):
...
return self.expression(
exp.CTE,
this=self._parse_statement(), # parse based on keyword ?!
expressions=expressions,
recursive=recursive,
)
But that only resulted in
File "sqlglot\parser.py", line 263, in check_errors
raise error
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 5, Col: 1.
parser.py:_parse_range is throwing out the error message 'Expected ) after IN'
minimal reproducible example:-
from sqlglot import transpile
test_sql = "SELECT column FROM db.tbl WHERE column IN (SELECT column FROM otherdb.othertbl)"
transpile(test_sql, read="spark")
sqlglot.errors.ParseError: Expected ) after IN. Line 0, Col: 50.
On following
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
parser fails with following error:
~/.local/lib/python3.9/site-packages/sqlglot/parser.py in _parse_cte(self)
354
355 if not self._match(TokenType.ALIAS):
--> 356 self.raise_error('Expected AS after WITH')
357
358 expressions.append(self._parse_table(alias=alias))
~/.local/lib/python3.9/site-packages/sqlglot/parser.py in raise_error(self, message, token)
202 )
203 if self.error_level == ErrorLevel.RAISE:
--> 204 raise self.error
205 if self.error_level == ErrorLevel.WARN:
206 logging.error(self.error)
ParseError: Expected AS after WITH. Line 0, Col: 15.
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1
CROSS JOIN UNNEST
is currently not supported when translating from Presto to Hive, e.g.:
sqlglot.transpile("""
SELECT
col1
, col2
, item
FROM
my_db.my_table
CROSS JOIN UNNEST(items) t (item)
""", read='presto', write='hive')
returns the query unmodified, but it should be
SELECT
col1
, col2
, item
FROM
my_db.my_table
LATERAL VIEW EXPLODE(items) t AS item
It would be great to add support for this expression!
Hey, amazing work on this library so far.
Are there plans to support "::" style type casts, as seen in postgres & redshift?
https://www.postgresqltutorial.com/postgresql-cast/
Here are some example scenarios I've tested in the 1.2.0 release.
import sqlglot
sqlglot.transpile("select foo::INTEGER from bar", read='postgres')
results in:
['SELECT CAST(foo AS ) AS INT FROM bar']
where I would expect:
['SELECT CAST(foo AS INT) FROM bar']
Another edge-case is combining the cast with an alias (which is what sqlglot currently interprets the :: as):
import sqlglot
sqlglot.transpile("select foo::INTEGER as baz from bar", read='postgres')
results in:
ParseError: Invalid expression / Unexpected token. Line 0, Col: 20.
select foo::INTEGER as baz from bar
where I would expect:
['SELECT CAST(foo AS INT) AS baz FROM bar']
Thanks!
Hive uses 0-based indexing, while Presto uses 1-based indexing, however in the following example sqlglot doesn't properly translate between the two:
print(sqlglot.transpile("""
SELECT
SPLIT(str_col_with_space, ' ')[0]
FROM
my_db.my_table
""", read='hive', write='presto', pretty=True)[0])
returns the query unchanged, but it should be
SELECT
SPLIT(str_col_with_space, ' ')[1]
FROM
my_db.my_table
and conversely
print(sqlglot.transpile("""
SELECT
SPLIT(str_col_with_space, ' ')[1]
FROM
my_db.my_table
""", read='presto', write='hive', pretty=True)[0])
returns the query unchanged, but it should be
SELECT
SPLIT(str_col_with_space, ' ')[0]
FROM
my_db.my_table
This is particularly dangerous when going from presto --> hive, as the wrongly translated code will always still be syntactically correct.
sqlglot
does not yet support Spark SQL queries that include literals that are specified with a postfix character to indicate the datatype.
I'd be open to trying to write a PR to implement this functionality. If you'd like me to attempt this, it'd be useful to have some high level pointers about how you think it would be best to implement. (I've not written a tokeniser or parser before, but stepping through your code, I think I have a reasonable understanding of how it works).
See here for the Spark language reference. For example a postfix of 'Y' is a shorthand way of indicating a TINYINT
literal, so you can write 2Y
rather than cast(2 as tinyint)
import sqlglot
sql = '2Y'
sqlglot.parse_one(sql, read="spark").sql(dialect="spark")
Results in:
'2 as Y'
When it should result in either:
CAST(2 AS TINYINT)
orCAST(2 AS BYTE)
or2Y
To give an example of how this can affect the result:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark_session = SparkSession(sc)
sql = "select 2Y"
print(spark_session.sql(sql).schema)
sql_parsed = sqlglot.parse_one(sql, read="spark").sql(dialect="spark")
print(spark_session.sql(sql_parsed).schema)
Results in:
StructType(List(StructField(2,ByteType,false)))
StructType(List(StructField(Y,IntegerType,false)))
I've also looked in the Spark codebase. I believe this may be the code to parse the literal syntax, and the syntax is also referenced here, as well as in various parts of the test suite like here.
I've done some research to understand how widespread the postfix syntax is, and to get a sense of what approach may be best to tokenising/parsing these queries. I recognise some of these flavours are not supported by sqlglot
. The intention is to provide some examples that may help decide the most sensible appraoch to tokenising/parsing.
I don't think DuckDB supports postfixes. In fact, it parses postfixes in a surprising way;
DuckDB evaluates syntax like:
select 0.1d
to the table:
d |
---|
0.1 |
i.e. the result is the same as select 0.1 as d
Note that the same query in PostGres results in the same result (a column named d with value 0.1). Whereas MySQL and MS SQL Server, and SQLite all give you a syntax error.
In Hive, a postfix can be used, much like Spark, documented here
In postgres, the ::
operator is a 'PostgreSQL-style typecast'. This means that you can write things like select 123::smallint
.
More info in the docs.
In Oracle it is possible to use postfixes, see here
Postfixes are not allowed:
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute("select 2Y")
Thanks so much for the library. I'm using it as part of splink, a FOSS record linkage tool. At the moment Splink uses Spark as a backend, but I'd like it to be able to support both DuckDB and Spark - sqlglot
looks pretty ideal for this purpose!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.