reata / sqllineage Goto Github PK
View Code? Open in Web Editor NEWSQL Lineage Analysis Tool powered by Python
License: MIT License
SQL Lineage Analysis Tool powered by Python
License: MIT License
Currently sqllineage treat every sql as DML. When there are DDLs, lineage result is weird.
create table taba like tabb;
alter table taba rename to tabb;
drop table if exists taba;
Let's add black check in GitHub Actions so that it will be automatically checked in each push, along with the existing pro-commit hook.
Example Features:
Some possible package for research:
Things to consider when choosing our visualization package:
drop table if exists tab_a;
create table if not exists tab_a like tab_b;
drop before create, the result is tab_a exists after the above statements;
If we switch the order:
create table if not exists tab_a like tab_b;
drop table if exists tab_a;
Although the above statements make no sense, still, we should output result as tab_a does not exist.
Under current circumstances, both cases will be that tab_a does not exists since tab_a is identified as temp table. This is related to #23
We should form a mature development strategy including branches, milestone, pull requests, tags management, better be documented as a contributing guide
select * from where foo="bar"
This invalid sql causes sqllineage to raise a AssertionError. We could be better. Give user some hints about possible sql syntax error.
Upon dealing with this issue, we could design a dedicated exception for sqllineage.
including:
this inferface may refer to uber's queryparser
TravisCI inherently support multiple python versions. And for now, it doesn't use tox directly. In fact, we're doing CI with TravisCI online and tox offline. Same logic repeated twice. This should be addressing in version 0.1.0
when i use LineageRunner to analyze a sql string like "select a.* from table a,table b " it throws a SQLLineageException:
SQLLineageException: An Identifier is expected, got IdentifierList[value: a, table] instead
So how can i deal with that exception?
from sqllineage.runner import * LineageRunner("select a.* from table1 a, table2 b").target_tables
`---------------------------------------------------------------------------
SQLLineageException Traceback (most recent call last)
in ()
----> 1 LineageRunner("select a.* from table1 a, table2 b").target_tables
~/anaconda3/lib/python3.6/site-packages/sqllineage/runner.py in init(self, sql, encoding, verbose)
31 if s.token_first(skip_cm=True)
32 ]
---> 33 self._lineage_results = [LineageAnalyzer().analyze(stmt) for stmt in self._stmt]
34 self._combined_lineage_result = combine(*self._lineage_results)
35 self._verbose = verbose
~/anaconda3/lib/python3.6/site-packages/sqllineage/runner.py in (.0)
31 if s.token_first(skip_cm=True)
32 ]
---> 33 self._lineage_results = [LineageAnalyzer().analyze(stmt) for stmt in self._stmt]
34 self._combined_lineage_result = combine(*self._lineage_results)
35 self._verbose = verbose
~/anaconda3/lib/python3.6/site-packages/sqllineage/core.py in analyze(self, stmt)
91 else:
92 # DML parsing logic also applies to CREATE DDL
---> 93 self._extract_from_DML(stmt)
94 return self._lineage_result
95
~/anaconda3/lib/python3.6/site-packages/sqllineage/core.py in _extract_from_DML(self, token)
143 raise SQLLineageException(
144 "An Identifier is expected, got %s[value: %s] instead"
--> 145 % (type(sub_token).name, sub_token)
146 )
147 source_table_token_flag = False
SQLLineageException: An Identifier is expected, got IdentifierList[value: table1 a, table2 b] instead`
SELECT col1 FROM (SELECT col1 FROM tab1)
Subquery without alias name is valid syntax in SparkSQL. And this parsing result says: "An Identifier is expected, got Parenthesis[value: (select col1 from tab1)] instead".
Since we're not assuming any specific SQL dialect, we should support this.
select * from tab_a
left join tab_b
on tab_a.x = tab_b.x
an extra whitespace in "left join" make tab_b undetectable.
sqllineage -e "refresh table dual;"
Statements(#): 1
Source Tables:
Target Tables:
<unknown>.dual
for things such as database info, alias name, etc
sqllineage -v -e "------------------------
dquote> select * from dual"
Statement #1: ------------------------select * from dual
table read: {Table: <unknown>.dual}
table write: {}
table rename: {}
table drop: {}
table with_: {}
==========
Summary:
Statements(#): 1
Source Tables:
<unknown>.dual
Target Tables:
We should trim the leading comment line when showing statement.
also, use The Ubuntu 20.04 (Focal Fossa) as build environment
cancel sqlparse <0.4.0 restriction.
update CI related package to highest version
Python 3.5 will reach End Of Life: 2020-09-13. We shall drop support for it before release v1.0
apart from the basic usage and introduction of sqllineage, this sphinx docs should be the first versioned docs in ReadTheDocs.
the README.md should also get update.
$ sqllineage -e "DROP TABLE IF EXISTS tab1"
Statements(#): 1
Source Tables:
Target Tables:
tab1
expect:
When a table is in target_tables, dropping table result in a removal from target_tables. Otherwise the DROP DML doesn't affect the result.
In command line interface, we currently show file-wise lineage result. It causes problem like #23 . Some self-dependent sql shouldn't identify related table as temp table. This pattern could easily be identified if we look at statement rather than the whole file.
On the other hand, if a table was created/inserted into, and later queried from, it's perfectly reasonable to be identified as temp table.
So this statement grandularity lineage result may be the right choice also concerning #29 .
We could try the operator reloading in design.
now install_requires in setup.py and requirements.txt both define sqlparse as a requirement. This duplication should be addressed in v0.1.0.
Besides, requirements.txt doesn't distinguish install requirements, development requirements and test requirements. We should find a proper way to do it.
hello reata.when i using sqllineage to analyze the follow SQL .I got a bug.
UPDATE tablea a INNER JOIN ( SELECT col2 FROM tableb GROUP BY col ) b ON b.col = a.col SET a.col1 = a.col2 * b.col2
that SQL update tableas col1 using tableb
s col2 . when i running the SQLLINGAGE it returns tableb as source table but null as target table .in my view the source table should be tableb and the target table should be tablea but not null
It's extremely painful to get graphviz & pygraphviz working in Windows, as shown in #87 .
After all, we're just using pygraphviz dot layout, without using the drawing functionality. For drawing, it's actually matplotlib, using graphviz dot layout. Maybe we can try port dot layout from C to Python.
This will greatly ease the usage of visualization.
Combine lineage result from multiple SQL files, same as combine from multiple SQL statements
Since @ekimd has started the effort for column-level lineage analysis. I create this ticket to track all the questions we have to answer before we dive into implementation details. Considering our pure code analysis approach without involving metadata, I believe we have several design choices to make.
Question No.1: What's the data structure to represent Column-Level Lineage.
Currently we're using DiGraph
in library networkx
to represent Table-Level Lineage, with table as vertex and table-level lineage as edge, which is pretty straight forward. After changing to Column-Level, what's the plan?
Question No.2: How do we deal with select *
INSERT OVERWRITE tab1
SELECT * FROM tab2;
In this case, we don't know which columns are in tab2.
Question No.3: How do we deal with column without table/alias prefix in case of join.
INSERT OVERWRITE tab1
SELECT col2
FROM tab2
JOIN tab3
ON tab2.col1 = tab3.col1
In this case, we don't know whether col2 is coming from tab2 or tab3.
Question No.4: How do we visualize column-level lineage?
take the following sql as example
SELECT 1;
-- SELECT 2;
the parsing result says that it contains two statements, instead of one.
for self-dependent SQL, it's also to insert data into new partition based on data from old one. This pattern couldn't be addressed since we consider table as the minimum entity.
Should we consider Partition-level lineage?
select
*
from -- comment
tab_a
This will raise AssertionError
hello,reata:
I got the newest code from your github to test whether SQLLINEAGE works.
it works very well for most of cartesian product SQL.
but ,when i input sql " select a.* from table a,table b" something strange happened:
LineageRunner("select a.* from source a,source b").source_tables
returns "[Table: .a]" but not " table"I run this sql just for fun ,after found this BUG .I tested more and get another "strange returns"
LineageRunner("select a.* from source1 a,source b").source_tables
returns " [Table: .source1]"
I thought , Was it the Tablenames matter? so I changed the Tablenames
LineageRunner("select * from care a,care b").source_tables
returns "[Table: .care]",It works.
LineageRunner("select * from care1 a,care b").source_tables
returns "[Table: .care, Table: .care1]" It works well.
LineageRunner("select * from care a,care b").source_tables
returns "[Table: .care]" but not "a"well,I thought that maybe some Tablenames sqllineage can`t works well.
$ python -m sqllineage.core -e "SELECT col1 FROM (SELECT col2 from tab1) dt"
Statements(#): 1
Source Tables:
dt
Target Tables:
expect source table as tab1
this should refer to some kind of standard.
For the moment, QueryParser's test case for Hive SQL seems the best choice
SELECT * FROM
tab1
causes following exception
Traceback (most recent call last):
File "/home/admin/.pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/admin/.pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/admin/repos/sqllineage/sqllineage/core.py", line 131, in
main()
File "/home/admin/repos/sqllineage/sqllineage/core.py", line 119, in main
print(LineageParser(sql))
File "/home/admin/repos/sqllineage/sqllineage/core.py", line 23, in init
self._extract_from_token(stmt)
File "/home/admin/repos/sqllineage/sqllineage/core.py", line 75, in _extract_from_token
assert isinstance(sub_token, Identifier)
AssertionError
Currently, If a table is both source table and target table. sqllineage will identify this table as temp table and hide it from user.
We should give use control over the display of temp table
either the NaiveLineageCombiner in sqllineage.combiners module or User-write custom Combiner
insert overwrite table tab_a
select * from tab_b
union all
select * from TAB_B
here tab_b and TAB_B will be parsed as two different tables.
Given the current situation of Travis CI, it takes ~20 minutes for a build to get start once the request is received. It's extremely slow, which greatly impact our efficiency.
Let's see if we can switch to GitHub Actions to achieve the same functionality. Given that we're using tox as CI interface, it should be fine for the switch.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.