ploomber / jupysql Goto Github PK

This project forked from catherinedevlin/ipython-sql

Better SQL in Jupyter. 📊

License: Apache License 2.0

Python 98.11% Shell 0.05% CSS 0.05% JavaScript 1.75% Dockerfile 0.03%

bigquery clickhouse data-engineering data-science duckdb hive jupyter mysql polars postgres presto python redshift snowflake spark-sql sql sqlite trino tsql

jupysql's Introduction

Tip

Deploy AI apps for free on Ploomber Cloud!

Ploomber is the fastest way to build data pipelines ⚡️. Use your favorite editor (Jupyter, VSCode, PyCharm) to develop interactively and deploy ☁️ without code changes (Kubernetes, Airflow, AWS Batch, and SLURM). Do you have legacy notebooks? Refactor them into modular pipelines with a single command.

Installation

Compatible with Python 3.7 and higher.

Install with pip:

pip install ploomber

Or with conda:

conda install ploomber -c conda-forge

Getting started

Try the tutorial:

Community

Main Features

⚡️ Get started quickly

A simple YAML API to get started quickly, a powerful Python API for total flexibility.

get-started.mp4

⏱ Shorter development cycles

Automatically cache your pipeline’s previous results and only re-compute tasks that have changed since your last execution.

shorter-cycles.mp4

☁️ Deploy anywhere

Run as a shell script in a single machine or distributively in Kubernetes, Airflow, AWS Batch, or SLURM.

deploy.mp4

📙 Automated migration from legacy notebooks

Bring your old monolithic notebooks, and we’ll automatically convert them into maintainable, modular pipelines.

refactor.mp4

I want to migrate my notebook.

Show me a demo.

Resources

About Ploomber

Ploomber is a big community of data enthusiasts pushing the boundaries of Data Science and Machine Learning tooling.

Whatever your skillset is, you can contribute to our mission. So whether you're a beginner or an experienced professional, you're welcome to join us on this journey!

Click here to know how you can contribute to Ploomber.

jupysql's People

Contributors

Stargazers

Watchers

Forkers

wstacey2019 yafimvo codesolid alex-monahan tonykploomber gtduncan jorisroovers genostack iamicey v3io pdbg neelasha23 jpjon chuwpeng123 ayushmaanshrotriya juliajyh anirudhviyer raghavsharmacode jng9134 ajneesh21 ssghost arnav1996 lfunderburk rohitpandey13 kachuma kewaljani alextongme javier daveokpare ayuryshev mehtamohit013 tegveerg wjsvec tl3119 avyfain wsshawn anupam-tiwari renilgangwal ha2179 sa4084 axd353 mz0in kaloob sazidthe1 marianamtzcelis stvhanna vaibhavpage azflin7 bryannho kevin-ploomber nmreadelf nooodle-soup erdal-pb marshallwhiteorg ned2 gilcloud maciejb luvjain krishisharma45 elgalu ramonvo74 maotech rituraj981 gauravcodepro jan-herout mordorianguy thedataincubator malaa-tech nguyenngocbinh wangjenn b1ackout wimvanleuven

jupysql's Issues

explicitly document the difference between %sql and %%sql (maybe adding an FAQ section)

This comes from this user question

PostgreSQL meta commands support

There's some issue with pgspecial, this only happens in magics.
We should reproduce and solve this in a manner that doesn't require the users to restart the kernel.

cc: @mayankkaizen @lawalogunfowora @joreyolo @mrlevitas @vioxcd @is-wschumacher @brazacz
@trunksio @henrymei @o-rayer @marcelokuadro @SimarKareer @ep3data-admin

mismatch on %sql charindex

reproduce and solve this reported issue

Obfuscate connection string

This issue discuss printing the connection string as a cell output. We should add some flag to allow the user to avoid this since it's not too secure. In a similar manner to this, we need to document this flag.

fixing CI on windows

When adding windows to the CI, it fails. We have to fix the tests, often it's just incompatibilities in the tests, but there might be cases where there's an actual incompatibility.

idea: open SQL files as notebooks

Similar to what Jupytext does, except for SQL.

Users can edit SQL files as notebooks, and the underlying SQL file is something that a database can run (i.e., only one query is active, and the rest are commented out). Integration with Ploomber. You can run these SQL files on Ploomber and select if you want to run them as notebooks or SQL. If the former, you get the output notebooks with all the charts.

update example to use local db

this example is connecting to a remote database; if users download it, they won't be able to run it; so we should update it to use local data.

we can use the same data as here: https://jupysql.readthedocs.io/en/latest/compose.html

Builds in readthedocs fail since updates in plot example

I've modified the example plot.md to consider a larger database, as detailed in #8.

For some reason, builds in readthedocs are failing: https://readthedocs.org/projects/jupysql/builds/

I created a new branch with diferent table sizes to verify if it was the problem and also tried with a bigger timeout from jupyterbook config. In all cases builds failed, so we need to verify why this is happening and build the new docs with the updated examples.

Suppress message "Returning data to local variable"

Example:

In:

%%sql result <<
select col_name
from table_name

Out:

Returning data to local variable result
I don't want this message to clutter my output for the document I am producing. It would be useful to have a setting to suppress this message.

Current workaround for this issue (needs to be applied for each cell):

%%capture
%%sql result <<
select col_name
from table_name

<< assignment for multiple lines

Not sure if this issue still happens, but we should reproduce, if it does we can fix it and if it doesn't we can close this issue.

experiment: showing our community page on errors

We want to incentivize our users to talk to us when they encounter problems by showing them our link to Slack in the exception message.

In ploomber-core 0.1, I added some custom exceptions, so we do that. Essentially, we want to replace some errors with their customized counterparts:

ValueError -> PloomberValueError
TypeError -> PloomberTypeError

see documentation here

Adding Impala support

Currently there's no support. Similar reference
We need to merge Luca's fix.

Need to make sure we have pytds compatibility

This user had issues with pytds and sqlalchemy, see example.
We should reproduce and solve this/make sure it works fine.

validating input parameter when initializing connection

When starting a new session, users can connect to a db like this:

%sql connection_string

However, the connection_string is not validated, and if I pass an invalid object:

%sql "not a connection string"

%sql 100

The following message shows:

Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
               postgresql://username:password@hostname/dbname
               or an existing connection: dict_keys([])

This is confusing as it doesn't tell me my input is wrong. It says "no connection string given", but that's incorrect.

Closing magics existing connection

Essentially there's no way of closing a magics initiated connection.
We should fix this issue thread sample.

CD for jupysql

We need to add a git actions step to allow publishing after successful CI runs.

syntax highlighting

This isn't supported in JupyterLab, and the required feature is scheduled for the 4.0 release, but there isn't an expected release date yet; see jupyterlab/jupyterlab#3869

based on this, it seems that another way to get syntax highlighting with jupyterlab-lsp, I tested it and it works:

Looks like at this point, jupyter-lsp is the way to go, so we should add instructions to the docs

cockroachdb integration

Following this issue, we should check what fixes are needed to support the db.

We should also write a guide on connecting to it, and add it to the DB's documentation

Add get started button

There should a button to get started from the repo, just like ploomber.

writing large SQL queries

The best way to write large SQL queries is via CTEs:

WITH something AS (
  SELECT * FROM some_table
  WHERE column = 1
)
SELECT *
FROM something
WHERE another_column = 2

However, when writing CTEs, we lose the ability to run the sub-expressions interactively.

We could add something like this:

%%sql --name something
SELECT * FROM some_table
WHERE column = 1

Then, to re-use the previous query:

%%sql
SELECT *
-- refer to the previous query
FROM ':something'
WHERE another_column = 2

And have the extension automatically expand the expression above to:

WITH something AS (
  SELECT * FROM some_table
  WHERE column = 1
)
SELECT *
FROM something
WHERE another_column = 2

Before submitting to the database.

better connection string parsing

we documented a workaround for connecting to SQLite databases that contain spaces (#35), however, this hasn't been fixed. so I'm opening this issue. This is related to the parsing logic (see #33 (comment)).

adding new magics for plotting

JupySQL has a feature to plot histograms and boxplots for large-scale datasets. It works by computing the summary statistics and then using those statistics to create the plot (as opposed to bringing all the data and computing locally, which can blow up memory usage).

However, only a Python API is exposed, and it'd be better to have it as magic. For example:

Plot histogram of x:

%%sql-histogram
SELECT x
FROM numbers

Plot boxplot of y:

%%sql-boxplot
SELECT y
FROM numbers

There are a few things to take into account:

the query might select more than one variable. In such case, we should throw an error (or ignore the others?)
the data type might be wrong, we should run the query anyway, so we don't have to perform validations, but if an error occurs, we should ensure the error message is clear
We should keep the Python API since I gave a talk at PyData 2022, so it's likely that some people are already using it
I'm unsure if it's best to create a new magic or re-use the existing one. It feels like the %%sql magic already has too many parameters

plotly as plotting backend

Add plotly as alternative plotting backend: https://jupysql.readthedocs.io/en/latest/plotting.html

Check rate of `--no-execute` usage on `--save`

We're using the --save, and --no-execute functions. This tells JupySQL to store the query, but skips execution.
Ideally save would only save the query given users will want to chain multiple saved queries, they'll always use the --no-execute flag. We want to track it and see if we need to reverse the functionality, meaning --save will only save, and to execute we need to add the --execute flag

Support passing existing connections

The initialization locks the context to a specific engine. See example

We'd like to allow users to connect to an engine directly by reference (similar to SQL alchemy).

buggy behavior in `%config SqlMagic.style`

I was testing the %config SqlMagic.style and realized it doesn't work well:

sometimes the text representation is empty, and changing the style does not affect it.

sample code:

%sql sqlite://

%%sql
CREATE TABLE languages (name, rating, change);
INSERT INTO languages VALUES ('Python', 14.44, 2.48);
INSERT INTO languages VALUES ('C', 13.13, 1.50);
INSERT INTO languages VALUES ('Java', 11.59, 0.40);
INSERT INTO languages VALUES ('C++', 10.00, 1.98);

%config SqlMagic.style = "PLAIN_COLUMNS"
%config SqlMagic.style = "RANDOM"
%config SqlMagic.style = "DEFAULT"
result = %sql SELECT * FROM languages LIMIT 2
print(result.pretty.get_string())

Document how to connect using config files

Support is already there but not well-documented

UDF support

Currently the package doesn't support it.
We should add a feature to accept it and execute it via the engine.

proposal: `%sqlcmd profile`

When working with a new dataset, practitioners need to explore and summarize it quickly: column values, types, distributions, etc. We could create a %sqlcmd profile magic that produces an HTML table/report of a table. Similar to pandas-profiling, except this would run the analysis on the SQL engine, making it more scalable.

Examples:

display an embedded summary:

%sqlcmd profile --table my_table

store a report:

%sqlcmd profile --table my_table --output report.html

Feedback to user in case connection hangs

We should probably have a mechanism to timeout or detach the process from the kernel to avoid user frustration.
Reference

proposal: `%sqlcmd test`

Often, practitioners need to evaluate the quality of their data (e.g., no NULLs, values within range, etc.). We could create a new magic for rapid testing:

%sqlcmd test --table data --column numbers --within 0,100 --type float
%sqlcmd test --table data --column names --no-nulls
%sqlcmd test --table data --column birth_date --strictly-greater-than 0

plots powered by the SQL engine

mssql+pyodbc connection issue, MARS_Connection=Yes variable

We should allow an easier workaround for this parameter.
The current workaround for users is setting MARS_Connection=Yes

A simple integration for oracledb

We should allow the users connecting to the ODB via the package.
In addition, we should also document the way of doing it into the docs.

Reference

color coding cell status - green for successful and red for errored result

We can follow this issue to allow users easily distinct between cells that are executed successfully vs ones that didn't.

Allowing sql highlighting

This currently doesn't work in jupyterlab.
We should tackle this issue to make the code more readable for users.

empty table should show its columns like pandas

According to this, we should support a similar structure to pandas when it comes to an empty response, show an empty table with the headers (column names).

Make sure we support ipython 8.0+

We should also validate 7.0 while we're at it

add algolia for searching

Like we did with Ploomber, we should add Algolia to JupySQL docs. Ploomber's doc is built on top of sphinx, while JupySQL's docs is on top of Jupyter Book (another layer of abstraction on top of sphinx), so there might be some exploration needed. We must document the process so we can apply it easily to the rest of our projects.

add duckdb example

https://duckdb.org/docs/guides/python/jupyter
https://pypi.org/project/duckdb-engine/

Allow sql kernel as part of the package

This is an interesting idea of having a sql-only notebook

formatting SQL to improve readability

we could have a CLI to format SQL in notebooks:

jupysql format some-notebook.ipynb

we already have sqlparse as dependency so we can use that for formatting

compatibility with SQLAlchemy 2

Connect throws

MetaData.init() got an unexpected keyword argument 'bind'

For now specifying SQLAlchemy version better, but should actually fix