amphi-ai / amphi-etl Goto Github PK

View Code? Open in Web Editor NEW

706.0 7.0 26.0 1.23 MB

Python-based Low-code ETL. Generates Python code you can deploy anywhere.

Home Page: https://docs.amphi.ai

License: Other

Python 0.48% TypeScript 86.43% CSS 12.47% JavaScript 0.30% Dockerfile 0.31%

data data-pipelines etl rag-pipeline structured-data unstructured-data

amphi-etl's People

Contributors

Stargazers

Watchers

amphi-etl's Issues

Field summary/overview tool

Hello,

it's useful to have a tool to have some insight about the data :
-min/max
-average/mean/median/standard deviation
-type
-count non null/count distinct
-min and max length
-etc...
And that for the selected fields, with an ability to use a sample.

here an example on Alteryx but it also exists on Orange Data Mining :

Best regards,

Simon

Multi-line SQL query seems to generate incorrect Python code and fail

When adding e.g. a Postgres Input, I've added a multili

SELECT x.* FROM events x
WHERE x.event_type = 'CreateEntityEvent' AND x.event_json LIKE '%Person%';

as the interface allows it as well:

Running the connector this way fails however with the error:

('unterminated string literal (detected at line 14)', (14, 9))

and it seems to be because the generated code incorrectly wraps the SQL query in quotes. The generated code reads:

    postgresInput1 = pd.read_sql(
        "SELECT x.* FROM events x
WHERE x.event_type = 'CreateEntityEvent' AND x.event_json LIKE '%Person%';",
        con=conn.connection
    ).convert_dtypes()

which is incorrect indeed.

The correct wrapping could be something like using tripple doublequotes """...""", like the following in this case:

    postgresInput1 = pd.read_sql(
        """SELECT x.* FROM events x
WHERE x.event_type = 'CreateEntityEvent' AND x.event_json LIKE '%Person%';""",
        con=conn.connection
    ).convert_dtypes()

This is not taking any cases of escaping into account, though, that might be relevant in more general cases, e.g. r"""...""", or '''...''' if the query contains double quotes.

I've tested that the wrapping is indeed the issue, as when I adjusted the SQL query to be on a single line, it worked.

I'm using amphi-etl==0.4.3.

Cloud inputs and output

Hi Thibaut.

I think it would be great to have cloud sources and targets, specifically data lake ones from the main cloud providers.

I know this is challenging because you would need to deal with security configurations for each cloud but it would be useful.

Your project is not open source, as it uses Elastic License 2.0

The Elastic License 2.0 is not open-source compliant, so calling your project open-source is incorrect.

The first item in the license's Limitations section:

Limitations

You may not provide the software to third parties as a hosted or managed service, where the service provides users with access to any substantial set of the features or functionality of the software.

Directly contradicts the Open Source Initiative's definition of open-source:

1. Free Redistribution

The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale.

More commentary on this issue:

https://news.ycombinator.com/item?id=40724610
https://funnelfiasco.com/blog/2021/02/03/thoughts-on-elastic-license-v2/
https://news.ycombinator.com/item?id=36760635

I'm not making any judgement of why you chose this license, merely the fact that you advertise yourselves as open-source when you're not.

Missing pyarrow and fastparquet

Trying to transform a csv into parquet :

`Error

Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:

Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.`

FIll missing value

Add component for filling missing values

Incorrect generated code with column name containing spaces for Aggregate

Excel File Input more options: select engine (incl. calamine), usecols, skiprows, decimal, etc

Hello! Thanks for the nice ETL tool.

It will be great to add more options (according pandas library) to select for Excel File Input:

engine - option to select specific engine for pandas to read Excel file ('pyxlsb', 'openpyxl', 'calamine'), setting 'calamine' as default engine, because of it's amazing speed (pls, see python-calamine )
usecols - for selecting wich column to read from Excel file
skiprows - for skiping some rows from a head of a sheet
decimal - for correct parsing Europian-style TEXT to numeric
dtype_backend - for selecting 'numpy' or 'pyarrow' backend

Simple Docker Compose File

Looking to have a quick way to test it out in a container efficiently.

Input / Output MQTT Connector

First of, great tool and thank you very much for making it open source.

An Input and Output Node to connect to a MQTT-Broker would be awesome.
The node could subscribe to a topic or use a the '#' connotation for subscribe to several topics.

https://github.com/mqttjs/MQTT.js/

CSV File Output: Header option not taken into account

[Bug] UI issue with edge close buttons and boolean toggles

There is a missing backgorund on the close buttons on the edges and also the boolean toggles in the advanced configurations.

Easy way to retrieve amphi version

Hello,

It's not that easy to find the version of amphi I'm running.

Maybe it would be nice to have it on the homepage and in an "About" item in the upper menu.

Best regards,

Simon

Ability to select several tools on canvas and copy/paste

Hello,

I would like to select some tools on my pipeline and copy paste them in an other pipeline. This would be a huge time gain and help avoid mistake.

As of today, I can't select several tools and I can't copy paste even one tool from one pipeline to the other.

Best regards,

Simon

Excel File Output : add engine option

Default engine is openpyxl but Excel shows a warning when reading file saved with this engine.
When I use xlsxwriter as engine, no warning

filter2.to_excel("c:/dev/db/localites3.xlsx", sheet_name="cp_multi_localites", engine = "xlsxwriter")

Add JSON path for JSON File Input

Need to add a JSON path option for the JSON File Input

Record ID tool

Hello,

A very, very simple tool that just create an Id (usually an integer) for each records on the stream. The configuration can be quite simple like just the name and the type.

Best regards,

Simon

Flow Zone/Containers for documentation/organization

Hello,

As we add tools on our canvas, we need to organize it. A common practice is to add some "containers" or "flow zones".

Here are some examples :

Features :
-choose color
-add title
-activate/deactivate
-collapse/expand.

Best regards,

Simon

Add Concat component

Export workflow as SVG for documentation

What's your use case?
I would like to export the workfow as a SVG file for documentation. Screenshots are too often bad quality and not resizeable. (this is common feature on etl/dataprep software)

What's your proposed solution?
A button/entry menu to export all the workflow in SVG.

Are there any alternative solutions?
Screenshot and then inkscape but that's ugly.

Best regards,

Simon

Support paginating REST API input

Many REST APIs return paginated data to make data queries more manageable. It's not obvious to me if if it's possible to implement that pagination through the current GUI.

This is potentially a big piece of work, as there are quite a few common pagination patterns, but that input type is pretty limited in use without any sort of pagination.

Here's some example or inspiration from the Airbyte docs that had to implement these already as well: Pagination Strategies

InfluxDB Connector

Connecting to an InfluxDB could be very helpful. Its used to store timeseries data.

Defining a timerange, bucket and measurement to gain data from could return a dataset that could be transformed and output further.

[Bug] Input edges count error for join"

Can't add a second input to the join component

Excel File Output : incorrect option for sheet name

Excel File Output component doesn't use correct name for sheet_name option

it uses "sheet" instead of "sheet_name"

incorrect
sqlQuery1.to_excel("c:/dev/db/localites2.xlsx", index=False, sheet='Feuil1')
correct
sqlQuery1.to_excel("c:/dev/db/localites2.xlsx", index=False, sheet_name='Feuil1')

A connection and credentials manager

Hello,

Instead of always writing credentials and connections settings, it would be cool to store it in a manager (of course,not in clear ^^)

Best regards,

Simon

Select a file : restrict to the good file type as an option

Hello,

When using a file as an entry (excel, csv..), I would like a drop down menu to select the good extensions like this :
-first entry is limited to specific extensions (like .csv if csv file.. xlsx and variations if excel files)
-second entry : all files (in case the file waw created with a bad extension)

This would simplify the search of the files.

Best regards,

Simon

[Feature] Add color highlighting for custom code and SQL components

Creating folder path for outputs on runtime

Hi Thibaut:

I don't know if this was requested before but it would be great to have the pipeline to create the folder of the destination during runtime.

Iceberg, Hudi and Delta support

Iceberg, Hudi and Delta are on demand. It would be great to have those formats as inputs and outputs.

Amphi launched on windows : not working with computer name

Hello,

I have installed and launched amphi

amphi open a tab on my default browser (still ok)
however the adress doesn't work

I had to use localhost instead.

I think :
-let localhost be the adress
-in the meantime, a tip in the documentation

Best regards,

Simon

+ button on tab opens the homepage instead of a blank pipeline

Hello,

Exactly as mentioned in the title :

Best regards,

Simon

[Feature] Google Sheet input allow public sheet access without service account key

Google Sheet input allow public sheet access without service account key

Excel File Input: sheet and header arguments error

Some problems with reading Excel.

Got an error, when choosing sheet option in Excel File Input:

Error
read_excel() got an unexpected keyword argument 'sheet'
[Show Traceback](http://localhost:8888/lab#)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 7
1 # Source code generated by Amphi
2 # Date: 2024-06-25 09:31:25
3 # Additional dependencies: openpyxl
4 import pandas as pd
----> 7 excelfileInput1 = pd.read_excel("dq_report.xlsx", sheet=' Sheet1', header='0').convert_dtypes()
9 excelfileInput1

TypeError: read_excel() got an unexpected keyword argument 'sheet'

Maybe pd.read_excel() is looking for sheet_name argument instead of sheet argument:

pd.read_excel("dq_report.xlsx", sheet_name='Sheet1', header='0').convert_dtypes()

When we add&choose Custom sheet, library adds space in the start of the sheet:

We'll got erros for sheet / header argemument, because all arguments are transformed to strings (instead of int, string, list):

Update icon with path-based svg for Windows

Pipeline metadata

Hello,

it would be nice, for data governance, to have some metadata on our pipelines :
-author
-copyright
-title
-key words
-a long description
-ability to add a link for documentation
-version

Best regards,

Simon

Save the code module I wrote

Hello, It's an honor to come across such a great program framework.
I wrote some custom Python code components, I want to save it to the right component bar, the next time a new project can directly drag these components into use.

Choose your execution engine (in-memory/in-database)

Hello,

A dreamt feature would be the ability to choose the execution engine. Some dataprep/etl tools have it (like Alteryx or Dataiku).
That means two possibilities :

in-memory when the tool retrieve the data on its machine and treats along the differents steps and write.
-in-database where data stay in the database and the tool only send sql queries to proceed the different steps. It allows very big volume in a very small time, and does not depend of the resource of the etl machine.

This is hard to develop of course but I think this has definitely this place on the roadmap.

Best regards,

Simon

A formula transformation tool

Hello,

I would like to have a formula tool with these features (looks very classic):
-modify an existing column or create a new one
-if a new one, ability to set the type (with a dropdown menu as exemple)
-autocompletion
-preview
-ability to disable a formula (not usual but when developing, it's pretty cool)
-language should be python of course

For reference, the Alteryx formula

Best regards,

Simon

Relative path not working for pipelines in folder

[Bug] Add support for Jupyterlab Dark mode

Filter Rows : how to force numeric value

I want to filter rows based on a numeric field

I have declared :

Column name = VILLE_count
Condition = >
Value = 1

aggregate1 = sqlQuery1.groupby(["CODPOSTAL",]).agg(VILLE_count=('VILLE', 'count'), VILLE_min=('VILLE', 'min'), VILLE_max=('VILLE', 'max')).reset_index()
# Filter rows based on condition
filter2 = aggregate1.query("VILLE_count > '1'")

I got error : TypeError: Invalid comparison between dtype=int64 and str

I have boolean option "Enforce value as string"
Is it possible to have "Enforce value as number" ?

To fix temporary, I add a string TypeConverter on VILLE_count between aggregate1 and filter2
Very easy :)

Tool searching

Hello,

There is a lot of tools, which is nice. However, especiall on a small screen, you have to scroll and read them all to find the good one. A search would be a nice feature.

Best regards,

Simon

offline installer (a simple independant .exe on windows)

Hello,

As of today, you need two things to install amphi on windows :
-a cmd prompt
-an internet connection to download every package needed

however, if you want to seduce the non-technical, restricted access users in corporate environment, it would be relevant to propose a classic offline installer for windows (and if possible, even a non-admin one, like Alteryx).

Best regards,

Simon

More generic connections available for database (jdbc, odbc, oledb, sql alchemy, adbc...)

Hello,

As of today, connections to data are limited for databases. I think you would impact waaaaay more users if you leverage standards connectors (jdbc, odbc, oledb, sql alchemy, adbc...) and only after make some dedicated connectors to major databases.

Best regards,

Simon

path : OneDrive - Business & Decision/B&D_Market_20190423/5_Small_Data/FR/BD_Market.xlsx

Result :

amphi-ai / amphi-etl Goto Github PK

amphi-etl's People

Contributors

Stargazers

Watchers

Forkers

amphi-etl's Issues

Recommend Projects

Recommend Topics

Recommend Org