Code Monkey home page Code Monkey logo

amphi-etl's People

Contributors

tgourdel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amphi-etl's Issues

Field summary/overview tool

Hello,

it's useful to have a tool to have some insight about the data :
-min/max
-average/mean/median/standard deviation
-type
-count non null/count distinct
-min and max length
-etc...
And that for the selected fields, with an ability to use a sample.

here an example on Alteryx but it also exists on Orange Data Mining :

image

Best regards,

Simon

Multi-line SQL query seems to generate incorrect Python code and fail

When adding e.g. a Postgres Input, I've added a multili

SELECT x.* FROM events x
WHERE x.event_type = 'CreateEntityEvent' AND x.event_json LIKE '%Person%';

as the interface allows it as well:

Screenshot 2024-06-20 at 14 01 35

Running the connector this way fails however with the error:

('unterminated string literal (detected at line 14)', (14, 9))

and it seems to be because the generated code incorrectly wraps the SQL query in quotes. The generated code reads:

    postgresInput1 = pd.read_sql(
        "SELECT x.* FROM events x
WHERE x.event_type = 'CreateEntityEvent' AND x.event_json LIKE '%Person%';",
        con=conn.connection
    ).convert_dtypes()

which is incorrect indeed.

The correct wrapping could be something like using tripple doublequotes """...""", like the following in this case:

    postgresInput1 = pd.read_sql(
        """SELECT x.* FROM events x
WHERE x.event_type = 'CreateEntityEvent' AND x.event_json LIKE '%Person%';""",
        con=conn.connection
    ).convert_dtypes()

This is not taking any cases of escaping into account, though, that might be relevant in more general cases, e.g. r"""...""", or '''...''' if the query contains double quotes.

I've tested that the wrapping is indeed the issue, as when I adjusted the SQL query to be on a single line, it worked.

I'm using amphi-etl==0.4.3.

Cloud inputs and output

Hi Thibaut.

I think it would be great to have cloud sources and targets, specifically data lake ones from the main cloud providers.

I know this is challenging because you would need to deal with security configurations for each cloud but it would be useful.

Your project is not open source, as it uses Elastic License 2.0

The Elastic License 2.0 is not open-source compliant, so calling your project open-source is incorrect.

The first item in the license's Limitations section:

Limitations

You may not provide the software to third parties as a hosted or managed service, where the service provides users with access to any substantial set of the features or functionality of the software.

Directly contradicts the Open Source Initiative's definition of open-source:

1. Free Redistribution

The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale.

More commentary on this issue:

https://news.ycombinator.com/item?id=40724610
https://funnelfiasco.com/blog/2021/02/03/thoughts-on-elastic-license-v2/
https://news.ycombinator.com/item?id=36760635

I'm not making any judgement of why you chose this license, merely the fact that you advertise yourselves as open-source when you're not.

Missing pyarrow and fastparquet

Trying to transform a csv into parquet :

image

`Error

Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:

  • Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
  • Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.`

Excel File Input more options: select engine (incl. calamine), usecols, skiprows, decimal, etc

Hello! Thanks for the nice ETL tool.

It will be great to add more options (according pandas library) to select for Excel File Input:

  1. engine - option to select specific engine for pandas to read Excel file ('pyxlsb', 'openpyxl', 'calamine'), setting 'calamine' as default engine, because of it's amazing speed (pls, see python-calamine )
  2. usecols - for selecting wich column to read from Excel file
  3. skiprows - for skiping some rows from a head of a sheet
  4. decimal - for correct parsing Europian-style TEXT to numeric
  5. dtype_backend - for selecting 'numpy' or 'pyarrow' backend

Easy way to retrieve amphi version

Hello,

It's not that easy to find the version of amphi I'm running.

Maybe it would be nice to have it on the homepage and in an "About" item in the upper menu.

Best regards,

Simon

Ability to select several tools on canvas and copy/paste

Hello,

I would like to select some tools on my pipeline and copy paste them in an other pipeline. This would be a huge time gain and help avoid mistake.

As of today, I can't select several tools and I can't copy paste even one tool from one pipeline to the other.

Best regards,

Simon

Excel File Output : add engine option

Default engine is openpyxl but Excel shows a warning when reading file saved with this engine.
When I use xlsxwriter as engine, no warning

filter2.to_excel("c:/dev/db/localites3.xlsx", sheet_name="cp_multi_localites", engine = "xlsxwriter")

Record ID tool

Hello,

A very, very simple tool that just create an Id (usually an integer) for each records on the stream. The configuration can be quite simple like just the name and the type.

image

Best regards,

Simon

Flow Zone/Containers for documentation/organization

Hello,

As we add tools on our canvas, we need to organize it. A common practice is to add some "containers" or "flow zones".

Here are some examples :

image

image

Features :
-choose color
-add title
-activate/deactivate
-collapse/expand.

Best regards,

Simon

Export workflow as SVG for documentation

What's your use case?
I would like to export the workfow as a SVG file for documentation. Screenshots are too often bad quality and not resizeable. (this is common feature on etl/dataprep software)

What's your proposed solution?
A button/entry menu to export all the workflow in SVG.

Are there any alternative solutions?
Screenshot and then inkscape but that's ugly.

Best regards,

Simon

Support paginating REST API input

Many REST APIs return paginated data to make data queries more manageable. It's not obvious to me if if it's possible to implement that pagination through the current GUI.

This is potentially a big piece of work, as there are quite a few common pagination patterns, but that input type is pretty limited in use without any sort of pagination.

Here's some example or inspiration from the Airbyte docs that had to implement these already as well: Pagination Strategies

InfluxDB Connector

Connecting to an InfluxDB could be very helpful. Its used to store timeseries data.

Defining a timerange, bucket and measurement to gain data from could return a dataset that could be transformed and output further.

Excel File Output : incorrect option for sheet name

Excel File Output component doesn't use correct name for sheet_name option

it uses "sheet" instead of "sheet_name"

  • incorrect
    sqlQuery1.to_excel("c:/dev/db/localites2.xlsx", index=False, sheet='Feuil1')
  • correct
    sqlQuery1.to_excel("c:/dev/db/localites2.xlsx", index=False, sheet_name='Feuil1')

A connection and credentials manager

Hello,

Instead of always writing credentials and connections settings, it would be cool to store it in a manager (of course,not in clear ^^)

Best regards,

Simon

Select a file : restrict to the good file type as an option

Hello,

When using a file as an entry (excel, csv..), I would like a drop down menu to select the good extensions like this :
-first entry is limited to specific extensions (like .csv if csv file.. xlsx and variations if excel files)
-second entry : all files (in case the file waw created with a bad extension)

This would simplify the search of the files.

Best regards,

Simon

Amphi launched on windows : not working with computer name

Hello,

I have installed and launched amphi
image

amphi open a tab on my default browser (still ok)
however the adress doesn't work
image

I had to use localhost instead.

I think :
-let localhost be the adress
-in the meantime, a tip in the documentation

Best regards,

Simon

Excel File Input: sheet and header arguments error

Some problems with reading Excel.

  1. Got an error, when choosing sheet option in Excel File Input:
Error
read_excel() got an unexpected keyword argument 'sheet'
[Show Traceback](http://localhost:8888/lab#)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 7
1 # Source code generated by Amphi
2 # Date: 2024-06-25 09:31:25
3 # Additional dependencies: openpyxl
4 import pandas as pd
----> 7 excelfileInput1 = pd.read_excel("dq_report.xlsx", sheet=' Sheet1', header='0').convert_dtypes()
9 excelfileInput1

TypeError: read_excel() got an unexpected keyword argument 'sheet'

Maybe pd.read_excel() is looking for sheet_name argument instead of sheet argument:

pd.read_excel("dq_report.xlsx", sheet_name='Sheet1', header='0').convert_dtypes()

  1. When we add&choose Custom sheet, library adds space in the start of the sheet:

image

  1. We'll got erros for sheet / header argemument, because all arguments are transformed to strings (instead of int, string, list):

image

image

Pipeline metadata

Hello,

it would be nice, for data governance, to have some metadata on our pipelines :
-author
-copyright
-title
-key words
-a long description
-ability to add a link for documentation
-version

Best regards,

Simon

Save the code module I wrote

Hello, It's an honor to come across such a great program framework.
I wrote some custom Python code components, I want to save it to the right component bar, the next time a new project can directly drag these components into use.
01

Choose your execution engine (in-memory/in-database)

Hello,

A dreamt feature would be the ability to choose the execution engine. Some dataprep/etl tools have it (like Alteryx or Dataiku).
That means two possibilities :

  • in-memory when the tool retrieve the data on its machine and treats along the differents steps and write.
    -in-database where data stay in the database and the tool only send sql queries to proceed the different steps. It allows very big volume in a very small time, and does not depend of the resource of the etl machine.

This is hard to develop of course but I think this has definitely this place on the roadmap.

Best regards,

Simon

A formula transformation tool

Hello,

I would like to have a formula tool with these features (looks very classic):
-modify an existing column or create a new one
-if a new one, ability to set the type (with a dropdown menu as exemple)
-autocompletion
-preview
-ability to disable a formula (not usual but when developing, it's pretty cool)
-language should be python of course

For reference, the Alteryx formula

image

Best regards,

Simon

Filter Rows : how to force numeric value

I want to filter rows based on a numeric field

I have declared :

  • Column name = VILLE_count
  • Condition = >
  • Value = 1
aggregate1 = sqlQuery1.groupby(["CODPOSTAL",]).agg(VILLE_count=('VILLE', 'count'), VILLE_min=('VILLE', 'min'), VILLE_max=('VILLE', 'max')).reset_index()
# Filter rows based on condition
filter2 = aggregate1.query("VILLE_count > '1'")

I got error : TypeError: Invalid comparison between dtype=int64 and str

I have boolean option "Enforce value as string"
Is it possible to have "Enforce value as number" ?

To fix temporary, I add a string TypeConverter on VILLE_count between aggregate1 and filter2
Very easy :)

Tool searching

Hello,

There is a lot of tools, which is nice. However, especiall on a small screen, you have to scroll and read them all to find the good one. A search would be a nice feature.

image

Best regards,

Simon

offline installer (a simple independant .exe on windows)

Hello,

As of today, you need two things to install amphi on windows :
-a cmd prompt
-an internet connection to download every package needed

however, if you want to seduce the non-technical, restricted access users in corporate environment, it would be relevant to propose a classic offline installer for windows (and if possible, even a non-admin one, like Alteryx).

Best regards,

Simon

Custom components

Multiple asks for custom components.

Still considering the best method. Will add ideas here.

Copy Paste components

Allow the copy and pasting of configured components inside a pipeline and outside different pipeline if possible

Time triggers

Is it possible to configured time based triggers to run the pipeline.

If not it could be useful to have triggers like "trigger every hour" etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.