artefactory / artefactory-connectors-kit Goto Github PK

ACK is an E(T)L tool specialized in API data ingestion. It is accessible through a Command-Line Interface. The application allows you to easily extract, stream and load data (with minimum transformations), from the API source to the destination of your choice.

License: GNU Lesser General Public License v3.0

Dockerfile 0.06% Makefile 0.12% Python 99.80% Shell 0.02%

google-cloud-storage search-console radarly salesforce adobe-analytics amazon-s3 confluence facebook google-analytics dcm

artefactory-connectors-kit's People

Contributors

Stargazers

Watchers

Forkers

vviers fabriceartefact haypierre warhotdog lapnd

artefactory-connectors-kit's Issues

Fix needed: the default .csv field_size_limit is exceeded while making requests with the DV360 reader

ERROR AND WHY :
While collecting data from the platform dv360 I encountered this issue :

Traceback (most recent call last):
  File "nck/entrypoint.py", line 86, in <module>
    app()
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 1164, in invoke
    return _process_result(rv)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 1101, in _process_result
    value = ctx.invoke(self.result_callback, value,
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "nck/entrypoint.py", line 68, in run
    writer.write(stream)
  File "/.../nautilus-connectors-kit/nck/writers/console_writer.py", line 44, in write
    buffer = file.read(1024)
  File "/.../nautilus-connectors-kit/nck/streams/stream.py", line 114, in readinto
    chunk = self.leftover or encode(next(iterable))
  File "/.../nautilus-connectors-kit/nck/utils/file_reader.py", line 36, in sdf_to_njson_generator
    for line in dict_reader:
  File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/csv.py", line 111, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)

From the error message I was able to set a csv.field_size_limit above the 131072 default limit.

HOW TO FIX IT AND FURTHER INVESTIGATIONS :
By adding this line of code to the file_reader.py the error vanished and I was able to get my result prompted on the console.

The line is the following and was added to the nck/utils/file_reader.py file(replace 1000000 by another limit to discuss) :

csv.field_size_limit(10000000)

Even if it worked I noticed that a field was containing an outrageous number of ids. I think that before setting this new csv.field_size_limit it could be interesting to check if there is no mistake in the process that would cause a field to contain way more ids than what it really should have.

As a User, I have documentation explaining how to launch a request on any reader

WHY
Today, documentation is missing on several readers:

Amazon S3
Google Cloud Storage
Oracle
MySQL
Radarly
Salesforce

As a consequence, using one of these 6 readers can be time-consuming, as you have to dig into the code and/or API documentation to understand the parameters that should be provided to the NCK command.

HOW
Create the missing documentation for each one of these 6 readers.
It should be inserted on this README, and include the following sections:

Source API
How to obtain credentials
Quickstart (a command example)
Parameters

As a Developer, I have documentation explaining how the application is structured

WHY
Today, onboarding a new contributor on NCK is tough: the application structure is complex to understand, and the development of new readers/writers/streamers should follow a set of conventions that are not specified anywhere.

HOW
Create documentation describing:

the application structure
the process & conventions to follow to develop a new reader/writer/streamer

EXPLO - Try out the Airbyte solution

WHY
The solution proposed by the Airbyte start-up is very close to what we had in mind for NCK (open-source application, EL(T) approach, similar data sources, configuration through an UI or an API). Potential cons could be: the fact that the application is coded in Java, doubts on scalability (as single node approach).

HOW
Test the solution and evaluate its pros and cons so that we can identify what could be NCK's differentiators

As a User or as a Contributor, I can access to a dedicated section of the documentation focusing on my needs

WHY
In the documentation, the Getting Started page is a mix of informations oriented towards end-users ("Launch you first NCK command", "Normalize field names") and contributors ("Set-up your developing environment", "Contribute").

WHAT
Create 2 distinct sections, focusing on the needs of each user profile. For instance:

For end-users

Installation (cloning the repo and installing requirements in a virtual environment)
Launch your first command
Normalize field names

For contributors

Development guidelines (linting, pre-commit hooks, TDD, documentation)
Application architecture: how to develop a new reader/stream/writer

As a user, I can see the HTTP request content in the logs for Google Services

WHY
Currently, we use the Discovery SDK, which hides a lot of complexity. However, for troubleshooting it could be useful to see the HTTP endpoint and the query body when we use NCK in a debug mode. This is always asked when we contact the support.

HOW
Add a log statement (debug level) before executing the request with the SDK. This log should contain as much information as possible about the request: endpoint, auth method, payload ...

Ignore Pipenv files in the repo

Pipenv is a fairly common framework for managing venvs in python.
The .gitignore of the repo should contain Pipfile and Pipfile.lock.

Update copy-pasted CONTRIBUTING.md to fit our project

The CONTRIBUTING.md file was written by briandk.

I believe that both the tone (using the first person in some places) and the links (e.g. line 23) should be adapted to our specific project.

As a User, I can configure/launch NCK requests through API calls

WHY
Today, interacting with the NCK with a CLI does not necessarily make sense. An API-interface might be a better choice (and would be easier to integrate with the web UI that we are contemplating to develop for non-tech users).

HOW
Move the developper interface from click to FastAPI.

As a Developer, I can rely on a standardized APIClient class to authenticate requests on any reader

WHY
Today, each reader uses its own authentication method: either a client class or a helper method.
Available client classes are available under the nck/clients directory (after the refacto, clients shared by several sources will appear under nck/clients/, and source-level clients will appear under nck/readers/<source>/client.py).

HOW

Long term goal
Create a unique APIClient class, that could handle the authentication process for all readers.
Today, a standardization effort has been undertaken with the creation of the APIClient class ; however, this class is only used by Yandex readers today. As we might not be able to standardize all authentification methods into a single one (some APIs are quite specific), we suggest to pursue the below preliminary goal as a start.

Intermediary goal
Group sources sharing a similar authentication process (e.g. Google sources) into a single client class.

Facebook reader: facebook-field vs facebook-desired-field behavior

Description

Expected:
As a user, if I specify a list of --facebook-desired-field, I would expect them to be automatically added as --facebook-field too. Thus, data will be queried at least for all facebook desired fields.

Current behavior:
Currently, if I only specify a list of --facebook-desired-field, I will get a dictionary with desired fields as keys and blank values because there is no data queried.

Add .dockerignore to project

There is currently no .dockerignore file, which results in a lot of useless files in the docker image. For example: the tests folder, the README.md file, the .github folder.

This is bad because it makes the Docker image bigger and increase build time.

Furthermore, it can add potential vulnerabilities.

As a Developer, I can test individual methods with unit tests

WHY
Today, the unit test coverage is still low (30%), which does not allow us to safely develop enhancements/new features.

HOW
Create missing unit tests for individual methods.

As A Developer, I can run integration tests that are based on mock data

WHY
Currently, we don't have tests that verify the end-to-end workflow because we can't share credentials from real customers to every contributor. So, it might be a good idea to look for sandbox environments to run our tests.
For example, the FB marketing API has one.

HOW
Use Docker containers to reproduce a real-world environment and run different commands. Verify the output is correct.

Add Google Display Video 360 Readme

There is only a readme for Google DoubleClick Bid Manager

Harmonize the "date" parameter (date_range, start_date, end_date,...) in the media and analytics readers.

Choose and implement a common behaviour between the readers with the "date" parameter (date_range, start_date, ...).

Define the priorities between the parameters start_date/end_date, date_range (and others) in order to have an easier behaviour between the readers and more understandable for the User.

For example:
With the GoogleAds Reader, the connector doesn't understand a request with start_date, end_date and date_range.

Define a standard for every readers.

As a Developer, I can see a clear distinction between a reader, a CLI, a config and a helper module for each reader source

WHY
Today, the application modules do not have a single purpose:

The nck/readers/<source>_reader.py modules are defining both a CLI command and a reader class
Modules under the nck/helpers/ directory are a mix of configuration variables and helper functions

HOW
To make things clearer, implement the following architecture:

- nck/
-- readers/
--- <source>/
---- reader.py
---- cli.py # nck/entrypoint.py should point to these files
---- config.py
---- helper.py

As a developper I can see standard interface and behaviour between cloud storage writers

WHY
Currently storage writers have different behaviors from each other.

return values of the write functions
management of non existing buckets.

HOW
Normalize behaviors of storage writers and factorize code in a objectstorage_writer class.

As a User, I can securely manage the authentication credentials handled to the application

As a Developer, I can test the application logic with integration tests

WHY
Today, we do not have integration tests at all (to my knowledge at least).

HOW
Create integration tests from scratch, so that we can test the application logic.
We could also contemplate testing API calls.

Security issue in logs

artefactory-connectors-kit/ack/utils/processor.py

Line 40 in 29fb5b2

logger.info(f"Calling {f.__name__} with ({_kwargs})")

Credentials might be printed in logs because of this line.

As a User, I can configure/launch NCK request by passing a .json file to the application

WHY
Today, generating NCK CLI commands requires a lot of efforts. Every project team using the application in production had to develop a “command generator”, i.e. a tool building NCK CLI commands from a set of parameters defined on a .json file. It would be much more efficient if we could directly pass such a .json file to the application to configure/launch a request, as it would avoid making costly (and unnecessary) transformations.

HOW
Allow users to pass a .json file to the application. It can be done by setting-up:

a second CLI that would take a single argument:
python nck/entrypoint_json.py read_dbm --config-file <path to the .json config file>

To be backward compatible, we could think of the following architecture:

- nck/
-- entrypoint/
--- entrypoint.py # Current entrypoint, allowing multiple parameters as a CLI
--- entrypoint_json.py # New entrypoint, allowing a single .json input

The second entrypoint will need to be completely separated from the first one and to implement a fully new architecture with a general formatter useable for every readers. The validation steps will be implemented in a separated issue.

Develop a SA360 reader.

👆

As a User, I can write data to Azure Blob Storage

WHY
In order to directly transfer raw data in Microsoft Azure Cloud.

HOW
Develop an Azure Blob writer in NCK

BUG -Fix 3 times reader class init

WHY
Currently the reader class is instantiated twice before the explicit instantiation.

This takes place in the nck/entrypoint.py file :

def process_command_pipeline(provided_commands, normalize_keys):
    provided_readers = [cmd for cmd in provided_commands if isinstance(cmd(), Reader)]
    provided_writers = [cmd for cmd in provided_commands if isinstance(cmd(), Writer)]
    _validate_provided_commands(provided_readers, provided_writers)

    reader = provided_readers[0]
    for stream in reader().read():
        for writer in provided_writers:
            if normalize_keys and issubclass(stream.__class__, JSONStream):
                writer().write(NormalizedJSONStream.create_from_stream(stream))
            else:
                writer().write(stream)

As you can see before instantiating the reader on the for stream in reader().read(): line, we have already instantiated the reader twice while creating the provided readers and writers list comprehensions.

You can verify this easily by adding logs in your init (if none are already present) and just run a command for this reader. You will see we go through the init 3 times.

HOW:
This is due to these: cmd(), a quick fix would be to compare cmd and not cmd() to another type (not Reader or Writer). Maybe there are better solutions or maybe it is preferable to not edit this.

I'm not familiar with this part of the project so let me know what makes more sense to you guys.

First noticed on this pr https://github.com/artefactory/nautilus-connectors-kit/pull/110#issue-577751388

As a Developer, I can access a back-up of the Oracle vendor/ directory (instead of having it directly into the application repo)

WHY
Heavy .zip files used by the Oracle reader are available in the vendor/ directory. It makes NCK Docker image unnecessarily heavy (while the Oracle reader is not frequently used).

HOW
Create a back-up of the vendor/ directory (in a shared Google Drive ?) and remove it from the repo

As a User, I can provide all arguments as click options for the GCS Reader, S3 Reader, GCS Writer and BQ Writer

WHY
The following readers/writers are using implicit config.<VAR> variables:

gcs_reader, via the object_storage_reader
- config.PROJECT_ID
s3_reader, via the object_storage_reader
- config.REGION_NAME, config.AWS_ACCESS_KEY_ID, config.AWS_SECRET_ACCESS_KEY
gcs_writer
- config.PROJECT_ID
bq_writer
- config.PROJECT_ID

These variables are not provided by the user as click options, but as environment variables. These environment variables are retrieved from the following snippet that appears in the nck/config.py module:

for key, var in os.environ.items():
    locals()[key] = var

This can be quite confusing for the user, and does not match the common NCK logic.

HOW
Explicitly implement these variables as click options

As a Developer, I can deploy the application on any cloud serverless service

WHY
Today, each individual project willing to use the application at a production scale needs to set-up and deploy the underlying infrastructure, which can be a heavy process.

HOW
We could think of a Deploy button (integrated to the repo's Readme, or to the web UI that we are planning to develop), that would allow us - in just one click - to deploy the serverless infrastructure needed to run the application, on the cloud platform of our choice (GCP Cloud Run, Amazon EKS, etc).

As a User, I can launch a request on any reader using a pre-defined date_range option

WHY
Currently, only specific readers accept a --date_range parameter (it depends of the source API, that offers it or not).

Such a pre-defined --date_range option can be very convenient, in particular when conducting tests. In this case, you just have to specify --date-range PREVIOUS_WEEK for instance, instead of providing a specific --start-date and --end-date for your request.

HOW
Make the --date_range option available to all readers, including when it is not initially provided by the source API.

Distribute NCK as a command line tool via PyPi.

The library does not necessarily needs Docker to be run. Using setuptools, we could distribute the script as a PiPy package witch then would be easily usable by doing only:

pip install nck
nck --help

This way the package would also be more lightweight that the relative docker image.

Remove TOTALS row in DBM_Reader

All in the title

As a User, I can launch a request knowing that all readers follow a common behavior on date parameters

WHY
Today, the behavior of readers expecting date parameters (start_date, end_date, date_range, etc.) is not harmonized.

In particular, readers usually do not know how to prioritize the date parameters given as inputs. For instance: the Google Ads reader crashes if you try to make a request including the start_date and end_date parameters + a date_range parameter.

HOW
Define a convention for all readers, and implement it.

As a user, I need non-normalized stream in the readers output because we normalize the output at the entrypoint with the --normalize-keys parameter

Current behavior:
JSONStream or NormalizedJSONStream are randomly defined in readers as outputs.

Expected behavior:
As a user, I want to have non-normalized streams (JSONStream for example) in the readers because I want to normalize them with the --normalize-keys parameter at the entrypoint.py level if needed.

Handle OSError "File name too long" with local_writer

Current Traceback:
File "<...>/nautilus-connectors-kit/nck/writers/local_writer.py", line 46, in write
with open(path, "wb") as h:
OSError: [Errno 63] File name too long: '/Users/USER/Desktop/results_CustomReport_493-698-1849_820-172-9109_927-461-4978_710-234-1866_786-452-3513_622-647-8507_704-008-6401_991-374-9639_414-678-9460_804-138-9346_351-580-1999_640-731-4384_478-167-3186_328-616-4160_381-972-5546_322-562-1358_368-269-7323_291-846-3501_315-547-3070_583-565-5443_884-258-3942_443-453-4856_400-583-8413_600-472-5847_853-384-9584_2020-11-12-12-13-21.njson'

Potential solution:
Add a "file_name" parameter to the local_writer

As a User, I can configure/launch NCK requests through a web UI

WHY
Today, NCK cannot be used by a non-tech user. Having a web UI would allow us to extend the public being able to use the application.

HOW
Create a web UI, allowing non-tech users to configure and launch NCK requests.

As a Developer, I am not confused by the presence of dead code/useless dependencies

WHY
Today, a lot of dead code/useless dependencies are present into the application repo (e.g. the nck/config.py module, implementing environments that are not used - and we can easily find many other examples).

HOW
Use the vulture package to find dead code/useless dependencies, and remove them

As a User, I can stream reports in a .csv format

WHY
Today, the only output stream format available is .njson (i.e. a file with n lines, each line being a dictionnary).
This format has two downsides:

It does not allow us to easily conduct preliminary analysis on the output data: .njson files cannot be directly forwarded to non-tech users, and cannot be put into a pandas DataFrame without undergoing preliminary transformations.
Some APIs natively return data in a .csv format: in these cases, we have to convert each line to a dictionnary, which can occasion parsing errors.

HOW
Create a .csv streamer.

Documentation is weak

We need to add documentation hover:

How to pass credentials to each reader and writer.
How to develop locally.
How to launch a simple command.

In the end what is important is: How to obtain credentials for all readers and writers.
Where to find the documentation of available dimensions and metrics for each reader.

CampaignSetup Request - FacebookReader

Current behavior
No need to use "field" parameter to request Campaign for FacebookReader. Only request "desired_fields" you need.
Functional behavior but has to be optimized.

Expected behavior
More user friendly use of the input parameters for FacebookReader, especially "field" and "desired_field"

Suggestion
Handle "field" and "desired_field" parameters that progressively becomes unconsistent for the user.

As a User, I can launch a request on any reader using a pre-defined date_range option [Continuation]

WHY
Currently, only specific readers accept a --date_range parameter (it depends of the source API, that offers it or not).

HOW
Make the --date_range option available to all readers, including when it is not initially provided by the source API.

Case 1 - If the --date_range option is provided by the source API
Noting to do

Case 2 - If the --date_range option is not provided by the source API
Make the the option available, using the get_date_start_and_date_stop_from_date_range() function (available here). Here is the status of the implementation on readers falling within this scope:

DCM reader - To be done
SA360 - To be done
MyTarget - To be done
Adobe 2.0 - Implemented
Search Console - To be done
Twitter Reader - To be done
The Trade Desk - Implemented

Also see issue #67 for a description of how each reader handles date parameters.

As a Developer, I can pick-and-choose the application modules to deploy in my infrastructure

WHY
Today, the application tends to be huge (23 readers, each one having its own dependencies). It is very heavy to deploy, while in practice, we only use a few features from it.

HOW
Make the application more modular (core repository with main features + satellite repositories with optional services).

BUG - Fix GCS Reader (deprecated)

WHY
gcs_reader seems to be deprecated.
Its purpose is currently to read structured files like .csv, split it and read it as structured data.
This reader should read files in GCS as raw data in order to transfer it without any process into a NCK writer.

HOW
Fix behavior of the gcs_reader

As a non-technical User, I have access to a short deck allowing me to quickly understand the added value of the application

WHY
We currently only have a technical documentation which is good but it could be interesting to have a one pager or a short deck which explains the value the project brings and basic things to think about when integrating it to another project. It may be interesting that this doc could be understandable for all, not only devs.

HOW
Create a public deck or a one pager to get the interest of using NCK and the value it brings also as the main concepts that need to be understood to integrate it in project.

[Security] Workflow lint_and_run_tests.yml is using vulnerable action actions/checkout

The workflow lint_and_run_tests.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

Make requirements-dev recursively install requirements too

IMO the (generally) expected behavior when installing dependencies in a file like requirements-dev.txt is that dependencies in requirements.txt will also be installed.

As of today, this is not the case.