Code Monkey home page Code Monkey logo

artefactory / artefactory-connectors-kit Goto Github PK

View Code? Open in Web Editor NEW
40.0 8.0 5.0 66.89 MB

ACK is an E(T)L tool specialized in API data ingestion. It is accessible through a Command-Line Interface. The application allows you to easily extract, stream and load data (with minimum transformations), from the API source to the destination of your choice.

License: GNU Lesser General Public License v3.0

Dockerfile 0.06% Makefile 0.12% Python 99.80% Shell 0.02%
google-cloud-storage search-console radarly salesforce adobe-analytics amazon-s3 confluence facebook google-analytics dcm

artefactory-connectors-kit's People

Contributors

aderennes avatar adussarps avatar ali-bellamlih avatar bdavis9725 avatar benoitbazouin avatar benoitgoujon avatar bibimorlet avatar cedric-magnan avatar d-tw avatar declin avatar dependabot[bot] avatar gabrielleberanger avatar haypierre avatar jbcharrueyartefact avatar l2me avatar louisrdsc avatar mycaule avatar nathanvermeersch avatar pol-defont-reaulx avatar r-lp avatar senhajirhazi avatar tom-grivaud avatar vviers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

artefactory-connectors-kit's Issues

Fix needed: the default .csv field_size_limit is exceeded while making requests with the DV360 reader

ERROR AND WHY :
While collecting data from the platform dv360 I encountered this issue :

Traceback (most recent call last):
  File "nck/entrypoint.py", line 86, in <module>
    app()
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 1164, in invoke
    return _process_result(rv)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 1101, in _process_result
    value = ctx.invoke(self.result_callback, value,
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "nck/entrypoint.py", line 68, in run
    writer.write(stream)
  File "/.../nautilus-connectors-kit/nck/writers/console_writer.py", line 44, in write
    buffer = file.read(1024)
  File "/.../nautilus-connectors-kit/nck/streams/stream.py", line 114, in readinto
    chunk = self.leftover or encode(next(iterable))
  File "/.../nautilus-connectors-kit/nck/utils/file_reader.py", line 36, in sdf_to_njson_generator
    for line in dict_reader:
  File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/csv.py", line 111, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)

From the error message I was able to set a csv.field_size_limit above the 131072 default limit.

HOW TO FIX IT AND FURTHER INVESTIGATIONS :
By adding this line of code to the file_reader.py the error vanished and I was able to get my result prompted on the console.

The line is the following and was added to the nck/utils/file_reader.py file(replace 1000000 by another limit to discuss) :

csv.field_size_limit(10000000)

Even if it worked I noticed that a field was containing an outrageous number of ids. I think that before setting this new csv.field_size_limit it could be interesting to check if there is no mistake in the process that would cause a field to contain way more ids than what it really should have.

As a User, I have documentation explaining how to launch a request on any reader

WHY
Today, documentation is missing on several readers:

  • Amazon S3
  • Google Cloud Storage
  • Oracle
  • MySQL
  • Radarly
  • Salesforce

As a consequence, using one of these 6 readers can be time-consuming, as you have to dig into the code and/or API documentation to understand the parameters that should be provided to the NCK command.

HOW
Create the missing documentation for each one of these 6 readers.
It should be inserted on this README, and include the following sections:

  • Source API
  • How to obtain credentials
  • Quickstart (a command example)
  • Parameters

As a Developer, I have documentation explaining how the application is structured

WHY
Today, onboarding a new contributor on NCK is tough: the application structure is complex to understand, and the development of new readers/writers/streamers should follow a set of conventions that are not specified anywhere.

HOW
Create documentation describing:

  • the application structure
  • the process & conventions to follow to develop a new reader/writer/streamer

EXPLO - Try out the Airbyte solution

WHY
The solution proposed by the Airbyte start-up is very close to what we had in mind for NCK (open-source application, EL(T) approach, similar data sources, configuration through an UI or an API). Potential cons could be: the fact that the application is coded in Java, doubts on scalability (as single node approach).

HOW
Test the solution and evaluate its pros and cons so that we can identify what could be NCK's differentiators

As a User or as a Contributor, I can access to a dedicated section of the documentation focusing on my needs

WHY
In the documentation, the Getting Started page is a mix of informations oriented towards end-users ("Launch you first NCK command", "Normalize field names") and contributors ("Set-up your developing environment", "Contribute").

WHAT
Create 2 distinct sections, focusing on the needs of each user profile. For instance:

For end-users

  • Installation (cloning the repo and installing requirements in a virtual environment)
  • Launch your first command
  • Normalize field names

For contributors

  • Development guidelines (linting, pre-commit hooks, TDD, documentation)
  • Application architecture: how to develop a new reader/stream/writer

As a user, I can see the HTTP request content in the logs for Google Services

WHY
Currently, we use the Discovery SDK, which hides a lot of complexity. However, for troubleshooting it could be useful to see the HTTP endpoint and the query body when we use NCK in a debug mode. This is always asked when we contact the support.

HOW
Add a log statement (debug level) before executing the request with the SDK. This log should contain as much information as possible about the request: endpoint, auth method, payload ...

Ignore Pipenv files in the repo

Pipenv is a fairly common framework for managing venvs in python.
The .gitignore of the repo should contain Pipfile and Pipfile.lock.

As a User, I can configure/launch NCK requests through API calls

WHY
Today, interacting with the NCK with a CLI does not necessarily make sense. An API-interface might be a better choice (and would be easier to integrate with the web UI that we are contemplating to develop for non-tech users).

HOW
Move the developper interface from click to FastAPI.

As a Developer, I can rely on a standardized APIClient class to authenticate requests on any reader

WHY
Today, each reader uses its own authentication method: either a client class or a helper method.
Available client classes are available under the nck/clients directory (after the refacto, clients shared by several sources will appear under nck/clients/, and source-level clients will appear under nck/readers/<source>/client.py).

HOW

Long term goal
Create a unique APIClient class, that could handle the authentication process for all readers.
Today, a standardization effort has been undertaken with the creation of the APIClient class ; however, this class is only used by Yandex readers today. As we might not be able to standardize all authentification methods into a single one (some APIs are quite specific), we suggest to pursue the below preliminary goal as a start.

Intermediary goal
Group sources sharing a similar authentication process (e.g. Google sources) into a single client class.

Facebook reader: facebook-field vs facebook-desired-field behavior

Description

Expected:
As a user, if I specify a list of --facebook-desired-field, I would expect them to be automatically added as --facebook-field too. Thus, data will be queried at least for all facebook desired fields.

Current behavior:
Currently, if I only specify a list of --facebook-desired-field, I will get a dictionary with desired fields as keys and blank values because there is no data queried.

Add .dockerignore to project

There is currently no .dockerignore file, which results in a lot of useless files in the docker image. For example: the tests folder, the README.md file, the .github folder.

This is bad because it makes the Docker image bigger and increase build time.

Furthermore, it can add potential vulnerabilities.

As A Developer, I can run integration tests that are based on mock data

WHY
Currently, we don't have tests that verify the end-to-end workflow because we can't share credentials from real customers to every contributor. So, it might be a good idea to look for sandbox environments to run our tests.
For example, the FB marketing API has one.

HOW
Use Docker containers to reproduce a real-world environment and run different commands. Verify the output is correct.

Harmonize the "date" parameter (date_range, start_date, end_date,...) in the media and analytics readers.

Choose and implement a common behaviour between the readers with the "date" parameter (date_range, start_date, ...).

Define the priorities between the parameters start_date/end_date, date_range (and others) in order to have an easier behaviour between the readers and more understandable for the User.

For example:
With the GoogleAds Reader, the connector doesn't understand a request with start_date, end_date and date_range.

Define a standard for every readers.

As a Developer, I can see a clear distinction between a reader, a CLI, a config and a helper module for each reader source

WHY
Today, the application modules do not have a single purpose:

  • The nck/readers/<source>_reader.py modules are defining both a CLI command and a reader class
  • Modules under the nck/helpers/ directory are a mix of configuration variables and helper functions

HOW
To make things clearer, implement the following architecture:

- nck/
-- readers/
--- <source>/
---- reader.py
---- cli.py # nck/entrypoint.py should point to these files
---- config.py
---- helper.py

As a User, I can configure/launch NCK request by passing a .json file to the application

WHY
Today, generating NCK CLI commands requires a lot of efforts. Every project team using the application in production had to develop a “command generator”, i.e. a tool building NCK CLI commands from a set of parameters defined on a .json file. It would be much more efficient if we could directly pass such a .json file to the application to configure/launch a request, as it would avoid making costly (and unnecessary) transformations.

HOW
Allow users to pass a .json file to the application. It can be done by setting-up:

  • a second CLI that would take a single argument:
    python nck/entrypoint_json.py read_dbm --config-file <path to the .json config file>

To be backward compatible, we could think of the following architecture:

- nck/
-- entrypoint/
--- entrypoint.py # Current entrypoint, allowing multiple parameters as a CLI
--- entrypoint_json.py # New entrypoint, allowing a single .json input

The second entrypoint will need to be completely separated from the first one and to implement a fully new architecture with a general formatter useable for every readers. The validation steps will be implemented in a separated issue.

BUG -Fix 3 times reader class init

WHY
Currently the reader class is instantiated twice before the explicit instantiation.

This takes place in the nck/entrypoint.py file :

def process_command_pipeline(provided_commands, normalize_keys):
    provided_readers = [cmd for cmd in provided_commands if isinstance(cmd(), Reader)]
    provided_writers = [cmd for cmd in provided_commands if isinstance(cmd(), Writer)]
    _validate_provided_commands(provided_readers, provided_writers)

    reader = provided_readers[0]
    for stream in reader().read():
        for writer in provided_writers:
            if normalize_keys and issubclass(stream.__class__, JSONStream):
                writer().write(NormalizedJSONStream.create_from_stream(stream))
            else:
                writer().write(stream)

As you can see before instantiating the reader on the for stream in reader().read(): line, we have already instantiated the reader twice while creating the provided readers and writers list comprehensions.

You can verify this easily by adding logs in your init (if none are already present) and just run a command for this reader. You will see we go through the init 3 times.

HOW:
This is due to these: cmd(), a quick fix would be to compare cmd and not cmd() to another type (not Reader or Writer). Maybe there are better solutions or maybe it is preferable to not edit this.

I'm not familiar with this part of the project so let me know what makes more sense to you guys.

First noticed on this pr https://github.com/artefactory/nautilus-connectors-kit/pull/110#issue-577751388

As a User, I can provide all arguments as click options for the GCS Reader, S3 Reader, GCS Writer and BQ Writer

WHY
The following readers/writers are using implicit config.<VAR> variables:

  • gcs_reader, via the object_storage_reader
    • config.PROJECT_ID
  • s3_reader, via the object_storage_reader
    • config.REGION_NAME, config.AWS_ACCESS_KEY_ID, config.AWS_SECRET_ACCESS_KEY
  • gcs_writer
    • config.PROJECT_ID
  • bq_writer
    • config.PROJECT_ID

These variables are not provided by the user as click options, but as environment variables. These environment variables are retrieved from the following snippet that appears in the nck/config.py module:

for key, var in os.environ.items():
    locals()[key] = var

This can be quite confusing for the user, and does not match the common NCK logic.

HOW
Explicitly implement these variables as click options

As a Developer, I can deploy the application on any cloud serverless service

WHY
Today, each individual project willing to use the application at a production scale needs to set-up and deploy the underlying infrastructure, which can be a heavy process.

HOW
We could think of a Deploy button (integrated to the repo's Readme, or to the web UI that we are planning to develop), that would allow us - in just one click - to deploy the serverless infrastructure needed to run the application, on the cloud platform of our choice (GCP Cloud Run, Amazon EKS, etc).

As a User, I can launch a request on any reader using a pre-defined date_range option

WHY
Currently, only specific readers accept a --date_range parameter (it depends of the source API, that offers it or not).

Such a pre-defined --date_range option can be very convenient, in particular when conducting tests. In this case, you just have to specify --date-range PREVIOUS_WEEK for instance, instead of providing a specific --start-date and --end-date for your request.

HOW
Make the --date_range option available to all readers, including when it is not initially provided by the source API.

Distribute NCK as a command line tool via PyPi.

The library does not necessarily needs Docker to be run. Using setuptools, we could distribute the script as a PiPy package witch then would be easily usable by doing only:

pip install nck
nck --help

This way the package would also be more lightweight that the relative docker image.

As a User, I can launch a request knowing that all readers follow a common behavior on date parameters

WHY
Today, the behavior of readers expecting date parameters (start_date, end_date, date_range, etc.) is not harmonized.

In particular, readers usually do not know how to prioritize the date parameters given as inputs. For instance: the Google Ads reader crashes if you try to make a request including the start_date and end_date parameters + a date_range parameter.

HOW
Define a convention for all readers, and implement it.

Handle OSError "File name too long" with local_writer

Current Traceback:
File "<...>/nautilus-connectors-kit/nck/writers/local_writer.py", line 46, in write
with open(path, "wb") as h:
OSError: [Errno 63] File name too long: '/Users/USER/Desktop/results_CustomReport_493-698-1849_820-172-9109_927-461-4978_710-234-1866_786-452-3513_622-647-8507_704-008-6401_991-374-9639_414-678-9460_804-138-9346_351-580-1999_640-731-4384_478-167-3186_328-616-4160_381-972-5546_322-562-1358_368-269-7323_291-846-3501_315-547-3070_583-565-5443_884-258-3942_443-453-4856_400-583-8413_600-472-5847_853-384-9584_2020-11-12-12-13-21.njson'

Potential solution:
Add a "file_name" parameter to the local_writer

As a User, I can stream reports in a .csv format

WHY
Today, the only output stream format available is .njson (i.e. a file with n lines, each line being a dictionnary).
This format has two downsides:

  • It does not allow us to easily conduct preliminary analysis on the output data: .njson files cannot be directly forwarded to non-tech users, and cannot be put into a pandas DataFrame without undergoing preliminary transformations.
  • Some APIs natively return data in a .csv format: in these cases, we have to convert each line to a dictionnary, which can occasion parsing errors.

HOW
Create a .csv streamer.

Documentation is weak

We need to add documentation hover:

  • How to pass credentials to each reader and writer.
  • How to develop locally.
  • How to launch a simple command.

In the end what is important is: How to obtain credentials for all readers and writers.
Where to find the documentation of available dimensions and metrics for each reader.

CampaignSetup Request - FacebookReader

Current behavior
No need to use "field" parameter to request Campaign for FacebookReader. Only request "desired_fields" you need.
Functional behavior but has to be optimized.

Expected behavior
More user friendly use of the input parameters for FacebookReader, especially "field" and "desired_field"

Suggestion
Handle "field" and "desired_field" parameters that progressively becomes unconsistent for the user.

As a User, I can launch a request on any reader using a pre-defined date_range option [Continuation]

WHY
Currently, only specific readers accept a --date_range parameter (it depends of the source API, that offers it or not).

Such a pre-defined --date_range option can be very convenient, in particular when conducting tests. In this case, you just have to specify --date-range PREVIOUS_WEEK for instance, instead of providing a specific --start-date and --end-date for your request.

HOW
Make the --date_range option available to all readers, including when it is not initially provided by the source API.

Case 1 - If the --date_range option is provided by the source API
Noting to do

Case 2 - If the --date_range option is not provided by the source API
Make the the option available, using the get_date_start_and_date_stop_from_date_range() function (available here). Here is the status of the implementation on readers falling within this scope:

  • DCM reader - To be done
  • SA360 - To be done
  • MyTarget - To be done
  • Adobe 2.0 - Implemented
  • Search Console - To be done
  • Twitter Reader - To be done
  • The Trade Desk - Implemented

Also see issue #67 for a description of how each reader handles date parameters.

BUG - Fix GCS Reader (deprecated)

WHY
gcs_reader seems to be deprecated.
Its purpose is currently to read structured files like .csv, split it and read it as structured data.
This reader should read files in GCS as raw data in order to transfer it without any process into a NCK writer.

HOW
Fix behavior of the gcs_reader

As a non-technical User, I have access to a short deck allowing me to quickly understand the added value of the application

WHY
We currently only have a technical documentation which is good but it could be interesting to have a one pager or a short deck which explains the value the project brings and basic things to think about when integrating it to another project. It may be interesting that this doc could be understandable for all, not only devs.

HOW
Create a public deck or a one pager to get the interest of using NCK and the value it brings also as the main concepts that need to be understood to integrate it in project.

[Security] Workflow lint_and_run_tests.yml is using vulnerable action actions/checkout

The workflow lint_and_run_tests.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.