Code Monkey home page Code Monkey logo

data-catalog's Introduction

Dependency Status Code Climate

data-catalog

This service is a backend to the "Data Catalog" tab in Console. It is used to store, retrieve and to search over metadata describing data sets downloaded into Trusted Analytics platform.

Service dependencies

  • ElasticSearch - metadata store backing
  • Downloader (Trusted Analytics platform) - is called to delete the actual data of the data sets
  • Dataset Publisher - is called to delete the Hive views of the data sets

Basic handling

Initial setup

  • You need Python (of course).
  • Install pip: sudo apt-get install python-pip
  • Install Tox: sudo pip install tox

Tests

  • Be in data-catalog directory (project's source directory).
  • Run: tox (first run will take long) or tox -r (if you have added something to requirements.txt)

Configuration

Configuration is handled through environment variables. They can be set in the "env" section of the CF (Cloud Foundry) manifest. Parameters:

  • LOG_LEVEL - Application's logging level. Should be set to one of logging levels from Python's logging module (e.g. DEBUG, INFO, WARNING, ERROR, FATAL). DEBUG is the default one if the parameter is not set.

Tools

There are few development tools to handle or setup data in data-catalog:

  • [Local setup tool] (#local-development-tools)
  • [Migration tool] (tools/ELASTIC_MIGRATE_README.md)

API documentation

Development

General

  • Everything should be done in a Python virtual environment (virtualenv).
  • To switch the command line to the project's virtualenv run source .tox/py27/bin/activate. Run deactivate to disable virtualenv.
  • Downloading additional dependencies (libraries): pip install <library_name>
  • Install bumpversion tool using sudo pip install bumpversion and run bumpversion patch --allow-dirty to bump the version before committing code, that will go to master.

Managing requirements

  • Dependencies need to be put in requirements.txt, requirements-normal.txt and requirements-native.txt.
  • This is so confusing because we need to support deployments to offline environments using the Python buildpack and some of our dependencies don't support offline mode well.
  • requirements-normal.txt contains pure Python dependencies that may be downloaded from the Internet. It's used by Tox and when downloading dependencies for offline package.
  • requirements-native.txt contains packaged that have native components and may be downloaded from the Internet. It's used by Tox and when downloading dependencies for offline package.
  • requirements.txt is requirements-normal.txt and requirements-native.txt combined (in that orded), but all the links to source control are replaced with the dependency name and version. It is used during offline package installated along the dependencies downloaded to the "vendor" folder.
  • When adding new dependencies update requirements files appropriately.
  • pipdeptree will help you find the dependendencies that you need to put in requirements files. They need to contain the dependencies (and their dependencies, the whole trees) of the actual app code and tests dependencies, not the quality of helper tools like pylint and pipdeptree.

Local setup

  1. [Install ElasticSearch] (https://www.elastic.co/downloads/elasticsearch) on your development machine.
  2. Change the cluster.name property in ELASTIC_SEARCH_DIR/config/elasticsearch.yml to an unique name. This will prevent your instance of ES from automatically merging into a cluster with others on the local network.
  3. Run ElasticSearch.
  4. Download and run User Management app (available in the same Github organization as this).
  5. Set a VCAP_SERVICES variable that would normally be set by CF. export VCAP_SERVICES='{"user-provided":[{"credentials":{"tokenKey":"http://uaa.example.com/token_key"},"tags":[],"name":"sso", "label":"user-provided"}]}'
  6. [Install NATS service] (https://nats.io/) or download and configure Latest Events Service app (also available in this same githib organization). If using other then default in NATS settings configure VCAP_SERVICES. Latest Event Service is configured to work with subject that start with 'platform.' string. Example settings:
  1. Running data-catalog service locally (first run prepares the index): python -m data_catalog.app
  2. Additional: some functions require Downloader and Dataset Publisher apps (also from the same Github organization as this).

Local development tools

  • Fill local ElasticSearch index with data (do after preparing the index): python -m tools.local_index_setup fill
  • Generating other set of example metadata: python -m tools.local_index_setup generate <entry_number>
  • To delete the index run: python -m tools.local_index_setup delete

Integration with PyCharm / IntelliJ with Python plugin

  • Run tox in project's source folder to create a virtualenv.
  • If you hava a new version of PyCharm/Idea you might need to remove .tox folder from exclusions in Python projects. Follow [this resolution] (https://youtrack.jetbrains.com/issue/PY-16141#comment=27-1015284).
  • File -> New Project -> Python
  • In "Project SDK" choose "New" -> "Add Local".
  • Select Python executable in .tox directory created in your source folder (enable showing hidden files first).
  • Skip ahead, then set "Project Location" to the source folder.
  • Add new "Python tests (Unittests)" run configuration. Choose "All in folder" with the folder being <source folder>/tests. You can use this configuration to debug tests.
  • Go to File -> Project Structure, then mark .tox folder as excluded.

Advanced handling

Test queries on local elastic search

  • Install marvel plugin: bin/plugin -i elasticsearch/marvel/latest
  • Go to the url: [http://localhost:9200/_plugin/marvel/sense/index.html] (http://localhost:9200/_plugin/marvel/sense/index.html)
  • Once you fill your elasticsearch index with example data (see: Development section), you can test it
  • List of indexes: GET _cat/indices (you will see also .marvel indexes produced by the plugin, to remove them you can run DELETE .marvel*)
  • GET trustedanalytics-meta/ lists the set up of the 'trustedanalytics-meta' index (mappings, settings, etc)
  • Example of a query on chosen field:
GET trustedanalytics-meta/dataset/_search
{
  "query":{
    "match": {
      "dataSample": "something"
    }
  }
}
  • Example of a fuzzy query:
GET trustedanalytics-meta/dataset/_search
{
  "query": {
    "match": {
      "title": {
        "query": "tneft",
        "fuzziness": 1
      }
    }
  }
}
  • Example of a query to our custom analyzer (called uri_analyzer) GET trustedanalytics-meta/_analyze?analyzer=uri_analyzer&text='http:/some-addres.example.com/dataset'

data-catalog's People

Contributors

dmalecka avatar klaudiaj avatar mbultrow avatar pgrabusz avatar pmilewsk avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.