Code Monkey home page Code Monkey logo

rfd-discovery's Introduction

rfd-discovery

Build Status

By

Description

This project, written in Python and Cython, deals with Discovery of Relaxed Functional Dependencies(RFDs) [1] using a bottom-up approach: instead of giving a fixed threshold on input and then finding all the RDFs, this method infers distances from different RHS attributes by itself and then discovers the RFDs for these ones.

rfd-discovery takes a dataset, representing a table of a relational database, in CSV format as input and prints the set of the discovered RFDs.

CSV file can contain the following formats:

  • int;
  • int32;
  • int64;
  • float;
  • float64;
  • string;
  • datetime64*.

*for date format you can use one of the formats known by pandas

Index:

Requirements

rfd-discovery is developed using Python 3.5, a C compiler (gcc or Visual Studio C++) and Cython 0.25.2, the latter is used to improve time and memory consuming in CPU-bound operations.

For running rdf-discovery correctly, you have to install Python 3.5 and Cython 0.25. For installing correctly all the requirements you have to install pip 9.0 (or high).

rdf-discovery use the following Python's libraries:
matplotlib✛
numpy✛
pandas✛
tornado
Cython
nltk
flask

You can install these by following the Setup Section.

✛these libraries are part of SciPy stack

Setup

In order to install rfd-discovery and all his requirements, you have to create a virtual environment using venv on Python 3.5. To install venv, run the following:

[sudo] pip3 install virtualenv on Linux/macOS or pip install virtualenv using the prompt as the administrator on Windows.

To create a virtual environment, in the main directory of the project run:

virtualenv venv.

To activate the virtual environment, in the main directory on the project run:

source venv/bin/activate on Linux/MacOS or venv\Scripts\activate on Windows.

You can check if the virtual environment is activated, checking if the command prompt has the prefix (venv).

To install all the requirements, run the following:

pip install -r requirements.txt

This should install, using pip, all the requirements.

To install WordNet, run:

python setup.py install.

Build

Part of rfd-discovery is written using Cython, a superset of the Python programming language, designed to give C-like performance with code which is mostly written in Python. This because operations that take place in the code are mostly CPU bound, wasting computation and memory resources.
You can compile Cython code running the following:

python build.py build_ext --inplace

this will generate C code from Cython code and will try to compile it.

** Note that you'll need gcc or other C compiler **

If building phase ends without errors, you should have some .c and .pyd (or .so, depending by your OS) files. Don't worry about dealing with these, Python does it automatically :).

Usage

Using rdf-discovery is easy enough. Just run the following command:

python3 main.py -c <csv-file> [options]

  • -c <your-csv>: is the path of the dataset on which you want to discover RFDs;

Options:

  • -v : display the version number;
  • -s <sep>: the separation char used in your CSV file. If you don't provide this, rfd-discovery tries to infer it for you;
  • -h: Indicates that the CSV file has the header row. If you don't provide this, rdf-discovery tries to infer it for you.
  • -r <rhs_index>: is the column number of the RHS attribute. It must be a valid integer. You can avoid specifying it only if you don't specify LHS attributes (it will find RFDs using each attribute as RHS and the remaining as LHS);
  • -l <lhs_index_1, lhs_index_2, ...,lhs_index_k>: column indexes of LHS attributes separated by commas (e.g. 1,2,3). You can avoid specifying them:
    if you don't specify the index for RHS attribute it will find RFDs using each attribute as RHS and the remaining as LHS;
    if you specify a valid RHS index it will assume your LHS as the remaining attributes;
  • -i <index_col>: the column which contains the primary key of the dataset. Specifying it, the program will not calculate distance on it. NOTE: index column should contain unique values;
  • -d <datetime columns>: a list of columns, separated by commas, which values are in datetime format; Specifying this, rfd-discovery can depict distance between two date in time format (e.g. ms, sec, min);
  • --semantic: use semantic distance on Wordnet for string; For more info here.
  • --human: print the RFDs to the standard output in a human-readable form;
  • --help: show help.
Valid Examples:
Check on each combination of attributes:

python main.py -c resources/dataset.csv

Infer LHS attributes given a fixed RHS' attribute index:

python main.py -c resources/dataset.csv -r 0

RHS and LHS fixed, separator and header line specified:

python main.py -c resources/dataset.csv -r 0 -l 1,2,3 -s , -h 0

rfd-discovery's People

Contributors

antonioaltamura avatar dariodip avatar mattiatomeo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

rfd-discovery's Issues

Test della matrice delle distanze con oracolo

Realizziamo un test che controlla se il contenuto della matrice delle distanze create da noi è esattamente lo stesso di quello di una matrice delle distanze oracolo (e.g. quella della presentazione PP)

Trovare modo per effettuare il confronto tra colonne di distanza

Qua ho letto che si può selezionare un subdataframe dal dataframe. Se riuscissimo ad implementare l'algoritmo selezionando direttamente i subdataframe per le diverse distanze, secondo me, stiamo a cavallo.

Es: partiamo dalla distanza massima, selezioniamo il subdataframe per quella distanza sull'attributo RHS e da quello selezioniamo le righe che non sono dominate (con una funzione sparata da una factory in modo da rendere tutto molto versatile); poi andiamo alla distanza precedente e facciamo i dovuti confronti ecc.

Select only valid tests

In order to extract statistics from our data, we put some invalid tests in our tests directory.
We should delete them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.