Code Monkey home page Code Monkey logo

datasets-provanalytics-dmkd's Introduction

Provenance Network Analytics Datasets

This repository provides the datasets used in the Provenance Network Analytics paper and the code for its analyses. The code was also used to generate the charts shown in our paper. Please note that the information provided here is meant to accompany the paper, where the analytic method is described in more detail.

Overview

Provenance network analytics is a novel data analytics approach that helps infer properties of data, such as quality or trustworthiness, from their provenance. Instead of analysing application data, which are typically domain-dependent, it analyses the data's provenance as represented using the World Wide Web Consortium's domain-agnostic PROV data model. Specifically, the approach proposes a number of network metrics (PNM) for provenance data and applies machine learning techniques over such metrics to build predictive models for some key properties of data. Applying this method on the provenance of real-world data from three different applications, we show that provenance network analytics can successfully identify the owners of provenance documents, assess the trustworthiness of crowdsourced data, and identify instructions from chat messages in an alternate-reality game with high levels of accuracy.

The notebooks and the accompanied datasets provided in this repository demonstrate how the method can be applied in a number of domains as a useful and generic tool for data analytics.

Installation

You do not need to install anything to see the notebooks provided in this repository (linked below). However, if you want to re-run the code on the datasets, you will need to install a number of required Python packages as listed in the requirements.txt as shown below.

The code provided with the datasets were run on Python 3.6. However, it might still run on other Python versions, but this is not guaranteed. All the packages required to run the experiments are listed in requirements.txt. In order to install those, run the following command with pip.

pip install -r requirements.txt

Provenance Datasets

We use three datasets in our paper, which listed below. Each dataset contains a number of provenance graphs and their labels. Instead of providing the actual provenance graphs, due to privacy issues, we only provide here the provenance network metrics calculated from those graphs (which are used in our analyses).

  1. Provenance documents on ProvStore:
    • provstore/data.csv: the PNM of provenance documents uploaded to ProvStore and their corresponding owners (anonymised as u_1, u_2, ...)
  2. Provenance of CollabMap data:
    • collabmap/trust_values.csv: the trust value of each data entity from CollabMap (identified by the id column).
    • collabmap/depgraphs.csv: the PNM of the provenance dependency graph of each data entity. (See our paper for the definition of a provenance dependency graph)
    • collabmap/ancestor-graphs.csv: the PNM of the (historical) provenance graph of each data entity (i.e. the graph records how it was generated).
  3. Provenance from the Radiation Response Game (RRG).
    • rrg/depgraphs-k.csv, e.g. rrg/depgraphs-5.csv: the PNM of the provenance dependency graph level k of a RRG chat message (k = 1..18).
    • rrg/depgraphs.csv: the PNM of the full dependency graph of a RRG chat message (i.e. without restricting a dependency graph to k edges away from a message entity).
    • rrg/ancestor-graphs.csv: the PNM of the (historical) provenance graph of the messages.

IPython Notebooks

The notebooks below provide the code for the analysis of the above datasets as reported in our paper. They detail the steps we took in our experiments and also show their results.

In addition, we also provide here extra materials to help with replicating the experiments and to document extra experiments we carried out, which are not included in the paper due to space constraints.

datasets-provanalytics-dmkd's People

Contributors

trungdong avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.