Code Monkey home page Code Monkey logo

marquezproject / marquez Goto Github PK

View Code? Open in Web Editor NEW
1.7K 48.0 294.0 46.8 MB

Collect, aggregate, and visualize a data ecosystem's metadata

Home Page: https://marquezproject.ai

License: Apache License 2.0

Java 76.14% Dockerfile 0.04% Shell 1.32% JavaScript 0.64% HTML 1.04% TypeScript 16.73% CSS 0.05% Python 3.46% Mustache 0.19% PLpgSQL 0.38%
data-lineage data-discovery data-governance data-provenance metadata-service data-dictionary marquez metadata data-ecosystem-metadata data-ops

marquez's Introduction

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by WeWork.

Badges

CircleCI codecov status Slack license Contributor Covenant maven docker Known Vulnerabilities CII Best Practices

Status

Marquez is an LF AI & Data Foundation incubation project under active development, and we'd love your help!

Adopters

Want to be added? Send a pull request our way!

Try it!

Open in Gitpod

Quickstart

Marquez provides a simple way to collect and view dataset, job, and run metadata using OpenLineage. The easiest way to get up and running is with Docker. From the base of the Marquez repository, run:

MacOS and Linux users:

$ ./docker/up.sh

Windows users:

Before cloning Marquez, configure Git to check out files with Unix-style file endings:

$ git config --global core.autocrlf false

Verify that Bash and PostgreSQL have been installed and added to the PATH variable (Git Bash is recommended).

Start all services:

$ sh ./docker/up.sh

Tip: Use the --build flag to build images from source, and/or --seed to start Marquez with sample lineage metadata. For a more complete example using the sample metadata, please follow our quickstart guide.

Note: Port 5000 is now reserved for MacOS. If running locally on MacOS, you can run ./docker/up.sh --api-port 9000 to configure the API to listen on port 9000 instead. Keep in mind that you will need to update the URLs below with the appropriate port number.

WEB UI

You can open http://localhost:3000 to begin exploring the Marquez Web UI. The UI enables you to discover dependencies between jobs and the datasets they produce and consume via the lineage graph, view run metadata of current and previous job runs, and much more!

HTTP API

The Marquez HTTP API listens on port 5000 for all calls and port 5001 for the admin interface. The admin interface exposes helpful endpoints like /healthcheck and /metrics. To verify the HTTP API server is running and listening on localhost, browse to http://localhost:5001. To begin collecting lineage metadata as OpenLineage events, use the LineageAPI or an OpenLineage integration.

Note: By default, the HTTP API does not require any form of authentication or authorization.

GRAPHQL

To explore metadata via graphql, browse to http://localhost:5000/graphql-playground. The graphql endpoint is currently in beta and is located at http://localhost:5000/api/v1-beta/graphql.

Documentation

We invite everyone to help us improve and keep documentation up to date. Documentation is maintained in this repository and can be found under docs/.

Note: To begin collecting metadata with Marquez, follow our quickstart guide. Below you will find the steps to get up and running from source.

Versions and OpenLineage Compatibility

Versions of Marquez are compatible with OpenLineage unless noted otherwise. We ensure backward compatibility with a newer version of Marquez by recording events with an older OpenLineage specification version. We strongly recommend understanding how the OpenLineage specification is versioned and published.

Marquez OpenLineage Status
UNRELEASED 1-0-5 CURRENT
0.46.0 1-0-5 RECOMMENDED
0.45.0 1-0-5 MAINTENANCE

Note: The openlineage-python and openlineage-java libraries will a higher version than the OpenLineage specification as they have different version requirements.

We currently maintain three categories of compatibility: CURRENT, RECOMMENDED, and MAINTENANCE. When a new version of Marquez is released, it's marked as RECOMMENDED, while the previous version enters MAINTENANCE mode (which gets bug fixes whenever possible). The unreleased version of Marquez is marked CURRENT and does not come with any guarantees, but is assumed to remain compatible with OpenLineage, although surprises happen and there maybe rare exceptions.

Modules

Marquez uses a multi-project structure and contains the following modules:

  • api: core API used to collect metadata
  • web: web UI used to view metadata
  • clients: clients that implement the HTTP API
  • chart: helm chart

Note: The integrations module was removed in 0.21.0, so please use an OpenLineage integration to collect lineage events easily.

Requirements

Note: To connect to your running PostgreSQL instance, you will need the standard psql tool.

Building

To build the entire project run:

./gradlew build

The executable can be found under api/build/libs/

Configuration

To run Marquez, you will have to define marquez.yml. The configuration file is passed to the application and used to specify your database connection. The configuration file creation steps are outlined below.

Step 1: Create Database

When creating your database using createdb, we recommend calling it marquez:

$ createdb marquez

Step 2: Create marquez.yml

With your database created, you can now copy marquez.example.yml:

$ cp marquez.example.yml marquez.yml

You will then need to set the following environment variables (we recommend adding them to your .bashrc): POSTGRES_DB, POSTGRES_USER, and POSTGRES_PASSWORD. The environment variables override the equivalent option in the configuration file.

By default, Marquez uses the following ports:

  • TCP port 8080 is available for the HTTP API server.
  • TCP port 8081 is available for the admin interface.

Note: All of the configuration settings in marquez.yml can be specified either in the configuration file or in an environment variable.

Running the HTTP API Server

$ ./gradlew :api:runShadow

Marquez listens on port 8080 for all API calls and port 8081 for the admin interface. To verify the HTTP API server is running and listening on localhost, browse to http://localhost:8081. We encourage you to familiarize yourself with the data model and APIs of Marquez. To run the web UI, please follow the steps outlined here.

Note: By default, the HTTP API does not require any form of authentication or authorization.

Related Projects

  • OpenLineage: an open standard for metadata and lineage collection

Getting Involved

Contributing

See CONTRIBUTING.md for more details about how to contribute.

Reporting a Vulnerability

If you discover a vulnerability in the project, please open an issue and attach the "security" label.


SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the Marquez project.

marquez's People

Contributors

ankitcha avatar ashulmanwework avatar collado-mike avatar davidjgoss avatar davidsharp7 avatar dependabot[bot] avatar fm100 avatar grantdfoster avatar henneberger avatar hjpatel16 avatar jlukenoff avatar julienledem avatar kevinmellott91 avatar merobi-hub avatar mobuchowski avatar oleksandrdvornik avatar pawel-big-lebowski avatar phixme avatar ravikamaraj avatar renovate[bot] avatar roaraya8 avatar ronthalanki avatar ronthalanki-wework avatar rossturk avatar ryanpeterson avatar sophiely avatar sreev avatar sshah-wework avatar tito12 avatar wslulciuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

marquez's Issues

Revisit CI flow

In our current flow, we run the 'test' task both as part of the build step, and then also as part of the dedicated test step. We have two choices:

  1. Skip the testing step in the build invocation by specifying -x test.
  2. Remove the individual test step altogether.

List job run states

Let's support listing / filtering jobs by run state. We'll want to update the OpenAPI spec accordingly #100 #104

GET /namespaces/:namespace/jobs?state=[completed|running|failed|aborted]

Design document is not publicly accessible

I arrived here after viewing this (excellent) presentation. I'm very keen to understand Marquez in more detail as it appears to align with many of my metadata goals. It' be great to have some visibility on the design/roadmap of the project. I believe that the Google document linked to in the README might contain useful information but do not currently have permissions to view it. On following the link I see: You need permission.

Add roadmap.md

We should define a clear timeline of milestones for Marquez and break them down into phases with expected release dates.

Verify and document default Postgres timezone behavior

Create a standard for dealing with timezones in the data, specifically around whether it's required/recommended to include them in time data.

This includes how the DB schema will address timestamp info, how the Jackson mapper treats timestamps that don't include TZ info, and how Marquez treats data in the Timestamp object.

By default, it seems that Postgres does not record TZ info in its timestamp. This can cause problems down the line, so I think for now it's advisable to create types as TIMESTAMP WITH TIME ZONE instead of just TIMESTAMP.

Reference:
https://www.postgresql.org/docs/9.6/static/datatype-datetime.html

List job runs

Let's support listing job runs. We'll want to update the OpenAPI spec accordingly #100 #104

GET /namespaces/:namespace/jobs

{
  "name": "my_first_job",
  "description": "Best job ever!",
  ...
  "runs": [
    "/jobs/runs/cfc4b5e6-c630-48d4-ad19-f2bd16c93a9d",
    "/jobs/runs/d33ef190-73bd-4a65-ab59-1bbd65364d0b",
    "/jobs/runs/5ced1097-8d59-46d8-933e-c9a688be8b8c",
    ...
  ]
}

We may want to list job runs when fetching a job by name. The endpoint above is not yet finalized, and something we can iterate on (just wanted to capture the functionality).

Vulnerability assessment tool scans

I recommend that we introduce vulnerability scans early in this project so that we can keep our security posture healthy. We can achieve this by performing security scans (with a tool) on PR's and rejecting them if they introduce new vulnerabilities. At a minimum, we should be scanning our dependencies since it's rather easy to do and incurs little overhead. Snyk is a solid choice for this purpose.

In the future, we'll want to scan custom code as well (the code we actually wrote), but those types of scans are usually more cumbersome to perform and orchestrate. I don't think this needs to be tackled right away, however, it should be considered carefully.

numeric IDs or UUIDs?

Does anyone have a strong opinion about numeric IDs vs. using UUIDs from the outset? Would the numeric IDs lock us into a single database master and affect how Marquez can evolve to support a distributed datastore in the future?

Add docker compose

To simplify getting started / up and running with marquez, we'll want to add support for docker-compose

Note: depends on #44

Reorganize Dropwizard service

The current organization of our Dropwizard has no clear home for business logic, leaving us to push too much logic down into the DAO or Resources (without any guiding principles on when or where). We should consider re-organizing the project. We should also use patterns to reinforce good hygiene like having separate Representation objects for both requests and responses, which can help avoid unexpected bugs.

The Dropwizard docs suggest separating the business logic into separate package from the reference objects and the resources. The Dropwizard docs also making passing mention of request / response entities, although it does not emphasize the importance of this for good service design. To that end, we probably want to introduce:

  • Controllers for request / response handling
  • Service classes
  • Request / Response entities

Thoughts?

Add CONTRIBUTING.md

Let's write up some clear guidelines for how to contribute to the project.

Branch protection on master?

FYI I just enabled branch protection on master to prevent merging PRs which have not been reviewed by Marquez owners. If there's any reason we don't want this, let's discuss!

screen shot 2018-08-27 at 12 57 18 pm

Standardize test file names

Some test files are named Test*.java and others are *Test.java. Will be nice to standardize on one. Seems like *Test.java is more conventional?

Add gradle task to apply java formatting

Now that we're catching code formatting issues during the build, it will be nice to have a gradle task which can auto apply the java formatting. Coming from Go, I loved having go fmt do this for me, so having something similar for this project will be nice.

v0.1.0 Service Layer Methods

Marquez 0.1.0

Key Description
+ Public
- Private

NamespaceService

Method Throws Description
+ Namespace create(String name) NamespaceException Creates a namespace.

JobService

Method Description
+ Job create(namespace, Job)
- JobVersion createVersion(namespace, Job)
+ Job[] getAll(namespace)
+ JobVersion[] getAllVersions(namespace, jobName)
+ JobVersion getVersionLatest(namespace, jobName)

DatasetService

Method Description
+ JobRun createJobRunOutputs(namespace, jobName, runId, Dataset[])
- DatasetVersion createVersion(namespace, Dataset)
+ Dataset[] getAll(namespace)

Fix checkstyle warnings

Let's address all style warnings that don't follow the google java style guide

$ ./gradlew checkstyleMain checkstyleTest
> Task :checkstyleMain
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:110: Abbreviation in name 'ownerDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:113: Abbreviation in name 'jobDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:116: Abbreviation in name 'jobRunDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:119: Abbreviation in name 'datasetDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:122: Abbreviation in name 'jobRunDefinitionDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:123: Abbreviation in name 'jobVersionDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/api/CreateNamespaceResponse.java:24: 'if' construct must use '{}'s. [NeedBraces]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/api/CreateNamespaceResponse.java:25: 'if' construct must use '{}'s. [NeedBraces]
.
.
.

Issues

  • Fix lines longer than 100 chars #283
  • Fix abbreviation in name #284
  • Disable Javadoc check #285

App Design / Structure

To date, we have encouraged quick iteration on features to allow for early API feedback. This ensured we understood how Marquez would collect metadata on running jobs as well as handle versioning of datasets (conversations that are still ongoing). Recently, our data model has seen many additions (namespaces!). As a result, we need to address our current app structure. Mainly, this means decoupling the DAO Layer from the Resource Layer and introduce a Service Layer to encapsulate all DAO interactions and allow each layer to evolve independently #66.

Note: App restructuring is required before opening up the project for contributions.

App Design

The multi-layer app design will consist of:

  • Resource Layer
  • Service Layer
  • DAO Layer

See Organizing Marquez Code doc for more details.

App Structure

We'll also have the following project structure:

marquez/
├── api
│   ├── exceptions
│   ├── health
│   ├── mappers
│   ├── models
│   └── validation
├── common
├── db
│   ├── mappers
│   └── models
└── service
    ├── exceptions
    ├── mappers
    └── models

For more details, see Marquez: App pkg structure

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.