marquezproject / marquez Goto Github PK
View Code? Open in Web Editor NEWCollect, aggregate, and visualize a data ecosystem's metadata
Home Page: https://marquezproject.ai
License: Apache License 2.0
Collect, aggregate, and visualize a data ecosystem's metadata
Home Page: https://marquezproject.ai
License: Apache License 2.0
To date, we have encouraged quick iteration on features to allow for early API feedback. This ensured we understood how Marquez would collect metadata on running jobs as well as handle versioning of datasets (conversations that are still ongoing). Recently, our data model has seen many additions (namespaces!). As a result, we need to address our current app structure. Mainly, this means decoupling the DAO Layer from the Resource Layer and introduce a Service Layer to encapsulate all DAO interactions and allow each layer to evolve independently #66.
Note: App restructuring is required before opening up the project for contributions.
The multi-layer app design will consist of:
See Organizing Marquez Code doc for more details.
We'll also have the following project structure:
marquez/
├── api
│ ├── exceptions
│ ├── health
│ ├── mappers
│ ├── models
│ └── validation
├── common
├── db
│ ├── mappers
│ └── models
└── service
├── exceptions
├── mappers
└── models
For more details, see Marquez: App pkg structure
I arrived here after viewing this (excellent) presentation. I'm very keen to understand Marquez in more detail as it appears to align with many of my metadata goals. It' be great to have some visibility on the design/roadmap of the project. I believe that the Google document linked to in the README might contain useful information but do not currently have permissions to view it. On following the link I see: You need permission
.
We feel it's important to have a welcoming community, let's follow https://www.contributor-covenant.org
TestJobSerialization.java
contents were lost during a merge. Needs to be restored from git history.
Currently, deleting an Owner only soft deletes the owner record, but will leave their ownerships unchanged. The Owner deletion should automatically end all of their ownerships.
In our current flow, we run the 'test' task both as part of the build step, and then also as part of the dedicated test step. We have two choices:
-x test
.Let's support listing job runs. We'll want to update the OpenAPI spec accordingly #100 #104
GET /namespaces/:namespace
/jobs
{
"name": "my_first_job",
"description": "Best job ever!",
...
"runs": [
"/jobs/runs/cfc4b5e6-c630-48d4-ad19-f2bd16c93a9d",
"/jobs/runs/d33ef190-73bd-4a65-ab59-1bbd65364d0b",
"/jobs/runs/5ced1097-8d59-46d8-933e-c9a688be8b8c",
...
]
}
We may want to list job runs when fetching a job by name. The endpoint above is not yet finalized, and something we can iterate on (just wanted to capture the functionality).
What are everyone's thoughts regarding how to version the API? Versioning (via URI or headers) or evolution? Or something else?
There is a good writeup that will guide us here:
https://circleci.com/docs/2.0/caching/
This will require changes across the application, so should be in a separate PR.
Defines a namespace
Let's define an API specification using OpenAPI v3.0, the most popular API specification tool around
Defines a job in docs/
We should define a clear timeline of milestones for Marquez and break them down into phases with expected release dates.
Placeholder to continue conversation in #15 re: exploring testcontainers
Let's write up some clear guidelines for how to contribute to the project.
Defines a dataset
Now that we're catching code formatting issues during the build, it will be nice to have a gradle task which can auto apply the java formatting. Coming from Go, I loved having go fmt
do this for me, so having something similar for this project will be nice.
Some test files are named Test*.java
and others are *Test.java
. Will be nice to standardize on one. Seems like *Test.java
is more conventional?
This will make transitioning to code generation with Lombok seamless.
Instead of validating UUID separately, look into: https://www.mkyong.com/webservices/jax-rs/jax-rs-path-uri-matching-example/
Define all API error responses using rfc7807
Does anyone have a strong opinion about numeric IDs vs. using UUIDs from the outset? Would the numeric IDs lock us into a single database master and affect how Marquez can evolve to support a distributed datastore in the future?
Create a standard for dealing with timezones in the data, specifically around whether it's required/recommended to include them in time data.
This includes how the DB schema will address timestamp info, how the Jackson mapper treats timestamps that don't include TZ info, and how Marquez treats data in the Timestamp
object.
By default, it seems that Postgres does not record TZ info in its timestamp. This can cause problems down the line, so I think for now it's advisable to create types as TIMESTAMP WITH TIME ZONE
instead of just TIMESTAMP
.
Reference:
https://www.postgresql.org/docs/9.6/static/datatype-datetime.html
Key | Description |
---|---|
+ | Public |
- | Private |
Method | Throws | Description |
---|---|---|
+ Namespace create(String name) |
NamespaceException | Creates a namespace. |
Method | Description |
---|---|
+ Job create(namespace, Job) |
|
- JobVersion createVersion(namespace, Job) |
|
+ Job[] getAll(namespace) |
|
+ JobVersion[] getAllVersions(namespace, jobName) |
|
+ JobVersion getVersionLatest(namespace, jobName) |
Method | Description |
---|---|
+ JobRun createJobRunOutputs(namespace, jobName, runId, Dataset[]) |
|
- DatasetVersion createVersion(namespace, Dataset) |
|
+ Dataset[] getAll(namespace) |
These tests started failing in the past couple weeks when run locally, though oddly enough did not fail when run against CircleCI.
Let's address all style warnings that don't follow the google java style guide
$ ./gradlew checkstyleMain checkstyleTest
> Task :checkstyleMain
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:110: Abbreviation in name 'ownerDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:113: Abbreviation in name 'jobDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:116: Abbreviation in name 'jobRunDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:119: Abbreviation in name 'datasetDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:122: Abbreviation in name 'jobRunDefinitionDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:123: Abbreviation in name 'jobVersionDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/api/CreateNamespaceResponse.java:24: 'if' construct must use '{}'s. [NeedBraces]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/api/CreateNamespaceResponse.java:25: 'if' construct must use '{}'s. [NeedBraces]
.
.
.
Let's keep a changelog!
This can be done once the code referencing those columns / tables is removed.
Do we need our own PreparedDbRule
? We're dynamically generating the config.yml
, which I'm not a fan of (and something I think we can hopefully avoid). Looking at the src for PreparedDbRule, on instantiation, it's creating the DB connection based on the details contained in DatabasePreparer. Worth exploring that as an option.
To simplify getting started / up and running with marquez, we'll want to add support for docker-compose
Note: depends on #44
Will be nice to eventually have our docs published here (automatically?) https://readthedocs.org/
This should include the ability to create a Namespace and list all the existing namespaces.
Seeing this during compilation:
...src/main/java/marquez/api/JobRunState.java uses unchecked or unsafe operations.
The current organization of our Dropwizard has no clear home for business logic, leaving us to push too much logic down into the DAO or Resources (without any guiding principles on when or where). We should consider re-organizing the project. We should also use patterns to reinforce good hygiene like having separate Representation objects for both requests and responses, which can help avoid unexpected bugs.
The Dropwizard docs suggest separating the business logic into separate package from the reference objects and the resources. The Dropwizard docs also making passing mention of request / response entities, although it does not emphasize the importance of this for good service design. To that end, we probably want to introduce:
Thoughts?
I recommend that we introduce vulnerability scans early in this project so that we can keep our security posture healthy. We can achieve this by performing security scans (with a tool) on PR's and rejecting them if they introduce new vulnerabilities. At a minimum, we should be scanning our dependencies since it's rather easy to do and incurs little overhead. Snyk is a solid choice for this purpose.
In the future, we'll want to scan custom code as well (the code we actually wrote), but those types of scans are usually more cumbersome to perform and orchestrate. I don't think this needs to be tackled right away, however, it should be considered carefully.
For better cohesiveness and manageability, I think we should move to a feature based package structure over the current layer based. I've created a branch as an example.
https://github.com/lulciuca/marquez/tree/feature-based-package-structure
Here's a nice article on the subject: http://www.javapractices.com/topic/TopicAction.do?Id=205
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.