New Feature Proposal: Data Lineage
Given that we're in a time where facts aren't always reliable and data sourcing can be considered suspect, it is important to create a way to show where data has originated from. Therefore, the intention of this feature set will be to create a digital map for the data source(s) both in the general context of the entire cached data set and in the specific endpoint context.
What is Data Lineage?
According to Wikipedia, "Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process."
General Case: /lineage
endpoint
For data that is stored in the cache and for endpoints where data is gathered dynamically, an endpoint for lineage is needed. The report should be digital, meaning ideally in a JSON format, so that users should be able to programmatically trace the source and understand the processes. All steps should be described again in a JSON structure, and important included libraries can be referenced as the method to capture the data. Additionally, the report should include when the last local update was run for the cache creation and what was used, and both static and dynamic sources should be called out. Additional details will be provided in this issue as the feature is designed in sections below.
Ideas
- The report should be digital, meaning ideally in a JSON format, so that users should be able to programmatically trace the source, and understand the processes.
- All steps should be described again in a JSON structure
- Important included libraries can be referenced as the method to capture the data
- When the last local update was run for the cache creation and what was used should be included
- Both static and dynamic sources should be called out
Specific case lineage for each query endpoint
When a query is run, each endpoint should report the data source(s) including those inside the system via the cache. Ideally, these sources should be linkable, if they are digital, so those interested can follow the trail. The URLs, in particular for Wikipedia and EDGAR, should be disclosed with each query result as an additional JSON field. If locally cached data is used in the result, that should be referred to as well. Note that there can be references to the general lineage endpoint.
Ideas
- The urls, in particular for Wikipedia and EDGAR, should be disclosed with each query result as an additional JSON field
- If locally cached data is used in the result then that should be referred too as well. Note that there can be references to the general lineage endpoint.