opendataalex / process_tracker_python Goto Github PK
View Code? Open in Web Editor NEWProcessTracker is a framework for managing data integration processes.
License: GNU General Public License v3.0
ProcessTracker is a framework for managing data integration processes.
License: GNU General Public License v3.0
Need to further research adding collectors for memory and cpu utilization to further expand the basic capabilities build for performance cluster management in #10 .
Needs to take advantage of aws_utilities.
Like with source, need to track where the data is loading to. Can have multiple targets.
Extracts should be processed in order of date registered. While it should be fine the way it's coded, need to enforce in the queries that return extracts back.
Currently there are four spots that have to be updated:
Need to have the ability to setup or wipe and reload the data store via CLI.
All finders should also allow for a given status, not just 'ready'. 'ready' should be it's default.
When ProcessTracker finds extracts need to have method to associate the process run to the extracts and change their status. Finders shouldn't necessarily do this by default though that would be an obvious place for it.
Need to be able to handle the situation where an extract is dependent on another extract to be completed before it's children get processed.
Currently, process_tracker only handles AWS CLI addresses (s3://). Should also allow URL addresses that use the https method (https://.s3.amazonaws.com/). Need to modify all code working with s3 to use that.
For processes that don't require locking but should have some performance management as far as max concurrent runs.
Have been recommended to switch over to PyTest. This is to track the testing and investigation as well as make the switch if found worthwhile.
The CLI tool should find the latest process run by given process name and be able to change it's status (provided it's not running or has failed). This would be ideal for handling processes that have gone into 'on hold' status.
The location_type field for location_lkup should be location_type_id.
Need to be able to add or delete process dependencies from the CLI tool using the process names.
As an extension to other audit information, we can also track information about locations:
Location Tracker needs to handle duplicate names gracefully. While names should still be unique, the code should try to catch the error. Currently the unique constraint on the database triggers the rollback, but that should be the last resort.
Tracking of sources/targets should be also at the object level, and not just the general source/target name.
Quality of life improvement to add a function to get the file's contents. Question is how far does this go (i.e. what file types?) and should this be part of the framework? Wanted to at least make a note to give it further thought.
Extracts should be associated to a dataset type. Instead of it's own object, maybe associate to Source?
SQLAlchemy supports other relational databases than postgresql. Need to add support for those.
Need to be able to read config file from s3 location. This is to support using tools like lambda.
Should provided the ability for filename lookups to be done via regular expression.
Need to build a user interface so that audits can be easily reviewed. Also should allow for management of Process Tracking framework.
Need to verify if the CLI tool performs a cascade delete when deleting lookup objects. Need to prevent it if it does.
Error thrown in lambda when settings manager can't find a config file. Should only try creating locally and only within the CLI - not just from importing ExtractTracker or ProcessTracker (or any part of the library really...).
When importing process_tracker, lot's of stuff is being kicked off due to settings manager. Need to stop initializing certain things if they are only going to throw an error (or better yet, fix it so it does what it's supposed to without throwing the errors).
TravisCI fails on tagged build, but build is successful. This is due to a race condition that whichever database gets finished first will deploy instead of waiting for both to finish.
Found with initial beta release that package is not packaged correctly. There is more than just the core classes importable.
Need to have the ability to register more than one source to a process.
Need to be able to manage total capacity with process runs.
Once the initial version is ready to go, need to research and integrate CI/CD so that going forward new releases can be automatically tested, built, and loaded to PyPi.
Instead of registering each extract, a user may want to wait and register them in one go by location name and/or path.
Need to add the ability for ProcessTracker to check for dependencies and if they are in a state to block the process from running.
Need to provide some options for working with non-relational databases. Requires some research as to which to support.
In the event the data store needs to be reset, need to enable option.
Need to add ability to track extract low and high dates, record counts, etc.
Need to go through entire project and add/fix/enhance logging capabilities so that logging works as expected. Need to also write out to a log file and not just console.
Need to make find_ready_extracts_by_location easier to understand. Currently the variable location points to location_name and not the filepath. Need to provide the option for one or the other.
Need way to put dependent process 'on hold' status until failure is resolved.
Just realized that the lookups should return Extract objects, not filepaths + filename. That can be generated using repr or some other function. Otherwise have to do the lookups again to modify the record.
Process needs to have the ability to be assigned to a specific cluster. That allows for resource allocation management.
Need to add documentation on how to use and what features are available.
Need to move data_store into utilities where it belongs. Also need to set configuration correctly and not just in verify_and_connect.
Thought of edge case where in the extreme likelihood that a parent extract file and a child extract file(s) are in the same extract set with a recorded dependency, the process will error out on the child extract. Need to add a check for that and bypass the dependency checker.
Need to have the ability to encrypt data store passwords. This way they are not stored in plain text in the config file.
Need to have a command line tool to be able to initialize process tracking data store and allow for adding default items (Actors, Tools, etc.).
Need to have the ability to upgrade the data store thru the CLI tool. Also need to determine policy for when to deprecate an upgrade (we can't keep every upgrade in every version).
Idea came initially from working on other audit fields for process. Would be nice to have options for extracts as well. This one can be big though because we're starting to approach data profiling territory.
Need to provide some initialization defaults. Also need to verify that location type is being set correctly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.