The process_tracker_python from opendataalex

Add memory and cpu status collectors

Need to further research adding collectors for memory and cpu utilization to further expand the basic capabilities build for performance cluster management in #10 .

Refactor register_extracts_by_location

Needs to take advantage of aws_utilities.

Add Target to process_run

Like with source, need to track where the data is loading to. Can have multiple targets.

Process Extracts in chronological order

Extracts should be processed in order of date registered. While it should be fine the way it's coded, need to enforce in the queries that return extracts back.

Have version number be modifyable in one place, used everywhere needed

Currently there are four spots that have to be updated:

mysql_process_tracker_defaults.sql
postgresql_process_tracker_defaults.sql
process_tracker.init.py
setup.py
Would be nice to be able to call it from one place and update accordingly.

Data store initialization thru CLI

Need to have the ability to setup or wipe and reload the data store via CLI.

Extract Finders By Given Status

All finders should also allow for a given status, not just 'ready'. 'ready' should be it's default.

When ProcessTracker finds extracts need to have method to associate the process run to the extracts and change their status. Finders shouldn't necessarily do this by default though that would be an obvious place for it.

Handle Extract dependencies

Need to be able to handle the situation where an extract is dependent on another extract to be completed before it's children get processed.

S3 capabilities not configured correctly

Currently, process_tracker only handles AWS CLI addresses (s3://). Should also allow URL addresses that use the https method (https://.s3.amazonaws.com/). Need to modify all code working with s3 to use that.

Process Concurrency

For processes that don't require locking but should have some performance management as far as max concurrent runs.

Switch Unit Test suite to PyTest

Have been recommended to switch over to PyTest. This is to track the testing and investigation as well as make the switch if found worthwhile.

CLI Tool Process Status Change

The CLI tool should find the latest process run by given process name and be able to change it's status (provided it's not running or has failed). This would be ideal for handling processes that have gone into 'on hold' status.

Location table type field mislabeled

The location_type field for location_lkup should be location_type_id.

Add or Delete Process Dependencies From CLI

Need to be able to add or delete process dependencies from the CLI tool using the process names.

Location Audit Info

As an extension to other audit information, we can also track information about locations:

number of files

Extract Location Names

Location Tracker needs to handle duplicate names gracefully. While names should still be unique, the code should try to catch the error. Currently the unique constraint on the database triggers the rollback, but that should be the last resort.

Sources/Targets need tracking by individual objects

Tracking of sources/targets should be also at the object level, and not just the general source/target name.

Add Extract contents getter

Quality of life improvement to add a function to get the file's contents. Question is how far does this go (i.e. what file types?) and should this be part of the framework? Wanted to at least make a note to give it further thought.

Add Dataset Object Type

Extracts should be associated to a dataset type. Instead of it's own object, maybe associate to Source?

Add support for other relational databases

SQLAlchemy supports other relational databases than postgresql. Need to add support for those.

Read config file from s3

Need to be able to read config file from s3 location. This is to support using tools like lambda.

Extract Lookup Filename RegEx

Should provided the ability for filename lookups to be done via regular expression.

Web User Interface

Need to build a user interface so that audits can be easily reviewed. Also should allow for management of Process Tracking framework.

CLI Cascade Delete?

Need to verify if the CLI tool performs a cascade delete when deleting lookup objects. Need to prevent it if it does.

Settings Manager should not try to create config file outright

Error thrown in lambda when settings manager can't find a config file. Should only try creating locally and only within the CLI - not just from importing ExtractTracker or ProcessTracker (or any part of the library really...).

Funky Calls On process_tracker Import

When importing process_tracker, lot's of stuff is being kicked off due to settings manager. Need to stop initializing certain things if they are only going to throw an error (or better yet, fix it so it does what it's supposed to without throwing the errors).

TravisCI Fails on Tagged Build

TravisCI fails on tagged build, but build is successful. This is due to a race condition that whichever database gets finished first will deploy instead of waiting for both to finish.

Package not packaged correctly

Found with initial beta release that package is not packaged correctly. There is more than just the core classes importable.

Process can have more than one source

Need to have the ability to register more than one source to a process.

Process Capacity Planning/Management

Need to be able to manage total capacity with process runs.

Enable CI/CD Stack

Once the initial version is ready to go, need to research and integrate CI/CD so that going forward new releases can be automatically tested, built, and loaded to PyPi.

Register extracts by location

Instead of registering each extract, a user may want to wait and register them in one go by location name and/or path.

Process Dependency handling

Need to add the ability for ProcessTracker to check for dependencies and if they are in a state to block the process from running.

Add support for non-relational databases

Need to provide some options for working with non-relational databases. Requires some research as to which to support.

Enable system reset

In the event the data store needs to be reset, need to enable option.

Add Audit Information

Need to add ability to track extract low and high dates, record counts, etc.

Improve/fix logging

Need to go through entire project and add/fix/enhance logging capabilities so that logging works as expected. Need to also write out to a log file and not just console.

Extract Location Lookup Fix

Need to make find_ready_extracts_by_location easier to understand. Currently the variable location points to location_name and not the filepath. Need to provide the option for one or the other.

Handling Process Dependency failures

Need way to put dependent process 'on hold' status until failure is resolved.

Extract Lookups should return Extract objects

Just realized that the lookups should return Extract objects, not filepaths + filename. That can be generated using repr or some other function. Otherwise have to do the lookups again to modify the record.

Process Cluster Assignment

Process needs to have the ability to be assigned to a specific cluster. That allows for resource allocation management.

Write documentation

Need to add documentation on how to use and what features are available.

Refactor data_store

Need to move data_store into utilities where it belongs. Also need to set configuration correctly and not just in verify_and_connect.

If extract parent is in same list as child extract and bulk update occurs, process will fail

Thought of edge case where in the extreme likelihood that a parent extract file and a child extract file(s) are in the same extract set with a recorded dependency, the process will error out on the child extract. Need to add a check for that and bypass the dependency checker.

opendataalex / process_tracker_python Goto Github PK

process_tracker_python's People

Contributors

Stargazers

Watchers

process_tracker_python's Issues

Recommend Projects

Recommend Topics

Recommend Org