ual-re / rebach Goto Github PK

Python-based tool to enable data preservation to a cloud-hosted storage solution

License: MIT License

Python 100.00%

rebach's Introduction

About Us

Research Engagement (RE) helps UA faculty, researchers, and students with every stage of their research. We have three functional units:

Research Incubator
Data Cooperative
Scholarly Communications

Members of the Data Cooperative:

GitHub	Name	Role/Title
	Jeff Oliver	Data Science Specialist
	Fernando Rios	Data Management Specialist
	Kiri Carini	GIS Specialist
	Jonathan Ratliff	Research Data Repository Assistant

Experience

Specialty	Tools/Resources
Data Science
Data Management & Publishing
GIS

rebach's People

Contributors

Stargazers

Watchers

Forkers

astrochun zoidy

rebach's Issues

Bug: bagger fails on Collections

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

Bagger currently can't process collections because it tries to extract the license from the metadata (which doesn't exist).

Steps To Reproduce

Process any collection

Display warning message for articles without curation folder

Description:
Currently, after fetching article details from ReDATA, the script checks for the presence of a curation folder related to the fetched article. If a curation folder is found, its name is displayed on the console. However, if no curation folder exists, nothing is displayed.

The requirement is to enhance the script to display a warning message on the console for article IDs that do not have a curation folder. Displaying a warning message for articles without a curation folder helps in identifying and addressing missing folders.
The warning message can provide information such as "No curation folder found for Article ID: [article_id]."

Steps to see existing behavior

Execute the script to fetch article details from ReDATA.
Observe the console output for articles that have a curation folder.
Note that no warning message is displayed for articles without a curation folder.

Solution:

Modify the script to include an additional step after checking for the curation folder.
If a curation folder is not found, display a warning message on the console indicating the absence of the curation folder for the respective article ID.
The warning message should be clear and informative, enabling users to identify articles without a curation folder easily.

Merge ReBACH-Bagger with ReBACH

Merge the ReBACH Bagger tool with the main ReBACH repo.

Bug: connection errors during file download are not handled

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

If an http error happens during the download of a file from figshare, the error is not handled and the program exits. The error should be handled and the download retried.

Steps To Reproduce

During the download of a file, if the connection is interrupted, the error will be thrown and the program will exit.

e.g., the error requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) is thrown on the following line.
''''
File "/home/u17/frios/ReBACH/figshare/Article.py", line 352, in __download_files
with requests.get(file['download_url'], stream=True, allow_redirects=True,
'''

Need a try catch around all blocks containing file downloads

Implement spec use case 3.5 (dry run mode)

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, Use Case 3.5 from the specification is not implemented. When implemented, ReBACH should have the possibility of running in dry run mode

Suggested Implementation

Look at the bagger dry run mode as well. See #37

Ability to specify bagger config to use in the main ReBACH config

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, ReBACH just uses the default config file that bagger returns. It would be useful to be able to specify in the main ReBACH config which bagger config to use. This is useful to more explicitly switch between e.g. production and testing configs.

Suggested Implementation

L51 of Integration.py should explicitly pass in the config to the get_args function, as read from the main ReBACH config. If the config is not specified, then it will fall back to the current implementation (not passing in any arguments to get_args

Bug: hashing files results in out of memory error

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

Performing a hash check using hashlib results in out of memory errors for large files due to the entire file being read into memory. This is similar to Issue #28

Steps To Reproduce

Process a dataset that contains large files, observe the process being killed with an out of memory error.

Incorrect metadata directory and filename in bag.py file

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

metadata_dir = f'v{version}/METADATA/'
metadata_filename = f'preservation_final_{article_id}.json'

In the above chunk of code which is there in bagger/bag.py file, the variable 'version' holds the value 'v01'. Therefore, adding 'v' would result in 'vv01'. To correct this, the redundant 'v' should be removed. Additionally, in the preservation staging storage, the metadata filename does not have the phrase 'preservation_final_'. Hence 'preservation_final_' should be removed.

Steps To Reproduce

Refer the below chunk of code in bagger/bag.py file

metadata_dir = f'v{version}/METADATA/'
metadata_filename = f'preservation_final_{article_id}.json'

Bug: crash when using dart workflows that do not upload bags

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

When a DART workflow that does not include an upload step, the program crashes

Steps To Reproduce

Edit the workflow.json file and delete the storageServices key to disable any uploads. Observe the program crashes

Missing first_depositor_full_name in preservation storage folder creation

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

Description:
When creating a folder in the preservation storage for each deposit in ReDATA, there is a specific format that needs to be followed: [article_id]_[version]_[first_depositor_full_name]_[hash]. However, in some cases, the 'first_depositor_full_name' is missing, and instead, it is displayed as '_' (underscore symbol). This issue occurs because the first depositor full name is fetched from the Figshare API response, specifically from the 'url_name' field. However, in certain cases, the 'url_name' field is set to '_' instead of the actual name.

Expected Behavior:
The folder name should consistently follow the format [article_id]_[version]_[first_depositor_full_name]_[hash] for all deposits.

Actual Behavior:
For certain deposits, the 'first_depositor_full_name' is displayed as '_' in the folder name, instead of the actual depositor name.

Solution:
Instead of relying on the 'url_name' field from the Figshare API response to retrieve the first depositor's full name, consider retrieving the 'full_name' field instead. Since 'full_name' contains spaces, replace them with underscores after fetching.

Steps To Reproduce

Execute the script by providing the article ID that has the depositor's full name as "USA National Phenology Network." Example: 20736700.
Go to the preservation storage location and search for the folder created for the article.
Notice that the first depositor's full name is missing from the folder name, and it is displayed as '_'.

Post Process Script Not Executed

This issue involves addressing existing discrepancy between the ReBACH Software Specification document and the current implementation

As per the ReBACH Software Specification document (refer 'Data and Metadata Integration Requirements – Function' section in the document), it was stated to execute a post-processing external script after performing step 3 to step 6. However, upon review, it was found that no code was present to execute a post-processing script. To address this, a code has been added to execute a post-processing script. The addition of the post-processing script is achieved by calling the method 'post_process_script_function'.

Enhance ReBACH to accept specific article and collection IDs for selective processing

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, when executing the ReBACH software, it processes all articles and collections on Figshare. There is no option to select specific article and collection IDs for processing. This can be time-consuming and unnecessary, as not all articles and collections need to be processed every time.

Suggested Implementation

Note: This enhancement needs to be finished prior to the completion of another enhancement #27

The ReBACH software is enhanced to include a new feature. An option is introduced to specify which articles and collections to process. This is done by using a command-line argument called '--Ids' followed by a list of IDs in the format "[1, 2, 3]". This argument is optional. If no '--Ids' argument is provided, the software will continue processing all articles and collections by default. When the '--Ids' argument is used, only the specified articles and collections will be processed and uploaded to Wasabi. However, this enhancement focuses on accepting specific article and collection IDs from a command-line argument, and the implementation of processing and uploading specific articles and collections will be addressed separately in issue #27

To execute the software with specific IDs, use a command like this:
python app.py .env.ini --Ids "[article and/or collection Ids]"

Ex: python app.py .env.ini --Ids "[22800527, 21642077]"

Bug: missing item in bagger config

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

The delete option is missing from the example config. This makes it impossible to set the flag to false (which is the default) unless bagger is executed standalone.

Steps To Reproduce

See default.example.toml

Bug: Sometimes API requests fail but retry doesn't work

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

Sometimes, API requests fail and the code will retry. However, the retry attempt sometimes produces unexpected results. For example, during the item fetching stage, if a timeout error occurs, the code will retry but instead, the item fetching process stops as if there were no more items to fetch.

This issue may be fixed by implementing more robust retry code. Instead of the existing solution, use the built-in retry functionality of the requests library. This was already implemented for file download retries in #83. This issue is to implement the same for all uses of the requests library.

Steps To Reproduce

See above.

Upload bag with changed curation files

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, if a preservation bag exists on preservation storage, only changes in the Figshare data/metadata will result in detecting that a bag being processed is different than the corresponding bag on preservation storage (via the hash in the bag name).

This means that if there are any changes to any other part of the bagged content that is not coming from the Figshare side, (e.g., curation metadata), ReBACH will show a message saying that the bag being created is a duplicate of an existing bag and will not upload it to preservation storage. This is undesirable sometimes since curation files may be added/updated later. However, replacing the existing file when curation data changes is not desirable ALL the time (since it could be the result of an error)

Suggested Implementation

Implement in two phases
~~1. Add a check to see if the bag to be uploaded is a different size than the one in preservation if the hash in the bag name is the same. Display a warning if not (to allow checking the logs)~~
2. Add a config and/or commandline flag to enable overwriting existing bags with the same name

Edit: phase 1 isn't possible because Dart handles bag creation and upload so there is no easy way to check the bag size before it's uploaded. Therefore, the only way updated curation files can be uploaded is to overwrite the bag without the check (overwriting is already possible by setting the flag in the bagger config).

Bagger: remove host_bucket config item

Is there an existing issue for this?

I have searched the existing issues

Description

The host_bucket entry in the toml configuration for bagger can be constructed from the host and bucket items. To simplify the configuration and code, it should be removed

Suggested Implementation

Refer to

https://github.com/UAL-RE/ReBACH/blob/main/bagger/config/default.example.toml
ReBACH/bagger/wasabi.py

Line 31 in 619cbbd

f"s3host='{self.s3host}', s3hostbucket='{self.s3hostbucket}' )"
ReBACH/bagger/bag.py

Line 47 in 619cbbd

s3hostbucket=config['Wasabi']['host_bucket'])

Bug: number of articles fetched incorrectly counts articles that are not 'approved'

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

In the initial fetching of articles from the figshare API, the code appears to count articles that have not been published in the fetched article count (i.e., articles who's curation_status was not 'approved'). The count of article versions fetched appears to be correct. This means that when this issue appears, the total number of articles fetched is greater than the article versions fetched (this should never be the case)

Steps To Reproduce

Run app.py with no arguments to process all data on Figshare.

Bug: Bag gets uploaded despite folder validation failed

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

Sometimes, a bag is created and uploaded despite the preservation package folder validation failing.

Steps To Reproduce

On staging, article id 7873476

2023-10-27 19:16:44,205:INFO: ------- Processing article 7873476 version 2.
2023-10-27 19:16:44,205:INFO: Pre-processing script finished successfully.
2023-10-27 19:16:44,226:INFO: Checking if /opt/redata/mnt/preservation_staging/7873476_XXXXXXXX exists.
2023-10-27 19:16:44,267:INFO: Exists and is not empty, checking contents.
2023-10-27 19:16:44,329:INFO: Comparing Figshare file hashes against existing local files.
2023-10-27 19:16:44,371:INFO: /830051024_salaries-ipeds.csv file exists (hash match).
2023-10-27 19:16:44,391:ERROR: /opt/redata/mnt/preservation_staging/7873476_v02_XXXXXXX/v02/DATA/830054620_salaries-ipeds.csv does not exist.
2023-10-27 19:16:44,392:ERROR: Validation failed, deleting /opt/redata/mnt/preservation_staging/7873476_v02_XXXXXXX.
2023-10-27 19:16:44,532:INFO: /opt/redata/mnt/preservation_staging/7873476_v02_XXXXXXX deleted due to failed validations.
2023-10-27 19:16:44,533:INFO: Checking required files exist in associated curation folder /opt/redata/mnt/curation_testing/4.Published/.
2023-10-27 19:16:44,533:INFO: Curation files exist. Continuing execution.
2023-10-27 19:16:44,534:INFO: Checking and creating empty directories in preservation storage.
2023-10-27 19:16:44,902:INFO: Copying curation UAL_RDM files to preservation UAL_RDM folder.
2023-10-27 19:16:46,005:INFO: Copied curation files to preservation folder.
2023-10-27 19:16:46,005:INFO: Saving json in metadata folder for each version.
2023-10-27 19:16:46,172:INFO: Config file: bagger/config/default.toml
2023-10-27 19:16:46,172:INFO: Overriding bagger log file location logs with /opt/redata/mnt/logs from ReBACH Env file
2023-10-27 19:16:46,173:INFO: Processing preservation package '7873476_v02_XXXXXXX'
19:16:47 -     INFO: Job succeeded: 7873476_v02_XXXXXXXX.tar
2023-10-27 19:16:47,984:INFO: Status: SUCCESS.
2023-10-27 19:16:47,984:INFO: Exit code: 0.
2023-10-27 19:16:47,984:INFO: Preservation package '7873476_v02_XXXXXX' processed successfully

After the error, the files were deleted so the files should be redownloaded. However, no attempt is made to redownload the files resulting in an incorrect bag with missing files, despite the log showing a successful bag upload.

Chore: update redata-commons dependency

Is there an existing issue for this?

I have searched the existing issues

Description

Update redata in requirements.txt to 0.51. This release updates the requests and urllib3 libraries with possible breaking changes.

requests 2.29 introduces a change to how chunked downloads are handled
requests 2.30 introduces urllib3 2.0 which itself may have breaking changes

Suggested Implementation

ReBACH uses chunked downloads. Therefore, the following tests are needed.

test downloading large files (anything more than a few tens of MB)

Bug: Investigate possible bug in processing logic when files exist

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

Investigate possible bug where previously downloaded datasets contain all expected files and all file hashes match is not processed further. See

ReBACH/figshare/Article.py

Line 562 in 4463f8d

process_article = False

May also need to improve the log message in

ReBACH/figshare/Article.py

Line 564 in 4463f8d

self.logs.write_log_in_file('error', f"{file_path} does not exist.", True)

Steps To Reproduce

TBD

Feature: Add terminal coloring to console log messages

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, it's difficult to quickly pick out errors and warnings in the logs output to the console. Adding color will make them easy to identify.

Suggested Implementation

Color only the message type (errors, warnings), not the entire log line
Color ERROR red, WARNING yellow, INFO remains uncolored
The code must detect whether the console is capable of outputting ANSI escape sequences on the platforms ReBACH may be expected to run or tested on (Windows 10+, Mac, Ubuntu)

Implement pre-processing script to determine if bag has been uploaded to preservation

Is there an existing issue for this?

I have searched the existing issues

Description

If the combined data and curation metadata folder does not exist in preservation staging storage, the software will re-download it. To avoid a lot of extra downloading when the record has already been preserved, check to make sure the bag isn't already preserved

Suggested Implementation

In the pre-processing script section

check to see if a bag with the same name as the package currently being processed exists in preservation storage.
if it does, skip further processing of the package (i.e., don't download any data files) and delete any files/folders already created
if it doesn't, continue processing the package (i.e., downloading the files, creating and uploading the bag)

Add option to continue file processing if there is an error

Is there an existing issue for this?

I have searched the existing issues

Description

In some cases, there may be errors when processing certain article IDs. Instead of aborting the entire process that precludes non-error IDs from being processed, add an option to allow skipping error items.

Suggested Implementation

Add a new flag on the command line that allows skipping errors when processing items. Do not make it an option in the config file since usually, we do want to catch errors. This option should only be used when ReBACH is executed interactively.

Update bagger docs to include testing

Is there an existing issue for this?

I have searched the existing issues

Description

Bagger has some unit tests in the /tests folder that are not documented. Update the readme to include how to run these tests

Suggested Implementation

No response

Bug: curation schema is too strict

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

The ualrdm-curationdirectory-schema.json schema is too strict in that it causes validation to fail when files we don't really care about for preservation purposes are missing.

For example, validation will fail if the METADATA folder does not contain the downloaded metadata json files. These will be regenerated on preservation so it doesn't matter if they're not there.

Steps To Reproduce

Run the validator on any curation directory with missing contents of the METADATA folder.

Check if a preservation archive exists in preservation storage if matching is unsuccessful

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, if a dataset in figshare does not have a corresponding curation folder, a warning is returned. There could be several reasons why the curation folder doesn't exist

The curation folder was inadvertently deleted or renamed. The existing warning will catch this.
The item is already preserved in storage therefore the folder was deleted. If the item is already preserved, this is not a warning/error condition.

Therefore for 2., we can skip showing a warning if the item is already preserved. Note, this will require reconciling the functionality with #61

Suggested Implementation

No response

Keep track of how many items were successfully bagged

Is there an existing issue for this?

I have searched the existing issues

Description

The only way to see the items that were successful or had errors are to scroll up through the log. It would be useful to have a summary of how many items were successfully uploaded and how many were not. This info could be compared with how many should have been uploaded

Suggested Implementation

No response

Implement preprocessing script to delete temporary files before bagging

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, temporary files (e.g., from Word, etc) in the UAL_RDM folder are included in bags. These should be deleted prior to bagging since preserving them is not needed.

Suggested Implementation

Implementation will likely be similar to #22

Files that should be deleted (there may be others)

Files starting with ~$ and ending in any of .docx, .doc, .xlsx, .xls
Files named .DS_Store
Files starting with ._.

Selective processing and uploading of articles and collections mentioned in the command-line argument

Currently, the ReBACH software processes and upload all articles and collections on Figshare, which can be time-consuming. With this enhancement, the software will be able to selectively process and upload articles and collections based on their provided IDs.

It is crucial to complete Enhancement #44 before implementing this enhancement because the former handles the extraction of article and collection IDs from command line argument. This enhancement handles processing and uploading of extracted article and collection Ids

In addition to this enhancement, the following new features have been added:

Calculation of required space when article IDs are explicitly provided.
Counting and display of articles for which curation folders exist and do not exist.

Bagger: Use host and bucket from the main configuration in the DART workflow

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, DART uploads to the specified host and bucket defined in the workflow file. The config file for bagger also defines the host and bucket for other reason (see the bagger readme file). This is confusing because it would seem that DART should also use the host and bucket specified by the main config (as it does with the Wasabi credentials).

Suggested Implementation

Add a new configuration option to the bagger toml config that specifies whether the DART workflow file should be parsed. If the option is set, the json for the workflow should be parsed to replace the host and bucket with those given in the toml file.

Bug: Certain errors cause bags to be created and uploaded anyways

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

The ReBACH specification states that the bag creation process should be atomic (all steps must succeed or else the bag creation process should error out for that item). Refer to 4.4.1 steps 6 and 7 of the spec.

When there is an error copying curation files from curation storage, the software currently displays an error however bag creation continues and the incomplete bag is uploaded. Instead, bag creation should fail and should be deleted.

Steps To Reproduce

Pick a dataset to upload. Then,

Remove read permissions of one or more files in the UAL_RDM folder
Run ReBACH
Observe errors in the log after the line :INFO:Log - Copying files to preservation folder.
Note that the bag creation and upload process continues

Feature: Implement schema-based directory validation

Is there an existing issue for this?

I have searched the existing issues

Description

Currently, ReBACH verifies the structure of the UAL_RDM curation folder via hard-coded if statements. For improved flexibility, these checks should be done by validating against a schema.

Related: #57

Suggested Implementation

This could be implemented as a pre-processing script (the framework for which is already implemented). Sample validation schemas and code is available.

Bug: ReBACH fails on preserve1

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

ReBACH fails to run on preserve1 when curation storage is mounted as readonly. Reported by @pavithraarizona

Steps To Reproduce

Execute ReBACH on preserve1. Observe the error

The script encountered an error stating "The curation storage location specified in the config file could not be reached or read.".

Adding Slack notifications

Is there an existing issue for this?

I have searched the existing issues

Description

To make the software run better in an unattended fashion, generate alerts for Slack when the entire process completes. In case of success, a success notification should be generated. To check for errors, the software should scan the logs for errors, generate a summary and post to Slack.

Suggested Implementation

A new module will need to be written that executes at the very end of the process. New configuration settings need to be added for the Slack API.

Define method 'post_process_script_function' for directory structure integration with ReBACH bagger

This issue involves defining the method 'post_process_script_function' for integrating the directory structure with the ReBACH bagger.

The purpose of the method and the lines of code within it is to invoke the ReBACH bagger. The ReBACH Software Specification document specifies that this method must be passed three arguments: the path to the preservation package, the result code of the pre-processing script, and whether an error was encountered in steps 3 to 6. However, as of now, the method has been implemented to take only two parameters, and the third parameter has not yet been implemented.

Note: Steps 3 to 6 are defined in the 'Data and Metadata Integration Requirements – Function' section of the ReBACH Software Specification document.

Bug: schemas don't accept article_id's different than 8 digits

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

The ualrdm-curationdirectory-schema.json and ualrdm-preservationdirectory-schema.json schemas only accept article_ids with exactly 8 digits. This causes validation to fail when the article_id is a different length (e.g., on the staging server)

The specification only says the article_id must be an integer.

Steps To Reproduce

Create a curation or preservation folder with an article_id with number of digits <>8

Bug: size calculation is wrong when there are no matching items

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

When the IDs to match are given via the --ids parameter and none of them match an actual ID, the size calculation for the associated curation folders is wrong (should be zero but it isn't)

Steps To Reproduce

run the software with --ids 2229467800 (this is an invalid ID)
Observe the following: Total size of the curated folders for the matched articles: 338189063285 bytes (it looks like it's taking the size of the entire curation folder but it should actually be zero)

Script Execution Error: "Killed" message when downloading larger files of articles from ReDATA

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

Issue Description: During the execution of a script, an error occurred while attempting to download larger files of articles from the ReDATA platform. Instead of completing the download process, the script encounters a "killed" message, causing the execution to stop abruptly.

This is a P1 priority bug

Expected Behavior: The script should successfully download the larger files from ReDATA without any interruptions or errors.

Actual Behavior: While executing a script, a "killed" message is displayed, abruptly terminating the script execution. Consequently, the larger files are not downloaded, preventing the expected behavior of script.

Additional Information:
The download operation is performed using the following line of code:
filecontent = requests.get(file['download_url'], headers={'Authorization': 'token ' + self.api_token},
allow_redirects=True)

Applying the solution mentioned in the following URL may help in resolving the issue: https://gist.github.com/wasi0013/ab73f314f8070951b92f6670f68b2d80

Steps To Reproduce

Execute the script by providing the article ID that has a file of size more than 800 MB from ReDATA. Example: 22122197
Observe the script execution and error occurrence.

Folder for DART settings, etc.

We are examining the use of DART for our digital preservation as it has a number of built-in functions.

For the purpose of testing and sharing of set-ups, we should:

Create a branch called 01_dart
Create a folder called dart within the 01_dart branch
Upload a profile settings, and configuration that we will use (@zoidy has some initial set-ups) into dart folder

Bug: log messages and locations inconsistent

Is there an existing issue for this?

I have searched the existing issues

Description of the bug

Certain log messages appear as Info in the console but as Error in the log.
Additionally, the bagger log location should be overridable like the host/bucket of #47.
Finally to make the main logs and bagger logs sort together, the bagger log file name should be changed to the format of the main app.

Steps To Reproduce

Run the script. Note some logging messages during bagging show up as Info in the console but as Error in the log files
N/a
N/a

Update readthedocs

Is there an existing issue for this?

I have searched the existing issues

Description

Update readthedocs repositories with links to the appropriate things in this repo

Suggested Implementation

No response

Allow case insensitive comparisons when validating

Is there an existing issue for this?

I have searched the existing issues

Description

Validations for the curation directory structure contents are case sensitive. Allow for case insensitive checking of certain files. This prevents many directories from being deemed invalid because the code can't find the required files. In the future, validation should be done via schemas but for now, this will do.

Suggested Implementation

No response

Add additional log messages

Is there an existing issue for this?

I have searched the existing issues

Description

It is sometimes not always clear from the logs whether an action was taken or not. Improve the logs to add additional messages for some actions

Suggested Implementation

No response

Update bagit profile and dart workflows

Is there an existing issue for this?

I have searched the existing issues

Description

The ReBACH bagit profile needs to be revised with production settings. The associated dart workflows should be re-created with this profile and the latest dart version as well.

Suggested Implementation

No response