CNN-Diffusion-MRIBrain-Segmentation repository is developed by Senthil Palanivelu, Suheyla Cetin Karayumak, Tashrif Billah, Sylvain Bouix, and Yogesh Rathi, Brigham and Women's Hospital (Harvard Medical School).
The CNN-Diffusion-MRIBrain-Segmentation-HCP repository was adapted from the CNN-Diffusion-MRIBrain-Segmentation Pipeline and is designed to run the HCP (Human Connectome Project) Masking Pipeline on a list of subjects. The pipeline involves several steps, including copying subject data from an S3 bucket, applying a brain masking process using a CNN model, and uploading the processed data back to the S3 bucket. The pipeline can be run in parallel using multiple cores, and it can process a specified number of subjects in each batch.
Running this adapted pipeline requires a GPU instance therefore we have used the g4dn.2xlarge ec2 instance with the updated AMI - Brainmasking_v9.16.2023.
- Adapted Pipeline Outline
- Prerequisites
- Configuration File Setup
- Running the Pipeline
- Monitoring and Logs
- Troubleshooting
- Contributing
- License
-
Initialization: The pipeline is initialized with necessary configuration parameters, such as file paths, subject lists, and processing parameters.
-
Copying subject data: Subject data is copied from the S3 bucket to the local file system for processing.
-
Preparing input files: An input text file is created containing the list of subjects and their corresponding file paths for the CNN masking process.
-
Running the CNN masking process: A brain masking process is applied to the subject data using a CNN model.
-
Moving subject data to the 'processed' directory: The processed data is moved to a designated 'processed' directory.
-
Cleaning up additional files: Additional files generated during the processing are moved to a separate directory.
-
Uploading subject data to S3: The processed data is uploaded back to the S3 bucket.
-
Uploading additional files to S3: Additional files generated during the processing are also uploaded to the S3 bucket.
-
Verifying the upload and logging: The script verifies if the processed data has been successfully uploaded to the S3 bucket and logs the status.
-
Deleting local data: Local data is deleted before starting another batch of subjects.
-
Repeating the process: The pipeline runs in a loop, processing a specified number of subjects in each batch until all subjects have been processed.
The Python script uses the argparse library to parse command-line arguments, or additionally, arguments can be set in a configuration file, which makes running the script from the terminal clean and easy. It leverages the multiprocessing library to parallelize certain tasks as we;ll as to speed up the processing.
The adapted pipeline requires a GPU instance to run. We have used the g4dn.2xlarge ec2 instance with our prebuilt AMI - Brainmasking_v9.16.2023.
If you do not have access to our AMI, can create your own AWS EC2 instance by creating a new g4dn.2xlarge instance which should use the Deep Learning AMI GPU TensorFlow 2.13 (Amazon Linux 2)
and follow the instructions in the CNN-Masking-Pipeline instructions to set up the environment and install the necessary packages within the new instance.
- S3path: A Pythonic file-system interface to Amazon S3
S3path has been added as a dependency in the environment_gpu.yml
file. It should be installed automatically when creating the environment from previous instructions.
If you are using a different environment, you can install the additional package by running the following command:
pip install s3path
The hcp_config.ini
file is essential for setting up and running the masking pipeline. It contains various settings that control the behavior of the pipeline. Below is a detailed explanation of each section and variable in the configuration file. Additionally, these settings can be overridden by passing command-line arguments to the script.
-
s3_bucket_hcp_root
: The root directory of the HCP data on AWS S3.- Example:
s3://nda-enclave-c3371/HCP/
- Example:
-
caselist_file
: Location of the case list file.- Example:
/home/ec2-user/CNN-Diffusion-MRIBrain-Segmentation/pipeline/caselists/PDC_SUBJECTS.txt
- Example:
-
group_name
: Name of the group for the study.- Example:
PDC
- Example:
-
hcp_data_root
: Root directory for HCP data.- Example:
/data/HCP/
- Example:
-
dry_run
: Whether to perform a dry run (True/False).- Example:
True
- Example:
-
bids_study_root
: Root directory for BIDS study.- Example:
/data/HCP/
- Example:
-
log_loc
: Location for the processed log.- Example:
/home/ec2-user/logs/%(group_name)s/processed.log
- Example:
-
temp_log_loc
: Location for the temporary log.- Example:
/home/ec2-user/logs/%(group_name)s/temp.log
- Example:
-
start_index
: Starting index for processing.- Example:
1
- Example:
-
end_index
: Ending index for processing.- Example:
1000
- Example:
-
batch_size
: Batch size for processing.- Example:
50
- Example:
-
input_text
: Location of the input text file.- Example:
/home/ec2-user/process_list.txt
- Example:
-
model_folder
: Location of the model folder.- Example:
/home/ec2-user/model_folder
- Example:
-
additional_files_loc
: Location for additional files.- Example:
/data/HCP/processed/%(group_name)s/AdditionalFiles
- Example:
-
masking_script
: Location of the masking script.- Example:
/home/ec2-user/CNN-Diffusion-MRIBrain-Segmentation/pipeline/run_pipeline.py
- Example:
-
multiprocessing
: Whether to use multiprocessing (True/False).- Example:
True
- Example:
-
appendage
: Appendage for the output file name.- Example:
_MR
- Example:
-
file_substring
: Substring for the output file name.- Example:
_EdEp
- Example:
-
output_file_name
: Name of the output file.- Example:
harmonization
- Example:
To modify these settings, open the configuration file in a text editor and change the values accordingly. Save the file after making your changes.
The batch_size
will determine how many subjects will be downloaded and processed at a time. Thus you could have a caselist with 1000 subjects and you will only need enough storage to process the total subjects you set as the batch_size
. After each batch the processed data will be uploaded to the S3 bucket and the local data will be deleted before starting the next batch.
The caselist_file
should contain a list of subject identifiers, one per line. These should not be full paths but rather just the unique identifiers for each subject you wish to process.
For example, the caselist_file
might look like:
subject_001
subject_001
subject_003
...
This list will be used by the pipeline to identify which subjects to process. The subjects should each have their own folder located in the group folder which is specified with the group_name
variable in the configuration file. The group folder should be located in the S3 bucket specified with the s3_bucket_hcp_root
variable in the configuration file.
The subjects will be copied from the S3 bucket to the local EC2 instance and stored in the directory specified with the hcp_data_root
variable in the configuration file. The subjects will be processed in the order they appear in the caselist_file
.
This pipeline assumes that the subjects are organized in the following way:
hpc_root/group_name/subject_001/unprocessed/Diffusion
The Diffusion
folder should contain the AP and PA diffusion images for the subject. The Diffusion
folder should also contain a bvals
and bvecs
file for the subject. The bvals
and bvecs
files should be named the same name as the diffusion image except with the bvals
and bvecs
extension, respectively.
The path to the config file can be set using the --config_loc
flag when running the pipeline script. If no path is specified, the pipeline will look for the config file in the CNN-Diffusion-MRIBrain-Segmentation/pipeline
directory. The default config file is hcp_config.ini
.
There are two ways to run the pipeline:
- Using a Screen Session
- Running in the Background with
nohup
Before running the pipeline in either mode, make sure to:
Activate the conda environment where all the dependencies for the pipeline are installed:
conda activate dmri_seg
Navigate to the directory containing the hcp_masking_pipeline.py
script:
cd /path/to/CNN-Diffusion-MRIBrain-Segmentation/pipeline
To start a new screen session, run:
screen -S hcp_masking_session
To detach from the screen session, press Ctrl + A
followed by D
.
To re-attach to the session, run:
screen -r hcp_masking_session
Now you can run the pipeline script:
# For a dry run to check if subjects are valid, or if `dry_run` is set to `True` in the config file you can leave out the `-dr` flag
python hcp_masking_pipeline.py --config_loc /path/to/hcp_config.ini -dr
# To actually run the pipeline, set `dry_run` to `False` in the config file or pass the `-r` flag
python hcp_masking_pipeline.py --config_loc /path/to/hcp_config.ini -r
If you prefer not to use a screen session, you can run the script in the background and redirect its output to a log file:
nohup python hcp_masking_pipeline.py --config_loc /path/to/hcp_config.ini > /path/to/log/file.log 2>&1 &
Note: If dry_run
is set to True
in the config file and you wish to run the pipeline, you'll need to pass the -r
flag to set it to False
.
To view the log file, run:
tail -f /path/to/log/file.log
To monitor the pipeline, you can use the htop
command to view the CPU and memory usage of the pipeline script:
htop
The pipeline script will output two log files:
-
temp.log
: This log file contains the output from the pipeline script. It is overwritten each time the pipeline script is run. -
processed.log
: This log file contains the output from the pipeline script. It is appended to each time the pipeline script is run.
The location of the log files can be set using the temp_log_loc
and log_loc
variables in the configuration file, respectively.
The processed log file will contain the following information for each subject: - The subject ID - Whether the subject was successfully processed or not - The time and date the subject was processed
The temp log is primarily used for saving the log from the s3 before appending the newly processed subjects to the processed log.
If you run into any issues while running the pipeline, please check the following: - Make sure the prerequisites are installed and configured correctly - Make sure the configuration file is set up correctly - Make sure the caselist file is set up correctly - Make sure the subjects are organized correctly in the S3 bucket
If you are still having issues, please open an issue on GitHub or email the author Ryan Zurrin.
If you would like to contribute to this project, please feel free to submit a pull request or open an issue.
This repository is licensed under the MIT License.