matsengrp / phip-flow Goto Github PK
View Code? Open in Web Editor NEWA Nextflow pipeline to align, merge, and organize large PhIP-Seq datasets
License: MIT License
A Nextflow pipeline to align, merge, and organize large PhIP-Seq datasets
License: MIT License
Are you able to share any examples that use the summarize_by_organism term? I'm not sure what the "public_epitopes_csv" should look like or how it factors into the analysis.
Thanks!
Ben
@sminot pointed out an error occurring when running the virscan library. Unfortunately, this error is slightly more nuanced than a simple formatting error.
Running phipflow
Traceback (most recent call last):
File "/usr/local/bin/phipflow", line 33, in <module>
sys.exit(load_entry_point('phippery==0.1', 'console_scripts', 'phipflow')())
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/phippery-0.1-py3.7.egg/phippery/cli.py", line 137, in load_from_counts_tsv
File "/usr/local/lib/python3.7/site-packages/phippery-0.1-py3.7.egg/phippery/phipdata.py", line 81, in stitch_dataset
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/ops/common.py", line 69, in new_method
return method(self, other)
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/arraylike.py", line 32, in __eq__
return self._cmp_method(other, operator.eq
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 6042, in _cmp_method
raise ValueError("Lengths must match to compare")
ValueError: Lengths must match to compare
If you want oligo sequence replicates to have the exact same alignment counts, one simple solution is to:
-a
parameter to the alignment optionsThis really has to do with repeat oligonucleotide sequences in the peptide libraries, and how we should expect the pipeline to report counts for them. The error is caused because in collect_phip_data processing step, the phippery python API first merges the counts files (i.e. the two tsv's coming from sam_to_counts idxstats command) and then expects the merged count's (now an enrichment matrix) shape to be exactly ( len(peptide_table) , len(sample_table) ). In the case of your example data, the counts files have exactly 115753 entries each, and the original vir3 peptide table has 128257 entries.
This discrepancy makes sense because the formatting of the current virscan lib made peptide id's a one-to-one mapping with unique Oligo sequences -- which is a very reasonable thing to do. However, the pipeline currently uses those peptide id's to create headers in the fasta file for building an index and later retrieving alignment counts statistics for each id. It does this in a naive way meaning each row in the peptide table gets a fasta entry. The index which is built from this fasta then treats those as identical, and only reports for a single of those unique id's.
Option 1 - The counts will be equal (replicated) for each sequence which is identical when a single alignment is reported. At first this may seem like the desired behavior because (I assume - but could be wrong) sequences are de-duplicated before the Oligo sequences are synthesized into a library. So, I guess one would think that b/c the sequence is homogeneous across certain strains, we should count it as a binding event for all peptide tile homologs in the face of ambiguity.
small note: Additionally one thing that comes to mind with this approach is the possibility of functional homologs? maybe this is dealt with as well during library design.
Option 2 - Another opinion that has been expressed is that we should be randomly distributing the number of measured antibody-peptide binding events (i.e. alignments) across the sequence replicates when evaluating the data. In other words, one alignment to one count of enrichment across the entire library. I guess the argument is that these measurements for any one peptide in the assay are not independent and thus, we don't want to artificially conflate counts for any singular measurements. This is the approach we have taken so far with the Overbaugh lab. However, I haven't really had to dig in to this behavior with these smaller "artisan" phage libraries b/c we have low abundance of sequence repeats. For reference, The Virscan library has a mean of 1.1 and a max of 53, sequence replicates in the library.
The defaults for both Bowtie and Bowtie2 are:
The process by which bowtie chooses an alignment to report is randomized in order to avoid “mapping bias” - the phenomenon whereby an aligner systematically fails to report a particular class of good alignments, causing spurious “holes” in the comparative assembly. Whenever bowtie reports a subset of the valid alignments that exist, it makes an effort to sample them randomly. This randomness flows from a simple seeded pseudo-random number generator and is deterministic in the sense that Bowtie will always produce the same results for the same read when run with the same initial “seed” value (see --seed option).
Bowtie 2's search for alignments for a given read is "randomized." That is, when Bowtie 2 encounters a set of equally-good choices, it uses a pseudo-random number to choose. For example, if Bowtie 2 discovers a set of 3 equally-good alignments and wants to decide which to report, it picks a pseudo-random integer 0, 1 or 2 and reports the corresponding alignment. Arbitrary choices can crop up at various points during alignment.
They both seem to suggest that we should expect the alignments to one replicate should be randomly distributed across the replicate entries. I though this would be the case since they would be "equally-good" choices. After running the pipeline on some example simulations I actually found that we have been assigning the count always to the last replicate sequence in the fasta. I'm really not sure why this is the case. However, when using the -a
parameter and resetting the index we get the behavior or replicate counts for replicate sequences.
Looking forward to hearing thoughts on this. (cc @ksung25)
I am trying to use phip-viz to view the pipeline output but running into an issue.
When I startup the app (with or without the -v parameter set), e.g.,
docker run -p 8501:8501 quay.io/matsengrp/phip-viz
docker run -p 8501:8501 -v $PWD:/app/data/ quay.io/matsengrp/phip-viz
docker run -p 8501:8501 -v <path_to_dotphipfile>/ quay.io/matsengrp/phip-viz
docker run -p 8501:8501 -v <path_to_app/data/dotphipfile>/ quay.io/matsengrp/phip-viz
I get the same error,
File "/usr/local/lib/python3.7/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 556, in _run_script
exec(code, module.__dict__)
File "/app/streamlit_app.py", line 167, in <module>
f"{fp}" for fp in os.listdir("./data")
I tried pulling the docker image a few different ways (implicitly on run and explicitly with a direct download). I also tried updating streamlite by reinstalling through conda (which shouldn't work but I thought I'd try). But I keep getting the same error. It seems like I'm supposed to be able to upload a dataset but the upload button seems to be dead too (below). Is this a wd issue or is this app defunct?
Often, there are batch-levels for which we would like to normalize separately.
Allow the user to specify the list sample column groups for which the workflow should be run in parallel. This should be done by a separate process which wraps the entire main workflow with multiple call to that workflow after splitting the sample table.
We should have a single column for relative path to sample demultiplexed fastq. Here's the reminder to merge these requirements
Wewill create a primary processes to parse the configuration provided by the user to validate and prep input to generate_counts pipeline.
The psuedo workflow code will look something like this:
nextflow.enable.dsl=2
include { ALIGN_COUNTS as generate_alignment_counts } from './workflows/alignment-counts-workflow.nf'
include { cpm_fold_enrichment as compute_cpm_fold_enrichment } from './workflows/enrichment-workflow.nf'
include { to_tall as write_ds_to_tall_csv } from './workflows/io.nf'
include { to_wide as write_ds_to_wide_csv } from './workflows/io.nf'
workflow {
// validate and set flags (emission channels) for downstream
// analysis to run. generate or modify sample table depending on input
parse_config()
// generate raw dataset
generate_alignment_counts(
parse_config.out.sample_table,
parse_config.out.peptide_table,
)
// example running cpm-fold-enrichment
compute_cpm_fold_enrichment(
generate_alignment_counts.out,
parse_config.out.foo // named channel in the emimssion block, optional depending on parse.
)
// some other downstream analysis
fit_predict_neg_binom(...)
// merge the separate analysis
merge_analysis()
// downstream analysis
if( params.output_tall )
write_ds_to_tall_csv(ds)
if( params.output_wide )
write_ds_to_wide_csv(ds)
}
After finalizing the CLI for model fitting on the Neg-Binom, let's add the subworkflow as an option in the config file
this will become the new home for the phip-seq pipeline, for now. Let's remove these deprecated notebooks and add the Nextflow file here.
According to https://www.nextflow.io/docs/latest/dsl2.html , we should migrate file objects to use the path enclosure
I am running the phip-flow pipeline pancov demo using:
nextflow run matsengrp/phip-flow -r V1.11 -profile docker \
--sample_table ../code/phip-flow/data/pan-cov-example/sample_table_with_beads_and_lib2.csv \
--peptide_table ../code/phip-flow/data/pan-cov-example/peptide_table.csv \
--output_tall_csv true \
--output_wide_csv true \
--results "$(date -I)" \
-resume
While the run terminated with several outputs, there were also multiple errors.
nextflow.pancov.test.log.log
Is this expected behavior? Is this resulting from an incomplete install?
Right now I have to link each seq_dir factor in the sample metadata to the full path where that dir actually is in the config file. This is confusing and tedious. Instead let's make a requirement that each seq_dir factor to exist in ${baseDir}/NGS. As a symbolic link or in reality.
We should have some infrastructure to tun the pipeline with a few different simulated datasets and assert we are getting the expected output from the pipeline.
In #56 We tried running BEER analysis in the phip-flow pipeline. The actual analysis was unsuccessful - however - we're able to create the R object for users so they may try their hand at running the analysis themselves. We need to export the object and add the relevant pipeline parameters to the add_beer
branch.
TODO
saveRDS(object, file = "my_data.rds")
to run_beer.Rscriptnote to self to add some samtools alignment stats output from the pipeline. As well as some code in phippery
to make sense of it
Create a dataset of reasonable size and reasonably showcases the annotations one might want
Add a process which checks for:
the input table for specific columns necessary or running the workflow specified by the parameters provided in the config file. The column names it looks for are configures by the column name config file. (which also must be completed).
Checks that all the filepaths in the sample table exist and throw useful errors that don't.
We will move the code currently sitting in phippery
phip-flow cli to script in a bin/ directly.
BEER has presented a very nice methodology for analyzing significant binding events. This method and software is described here . The documentation for the R interface can be found here. We would like to integrate this analysis into the phip-flow pipeline. Here's an outline about how I imagine this happening:
The process location in DAG
A quick and easy way to go about this is simply extend the DAG like the AGG sub workflow does. From here, we can pretty easily get all the inputs we need for the BEER analysis, i.e. counts for beads and empirical IP's. If we extend from the binary, we can use the python API utilities to easy isolate the inputs for BEER as well. I'm agnostic to whether or not we use the Python API to do this.
Dependencies
Looks like we will need [PhIPData(https://www.bioconductor.org/packages/release/bioc/vignettes/PhIPData/inst/doc/PhIPData.html), and BEER R bioconductor packages. I'm very unfamiliar with R packaging, but I imagine extending this image to include the R dependencies should not be too bad.
Transformation
The PhIPData Object is a near identical data structure to the xarray phipdata we use in phippery (hilarious how similar their organization of the data is to ours)
Theirs:
Our xarray:
So this shouldn't be hard. Essentially we need to read in the sample, peptide, and peptide counts table in an R script in able to run BEER. I think this simply requires readr matrix objects from the wide output of phippery. I'm very bad with R, but it also seems that we may need to specify data types in each of the sample and peptide annotation tables when their read in.
Output
Let's just stick with outputting whatever is easiest from the R pipeline for now? I imagine we could tie things back into the xarray dataset pretty easily, but this is not necessary for the manuscript re-submission.
the configuration file should specify which alignment parameters we want to use.
Currently, we put all alignment options in a template with access to certain variables. Let's be sure to move the peptide size and read length to the config file for users to easily run the default alignment option tailored to their data.
A little TLC to https://matsengrp.github.io/phippery/the-rest-of-it.html#contributing is required
There's a mention of git-- perhaps the right thing to do is to just rename this "how to contribute" and say that we follow the tskit procedure.
Maybe a little snippet about modifying specific steps in the pipeline, too
Also add something, "we're happy to assist with contributions?" We welcome them etc.
Add a param which specifies whether you would like to optionally output the pickle dump'd binary.
Need to document the configuration file, and each of it's parameters.
Once we get an example, we need to organize template files from https://github.com/matsengrp/phip-flow-template into a single directory, in this repo
Hi,
I am currently running
nextflow run matsengrp/phip-flow -r main \
--sample_table sample_table.csv \
--peptide_table peptide_table.csv \
--read_length 189 \
--oligo_tile_length 189 \
--output_tall_csv true \
--output_wide_csv true \
--results "$(date -I)" \
-profile docker
and I am getting the following error
executor > local (29)
[01/9a1585] process > ALIGN:validate_sample_table (1) [100%] 1 of 1 ✔
[5d/c9e72c] process > ALIGN:validate_peptide_table (1) [100%] 1 of 1 ✔
[c7/a14c72] process > ALIGN:generate_fasta_reference (1) [100%] 1 of 1 ✔
[9b/051e5f] process > ALIGN:generate_index (1) [100%] 1 of 1 ✔
[df/5f63eb] process > ALIGN:short_read_alignment (12) [ 12%] 1 of 8, failed: 1
set -euo pipefail
STREAM_FILE_CMD=zcat
FASTQ=D152-YO-PCD-BREAST_S16_L001_R1_001.fastq
INDEX=peptide_index/peptide
ALIGN_OUT_FN=9.sam
READ_LENGTH=189
PEPTIDE_LENGTH=117
CPUS=1
MM=2
OP_ARGS="--tryhard --nomaqround --norc --best --sam --quiet"
if [ ${PEPTIDE_LENGTH} -lt ${READ_LENGTH} ]; then
let TRIM3=${READ_LENGTH}-${PEPTIDE_LENGTH}
else
TRIM3=0
fi
echo $OP_ARGS
$STREAM_FILE_CMD $FASTQ | bowtie \
--trim3 $TRIM3 \
--threads $CPUS \
-n $MM \
-l $PEPTIDE_LENGTH \
$OP_ARGS \
-x $INDEX - > $ALIGN_OUT_FN
Command exit status:
1
Command error:
--tryhard --nomaqround --norc --best --sam --quiet
gzip: D152-YO-PCD-BREAST_S16_L001_R1_001.fastq.gz: No such file or directory
Error: reads file does not look like a FASTQ file
Command: /bowtie-1.3.1-linux-x86_64/bowtie-align-s --wrapper basic-0 --trim3 72 --threads 1 -n 2 -l 117 --tryhard --nomaqround --norc --best --sam --quiet -x peptide_index/peptide -
What could I change to make it work?
Thanks!
I am currently running:
$ nextflow run matsengrp/phip-flow -r V1.11 \
--sample_table ../data-raw/20230919_virscan_demo.samples.debug.csv \
--peptide_table ../data-raw/VIR3_clean.csv \
--read_length 101 --peptide_tile_length 200 \
--run_zscore_fit_predict \
--run_cpm_enr_workflow \
--summarize_by_organism true \
--peptide_org_col Organism \
--results 20230919_virscan_demo_phipflow_"$(date -I)" \
-resume
N E X T F L O W ~ version 23.10.0
Launching `https://github.com/matsengrp/phip-flow` [cheeky_wescoff] DSL2 - revision: a60f8ebf29 [V1.11]
P H I P - F L O W!
Matsen, Overbaugh, and Minot Labs
Fred Hutchinson CRC, Seattle WA
================================
sample_table : ../data-raw/20230919_virscan_demo.samples.debug.csv
peptide_table : ../data-raw/VIR3_clean.csv
results : 20230919_virscan_demo_phipflow_2023-10-31
executor > local (1)
[3a/a4e2cb] process > ALIGN:validate_sample_table (1) [100%] 1 of 1, cached: 1 ✔
[4f/2c6603] process > ALIGN:validate_peptide_table (1) [100%] 1 of 1, cached: 1 ✔
[a9/249fba] process > ALIGN:generate_fasta_reference (1) [100%] 1 of 1, cached: 1 ✔
[43/2583fe] process > ALIGN:generate_index (1) [100%] 1 of 1, cached: 1 ✔
[94/82af45] process > ALIGN:short_read_alignment (24) [100%] 26 of 26, cached: 26 ✔
[07/ed5cb7] process > ALIGN:sam_to_counts (24) [100%] 26 of 26, cached: 26 ✔
[35/765f97] process > ALIGN:sam_to_stats (23) [100%] 26 of 26, cached: 26 ✔
[f9/4fd8f0] process > ALIGN:collect_phip_data (1) [100%] 1 of 1, cached: 1 ✔
[9f/ec1433] process > ALIGN:replicate_counts (1) [100%] 1 of 1, cached: 1 ✔
[57/6a8991] process > STATS:counts_per_million (1) [100%] 1 of 1, cached: 1 ✔
[be/a7ec54] process > STATS:size_factors (1) [100%] 1 of 1, cached: 1 ✔
[ce/5c30dd] process > STATS:edgeR_BEER_workflows:to_csv (1) [100%] 1 of 1, cached: 1 ✔
[f1/bb5bd8] process > STATS:edgeR_BEER_workflows:run_edgeR (1) [100%] 1 of 1, cached: 1 ✔
[85/a5caa9] process > STATS:edgeR_BEER_workflows:append_assay_csvs_to_xarray (1) [100%] 1 of 1, cached: 1 ✔
[c1/a530b7] process > STATS:edgeR_BEER_workflows:publish_rds (1) [100%] 1 of 1, cached: 1 ✔
[a9/6d470e] process > STATS:cpm_fold_enrichment (1) [100%] 1 of 1, cached: 1 ✔
[58/301d3b] process > STATS:fit_predict_zscore (1) [100%] 1 of 1, cached: 1 ✔
[99/68f470] process > STATS:merge_binary_datasets [100%] 1 of 1, cached: 1 ✔
[a7/20c77b] process > DSOUT:dump_binary [100%] 1 of 1, cached: 1 ✔
[bd/a239b9] process > DSOUT:dump_wide_csv [100%] 1 of 1, cached: 1 ✔
[- ] process > DSOUT:dump_tall_csv -
[af/456d3e] process > AGG:split_samples [ 0%] 0 of 1
[- ] process > AGG:aggregate_organisms -
[- ] process > AGG:join_organisms -
ERROR ~ Error executing process > 'AGG:split_samples'
Caused by:
Process `AGG:split_samples` terminated with an error exit status (1)
Command executed [/home/bek321/.nextflow/assets/matsengrp/phip-flow/templates/split_samples.py]:
#!/usr/bin/env python3
from collections import defaultdict
import os
from typing import List
import pandas as pd
import logging
def setup_logging() -> logging.Logger:
"""Set up logging."""
# Set up logging
logFormatter = logging.Formatter(
'%(asctime)s %(levelname)-8s [split_samples] %(message)s'
)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Also write to STDOUT
consoleHandler = logging.StreamHandler()
consoleHandler.setFormatter(logFormatter)
logger.addHandler(consoleHandler)
return logger
logger = setup_logging()
# The user must specify a CSV containing the sample mapping
sample_mapping_fp = "data_sample_annotation_table.csv"
logger.info(f"Reading in sample mapping from: {sample_mapping_fp}")
assert os.path.exists(sample_mapping_fp)
# Read in the table
df = pd.read_csv(sample_mapping_fp, index_col=0)
logger.info(f"Sample mapping table has {df.shape[0]:,} rows and {df.shape[1]:,} columns")
# The user must specify the column used to group replicates
# from the same sample
sample_grouping_col = ""
msg = f"Column '{sample_grouping_col}' not found ({', '.join(df.columns.values)})"
assert sample_grouping_col in df.columns.values, msg
# Write out a file containing the unique list of sample names
df.reindex(
columns=[sample_grouping_col]
).drop_duplicates(
).to_csv(
"sample_list",
header=None,
index=None
)
Command exit status:
1
Command output:
(empty)
Command error:
2023-10-31 23:27:39,491 INFO [split_samples] Reading in sample mapping from: data_sample_annotation_table.csv
Traceback (most recent call last):
File ".command.sh", line 32, in <module>
assert os.path.exists(sample_mapping_fp)
AssertionError
Work dir:
/n/scratch3/users/b/bek321/virscan_hipc/data-final/work/af/456d3e3af0b2ad7c60707f28466a6e
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
-- Check '.nextflow.log' file for details
The error seems to suggest that the sample table does not exist but navigating to the working directory, the link appears to be active and the file exists (see head below)
(phipflow) [bek321@compute-e-16-231 456d3e3af0b2ad7c60707f28466a6e]$ ls -lhat
total 592K
-rw-rw-r-- 1 bek321 bek321 1 Oct 31 23:27 .exitcode
drwxrwxr-x 2 bek321 bek321 626 Oct 31 23:27 .
-rw-rw-r-- 1 bek321 bek321 342 Oct 31 23:27 .command.err
-rw-rw-r-- 1 bek321 bek321 342 Oct 31 23:27 .command.log
-rw-rw-r-- 1 bek321 bek321 0 Oct 31 23:27 .command.out
lrwxrwxrwx 1 bek321 bek321 108 Oct 31 23:27 data_zscore.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_zscore.csv.gz
lrwxrwxrwx 1 bek321 bek321 114 Oct 31 23:27 data_size_factors.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_size_factors.csv.gz
lrwxrwxrwx 1 bek321 bek321 125 Oct 31 23:27 data_sample_annotation_table.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_sample_annotation_table.csv.gz
lrwxrwxrwx 1 bek321 bek321 126 Oct 31 23:27 data_peptide_annotation_table.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_peptide_annotation_table.csv.gz
lrwxrwxrwx 1 bek321 bek321 112 Oct 31 23:27 data_enrichment.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_enrichment.csv.gz
lrwxrwxrwx 1 bek321 bek321 115 Oct 31 23:27 data_edgeR_logpval.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_edgeR_logpval.csv.gz
lrwxrwxrwx 1 bek321 bek321 113 Oct 31 23:27 data_edgeR_logfc.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_edgeR_logfc.csv.gz
lrwxrwxrwx 1 bek321 bek321 112 Oct 31 23:27 data_edgeR_hits.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_edgeR_hits.csv.gz
lrwxrwxrwx 1 bek321 bek321 105 Oct 31 23:27 data_cpm.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_cpm.csv.gz
lrwxrwxrwx 1 bek321 bek321 108 Oct 31 23:27 data_counts.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_counts.csv.gz
-rw-rw-r-- 1 bek321 bek321 0 Oct 31 23:27 .command.begin
-rw-rw-r-- 1 bek321 bek321 1.4K Oct 31 23:27 .command.sh
-rw-rw-r-- 1 bek321 bek321 4.6K Oct 31 23:27 .command.run
drwxrwxr-x 7 bek321 bek321 240 Oct 31 23:27 ..
(phipflow) [bek321@compute-e-16-231 456d3e3af0b2ad7c60707f28466a6e]$ zcat data_sample_annotation_table.csv.gz | head
sample_id,replicate_id,control_status,sample_ID,sample_name,fastq_filepath,raw_total_sequences,reads_mapped,error_rate,average_quality,percent_mapped,percent_peptides_detected,percent_peptides_between_10_and_100
0,97,library,S1,c1,../data-raw/fastq//IDX097-c1-1_S1_R1_001.fastq.gz,10552,1268,0.002811007,31.1,12.016679302501895,0.7656502179218289,0.0031187381585410547
1,98,beads_only,S2,c2,../data-raw/fastq//IDX098-c2-1_S2_R1_001.fastq.gz,14031,1911,0.002440275,31.6,13.61984177891811,0.9769447281629852,0.007017160856717372
2,100,beads_only,S4,c4,../data-raw/fastq//IDX100-c4-1_S4_R1_001.fastq.gz,202207,28633,0.001710275,33.5,14.16024173248206,6.888512907677553,0.28380517242723596
3,101,beads_only,S5,c5,../data-raw/fastq//IDX101-c5-1_S5_R1_001.fastq.gz,11258,1596,0.002146455,32.4,14.176585539172143,0.8677888926140483,0.0031187381585410547
4,102,beads_only,S6,c6,../data-raw/fastq//IDX102-c6-1_S6_R1_001.fastq.gz,6102,857,0.001756068,33.0,14.044575549000326,0.4545560866073587,0.0015593690792705273
5,103,empirical,S7,v1,../data-raw/fastq//IDX103-v1-1_S7_R1_001.fastq.gz,4370,656,0.001841343,33.0,15.011441647597254,0.36411268000966807,0.0023390536189057906
6,104,empirical,S8,v2,../data-raw/fastq//IDX104-v2-1_S8_R1_001.fastq.gz,4785,745,0.00162137,33.5,15.569487983281086,0.3999781688328903,0.0031187381585410547
7,105,library,S9,c1,../data-raw/fastq//IDX105-c1-2_S9_R1_001.fastq.gz,74117,10804,0.001795265,33.4,14.576952655935886,3.8212339287524264,0.025729589807963697
8,107,beads_only,S11,c3,../data-raw/fastq//IDX107-c3-2_S11_R1_001.fastq.gz,4591,698,0.001687991,33.1,15.203659333478544,0.36333299547003284,0.0031187381585410547
Any idea what may be going on here?
update the rest of the sample table paths in data/pan-cov-example/ . At least the ones we want in the repo.
Currently, there are a few scripts such as run_gizmo_meg.sh
and gizmo_config
that are specific to our experiments. These should be moved to https://github.com/matsengrp/phippery-experiments in a sub-module containing this repo. Generally we should separate our experiments from the pipeline itself.
We should gzip csv outputs to save space.
I am seeing surprisingly low % mapping. The COVID example gives a % mapping of around 70%, while several experiments we have run give closer to 15%. My colleagues have run these data manually without phipflow and tell me that they have alignment close to 70%, so I suspect I'm messing up some input parameter that is somehow creating an alignment issue. But I can't figure out what that parameter might be.
Any thoughts on what may be causing the low alignment?
Two calls that produced the same result:
nextflow run matsengrp/phip-flow -r V1.10 \
--sample_table ../data-raw/20231219_samples.csv \
--peptide_table ../data-raw/peptide_table/VIR3_clean.csv \
--read_length 75 --peptide_tile_length 168 \
--run_zscore_fit_predict \
--run_cpm_enr_workflow \
--summarize_by_organism true \
--peptide_org_col Organism \
--sample_grouping_col sample_name \
--results 20231219_phipflow_"$(date -I)" \
-resume
nextflow run matsengrp/phip-flow -r V1.10 \
--sample_table ../data-raw/20231219_samples.DEBUG.csv \
--peptide_table ../data-raw/peptide_table/VIR3_clean.csv \
--read_length 75 --peptide_tile_length 168 \
--peptide_seq_col Prot \
--sample_grouping_col sample_name \
--results 20231219_phipflow_debug_"$(date -I)" \
-resume
We already have an example data set. Let's make a formal test and a git workflow that asserts this passes before a merge.
I'm working on getting a test of your analysis suite running using nextflow and a Singularity container. I'm running the container in a virtual environment that also has phippery installed. When the container runs using the defaults (that I think should run the test data set for validation), the first several steps run fine; however, the nextflow analysis errors out when trying to run merge-counts-stats.py script. A phippery module is not being found. The error message from nextflow is:
Error executing process > 'ALIGN:collect_phip_data (1)'
Caused by:
Process ALIGN:collect_phip_data (1)
terminated with an error exit status (1)
Command executed:
merge-counts-stats.py -st validated_sample_table.csv -pt validated_peptide_table.csv -cfp ".counts" -sfp ".stats" -o data.phip
Command exit status:
1
Command output:
(empty)
Command error:
Traceback (most recent call last):
File "/home/dob/.nextflow/assets/matsengrp/phip-flow/bin/merge-counts-stats.py", line 8, in
import phippery.phipdata as phipdata
ModuleNotFoundError: No module named 'phippery.phipdata'
I was unsuccessful in manually locating the phippery.phipdata module that the script it trying to import. Is the module currently included in the package? Has the module been renamed to something else? What would you suggest to either fix or troubleshoot the problem?
Thanks.
eventually, we need to switch to the new API, -> https://www.nextflow.io/blog/2020/dsl2-is-here.html
We've decided that there is no more reason to maintain the Neg. Binomial workflow -- so it needs to either be deprecated (with a warning) or removed completely. I'm leaning towards the latter
TODO
We are most generally aligning reads that are larger than the peptide-encoding reference sequences. This brings up interest in using local alignment, as seen here.
Currently, we use end-to-end alignment bu trim the low quality (right side, 3 prime) end. This could, potentially, be missing many alignments of interest.
We will write a sub-workflow which first checks for a columns "sample_id" or "peptide_id" in the tables, and adds them if they are not there.
Ultimately, we could replace both the sample table and the peptide table parameters with a single manifest file (CSV) which will look almost exactly the same as the sample table -- except now there is a required "peptide_reference" that points to a relative filepath to the the correct peptide table for each respective sample manifest. With a little tweaking the pipeline could then run library workflows in parallel i.e. If there are N unique peptide tables in the manifest "peptide_reference" column, across all samples, then the workflow will be run in parallel and produce N different data sets.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.