matsengrp / phip-flow Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 6.0 28.18 MB

A Nextflow pipeline to align, merge, and organize large PhIP-Seq datasets

License: MIT License

Nextflow 33.51% Shell 5.96% Python 50.65% R 9.89%

phip-flow's People

Contributors

Stargazers

Watchers

Forkers

sminot j3551ca bennuru standardgalactic afonsoguerra

phip-flow's Issues

[improvement] Examples with summarize_by_organism

Are you able to share any examples that use the summarize_by_organism term? I'm not sure what the "public_epitopes_csv" should look like or how it factors into the analysis.

Thanks!
Ben

Replicate Oligo nt Sequence Behavior

@sminot pointed out an error occurring when running the virscan library. Unfortunately, this error is slightly more nuanced than a simple formatting error.

Running phipflow
Traceback (most recent call last):
File "/usr/local/bin/phipflow", line 33, in <module>
sys.exit(load_entry_point('phippery==0.1', 'console_scripts', 'phipflow')())
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/phippery-0.1-py3.7.egg/phippery/cli.py", line 137, in load_from_counts_tsv
File "/usr/local/lib/python3.7/site-packages/phippery-0.1-py3.7.egg/phippery/phipdata.py", line 81, in stitch_dataset
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/ops/common.py", line 69, in new_method
return method(self, other)
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/arraylike.py", line 32, in __eq__
return self._cmp_method(other, operator.eq
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 6042, in _cmp_method
raise ValueError("Lengths must match to compare")
ValueError: Lengths must match to compare

TLDR

If you want oligo sequence replicates to have the exact same alignment counts, one simple solution is to:

reset the peptide_id to have a unique int for each row.
add -a parameter to the alignment options

More Detail

This really has to do with repeat oligonucleotide sequences in the peptide libraries, and how we should expect the pipeline to report counts for them. The error is caused because in collect_phip_data processing step, the phippery python API first merges the counts files (i.e. the two tsv's coming from sam_to_counts idxstats command) and then expects the merged count's (now an enrichment matrix) shape to be exactly ( len(peptide_table) , len(sample_table) ). In the case of your example data, the counts files have exactly 115753 entries each, and the original vir3 peptide table has 128257 entries.

This discrepancy makes sense because the formatting of the current virscan lib made peptide id's a one-to-one mapping with unique Oligo sequences -- which is a very reasonable thing to do. However, the pipeline currently uses those peptide id's to create headers in the fasta file for building an index and later retrieving alignment counts statistics for each id. It does this in a naive way meaning each row in the peptide table gets a fasta entry. The index which is built from this fasta then treats those as identical, and only reports for a single of those unique id's.

Options for handling repeat sequences

Option 1 - The counts will be equal (replicated) for each sequence which is identical when a single alignment is reported. At first this may seem like the desired behavior because (I assume - but could be wrong) sequences are de-duplicated before the Oligo sequences are synthesized into a library. So, I guess one would think that b/c the sequence is homogeneous across certain strains, we should count it as a binding event for all peptide tile homologs in the face of ambiguity.

small note: Additionally one thing that comes to mind with this approach is the possibility of functional homologs? maybe this is dealt with as well during library design.

Option 2 - Another opinion that has been expressed is that we should be randomly distributing the number of measured antibody-peptide binding events (i.e. alignments) across the sequence replicates when evaluating the data. In other words, one alignment to one count of enrichment across the entire library. I guess the argument is that these measurements for any one peptide in the assay are not independent and thus, we don't want to artificially conflate counts for any singular measurements. This is the approach we have taken so far with the Overbaugh lab. However, I haven't really had to dig in to this behavior with these smaller "artisan" phage libraries b/c we have low abundance of sequence repeats. For reference, The Virscan library has a mean of 1.1 and a max of 53, sequence replicates in the library.

Current defaults & potential bug

The defaults for both Bowtie and Bowtie2 are:

The process by which bowtie chooses an alignment to report is randomized in order to avoid “mapping bias” - the phenomenon whereby an aligner systematically fails to report a particular class of good alignments, causing spurious “holes” in the comparative assembly. Whenever bowtie reports a subset of the valid alignments that exist, it makes an effort to sample them randomly. This randomness flows from a simple seeded pseudo-random number generator and is deterministic in the sense that Bowtie will always produce the same results for the same read when run with the same initial “seed” value (see --seed option).

Bowtie 2's search for alignments for a given read is "randomized." That is, when Bowtie 2 encounters a set of equally-good choices, it uses a pseudo-random number to choose. For example, if Bowtie 2 discovers a set of 3 equally-good alignments and wants to decide which to report, it picks a pseudo-random integer 0, 1 or 2 and reports the corresponding alignment. Arbitrary choices can crop up at various points during alignment.

They both seem to suggest that we should expect the alignments to one replicate should be randomly distributed across the replicate entries. I though this would be the case since they would be "equally-good" choices. After running the pipeline on some example simulations I actually found that we have been assigning the count always to the last replicate sequence in the fasta. I'm really not sure why this is the case. However, when using the -a parameter and resetting the index we get the behavior or replicate counts for replicate sequences.

Looking forward to hearing thoughts on this. (cc @ksung25)

streamlit phip-viz error on startup

I am trying to use phip-viz to view the pipeline output but running into an issue.

When I startup the app (with or without the -v parameter set), e.g.,

docker run -p 8501:8501 quay.io/matsengrp/phip-viz
docker run -p 8501:8501 -v $PWD:/app/data/ quay.io/matsengrp/phip-viz
docker run -p 8501:8501 -v <path_to_dotphipfile>/ quay.io/matsengrp/phip-viz
docker run -p 8501:8501 -v <path_to_app/data/dotphipfile>/ quay.io/matsengrp/phip-viz

I get the same error,

File "/usr/local/lib/python3.7/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 556, in _run_script
    exec(code, module.__dict__)
File "/app/streamlit_app.py", line 167, in <module>
    f"{fp}" for fp in os.listdir("./data")

I tried pulling the docker image a few different ways (implicitly on run and explicitly with a direct download). I also tried updating streamlite by reinstalling through conda (which shouldn't work but I thought I'd try). But I keep getting the same error. It seems like I'm supposed to be able to upload a dataset but the upload button seems to be dead too (below). Is this a wd issue or is this app defunct?

batch-specific workflows

Often, there are batch-levels for which we would like to normalize separately.

Allow the user to specify the list sample column groups for which the workflow should be run in parallel. This should be done by a separate process which wraps the entire main workflow with multiple call to that workflow after splitting the sample table.

Merge seq_dir and fastq_filename

We should have a single column for relative path to sample demultiplexed fastq. Here's the reminder to merge these requirements

parse_config sub-workflow

Wewill create a primary processes to parse the configuration provided by the user to validate and prep input to generate_counts pipeline.

The psuedo workflow code will look something like this:

nextflow.enable.dsl=2
include { ALIGN_COUNTS as generate_alignment_counts } from './workflows/alignment-counts-workflow.nf'
include { cpm_fold_enrichment as compute_cpm_fold_enrichment } from './workflows/enrichment-workflow.nf'
include { to_tall as write_ds_to_tall_csv } from './workflows/io.nf'
include { to_wide as write_ds_to_wide_csv } from './workflows/io.nf'

workflow {

    // validate and set flags (emission channels) for downstream
    // analysis to run. generate or modify sample table depending on input
    parse_config()

    // generate raw dataset
    generate_alignment_counts(
        parse_config.out.sample_table,
        parse_config.out.peptide_table,
    )

    // example running cpm-fold-enrichment 
    compute_cpm_fold_enrichment(
        generate_alignment_counts.out,
        parse_config.out.foo // named channel in the emimssion block, optional depending on parse.
    )

    // some other downstream analysis
    fit_predict_neg_binom(...)
    
    // merge the separate analysis
    merge_analysis()

    // downstream analysis
    if( params.output_tall ) 
        write_ds_to_tall_csv(ds)

    if( params.output_wide )
        write_ds_to_wide_csv(ds)
}

Model fitting subworkflow

After finalizing the CLI for model fitting on the Neg-Binom, let's add the subworkflow as an option in the config file

Restructure.

this will become the new home for the phip-seq pipeline, for now. Let's remove these deprecated notebooks and add the Nextflow file here.

Change the "file" type cast to "Path"

According to https://www.nextflow.io/docs/latest/dsl2.html , we should migrate file objects to use the path enclosure

Unexpected behavior during nextflow run

I am running the phip-flow pipeline pancov demo using:

nextflow run matsengrp/phip-flow -r V1.11 -profile docker \
        --sample_table ../code/phip-flow/data/pan-cov-example/sample_table_with_beads_and_lib2.csv \
        --peptide_table ../code/phip-flow/data/pan-cov-example/peptide_table.csv \
        --output_tall_csv true \
        --output_wide_csv true \
        --results "$(date -I)" \
        -resume

While the run terminated with several outputs, there were also multiple errors.
nextflow.pancov.test.log.log
Is this expected behavior? Is this resulting from an incomplete install?

Re-factor the config setup.

Right now I have to link each seq_dir factor in the sample metadata to the full path where that dir actually is in the config file. This is confusing and tedious. Instead let's make a requirement that each seq_dir factor to exist in ${baseDir}/NGS. As a symbolic link or in reality.

Add automated simulation testing.

We should have some infrastructure to tun the pipeline with a few different simulated datasets and assert we are getting the expected output from the pipeline.

Export `PhIPData` Object from pipeline

In #56 We tried running BEER analysis in the phip-flow pipeline. The actual analysis was unsuccessful - however - we're able to create the R object for users so they may try their hand at running the analysis themselves. We need to export the object and add the relevant pipeline parameters to the add_beer branch.

TODO

Add saveRDS(object, file = "my_data.rds") to run_beer.Rscript
Change the name to export_phipData.Rscript or something more appropriate
Add documentation
Update release

Alignments stats

note to self to add some samtools alignment stats output from the pipeline. As well as some code in phippery to make sense of it

Make example dataset from Caitlin et. al. 2021

Create a dataset of reasonable size and reasonably showcases the annotations one might want

Add input checks

Add a process which checks for:

the input table for specific columns necessary or running the workflow specified by the parameters provided in the config file. The column names it looks for are configures by the column name config file. (which also must be completed).
Checks that all the filepaths in the sample table exist and throw useful errors that don't.

Integrating "phip-flow" command code directly into bin/

We will move the code currently sitting in phippery phip-flow cli to script in a bin/ directly.

Move scripts
Make new container for scripts
integrate and test

BEER Analysis into workflow

BEER -> phip-flow

@sminot.

BEER has presented a very nice methodology for analyzing significant binding events. This method and software is described here . The documentation for the R interface can be found here. We would like to integrate this analysis into the phip-flow pipeline. Here's an outline about how I imagine this happening:

The process location in DAG
A quick and easy way to go about this is simply extend the DAG like the AGG sub workflow does. From here, we can pretty easily get all the inputs we need for the BEER analysis, i.e. counts for beads and empirical IP's. If we extend from the binary, we can use the python API utilities to easy isolate the inputs for BEER as well. I'm agnostic to whether or not we use the Python API to do this.

Dependencies
Looks like we will need [PhIPData(https://www.bioconductor.org/packages/release/bioc/vignettes/PhIPData/inst/doc/PhIPData.html), and BEER R bioconductor packages. I'm very unfamiliar with R packaging, but I imagine extending this image to include the R dependencies should not be too bad.

Transformation
The PhIPData Object is a near identical data structure to the xarray phipdata we use in phippery (hilarious how similar their organization of the data is to ours)
Theirs:

Our xarray:

So this shouldn't be hard. Essentially we need to read in the sample, peptide, and peptide counts table in an R script in able to run BEER. I think this simply requires readr matrix objects from the wide output of phippery. I'm very bad with R, but it also seems that we may need to specify data types in each of the sample and peptide annotation tables when their read in.

Output
Let's just stick with outputting whatever is easiest from the R pipeline for now? I imagine we could tie things back into the xarray dataset pretty easily, but this is not necessary for the manuscript re-submission.

Add more alignment schemes based on experiment type.

the configuration file should specify which alignment parameters we want to use.

Move read, ad peptide length to config

Currently, we put all alignment options in a template with access to certain variables. Let's be sure to move the peptide size and read length to the config file for users to easily run the default alignment option tailored to their data.

Add documentation how to contribute.

A little TLC to https://matsengrp.github.io/phippery/the-rest-of-it.html#contributing is required

There's a mention of git-- perhaps the right thing to do is to just rename this "how to contribute" and say that we follow the tskit procedure.

Maybe a little snippet about modifying specific steps in the pipeline, too
Also add something, "we're happy to assist with contributions?" We welcome them etc.

Remove automated output of binary

Add a param which specifies whether you would like to optionally output the pickle dump'd binary.

Document config files

Need to document the configuration file, and each of it's parameters.

README Description of inputs and config

Organize files

Once we get an example, we need to organize template files from https://github.com/matsengrp/phip-flow-template into a single directory, in this repo

ALIGN:short_read_alignment (10) - ERROR

Hi,

I am currently running

nextflow run matsengrp/phip-flow -r main \
        --sample_table sample_table.csv \
        --peptide_table peptide_table.csv \
        --read_length 189 \
        --oligo_tile_length 189 \
        --output_tall_csv true \
        --output_wide_csv true \
        --results "$(date -I)" \
        -profile docker

and I am getting the following error

executor >  local (29)
[01/9a1585] process > ALIGN:validate_sample_table (1)                        [100%] 1 of 1 ✔
[5d/c9e72c] process > ALIGN:validate_peptide_table (1)                       [100%] 1 of 1 ✔
[c7/a14c72] process > ALIGN:generate_fasta_reference (1)                     [100%] 1 of 1 ✔
[9b/051e5f] process > ALIGN:generate_index (1)                               [100%] 1 of 1 ✔
[df/5f63eb] process > ALIGN:short_read_alignment (12)                        [ 12%] 1 of 8, failed: 1

  set -euo pipefail
  
  STREAM_FILE_CMD=zcat
  FASTQ=D152-YO-PCD-BREAST_S16_L001_R1_001.fastq
  INDEX=peptide_index/peptide
  ALIGN_OUT_FN=9.sam
  READ_LENGTH=189
  PEPTIDE_LENGTH=117
  CPUS=1
  MM=2
  OP_ARGS="--tryhard --nomaqround --norc --best --sam --quiet"
  
  if [ ${PEPTIDE_LENGTH} -lt ${READ_LENGTH} ]; then
      let TRIM3=${READ_LENGTH}-${PEPTIDE_LENGTH}
  else
      TRIM3=0
  fi
  
  echo $OP_ARGS
  
  $STREAM_FILE_CMD $FASTQ | bowtie \
    --trim3 $TRIM3 \
    --threads $CPUS \
    -n $MM \
    -l $PEPTIDE_LENGTH \
    $OP_ARGS \
    -x $INDEX - > $ALIGN_OUT_FN

Command exit status:
  1

Command error:
  --tryhard --nomaqround --norc --best --sam --quiet
  gzip: D152-YO-PCD-BREAST_S16_L001_R1_001.fastq.gz: No such file or directory
  Error: reads file does not look like a FASTQ file
  Command: /bowtie-1.3.1-linux-x86_64/bowtie-align-s --wrapper basic-0 --trim3 72 --threads 1 -n 2 -l 117 --tryhard --nomaqround --norc --best --sam --quiet -x peptide_index/peptide -

I already verified that the files "missing" are present in the directory

What could I change to make it work?
Thanks!

[possible bug] summarize_by_organism cannot find sample table

I am currently running:

$ nextflow run matsengrp/phip-flow -r V1.11  \
        --sample_table ../data-raw/20230919_virscan_demo.samples.debug.csv \
        --peptide_table ../data-raw/VIR3_clean.csv \
        --read_length 101 --peptide_tile_length 200 \
        --run_zscore_fit_predict \
        --run_cpm_enr_workflow \
        --summarize_by_organism true \
        --peptide_org_col Organism \
        --results 20230919_virscan_demo_phipflow_"$(date -I)" \
        -resume

N E X T F L O W  ~  version 23.10.0
Launching `https://github.com/matsengrp/phip-flow` [cheeky_wescoff] DSL2 - revision: a60f8ebf29 [V1.11]
P H I P - F L O W!
Matsen, Overbaugh, and Minot Labs
Fred Hutchinson CRC, Seattle WA
================================
sample_table    : ../data-raw/20230919_virscan_demo.samples.debug.csv
peptide_table   : ../data-raw/VIR3_clean.csv
results         : 20230919_virscan_demo_phipflow_2023-10-31

executor >  local (1)
[3a/a4e2cb] process > ALIGN:validate_sample_table (1)                            [100%] 1 of 1, cached: 1 ✔
[4f/2c6603] process > ALIGN:validate_peptide_table (1)                           [100%] 1 of 1, cached: 1 ✔
[a9/249fba] process > ALIGN:generate_fasta_reference (1)                         [100%] 1 of 1, cached: 1 ✔
[43/2583fe] process > ALIGN:generate_index (1)                                   [100%] 1 of 1, cached: 1 ✔
[94/82af45] process > ALIGN:short_read_alignment (24)                            [100%] 26 of 26, cached: 26 ✔
[07/ed5cb7] process > ALIGN:sam_to_counts (24)                                   [100%] 26 of 26, cached: 26 ✔
[35/765f97] process > ALIGN:sam_to_stats (23)                                    [100%] 26 of 26, cached: 26 ✔
[f9/4fd8f0] process > ALIGN:collect_phip_data (1)                                [100%] 1 of 1, cached: 1 ✔
[9f/ec1433] process > ALIGN:replicate_counts (1)                                 [100%] 1 of 1, cached: 1 ✔
[57/6a8991] process > STATS:counts_per_million (1)                               [100%] 1 of 1, cached: 1 ✔
[be/a7ec54] process > STATS:size_factors (1)                                     [100%] 1 of 1, cached: 1 ✔
[ce/5c30dd] process > STATS:edgeR_BEER_workflows:to_csv (1)                      [100%] 1 of 1, cached: 1 ✔
[f1/bb5bd8] process > STATS:edgeR_BEER_workflows:run_edgeR (1)                   [100%] 1 of 1, cached: 1 ✔
[85/a5caa9] process > STATS:edgeR_BEER_workflows:append_assay_csvs_to_xarray (1) [100%] 1 of 1, cached: 1 ✔
[c1/a530b7] process > STATS:edgeR_BEER_workflows:publish_rds (1)                 [100%] 1 of 1, cached: 1 ✔
[a9/6d470e] process > STATS:cpm_fold_enrichment (1)                              [100%] 1 of 1, cached: 1 ✔
[58/301d3b] process > STATS:fit_predict_zscore (1)                               [100%] 1 of 1, cached: 1 ✔
[99/68f470] process > STATS:merge_binary_datasets                                [100%] 1 of 1, cached: 1 ✔
[a7/20c77b] process > DSOUT:dump_binary                                          [100%] 1 of 1, cached: 1 ✔
[bd/a239b9] process > DSOUT:dump_wide_csv                                        [100%] 1 of 1, cached: 1 ✔
[-        ] process > DSOUT:dump_tall_csv                                        -
[af/456d3e] process > AGG:split_samples                                          [  0%] 0 of 1
[-        ] process > AGG:aggregate_organisms                                    -
[-        ] process > AGG:join_organisms                                         -
ERROR ~ Error executing process > 'AGG:split_samples'

Caused by:
  Process `AGG:split_samples` terminated with an error exit status (1)

Command executed [/home/bek321/.nextflow/assets/matsengrp/phip-flow/templates/split_samples.py]:

  #!/usr/bin/env python3

  from collections import defaultdict
  import os
  from typing import List
  import pandas as pd
  import logging


  def setup_logging() -> logging.Logger:
      """Set up logging."""

      # Set up logging
      logFormatter = logging.Formatter(
          '%(asctime)s %(levelname)-8s [split_samples] %(message)s'
      )
      logger = logging.getLogger()
      logger.setLevel(logging.INFO)

      # Also write to STDOUT
      consoleHandler = logging.StreamHandler()
      consoleHandler.setFormatter(logFormatter)
      logger.addHandler(consoleHandler)

      return logger

  logger = setup_logging()

  # The user must specify a CSV containing the sample mapping
  sample_mapping_fp = "data_sample_annotation_table.csv"
  logger.info(f"Reading in sample mapping from: {sample_mapping_fp}")
  assert os.path.exists(sample_mapping_fp)

  # Read in the table
  df = pd.read_csv(sample_mapping_fp, index_col=0)
  logger.info(f"Sample mapping table has {df.shape[0]:,} rows and {df.shape[1]:,} columns")

  # The user must specify the column used to group replicates
  # from the same sample
  sample_grouping_col = ""

  msg = f"Column '{sample_grouping_col}' not found ({', '.join(df.columns.values)})"
  assert sample_grouping_col in df.columns.values, msg

  # Write out a file containing the unique list of sample names
  df.reindex(
      columns=[sample_grouping_col]
  ).drop_duplicates(
  ).to_csv(
      "sample_list",
      header=None,
      index=None
  )

Command exit status:
  1

Command output:
  (empty)

Command error:
  2023-10-31 23:27:39,491 INFO     [split_samples] Reading in sample mapping from: data_sample_annotation_table.csv
  Traceback (most recent call last):
    File ".command.sh", line 32, in <module>
      assert os.path.exists(sample_mapping_fp)
  AssertionError

Work dir:
  /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/af/456d3e3af0b2ad7c60707f28466a6e

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

The error seems to suggest that the sample table does not exist but navigating to the working directory, the link appears to be active and the file exists (see head below)

(phipflow) [bek321@compute-e-16-231 456d3e3af0b2ad7c60707f28466a6e]$ ls -lhat
total 592K
-rw-rw-r-- 1 bek321 bek321    1 Oct 31 23:27 .exitcode
drwxrwxr-x 2 bek321 bek321  626 Oct 31 23:27 .
-rw-rw-r-- 1 bek321 bek321  342 Oct 31 23:27 .command.err
-rw-rw-r-- 1 bek321 bek321  342 Oct 31 23:27 .command.log
-rw-rw-r-- 1 bek321 bek321    0 Oct 31 23:27 .command.out
lrwxrwxrwx 1 bek321 bek321  108 Oct 31 23:27 data_zscore.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_zscore.csv.gz
lrwxrwxrwx 1 bek321 bek321  114 Oct 31 23:27 data_size_factors.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_size_factors.csv.gz
lrwxrwxrwx 1 bek321 bek321  125 Oct 31 23:27 data_sample_annotation_table.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_sample_annotation_table.csv.gz
lrwxrwxrwx 1 bek321 bek321  126 Oct 31 23:27 data_peptide_annotation_table.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_peptide_annotation_table.csv.gz
lrwxrwxrwx 1 bek321 bek321  112 Oct 31 23:27 data_enrichment.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_enrichment.csv.gz
lrwxrwxrwx 1 bek321 bek321  115 Oct 31 23:27 data_edgeR_logpval.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_edgeR_logpval.csv.gz
lrwxrwxrwx 1 bek321 bek321  113 Oct 31 23:27 data_edgeR_logfc.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_edgeR_logfc.csv.gz
lrwxrwxrwx 1 bek321 bek321  112 Oct 31 23:27 data_edgeR_hits.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_edgeR_hits.csv.gz
lrwxrwxrwx 1 bek321 bek321  105 Oct 31 23:27 data_cpm.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_cpm.csv.gz
lrwxrwxrwx 1 bek321 bek321  108 Oct 31 23:27 data_counts.csv.gz -> /n/scratch3/users/b/bek321/virscan_hipc/data-final/work/bd/a239b981851083b8ceab677cf2c224/data_counts.csv.gz
-rw-rw-r-- 1 bek321 bek321    0 Oct 31 23:27 .command.begin
-rw-rw-r-- 1 bek321 bek321 1.4K Oct 31 23:27 .command.sh
-rw-rw-r-- 1 bek321 bek321 4.6K Oct 31 23:27 .command.run
drwxrwxr-x 7 bek321 bek321  240 Oct 31 23:27 ..

(phipflow) [bek321@compute-e-16-231 456d3e3af0b2ad7c60707f28466a6e]$ zcat data_sample_annotation_table.csv.gz | head
sample_id,replicate_id,control_status,sample_ID,sample_name,fastq_filepath,raw_total_sequences,reads_mapped,error_rate,average_quality,percent_mapped,percent_peptides_detected,percent_peptides_between_10_and_100
0,97,library,S1,c1,../data-raw/fastq//IDX097-c1-1_S1_R1_001.fastq.gz,10552,1268,0.002811007,31.1,12.016679302501895,0.7656502179218289,0.0031187381585410547
1,98,beads_only,S2,c2,../data-raw/fastq//IDX098-c2-1_S2_R1_001.fastq.gz,14031,1911,0.002440275,31.6,13.61984177891811,0.9769447281629852,0.007017160856717372
2,100,beads_only,S4,c4,../data-raw/fastq//IDX100-c4-1_S4_R1_001.fastq.gz,202207,28633,0.001710275,33.5,14.16024173248206,6.888512907677553,0.28380517242723596
3,101,beads_only,S5,c5,../data-raw/fastq//IDX101-c5-1_S5_R1_001.fastq.gz,11258,1596,0.002146455,32.4,14.176585539172143,0.8677888926140483,0.0031187381585410547
4,102,beads_only,S6,c6,../data-raw/fastq//IDX102-c6-1_S6_R1_001.fastq.gz,6102,857,0.001756068,33.0,14.044575549000326,0.4545560866073587,0.0015593690792705273
5,103,empirical,S7,v1,../data-raw/fastq//IDX103-v1-1_S7_R1_001.fastq.gz,4370,656,0.001841343,33.0,15.011441647597254,0.36411268000966807,0.0023390536189057906
6,104,empirical,S8,v2,../data-raw/fastq//IDX104-v2-1_S8_R1_001.fastq.gz,4785,745,0.00162137,33.5,15.569487983281086,0.3999781688328903,0.0031187381585410547
7,105,library,S9,c1,../data-raw/fastq//IDX105-c1-2_S9_R1_001.fastq.gz,74117,10804,0.001795265,33.4,14.576952655935886,3.8212339287524264,0.025729589807963697
8,107,beads_only,S11,c3,../data-raw/fastq//IDX107-c3-2_S11_R1_001.fastq.gz,4591,698,0.001687991,33.1,15.203659333478544,0.36333299547003284,0.0031187381585410547

Any idea what may be going on here?

update exmaple sample tables paths

update the rest of the sample table paths in data/pan-cov-example/ . At least the ones we want in the repo.

Remove all non-relavent scripts.

Currently, there are a few scripts such as run_gizmo_meg.sh and gizmo_config that are specific to our experiments. These should be moved to https://github.com/matsengrp/phippery-experiments in a sub-module containing this repo. Generally we should separate our experiments from the pipeline itself.

gzip csv outputs

We should gzip csv outputs to save space.

Quantify the impact of the library batch by comparing between tech and bio replicates

https://matsengrp.slack.com/archives/CUJGLNNQJ/p1590779872013300

Low alignment

I am seeing surprisingly low % mapping. The COVID example gives a % mapping of around 70%, while several experiments we have run give closer to 15%. My colleagues have run these data manually without phipflow and tell me that they have alignment close to 70%, so I suspect I'm messing up some input parameter that is somehow creating an alignment issue. But I can't figure out what that parameter might be.

Any thoughts on what may be causing the low alignment?

Two calls that produced the same result:

nextflow run matsengrp/phip-flow -r V1.10  \
        --sample_table ../data-raw/20231219_samples.csv \
        --peptide_table ../data-raw/peptide_table/VIR3_clean.csv \
        --read_length 75 --peptide_tile_length 168 \
        --run_zscore_fit_predict \
        --run_cpm_enr_workflow \
        --summarize_by_organism true \
        --peptide_org_col Organism \
        --sample_grouping_col sample_name \
        --results 20231219_phipflow_"$(date -I)" \
        -resume

nextflow run matsengrp/phip-flow -r V1.10  \
        --sample_table ../data-raw/20231219_samples.DEBUG.csv \
        --peptide_table ../data-raw/peptide_table/VIR3_clean.csv \
        --read_length 75 --peptide_tile_length 168 \
        --peptide_seq_col Prot \
        --sample_grouping_col sample_name \
        --results 20231219_phipflow_debug_"$(date -I)" \
        -resume

snipette of the peptide table:

e_id | Aclstr50 | Bclstr50 | Entry | Gene names | Gene ontology (GO) | Gene ontology IDs | Genus | Organism | Protein names | seq | Species | Subcellular location | Version (entry) | Version (sequence) | end | id | oligo | source | Prot_Start | Prot -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0 | 8652 | 15383 | A0A126 | US5 | suppression by virus of host apoptotic process | GO:0019050 | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein J | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA | Papiine herpesvirus 2 | 13 | 1 | 56 | 1 | aGGAATTCCGCTGCGTatgcgcagcttgctgtttgtggtcggtgcttgggtcgctgctctcgtcaccaaccttacccctgatgcagctcttgcaagtggtactacaaccaccgctgccgcagggaacacatctgcaacagcttctccaggtgacaacgccacaagcatcgacgctggcagtacaCAGGgaagagctcgaa | Vir2 | 1 | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGST 1 | 15383 | 15384 | A0A126 | US5 | suppression by virus of host apoptotic process | GO:0019050 | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein J | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA | Papiine herpesvirus 2 | 13 | 1 | 84 | 2 | aTGAATTCGGAGCGGTactacaaccaccgctgccgcagggaacacatctgcaacagcttctccaggtgacaacgccacaagcatcgacgctggcagtacaattaccgctgccgctcctccaggtcattcaacaccttggcctgcactcccaactgatctcgcacttccactcgttatcgggggtCACTGCACTCGAGACa | Vir2 | 29 | TTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGG 2 | 15384 | 3254 | A0A126 | US5 | suppression by virus of host apoptotic process | GO:0019050 | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein J | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA | Papiine herpesvirus 2 | 13 | 1 | 112 | 3 | aGGAATTCCGCTGCGTattaccgctgccgctcctccaggtcattcaacaccttggcctgcactcccaactgatctcgcacttccactcgttatcgggggtttgtgcgccctcacactcgcagcaatgggcgccggggcattgcttcatcgctgctgccgccgctgcgcacgccgccgccagaatCAGGgaagagctcgaa | Vir2 | 57 | ITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQN 3 | 3254 | | A0A126 | US5 | suppression by virus of host apoptotic process | GO:0019050 | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein J | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA | Papiine herpesvirus 2 | 13 | 1 | 118 | 4 | aTGAATTCGGAGCGGTttgtgcgccctcacactcgcagcaatgggcgccggggcattgcttcatcgctgctgccgccgctgcgcacgccgccgccagaatgtctcctcagtcagtgcttaatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacggcCACTGCACTCGAGACa | Vir2 | 85 | LCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA 4 | 32531 | | A0A130 | US4 | | | | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein G (Fragment) | RDRGPSRSRVRYTRLAASEA | Papiine herpesvirus 2 | 8 | 1 | 20 | 5 | aGGAATTCCGCTGCGTcgcgatcgcggcccttctcgctctcgcgtgcgctacacccgcctggctgcctcagaagcttaatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacggcgaaacccactaccgtacttgggtcaggtcgatccatgttcctCAGGgaagagctcgaa | Vir2 | 1 | RDRGPSRSRVRYTRLAASEA 5 | 6167 | 2421 | A0A132 | US6 | | | | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein D (Fragment) | MGFGAAAALLALAVALARVPAGGGAYVPVDRALTRVSPNRFRGSSLPPPEQKTDPPDVRRVYH | Papiine herpesvirus 2 | 11 | 1 | 56 | 6 | aGGAATTCCGCTGCGTatggggtttggcgccgcagcagcactgttggctctggcagttgcactcgcccgcgtgcctgcaggcggcggggcatatgtcccagtggaccgcgcactcacacgcgttagcccaaaccgcttccgcggttcatccctgccacctcctgaacaaaagaccgaccctcctCAGGgaagagctcgaa | Vir2 | 1 | MGFGAAAALLALAVALARVPAGGGAYVPVDRALTRVSPNRFRGSSLPPPEQKTDPP 6 | 2421 | | A0A132 | US6 | | | | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein D (Fragment) | MGFGAAAALLALAVALARVPAGGGAYVPVDRALTRVSPNRFRGSSLPPPEQKTDPPDVRRVYH | Papiine herpesvirus 2 | 11 | 1 | 63 | 7 | aTGAATTCGGAGCGGTgtggaccgcgcactcacacgcgttagcccaaaccgcttccgcggttcatccctgccacctcctgaacaaaagaccgaccctcctgacgtccgccgcgtgtatcattaatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacCACTGCACTCGAGACa | Vir2 | 29 | VDRALTRVSPNRFRGSSLPPPEQKTDPPDVRRVYH 7 | 4099 | 4100 | A0ERK8 | VAC_DPP16_017 VAC_DPP16_226 VAC_TP3_013 VAC_TP5_013 VACV-DUKE-017 VACV_TT12_011 VACV_TT8_011 | Vaccinia virus | VACV-DUKE-017 (Zinc finger-Like protein) (Zinc finger-like) (Zinc finger-like protein) | MHYPKYYINITKINPHLANQFRAWKKRIAGRDYITNLSKDTGIQQSKLTETIRNCQKNRNIYGLYIHYNLVINWITDVIINQY | Vaccinia virus | 10 | 1 | 56 | 8 | aGGAATTCCGCTGCGTatgcactaccctaaatactacatcaacatcacaaaaatcaatcctcaccttgctaaccagtttcgcgcctggaaaaaacgcatcgccggtcgcgattacatcactaaccttagcaaggataccggcatccaacagagcaaacttacagaaaccattcgcaattgtcagCAGGgaagagctcgaa | Vir2 | 1 | MHYPKYYINITKINPHLANQFRAWKKRIAGRDYITNLSKDTGIQQSKLTETIRNCQ 8 | 4100 | 6440 | A0ERK8 | VAC_DPP16_017 VAC_DPP16_226 VAC_TP3_013 VAC_TP5_013 VACV-DUKE-017 VACV_TT12_011 VACV_TT8_011 | Vaccinia virus | VACV-DUKE-017 (Zinc finger-Like protein) (Zinc finger-like) (Zinc finger-like protein) | MHYPKYYINITKINPHLANQFRAWKKRIAGRDYITNLSKDTGIQQSKLTETIRNCQKNRNIYGLYIHYNLVINWITDVIINQY | Vaccinia virus | 10 | 1 | 83 | 9 | aTGAATTCGGAGCGGTgccggtcgcgattacatcactaaccttagcaaggataccggcatccaacagagcaaacttacagaaaccattcgcaattgtcagaagaatcgcaatatctatggcctctacattcattacaatctggtgattaattggatcaccgacgtgattattaaccagtattaaCACTGCACTCGAGACa | Vir2 | 29 | AGRDYITNLSKDTGIQQSKLTETIRNCQKNRNIYGLYIHYNLVINWITDVIINQY 9 | 6500 | 3255 | A0ERN6 | VACV-DUKE-045 | | | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 56 | 10 | aGGAATTCCGCTGCGTatgggcgcaactattagcattttggcctcctatgataaccctaatctcttcaccgccatgattttgatgagtccacttgtcaatgccgatgctgttagtcgcctgaacctcctggcagctaagctgatgggcacaatcacccctaatgcccctgtggggaaactctgcCAGGgaagagctcgaa | Vir2 | 1 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLC 10 | 3255 | 3256 | A0ERN6 | VACV-DUKE-045 | | | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 84 | 11 | aTGAATTCGGAGCGGTgccgatgctgttagtcgcctgaacctcctggcagctaagctgatgggcacaatcacccctaatgcccctgtggggaaactctgccctgaatccgtctcacgcgatatggacaaggtttataagtatcagtacgaccctttgatcaatcacgagaaaatcaaagcagggCACTGCACTCGAGACa | Vir2 | 29 | ADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAG 11 | 3256 | 3093 | A0ERN6 | VACV-DUKE-045 | | | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 112 | 12 | aGGAATTCCGCTGCGTcctgaatccgtctcacgcgatatggacaaggtttataagtatcagtacgaccctttgatcaatcacgagaaaatcaaagcagggtttgcctcacaagttcttaaagctactaacaaagtgcgcaagattatctccaagattaatacaccacctaccttgatcctgcagCAGGgaagagctcgaa | Vir2 | 57 | PESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQ 12 | 3093 | 9021 | A0ERN6 | VACV-DUKE-045 | | | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 140 | 13 | aTGAATTCGGAGCGGTtttgcctcacaagttcttaaagctactaacaaagtgcgcaagattatctccaagattaatacaccacctaccttgatcctgcaggggaccaataatgaaatcagtgacgttctcggcgcatactactttatgcaacacgctaactgcaatcgcgagatcaaaatctacCACTGCACTCGAGACa | Vir2 | 85 | FASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIY 13 | 9021 | 14672 | A0ERN6 | VACV-DUKE-045 | | | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 168 | 14 | aGGAATTCCGCTGCGTgggaccaataatgaaatcagtgacgttctcggcgcatactactttatgcaacacgctaactgcaatcgcgagatcaaaatctacgaaggtgccaaacatcacctgcataaggagactgacgaagtcaagaaatctgtgatgaaggagatcgaaacttggattttcaatCAGGgaagagctcgaa | Vir2 | 113 | GTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFN 14 | 14672 | | A0ERN6 | VACV-DUKE-045 | | | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 171 | 15 | aTGAATTCGGAGCGGTgaaggtgccaaacatcacctgcataaggagactgacgaagtcaagaaatctgtgatgaaggagatcgaaacttggattttcaatcgcgttaaataatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacggcgaaacccacCACTGCACTCGAGACa | Vir2 | 141 | EGAKHHLHKETDEVKKSVMKEIETWIFNRVK 15 | 8653 | 4310 | A0ES12 | VACV-DUKE-171 | | | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 56 | 16 | aGGAATTCCGCTGCGTatgattccattgttgtttatcttgttctattttgctaacgggatcgaatggcataaatttgaaacttccgaagagattattagcacatatttgcttgatgacgttctttataccggcgtgaacggcgctgtctacactttttcaaataacaagcttaataagacaggcCAGGgaagagctcgaa | Vir2 | 1 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTG 16 | 4310 | 4311 | A0ES12 | VACV-DUKE-171 | | | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 84 | 17 | aTGAATTCGGAGCGGTacatatttgcttgatgacgttctttataccggcgtgaacggcgctgtctacactttttcaaataacaagcttaataagacaggcttgacaaacactaattatattaccacatcaatcaaggttgaagacgccgataaagatactcttgtctgcggcaccaacaacggtCACTGCACTCGAGACa | Vir2 | 29 | TYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNG 17 | 4311 | 6501 | A0ES12 | VACV-DUKE-171 | | | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 112 | 18 | aGGAATTCCGCTGCGTttgacaaacactaattatattaccacatcaatcaaggttgaagacgccgataaagatactcttgtctgcggcaccaacaacggtaaccctaaatgctggaaaatcgacggctcagatgatcctaaacaccgcgggcgcggttacgccccatatcagaactccaaagtcCAGGgaagagctcgaa | Vir2 | 57 | LTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKV 18 | 6501 | 6502 | A0ES12 | VACV-DUKE-171 | | | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 140 | 19 | aTGAATTCGGAGCGGTaaccctaaatgctggaaaatcgacggctcagatgatcctaaacaccgcgggcgcggttacgccccatatcagaactccaaagtcacaatcatcagtcacaacggttgtgtcttgtcagatattaatatcagtaaggagggtatcaaacgctggcgccgctttgacgggCACTGCACTCGAGACa | Vir2 | 85 | NPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDG 19 | 6502 | 6168 | A0ES12 | VACV-DUKE-171 | | | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 168 | 20 | aGGAATTCCGCTGCGTacaatcatcagtcacaacggttgtgtcttgtcagatattaatatcagtaaggagggtatcaaacgctggcgccgctttgacgggccttgtggctatgacctctttactgcagacaacgtcatcccaaaagatggcttgcgcggggcttttgttgataaggatggcacaCAGGgaagagctcgaa | Vir2 | 113 | TIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGT 20 | 6168 | 4312 | A0ES12 | VACV-DUKE-171 | | | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 196 | 21 | aTGAATTCGGAGCGGTccttgtggctatgacctctttactgcagacaacgtcatcccaaaagatggcttgcgcggggcttttgttgataaggatggcacatacgataaagtgtatatcttgttcaccgataccattggttctaagcgcatcgtgaagatcccttatatcacccagatgtgccttCACTGCACTCGAGACa | Vir2 | 141 | PCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCL 21 | 4312 | 4313 | A0ES12 | VACV-DUKE-171 | | | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 224 | 22 | aGGAATTCCGCTGCGTtacgataaagtgtatatcttgttcaccgataccattggttctaagcgcatcgtgaagatcccttatatcacccagatgtgccttaatgatgaagggggcccttcaagtttgagttctcaccgctggtccacttttctcaaagtcgagcttgaatgtgacatcgacggtCAGGgaagagctcgaa | Vir2 | 169 | YDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDG 22 | 4313 | | A0ES12 | VACV-DUKE-171 | | | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 228 | 23 | aTGAATTCGGAGCGGTaatgatgaagggggcccttcaagtttgagttctcaccgctggtccacttttctcaaagtcgagcttgaatgtgacatcgacggtcgctcataccgctaatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacggcgaaaccCACTGCACTCGAGACa | Vir2 | 197 | NDEGGPSSLSSHRWSTFLKVELECDIDGRSYR 23 | 14673 | 4314 | A0ES13 | VACV-DUKE-172 | | | Vaccinia virus | VACV-DUKE-172 | MNTIKQSFSTSKLEGYTKQLPSPAPGICLPAGKVVPHTTFEVIEQYNVLDDIIKPLSNQPIFEGPSGVKWFDIKEKENEHREYRIYFIKENSIYSFDTKSKQTRSSQVDARLFSVMVTSKPLFIADIGIGVGMPQMKKILKM | Vaccinia virus | 17 | 1 | 56 | 24 | aGGAATTCCGCTGCGTatgaacactatcaaacagtcattctcaacatctaaactcgaagggtataccaaacagctcccaagtccagcaccagggatttgcttgccagcaggtaaggtggtgcctcacacaacattcgaggttatcgaacaatacaatgtcttggatgatattattaaacctctgCAGGgaagagctcgaa | Vir2 | 1 | MNTIKQSFSTSKLEGYTKQLPSPAPGICLPAGKVVPHTTFEVIEQYNVLDDIIKPL 24 | 4314 | 4315 | A0ES13 | VACV-DUKE-172 | | | Vaccinia virus | VACV-DUKE-172 | MNTIKQSFSTSKLEGYTKQLPSPAPGICLPAGKVVPHTTFEVIEQYNVLDDIIKPLSNQPIFEGPSGVKWFDIKEKENEHREYRIYFIKENSIYSFDTKSKQTRSSQVDARLFSVMVTSKPLFIADIGIGVGMPQMKKILKM | Vaccinia virus | 17 | 1 | 84 | 25 | aTGAATTCGGAGCGGTttgccagcaggtaaggtggtgcctcacacaacattcgaggttatcgaacaatacaatgtcttggatgatattattaaacctctgtccaatcaacctattttcgaaggtccatctggcgttaagtggttcgacatcaaggaaaaggaaaatgagcatcgcgagtaccgcCACTGCACTCGAGACa | Vir2 | 29 | LPAGKVVPHTTFEVIEQYNVLDDIIKPLSNQPIFEGPSGVKWFDIKEKENEHREYR 25 | 4315 | 4316 | A0ES13 | VACV-DUKE-172 | | | Vaccinia virus | VACV-DUKE-172 | MNTIKQSFSTSKLEGYTKQLPSPAPGICLPAGKVVPHTTFEVIEQYNVLDDIIKPLSNQPIFEGPSGVKWFDIKEKENEHREYRIYFIKENSIYSFDTKSKQTRSSQVDARLFSVMVTSKPLFIADIGIGVGMPQMKKILKM | Vaccinia virus | 17 | 1 | 112 | 26 | aGGAATTCCGCTGCGTtccaatcaacctattttcgaaggtccatctggcgttaagtggttcgacatcaaggaaaaggaaaatgagcatcgcgagtaccgcatctacttcatcaaagaaaatagtatctacagcttcgacaccaaatctaagcaaacacgcagtagccaagtcgatgcacgccttCAGGgaagagctcgaa | Vir2 | 57 | SNQPIFEGPSGVKWFDIKEKENEHREYRIYFIKENSIYSFDTKSKQTRSSQVDARL

Fix broken images

Add an automated test.

We already have an example data set. Let's make a formal test and a git workflow that asserts this passes before a merge.

phip-flow command error when running merge-counts-stats.py

I'm working on getting a test of your analysis suite running using nextflow and a Singularity container. I'm running the container in a virtual environment that also has phippery installed. When the container runs using the defaults (that I think should run the test data set for validation), the first several steps run fine; however, the nextflow analysis errors out when trying to run merge-counts-stats.py script. A phippery module is not being found. The error message from nextflow is:

Error executing process > 'ALIGN:collect_phip_data (1)'

Caused by:
Process ALIGN:collect_phip_data (1) terminated with an error exit status (1)

Command executed:

merge-counts-stats.py -st validated_sample_table.csv -pt validated_peptide_table.csv -cfp ".counts" -sfp ".stats" -o data.phip

Command exit status:
1

Command output:
(empty)

Command error:
Traceback (most recent call last):
File "/home/dob/.nextflow/assets/matsengrp/phip-flow/bin/merge-counts-stats.py", line 8, in
import phippery.phipdata as phipdata
ModuleNotFoundError: No module named 'phippery.phipdata'

I was unsuccessful in manually locating the phippery.phipdata module that the script it trying to import. Is the module currently included in the package? Has the module been renamed to something else? What would you suggest to either fix or troubleshoot the problem?

Thanks.

DSL 2

eventually, we need to switch to the new API, -> https://www.nextflow.io/blog/2020/dsl2-is-here.html

Deprecate or Remove Negative Binomial workflow

We've decided that there is no more reason to maintain the Neg. Binomial workflow -- so it needs to either be deprecated (with a warning) or removed completely. I'm leaning towards the latter

TODO

Remove the NegBinom code from phippery
Remove the workflow processes relevant to NegBinom in the pipeline
Remove the relevant documentation for NB
Update phippery version
Update phip-flow version

Local Alignment

We are most generally aligning reads that are larger than the peptide-encoding reference sequences. This brings up interest in using local alignment, as seen here.

Currently, we use end-to-end alignment bu trim the low quality (right side, 3 prime) end. This could, potentially, be missing many alignments of interest.

remove integer id requirement for sample and peptide table.

We will write a sub-workflow which first checks for a columns "sample_id" or "peptide_id" in the tables, and adds them if they are not there.

Multi-library Manifest

Ultimately, we could replace both the sample table and the peptide table parameters with a single manifest file (CSV) which will look almost exactly the same as the sample table -- except now there is a required "peptide_reference" that points to a relative filepath to the the correct peptide table for each respective sample manifest. With a little tweaking the pipeline could then run library workflows in parallel i.e. If there are N unique peptide tables in the manifest "peptide_reference" column, across all samples, then the workflow will be run in parallel and produce N different data sets.