felicityallen / jacks Goto Github PK

View Code? Open in Web Editor NEW

14.0 3.0 13.0 21.79 MB

Analysis package for processing counts from genome-wide CRISPR/Cas9 screens

License: Apache License 2.0

Python 82.56% R 9.60% HTML 6.42% Dockerfile 0.36% JavaScript 0.46% CSS 0.60%

crispr crispr-cas9 crispr-analysis genome-wide-data bioinformatics

jacks's Introduction

In the subdirectories here you will find:

jacks:

The JACKS python package (please see jacks/README.txt for usage instructions)

2018_paper_materials:

Scripts and README files with location of results and data for the JACKS 2018 paper.

reference_grna_efficacies:

Trained values for JACKS's gRNA efficacies for the Avana, GeCKOv2, Yusa 1.0, TKOv1 and Whiteahead libraries as generated for the 2018 JACKS paper. These can be used with run_JACKS.py to evaluate screens on these libraries without re-running the full analysis.

jacks's People

Contributors

Stargazers

Watchers

Forkers

johncthomas goedel-gang pleprohon cetienn01 jchenpku gn5 wtsi-hgi peterpdu anne-qwas richardkmichael david-a-parry joseescandellplanells dianario

jacks's Issues

KeyError issue

Hi,

Whenever I have tried to run jacks i keep getting a key issue error
Here is the error code

[2019-05-23 16:36:55,172] jacks: INFO Loading sample specification
[2019-05-23 16:36:55,172] jacks: INFO Loading gene mappings
[2019-05-23 16:36:55,186] jacks: INFO Loading data and pre-processing
Traceback (most recent call last):
File "/home/alexander/bioTools/JACKS/jacks/run_JACKS.py", line 7, in
runJACKSFromArgs()
File "/home/alexander/bioTools/JACKS/jacks/jacks/jacks_io.py", line 441, in runJACKSFromArgs
args.outprefix, args.reffile, args.n_pseudo, args.count_prior)
File "/home/alexander/bioTools/JACKS/jacks/jacks/jacks_io.py", line 431, in runJACKS
outprefix, apply_w_hp=apply_w_hp, norm_type=norm_type, ctrl_genes=ctrl_genes, fdr=fdr, fdr_thresh_type=fdr_thresh_type, n_pseudo=n_pseudo, count_prior=count_prior )
File "/home/alexander/bioTools/JACKS/jacks/jacks/jacks_io.py", line 383, in load_data_and_run
data, meta, sample_ids, genes, gene_index = loadDataAndPreprocess(sample_spec, gene_spec,ctrl_spec=ctrl_spec,normtype=norm_type, ctrl_geneset=ctrl_geneset, prior=count_prior)
File "/home/alexander/bioTools/JACKS/jacks/jacks/preprocess.py", line 147, in loadDataAndPreprocess
counts.append([np.log2(eval(row[colname])+prior) for sample_id, colname in sample_spec[filename]])
File "/home/alexander/bioTools/JACKS/jacks/jacks/preprocess.py", line 147, in
counts.append([np.log2(eval(row[colname])+prior) for sample_id, colname in sample_spec[filename]])
KeyError: 'd15Cas9k2'

Here is a copy of the input

python ~/bioTools/JACKS/jacks/run_JACKS.py smallCount.tab smallRep.tab guideMapping.tab --ctrl_sample_hdr=Control

Here is what the smallRep.tab looks like

Replicate Sample Control
d15Cas9k1 d15Cas9 d15Wt
d15Cas9k2 d15Cas9 d15Wt
d15WTk1 d15Wt d15Wt
d15WTk2 d15Wt d15Wt

Here is a sample of the smallCount.tab file

sgRNA d15Cas9k1 d15Cas9k2 d15WTk1 d15WTk2
PMVK_NM_006556.3_154936638 3214 3573 4319 4463
PLK2_NM_006622.3_58457530 6003 5871 8658 6865
CHEK1_NM_001114121.2_125627805 3948 2687 5182 5021
MAP4K5_NM_006575.4_50485610 3688 3408 5612 5653
CKS2_NM_001827.2_89311315 6709 7779 9329 8668
AK4_NM_013410.3_65148450 5115 4971 6398 6486

An help would be deeply appreciated.
Thank you.

data_err of zero with one replicate

When there are no replicates the data error used in jacks.infer_JACKS_gene() is zero. I think the line data_err[SP.isnan(data_err)] = 2.0 is supposed to prevent this, but io_preprocess.calc_posterior_sd() returns zeroes, not NaNs, with single reps.

RuntimeWarning: Mean of empty slice

Hello,
I am trying to use JACKS and running into following issues:

python ~/Research/Programs/Jacks/JACKS/jacks/run_JACKS.py Count_Matrix.tab Exp_Summary_JACKS.tab sgRNA_Mapping_File.tab --sgrna_hdr=sgrna --gene
_hdr=Gene --ctrl_sample_hdr=Sample  --outprefix JACKS
[2021-11-09 18:05:01,363] jacks: INFO     Loading sample specification
[2021-11-09 18:05:01,363] jacks: INFO     Loading gene mappings
[2021-11-09 18:05:01,365] jacks: INFO     Loading data and pre-processing
[2021-11-09 18:05:01,424] jacks: INFO     Applying median normalisation
[2021-11-09 18:05:01,471] jacks: INFO     Collating 0 samples
[2021-11-09 18:05:01,487] jacks: INFO     Running JACKS inference
/home/.conda/envs/jacksenv/lib/python3.10/site-packages/scipy/_lib/deprecation.py:20: RuntimeWarning: Mean of empty slice
  return fun(*args, **kwargs)
/home/Research/Programs/Jacks/JACKS/jacks/jacks/infer.py:88: RuntimeWarning: Mean of empty slice.
  LOG.debug("After init, mean absolute error=%.3f, <x>=%.1f <w>=%.1f lower bound=%.1f"%(SP.nanmean(abs(y.T-SP.outer(w1,x1))).mean(), x1.mean(), w1.mean(), bound))
/home/.conda/envs/jacksenv/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/Research/Programs/Jacks/JACKS/jacks/jacks/infer.py:115: RuntimeWarning: Mean of empty slice.
  LOG.debug("After W update, <w>=%.1f, mean absolute error=%.3f"%(w1.mean(), SP.nanmean(abs(y.T-SP.outer(w1,x1))).mean()))
/home/Research/Programs/Jacks/JACKS/jacks/jacks/infer.py:96: RuntimeWarning: Mean of empty slice.
  LOG.debug("Iter %d/%d. lb: %.1f err: %.3f x:%.2f+-%.2f w:%.2f+-%.2f xw:%.2f"%(i+1, n_iter, bound, SP.nanmean(abs(y.T-SP.outer(w1,x1))).mean(), x1.mean(), SP.median((x2-x1**2)**0.5), w1.mean(), SP.median((w2-w1**2)**0.5), x1.mean()*w1.mean()))
/home/.conda/envs/jacksenv/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3440: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
[2021-11-09 18:05:01,801] jacks: INFO     Writing JACKS results
/gpfs/fs1/home/Research/Programs/Jacks/JACKS/jacks/jacks/jacks_io.py:28: RuntimeWarning: Mean of empty slice
  ordered_genes = [(np.nanmean(jacks_results[gene][4]),gene) for gene in jacks_results]
/home/Research/Programs/Jacks/JACKS/jacks/jacks/jacks_io.py:137: RuntimeWarning: Mean of empty slice
  ordered_genes = [(np.nanmean(jacks_results[gene][4]),gene) for gene in jacks_results]

Here are snippets of how various files look:
head -n +4 Count_Matrix.tab

sgRNA   P23H1   P23H2   Control1        Control2        Control3
Amfr_sgRNA1     150     44      602     530     302
Amfr_sgRNA2     141     24      380     350     162
Amfr_sgRNA3     203     21      443     435     303

head Exp_Summary_JACKS.tab

Replicate       Sample
P23H1   P23H
P23H2   P23H
Control1        CTRL
Control2        CTRL
Control3        CTRL

head -n +3 sgRNA_Mapping_File.tab

sgrna   Gene
Sec24d_sgRNA2   Sec24d
Gm30534_sgRNA3  Gm30534

Not exactly sure why any array slice will produce a mean of <0( assuming that is the error). The counts is tab separated and so are the other files as well. I do see a Collating 0 samples could this be the issue?

Any help will be much appreciated.

Thanks,
D

Statistics for positive selection

Thank you for your great tool. Would it be possible to provide statistics for positive selection, either in addition to negative selection or as an alternative using a command line flag?

This would make JACKS more versatile and would cover the analysis of enrichment screens.

problem with generated p-values

Hello,

I am using Jacks to analyse a CRISPR screen where i want to select positively enriched genes compared to a control.
Everything works very well in terms of beta scores, with a positive control getting the largest beta score as expected. But the p-values are the opposite of what i would expect, with our positive controls never getting low p-values.
I have used as control genes both the NEGv1.txt nonessential gene list you provide and also tried to use three control genes from the TKOv3 library we use. I encountered the same problem in both.
I was wondering if there was something i was missing or not using the right control genes ?

Best,
Anne

Normalization in the presence of large fraction of dropouts

The median normalization of read counts can give unexpected results if applied to experiments where a large fraction of gRNAs are not represented. A proposed solution is to add an additional parameter into the normalization step that aligns samples using information only from gRNAs with a minimum read count, e.g. 10.

p-values or FDR

Thank you for providing this tool. I would be interested to know if the p-values reported in the output files are raw p-values or corrected p-values. I was unable to find a documentation of the output file and it was not clear to me from the methods in the paper.

Thanks for your time!

run_jacks_reference.py use of createSampleSpec needs updating

run_jacks_reference.py calls createSampleSpec from run_jacks.py missing ctrl_sample_or_hdr argument, and it doesn't expect the tuple return value.

I got around this by adding fixed X parts of rj_reference.py to run_jacks.py, see attached, but also changed it to take arguments from within Python which might not suit everyone.

jacks_front.py.zip

LICENSE, Versioning and GitHub Releases

Hello,

Can you please add a LICENSE to this repository?
Strictly speaking, without any LICENSE information no one is allowed to use the code available here.

What is the Versioning and GitHub Release policy?
We would like to use it but are not sure how to reference a specific version without this information being present in the repository.

What is the current version?
jacks/setup.py talks about 0.2 -- is that the version that should be used for this repository?

No entrypoints for scripts

The scripts:

jacks/run_JACKS.py and
jacks/plot_heatmap.py

are supposed to be available to a user. Installing via setup.py doesn't make them available.

I think those should be entry points and packaged accordingly. I think https://packaging.python.org/specifications/entry-points/#use-for-scripts might be a good starting point.

Slightly confusing header names in documentation

Hi guys,
V minor documentation issue - I found it a bit confusing that the example column header names in your initial example run_JACKS.py in the documentation didn't match the column names you used in the example files below. e.g:

python run_JACKS.py countfile replicatemapfile sgrnamappingfile --rep_hdr=replicate_hdr

(if I am understanding it correctly) means that, in replicatemapfile, the replicate column header is named "replicate_hdr". In the example replicatemapfile, the column header is "Replicate Header". (and so on for the other column headers in the example replicatemapfile. It would be easier to get what's going on quickly if they were the same (so you can e.g. easily see that argument to --rep_hdr matches the name of the first column in replicatemapfile).
Dan

Run JACKS from fold change

Hi,

Is there a way to use the JACKS API to run the analysis starting from a log2 fold change table (and a corresponding table of precomputed variances)? That is, to input an sgRNA by screen table, like the heatmap shown in Figure 1 of the paper.

Thanks,
Peter

Please provide a PyPI version

It seems that the jacks subdirectory is prepared as it has a setup.py file.

Can you please publish it on PyPI so that the installation process is more developer friendly?

How many cell lines needed for P-value calculation?

My data set so far contains only 2 cell lines and 1 pDNA control. Jacks could generate the per gene results and stds but not P-values were generated. I'm guessing that's because I have to few cell lines in the data set. Is that correct? How many cell lines are needed for it to calculate P-values? Thanks!