Code Monkey home page Code Monkey logo

toolkit's Introduction

Documentation status PyPI version Codacy Badge Build Status Gitter PEP compatible

Open in Gitpod

ngs-toolkit

This is my NGS analysis toolkit: ngs_toolkit.

Head to the documentation to see how to install and use the toolkit, and have a look at the catalogue of available functions.

Install with:

pip install ngs-toolkit

You might need to add a --user flag to the above command.

toolkit's People

Contributors

afrendeiro avatar fwzhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

toolkit's Issues

Implement HOMER motif analysis through de novo discovery, reduction and overrepresentation

From @cschmidl:

HOMER reducing motifs:

1. motif search as usual

2. Create a combined motif file (example using 3 input files) => homerMotifs.all.motifs is the file with the results

cat
/Volumes/NGSdata/projects/project/HOMER/motifs/differentialExpression.Down_G1_vs_G2.txt/homerMotifs.all.motifs
/Volumes/NGSdata/projects/project/HOMER/motifs/differentialExpression.Down_G1_vs_G3.txt/homerMotifs.all.motifs
/Volumes/NGSdata/projects/project/HOMER/motifs/differentialExpression.Down_G2_vs_G3.txt/homerMotifs.all.motifs
/Volumes/NGSdata/projects/project/HOMER/motifs/differentialExpression.Up_G1_vs_G2.txt/homerMotifs.all.motifs
/Volumes/NGSdata/projects/project/HOMER/motifs/differentialExpression.Up_G1_vs_G3.txt/homerMotifs.all.motifs
/Volumes/NGSdata/projects/project/HOMER/motifs/differentialExpression.Up_G2_vs_G3.txt/homerMotifs.all.motifs \

/Volumes/NGSdata/projects/project/HOMER/motifs/project_homerMotifs.all.motifs

3. compareMotifs.pl [options]

compareMotifs.pl /Volumes/NGSdata/projects/project/HOMER/motifs/project_homerMotifs.all.motifs /Volumes/NGSdata/projects/project/HOMER/motifs/project_opening_reduced -info 0.6 -nofacts -pvalue 1e-25 -known /Users/christianschmidl/src/homer/data/knownTFs/vertebrates/known.motifs

4. make a known motif file (not 100% sure why I egreped this, maybe have a quick look at the .motif file)

for i in $(ls /Volumes/NGSdata/projects/project/HOMER/motifs/project_all_reduced/homerResults/*.motif | egrep -v similar | egrep -v RV )
do
cat $i >> /Volumes/NGSdata/projects/project/HOMER/motifs/project_all_reduced/all.motifs
done

5. check for enrichment of defined known motifs ==> usual motif search but specifying the selfmade known motif file as the motif source

PROJECTDIR=/Volumes/NGSdata/projects/project
HOMERDIR=${PROJECTDIR}/HOMER
GENOME="mm10"
MOTIFFILE="/Volumes/NGSdata/projects/project/HOMER/motifs/project_all_reduced/all.motifs"
OPTIONS="-nomotif "

for SAMPLE in $(ls ${HOMERDIR}/peaks | egrep differentialExpression.)
do
echo "findMotifsGenome.pl ${HOMERDIR}/peaks/$SAMPLE $GENOME ${HOMERDIR}/motifs/reducedMotifEnrichment_$(basename "${SAMPLE}" .txt) $OPTIONS -mknown $MOTIFFILE"
done

6. extract from all motif searches the p-values and motif names, make a dataframe with 1st column motifs, and in all other columns the p-values and fold-change input vs. background from each motif search for visulaization. Heatmap is nice, but also bubble plots I’ve seen where the size of the bubble is p-value, while color is fold change or vice versa. No idea how to plot this though... in general I think a good approach to crosscompare several samples!

Enrichr step of run_enrichment_jobs missing time

toolkit/ngs_toolkit/general.py

Lines 1920 to 1941 in ffe682e

# Enrichr
if "enrichr" in steps:
files = glob(results_dir + "/*/*.gene_symbols.txt")
for file in files:
dir_ = os.path.dirname(file)
name = os.path.basename(dir_)
output_ = file.replace(".gene_symbols.txt", ".enrichr.csv")
if os.path.exists(get_this_file_or_timestamped(output_)) and (not overwrite):
continue
jobs.append(
[
name + "_enrichr",
os.path.join(dir_, name + ".enrichr.log"),
os.path.join(dir_, name + ".enrichr.sh"),
("shortq", 1, 4000),
"{e} -m ngs_toolkit.recipes.enrichr {f} {o}".format(
e=sys.executable, f=file, o=output_
),
]
)
for jobname, log_file, job_file, (partition, cores, mem, time), task in jobs:

Leads to an unpacking error (see last line)

Handle sample and group attributes between functions better

There are errors when using ngs_toolkit.Analysis.get_sample_colours within ngs_toolkit.general.unsupervised if group attributes are not given in the same order as existing in the various levels of the relevant matrix Multiindex columns or if all attributes are blank (NaN).

Reported by @cschmidl

divvy 0.4.0 not supported

Reported by @sreichl:
divvy 0.4.0 not supported since divvy.logging is no longer present.

TODO: fix divvy of logging in a backwards compatible manner

Package name

pro con
ngs_toolkit explicit long; underscode/hyphen; polluted namespace
seeq unique; small already on PyPI; ambiguous
...

pybedtools - bedtool cannot turn into dataframe if NAs present in intersect result

toolkit/ngs_toolkit/atacseq.py

Lines 1294 to 1298 in 2e27787

annot = bed.intersect(
states, wa=True, wb=True, f=frac, loj=True
)
try:
annot = annot.to_dataframe(usecols=[0, 1, 2, 6])

e.g.

bed
chr1 9844 10460
chr1 180534 181797

states (lift-over from hg38, found in projects/pad-map/data/external/E032...etc.)
chr1 10000 10800 9_Het
chr1 10800 16000 15_Quies
chr1 16000 16200 1_TssA
chr1 16200 19000 5_TxWk
chr1 19000 96080 15_Quies
chr1 96276 96476 15_Quies
chr1 97276 177200 15_Quies

annot
chr1 9844 10460 chr1 10000 10800 9_Het
chr1 180534 181797 . -1 -1 .
chr1 585847 586452 chr1 586020 770220 15_Quies
chr1 629689 630568 chr1 586020 770220 15_Quies

Quick andre fix from earlier:
annot.to_dataframe(disable_auto_names=True, header=None, usecols=['a', 'b', 'c', 'g'], names=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

R dataframe conversion bugs

df = pd.DataFrame(np.asarray(r_df)).T
df.columns = [str(x) for x in r_df.colnames]
df.index = [str(x) for x in r_df.rownames]

With my version of R (etc.), I had to change to the following to work

def r2pandas_df(r_df):
import numpy as np
df = pd.DataFrame(np.asarray(r_df)) # No transposition
df.columns = [s for s in r_df.columns.tolist()]
df.index = [s for s in r_df.index.tolist()]
return df

Perhaps this entire function is depreciated? I didn't really find a need to convert R data frame to Pandas DF

Recipes

Premade recipies of common analysis ready to run.

Usage: projectmanager run --recipe="atac-seq_analysis" metadata/project_config.yaml

Bug in fisher test (differential overlap)

First sorry I am not quoting exactly the lines in ngs_toolkit... github loading of the page is perpetually stuck in 'fetching commit history' for me for some reason and i can't quote the lines directly.

Before I forget -
Calculation of column "d" is wrong
Currently d = total regions (i.e. matrix_raw.shape[0]) - (size1 + size2 + intersect)

Should actually be -
d = total regions - (a + b + c) (exclusively group1, exclusively group2, and intersect)
or alternatively:
d = total regions - (size1-intersect) - (size2-intersect) - intersect
===> hence d = total regions - size1 - size2 +intersect

Found the error by getting negative values for column d in my comparison, and I suspect this came from an order of operations issue

Incorrect inference of multi-index in the case of non-str index

def fix_dataframe_header(df, force_dtypes=float):
# First try to see whether it is not a MultiIndex after all
if df.index.isna().any().sum() == 1:
df.columns = df.loc[df.index.isna(), :].squeeze()
df.columns.name = None
df.index.name = None
return df.loc[~df.index.isna()]
# If so, get potential categorical columns and prepare MultiIndex with them
cols = list()
for i in df.index[:300]:
try:
df.loc[i, :].astype(float)
except ValueError:
cols.append(i)
if len(cols) == 0:
pass
elif len(cols) == 1:
df.columns = df.loc[cols[0]]
df = df.loc[~df.index.isin(cols)]
else:
df.columns = pd.MultiIndex.from_arrays(df.loc[cols].values, names=cols)
df = df.loc[~df.index.isin(cols)]
df.index.name = None
if force_dtypes is not None:
return df.astype(force_dtypes)
else:
return df

Context:
Upon analysis.load_data() of matrix_norm, inference of multi-index levels excludes non-str levels e.g. boolean entries (in my case, toggle/exclude), and numeric factors (e.g. cell_count), and regardless of whether numeric attributes were designated in annotate_samples.

See below for what I mean by the bug:
In [41]: analysis.matrix_norm.index
Out[41]:
Index(['exclude', 'toggle', 'cell_count', 'chr1:9843-10840',
'chr1:15983-16492', 'chr1:180528-181802', 'chr1:191092-192113',
'chr1:267741-268289', 'chr1:585847-586457', 'chr1:605341-605844',
...
'chrY:56825538-56826039', 'chrY:56836469-56837142',
'chrY:56841935-56842436', 'chrY:56843501-56844002',
'chrY:56845972-56846473', 'chrY:56849062-56849563',
'chrY:56855527-56856028', 'chrY:56870612-56871192',
'chrY:56873351-56874110', 'chrY:56879337-56879838'],
dtype='object', length=211111)

analysis.collect_coverage - unsorted indexes when fast_and_unsafe = True

if fast_and_unsafe:
_LOGGER.warning("Using a concatenation method that is not 100% safe.")
matrix_raw = pd.concat(matrix_raw, axis=1, sort=False)
matrix_raw = matrix_raw.loc[:, ~matrix_raw.columns.duplicated()].set_index(["chrom", "start", "end"])

For consensus coverage bed for each sample, the index order of matrix_raw is not in natural sort order (the sample-specific coverage bed under distributed does not have regions in natural sort order too).
See below ---------------------------

top n of sample-specific bed

In [26]: bed = pd.read_csv(f'{analysis.data_dir}/{analysis.samples[1].name}/coverage/{an
...: alysis.samples[1].name}.peak_set_coverage.bed', sep='\t', header=None)

In [27]: bed.head()
Out[27]:
0 1 2 3
0 chr1 234880931 234882653 16
1 chr1 243268961 243270032 2
2 chr1 39845476 39846665 96
3 chr1 61865588 61866129 0
4 chr1 77594208 77594852 2

top n of matrix_raw in distributed=True vs. distributed = False or fast_and_unsafe = False

In [20]: df = analysis.measure_coverage(samples=s[0:2], assign=False, save=False)

In [21]: df.head()
Out[21]:
ATAC_KB_PID-1377_PBMC ATAC_KB_PID-1377_Tcell
chr1:9843-10840 18 26
chr1:15983-16492 3 1
chr1:180528-181802 18 18
chr1:191092-192113 8 5
chr1:267741-268289 8 4

In [22]: df = analysis.collect_coverage(samples=s[0:2], assign=False, save=False)
100%|█████████████████████████████████████████████████████| 2/2 [00:00<00:00, 18.73it/s]

In [23]: df.head()
Out[23]:
variable ATAC_KB_PID-1377_PBMC ATAC_KB_PID-1377_Tcell
region
chr1:9843-10840 18 26
chr1:15983-16492 3 1
chr1:180528-181802 18 18
chr1:191092-192113 8 5
chr1:267741-268289 8 4

matrix_raw if distributed and fast_and_unsafe are True

In [24]: df = analysis.collect_coverage(samples=s[0:2], assign=False, save=False, fast_a
...: nd_unsafe=True)
100%|█████████████████████████████████████████████████████| 2/2 [00:00<00:00, 17.64it/s]
ngs_toolkit:atacseq:L857 (collect_coverage) [WARNING] > Using a concatenation method that is not 100% safe.

In [25]: df.head()
Out[25]:
ATAC_KB_PID-1377_PBMC ATAC_KB_PID-1377_Tcell
region
chr1:234880931-234882653 12 16
chr1:243268961-243270032 3 2
chr1:39845476-39846665 24 96
chr1:61865588-61866129 1 0
chr1:77594208-77594852 5 2

Inconsisent shape of region_annotation vs. region_annotation_b

>>> a.get_peak_genomic_location()

Running analysis.get_peak_genomic_location() leads to inconsistent number of regions between background and real peak set.

Dataset: projects/pad-map (feel free to use)

See:

In [15]: [f'{i}: shape {getattr(analysis, i).shape}' for i in analysis.dict if isinstance(getattr(analysis, i), pd.core.fram
...: e.DataFrame)]
Out[15]:
['comparison_table: shape (425, 5)',
'gene_annotation: shape (211108, 6)',
'closest_tss_distances: shape (211468, 6)',
'region_annotation: shape (211105, 4)', <-----
'region_annotation_b: shape (211104, 4)', <----
'region_annotation_mapping: shape (395590, 4)',
'region_annotation_b_mapping: shape (339561, 4)',
'chrom_state_annotation: shape (211108, 4)',
'chrom_state_annotation_b: shape (211108, 4)',
'chrom_state_annotation_mapping: shape (250895, 4)',
'chrom_state_annotation_b_mapping: shape (223218, 4)']

Bug in structure of for loop iterating through gene_set_librar(ies)

# pivot table
if ("correlation" not in plot_types) and ("heatmap" not in plot_types):
return

Structure currently as follows -
For gene_set_library in gene_set_libraries:
plot barplot
plot scatterplot
if extra steps (i.e. heatmap, correlation) not in steps:
return & don't continue to pivot

Hence, if heatmap and/or correlation are not in steps, not all barplots/scatterplots are plotted for each gene_set_library, only the very first one. This is difficult because currently I want a different top_n for barplot than the rest. Simple fix is to make it an positive, rather than a negation at if extra steps... and etc.

Motif vs. Meme

)
def collect_differential_enrichment(
self,
steps=["region", "lola", "motif", "homer", "homer_consensus", "enrichr"],
directional=True,
permissive=True,
output_dir="{results_dir}/differential_analysis_{data_type}/enrichments",
input_prefix="differential_analysis",
output_prefix="differential_analysis",
differential=None,
):
"""
Collect the results of enrichment analysis ran after a differential analysis.
Parameters
----------
steps : :obj:`list`, optional
Steps of the enrichment analysis to collect results for.
Defaults to ["region", "lola", "meme", "homer", "enrichr"].

Is there a reason that differential enrichment recognizes "meme" and collect differential enrichment recognizes "motif" as one of the steps? the two are also somewhat used interchangeably in collect differential enrichment

ATACAnalysis feature request - make chrom_state_file / build more explicit

It could be really nice if the statement of which regulatory build was more explicit in chrom_state_file.
For example - this could mean:
results/analysis.chrom_state_annotation.csv --> results/analysis.E002.chrom_state_annotation.csv
attribute E002_chrom_state_annotation for analysis object

Pro's:

  • more immediately apparent in results directory
    -allows for multiple chromatin states to be analyzed for mixed samples (e.g. PBMCs)

Con's:

  • too specialized?

numpy version might be too low

Hi Andre,
With numpy version=1.15.0, while running ATACSeqAnalysis('blah').get_peak_gene_annotation(tss='different blah'), I encountered:
ModuleNotFoundError - No module named 'numpy.core._multiarray_umath'

A bit of googling and tab completion revealed that version 1.15 of numpy does not have this module (obviously) and this was fixed by upgrading to 1.16.

Thought I'd let you know (I checked the dev version and didn't see a change in the requirements).
Cheers,
F

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.