When trying to follow the <a href="https://nbviewer.jupyter.org/github/colomemaria/epi

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Problem building scATAC-seq count matrice from bams about episcanpy HOT 13 CLOSED

colomemaria commented on August 26, 2024

Problem building scATAC-seq count matrice from bams

from episcanpy.

Comments (13)

DaneseAnna commented on August 26, 2024

Hi,
I haven't seen this error before, but I am curious to figure it out. I am trying to reproduce your error.
You used the bam files from Buenrostro et al. 2018 or did you aligned the data yourself? I am wondering if there is a difference in the bam header.

Does it breaks right away ?

from episcanpy.

hildemann commented on August 26, 2024

Thanks for the quick support!
Yes I used the bam files from Buenrostro et al. 2018 however downloaded them without headers (sam-dump default options).
Other than that the bam files are sorted and indexed. Are the headers needed for the tutorial?
Yes it breaks right away.

from episcanpy.

DaneseAnna commented on August 26, 2024

Hi, sorry for the delay!
I managed to reproduce your error but I don't know what is wrong, yet.

Using sam-dump I downloaded a very strange one of your bam file. However the format is ver strange (not the usual sample layout). Can you check if your bam file look similar to mine ? It might be the reason why it doesn't work.

head_SRR5356154.txt

from episcanpy.

hildemann commented on August 26, 2024

Hi,
yes, it looks identical. Yes it's a very odd layout, which I didn't notice until now.

from episcanpy.

hildemann commented on August 26, 2024

Hi Anna,

is there any news on this issue, has the source been indentified yet? Or maybe put differently: What would be your way of downloading the Buenrostro bams and integrating them into Episcanpy?
Or do you suspect the problem to be in the format of the data?

Thanks a lot for your help,

from episcanpy.

DaneseAnna commented on August 26, 2024

Hi,

Sorry, I did not think about it anymore as the problem seems to be the file format when downloaded from NCBI. There is a couple of options available to solve this, one option is to run re-aligned the data but it is long and fastidious. Another option is to download the aligned bam files from Chen et al. 2019 benchmarking (https://github.com/pinellolab/scATAC-benchmarking). They make the data available to download here https://www.dropbox.com/sh/8o8f0xu6cvr46sm/AAB6FMIDvHqnG6h7athgcm5-a/Buenrostro_2018.tar.gz?dl=0

Just one more thing. This dataset is plate based and it produce one cell per bam file. Thus it takes up a lot of memory and take a long time to build. Other methods available like the one from 10x Genomics produce multiplex bam files that can be used to build larger count matrices in a easier way.

I hope it helps, we already build matrices using the Buenrostro dataset. I can share a script with you if you need.

from episcanpy.

hildemann commented on August 26, 2024

Hi Anna,

thanks a lot for pointing out this github repository. Now having looked into the Buenrostro_2018.tar.gz, i tried to work with two of the bam files as followed (the rest of notebook is identical to your tutorial):

path_to_play_data ='~/buenrostroBams/sc-bams_nodup/'
file_annot_name = '~/samData/GSE96769_PeakFile_20160207.bed'

As annotation file i used the original datasets Peakfile from ncbi.
At first only trying on two of the bam files
# list of the bam files you want to build a count matrix for
list_cells =['BM1077-CLP-Frozen-160106-13.dedup.st.bam', 'BM1077-CLP-Frozen-160106-14.dedup.st.bam',]

peaks = epi.ct.load_features(file_annot_name)
peaks_names = epi.ct.name_features(peaks)

And eventually the command:
epi.ct.bld_atac_mtx(list_bam_files=list_cells, loaded_feat=peaks, output_file_name='test_ATAC_mtx.txt', path=path_to_play_data, writing_option='w', header=peaks_names)

Results yet again in :
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
----> 1 epi.ct.bld_atac_mtx(list_bam_files=list_cells,
2 loaded_feat=peaks,
3 output_file_name='test_ATAC_mtx.txt',
4 path=path_to_play_data,
5 writing_option='w',

~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/count_matrix/_atac_mtx.py in bld_atac_mtx(list_bam_files, loaded_feat, output_file_name, path, writing_option, header, mode, check_sq, chromosomes)
94 #for read in samfile.fetch(until_eof=True):
95 for read in samfile:
---> 96 line = str(read).split('\t')
97 if line[2][3:] in chromosomes:
98 keep_lines.append(line[2:4])

~/anaconda3/envs/bsh/lib/python3.8/site-packages/bamnostic/core.py in str(self)
462
463 def str(self):
--> 464 return self.repr()
465
466 def _range_popper(self, interval_start, interval_stop=None, front=True):

~/anaconda3/envs/bsh/lib/python3.8/site-packages/bamnostic/core.py in repr(self)
444 self.read_name,
445 "{}".format(self.flag),
--> 446 "{}".format(self.reference_name, self.tid),
447 "{}".format(self.pos + 1),
448 "{}".format(self.mapq),

AttributeError: 'AlignedSegment' object has no attribute 'reference_name'`

Now does this have to do with the Annotation Peak file? I thought coming from the official ncbi page there should be nothing wrong with that.

However, coming back to your kind offer it would be great if you were to share the script, with which you build the matrices from the dataset. Probably that would make things much clearer for me

from episcanpy.

DaneseAnna commented on August 26, 2024

Hey, sorry for the delay. It has been a couple of very busy days.

I think it is still the Buenrostro header that is a little weird.
You can either use the pybedtool branch and the following script:

!pip install git+https://github.com/colomemaria/epiScanpy@pybedtools
import episcanpy.api as epi
windows = epi.ct.make_windows(5000)
bamfile = ['/Users/annadanese/Desktop/BM1077-CMP-Frozen-160106-83.dedup.st.bam']
epi.ct.bld_atac_mtx_pybedtools(list_bam_files=bamfile,
loaded_feat=windows,
output_file_name='test.h5ad')

OR: we used the following script when we worked with these data. https://www.dropbox.com/s/i4hbhvzns5sfd8y/run_buenrostro_1_build_array.sh?dl=0

Let me know if any of it is working !

Best,
Anna

from episcanpy.

hildemann commented on August 26, 2024

Hi @DaneseAnna ,

no worries, your help has been excellent. So after trying out what you suggested and working around a few minor problems i now have a anndata object with correct objects and the properties:

"AnnData object with n_obs × n_vars = 3 × 491436
obs: 'cell_name', 'label', 'cell_type', 'facs_label'
uns: 'omic'"

Created on three bam files. All the objects are as they should be however, editing variables or loading the gene/transcript annotation doesn't work. The variables are still the numbers 0 to 491435, which makes it hard for me to do an analysis with the object.

I downloaded the data as suggested in tutorial with:
!wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz -O data/data_tutorial_buenrostro/gencode.v19.annotation.gtf.gz
!cd data/data_tutorial_buenrostro; gunzip gencode.v19.annotation.gtf

Then running the command (copied from the tutorial under section "Load gene/transcript annotation"):
epi.tl.find_genes(adata, gtf_file='data/data_tutorial_buenrostro/gencode.v19.annotation.gtf', key_added='transcript_annotation', upstream=2000, feature_type='transcript', annotation='HAVANA', raw=False)

Threw

IndexError Traceback (most recent call last)
in
----> 1 epi.tl.find_genes(adata,
2 gtf_file='data/data_tutorial_buenrostro/gencode.v19.annotation.gtf',
3 key_added='transcript_annotation',
4 upstream=2000,
5 feature_type='transcript',

~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/tools/find_genes.py in find_genes(adata, gtf_file, key_added, upstream, feature_type, annotation, raw)
43 line = line.split('')
44 if line[0] not in raw_adata_features.keys():
---> 45 raw_adata_features[line[0]] = [[int(line[1]),int(line[2]), feature_index]]
46 else:
47 raw_adata_features[line[0]].append([int(line[1]),int(line[2]), feature_index])

IndexError: list index out of range

Is this a familiar error to you? Or do you know why it is thrown?

Thanks a lot for the help

from episcanpy.

DaneseAnna commented on August 26, 2024

Hi, sorry ! I did not see your message.
Which variable did you use? windows = epi.ct.make_windows(5000) or something else?
Theoretically, the name of the variable, if specified, should be saved in adata.var_names.

Best,
Anna

from episcanpy.

hildemann commented on August 26, 2024

No worries @DaneseAnna
Well, i tried to specify the variable names by giving it:

"file_annot_name = '/nfs/home/students/vhildemann/samData/GSE96769_PeakFile_20160207.bed'
peaks = epi.ct.load_features(file_annot_name)
peaks_names = epi.ct.name_features(peaks)"

and passing option "loaded_feat=peaks" into the "bld_atac_mtx_pybedtools" command

However still there are these numbers only & adding transcript annotation fails, bc of the index error.

adata.var_names gives me:

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
...
'491426', '491427', '491428', '491429', '491430', '491431', '491432',
'491433', '491434', '491435'],
dtype='object', length=491436)

Thanks

from episcanpy.

DaneseAnna commented on August 26, 2024

Ok, this shouldn't be. I think it is an error specific to the pybedtool function. I will fix it.
In the mean time you can change it by writing adata.var_names = peaks. That should fix the problem as long as they are the same length.
Then, epi.tl.find_genes should work if you have variable names. However it takes a long time to run and you might prefer to use it on a smaller feature space. Or, you can build a gene activity matrix that will give you a new AnnData object with genes as the feature space. It's api.tl.geneactivity

Best,
Anna

from episcanpy.

hildemann commented on August 26, 2024

Hi @DaneseAnna ,

thank you for the suggestions ! They solved most of the problems. Finally the analysis is running almost as it should be. Now what is a interpretational question i couldn't solve on my own is, in the tutorial:
What the function 'epi.pp.normalize_total(adata)' does and how it changes the visualizations of the data?
( umaps line 42 compared to umaps line 49)
In the documentation i couldn't find a clear explanation on that.

However apart from that, still the last command: 'epi.pl.rank_feat_groups_matrixplot(adata)'
is throwing:

"Warning: dendogram data not found(using key=dendogram_louvain).Running 'sc.tl.dendogram' with default parameters.
For fine tuning it is recommended to run 'sc.tl.dendogram' independently. "

Followed by:

ValueError Traceback (most recent call last)
in
----> 1 epi.pl.rank_feat_groups_matrixplot(adata)

~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/plotting/_scanpy_plotting.py in rank_feat_groups_matrixplot(adata, groups, n_features, groupby, key, show, save)
468 Are passed to :func:~scanpy.pl.matrixplot.
469 """
--> 470 sc.pl.rank_genes_groups_matrixplot(adata, groups, n_genes=n_features, groupby=groupby,
471 key=key, show=show, save=save)
472

~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_tools/init.py in rank_genes_groups_matrixplot(adata, groups, n_genes, groupby, key, show, save, **kwds)
637 """
638
--> 639 _rank_genes_groups_plot(
640 adata,
641 plot_type='matrixplot',

~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_tools/init.py in _rank_genes_groups_plot(adata, plot_type, groups, n_genes, groupby, key, show, save, **kwds)
406 elif plot_type == 'matrixplot':
407 from .._anndata import matrixplot
--> 408 matrixplot(adata, gene_names, groupby, var_group_labels=group_names,
409 var_group_positions=group_positions, show=show, save=save, **kwds)
410

~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in matrixplot(adata, var_names, groupby, use_raw, log, num_categories, figsize, dendrogram, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, standard_scale, swap_axes, show, save, **kwds)
2181
2182 if dendrogram:
-> 2183 dendro_data = _reorder_categories_after_dendrogram(
2184 adata,
2185 groupby,

~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in _reorder_categories_after_dendrogram(adata, groupby, dendrogram, var_names, var_group_labels, var_group_positions)
3109 """
3110
-> 3111 key = _get_dendrogram_key(adata, dendrogram, groupby)
3112
3113 dendro_info = adata.uns[key]

~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in _get_dendrogram_key(adata, dendrogram_key, groupby)
3195 "tuning it is recommended to run sc.tl.dendrogram independently."
3196 )
-> 3197 dendrogram(adata, groupby, key_added=dendrogram_key)
3198
3199 if 'dendrogram_info' not in adata.uns[dendrogram_key]:

~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/tools/_dendrogram.py in dendrogram(adata, groupby, n_pcs, use_rep, var_names, use_raw, cor_method, linkage_method, optimal_ordering, key_added, inplace)
130 corr_matrix, method=linkage_method, optimal_ordering=optimal_ordering
131 )
--> 132 dendro_info = sch.dendrogram(z_var, labels=categories, no_plot=True)
133
134 # order of groupby categories

~/anaconda3/envs/bsh/lib/python3.8/site-packages/scipy/cluster/hierarchy.py in dendrogram(Z, p, truncate_mode, color_threshold, get_leaves, orientation, labels, count_sort, distance_sort, show_leaf_counts, no_plot, no_labels, leaf_font_size, leaf_rotation, leaf_label_func, show_contracted, link_color_func, ax, above_threshold_color)
3275 "'bottom', or 'right'")
3276
-> 3277 if labels and Z.shape[0] + 1 != len(labels):
3278 raise ValueError("Dimensions of Z and labels must be consistent.")
3279

~/anaconda3/envs/bsh/lib/python3.8/site-packages/pandas/core/indexes/base.py in nonzero(self)
2147
2148 def nonzero(self):
-> 2149 raise ValueError(
2150 f"The truth value of a {type(self).name} is ambiguous. "
2151 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Is this a common error? Both running 'sc.tl.dendogram' or 'a.all()' independently, hasn't worked for me.

Thanks a lot!

from episcanpy.

Problem building scATAC-seq count matrice from bams about episcanpy HOT 13 CLOSED

Comments (13)

IndexError: list index out of range

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent