Comments (13)
Hi,
I haven't seen this error before, but I am curious to figure it out. I am trying to reproduce your error.
You used the bam files from Buenrostro et al. 2018 or did you aligned the data yourself? I am wondering if there is a difference in the bam header.
Does it breaks right away ?
from episcanpy.
Thanks for the quick support!
Yes I used the bam files from Buenrostro et al. 2018 however downloaded them without headers (sam-dump default options).
Other than that the bam files are sorted and indexed. Are the headers needed for the tutorial?
Yes it breaks right away.
from episcanpy.
Hi, sorry for the delay!
I managed to reproduce your error but I don't know what is wrong, yet.
Using sam-dump I downloaded a very strange one of your bam file. However the format is ver strange (not the usual sample layout). Can you check if your bam file look similar to mine ? It might be the reason why it doesn't work.
from episcanpy.
Hi,
yes, it looks identical. Yes it's a very odd layout, which I didn't notice until now.
from episcanpy.
Hi Anna,
is there any news on this issue, has the source been indentified yet? Or maybe put differently: What would be your way of downloading the Buenrostro bams and integrating them into Episcanpy?
Or do you suspect the problem to be in the format of the data?
Thanks a lot for your help,
from episcanpy.
Hi,
Sorry, I did not think about it anymore as the problem seems to be the file format when downloaded from NCBI. There is a couple of options available to solve this, one option is to run re-aligned the data but it is long and fastidious. Another option is to download the aligned bam files from Chen et al. 2019 benchmarking (https://github.com/pinellolab/scATAC-benchmarking). They make the data available to download here https://www.dropbox.com/sh/8o8f0xu6cvr46sm/AAB6FMIDvHqnG6h7athgcm5-a/Buenrostro_2018.tar.gz?dl=0
Just one more thing. This dataset is plate based and it produce one cell per bam file. Thus it takes up a lot of memory and take a long time to build. Other methods available like the one from 10x Genomics produce multiplex bam files that can be used to build larger count matrices in a easier way.
I hope it helps, we already build matrices using the Buenrostro dataset. I can share a script with you if you need.
from episcanpy.
Hi Anna,
thanks a lot for pointing out this github repository. Now having looked into the Buenrostro_2018.tar.gz, i tried to work with two of the bam files as followed (the rest of notebook is identical to your tutorial):
path_to_play_data ='~/buenrostroBams/sc-bams_nodup/'
file_annot_name = '~/samData/GSE96769_PeakFile_20160207.bed'
As annotation file i used the original datasets Peakfile from ncbi.
At first only trying on two of the bam files
# list of the bam files you want to build a count matrix for
list_cells =['BM1077-CLP-Frozen-160106-13.dedup.st.bam', 'BM1077-CLP-Frozen-160106-14.dedup.st.bam',]
peaks = epi.ct.load_features(file_annot_name)
peaks_names = epi.ct.name_features(peaks)
And eventually the command:
epi.ct.bld_atac_mtx(list_bam_files=list_cells, loaded_feat=peaks, output_file_name='test_ATAC_mtx.txt', path=path_to_play_data, writing_option='w', header=peaks_names)
Results yet again in :
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
----> 1 epi.ct.bld_atac_mtx(list_bam_files=list_cells,
2 loaded_feat=peaks,
3 output_file_name='test_ATAC_mtx.txt',
4 path=path_to_play_data,
5 writing_option='w',
~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/count_matrix/_atac_mtx.py in bld_atac_mtx(list_bam_files, loaded_feat, output_file_name, path, writing_option, header, mode, check_sq, chromosomes)
94 #for read in samfile.fetch(until_eof=True):
95 for read in samfile:
---> 96 line = str(read).split('\t')
97 if line[2][3:] in chromosomes:
98 keep_lines.append(line[2:4])
~/anaconda3/envs/bsh/lib/python3.8/site-packages/bamnostic/core.py in str(self)
462
463 def str(self):
--> 464 return self.repr()
465
466 def _range_popper(self, interval_start, interval_stop=None, front=True):
~/anaconda3/envs/bsh/lib/python3.8/site-packages/bamnostic/core.py in repr(self)
444 self.read_name,
445 "{}".format(self.flag),
--> 446 "{}".format(self.reference_name, self.tid),
447 "{}".format(self.pos + 1),
448 "{}".format(self.mapq),
AttributeError: 'AlignedSegment' object has no attribute 'reference_name'`
Now does this have to do with the Annotation Peak file? I thought coming from the official ncbi page there should be nothing wrong with that.
However, coming back to your kind offer it would be great if you were to share the script, with which you build the matrices from the dataset. Probably that would make things much clearer for me
from episcanpy.
Hey, sorry for the delay. It has been a couple of very busy days.
I think it is still the Buenrostro header that is a little weird.
You can either use the pybedtool branch and the following script:
!pip install git+https://github.com/colomemaria/epiScanpy@pybedtools
import episcanpy.api as epi
windows = epi.ct.make_windows(5000)
bamfile = ['/Users/annadanese/Desktop/BM1077-CMP-Frozen-160106-83.dedup.st.bam']
epi.ct.bld_atac_mtx_pybedtools(list_bam_files=bamfile,
loaded_feat=windows,
output_file_name='test.h5ad')
OR: we used the following script when we worked with these data. https://www.dropbox.com/s/i4hbhvzns5sfd8y/run_buenrostro_1_build_array.sh?dl=0
Let me know if any of it is working !
Best,
Anna
from episcanpy.
Hi @DaneseAnna ,
no worries, your help has been excellent. So after trying out what you suggested and working around a few minor problems i now have a anndata object with correct objects and the properties:
"AnnData object with n_obs × n_vars = 3 × 491436
obs: 'cell_name', 'label', 'cell_type', 'facs_label'
uns: 'omic'"
Created on three bam files. All the objects are as they should be however, editing variables or loading the gene/transcript annotation doesn't work. The variables are still the numbers 0 to 491435, which makes it hard for me to do an analysis with the object.
I downloaded the data as suggested in tutorial with:
!wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz -O data/data_tutorial_buenrostro/gencode.v19.annotation.gtf.gz
!cd data/data_tutorial_buenrostro; gunzip gencode.v19.annotation.gtf
Then running the command (copied from the tutorial under section "Load gene/transcript annotation"):
epi.tl.find_genes(adata, gtf_file='data/data_tutorial_buenrostro/gencode.v19.annotation.gtf', key_added='transcript_annotation', upstream=2000, feature_type='transcript', annotation='HAVANA', raw=False)
Threw
IndexError Traceback (most recent call last)
in
----> 1 epi.tl.find_genes(adata,
2 gtf_file='data/data_tutorial_buenrostro/gencode.v19.annotation.gtf',
3 key_added='transcript_annotation',
4 upstream=2000,
5 feature_type='transcript',
~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/tools/find_genes.py in find_genes(adata, gtf_file, key_added, upstream, feature_type, annotation, raw)
43 line = line.split('')
44 if line[0] not in raw_adata_features.keys():
---> 45 raw_adata_features[line[0]] = [[int(line[1]),int(line[2]), feature_index]]
46 else:
47 raw_adata_features[line[0]].append([int(line[1]),int(line[2]), feature_index])
IndexError: list index out of range
Is this a familiar error to you? Or do you know why it is thrown?
Thanks a lot for the help
from episcanpy.
Hi, sorry ! I did not see your message.
Which variable did you use? windows = epi.ct.make_windows(5000) or something else?
Theoretically, the name of the variable, if specified, should be saved in adata.var_names.
Best,
Anna
from episcanpy.
No worries @DaneseAnna
Well, i tried to specify the variable names by giving it:
"file_annot_name = '/nfs/home/students/vhildemann/samData/GSE96769_PeakFile_20160207.bed'
peaks = epi.ct.load_features(file_annot_name)
peaks_names = epi.ct.name_features(peaks)"
and passing option "loaded_feat=peaks" into the "bld_atac_mtx_pybedtools" command
However still there are these numbers only & adding transcript annotation fails, bc of the index error.
adata.var_names gives me:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
...
'491426', '491427', '491428', '491429', '491430', '491431', '491432',
'491433', '491434', '491435'],
dtype='object', length=491436)
Thanks
from episcanpy.
Ok, this shouldn't be. I think it is an error specific to the pybedtool function. I will fix it.
In the mean time you can change it by writing adata.var_names = peaks. That should fix the problem as long as they are the same length.
Then, epi.tl.find_genes should work if you have variable names. However it takes a long time to run and you might prefer to use it on a smaller feature space. Or, you can build a gene activity matrix that will give you a new AnnData object with genes as the feature space. It's api.tl.geneactivity
Best,
Anna
from episcanpy.
Hi @DaneseAnna ,
thank you for the suggestions ! They solved most of the problems. Finally the analysis is running almost as it should be. Now what is a interpretational question i couldn't solve on my own is, in the tutorial:
What the function 'epi.pp.normalize_total(adata)' does and how it changes the visualizations of the data?
( umaps line 42 compared to umaps line 49)
In the documentation i couldn't find a clear explanation on that.
However apart from that, still the last command: 'epi.pl.rank_feat_groups_matrixplot(adata)'
is throwing:
"Warning: dendogram data not found(using key=dendogram_louvain).Running 'sc.tl.dendogram' with default parameters.
For fine tuning it is recommended to run 'sc.tl.dendogram' independently. "
Followed by:
ValueError Traceback (most recent call last)
in
----> 1 epi.pl.rank_feat_groups_matrixplot(adata)
~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/plotting/_scanpy_plotting.py in rank_feat_groups_matrixplot(adata, groups, n_features, groupby, key, show, save)
468 Are passed to :func:~scanpy.pl.matrixplot
.
469 """
--> 470 sc.pl.rank_genes_groups_matrixplot(adata, groups, n_genes=n_features, groupby=groupby,
471 key=key, show=show, save=save)
472
~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_tools/init.py in rank_genes_groups_matrixplot(adata, groups, n_genes, groupby, key, show, save, **kwds)
637 """
638
--> 639 _rank_genes_groups_plot(
640 adata,
641 plot_type='matrixplot',
~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_tools/init.py in _rank_genes_groups_plot(adata, plot_type, groups, n_genes, groupby, key, show, save, **kwds)
406 elif plot_type == 'matrixplot':
407 from .._anndata import matrixplot
--> 408 matrixplot(adata, gene_names, groupby, var_group_labels=group_names,
409 var_group_positions=group_positions, show=show, save=save, **kwds)
410
~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in matrixplot(adata, var_names, groupby, use_raw, log, num_categories, figsize, dendrogram, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, standard_scale, swap_axes, show, save, **kwds)
2181
2182 if dendrogram:
-> 2183 dendro_data = _reorder_categories_after_dendrogram(
2184 adata,
2185 groupby,
~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in _reorder_categories_after_dendrogram(adata, groupby, dendrogram, var_names, var_group_labels, var_group_positions)
3109 """
3110
-> 3111 key = _get_dendrogram_key(adata, dendrogram, groupby)
3112
3113 dendro_info = adata.uns[key]
~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in _get_dendrogram_key(adata, dendrogram_key, groupby)
3195 "tuning it is recommended to run sc.tl.dendrogram
independently."
3196 )
-> 3197 dendrogram(adata, groupby, key_added=dendrogram_key)
3198
3199 if 'dendrogram_info' not in adata.uns[dendrogram_key]:
~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/tools/_dendrogram.py in dendrogram(adata, groupby, n_pcs, use_rep, var_names, use_raw, cor_method, linkage_method, optimal_ordering, key_added, inplace)
130 corr_matrix, method=linkage_method, optimal_ordering=optimal_ordering
131 )
--> 132 dendro_info = sch.dendrogram(z_var, labels=categories, no_plot=True)
133
134 # order of groupby categories
~/anaconda3/envs/bsh/lib/python3.8/site-packages/scipy/cluster/hierarchy.py in dendrogram(Z, p, truncate_mode, color_threshold, get_leaves, orientation, labels, count_sort, distance_sort, show_leaf_counts, no_plot, no_labels, leaf_font_size, leaf_rotation, leaf_label_func, show_contracted, link_color_func, ax, above_threshold_color)
3275 "'bottom', or 'right'")
3276
-> 3277 if labels and Z.shape[0] + 1 != len(labels):
3278 raise ValueError("Dimensions of Z and labels must be consistent.")
3279
~/anaconda3/envs/bsh/lib/python3.8/site-packages/pandas/core/indexes/base.py in nonzero(self)
2147
2148 def nonzero(self):
-> 2149 raise ValueError(
2150 f"The truth value of a {type(self).name} is ambiguous. "
2151 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is this a common error? Both running 'sc.tl.dendogram' or 'a.all()' independently, hasn't worked for me.
Thanks a lot!
from episcanpy.
Related Issues (20)
- Problems when generating matrix (episcanpy.ct.bld_mtx_fly())
- In DNA methylation tutorial, the data for matrices construction and secondary processing are not consistant. HOT 1
- 'IndexError: list index out of range' when running epi.tl.geneactivity() HOT 3
- numpy>=1.21.2 requires python >= 3.7 but install docs show python 3.6
- Error with normalize_total() HOT 2
- How to generate count matrix in Episcanpy
- Potential edit required in building count matrix (methylation) HOT 1
- Plans to include other QC parameters for scATAC-seq data?
- Future plans
- episcanpy.tl.silhouette calls sklearn.metrics.silhouette_score with wrong arguments HOT 1
- Error in Generating Methylation Count Matrices HOT 2
- Questions about running time for epi.tl.geneactivity
- installation should say python 3.7 else SyntaxError: future feature annotations is not defined
- Type in import of `tss_enrichment_score` HOT 1
- Import Error when Importing on Python 3.9
- Do we have method which can find tf-gene or peak-gene relation? HOT 1
- Problems in importing the api after PyPI installation HOT 2
- Bug in silhouette score calculation
- Error running "epi.ct.save_sparse_mtx" HOT 1
- Error when loading 10X Cellranger output with read_ATAC_10x() HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from episcanpy.