Code Monkey home page Code Monkey logo

gse84920.parser's Introduction

GSE84920.parser: Parsers for the GSE84920 Hi-C Data Sets of Ramani et al. (2017)

This R package provides a shell script and an R API for working with the Ramani et al. (2017) Hi-C data set.

The Ramani data set is published on NCBI's Gene Expression Omnibus (GEO) in the GEO series GSE84920 (titled 'Massively multiplex single-cell Hi-C'), which contains:

GEO Sample GEO Title Cell Types
GSM2254215 Combinatorial scHi-C Library ML1 human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski')
GSM2254216 Combinatorial scHi-C Library ML2 human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski')
GSM2254217 Combinatorial scHi-C Library ML3 human ('GM12878', 'K562'), mouse ('MEF', 'Patski')
GSM2254218 Combinatorial scHi-C Library PL1 human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski')
GSM2254219 Combinatorial scHi-C Library PL2 human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski')
GSM2438426** Combinatorial scHi-C Library ML4 human ('Asynchronous', 'Nocadazole'), mouse ('Patski')

Data

The subsetted example GSM2254215_ML1 files in the system.file("extdata", package = "GSE84920.parser") folder were obtained by first downloading relevant GEO files and truncating using the following Bash script:

url_path="https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM2254215&format=file"
sample=GSM2254215_ML1
types=(percentages validPairs assignments)
tmpfile=$(mktemp)
mkdir -p inst/extdata/
for type in "${types[@]}"; do
  src=$sample.$type.txt.gz
  dest=inst/extdata/$sample.rows=1-1000.$type.txt
  curl "$url_path&file=$src" -o "$tmpfile"
  zcat "$tmpfile" | head -1000 > "$dest"
  gzip "$dest"
done
rm -- "$tmpfile"

R API

This R package provides functions for reading the above GEO data set, and other data with the same file formats, into R. It also includes a small subset of the ML1 data for the purpose of illustrating and testing the different functions:

> library(GSE84920.parser)
> path <- system.file("extdata", package = "GSE84920.parser")
> dir(path)
[1] "combo_hg19_mm10.genomesize"                   
[2] "GSM2254215_ML1.rows=1-1000_assignments.txt.gz"
[3] "GSM2254215_ML1.rows=1-1000.percentages.txt.gz"
[4] "GSM2254215_ML1.rows=1-1000.validPairs.txt.gz" 

Lets look at the content of each of these *.txt.gz files:

> file <- file.path(path, "GSM2254215_ML1.rows=1-1000_assignments.txt.gz")
> a <- read_assignments(file)
> a
# A tibble: 1,000 x 4
   readname                   left_inner_barcode right_inner_barcouter_barcode
   <chr>                      <chr>              <chr>             <chr>        
 1 D00584:136:HMTLJBCXX:1:11CTCTCACG           CTCTCACG          TCAGATGC     
 2 D00584:136:HMTLJBCXX:1:11GCACCATG           GCACCATG          GTGTAGCA     
 3 D00584:136:HMTLJBCXX:1:11AGGTGCGA           AGGTGCGA          GTATCTAT     
 4 D00584:136:HMTLJBCXX:1:11GCCTTAGG           GCCTTAGG          CAGCATAT     
 5 D00584:136:HMTLJBCXX:1:11CACCTGTG           CACCTGTG          TACTAAGC     
 6 D00584:136:HMTLJBCXX:1:11CCGCTACG           CCGCTACG          CAGCATAT     
 7 D00584:136:HMTLJBCXX:1:11GCCTCGAA           GCCTCGAA          GTATCTAT     
 8 D00584:136:HMTLJBCXX:1:11CTGGTCAC           CTGGTCAC          TTGACCAT     
 9 D00584:136:HMTLJBCXX:1:11CTGCGTAG           CTGCGTAG          TATCTTGT     
10 D00584:136:HMTLJBCXX:1:11CACGACCT           CACGACCT          GATGATCC     
# … with 990 more rows
> p <- read_percentages(file)
> p
# A tibble: 1,000 x 16
   hg19_frac mm10_frac hg19_count mm10_count pair_count inner_barcode outer_barcode is_observed Col10 dpnii_1x
       <dbl>     <dbl>      <int>      <int>      <int> <chr>         <chr>         <chr>       <chr>    <int>
 1     1.000 0.0000356      28050          1      28052 ACCACCAC      TCAGATGC      True        All      55603
 2     0.680 0.320          19081       8970      28052 ACCACCAC      TCAGATGC      Randomized  All      55603
 3     1.000 0.0000370      27010          1      28052 ACCACCAC      TCAGATGC      True        Long     55603
 4     0.683 0.317          18444       8555      28052 ACCACCAC      TCAGATGC      Randomized  Long     55603
 5     0     1                  0          1          1 CATAGCGC      ACTTGATA      True        All          2
 6     1     0                  1          0          1 CATAGCGC      ACTTGATA      Randomized  All          2
 7     0     1                  0          1          1 CATAGCGC      ACTTGATA      True        Long         2
 8     1     0                  1          0          1 CATAGCGC      ACTTGATA      Randomized  Long         2
 9     1     0                  2          0          2 GGCCGTTC      GCCATTAA      True        All          4
10     1     0                  2          0          2 GGCCGTTC      GCCATTAA      Randomized  All          4
# … with 990 more rows, and 6 more variables: dpnii_2x <int>, dpnii_3x <int>, dpnii_4x <int>,
#   cistrans_ratio <dbl>, hela_allele_frac <dbl>, celltype <chr>
> file <- file.path(path, "GSM2254215_ML1.rows=1-1000.validPairs.txt.gz")
> vp <- read_validpairs(file)
> vp
# A tibble: 1,000 x 17
   chr_a start_a  end_a chr_b start_b  end_b readname  col8  col9 col10 col11 inner_barcode outer_barcode
   <chr>   <int>  <int> <chr>   <int>  <int> <chr>    <int> <int> <chr> <chr> <chr>         <chr>        
 1 mous2.28e7 2.28e7 mous5.33e7 5.33e7 D00584:37    42 +     -     ATCCGCGG      GTCATAGT     
 2 huma8.86e7 8.86e7 huma8.91e7 8.91e7 D00584:42    42 -     -     GAGGAGCA      CGATGACA     
 3 huma1.27e8 1.27e8 huma1.27e8 1.27e8 D00584:42    42 +     +     GCTACGGT      AGTCGTAT     
 4 huma4.21e7 4.21e7 huma4.27e7 4.27e7 D00584:42    42 +     +     AGGTGCGA      ATACATGT     
 5 huma2.09e8 2.09e8 huma2.32e8 2.32e8 D00584:42    35 +     -     GCCTCGAA      GAGTACGT     
 6 huma5.57e6 5.57e6 huma1.50e8 1.50e8 D00584:42    42 +     -     GCTCGCTA      CTAGTGAA     
 7 mous4.12e7 4.12e7 mous4.12e7 4.12e7 D00584:42    42 +     -     GAGGAGCA      CGTTACTT     
 8 mous6.54e7 6.54e7 mous6.59e7 6.59e7 D00584:42    42 -     +     TCCGGACA      TGTCTGCA     
 9 huma1.82e8 1.82e8 huma1.82e8 1.82e8 D00584:42    42 +     -     CAGGCTTG      GATATAAC     
10 mous5.86e7 5.86e7 mous6.33e7 6.33e7 D00584:42    42 +     +     TCACGAGC      TGAGGCAA     
# … with 990 more rows, and 4 more variables: col14 <chr>, col15 <int>, col16 <chr>, col17 <int>

Reading the full Ramani HiC files can take quite a while, particularly the ones with "assignments" and "validPairs" data. Because of this, you might want to import these data (once) into a local database and work with the data from there. See help("import_assignments", package = "GSE84920.parser") for an example showing how to import into an SQLite database on file, which is easy since it requires zero setup.

Footnote:
**: The ML4 files are of slightly different file format than the other sets. Specifically, the *.validPairs.txt.gz file has an additional three columns appended at the end - GSE84920.parser::read_validpairs() will not be able to read those data because of that.

Shell Script API

Splitting whole-genome files into chromosome-pair files

The package also provides a Bash script for splitting a raw HiC-count data file into chromosome-pair files, e.g. HiC sequence data files in GSE35156 (Dixon, et al., 2012). This split.sh script is located under system.file("scripts", package = "GSE84920.parser").

$ split.sh 
Split HiC Count Data File into Chromosome-Pair Files

Usage:
 ./split.sh [options] path/to/*.summary.txt.gz

Options:
 --help       Display this help
 --version    Display the version
 --dryrun     Don't run anything
 --intraonly  Only process intra-chromosome pairs

Examples:
 ./split.sh --intraonly hicData/GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary.txt.gz

This outputs *.tsv.gz files:

  1. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chr1_vs_chr1.tsv.gz
  2. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chr2_vs_chr2.tsv.gz
...
 25. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chrM_vs_chrM.tsv.gz

to folder hicData/GSE35156_GSM862720/ with names 

Details:
The produced files are names '<name>,<chr_i>_vs_<chr_j>.tsv.gz' where <name>
is the name of input '*.summary.txt.gz' file without the extension. The
files are written to folder 'hicData/<gse_id>_<gsm_id>/' where <gse_id> and
<gsm_id> are inferred from the <name>. This folder is automatically created,
if missing.

Version: 0.1.2
Copyright: Henrik Bengtsson (2017-2019)
License: GPL (>= 3.0)
Source: https://github.com/HenrikBengtsson/ramani

Comment: This split.sh script is unrelated to the Ramani et al. (2017) study. It might be better fitted for the TopDomData R package.

References

  1. Ramani, V., Deng, X., Qiu, R., Gunderson, K. L., Steemers, F. J., Disteche, C. M., ..., Shendure, J. (2017). Massively multiplex single-cell Hi-C. Nature methods, 14(3), 263–266. doi:10.1038/nmeth.4155, PMC5330809.

  2. Dixon JR, Selvaraj S, Yue F, Kim A, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 2012 Apr 11; 485(7398):376-80, doi: 10.1038/nature11082, PMCID: PMC3356448, PMID: 22495300.

Installation

R package GSE84920.parser is only available via GitHub and can be installed in R as:

remotes::install_github("HenrikBengtsson/GSE84920.parser", ref="master")

Pre-release version

To install the pre-release version that is available in Git branch develop on GitHub, use:

remotes::install_github("HenrikBengtsson/GSE84920.parser", ref="develop")

This will install the package from source.

Contributions

This Git repository uses the Git Flow branching model (the git flow extension is useful for this). The develop branch contains the latest contributions and other code that will appear in the next release, and the master branch contains the code of the latest release.

Contributing to this package is easy. Just send a pull request. When you send your PR, make sure develop is the destination branch on the GSE84920.parser repository. Your PR should pass R CMD check --as-cran, which will also be checked by Travis CI and AppVeyor CI when the PR is submitted.

Software status

Resource GitHub GitHub Actions Travis CI AppVeyor CI
Platforms: Multiple Multiple Linux & macOS Windows
R CMD check Build status Build status
Test coverage Coverage Status

gse84920.parser's People

Contributors

henrikbengtsson avatar

Watchers

 avatar  avatar  avatar

gse84920.parser's Issues

Use a less generic package name

I picked the package name 'ramani' only during a study project because it was easy to use and remember and because it refers to the data set used in well-known Ramani et al. (2017) publication.

However, since this package is very specific to the GSE84920 data set and because someone else might need the name 'ramani' for an R package, I think this package should be renamed to it's slightly less generic. Maybe some version of:

  • ramani.2017
  • ramani.hic.2017
  • ramani.gse84920
  • gse84920.io
  • gse84920.parser
  • GSE84920Parser
    ?

R packages can only contain symbols a-z, A-Z, 0-9, and period.

README: Illustrate what data look like

Illustrate what data look like, e.g.

> data <- readRDS("inst/compiledData/human,HAP1,unique,chr=22.rds")
> tibble::as_tibble(data)
# A tibble: 112,704 x 9
   chr_a  start_a    end_a chr_b start_b   end_b celltype cell_id      name     
   <chr>    <int>    <int> <chr>   <int>   <int> <chr>    <fct>        <chr>    
 1 22    16304723 16304853 22     1.64e7  1.64e7 HAP1     GGTCAGTG-TGGSM225422 22    16344591 16344666 22     1.71e7  1.71e7 HAP1     AAGCCGGT-CTGSM225423 22    16357581 16357715 22     1.77e7  1.77e7 HAP1     TCGACTGC-TTGSM225424 22    16433346 16433395 22     1.71e7  1.71e7 HAP1     ACCACCAC-TGGSM225425 22    16433811 16433879 22     1.71e7  1.71e7 HAP1     TTGTGCCG-CGGSM225426 22    16499667 16499748 22     1.75e7  1.75e7 HAP1     CGCGCAAT-CTGSM225427 22    16551301 16551348 22     2.19e7  2.19e7 HAP1     CGACATGG-CAGSM225428 22    16554345 16554502 22     1.79e7  1.79e7 HAP1     AACGGTCG-TGGSM225429 22    16848715 16848855 22     1.69e7  1.69e7 HAP1     AAGCCGGT-AAGSM2254210 22    16852229 16852468 22     1.69e7  1.69e7 HAP1     GCTGAGAC-CCGSM22542# … with 112,694 more rows

Parsing errors: 17 cols instead of 20 cols for GSM2438426_ML4.validPairs.txt.gz

Warning message:                                                                                                                                        
35304186 parsing failures.                                                                                                                              
row col   expected     actual                                                file                                                                       
  1  -- 17 columns 20 columns 'hicData/GSE84920/GSM2438426_ML4.validPairs.txt.gz'                                                                       
  2  -- 17 columns 20 columns 'hicData/GSE84920/GSM2438426_ML4.validPairs.txt.gz'                                                                       
  3  -- 17 columns 20 columns 'hicData/GSE84920/GSM2438426_ML4.validPairs.txt.gz'                                                                       
  4  -- 17 columns 20 columns 'hicData/GSE84920/GSM2438426_ML4.validPairs.txt.gz'                                                                       
  5  -- 17 columns 20 columns 'hicData/GSE84920/GSM2438426_ML4.validPairs.txt.gz'                                                                       
... ... .......... .......... ...................................................                                                                       
See problems(...) for more details. 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.