GSE84920.parser: Parsers for the GSE84920 Hi-C Data Sets of Ramani et al. (2017)

This R package provides a shell script and an R API for working with the Ramani et al. (2017) Hi-C data set.

The Ramani data set is published on NCBI's Gene Expression Omnibus (GEO) in the GEO series GSE84920 (titled 'Massively multiplex single-cell Hi-C'), which contains:

GEO Sample	GEO Title	Cell Types
GSM2254215	Combinatorial scHi-C Library ML1	human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski')
GSM2254216	Combinatorial scHi-C Library ML2	human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski')
GSM2254217	Combinatorial scHi-C Library ML3	human ('GM12878', 'K562'), mouse ('MEF', 'Patski')
GSM2254218	Combinatorial scHi-C Library PL1	human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski')
GSM2254219	Combinatorial scHi-C Library PL2	human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski')
GSM2438426**	Combinatorial scHi-C Library ML4	human ('Asynchronous', 'Nocadazole'), mouse ('Patski')

Data

The subsetted example GSM2254215_ML1 files in the system.file("extdata", package = "GSE84920.parser") folder were obtained by first downloading relevant GEO files and truncating using the following Bash script:

url_path="https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM2254215&format=file"
sample=GSM2254215_ML1
types=(percentages validPairs assignments)
tmpfile=$(mktemp)
mkdir -p inst/extdata/
for type in "${types[@]}"; do
  src=$sample.$type.txt.gz
  dest=inst/extdata/$sample.rows=1-1000.$type.txt
  curl "$url_path&file=$src" -o "$tmpfile"
  zcat "$tmpfile" | head -1000 > "$dest"
  gzip "$dest"
done
rm -- "$tmpfile"

R API

This R package provides functions for reading the above GEO data set, and other data with the same file formats, into R. It also includes a small subset of the ML1 data for the purpose of illustrating and testing the different functions:

> library(GSE84920.parser)
> path <- system.file("extdata", package = "GSE84920.parser")
> dir(path)
[1] "combo_hg19_mm10.genomesize"                   
[2] "GSM2254215_ML1.rows=1-1000_assignments.txt.gz"
[3] "GSM2254215_ML1.rows=1-1000.percentages.txt.gz"
[4] "GSM2254215_ML1.rows=1-1000.validPairs.txt.gz"

Lets look at the content of each of these *.txt.gz files:

> file <- file.path(path, "GSM2254215_ML1.rows=1-1000_assignments.txt.gz")
> a <- read_assignments(file)
> a
# A tibble: 1,000 x 4
   readname                   left_inner_barcode right_inner_barc… outer_barcode
   <chr>                      <chr>              <chr>             <chr>        
 1 D00584:136:HMTLJBCXX:1:11… CTCTCACG           CTCTCACG          TCAGATGC     
 2 D00584:136:HMTLJBCXX:1:11… GCACCATG           GCACCATG          GTGTAGCA     
 3 D00584:136:HMTLJBCXX:1:11… AGGTGCGA           AGGTGCGA          GTATCTAT     
 4 D00584:136:HMTLJBCXX:1:11… GCCTTAGG           GCCTTAGG          CAGCATAT     
 5 D00584:136:HMTLJBCXX:1:11… CACCTGTG           CACCTGTG          TACTAAGC     
 6 D00584:136:HMTLJBCXX:1:11… CCGCTACG           CCGCTACG          CAGCATAT     
 7 D00584:136:HMTLJBCXX:1:11… GCCTCGAA           GCCTCGAA          GTATCTAT     
 8 D00584:136:HMTLJBCXX:1:11… CTGGTCAC           CTGGTCAC          TTGACCAT     
 9 D00584:136:HMTLJBCXX:1:11… CTGCGTAG           CTGCGTAG          TATCTTGT     
10 D00584:136:HMTLJBCXX:1:11… CACGACCT           CACGACCT          GATGATCC     
# … with 990 more rows

> p <- read_percentages(file)
> p
# A tibble: 1,000 x 16
   hg19_frac mm10_frac hg19_count mm10_count pair_count inner_barcode outer_barcode is_observed Col10 dpnii_1x
       <dbl>     <dbl>      <int>      <int>      <int> <chr>         <chr>         <chr>       <chr>    <int>
 1     1.000 0.0000356      28050          1      28052 ACCACCAC      TCAGATGC      True        All      55603
 2     0.680 0.320          19081       8970      28052 ACCACCAC      TCAGATGC      Randomized  All      55603
 3     1.000 0.0000370      27010          1      28052 ACCACCAC      TCAGATGC      True        Long     55603
 4     0.683 0.317          18444       8555      28052 ACCACCAC      TCAGATGC      Randomized  Long     55603
 5     0     1                  0          1          1 CATAGCGC      ACTTGATA      True        All          2
 6     1     0                  1          0          1 CATAGCGC      ACTTGATA      Randomized  All          2
 7     0     1                  0          1          1 CATAGCGC      ACTTGATA      True        Long         2
 8     1     0                  1          0          1 CATAGCGC      ACTTGATA      Randomized  Long         2
 9     1     0                  2          0          2 GGCCGTTC      GCCATTAA      True        All          4
10     1     0                  2          0          2 GGCCGTTC      GCCATTAA      Randomized  All          4
# … with 990 more rows, and 6 more variables: dpnii_2x <int>, dpnii_3x <int>, dpnii_4x <int>,
#   cistrans_ratio <dbl>, hela_allele_frac <dbl>, celltype <chr>

> file <- file.path(path, "GSM2254215_ML1.rows=1-1000.validPairs.txt.gz")
> vp <- read_validpairs(file)
> vp
# A tibble: 1,000 x 17
   chr_a start_a  end_a chr_b start_b  end_b readname  col8  col9 col10 col11 inner_barcode outer_barcode
   <chr>   <int>  <int> <chr>   <int>  <int> <chr>    <int> <int> <chr> <chr> <chr>         <chr>        
 1 mous…  2.28e7 2.28e7 mous…  5.33e7 5.33e7 D00584:…    37    42 +     -     ATCCGCGG      GTCATAGT     
 2 huma…  8.86e7 8.86e7 huma…  8.91e7 8.91e7 D00584:…    42    42 -     -     GAGGAGCA      CGATGACA     
 3 huma…  1.27e8 1.27e8 huma…  1.27e8 1.27e8 D00584:…    42    42 +     +     GCTACGGT      AGTCGTAT     
 4 huma…  4.21e7 4.21e7 huma…  4.27e7 4.27e7 D00584:…    42    42 +     +     AGGTGCGA      ATACATGT     
 5 huma…  2.09e8 2.09e8 huma…  2.32e8 2.32e8 D00584:…    42    35 +     -     GCCTCGAA      GAGTACGT     
 6 huma…  5.57e6 5.57e6 huma…  1.50e8 1.50e8 D00584:…    42    42 +     -     GCTCGCTA      CTAGTGAA     
 7 mous…  4.12e7 4.12e7 mous…  4.12e7 4.12e7 D00584:…    42    42 +     -     GAGGAGCA      CGTTACTT     
 8 mous…  6.54e7 6.54e7 mous…  6.59e7 6.59e7 D00584:…    42    42 -     +     TCCGGACA      TGTCTGCA     
 9 huma…  1.82e8 1.82e8 huma…  1.82e8 1.82e8 D00584:…    42    42 +     -     CAGGCTTG      GATATAAC     
10 mous…  5.86e7 5.86e7 mous…  6.33e7 6.33e7 D00584:…    42    42 +     +     TCACGAGC      TGAGGCAA     
# … with 990 more rows, and 4 more variables: col14 <chr>, col15 <int>, col16 <chr>, col17 <int>

Reading the full Ramani HiC files can take quite a while, particularly the ones with "assignments" and "validPairs" data. Because of this, you might want to import these data (once) into a local database and work with the data from there. See help("import_assignments", package = "GSE84920.parser") for an example showing how to import into an SQLite database on file, which is easy since it requires zero setup.

Footnote:
**: The ML4 files are of slightly different file format than the other sets. Specifically, the *.validPairs.txt.gz file has an additional three columns appended at the end - GSE84920.parser::read_validpairs() will not be able to read those data because of that.

Shell Script API

Splitting whole-genome files into chromosome-pair files

The package also provides a Bash script for splitting a raw HiC-count data file into chromosome-pair files, e.g. HiC sequence data files in GSE35156 (Dixon, et al., 2012). This split.sh script is located under system.file("scripts", package = "GSE84920.parser").

$ split.sh 
Split HiC Count Data File into Chromosome-Pair Files

Usage:
 ./split.sh [options] path/to/*.summary.txt.gz

Options:
 --help       Display this help
 --version    Display the version
 --dryrun     Don't run anything
 --intraonly  Only process intra-chromosome pairs

Examples:
 ./split.sh --intraonly hicData/GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary.txt.gz

This outputs *.tsv.gz files:

  1. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chr1_vs_chr1.tsv.gz
  2. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chr2_vs_chr2.tsv.gz
...
 25. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chrM_vs_chrM.tsv.gz

to folder hicData/GSE35156_GSM862720/ with names 

Details:
The produced files are names '<name>,<chr_i>_vs_<chr_j>.tsv.gz' where <name>
is the name of input '*.summary.txt.gz' file without the extension. The
files are written to folder 'hicData/<gse_id>_<gsm_id>/' where <gse_id> and
<gsm_id> are inferred from the <name>. This folder is automatically created,
if missing.

Version: 0.1.2
Copyright: Henrik Bengtsson (2017-2019)
License: GPL (>= 3.0)
Source: https://github.com/HenrikBengtsson/ramani

Comment: This split.sh script is unrelated to the Ramani et al. (2017) study. It might be better fitted for the TopDomData R package.

References

Ramani, V., Deng, X., Qiu, R., Gunderson, K. L., Steemers, F. J., Disteche, C. M., ..., Shendure, J. (2017). Massively multiplex single-cell Hi-C. Nature methods, 14(3), 263–266. doi:10.1038/nmeth.4155, PMC5330809.
Dixon JR, Selvaraj S, Yue F, Kim A, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 2012 Apr 11; 485(7398):376-80, doi: 10.1038/nature11082, PMCID: PMC3356448, PMID: 22495300.

Installation

R package GSE84920.parser is only available via GitHub and can be installed in R as:

remotes::install_github("HenrikBengtsson/GSE84920.parser", ref="master")

Pre-release version

To install the pre-release version that is available in Git branch develop on GitHub, use:

remotes::install_github("HenrikBengtsson/GSE84920.parser", ref="develop")

This will install the package from source.

Contributions

This Git repository uses the Git Flow branching model (the git flow extension is useful for this). The develop branch contains the latest contributions and other code that will appear in the next release, and the master branch contains the code of the latest release.

Contributing to this package is easy. Just send a pull request. When you send your PR, make sure develop is the destination branch on the GSE84920.parser repository. Your PR should pass R CMD check --as-cran, which will also be checked by Travis CI and AppVeyor CI when the PR is submitted.

Software status

Resource	GitHub	GitHub Actions	Travis CI	AppVeyor CI
Platforms:	Multiple	Multiple	Linux & macOS	Windows
R CMD check
Test coverage

henrikbengtsson / gse84920.parser Goto Github PK

gse84920.parser's Introduction

GSE84920.parser: Parsers for the GSE84920 Hi-C Data Sets of Ramani et al. (2017)

Data

R API

Shell Script API

Splitting whole-genome files into chromosome-pair files

References

Installation

Pre-release version

Contributions

Software status

gse84920.parser's People

Contributors

Watchers

gse84920.parser's Issues

Useful tools

Recommend Projects

Recommend Topics

Recommend Org