hackseq / 2017_project_5 Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 2.0 127.03 MB

Developing advanced R tutorials for genomic data analysis

Home Page: https://hackseq.github.io/2017_project_5/

License: MIT License

HTML 99.72% R 0.04% TeX 0.22% CSS 0.01%

2017_project_5's People

Contributors

Stargazers

Watchers

Forkers

isadorafranca nikolaospapachristou

2017_project_5's Issues

Picking a dataset

I envisioned that we would pick a genomic dataset (or a set of related datasets) that we can all use so that the tutorials are consistent with one another. It then makes it easier to string the tutorials together as part of a longer workshop.

Who has ideas of good datasets that can be analyzed in different ways for each topic we end up selecting?

Simplified eQTL analysis in R

I will be working on creating a simplified eQTL analysis using biomaRt and datatable.
This will be a naive way to do eQTL analysis using the two packages above.

@hackseq/2017_team_5

Topic brainstorm

Comment below with topics that you would like to see included in this set of intermediate/advanced R tutorials for genomic data analysis. You don't need to know the topic, because someone else might be able to write the tutorial. Also, a topic can be anything in R that can be used for genomic data analysis, e.g. an R package.

We'll then create separate GitHub issues for each topic that will get assigned to someone.

Showing how to use R & memory-mapping to analyze data encoded as large matrices

For multiple genomic data, most of the information can be stored as matrices. The most striking example is with SNP data, which can be stored as matrices with thousands to hundreds of thousands of rows (samples) with hundreds of thousands to dozens of millions of columns (SNPs) (Bycroft et al. 2017). This results in datasets of GygaBytes to TeraBytes of data.

Other fields in genomics, such as proteomics or expression data, use data stored as matrices potentially of size larger than available memory.

To address large data size in R, we can use memory-mapping for accessing large matrices stored on disk instead of in RAM. This has existed in R for several years thanks to package bigmemory (Kane, Emerson, and Weston 2013).

More recently, two packages which use the same principle as bigmemory have been developed: bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). Package bigstatsr implements many statistical tools for several types of Filebacked Big Matrices (FBMs), making it usable for any type of genomic data that can be encoded as a matrix. The statistical tools in bigstatsr include implementation of multivariate sparse linear models, Principal Component Analysis (PCA), matrix operations, and numerical summaries. Package bigsnpr implements algorithms which are specific to the analysis of SNP arrays, making use of already implemented features in package bigstatsr.

In this small tutorial, we’ll see the potential benefits of using memory-mapping instead of standard R matrices in memory, by using bigstatsr and bigsnpr.

You can find the first version of the tuto there.

Unsupervised clustering

Hi all,

I'll be working on producing a tutorial on the subject of unsupervised ML/clustering packages. The subtopics I have are as follows:

PCA (+ scree)
NMF
Hierarchical clustering
Consensus clustering
tSNE

I'll produce visualisations for each at the end of the section.
If anyone has any recommendations for packages they've used that didn't have a vignette or comprehensive online tutorial, hit me up!

@hackseq/2017_team_5

Integration of Genomic Analytic Results as a Multi-tab Excel Sheet with Package 'xlsx'

What topic (e.g. R package) will you showcase?
Mostly about the package 'xlsx'. I will be showing a way of integrating results from multiple genomic analyses or pipelines and produce a multi-tab excel sheet that stores all the data tables and figures in a single excel file (or loop the pipeline to generate multiple excel files)

Why do you think it's worthwhile learning about this topic/package?
It is common that for at the end stage of genomic studies, one could arrive at a point with 50 genes of interest and ask...what do we know about these genes? what were the results of SNV or expresssion or pathway analyses I performed for each of these genes?
My goal is to generate a single excel file for each gene, containing all the analyses results about this gene, including data tables, figures, literature text-mining etc. as an integrated report.

It's important to state the motivation so it's clear to readers why they should read further.
What dataset will you use?
Preferably, pick a dataset that is relatively small.
If the dataset is large, you can subset it to make it smaller (e.g. subset on chromosomes 20-22).
Try to leverage datasets that are used elsewhere in this set of tutorials.
It would be nice if there was a permalink for downloading the dataset. If a custom dataset is created (e.g. subsetting an existing large dataset), it would be great if the custom version was hosted somewhere (e.g. FigShare) so you can provide a permalink.
I will be using the results generated by all the analyses we are doing. Or a list of genes significant from certain analyses we showcased.

What software dependencies need to be installed?
R packages are usually easy to install, so it's okay to install a few R packages.
Other command-line tools might be harder to set up on certain systems (e.g. Windows), so try to limit the number of external tool dependencies.
Mostly the package '.xlsx', currently, there is not a very useful tutorial on this package on the web.
The package 'animation' is sometimes used when some R processes automatically store a figure in .PDF format, which requires to be converted to .PNG to be imported into an excel sheet.

What will you cover in your tutorial?
This roughly corresponds to an outline of what you will accomplish in your tutorial using the dataset you picked.
How to create a workbook, worksheets, how to export results (data tables, figures, texts) onto the .xlsx.

Showing how to use R to make a fully reproducible pipeline

A biological analysis is sometimes more appropriately called a pipeline. This is because it generally consists of many steps, using many different software and data formats. Yet, these analysis pipelines are becoming very complex and usually makes use of many bash/perl scripts. For people like me who don't really know that much bash or perl, it can be really hard to understand those scripts.

What is important in these pipelines? To list what comes to my mind:

use the command line
manipulate files
use regular expressions
visualize results
report results

I think we can do each of these operations in R.
And I think we should.

The main reason would be to put all your analysis in a single notebook where you have all your code, results and possibly some writing. Using notebooks is good practice and makes it possible to have a fully reproducible analysis, which will a standard in years to come. Another reason is simply that it's easier!

In this tutorial, I'll show an example of a moderately complex analysis of the 1000 Genomes data, all in R.

You can find the first version of the tuto there.

Introductions

Hello fellow team members!

My name is Bruno Grande. I'm a PhD candidate in Molecular Biology and Biochemistry working in cancer genomics at Simon Fraser University in the Vancouver area. If the name sounds familiar, it's because I'm the one who proposed this project, "Developing advanced R tutorials for genomic data analysis". While that technically makes me team leader, I want this project to be a collaboration between people with similar interests in R and genomics. It's great to see that the idea of creating high-quality R tutorials focused on genomics sparked interest in quite a few of you, especially among remote participants.

To start us off, how about we all introduce ourselves to one another with the following?

Your name
Your position (e.g. PhD, MSc, postdoc, staff)
Institutional affiliation
Whether you will be a in-person or remote participant
If you're a remote participant, where will you be joining us from (city and time zone)?
What motivated you to work on this project?

I'll kick off other discussions soon in preparation for Hackseq, which is starting in less than two days! Feel free to include any questions you may have in your reply.

Showing how to store results in a consistent data structure

How do you store results of an analysis?

When you have different parameters that are varying, when you compare many methods and when you want to keep all the results of an analysis, your code can become quite complex.

In this tutorial, we'll make a comparison of machine learning methods for predicting disease based on small SNP data. We'll show how to use the tidyverse set of packages to make the analysis easier by using consistent data structures and functional programming. We'll use tibbles with list-columns.

You can find the first version of the tuto there

hackseq / 2017_project_5 Goto Github PK

2017_project_5's People

Contributors

Stargazers

Watchers

Forkers

2017_project_5's Issues

Picking a dataset

Simplified eQTL analysis in R

Topic brainstorm

Showing how to use R & memory-mapping to analyze data encoded as large matrices

Unsupervised clustering

Integration of Genomic Analytic Results as a Multi-tab Excel Sheet with Package 'xlsx'

Showing how to use R to make a fully reproducible pipeline

Introductions

Showing how to store results in a consistent data structure

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent