hackseq / 2017_project_5 Goto Github PK
View Code? Open in Web Editor NEWDeveloping advanced R tutorials for genomic data analysis
Home Page: https://hackseq.github.io/2017_project_5/
License: MIT License
Developing advanced R tutorials for genomic data analysis
Home Page: https://hackseq.github.io/2017_project_5/
License: MIT License
I envisioned that we would pick a genomic dataset (or a set of related datasets) that we can all use so that the tutorials are consistent with one another. It then makes it easier to string the tutorials together as part of a longer workshop.
Who has ideas of good datasets that can be analyzed in different ways for each topic we end up selecting?
I will be working on creating a simplified eQTL analysis using biomaRt and datatable.
This will be a naive way to do eQTL analysis using the two packages above.
@hackseq/2017_team_5
Comment below with topics that you would like to see included in this set of intermediate/advanced R tutorials for genomic data analysis. You don't need to know the topic, because someone else might be able to write the tutorial. Also, a topic can be anything in R that can be used for genomic data analysis, e.g. an R package.
We'll then create separate GitHub issues for each topic that will get assigned to someone.
For multiple genomic data, most of the information can be stored as matrices. The most striking example is with SNP data, which can be stored as matrices with thousands to hundreds of thousands of rows (samples) with hundreds of thousands to dozens of millions of columns (SNPs) (Bycroft et al. 2017). This results in datasets of GygaBytes to TeraBytes of data.
Other fields in genomics, such as proteomics or expression data, use data stored as matrices potentially of size larger than available memory.
To address large data size in R, we can use memory-mapping for accessing large matrices stored on disk instead of in RAM. This has existed in R for several years thanks to package bigmemory (Kane, Emerson, and Weston 2013).
More recently, two packages which use the same principle as bigmemory have been developed: bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). Package bigstatsr implements many statistical tools for several types of Filebacked Big Matrices (FBMs), making it usable for any type of genomic data that can be encoded as a matrix. The statistical tools in bigstatsr include implementation of multivariate sparse linear models, Principal Component Analysis (PCA), matrix operations, and numerical summaries. Package bigsnpr implements algorithms which are specific to the analysis of SNP arrays, making use of already implemented features in package bigstatsr.
In this small tutorial, we’ll see the potential benefits of using memory-mapping instead of standard R matrices in memory, by using bigstatsr and bigsnpr.
You can find the first version of the tuto there.
Hi all,
I'll be working on producing a tutorial on the subject of unsupervised ML/clustering packages. The subtopics I have are as follows:
I'll produce visualisations for each at the end of the section.
If anyone has any recommendations for packages they've used that didn't have a vignette or comprehensive online tutorial, hit me up!
@hackseq/2017_team_5
What topic (e.g. R package) will you showcase?
Mostly about the package 'xlsx'. I will be showing a way of integrating results from multiple genomic analyses or pipelines and produce a multi-tab excel sheet that stores all the data tables and figures in a single excel file (or loop the pipeline to generate multiple excel files)
Why do you think it's worthwhile learning about this topic/package?
It is common that for at the end stage of genomic studies, one could arrive at a point with 50 genes of interest and ask...what do we know about these genes? what were the results of SNV or expresssion or pathway analyses I performed for each of these genes?
My goal is to generate a single excel file for each gene, containing all the analyses results about this gene, including data tables, figures, literature text-mining etc. as an integrated report.
It's important to state the motivation so it's clear to readers why they should read further.
What dataset will you use?
Preferably, pick a dataset that is relatively small.
If the dataset is large, you can subset it to make it smaller (e.g. subset on chromosomes 20-22).
Try to leverage datasets that are used elsewhere in this set of tutorials.
It would be nice if there was a permalink for downloading the dataset. If a custom dataset is created (e.g. subsetting an existing large dataset), it would be great if the custom version was hosted somewhere (e.g. FigShare) so you can provide a permalink.
I will be using the results generated by all the analyses we are doing. Or a list of genes significant from certain analyses we showcased.
What software dependencies need to be installed?
R packages are usually easy to install, so it's okay to install a few R packages.
Other command-line tools might be harder to set up on certain systems (e.g. Windows), so try to limit the number of external tool dependencies.
Mostly the package '.xlsx', currently, there is not a very useful tutorial on this package on the web.
The package 'animation' is sometimes used when some R processes automatically store a figure in .PDF format, which requires to be converted to .PNG to be imported into an excel sheet.
What will you cover in your tutorial?
This roughly corresponds to an outline of what you will accomplish in your tutorial using the dataset you picked.
How to create a workbook, worksheets, how to export results (data tables, figures, texts) onto the .xlsx.
A biological analysis is sometimes more appropriately called a pipeline. This is because it generally consists of many steps, using many different software and data formats. Yet, these analysis pipelines are becoming very complex and usually makes use of many bash/perl scripts. For people like me who don't really know that much bash or perl, it can be really hard to understand those scripts.
What is important in these pipelines? To list what comes to my mind:
I think we can do each of these operations in R.
And I think we should.
The main reason would be to put all your analysis in a single notebook where you have all your code, results and possibly some writing. Using notebooks is good practice and makes it possible to have a fully reproducible analysis, which will a standard in years to come. Another reason is simply that it's easier!
In this tutorial, I'll show an example of a moderately complex analysis of the 1000 Genomes data, all in R.
You can find the first version of the tuto there.
Hello fellow team members!
My name is Bruno Grande. I'm a PhD candidate in Molecular Biology and Biochemistry working in cancer genomics at Simon Fraser University in the Vancouver area. If the name sounds familiar, it's because I'm the one who proposed this project, "Developing advanced R tutorials for genomic data analysis". While that technically makes me team leader, I want this project to be a collaboration between people with similar interests in R and genomics. It's great to see that the idea of creating high-quality R tutorials focused on genomics sparked interest in quite a few of you, especially among remote participants.
To start us off, how about we all introduce ourselves to one another with the following?
I'll kick off other discussions soon in preparation for Hackseq, which is starting in less than two days! Feel free to include any questions you may have in your reply.
How do you store results of an analysis?
When you have different parameters that are varying, when you compare many methods and when you want to keep all the results of an analysis, your code can become quite complex.
In this tutorial, we'll make a comparison of machine learning methods for predicting disease based on small SNP data. We'll show how to use the tidyverse set of packages to make the analysis easier by using consistent data structures and functional programming. We'll use tibbles with list-columns.
You can find the first version of the tuto there
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.