Code Monkey home page Code Monkey logo

2017_project_5's People

Contributors

5dpz avatar abaghela avatar acavalla avatar brunograndephd avatar jakelever avatar privefl avatar zhenyisong avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

2017_project_5's Issues

Picking a dataset

I envisioned that we would pick a genomic dataset (or a set of related datasets) that we can all use so that the tutorials are consistent with one another. It then makes it easier to string the tutorials together as part of a longer workshop.

Who has ideas of good datasets that can be analyzed in different ways for each topic we end up selecting?

Simplified eQTL analysis in R

I will be working on creating a simplified eQTL analysis using biomaRt and datatable.
This will be a naive way to do eQTL analysis using the two packages above.

@hackseq/2017_team_5

Topic brainstorm

Comment below with topics that you would like to see included in this set of intermediate/advanced R tutorials for genomic data analysis. You don't need to know the topic, because someone else might be able to write the tutorial. Also, a topic can be anything in R that can be used for genomic data analysis, e.g. an R package.

We'll then create separate GitHub issues for each topic that will get assigned to someone.

Showing how to use R & memory-mapping to analyze data encoded as large matrices

For multiple genomic data, most of the information can be stored as matrices. The most striking example is with SNP data, which can be stored as matrices with thousands to hundreds of thousands of rows (samples) with hundreds of thousands to dozens of millions of columns (SNPs) (Bycroft et al. 2017). This results in datasets of GygaBytes to TeraBytes of data.

Other fields in genomics, such as proteomics or expression data, use data stored as matrices potentially of size larger than available memory.

To address large data size in R, we can use memory-mapping for accessing large matrices stored on disk instead of in RAM. This has existed in R for several years thanks to package bigmemory (Kane, Emerson, and Weston 2013).

More recently, two packages which use the same principle as bigmemory have been developed: bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). Package bigstatsr implements many statistical tools for several types of Filebacked Big Matrices (FBMs), making it usable for any type of genomic data that can be encoded as a matrix. The statistical tools in bigstatsr include implementation of multivariate sparse linear models, Principal Component Analysis (PCA), matrix operations, and numerical summaries. Package bigsnpr implements algorithms which are specific to the analysis of SNP arrays, making use of already implemented features in package bigstatsr.

In this small tutorial, we’ll see the potential benefits of using memory-mapping instead of standard R matrices in memory, by using bigstatsr and bigsnpr.


You can find the first version of the tuto there.

Unsupervised clustering

Hi all,

I'll be working on producing a tutorial on the subject of unsupervised ML/clustering packages. The subtopics I have are as follows:

  • PCA (+ scree)
  • NMF
  • Hierarchical clustering
  • Consensus clustering
  • tSNE

I'll produce visualisations for each at the end of the section.
If anyone has any recommendations for packages they've used that didn't have a vignette or comprehensive online tutorial, hit me up!

@hackseq/2017_team_5

Integration of Genomic Analytic Results as a Multi-tab Excel Sheet with Package 'xlsx'

What topic (e.g. R package) will you showcase?
Mostly about the package 'xlsx'. I will be showing a way of integrating results from multiple genomic analyses or pipelines and produce a multi-tab excel sheet that stores all the data tables and figures in a single excel file (or loop the pipeline to generate multiple excel files)

Why do you think it's worthwhile learning about this topic/package?
It is common that for at the end stage of genomic studies, one could arrive at a point with 50 genes of interest and ask...what do we know about these genes? what were the results of SNV or expresssion or pathway analyses I performed for each of these genes?
My goal is to generate a single excel file for each gene, containing all the analyses results about this gene, including data tables, figures, literature text-mining etc. as an integrated report.

It's important to state the motivation so it's clear to readers why they should read further.
What dataset will you use?
Preferably, pick a dataset that is relatively small.
If the dataset is large, you can subset it to make it smaller (e.g. subset on chromosomes 20-22).
Try to leverage datasets that are used elsewhere in this set of tutorials.
It would be nice if there was a permalink for downloading the dataset. If a custom dataset is created (e.g. subsetting an existing large dataset), it would be great if the custom version was hosted somewhere (e.g. FigShare) so you can provide a permalink.

I will be using the results generated by all the analyses we are doing. Or a list of genes significant from certain analyses we showcased.

What software dependencies need to be installed?
R packages are usually easy to install, so it's okay to install a few R packages.
Other command-line tools might be harder to set up on certain systems (e.g. Windows), so try to limit the number of external tool dependencies.

Mostly the package '.xlsx', currently, there is not a very useful tutorial on this package on the web.
The package 'animation' is sometimes used when some R processes automatically store a figure in .PDF format, which requires to be converted to .PNG to be imported into an excel sheet.

What will you cover in your tutorial?
This roughly corresponds to an outline of what you will accomplish in your tutorial using the dataset you picked.

How to create a workbook, worksheets, how to export results (data tables, figures, texts) onto the .xlsx.

Showing how to use R to make a fully reproducible pipeline

A biological analysis is sometimes more appropriately called a pipeline. This is because it generally consists of many steps, using many different software and data formats. Yet, these analysis pipelines are becoming very complex and usually makes use of many bash/perl scripts. For people like me who don't really know that much bash or perl, it can be really hard to understand those scripts.

What is important in these pipelines? To list what comes to my mind:

  • use the command line
  • manipulate files
  • use regular expressions
  • visualize results
  • report results

I think we can do each of these operations in R.
And I think we should.

The main reason would be to put all your analysis in a single notebook where you have all your code, results and possibly some writing. Using notebooks is good practice and makes it possible to have a fully reproducible analysis, which will a standard in years to come. Another reason is simply that it's easier!

In this tutorial, I'll show an example of a moderately complex analysis of the 1000 Genomes data, all in R.


You can find the first version of the tuto there.

Introductions

Hello fellow team members!

My name is Bruno Grande. I'm a PhD candidate in Molecular Biology and Biochemistry working in cancer genomics at Simon Fraser University in the Vancouver area. If the name sounds familiar, it's because I'm the one who proposed this project, "Developing advanced R tutorials for genomic data analysis". While that technically makes me team leader, I want this project to be a collaboration between people with similar interests in R and genomics. It's great to see that the idea of creating high-quality R tutorials focused on genomics sparked interest in quite a few of you, especially among remote participants.

To start us off, how about we all introduce ourselves to one another with the following?

  • Your name
  • Your position (e.g. PhD, MSc, postdoc, staff)
  • Institutional affiliation
  • Whether you will be a in-person or remote participant
  • If you're a remote participant, where will you be joining us from (city and time zone)?
  • What motivated you to work on this project?

I'll kick off other discussions soon in preparation for Hackseq, which is starting in less than two days! Feel free to include any questions you may have in your reply.

Showing how to store results in a consistent data structure

How do you store results of an analysis?

When you have different parameters that are varying, when you compare many methods and when you want to keep all the results of an analysis, your code can become quite complex.

In this tutorial, we'll make a comparison of machine learning methods for predicting disease based on small SNP data. We'll show how to use the tidyverse set of packages to make the analysis easier by using consistent data structures and functional programming. We'll use tibbles with list-columns.


You can find the first version of the tuto there

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.