twang15 / platoacademy Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 22.55 MB

Free thoughts live

multiomics statistics machine-learning high-performance-computing bioinformatics medicine

platoacademy's Introduction

PlatoAcademy

An academic space for free thinking in mathematics, computer science, medicine and beyond.

Tools

Slack: platoacademy.slack.com
Zoom: announced in slack weekly or bi-weekly
Emails: you already have that
Wiki: write your problems and thoughts
Github repo: papers, slides, books, and beyond

platoacademy's People

Contributors

Stargazers

Watchers

Forkers

bbandaru

platoacademy's Issues

Books

李航统计机器学习

Statistical models for NLP

Stanford CS-229:

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include: supervised learning (generative/discriminative learning, parametric/non-parametric learning, neural networks, support vector machines); unsupervised learning (clustering, dimensionality reduction, kernel methods); learning theory (bias/variance tradeoffs, practical advice); reinforcement learning and adaptive control. The course will also discuss recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing.
cs-229 on Coursa: https://www.coursera.org/learn/machine-learning/home/welcome

AI for everyone: https://www.coursera.org/learn/ai-for-everyone

### Todos

Feature selection via exhaustive search.
Estimate search time
Try several other linear (svm) and non-linear (random forest, extra tree, gradientboost, xgboost) model
Model interpretation for best linear model (via statistics, hypothesis testing, and LIMA, Shapley value)
Metrics: auc, accuracy, sensitivity, specificity, ppv
Model selection via nested cv
Model comparison in terms of auc (p-value), accuracy, speed

Bioinformatics 3: ChiP-seq analysis

Steps in data analysis

Preprocessing:
i) Bad quality -> Tool: Use “FASTQ Quality Filter” and/or “FASTQ Quality
ii) Flagged Kmer Content: About 100% of the first six bases are the same sequence -> Tool: Use “FASTQTrimmer” Trimmer
Quality control: Run fastqc on the processed samples to see if the problem has been removed. Tool: fastqc

Library complexity: the fraction of unique fragments present in a given library. A proxy is to look at the sequence
duplication levels on the FastQC report.

Low library complexity may be an indicator that:
– A new sample and a new library should be prepared.
– We have to find a better Ab to perform the IP.
– We can not sequence the same sample anymore because we will not find new sequences.
In certain experimental settings we may expect a low library complexity. i.e. We are profiling a protein that binds to a small subset of the genome.

Mapping (alignment): Treat IP and control the same way (preprocessing and mapping). Tool: bowtie 1 or bowtie 2 (use end-to-end mode) or bwa
– map the reads and removing unmapped reads
– filter reads mapped by quality mapping score
Peak calling
i) Read extension and signal profile generation: Estimation of the fragment length using Strand cross-correlation analysis
ii) Peak assignment and evaluation
– Look for fold enrichment of the sample over input or expected background
– Estimate the significance of the fold enrichment using

• Poisson distribution
• negative binomial distribution
• background distribution from input DNA
• model background data to adjust for local variation (MACS): MACS default is to filter out redundant tags at the same location and with the same strand by allowing at most 1 tag. Format of tag file, “BED” or “SAM” or “BAM” or “BOWTIE”. DEFAULT: “BED”
iii) Look at your mapped reads and peaks in a genome browser to verify peak calling thresholds

Peak analysis and interpretation
i) Link peaks to genes: Bed tools (intersectBed, closestBed, coverageBed, slopBed)

Link peaks to nearby genes (intersectBed)
Link peaks to closest genes (closestBed)
ii) Infer possible biological consequences of the binding

Comparing ChIP-seq across samples

intersectBed (finds the subset of peaks common in 2 samples or unique to one them)
macs2 bdgdiff (find peaks present only in one of the samples)

Visualizing ChIP-seq reads with ngsplot
AnalysisofChIP-seqData2016.pdf

Ensemble learning

Bioinformatics tutorial

This issue is to explore the overview of Bioinformatics

Readings

The Actual Difference Between Statistics and Machine Learning

Human-Computer Interaction

Books

understanding machine learning from theory to algorithms

Data Standardization before modeling

Two seemingly conflicts: interpretability and feature importance

A lot of software for performing multiple linear regression will provide standardised coefficients which are equivalent to unstandardised coefficients where you manually standardise predictors and the response variable (of course, it sounds like you are talking about only standardising predictors).
In cases where the metric does have meaning to the person interpreting the regression equation, unstandardised coefficients are often more informative.
You can always convert standardised coefficients to unstandardised coefficients if you know the mean and standard deviation of the predictor variable in the original sample.

Researchers

Yidi Sun, Bioinformatics, CAS
Su-In Lee, U of Washington
Scott Lundberg, MSR
Christoph Molnar, U of Munich

Interpretable machine learning: https://christophm.github.io/book/

Andrew Ng

CS229a
CS229: http://cs229.stanford.edu/syllabus-spring2021.html
CS230

Jerome H. Friedman(https://statweb.stanford.edu/~jhf/)

The element of Statistical Learning
Gradient Boost

Molecule Biology

Signal processing

Statistics 4: 2. Law of Large number

The large numbers theorem states that if the same experiment or study is repeated independently a large number of times, the average of the results of the trials must be close to the expected value. Note that the theorem deals only with a large number of trials while the average of the results of the experiment repeated a small number of times might be substantially different from the expected value. However, each additional trial increases the precision of the average result.

Linear models

Linear regression in R
Formula syntax

The : is for interactions between two terms while the * is for main effects and interactions. The / is another one for interactions but what it does is generate an interaction between the numerator and all of the terms in the denominator (e.g. A/(B+C) = A:B + A:C). The | is for something like "grouped by". So, 1|station would be intercept grouped by station and in parentheses it's random (1|station). That's how you would do nesting.
(1|station/tow) would expand to (1|station)+(1|station:tow) (main effect of station plus interaction between tow and station)

10 Assumptions for Linear regression
Logistic regression

introduction: https://lulab1.gitbook.io/training/part-ii.-machine-learning-skills/3.deep-learning-basics

在统计描述方面，除了ROC曲线图，对于诊断指标是多分类或连续型变量的研究，还应根据不同的截断值报告敏感度、特异度、约登指数、阳性和阴性预测值等指标；在报告这些指标时，应在权衡灵敏度和特异度之后选取并报告截断值。对于模型分类效能的研究，需报告模型表达式。应注意ROC曲线与表格数据、文字叙述的一致性。在统计推断方面，必须报告ROC曲线的AUC及其95%置信区间(CI)。AUC主要用于诊断价值的比较。当评价某种方法有无诊断价值时，需将AUC与0.5进行比较，报告检验统计量Z及其P值。当评价多个方法的诊断价值时，需对多条ROC曲线的AUC进行比较，涉及多重比较的问题，需要调整检验水准α，除报告Z和P值外，应注意报告设定的α。当多条ROC曲线交叉时，如果仅仅比较AUC可能不能反映真实情况，此时应注意比较策略。胡良平等认为，可以比较这几条ROC的部分AUC或固定假阳性率时的敏感度，此时应报告假阳性率选定的依据及取值、此取值条件处的敏感度，以及相应的统计量和P值。近年基于风险分级的NRI(net reclassification improvement,净分类改善度)和IDI(integrated discrimination improvement,综合区分改善度）受到重视，它们可以反映分类效能改善情况，比AUC更有实际指导意义，因此建议报告NRI和IDI。

RNA seq: from 0 to 1

What is mRNA, rRNA and tRNA?

mRNA: Messenger RNA (mRNA) molecules carry the coding sequences for protein synthesis and are called transcripts
rRNA: ribosomal RNA (rRNA) molecules form the core of a cell's ribosomes (the structures in which protein synthesis takes place)
tRNA: transfer RNA (tRNA) molecules carry amino acids to the ribosomes during protein synthesis

Machine Learning application in medicine and biology

https://www.zhihu.com/question/60067846

twang15 / platoacademy Goto Github PK

platoacademy's Introduction

PlatoAcademy

Tools

platoacademy's People

Contributors

Stargazers

Watchers

Forkers

platoacademy's Issues

Books

Recommend Projects

Recommend Topics

Recommend Org