Code Monkey home page Code Monkey logo

platoacademy's Introduction

PlatoAcademy

An academic space for free thinking in mathematics, computer science, medicine and beyond.

Tools

  1. Slack: platoacademy.slack.com
  2. Zoom: announced in slack weekly or bi-weekly
  3. Emails: you already have that
  4. Wiki: write your problems and thoughts
  5. Github repo: papers, slides, books, and beyond

platoacademy's People

Contributors

twang15 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

bbandaru

platoacademy's Issues

NLP and ASR Learning

Books

  1. 李航 统计机器学习
  • Statistical models for NLP
  1. Stanford CS-229:
  • This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include: supervised learning (generative/discriminative learning, parametric/non-parametric learning, neural networks, support vector machines); unsupervised learning (clustering, dimensionality reduction, kernel methods); learning theory (bias/variance tradeoffs, practical advice); reinforcement learning and adaptive control. The course will also discuss recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing.
  • cs-229 on Coursa: https://www.coursera.org/learn/machine-learning/home/welcome
  1. AI for everyone: https://www.coursera.org/learn/ai-for-everyone

### Todos

  1. Feature selection via exhaustive search.
  2. Estimate search time
  3. Try several other linear (svm) and non-linear (random forest, extra tree, gradientboost, xgboost) model
  4. Model interpretation for best linear model (via statistics, hypothesis testing, and LIMA, Shapley value)
  5. Metrics: auc, accuracy, sensitivity, specificity, ppv
  6. Model selection via nested cv
  7. Model comparison in terms of auc (p-value), accuracy, speed

Bioinformatics 3: ChiP-seq analysis

Steps in data analysis

  1. Preprocessing:
    i) Bad quality -> Tool: Use “FASTQ Quality Filter” and/or “FASTQ Quality
    ii) Flagged Kmer Content: About 100% of the first six bases are the same sequence -> Tool: Use “FASTQTrimmer” Trimmer

  2. Quality control: Run fastqc on the processed samples to see if the problem has been removed. Tool: fastqc

Library complexity: the fraction of unique fragments present in a given library. A proxy is to look at the sequence
duplication levels on the FastQC report.

Low library complexity may be an indicator that:
– A new sample and a new library should be prepared.
– We have to find a better Ab to perform the IP.
– We can not sequence the same sample anymore because we will not find new sequences.
In certain experimental settings we may expect a low library complexity. i.e. We are profiling a protein that binds to a small subset of the genome.

  1. Mapping (alignment): Treat IP and control the same way (preprocessing and mapping). Tool: bowtie 1 or bowtie 2 (use end-to-end mode) or bwa
    – map the reads and removing unmapped reads
    – filter reads mapped by quality mapping score

  2. Peak calling
    i) Read extension and signal profile generation: Estimation of the fragment length using Strand cross-correlation analysis
    ii) Peak assignment and evaluation
    – Look for fold enrichment of the sample over input or expected background
    – Estimate the significance of the fold enrichment using

  • • Poisson distribution
  • • negative binomial distribution
  • • background distribution from input DNA
  • • model background data to adjust for local variation (MACS): MACS default is to filter out redundant tags at the same location and with the same strand by allowing at most 1 tag. Format of tag file, “BED” or “SAM” or “BAM” or “BOWTIE”. DEFAULT: “BED”
    iii) Look at your mapped reads and peaks in a genome browser to verify peak calling thresholds
  1. Peak analysis and interpretation
    i) Link peaks to genes: Bed tools (intersectBed, closestBed, coverageBed, slopBed)
  • Link peaks to nearby genes (intersectBed)
  • Link peaks to closest genes (closestBed)
    ii) Infer possible biological consequences of the binding
  1. Comparing ChIP-seq across samples
  • intersectBed (finds the subset of peaks common in 2 samples or unique to one them)
  • macs2 bdgdiff (find peaks present only in one of the samples)
  1. Visualizing ChIP-seq reads with ngsplot
    AnalysisofChIP-seqData2016.pdf

Books

understanding machine learning from theory to algorithms

Data Standardization before modeling

Two seemingly conflicts: interpretability and feature importance

  1. A lot of software for performing multiple linear regression will provide standardised coefficients which are equivalent to unstandardised coefficients where you manually standardise predictors and the response variable (of course, it sounds like you are talking about only standardising predictors).
  2. In cases where the metric does have meaning to the person interpreting the regression equation, unstandardised coefficients are often more informative.
  3. You can always convert standardised coefficients to unstandardised coefficients if you know the mean and standard deviation of the predictor variable in the original sample.

Statistics 4: 2. Law of Large number

The large numbers theorem states that if the same experiment or study is repeated independently a large number of times, the average of the results of the trials must be close to the expected value. Note that the theorem deals only with a large number of trials while the average of the results of the experiment repeated a small number of times might be substantially different from the expected value. However, each additional trial increases the precision of the average result.

Linear models

Linear regression in R
Formula syntax

  1. The : is for interactions between two terms while the * is for main effects and interactions. The / is another one for interactions but what it does is generate an interaction between the numerator and all of the terms in the denominator (e.g. A/(B+C) = A:B + A:C). The | is for something like "grouped by". So, 1|station would be intercept grouped by station and in parentheses it's random (1|station). That's how you would do nesting.
  2. (1|station/tow) would expand to (1|station)+(1|station:tow) (main effect of station plus interaction between tow and station)

10 Assumptions for Linear regression
Logistic regression

AUC 描述与推断

RNA seq: from 0 to 1

What is mRNA, rRNA and tRNA?

  1. mRNA: Messenger RNA (mRNA) molecules carry the coding sequences for protein synthesis and are called transcripts
  2. rRNA: ribosomal RNA (rRNA) molecules form the core of a cell's ribosomes (the structures in which protein synthesis takes place)
  3. tRNA: transfer RNA (tRNA) molecules carry amino acids to the ribosomes during protein synthesis

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.