Code Monkey home page Code Monkey logo

datasciencemasters's Introduction

The Open-Source Data Science Masters

This is a fork of this, experimenting with different curriculum topics and themes.

License here.

The Open Source Data Science Curriculum

History

Fundamentals

Intro to Data Science [UW / Coursera](https://www.coursera.org/course/dat * Topics: Python NLP on Twitter API, Distributed Computing Paradigm, MapReduce/Hadoop & Pig Script, SQL/NoSQL, Relational Algebra, Experiment design, Statistics, Graphs, Amazon EC2, Visualization.asci) Algebra-Steven-Levandosky/dp/0536667470/ref=sr_1_1?ie=UTF8&qid=1376546498&sr=8-1&keywords=linear+algebra+levandosky#)

Skills

Matrices and Linear Algebra fundamentals
	Linear Algebra / Levandosky [Stanford / Book](http://www.amazon.com/Linear-
	Coding the Matrix: Linear Algebra through Computer Science Applications [Brown / Coursera](https://www.coursera.org/course/matrix)
Hash Functions, Binary Tree, O(n)
Relational Algebra
DB Basics
Inner, Outer, Cross, Theta join
CAP Theorem
abular data
Entropy
Data Frames and Series
Sharding
OLAP
Multidimensional Data Model
	ETL
Reporting vs. BI vs. Analytics
JSON & XML
NoSQL
Regex
Vendor Landscape
Env setup

Maths and Stats

Skills

Descriptive statistics
Exploratory Data Analysis
Histograms
Percentiles and outliers
Probability theory
Bayes Theorem
Random Variables
Cumulative Distribution Function (CDF)
Continous Distributions (Normal, Poisson, Gaussian)
Skewness
ANOVA
Probability Density Functions

Central Limit Theorem
Monte Carlo Method
Hypothesis testing
p-value
Chi squared test
Estimation
Confidence intevals (CI)
MLE
Kernel Density Estimate
Regression
Covariance
Correlation
Pearson Coefficient
Causation
Least squares fit
Euclidean Distance

Computing

Toolbox / Programming Languages / Software stacks

Skills

Unix cli install programs and packages
Bash basics
	cat, grep, wget etc
	piping
	understand stdio
Python
Regex
MS Excel w/ Analysis ToolPak
Java
R, R-studio, Rattle
IBM SPSS
Weka, Knime, RapidMiner
Hadoop ditribution of choice
Spark, Storm
Flume, Scibe, Chukwa
Nutch, Talend, Scraperwiki
Webscraper, Flume, Sqoop
tm, RWeka, NLTK
RHIPE
D3.js, ggplot2, Shiny
IBM Languageware
Cassandra, MongoDB

Algorithms, data structures and databases

Programming

Skills

Variables
Vectors
Matrices
Arrays
Factors
Lists
Data Frames
Reading CSV data
Reading Raw data
Manipulate Data Frames
Functions
Factor Analysis

Applied methods

Data Munging and integration

The art of converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. Expect to spend 80% of your workday doing some sort of data wrangling.

Skills

Dimensionality & Numerosity Reduction
Normalization
Data Scrubbing
Handling missing values
Unbiased estimators
Binning sparse values
Feature Extraction
Denoising
Sampling
Stratified Sampling
Principal Component Analysis
Summary of Data Formats
Data Discovery
Data Sources & Acquisition
Data Integration
Data Fusion
Transformation and enrichment
Data survey
Google OpenRefine
How Much Daya
Using ETL

Visualization

Skills

Data Exploration in R (Hist, boxplot etc)
Uni, Bi and multivariate Viz
ggplot2
Histogram & Pie (Uni)
Tree and Tree Map
Scatter Plot
Line Charts
Survey Plot
Timeline
Decision Tree
D3.js
InfoVis
IBM ManyEyes
Tableau

Data mining and analysis

Machine Learning

Skills

Numerical Var
Categorical Var
Supervised Learning
Unsupervised Learning
Concepts, Inputs and Attributes
Training and Test Data
Classifier
Prediction
Lift
Overfitting
Bias and variance
Classification
	Trees and classification
	Classification rate
	Decision trees
	Boosting
	Naive Bayes Classifiers
	K-Nearest neighbour
Regression
	Logistic regression
	Ranking
	Linear regression
	Perceptron
Clustering
	Hierarchical clustering
	K-means clustering
Neural Networks
Sentiment analysis
Collaborative Filtering
Tagging

Text Mining / NLP

Skills

Corpus
Named Entity Recognition
Text Analysis
UIMA
Term Document Matrix
Term Frequency and weight
Support Vector Machines
Association rules
Market Based Analysis
Feature Extraction
Use Mahout
Use Weka
Use NLTK
Classify Text
Vocabulaty Mapping

Big Data

Map reduce fundamentals
Hadoop
HDFS
Data Replication Principles
Setup Hadoop (IBM / Cloudera / HortonWorks)
Name & Data nodes
Job and task tracker
M/R Programming
Sqoop: Loading Data in HDFS
Flube, Scribe: For Unstructured Data
SQL with Pig
DWH with Hive
Scribe, Chukwa For Weblog
Using Mahout
Zookeeper Avro
Storm: Hadoop Realtime
Rhadoop, RHIPE
rmr
Cassandra
MongoDB, Neo4j

General Resources:

Contribute

Please Share and Contribute Your Ideas -- it's Open Source!

A note on direction

This is an introduction geared toward those with at least a minimum understanding of programming, and (perhaps obviously) an interest in the components of Data Science (like statistics and distributed computing). Out of personal preference and need for focus, the curriculum assumes and mainly uses Python tools and resources, except where marked as R, Java etc.

datasciencemasters's People

Contributors

clarecorthell avatar jxx avatar dawny33 avatar

Stargazers

 avatar Yash Jariwala avatar Marcelo Godoy avatar Kshitij Sharma avatar RAJNISH KRISHNA avatar  avatar  avatar  avatar Saurabh Bhardwaj avatar Dan Flan avatar penguincs avatar Ruslan Pylypiuk avatar

Watchers

Arya avatar Krishna Dhakal avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.