Code Monkey home page Code Monkey logo

ida's Introduction

README

Ivaylo Petev and myself use this repository to teach an undergraduate introduction to data analysis. The course is online.

If you are reading the course on its online pages, just replace the .html extension of a page by .R to download the underlying code.

HOWTO

The course pages are formatted in R Markdown syntax and were converted to HTML with knitr 1.4:

install.packages("knitr")
citation("knitr")

The knitting routine is in the .Rprofile. To compile the whole course, set the IDA folder as your working directory and then type ida.build() (takes a bit more than five minutes on optic fiber).

Other files are called from the code/ and data/ folders. Most datasets are downloaded on the fly if they are missing from the data/ folder, so make sure that you are online while running the scripts.

The whole course was coded and taught with RStudio. The code was ran on R 2.15.2, 2.15.3, 3.0.0 and 3.0.1, on a MacBook Air running OS X 10.8 and Mac OS X 10.9. Most plots use ggplot2 version 0.9.3.1 (just in case compatibility breaks at some point).

CREDITS

Thanks to the Sciences Po Reims staff, who offered invaluable support, and to the small group of students who enrolled in (and survived to) the course. The R-2013-Lyon slides have a bit more detail on the practicals.

Bits and pieces of the code were posted to Gist, RPubs and Stack Overflow during development. Thanks to the great R developer and user communities that live online, and which we are now proud to count ourselves in.

If you share the spirit of all this, you should consider joining the Foundation for Open Access Statistics and check out places like OpenCPU, the Open Knowledge Foundation and other initiatives in open access, open data, open source and open science.

HISTORY

Aug 2013: better data management, with large or multiple-file datasets read from ZIP archives. Switched datasets to .csv thanks to GitHub.

Jul 2013: typos and broken links. Removed some functions in .Rprofile that are now part of the questionr package.

Jun-2013: first draft. Everything kind of works, Sessions 5--7 are unlisted, the code/ folder contains a few more exercises. That's it for now!

May-2013: added more course content and better resolution (100dpi) for all plots.

Apr-2013: added a lot of course content and cleaner plots. Also adding the R-2013-Lyon folder for a conference presentation on the course.

Mar-2013: reviewed course structure: less files, more code, tons of new examples and exercises.

Feb-2013: more efficient .Rprofile functions and improved knitr routine, tidier code on the early sessions.

Jan-2013: first release.

First release: January 2013.
Last revised: August 2013.

ida's People

Contributors

ajdm avatar briatte avatar joeeoj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ida's Issues

04. Data

  • 4.0 Basic data operators
    • (Tidy) data frames, glimpse, rename and select, mutate, recode
    • Plain text, read_lines, stringr functions
    • JSON with jsonlite (or NDJSON)
    • XML/HTML trees with xml2
  • 4.1 Imports and Exports
    • Files
    • Databases
    • Files from Outer Space: Downloading a Google Spreadsheet
    • Files from Outer Space: Web Scraping with httr and rvest (and xm2)
  • 4.2. Reshapes and aggregates (N.B. things below are really case studies)
    • Split-Apply-Combine and tidiness
    • Reshaping: long and wide (always prefer long)
    • Aggregating (aggregate, group_by, summarise), slicing (apply, slice etc.)
    • Coercing to, and binding, data frames
  • 4.3. Practice
    • Currently: Estimates of Congressional ideology
    • >> MOVED from 4.2. Visualizing the U.S. housing market by city (Case Schiller Index)
    • >> MOVED from 4.2. Visualizing U.S. homicide trends by weapon type

TODO:

  • Include Twitter API example.

Notes:

  • Complex data object structures should be covered in Section 02 'Objects'

10. Maps

cartography + sf: http://rgeomatic.hypotheses.org/1149
linemaps: https://rgeomatic.hypotheses.org/1156

  • Keep an introductory part with simple choropleth maps in ggplot2
  • Keep geocoding example with ggmap + Google Maps
  • Also show Leaflet
    • include mention of leaflet.extras
    • include mention of leaflet.esri
  • Practice:

The list above was way more detailed, but it seems that my hard work a few days ago on detailing this issue got lost in the ether…

In brief, also use many of the posts by

Map projections:

https://xkcd.com/977/

Excellent multi-part tutorial by Bhaskar V. Karambelkar:

GeoSpatial Data Visualization in R

Some steps match those of Robin Lovelace:

https://github.com/Robinlovelace/Creating-maps-in-R

Bookmarks:

Cite:

Cite, French:

License

What license is this tutorial under?
I would like to reuse it for teaching purposes, and I am unclear if I can, e.g., leave out some of the examples in the forked version.

Thanks for clarification

Customer Churn Code

Dear Matt Dancho:

I am running your Customer Churn code, but am getting the following error, and cannot fix it. Can you help me please? Thank you.

Run explain() on explainer

explanation <- lime::explain(

  • x_test_tbl[1:10,],
  • explainer = explainer,
  • n_labels = 1,
  • n_features = 4,
  • kernel_width = 0.5
  • )
    Error in dimnames<-.data.frame(*tmp*, value = list(n)) :
    invalid 'dimnames' given for data frame

03. Functions

  • 3.0 Functions, e.g.
    • Math
    • Data
    • Models – introduce formula notation
    • quickly mention plots, HTTP calls, whatever
  • 3.1 Control flow
    • The Human Narrative: # comments
    • The Computer Narrative: %>% pipe operator
    • Conditions: if/else, else_if, mutate_if, when
    • Sanity Checks: stopifnot – quickly mention unit tests for packages
  • 3.2 Iteration (with mentions of parallelization)
    • Loops: for, while – mention parallelization with foreach
    • Vectorization: sapply, lapply, mapply, map_* – parallelization with mcapply
    • Need for Speed? mention benchmarking
    • Reproducible Code:
      • Reproducibility = 10% code, 90% human documentation
      • Writing pseudo-makefiles for a data analysis folder
      • Makefiles + cron (with demo scraper)
  • Practice
  • Currently: Computing the Herfindhal-Hirschman Index
  • Add: Examples of calls using Stan, Julia, Python, shell?
  • Add: Game of Life example to show iteration and matrix computations (Petr Keil)

Use semantic versioning and release as 2.0.0

Let's assume that v1.0 was the Jan 2013 release, and that the Aug 2013 release was v1.1.

  • v1.2 -- just run the existing code properly
  • v1.3 -- start improvements
  • v2.0 -- new major 2017 release?

08. Models

TODO_FIRST: determine whether this section should be

  1. about modelling per se (in which case, show many models),
  2. or about general model classes (linear, nonlinear, hierarchical/multilevel, temporal/spatial effects and SE clustering, bootstrapped, Bayesian), plus tips and tricks (e.g. ggfortify, Zelig)

I'm slowly drifting towards Option 2, covering only the basic modelling stuff, and citing examples of text models (topic models), network models (ERGM, SOAM), etc.

  • 8.0. Linear models
    • Current example: Markus Gesmann's prediction of London Olympics 100m men's sprint results
  • 8.1. Linear correlation
    • Visualizing linear relationships
    • Measuring linear correlations
    • Correlation matrixes
    • Scatterplot matrixes
  • 8.2. Linear equations (changed title; also, not yet sub-sectioned)
    • Ordinary Least Squares (Legendre published the method of least squares in 1805.)
    • Results:
      • residuals
      • fitted values
    • Generalization, e.g.
      • to add dummies (show that)
      • or lagged values (leave it to Section on 'Time Series')
    • Presenting results:
      • Tables: texreg
      • Marginal FX plots (margins)
  • 8.3. Advanced Modelling (leave anything to do with 'Times Series' or 'Networks')
    • Nonlinear equations
    • Corrected standard errors
      • Robust SEs (jackknife, sandwich), FE, RE
      • Bootstrapped SEs
    • Quick word on a few model 'classes'
      • Spatial / Gravity
      • Econometrics: 2SLS, DiD, Oaxaca decomposition
      • Lasso, regularization
      • Machine Learning, random forests, neural networks…
    • Bayesian models with Stan

Note: Section 8.3. really should be a collection of examples.

References:

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning
  • Shalizi, ADAEPoV

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.