Code Monkey home page Code Monkey logo

s15_zhu_data_lecture_notes's Introduction

Lecture Notes for Data Engineering w/ Ken Anderson

###GitHub Repo

Anderson is asking us to make this as a first assignment. I'm excited for the class.

####General Flow of Class:

Thursday:

  • New Area
  • Intro
  • Looking up frameworks

Tuesday: Presentations

####Data analytics

  • Sampling ... Machine Learning
  • Statistics largely based on sampling. If you already know the answer, not really statistics.

Storage

  • Data modeling is very hard - i.e. data formatting in facebook. New posts will have a different format than the old posts. They wont convert the old posts into the new data type.
  • NoSQL
    • Graph
    • Key-Value Stores
    • Columnar

####Twitter

  • Twitter Clients -> Twitter <- Tweet
  • Sometimes twitter clients use different encoding than utf-8
  • Twitter uses utf-8
  • End tweet is half utf-8 and half different encoding (hard to perform analytics on)
    • i.e. python would expect it to be all utf-8, blows up if different

####Info Viz

  • D3 (visualiztion - graphs)
  • excel / R (data analysis)

####Data Lifecycle 0. Question

  • Curation / Triage Persistence - prioritization of data sources
  • Which source will give me the answer?
  1. Collection / Generation
  2. Cleanup
  3. Storage
  4. Processing / Analysis
  5. Query / visualize / ACT (data transformed to knowledge we can act upon)
  • This usually gives you new questions

This lifecycle is in many startup companies that do big data Teams of people within each step.

####Request Response Cycle http:// http request methods

  • GET
  • POST
  • PUT
  • DELETE

Some request comes out, associated with url server recieves url maps request to file and returns back a response

Lecture 2 - 1/15/15

Markup

Type of Markup Language

Two Types:

  • Standard Markdown (SM)
  • GitHub Flavored Markdown (GFM)
What Can I Do With Markdown
  • Style, word formatting
  • Images, Links
  • Code Blocks

Review

Web Browser --> Web Server, Map Server --> handler (usually separate entity)

REST - Representation State Transfer

  • Resources
    • URI (URL is a subclass of URI)
    • CRUD
      • Create
      • Read
      • Update
      • Destroy

HOMEWORK

DATABASE?

ids? input and output and ERRORS??

##Restful Web Services

  • Architectural style for web services
    • Invented by Roy Fielding
  • Approach to developing web services
  • Service provides access to a linked set of resources
  • For each resource you can perform operations on it simliar to the main operations

API Examples

GET /api/1.0/users
//Retrieve list of users

GET /api/1.0/users/0
//Retrieve details of user 0

POST /api/1.0/users
//create a new user
  • /api/1.0 is convention so that you can update your api
    • i.e. /api/1.1/users
    • keep your other api url running for transition
PUT /api/1.0/users/0
Update user 0

DELETE /api/1.0/users/0
Delete user 0

GET /api/1.0/search?q=tattersail
Perform a search with the query tattersail.

Discussion - Request Response cycle

  • Each operation may produce a result - Synchronously
    • With RESTful services, JSON format is king
  • JSON - highly compressible
  • Asynchronously - Use AJAX calls.
  • POST and PUT methods typically send data
    • Also in JSON format
    • May be in the URL or in the body of the HTTP Request
      • GET, data may appear as query params
  • Other formats are possible: HTML and XML are typical
  • If a request needs to be authenticated
    • the authentication data appears in HTTP headers
    • i.e. used for DELETE - dont want just anyone to be able to delete users

Discussion - How do you think operations on two resources are handled?

GET /api/1.0/posts/0/comments/1
Get the first comment on post 0

POST /api/1.0/posts/0/comments
Create a new comment on post 0

Issues

  • Security: How do you authenticate users?
  • Identity: How are ids assigned to resources?
  • Failure: How do we handle failure sitatuions?
    • In example, handle it in JSON
    • Could have used http status codes
    • Most services use combination of both
  • Persistence: How are resources stored?

EXAMPLE

  • Contacts web service
  • implemented in both ruby and javascript
  • technologies used:
    • Sinatra
    • Rspec - Ruby testing
    • Typhoeus - http requests (libcurl wrapper)
    • Node - wrapper around chrome js engine (v8)
    • Express - alike sinatra
def handle_request(method, uri, data = nil)
    url      = "#{base_uri}/#{uri}"
    info     = { "Content-Type" => "application/json" }
    params   = {method: method, body: data, headers: info }
    request  = Typhoeus::Request.new(url, params)
    request.run
    response = request.response

    raise NotConnected if response.code == 0 # not connected
    if response.code == 200 # successful request response cycle
      result = JSON.parse(response.body) 
      #puts result.inspect
      if result["status"]
        yield result["data"]
      else
        raise FailureResult.new(result["error"])
      end
    else
      raise NotOk
    end
    raise ServiceError
  end

Lecture 4 - 1/22/15

Git Presentation

  • Stages of file: Untracked, Unmodified, Modified, Staged, Remote
  • git
    • init - initialize
    • clone - clone a repo
    • branch - new branch
    • checkout - move into another branch
    • add - add files
    • commit - commits stuff
    • merge - merge a branch with another branch
  • Resolve Conflicts
    • find conflicts
    • <<<<< start
    • ===== differnces
    • end of conflicts

GitHub Presentation

  • Pull requests are new for GitHub
  • Forking creates repo under your own head
  • Cloning create repo with collaborators of that repo community

####NODE.JS

  • Most code in node is pacckaged inside of a module
  • http is a core module, provided by Node itself
  • npm (node package manager) - gives access to extend node
// Load the http module to create an http server.
var http = require('http');

// Configure our HTTP server to respond with Hello World to all requests.
var server = http.createServer(function (request, response) {
  response.writeHead(200, {"Content-Type": "text/plain"});
  response.end("Hello World\n");
});

// Listen on port 8000, IP defaults to 127.0.0.1
server.listen(8000);

// Put a friendly message on the terminal
console.log("Server running at http://127.0.0.1:8000/");

Express

  • Web application framework written in javascript for use in Node.js

Express example

  • In class demo...

AngularJS

Client-side web application framework written in Javascript for use in most web browsers

Core Concepts

  • Data Binding

    • The value of an HTML tag can be associated with a model object. When one changes, Angular updates the other automatically.
  • Controllers

    • State and methods that can be accessed within that section of the page
    • Can modularize your web app and decompose data and functionality into small, manageable chunks
  • Services

  • Directives

  • Embeddeble

  • Injectable

this Keyword

  • the value of this is determined on how you call it.
  • implicit binding, explicit binding, etc.

requireJS

  • kicks off loading of app specific code
  • requireJS then starts to look at dependencies and starts to execute them in order
  • requireJS can configure things like:
    • bootstrap
    • angular
    • ngRoute
    • jQuery
  • requireJS also lets you load js that you wrote such as:
    • routes.js
    • /controllers

Twitter - Consumer Keys, Access Tokens

  • OAuth allows app to act on behalf of users
    • i.e. post tweets etc
  • Might ship your application with consumer keys.
  • When user launches app, send them to twitter to login and grant access
  • Twitter will then send an access token/secret for application to store on behalf

TwitterRequest

  • is the top level class of a tweet
  • contains helpers:
    • logging
    • rates
    • params
    • props

Contract

  • TwitterRequest has a public collect method that yields data back to its caller.
  • Subclasses of twitterequest need:
    • url
    • request_name
    • twitter_endpoint -> rate limits
    • success -> handler for successful response for twitter
  • Subclasses may implement:
    • error
    • authorization

Rates

  • ensures that rates are checked on each request
def make_request
  check_rates
  request = Typhoeus::Request.new(url, options)
  log.info("REQUESTING: #{request.base_url}?#{display_params}")
  response = request.run
  @rate_count = @rate_count - 1
  response
end

def check_rates
  refresh_rates if @@rates.size == 0
  refresh_rates if Time.now > twitter_window
  return if @rate_count > 0
  delta = twitter_window - Time.now
  log.info "Sleeping for #{delta} seconds"
  sleep delta
  refresh_rates
end

Web Analytics Application - Scaling Problems

  • Problem: Can't keep up with write demand, which affects read requests
  • Not a perfect solution: Batch updates to database after getting some amount requests.
    • Add a queue between web server and a database. Attach a signle worker to queue and that will start working on the queue.
    • Problem -> even if you have 20 different workers you still have a bottleneck, the database. It cannot handle the load that the workers want to do.
  • Not a perfect solution: Vertically scaling -> will fail at some point and takes a lot of money.
    • Doesn't solve underlying problem i.e. one machine that is the bottle neck
  • In relational world -> to solve, shard the databse.
    1. need multiple copies of the database
    2. then partition your data across those databases with a partition strategy
    • Often take an Md5 hash of some aspect of the input data and then mod that value by the number of shards
    • Then write data to indicated shard
    • Do the same thing for reads to locate the data needed to fill request
    1. NOTE: need a good hash function that distributes the reads and writes evenly
  • Problems:
    • Sharding is application level -> you have to manage number of shards
    • If shards changes, have to remap entire db and turn your app off.
    • if you make a mistake when resharding, takes time to fix.

NoSQL to the rescue

  • NoSQL db avoid mutable data
    • Can't lose correct data because once written, it is immutable and cant be updated
    • if a val changes write a new immutable copy
  • Fault tolerance
    • if disk error occurs, NoSQL db switches to its replica automatically
    • reshards db automatically
    • when old machine comes back and it reshards again.
    • can expect performance to go down while rebalancing occurs

Types of NoSQL DBs

  • Key Value
  • Graphs
  • Columnar
  • Documents

Key Value

  • Key value is a simple database that when presented with a string (key) returns an arbitrarily large set of data (value)
  • Have no query language. Just act like hash tables.
  • Vals are untyped, you can store any type of data in these databases
  • Benefits -> simplicity

Graph Stores

  • store graph structures rather than table/row/column
  • probide structural query languages
    • examples: find all pairs of person nodes who have at least 3 children together, live in CO, married more than 15 years
  • provide ability to do graph traversals efficiently
  • Examples -> Neo4J, Titan, Infinite Graph, Info Grid

Columnar Store

  • Column family stores
  • able to scale to enormous amounts of data
  • often able to achieve very fast writes, while also maintaining reasonable read performance
    • Column Family: table of related data
    • Colum families consist of rows that have unique row keys
    • Rows consist of columns (potentially millions of them)
    • Columns consist of a key and a value
    • Value itself might be a JSON map that in turn has keys and values
  • Hash tables all the way down
  • tries its best to keep a whole row on disc so that its a single stream -> only thing holding it down network latency and disc latency

Document stores

  • Like key-value but a little more structure
  • insert documents ( a bag of key-value pairs )
  • each document gets indexed in a variety of ways
  • docs can be found via queries on any attr
  • documents can be grouped into collecitons
  • collections can be grouped into databases
  • Each database is then used by a particular applicaiton to get its work done

CouchDB

  • Document Database
    • Implemented in Erlang. Lot of use in telecommunications
    • Massive concurrency, fault tolerance, distributed systems
    • All of these features are on display in the design of CouchDB
  • CouchDB's design embraces the web
    • High availability trades consistency for EVENTUAL consistency

Document Model

  • Document databases: self-contained data
  • CouchDB stores documents
  • Each document contains everything that might be needed by an application.
    • Avoid foreign keys etc, each document meant to stand on its own.
    • Usually no references
  • No schema enforced, each document can have a different set of attributes

CAP Theorem - PICK 2

  • Consistency -> All clients see the same data even in the presence of concurrent updates.
  • Availability -> All clients able to read or write the data store when they want
  • Partition Tolerance -> DB can be split across multiple servers

Choices

  1. Consistency and Availability -> what relational dbs provide, low partition tolerance.
  2. Availiability and Partition Tolerance -> Provides the ability to scale horizontally and always be abailible for requests.
  • Can only guarantee eventual consistency
  • 3 clients issue same query and get different results -> can design around this
  • i.e. don't really need total consistency all the time (facebook likes)
  1. Consistency and Partition Tolerance -> provide consistency across multiple dbs, but not always available for client requests

Couch DB choose the second option

Specifics

  • CouchDB uses B-tree storage engine.
    • allows searches, insertions, and deletions in log time.
  • Employs MapReduce over B-Tree to compute views of the data allowing for parallel and incremental computation.
  • No Locking
    • same idea that git uses -> multiple users can edit a repo.
    • Each read of a doc returns a version of the document that was the latest when read started
    • document can be written while it is being read, the next read will then return the new document
  • Validation -> validation functions can be written in javascript
    • Each time an update for a document is submitted the proposed change is passed to the validation function
    • the validation function then chooses to accept or deny the update
  • Merge Conflicts
    • Can encounter merge conflicts.

Lecture 14 - 2/17/15

MongoDB Student Presentation

MongoDB Specifics

Indexes enabling queries

MongoDB supports full text searching

Can weight some search factors more than others

OR, phrase matches, negations.... can get fancy and complicated

GeoJSON can specify points, lineStrings or Polygons.

geojson.io can be used to create a bounding box, integrated into queries

Map Reduce - mongo offeres the ability to create new collections from old celltions using a map function and a reduce function. It can also take a query, if so, it first finds the results of the query and then applies MapReduce to the result set.

Basic map function that can be used to count up the number of times a screen name appears in a collection

function () {
  emit(this.user.screen_name, {count: 1});
}

A reduce function takes a key and a set of produced documents that were previously emited It must return a document that can be sent back into the reduce function in a subsequent phase For our example, that means it must generate documents of the form {count: X} where x is a number

function (key, docs) {
  var total = 0;
  for (var i = 0; i < docs.length; i++) {
    total += docs[i].count;
  }
  return {count : total};
}

End Goal: At the end of the operation, you will have a set of documents in a new collection. Each document will have one key (in this case a screen_name) and a value that is equal to the number of times it appeared in the collection.

Lecture 18 - 3/17/15

Solr

Simple way to get search functionality for your DB. Works well with ruby. Fast, reliable, and has many scalable features.

On top of Lucene (which uses Java, portable) as a RESTful web server API.

When integrating into the model, you can add searchable fields for the classes. Controller can also get pagination. The view is where the return values of the query are listed and can be interacted with using a search box.

Ruby convention over configuration.

Pitfalls: Solr reindexes from the beginning, started by default in production environment, solr configs and git

Redis

Key-value store DB.

Memory-oriented. Fast.

Keys: string, concise, max 512MB

Values: strings, can contain any info, max 512MB

Can contain multiple types. The data is loosely typed.

Features: sorted sets, bitmaps, hyperloglogs, persistence (snapshotting or append-only file)

Kafka

Helpful for processing data coming over the network in real-time.

Distributed, fault-tolerant, high-throughput, publish-subscribe, messaging system.

Runs on Apache ZooKeeper, written in Scala. Keeps all messages for up to N days.

Similar to RabbitMQ, Flume, database, Redis pub/sub, supercomputer.

Lecture 19

###Something here...

Lecture 20

Neo4J

Graph Database: nodes and edges

Relationships can have properties, can be single or bi-directional

More likely to run on single server than cluster.

Works well with Germlin and Cypher

Aggregated oriented databases - Neo4J is not one which is why it's hard to shard

Apache HBase

Runs on top of Hadoop (MapReduce and HDFS).

Opensource column-oriented Hadoop Database. Influenced by Google file system -> Google big table.

Schemas contain column families which contain columns. Columns within column families are contiguous.

Great for scalability

Lecture 21

Riak

Cassandra

Fast, scalable, growing in popularity

Lecture 22

JS

I know nothing about JS

Scope, Modules, explict stuff in JS

Ruby on Rails

Just a huge live demo.

Lecture 23

Hadoop

HDFS Architecture bunch of nodes

Spark

Alongside Hadoop

Rest are just student presentations...

s15_zhu_data_lecture_notes's People

Contributors

zhued avatar

Watchers

 avatar

s15_zhu_data_lecture_notes's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.