###GitHub Repo
Anderson is asking us to make this as a first assignment. I'm excited for the class.
####General Flow of Class:
Thursday:
- New Area
- Intro
- Looking up frameworks
Tuesday: Presentations
####Data analytics
- Sampling ... Machine Learning
- Statistics largely based on sampling. If you already know the answer, not really statistics.
- Data modeling is very hard - i.e. data formatting in facebook. New posts will have a different format than the old posts. They wont convert the old posts into the new data type.
- NoSQL
- Graph
- Key-Value Stores
- Columnar
- Twitter Clients -> Twitter <- Tweet
- Sometimes twitter clients use different encoding than utf-8
- Twitter uses utf-8
- End tweet is half utf-8 and half different encoding (hard to perform analytics on)
- i.e. python would expect it to be all utf-8, blows up if different
####Info Viz
- D3 (visualiztion - graphs)
- excel / R (data analysis)
####Data Lifecycle 0. Question
- Curation / Triage Persistence - prioritization of data sources
- Which source will give me the answer?
- Collection / Generation
- Cleanup
- Storage
- Processing / Analysis
- Query / visualize / ACT (data transformed to knowledge we can act upon)
- This usually gives you new questions
This lifecycle is in many startup companies that do big data Teams of people within each step.
####Request Response Cycle http:// http request methods
- GET
- POST
- PUT
- DELETE
Some request comes out, associated with url server recieves url maps request to file and returns back a response
Type of Markup Language
Two Types:
- Standard Markdown (SM)
- GitHub Flavored Markdown (GFM)
- Style, word formatting
- Images, Links
- Code Blocks
Web Browser --> Web Server, Map Server --> handler (usually separate entity)
- Resources
- URI (URL is a subclass of URI)
- CRUD
- Create
- Read
- Update
- Destroy
DATABASE?
ids? input and output and ERRORS??
##Restful Web Services
- Architectural style for web services
- Invented by Roy Fielding
- Approach to developing web services
- Service provides access to a linked set of resources
- For each resource you can perform operations on it simliar to the main operations
GET /api/1.0/users
//Retrieve list of users
GET /api/1.0/users/0
//Retrieve details of user 0
POST /api/1.0/users
//create a new user
- /api/1.0 is convention so that you can update your api
- i.e. /api/1.1/users
- keep your other api url running for transition
PUT /api/1.0/users/0
Update user 0
DELETE /api/1.0/users/0
Delete user 0
GET /api/1.0/search?q=tattersail
Perform a search with the query tattersail.
- Each operation may produce a result - Synchronously
- With RESTful services, JSON format is king
- JSON - highly compressible
- Asynchronously - Use AJAX calls.
- POST and PUT methods typically send data
- Also in JSON format
- May be in the URL or in the body of the HTTP Request
- GET, data may appear as query params
- Other formats are possible: HTML and XML are typical
- If a request needs to be authenticated
- the authentication data appears in HTTP headers
- i.e. used for DELETE - dont want just anyone to be able to delete users
GET /api/1.0/posts/0/comments/1
Get the first comment on post 0
POST /api/1.0/posts/0/comments
Create a new comment on post 0
- Security: How do you authenticate users?
- Identity: How are ids assigned to resources?
- Failure: How do we handle failure sitatuions?
- In example, handle it in JSON
- Could have used http status codes
- Most services use combination of both
- Persistence: How are resources stored?
- Contacts web service
- implemented in both ruby and javascript
- technologies used:
- Sinatra
- Rspec - Ruby testing
- Typhoeus - http requests (libcurl wrapper)
- Node - wrapper around chrome js engine (v8)
- Express - alike sinatra
def handle_request(method, uri, data = nil)
url = "#{base_uri}/#{uri}"
info = { "Content-Type" => "application/json" }
params = {method: method, body: data, headers: info }
request = Typhoeus::Request.new(url, params)
request.run
response = request.response
raise NotConnected if response.code == 0 # not connected
if response.code == 200 # successful request response cycle
result = JSON.parse(response.body)
#puts result.inspect
if result["status"]
yield result["data"]
else
raise FailureResult.new(result["error"])
end
else
raise NotOk
end
raise ServiceError
end
- Stages of file: Untracked, Unmodified, Modified, Staged, Remote
- git
- init - initialize
- clone - clone a repo
- branch - new branch
- checkout - move into another branch
- add - add files
- commit - commits stuff
- merge - merge a branch with another branch
- Resolve Conflicts
- find conflicts
- <<<<< start
- ===== differnces
-
end of conflicts
- Pull requests are new for GitHub
- Forking creates repo under your own head
- Cloning create repo with collaborators of that repo community
####NODE.JS
- Most code in node is pacckaged inside of a module
- http is a core module, provided by Node itself
- npm (node package manager) - gives access to extend node
// Load the http module to create an http server.
var http = require('http');
// Configure our HTTP server to respond with Hello World to all requests.
var server = http.createServer(function (request, response) {
response.writeHead(200, {"Content-Type": "text/plain"});
response.end("Hello World\n");
});
// Listen on port 8000, IP defaults to 127.0.0.1
server.listen(8000);
// Put a friendly message on the terminal
console.log("Server running at http://127.0.0.1:8000/");
- Web application framework written in javascript for use in Node.js
- In class demo...
Client-side web application framework written in Javascript for use in most web browsers
-
Data Binding
- The value of an HTML tag can be associated with a model object. When one changes, Angular updates the other automatically.
-
Controllers
- State and methods that can be accessed within that section of the page
- Can modularize your web app and decompose data and functionality into small, manageable chunks
-
Services
-
Directives
-
Embeddeble
-
Injectable
- the value of this is determined on how you call it.
- implicit binding, explicit binding, etc.
- kicks off loading of app specific code
- requireJS then starts to look at dependencies and starts to execute them in order
- requireJS can configure things like:
- bootstrap
- angular
- ngRoute
- jQuery
- requireJS also lets you load js that you wrote such as:
- routes.js
- /controllers
- OAuth allows app to act on behalf of users
- i.e. post tweets etc
- Might ship your application with consumer keys.
- When user launches app, send them to twitter to login and grant access
- Twitter will then send an access token/secret for application to store on behalf
- is the top level class of a tweet
- contains helpers:
- logging
- rates
- params
- props
- TwitterRequest has a public collect method that yields data back to its caller.
- Subclasses of twitterequest need:
- url
- request_name
- twitter_endpoint -> rate limits
- success -> handler for successful response for twitter
- Subclasses may implement:
- error
- authorization
- ensures that rates are checked on each request
def make_request
check_rates
request = Typhoeus::Request.new(url, options)
log.info("REQUESTING: #{request.base_url}?#{display_params}")
response = request.run
@rate_count = @rate_count - 1
response
end
def check_rates
refresh_rates if @@rates.size == 0
refresh_rates if Time.now > twitter_window
return if @rate_count > 0
delta = twitter_window - Time.now
log.info "Sleeping for #{delta} seconds"
sleep delta
refresh_rates
end
- Problem: Can't keep up with write demand, which affects read requests
- Not a perfect solution: Batch updates to database after getting some amount requests.
- Add a queue between web server and a database. Attach a signle worker to queue and that will start working on the queue.
- Problem -> even if you have 20 different workers you still have a bottleneck, the database. It cannot handle the load that the workers want to do.
- Not a perfect solution: Vertically scaling -> will fail at some point and takes a lot of money.
- Doesn't solve underlying problem i.e. one machine that is the bottle neck
- In relational world -> to solve, shard the databse.
- need multiple copies of the database
- then partition your data across those databases with a partition strategy
- Often take an Md5 hash of some aspect of the input data and then mod that value by the number of shards
- Then write data to indicated shard
- Do the same thing for reads to locate the data needed to fill request
- NOTE: need a good hash function that distributes the reads and writes evenly
- Problems:
- Sharding is application level -> you have to manage number of shards
- If shards changes, have to remap entire db and turn your app off.
- if you make a mistake when resharding, takes time to fix.
- NoSQL db avoid mutable data
- Can't lose correct data because once written, it is immutable and cant be updated
- if a val changes write a new immutable copy
- Fault tolerance
- if disk error occurs, NoSQL db switches to its replica automatically
- reshards db automatically
- when old machine comes back and it reshards again.
- can expect performance to go down while rebalancing occurs
- Key Value
- Graphs
- Columnar
- Documents
- Key value is a simple database that when presented with a string (key) returns an arbitrarily large set of data (value)
- Have no query language. Just act like hash tables.
- Vals are untyped, you can store any type of data in these databases
- Benefits -> simplicity
- store graph structures rather than table/row/column
- probide structural query languages
- examples: find all pairs of person nodes who have at least 3 children together, live in CO, married more than 15 years
- provide ability to do graph traversals efficiently
- Examples -> Neo4J, Titan, Infinite Graph, Info Grid
- Column family stores
- able to scale to enormous amounts of data
- often able to achieve very fast writes, while also maintaining reasonable read performance
- Column Family: table of related data
- Colum families consist of rows that have unique row keys
- Rows consist of columns (potentially millions of them)
- Columns consist of a key and a value
- Value itself might be a JSON map that in turn has keys and values
- Hash tables all the way down
- tries its best to keep a whole row on disc so that its a single stream -> only thing holding it down network latency and disc latency
- Like key-value but a little more structure
- insert documents ( a bag of key-value pairs )
- each document gets indexed in a variety of ways
- docs can be found via queries on any attr
- documents can be grouped into collecitons
- collections can be grouped into databases
- Each database is then used by a particular applicaiton to get its work done
- Document Database
- Implemented in Erlang. Lot of use in telecommunications
- Massive concurrency, fault tolerance, distributed systems
- All of these features are on display in the design of CouchDB
- CouchDB's design embraces the web
- High availability trades consistency for EVENTUAL consistency
- Document databases: self-contained data
- CouchDB stores documents
- Each document contains everything that might be needed by an application.
- Avoid foreign keys etc, each document meant to stand on its own.
- Usually no references
- No schema enforced, each document can have a different set of attributes
- Consistency -> All clients see the same data even in the presence of concurrent updates.
- Availability -> All clients able to read or write the data store when they want
- Partition Tolerance -> DB can be split across multiple servers
- Consistency and Availability -> what relational dbs provide, low partition tolerance.
- Availiability and Partition Tolerance -> Provides the ability to scale horizontally and always be abailible for requests.
- Can only guarantee eventual consistency
- 3 clients issue same query and get different results -> can design around this
- i.e. don't really need total consistency all the time (facebook likes)
- Consistency and Partition Tolerance -> provide consistency across multiple dbs, but not always available for client requests
Couch DB choose the second option
- CouchDB uses B-tree storage engine.
- allows searches, insertions, and deletions in log time.
- Employs MapReduce over B-Tree to compute views of the data allowing for parallel and incremental computation.
- No Locking
- same idea that git uses -> multiple users can edit a repo.
- Each read of a doc returns a version of the document that was the latest when read started
- document can be written while it is being read, the next read will then return the new document
- Validation -> validation functions can be written in javascript
- Each time an update for a document is submitted the proposed change is passed to the validation function
- the validation function then chooses to accept or deny the update
- Merge Conflicts
- Can encounter merge conflicts.
Indexes enabling queries
MongoDB supports full text searching
Can weight some search factors more than others
OR, phrase matches, negations.... can get fancy and complicated
GeoJSON can specify points, lineStrings or Polygons.
geojson.io can be used to create a bounding box, integrated into queries
Map Reduce - mongo offeres the ability to create new collections from old celltions using a map function and a reduce function. It can also take a query, if so, it first finds the results of the query and then applies MapReduce to the result set.
Basic map function that can be used to count up the number of times a screen name appears in a collection
function () {
emit(this.user.screen_name, {count: 1});
}
A reduce function takes a key and a set of produced documents that were previously emited It must return a document that can be sent back into the reduce function in a subsequent phase For our example, that means it must generate documents of the form {count: X} where x is a number
function (key, docs) {
var total = 0;
for (var i = 0; i < docs.length; i++) {
total += docs[i].count;
}
return {count : total};
}
End Goal: At the end of the operation, you will have a set of documents in a new collection. Each document will have one key (in this case a screen_name) and a value that is equal to the number of times it appeared in the collection.
Simple way to get search functionality for your DB. Works well with ruby. Fast, reliable, and has many scalable features.
On top of Lucene (which uses Java, portable) as a RESTful web server API.
When integrating into the model, you can add searchable fields for the classes. Controller can also get pagination. The view is where the return values of the query are listed and can be interacted with using a search box.
Ruby convention over configuration.
Pitfalls: Solr reindexes from the beginning, started by default in production environment, solr configs and git
Key-value store DB.
Memory-oriented. Fast.
Keys: string, concise, max 512MB
Values: strings, can contain any info, max 512MB
Can contain multiple types. The data is loosely typed.
Features: sorted sets, bitmaps, hyperloglogs, persistence (snapshotting or append-only file)
Helpful for processing data coming over the network in real-time.
Distributed, fault-tolerant, high-throughput, publish-subscribe, messaging system.
Runs on Apache ZooKeeper, written in Scala. Keeps all messages for up to N days.
Similar to RabbitMQ, Flume, database, Redis pub/sub, supercomputer.
###Something here...
Graph Database: nodes and edges
Relationships can have properties, can be single or bi-directional
More likely to run on single server than cluster.
Works well with Germlin and Cypher
Aggregated oriented databases - Neo4J is not one which is why it's hard to shard
Runs on top of Hadoop (MapReduce and HDFS).
Opensource column-oriented Hadoop Database. Influenced by Google file system -> Google big table.
Schemas contain column families which contain columns. Columns within column families are contiguous.
Great for scalability
Fast, scalable, growing in popularity
I know nothing about JS
Scope, Modules, explict stuff in JS
Just a huge live demo.
HDFS Architecture bunch of nodes
Alongside Hadoop