This is an assignment to the class Programmieren 3 at the University of Applied Sciences Rosenheim.

Assignment 13: Map-Reduce and Collect

This assignment covers some more advanced concepts of the Java 8 Streams API. The concepts that are specifically topic of this assignment are:

Map-Reduce
Collecting
Grouping-By

We'll use Map-Reduce to implement the classical word count example. As sample data the repository contains about 3.000 tweets of Donald Trump we will analyze in this assignment. As an alternative the assignment also contains a generator which uses the Twitter API to fetch the tweets live. To be able to use this generator you have to do some additional configuration.

A clever data scientist discovered that most of the angrier tweets came from Android where the nicer ones were written with an iPhone (first article and follow up article). We will have a look if we can use Java's Streams to group the tweets by the kind of client which was used to create them.

Setup

Create a fork of this repository (button in the right upper corner)
Clone the project (get the link by clicking the green Clone or download button)
Import the project to your IntelliJ
Read the whole assignment spec!

Remark: the given test suite is incomplete and won't succeed after the checkout!

Generators

To be able to analyze the tweets we need a generator which loads the given tweets (from a JSON file) and exposes them as a Stream.

The following UML shows the class structure of the generators and the factory which exposes them.

Remark: the UML is not complete but just meant as implementation hint and for orientation.

Hint: the dependency to GSON is already added and GSON exposes the following method:

Gson gson = new Gson();
Tweet[] tweets = gson.fromJson(reader, Tweet[].class);

Where reader is an instance of the abstract class Reader. To access files from the resources folder implement something like in this snippet:

getClass().getResourceAsStream("/path/to/trump_tweets.json");

Side note: this is only possible if you are in a non static context, otherwise you have to write something like this: MyClass.class.getResourceAsStream(...).

To implement the OfflineTweetStreamGenerator follow these steps:

Create the class OfflineTweetStreamGenerator.
Implement the interface TweetStreamGenerator.
Implement the method getTweetStream by using GSON and the helper method Arrays.stream(T[] array) to load the JSON file and create a new Stream from the deserialized tweet array.

Those who want to play around with some additional language features could use try-with-resources for the deserialization of the tweets.

And those who want a more functional way may be interested in this Gist.

Collecting

Until now Streams are nice to have but printing all results to the command line with the terminator forEach(...) is not really practicable.

To be able to process the data we need to collect them e.g. in a List<> or a Set<> or any other Collection.

Fortunately the Streams API already defines a method .collect(...) and the JDK contains the utility class Collectors which defines the most common collectors.

With the OfflineTweetStreamGenerator working we can start to analyze the tweets in the class TrumpTweetStats. The following UML contains the signatures of the methods to be implemented in this assignment.

Don't worry, the method stubs are given already! The UML is only meant to keep the signatures in mind while you're reading the spec.

The first two tasks are the implementation of the methods:

calculateSourceAppStats
calculateTweetsBySourceApp

`calculateSourceAppStats`

This method groups the tweets by the app they were created with and counts how many tweets were created with which app. In SQL you would write it like this:

SELECT source, count(*) FROM tweets
GROUP BY source

If you want to try it on your own: the repository contains a SQL script that creates a tweet table and insert all the tweets to it. The script has been tested on MSSQL and PostgreSQL but if you want to use it on MySQL or MariaDB some additional work might be necessary.

Every Tweet instance has two methods to access the source:

getSource()
getSourceApp()

The first one returns a string like this: <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>. The second one extracts the actual name of the app Twitter for iPhone. It does not matter for the assignment which one you choose but the second one is a little bit prettier when the result is printed.

Side note: the task is thought to be solved with collect but it's also possible to do it with reduce!

`calculateTweetsBySourceApp`

This method is very similar to calculateSourceAppStats but instead of just counting the tweets it collects them as a Set<Tweet> for further analysis.

Where the method calculateSourceAppStats was very easy to implement with SQL, this method is impossible to implement in SQL because SQL does not define a map of lists (i.e. a tuple of tuples)!

Side note: the task is thought to be solved with collect but it's also possible to do it with reduce!

Map-Reduce

A classical algorithm used to process huge amounts of data is Map-Reduce. As the name already indicates the algorithm consists of two steps:

Map - transform the data in parallel
Reduce - retrieve all interim results and aggregate them

If you're looking for examples of Map-Reduce, the first hit will most likely be the word count problem. It's relatively simple to implement as there's not much transformation required and it demonstrates the concept very well.

We want to analyze which words are the most common in the given tweets. The following flow chart is meant as orientation how to implement the Map-Reduce algorithm.

The text of the tweet can be split like this:

String[] split = "Hello World".split("( )+");

The reduce method requires a so called accumulator. For this method, an instance of HashMap<> or LinkedHashMap<> seams to be a good idea. The next part is the reduction step and should be an instance of BiFunction<>. It's a function where the accumulator and a single value is passed in and the accumulator is returned after the value is processed (e.g. inserted to a list). The last part is a combiner. It's meant to combine two accumulator values but you won't need it this time.

Debugging hint: the latest IntelliJ Idea ships with the plugin Java Stream Debugger. The plugin visualizes how the stream is transformed step by step (including the actual data). That's very helpful if something is happening you don't expect to happen!

Another hint: the given 3225 tweets are a little bit too much to debug your Stream if something is going wrong. It may help to limit the Stream you're passing to the method calculateWordCount to e.g. 200 elements!

Last but not least: the Map-Reduce algorithm is a little bit tricky when you implement it for the first time. If you're stuck, Google can help you!

There are already some unit tests but it might be a good idea to extend the test suite.

Using the Twitter API

To be able to use Twitter4j you're required to configure it by setting OAuth consumer key, consumer secret, access token and access token secret in a file called twitter4j.properties that is in the root of your resources folder (right next to the files stopwords.txt and trump_tweets.json). The file should have the following structure:

debug=true
oauth.consumerKey=<dummy>
oauth.consumerSecret=<dummy>
oauth.accessToken=<dummy>
oauth.accessTokenSecret=<dummy>

To get these tokens you need to register a Twitter app. Then you have to fill in some basic information about the "app" you're creating.

After the registration of your new app you'll be able to retrieve the required information. Copy the given structure to a new file twitter4j.properties and replace the <dummy> strings with your actual keys and secrets and you should be able to fetch the tweets live from the API.

gramsimamsi / 13-map-reduce-collect Goto Github PK

13-map-reduce-collect's Introduction

Assignment 13: Map-Reduce and Collect

Setup

Generators

Collecting

`calculateSourceAppStats`

`calculateTweetsBySourceApp`

Map-Reduce

Using the Twitter API

13-map-reduce-collect's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent