This is an assignment to the class Programmieren 3 at the University of Applied Sciences Rosenheim.
This assignment covers some more advanced concepts of the Java 8 Stream
s API.
The concepts that are specifically topic of this assignment are:
- Map-Reduce
- Collecting
- Grouping-By
We'll use Map-Reduce to implement the classical word count example. As sample data the repository contains about 3.000 tweets of Donald Trump we will analyze in this assignment. As an alternative the assignment also contains a generator which uses the Twitter API to fetch the tweets live. To be able to use this generator you have to do some additional configuration.
A clever data scientist discovered that most of the angrier tweets came from Android where the nicer ones were written with an iPhone (first article and follow up article).
We will have a look if we can use Java's Stream
s to group the tweets by the kind of client which was used to create them.
- Create a fork of this repository (button in the right upper corner)
- Clone the project (get the link by clicking the green Clone or download button)
- Import the project to your IntelliJ
- Read the whole assignment spec!
Remark: the given test suite is incomplete and won't succeed after the checkout!
To be able to analyze the tweets we need a generator which loads the given tweets (from a JSON file) and exposes them as a Stream
.
The following UML shows the class structure of the generators and the factory which exposes them.
Remark: the UML is not complete but just meant as implementation hint and for orientation.
Hint: the dependency to GSON is already added and GSON exposes the following method:
Gson gson = new Gson();
Tweet[] tweets = gson.fromJson(reader, Tweet[].class);
Where reader
is an instance of the abstract class Reader
.
To access files from the resources
folder implement something like in this snippet:
getClass().getResourceAsStream("/path/to/trump_tweets.json");
Side note: this is only possible if you are in a non static
context, otherwise you have to write something like this: MyClass.class.getResourceAsStream(...)
.
To implement the OfflineTweetStreamGenerator
follow these steps:
- Create the class
OfflineTweetStreamGenerator
. - Implement the interface
TweetStreamGenerator
. - Implement the method
getTweetStream
by using GSON and the helper methodArrays.stream(T[] array)
to load the JSON file and create a newStream
from the deserialized tweet array.
Those who want to play around with some additional language features could use try-with-resources
for the deserialization of the tweets.
And those who want a more functional way may be interested in this Gist.
Until now Stream
s are nice to have but printing all results to the command line with the terminator forEach(...)
is not really practicable.
To be able to process the data we need to collect
them e.g. in a List<>
or a Set<>
or any other Collection.
Fortunately the Stream
s API already defines a method .collect(...)
and the JDK contains the utility class Collectors
which defines the most common collectors.
With the OfflineTweetStreamGenerator
working we can start to analyze the tweets in the class TrumpTweetStats
.
The following UML contains the signatures of the methods to be implemented in this assignment.
Don't worry, the method stubs are given already! The UML is only meant to keep the signatures in mind while you're reading the spec.
The first two tasks are the implementation of the methods:
calculateSourceAppStats
calculateTweetsBySourceApp
This method groups the tweets by the app they were created with and counts how many tweets were created with which app. In SQL you would write it like this:
SELECT source, count(*) FROM tweets
GROUP BY source
If you want to try it on your own: the repository contains a SQL script that creates a tweet
table and insert all the tweets to it.
The script has been tested on MSSQL and PostgreSQL but if you want to use it on MySQL or MariaDB some additional work might be necessary.
Every Tweet
instance has two methods to access the source
:
getSource()
getSourceApp()
The first one returns a string like this: <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
.
The second one extracts the actual name of the app Twitter for iPhone
.
It does not matter for the assignment which one you choose but the second one is a little bit prettier when the result is printed.
Side note: the task is thought to be solved with collect
but it's also possible to do it with reduce
!
This method is very similar to calculateSourceAppStats
but instead of just counting the tweets it collects them as a Set<Tweet>
for further analysis.
Where the method calculateSourceAppStats
was very easy to implement with SQL, this method is impossible to implement in SQL because SQL does not define a map of lists (i.e. a tuple of tuples)!
Side note: the task is thought to be solved with collect
but it's also possible to do it with reduce
!
A classical algorithm used to process huge amounts of data is Map-Reduce. As the name already indicates the algorithm consists of two steps:
- Map - transform the data in parallel
- Reduce - retrieve all interim results and aggregate them
If you're looking for examples of Map-Reduce, the first hit will most likely be the word count problem. It's relatively simple to implement as there's not much transformation required and it demonstrates the concept very well.
We want to analyze which words are the most common in the given tweets. The following flow chart is meant as orientation how to implement the Map-Reduce algorithm.
The text of the tweet can be split like this:
String[] split = "Hello World".split("( )+");
The reduce
method requires a so called accumulator.
For this method, an instance of HashMap<>
or LinkedHashMap<>
seams to be a good idea.
The next part is the reduction step and should be an instance of BiFunction<>
.
It's a function where the accumulator and a single value is passed in and the accumulator is returned after the value is processed (e.g. inserted to a list).
The last part is a combiner.
It's meant to combine two accumulator values but you won't need it this time.
Debugging hint: the latest IntelliJ Idea ships with the plugin Java Stream Debugger. The plugin visualizes how the stream is transformed step by step (including the actual data). That's very helpful if something is happening you don't expect to happen!
Another hint: the given 3225 tweets are a little bit too much to debug your Stream
if something is going wrong. It may help to limit
the Stream
you're passing to the method calculateWordCount
to e.g. 200 elements!
Last but not least: the Map-Reduce algorithm is a little bit tricky when you implement it for the first time. If you're stuck, Google can help you!
There are already some unit tests but it might be a good idea to extend the test suite.
To be able to use Twitter4j you're required to configure it by setting OAuth consumer key, consumer secret, access token and access token secret in a file called twitter4j.properties
that is in the root of your resources
folder (right next to the files stopwords.txt
and trump_tweets.json
).
The file should have the following structure:
debug=true
oauth.consumerKey=<dummy>
oauth.consumerSecret=<dummy>
oauth.accessToken=<dummy>
oauth.accessTokenSecret=<dummy>
To get these tokens you need to register a Twitter app. Then you have to fill in some basic information about the "app" you're creating.
After the registration of your new app you'll be able to retrieve the required information.
Copy the given structure to a new file twitter4j.properties
and replace the <dummy>
strings with your actual keys and secrets and you should be able to fetch the tweets live from the API.