Code Monkey home page Code Monkey logo

csc8101's Introduction

CSC8101 web log analytics coursework, academic year 2014/15

This coursework introduces you to real-time analytics and basic stream processing.

Some analytics applications require that reports or visualizations are available on-demand, usually within sub-second response times i.e. an experience comparable to browsing most modern dynamically rendered web sites.

A typical application domain for such analytics is system log or metric processing, such as for monitoring activity on a group of servers.

Meeting such response time targets, particularly under high load, usually requires storing the data in such as way as to minimise the disk read activity necessary for query execution. This in turn can require denormalising of data and pre-calculation of summary statistics.

In this exercise you are provided with web server activity data, comprising a time ordered list of the site access activity. Client IP addresses have been replaced with unique identifiers for anonymity.

For part one, the required query is:

  • for a given set of URLs, start hour and end hour, show the total number of accesses for each url during each hour the period.

You should write code to process the log data and write a database table, such that each application query can be satisfied efficiently with minimal disk reads.

Your log processing code must not use an amount of memory proportional to the amount data being processed, the number of distinct URLs or the data time span.

You may assume entries in the origin file are in time order, although this is not a realistic real world assumption.

You are not required to deal with fault tolerance concerns beyond basic exception handling.

You are not required to provide command line argument parsing for your query client, but the query parameters should be easily editable in your source file.

The log line format is: client_id timestamp "method url version" status size

Hints:

  • URLs are encoded and thus don't contain spaces. Timestamps contain exactly one space.
  • [dd/MMM/yyyy:HH:mm:ss z]
  • The database server is capable of handling multiple requests concurrently.

For part two, you are required to identify user sessions and the activity in them. A session is a group of accesses by the same client, separated by a defined interval of inactivity.

Unlike part one, this requires maintaining some state in the stream processor, since you can't immediately identify the end of a session when you see it - it's only apparent after the inactivity period (30 minutes).

The required query is:

  • For a given client id, show all sessions with their start time, end time, number of accesses and approximate number of distinct URLs accessed.
  • Provide an approximate number of distinct URLs over all the sessions for the given client id.

Your log processing code may use an amount of memory approximately proportional to the number of concurrent sessions, but not the total number of sessions, total clients or number of accesses per session.

Hints:

  • LinkedHashMap#removeEldestEntry
  • HyperLogLog(0.05)

Deliverables

Java code to populate and query the database. The processing should be MessageHandler plus any additional code files you create to call from it. The processing code should populate the tables for both parts 1 and 2 in a single pass over the log data. The query code may be individual classes each having either a main method or unit tests (or both).

Code will be judged on correctness, maintainability and performance. Note the last two are not mutually exclusive, just hard to achieve at the same time. Your code may use existing 3rd party libraries provided they are open source and you comply with the licence terms.

References

csc8101's People

Contributors

jhalliday avatar

Stargazers

gsp8181 avatar

Watchers

Rui Vieira avatar  avatar James Cloos avatar Nipun Balan Thekkummal avatar

csc8101's Issues

Consumer problems

My Consumer seems to start the job properly and do the work. However, it does not consume all the messages from queue.
-- Meters ----------------------------------------------------------------------
throughput
count = 457546
mean rate = 4836.02 events/second
1-minute rate = 4113.17 events/second
5-minute rate = 2027.04 events/second
15-minute rate = 1347.94 events/second

INFO myclient_ip-172-31-11-102-1423232252518-a037b72d_watcher_executor - [myclient_ip-172-31-11-102-1423232252518-a037b72d], stopping watcher executor thread for consumer myclient_ip-172-31-11-102-1423232252518-a037b72d

As you can see myclient group have not consumed all the messages it had to.

ubuntu@ip-172-31-11-102:~/Big-data-cassandra$ ~/kafka/bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group myclient
Group Topic Pid Offset logSize Lag Owner
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
myclient csc8101 0 268803 50489553 50220750 none
myclient csc8101 1 259671 49879429 49619758 none
myclient csc8101 2 259346 50039695 49780349 none
myclient csc8101 3 237359 49591323 49353964 none

Re-preparing already prepared query

[Cassandra Java Driver worker-0] WARN com.datastax.driver.core.Cluster - Re-preparing already prepared query Please note that preparing the same query more than once is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once.

I am trying to session.prepare my query and then session.executeAsync. As I am using counter my queries will often will be the say. Is there a way to check if the query is prepared already and just used the prepared one.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.